The first version of our definition Battery Energy Storage System — a grid-scale installation of batteries used to store and dispatch electricity. Ours were containerized lithium-ion units, each roughly 200 kW. site controller was a fork of a heating optimization application. Not because it was a natural fit, but because it existed, it more or less worked, and we had edge devices already deployed: industrial Linux boxes managed by Telia, connected via fiber or Ethernet with mobile as a fallback. The devices weren’t ours to configure. The Python version was 3.6, locked by the vendor, and that was that.

This is not an unusual place to start a proof of concept. You use what you have. The question is always what you learn on the way to something real.

Key Takeaways

  • definition Frequency Containment Reserve for Disturbances — a Nordic grid ancillary service where batteries respond automatically to frequency deviations caused by sudden power imbalances, like a large generator tripping. prequalification requires a minimum definition Ten samples per second — one measurement every 100 milliseconds. This is the minimum logging cadence required by Nordic TSOs during FCR prequalification tests. sampling rate across all monitored units, per the joint Nordic TSO technical requirements (SVK/Nordic TSOs, March 2025)
  • A single-threaded Python controller polling twelve containers sequentially cannot complete one full sample cycle within 100 ms; the work per window exceeds what one thread can reliably finish
  • Decoupling the hot path from storage via an in-memory queue and per-container read/write threads eliminated the jitter that was causing setpoint delivery failures

The proof of concept

Two battery containers. Around 150 kW each. The app communicated via definition A serial industrial communication protocol, originally designed for PLCs. Devices are polled one at a time over a shared cable — slow but very reliable on short runs. , read state, sent setpoints, forwarded telemetry to our existing platform for storage and visualisation. Single-threaded Python. The code was simple enough that the logic was obvious on first read, and that simplicity was valuable: a PoC you can reason about entirely in your head is a PoC you can actually debug.

It worked. We shipped it. We learned from it. (More on how I ended up doing this kind of work in the first place.)

Industrial battery energy storage units in a warehouse facility Large-scale battery units of the kind we were managing via Modbus RTU. Photo: Pexels.

Scaling up

As we grew from two containers to around twelve, each around 200 kW and each with its own dedicated connection, the communication layer evolved alongside. We moved from Modbus RTU to definition The same Modbus protocol transported over standard Ethernet/IP networks. Allows faster polling and easier integration into modern infrastructure. where the network allowed it, and added definition A lightweight publish-subscribe messaging protocol common in IoT systems. Devices publish data to a broker; subscribers receive it asynchronously — good fit for intermittent mobile connections. for sites where it made more sense architecturally. Each container needed its own connection; the protocol doesn’t allow meaningful multiplexing across independent units.

The single-threaded model still worked, mostly. Reading and writing across twelve connections sequentially introduced some latency, but nothing that broke functionality. The app was slow in the way that software is often slow without consequence, until the moment consequence arrives.

Why did FCR prequalification expose the limits of single-threaded control?

definition Svenska kraftnät — Sweden's transmission system operator, responsible for grid stability and the body that runs FCR prequalification tests for Swedish resources. ’s FCR Technical Requirements mandate that resources deliver 86% of active power within 7.5 seconds of activation, with test data logged at a minimum of 10 Hz throughout each test window (SVK/Nordic TSOs, March 2025). FCR-D in particular: within defined windows, some as short as thirty seconds, the system must demonstrate precise frequency response curves sampled every 100 ms.

For a single-threaded Python 3.6 application sequentially polling twelve containers over separate connections, computing control logic, writing setpoints, and trying to log all of it to a local SQLite database: this is not a scheduling problem. It is a physics problem. The work to be done in each 100 ms window is larger than what a single thread can reliably complete.

We saw the failure before we measured it. The setpoint delivery was delayed, sometimes by milliseconds, sometimes by full seconds. The response curves the prequalification tool was recording didn’t match what the grid required. FCR-D tests, which are both logic-heavy and CPU-heavy, were the clearest failures. The app was reacting, but it was reacting late, and in a timed test, late is wrong.

The SQLite situation compounded it. Local writes were happening synchronously in the hot path, occasionally causing blocking that added unpredictable jitter to an already overloaded loop. A hanging database connection during a prequalification window is exactly as bad as it sounds.

A 2026 study on Python definition Python's Global Interpreter Lock — a mutex that prevents more than one thread from executing Python bytecode at the same time, meaning Python threads can't truly run in parallel on multiple CPU cores. behaviour in edge computing systems found that P99 latency increases 2.0x on single-core hardware and up to 4.8x on quad-core as thread count saturates, with throughput dropping over 40% past the saturation threshold (Mandal & Shende, arXiv 2601.10582, 2026). The numbers are from AI inference workloads, but the GIL contention mechanism is identical: a single lock serialising execution across threads that have real deadlines. In a control loop with a hard 100 ms ceiling, even a modest latency multiplier makes the system structurally unreliable.

How the multithreaded rewrite solved the 10 Hz problem

We rewrote the application as a multithreaded system built around one design principle: the hot path (sampling hardware state and delivering setpoints) must never wait on anything else.

The architectural pattern we landed on was a base thread class we called the zombie thread. The name is more accurate than it sounds. Any thread inheriting from it would, on death, restart. The loop was unconditional: do the work, handle exceptions, alert on failure, restart and try again. A thread that crashed was not a crashed system. It was a briefly interrupted one.

This is Erlang’s OTP supervisor model implemented in Python threads. The Erlang supervisor behaviour automatically restarts failed child processes, with configurable strategies for whether to restart siblings and how many restarts to permit before escalating failure up the supervision tree (Erlang/OTP Documentation v29.0). Our implementation was less configurable and more pragmatic, but the principle was identical: let it fail, restart it, keep the system running.

From this base, we built the thread hierarchy:

zombie thread architecture · hot path isolation
schedule thread dispatch plan → setpoints queue drainer queue → SQLite (slow, async) uplink thread SQLite → cloud platform IN-MEMORY QUEUE read threads write here and return immediately · never touches SQLite directly R/W thread container 1 R/W thread container 2 R/W thread container 3 ··· R/W thread container 11 R/W thread container 12 BATT 1 · ~200kW BATT 2 · ~200kW BATT 3 · ~200kW ··· BATT 11 · Modbus TCP BATT 12 · Modbus TCP ↻ any thread crash → automatic restart · let it fail, restart it, keep the system running

The in-memory queue is the critical decoupling. Read threads sample hardware state, write to the queue, and return immediately. They never touch SQLite. The queue drainer runs at whatever pace storage can sustain, and that pace is no longer anyone's problem at 3:00 AM during a prequalification test.

The in-memory queue was the critical decoupling. The read threads, after sampling hardware state, wrote to the queue and returned immediately to the next sample. They never touched SQLite. The queue-drainer ran at whatever pace SQLite could sustain, and SQLite’s pace was no longer anyone’s problem at 3:00 AM during an FCR-D test.

The consequence the design accepted was that data in the queue was not yet in persistent storage. There was a propagation delay, sometimes seconds, sometimes longer if the mobile connection was the only uplink. We accepted this. The alternative was letting I/O contention degrade the sampling rate during prequalification, which we did not accept.

Every engineer who has separated a write-ahead log from a storage engine, or decoupled a message producer from a consumer to protect the producer’s latency, has solved a version of this problem. The domain was different; the trade-off was the same.

What we learned

The zombie thread pattern is Erlang’s supervisor model implemented in Python threads: let it fail, restart it, keep the system running. It is not elegant by the standards of fault-tolerant systems design, but it is remarkably effective for a constrained environment where you cannot replace the runtime or upgrade the operating system.

The deeper lesson was about what “performance” means in a real-time control system. The single-threaded app was not slow in any conventional sense. It handled the PoC correctly, scaled to a dozen units, and processed data faithfully. It was slow only when measured against a hard deadline that didn’t exist until prequalification testing revealed it. The 100 ms boundary was always there; it just wasn’t visible until we ran the tests that made it matter.

Building for a deadline you haven’t yet been given is difficult to justify. Recognising when you’ve hit one, and being willing to rewrite rather than optimise around it, is what turns a proof of concept into something that can actually qualify for grid service delivery.

The hardware is still Telia-managed. The Python version is still 3.6. The zombie threads are still running.

If you're building a flexibility platform or energy optimization product and need someone who has shipped this end-to-end — let's talk.

For how these latency constraints translate to the full real-time control layer, see the upcoming post on real-time control: latency, reliability, and correctness. The second-generation controller that replaced this architecture on newer sites is covered in a forthcoming post on the diagnostic controller for wind and hydro. For a deeper look at what FCR prequalification actually tests and why the timing requirements are as tight as they are, the FCR deep dive covers the physics and the test sequence.