How I ended up building software for the power grid

Key Takeaways

Why energy software has a categorically different reliability model than web or enterprise software
The specific failure modes that only appear when the grid is watching
What I built at Helicon Technologies, Stockholm Exergi, and Ingrid, from edge controllers to cloud EMS

I didn’t start in energy. I came to it the long way.

I studied computer science in Zilina, Slovakia followed by electrical engineering and information technology, a degree split between software and electrical systems. The decade after was generalist work: a freight TMS from zero to first paying customer in three months with a team of five, a medical data platform, car sharing infrastructure, logistics tooling, and a dozen smaller things through a small business I ran across Central European markets. I led teams, shipped products, got things out faster than most people thought was reasonable. (Full background on my CV if you want the timeline.) Then I landed somewhere the stakes were different.

I moved to Sweden and started working on battery energy storage systems (BESS) connected to the Nordic power grid. I wouldn’t have called it an immediate shift in the mindset at the time. It was more gradual, each piece making a bit more sense, until one day it all just clicked.

High-voltage transmission infrastructure, the physical layer that frequency regulation software must serve.

Grid frequency drops. Software either responds or it doesn’t.

The mechanics of frequency regulation are simple. Grid frequency drops below 50 Hz, batteries discharge. It rises, they charge. Simple, right? Spoiler: the devil is in the details, but more on that later. Get enough of these systems coordinated across Sweden and you’re actively stabilizing the electrical grid. Think about all the life-critical infrastructure that runs on electricity: hospitals, water treatment, traffic management, data centers, banking and transport. We take uninterrupted supply for granted in a way that hides exactly how much engineering goes into maintaining it.

That’s not a metaphor. That’s literally what the software was doing.

This is what the TSO monitoring screen actually shows — a 60-minute frequency trace with the FCR-N tolerance band, and the current reading on the gauge operators watch in real time.

nordic grid frequency — 60-minute window svk-style monitor · simulated data

Frequency

50.000 Hz

● within FCR-N band

In most of my earlier work, software failure meant a degraded user experience, a missed SLA, a customer complaint. Recoverable. Annoying, but recoverable. Grid services don’t work that way. When a battery commits to Frequency Containment Reserves (FCR), the Transmission System Operator (TSO) counts that capacity in its reserve calculations. If the software misses a setpoint, it shows up on a frequency trace that regulators review. There’s no hotfix, especially during prequalification (PQ) tests. The test either passed or it didn’t.

reliability model comparison

web software

grid software

failure event

missed deadline or exception

missed setpoint activation

consequence

slow page, degraded UX

appears on TSO frequency trace

retry?

yes — retry queue, backoff

no — activation window is gone

fix route

hotfix deploy, hours

PQ test restart, weeks

who notices

user sees spinner

regulator sees trace

That distinction rewired how I think about software. Reliability stopped being a quality attribute to optimize and became a constraint with the same standing as physics. My team and I made mistakes that cost us days or weeks:

A logging implementation that introduced enough jitter to fail a prequalification test the hardware had actually passed
A single-threaded architecture that worked fine until 10 Hz sampling across twelve containers exposed its limits
A SQLite write that blocked the control loop long enough to appear as a setpoint miss on the TSO’s trace

Every time, same lesson: the system doesn’t care what you intended. This is critical infrastructure, and at scale that’s not a slogan, it’s an engineering constraint.

What I’ve actually built: edge controllers, EMS, and fleet coordination

At Helicon Technologies, Stockholm Exergi, and Ingrid, I built these systems from the hardware up. Now I have to rethink almost all of it from the opposite direction.

It started close to the metal. Site controllers running on industrial edge devices, reading battery telemetry, writing setpoints to inverter or battery registers within tight latency budgets. I’ve designed this… actually multiple times. The second time was a complete rewrite, after PQ testing showed that a single-threaded Python application can’t sample twelve battery containers at 10 Hz while also writing to a local database. More on that in a later post.

BESS software stack — where the layers live

⚡

TSO / SvK market

FCR-N · FCR-D · mFRR — capacity bids and activations

MARKET

dispatch signals

☁

EMS / fleet coordinator

market dispatch · SoC management · multi-site coordination

CLOUD

setpoints · telemetry

⚠

edge controller failure zone

10 Hz sampling · setpoint dispatch · Modbus/DNP3

logging jitter single-threaded poll blocking I/O

EDGE

Modbus · DNP3

🔋

inverter / battery hardware

physical cells · BMS · inverter registers — hardware passes PQ

What it became now, when designing it again under a different company, is very different from the first version. Same control problem, lifted off the edge device into the cloud, built to steer not just batteries but any renewable resource: BESS, wind, hydro. The architecture doesn’t care what the asset is. It doesn’t care about what providers you are dispatching towards or what kind of hardware you have. The internal logic is decoupled from the integration layer. The biggest limitations you have while implementing new hardware are the hardware limitations that the site has, rather than additional logic that needs to be implemented.

If you're building a flexibility platform or energy optimization product and need someone who has shipped this end-to-end — let's talk.

Wind farm, one of the asset classes the cloud controller is designed to coordinate alongside BESS and hydro.

The Energy Management System (EMS) layer came with it: coordinating assets across FCR and manual Frequency Restoration Reserves (mFRR) markets, managing state-of-charge headroom across concurrent commitments, running the SvK prequalification process end to end. Seven months of TSO review teaches you exactly what a submission needs to survive it.

Then fleet coordination: sites in different price areas, different TSOs, hardware from vendors who had no intention of making this easy. Building a coherent system out of that, at a scale where a 10-20% availability gap in the fleet is the baseline not an incident, while PQ limits are set at 5 to max 10% error.

Control room operations, coordinating assets across multiple sites, TSOs, and market zones.

I also built a prequalification verification tool that processes raw test data and surfaces failures automatically. It exists simply because we needed it and nothing like it existed. There is a tool from SvK that supposedly does that, but it only runs on Windows, is really tricky to use, and looks like it was designed in 1997. Good tool. Needs a funeral.

Why grid-side software needs engineers who’ve been burned by a frequency trace

We’re in a genuinely interesting moment. Gate closures are getting shorter. Markets are moving toward continuous trading. TSOs across Europe are pushing for more flexibility from more asset types. The grid is getting smarter and more software-driven, and that means more software engineers landing here, often from backgrounds where reliability was a KPI, not a physical constraint.

And now we have AI tools that make writing and shipping code faster than ever. That’s mostly a good thing. But it also means people can move quickly through domains they don’t fully understand yet. Some shortcuts are fine. Some will show up on a frequency trace at 3am on Sunday :)

That gap is what I write about: what these systems actually have to do, where they break, and what it costs when you get that wrong. The failure modes that don’t appear in any TSO document. What a prequalification test is really checking. What it looks like when the software takes reliability as seriously as physics does.

The energy transition needs good software. Haven’t found the ceiling of this problem yet. Follow the writing here if this is your kind of problem too.