What Small Services Taught Me About Large-Scale System Design

tl;dr: Lessons from building small internal services (like handling retries, state, naming, and observability) map directly to large-scale system design. Thinking small helped me anticipate failure modes, clarify ownership, and build more resilient systems.

I used to think building large-scale systems meant thinking in large-scale terms: distributed consensus, massive clusters, orchestration layers, exotic caching strategies. And while that’s certainly part of it, I’ve realized over time that the lessons that really stick (the ones that actually make a system work at scale) often come from building the smallest services.

I’m talking about those seemingly trivial internal tools, daemons that sync two systems every hour, or ad-hoc APIs you write to bridge some legacy gap. The stuff no one writes whitepapers about. Ironically, these small services have taught me more about system design than many of the formal architectural efforts I’ve been part of.

Here’s what I’ve learned.

1. Every system is eventually a distributed system

When you build a tiny service that sends alerts from a database to Slack, you don’t think of it as “distributed.” It’s one HTTP request. One cron job. But then the Slack API times out. The cron job silently fails. The service retries. Two alerts are sent. Someone yells.

That moment is when you learn your service is distributed, even if the architecture diagram is a single box. It has multiple failure domains: the network, the API you call, the persistence layer, your scheduler. Those things don’t go down in unison, and they don’t recover in sync.

This lesson scales directly. In large systems, you might reach for Kafka or SQS to help with delivery guarantees. But the conceptual pain (how to deal with things going wrong independently) is exactly the same.

I now find myself asking: “Where is the time boundary?” and “What assumes reliability here?” far earlier in a design. Because even if you don’t build for scale, time will find you. So will retries.

2. Idempotency isn’t optional, it just hasn’t hurt you yet

One of my earliest “small” services pushed data from one internal system to another via a REST endpoint. I didn’t think much about it, just posted the data and moved on.

Then the endpoint started intermittently timing out, and our retry logic kicked in. But the receiver didn’t handle duplicates well. We got inconsistent states, user reports of phantom entries, and downstream chaos.

I now treat idempotency as a first-class concern, even in simple APIs. If I expose a write endpoint, I either:

Support idempotency keys
Make the write operation naturally idempotent
Or explicitly document that the client must handle retries with care

Here’s a very simplified pattern I reach for now:

# Pseudo-code
def process_request(request):
    if already_processed(request.idempotency_key):
        return previously_returned_response(request.idempotency_key)
    result = do_work(request.payload)
    store_result(request.idempotency_key, result)
    return result

It’s not complex, but that check can prevent so much future ambiguity. This gets even more valuable as your system scales out and retries become opaque across services and teams.

3. Naming and ownership outlast architecture

In small services, naming might seem like a bike-shed issue. But here’s what happens in large systems: services evolve, responsibilities shift, and what’s more expensive than changing code is changing understanding.

A service I wrote years ago called inventory-sync was initially just syncing one vendor’s inventory to our internal database. It grew, added endpoints, connected to multiple vendors, and became critical infrastructure. But no one wanted to touch it, because they couldn’t tell what it really did from the name. Was it read-only? Was it authoritative? Could you write to it?

Now, I advocate for naming services by their domain ownership, not their function. A name like product-ingestion tells me who owns it (the product team), and roughly what it’s about. If it evolves to include caching, syncing, or enrichment, the name still holds. If you name it by what it does right now, you’re setting up a future contradiction.

Ownership, likewise, is clearer in small teams. But when systems grow, ambiguity about who can change what (or what guarantees are assumed) becomes the enemy. Every broken integration I’ve debugged across teams boiled down to either:

Misunderstood contract (no formal API schema or versioning)
Implicit behavior assumptions (e.g., “it always returns sorted results”)

Small services that force you to write things down (OpenAPI specs, contracts, ownership in README.md) are giving you practice reps for the real thing.

4. State is a liability, treat it like a live wire

One of the most humbling bugs I ever dealt with came from a small service that processed webhook callbacks from a third-party. I assumed the data could always be trusted. So I stored it as-is.

Fast forward a few months, and the provider made a silent change in field naming. My service started overwriting the wrong records. We had no idea until someone noticed incorrect totals on a dashboard. State had silently drifted.

When you build at scale, you get more tooling to deal with this: migrations, schema registries, change data capture. But the lesson is the same: storing state creates coupling over time. And that coupling is usually invisible until it bites.

Now, even for small services, I do two things:

Validate external inputs as if they’re coming from a hostile source (because they’re certainly not under your control)
Model internal state changes with explicit versioning or guards (even something as simple as a schema hash)

If a service owns state, I try to treat it as sacred. If it doesn’t, I avoid touching it unless I absolutely have to.

5. Observability is a multiplier, not an accessory

A common trap with small services is under-investing in observability. “It’s just 200 lines of code, we can debug it directly.” And for a while, you can.

Until it’s running in prod, and something happens at 3 a.m., and you’re SSHing into logs like a cryptologist trying to decipher what happened.

The small services that caused me the most pain were not the most complex, they were the ones that gave me the least information. And the ones I could fix fastest? They had three things:

Structured logs with request IDs
Clear separation of error levels
Minimal but useful metrics (like success/error counts per operation)

In large-scale systems, good observability is assumed. But the practice is often learned the hard way: through these “small” projects that didn’t seem worth instrumenting.

I now include basic observability scaffolding even in the tiniest internal tool. It takes minutes to add up front. It saves hours later. And once you scale, it’s not just helpful, it’s the only way to function.

Final thoughts

Building small services has been my sandbox for understanding large-scale design. They expose the same fault lines: inconsistency, coordination, state, and communication. But they do it at a scale where you can actually see and feel the failure modes.

It’s tempting to think of system design as a high-level architectural activity. But more often, it’s an accumulation of practices (learned through tiny, sharp lessons) that harden your intuition about what works at scale.

The irony is that most of those lessons never came from designing “at scale.” They came from shipping something small, and watching what happened when it lived longer than I expected.