Running Ceiba on a Single Node First: Platform Boundaries Before Scale
On a last post, Designing Failure Domains on a Single Node, I wrote about building platform boundaries before workloads arrived. I thought now I could be sharing on how that kind of infrastructure could become meaningful with something real running on it right?
It seems easy to design a clean platform in theory, but it gets harder to keep it clean once an actual product workload needs runtime paths, data paths, deployment flow, recovery behavior, and enough observability to explain what happened when something inevitably behaves differently than expected.
Ceiba is the first workload that makes that question concrete for me.
In From Production APIs to Productized APIs: Why I’m Building Ceiba, I described Ceiba as an API access and monetization layer for existing Node APIs. The first version is focused on practical access concerns such as API keys, plans, quotas, usage tracking, and subscription-gated access. I thing that should make it a good workload to test on a small platform.
A single node may not be the final production architecture for Ceiba, but because it has enough moving parts to expose weak platform boundaries without requiring a large deployment surface from day one. It has a control plane, runtime authorization checks, API key state, quota decisions, usage writes, and eventually billing-linked access rules. That is much more useful as a test than a static app that either serves a page or does not.
We will not pretend the homelab single-node is the cloud. The point is to use a constrained environment to learn whether the workload is understandable before scaling adds more places for misunderstanding to hide.
Why start on a single node#
There is a temptation to treat scale as the moment when architecture becomes serious. I've participated first-hand around these kind of technical discussions with other probably more experienced engineers, and I've come to think that it is backwards.
Scale reveals architecture, but it does not create it. If a system is already confused on one node, adding more nodes usually spreads the confusion around with better uptime terminology.
A single-node environment forces a useful kind of honesty. There is less room to hide behind managed services, autoscaling groups, or abstract diagrams. Storage boundaries matter. Runtime paths matter. Recovery steps matter. A slow database, full disk, broken mount, noisy container, or missing backup is not theoretical anymore.
So I decided to run Ceiba there first. Ceiba doesn't require a massive cluster to become real and anser request-time questions clearly.
Should this API key be allowed? Which project owns it? What policy applies? Has the quota been exceeded? Should usage be recorded? What should happen if a non-critical usage write fails? What should the protected API see when the runtime cannot respond?
The answer to these, requires a workload that is small enough to inspect and serious enough to fail in meaningful ways.
Mapping Ceiba onto the platform#
The single-node platform from the failure-domain post was organized around four responsibilities: boot/system, platform, data, and backups. The reason for that structure was not neatness. It was to avoid coupling unrelated failures.
So let's map Ceiba into that model naturally.
The boot/system tier owns the operating system and the baseline services the node needs to come back after maintenance or failure. It should not be polluted with application state just because that is convenient in the moment.
The platform tier owns the container runtime, ingress, orchestration layer, and deployment surface. For Ceiba, that is where the control plane and runtime services can initially live. The runtime should be close enough to protected APIs to keep authorization checks predictable, while still being a separate service with its own responsibility.
The data tier owns the state Ceiba cannot casually lose: projects, API keys, policies, plans, subscription-linked state, quota counters, and usage records. Some of that state may eventually belong in different stores depending on the access pattern, but the principle is the same. Runtime execution and durable product state should not be treated as the same kind of thing.
The backup tier owns recovery artifacts. This is where database dumps, config exports, and restore evidence belong. A backup strategy that only exists as hope is not a strategy. Ceiba makes that visible because the product depends on state that would be painful to reconstruct manually.
The first useful map looks like this:
That flow is intentionally plain.
I want the first version to be explainable without needing a platform diagram full of icons. A developer should be able to understand where the request goes, where the decision is made, where state lives, and what happens if part of that path has a bad day.
What Ceiba should be proving early#
I'm intending it helps me prove that boundaries are useful. If a control plane updated doesn't make request-time authorization fragile, or if runtime make access decisions without turning it into a management app, if the SDK fails in a way that is clear to the API owner, or if backups restore the state that actually matters. The first Ceiba deployment is not exactly about proving that the platform can handle enormous traffic, but answering these early questions rather than raw throughput.
Throughput can be improved once the shape of the system is clear. Confused ownership is harder to fix later because it tends to leak into code, data, deployment scripts, and mental models at the same time.
The biggest thing I want to avoid is a platform where every part knows too much. The dashboard should configure access, not become the hot path. The runtime should evaluate access, not become the product management interface. The SDK should integrate with existing APIs, not become the place where policy logic secretly lives. The database should preserve product state, not become an accidental dumping ground for every log-like event forever.
This is where the single-node setup helps. Its limits make sloppy boundaries easier to notice.
Failure behavior before scaling#
A paradoxical aspect of this is that a workload like Ceiba has an uncomfortable property of disappearing when when it works. I've come to think that is what an access layer should do. It should sit between existing APIs and product rules without making itself the main character of every request. Hwoever, when it fails, the behavior has to be intentional.
A clear SDK posture when runtime is unavailable, and depending on the route and product policy, that may mean fail closed for protected paid access, or fail soft only for non-critical paths. That decision should not be improvised inside every application.
As for the other aspects, when usage tracking is delayed or partially unavailable, the product should not lose the ability to explain access decisions. Usage is important, especially for quotas and billing context, but not every usage write should be allowed to collapse the user-facing API experience. If the data tier is slow, that should show up as a platform signal rather than a mystery inside a route handler, and if backups fail, that should be visible before restore is needed. If a deployment breaks the runtime, rollback should be a known path, not a late-night archaeology project.
The environment is modest, but the questions are production-shaped.
The scaling path I want to preserve#
Starting on a single node only makes sense if the system does not become trapped there. The goal is to make the first deployment small without making it parochial. Container images, environment configuration, migrations, ingress assumptions, service boundaries, and backup/restore flow should all be shaped so that moving later is a continuation, not a rewrite.
This is where the initial boundery testing, ensure that the runtime and control plane should already be separate enough to scale differently. The data layer should have a clear migration story. The SDK should not care whether the runtime is on the same node, another node, or a managed environment, as long as the access contract remains stable.
If Ceiba grows, the likely path is not as mysterious. First, the system is run on the single-node platform and workload shape is validated. Then separate the most sensitive state from the application runtime more deliberately. Then, we'd move the runtime closer to the APIs that depend on it, especially if latency or availability demands it. Later on, split deployment concerns across multiple nodes or cloud infrastructure when the reliability needs justify the cost and complexity.
Scaling here is not meant to rescue unclear design, but to amplify a design that already has understandable boundaries.
Summarizing some of this learning#
A single-node platform gives a workload like Ceiba, a disciplined place to become real before it becomes large. Ceiba gives the platform a real workload that can challenge whether the boundaries were useful or just tidy. And that feedback loop is valuable.
A product workload forces different questions than infrastructure work by itself. It asks whether deployment is boring enough. Whether logs answer the right questions. Whether state is recoverable. Whether request-time behavior is explainable. Whether the architecture supports the product instead of just looking good in a repository.
Ceiba is still early, and the platform will change as the product changes.. that's cool. I think what is important is that the first environment is not treated as disposable chaos. It is treated as a proving ground, not final production, and not a pretend cloud.
A place to learn how the system behaves under real constraints, with enough discipline that the lessons can carry forward. For me, what's key here is to make the system understandable before it grows.