Data & System Architecture, from the ground up Lesson 71 / 80

The 'rebuild it cheaper' decision

When the vendor invoice gets painful enough that building it in-house starts to look attractive. The honest math, when rebuilding works, when it does not, and the hybrid that often wins.

There is a moment in the life of a growing company when somebody opens the monthly cloud and SaaS spreadsheet, points at one line, and says some version of “we could build that ourselves for less than that.” The line item is usually Snowflake, or Datadog, or Segment, or Sentry, or Salesforce, or one of the half-dozen vendors whose pricing curves have been shaped to extract a meaningful fraction of a series-B-and-up tech company’s infrastructure budget. The number is large, growing, and visibly correlated with the number of users or events or hosts the company has, in a way that makes it easy to extrapolate forward and panic.

The reaction is often correct in spirit and wrong in detail. Sometimes building the thing is genuinely a good trade. Often it is a much worse trade than it looks at the moment of the realisation, because the cost being compared against is the vendor invoice, and the cost actually being committed to is engineer salaries, infrastructure, opportunity cost, and the ongoing maintenance burden of a piece of internal infrastructure that nobody outside the team that built it cares about. This lesson is about how to make that comparison honestly.

It is the natural follow-on to the cost levers Module 9 has covered. Storage tiering, query optimisation, right-sizing, and FinOps culture are all about spending less inside a vendor’s pricing model. Build-versus-buy is the meta-question: should you be spending inside the vendor’s pricing model at all, or is the workload now mature and stable enough that owning it makes sense?

The pattern

The shape repeats across categories. A vendor sells a managed service that solves a problem your team has. In the early years, the vendor is a clear win: their product is more capable than anything you would build, the price is small relative to the rest of your spend, and your team has more important things to do. You sign up, you integrate, the platform team breathes out.

Then the company grows. The bill grows with it, sometimes faster than the company does, because vendor pricing is tuned to capture more value per unit as customers scale. A vendor that cost you fifty thousand dollars a year when you had ten engineers is now costing you three million a year when you have three hundred. The fraction of the use case you actually exercise is often small: you use ten percent of the vendor’s surface area, but you are paying for the full product because that is how the contract works.

At some threshold, somebody runs the back-of-the-envelope. Two engineers full-time for a year is roughly half a million dollars all-in. The infrastructure to host the thing is maybe another hundred thousand. So if the vendor bill is over a million a year and the workload is a thin slice of what they sell, building it ourselves looks like a fifty-percent saving. The pitch lands, the team gets approved, the project starts.

Whether it ends well is not determined by the spreadsheet. It is determined by which side of a handful of factors the workload sits on.

The honest math

The naive comparison is “vendor bill versus engineer salaries plus infrastructure”. The honest comparison has more terms.

Vendor cost. The current bill, projected forward over the period you are planning for. Three to five years is the right horizon, because that is the timescale on which an internal rebuild has to recoup its costs.

Engineer salaries. Not just the ones who build it. The ones who maintain it after it is built. A two-engineer build is a six-engineer-year commitment over three years if you account for the ongoing operations, and that number scales up as the system grows.

Infrastructure. The compute, storage, and network the in-house version will run on. This is often underestimated, especially for systems that need high availability, redundancy across regions, or careful operational instrumentation.

Opportunity cost. What those engineers would have built instead. The hardest term to estimate and often the largest. If your two best platform engineers spend a year cloning a metrics database, that is a year they did not spend on the next product.

Vendor R&D you are replicating. This is the term most often missed. The vendor has dozens or hundreds of engineers working on the product. They are shipping features, fixing bugs, improving performance, handling new compliance regimes, supporting new clouds. Your in-house version starts as a snapshot of what the vendor had three years ago, and the gap widens over time. The cost of staying competitive with the vendor’s roadmap is real and recurring.

Switching cost back, if it goes wrong. Some rebuilds fail. The honest plan accounts for the cost of returning to the vendor (data migration, contract renegotiation, the embarrassing internal post-mortem) if the in-house effort does not work out.

When you sum those terms, the spreadsheet is rarely as one-sided as the original framing made it sound. Sometimes building still wins. Sometimes it doesn’t. The point is to do the calculation rather than skip it.

When rebuilding makes sense

Four conditions, all of which should hold before the project gets approved.

The vendor cost is genuinely a large fraction of total infra spend. “Large” here means more than ten percent of the total infrastructure budget, sustained, with a credible projection that it will continue to grow. A vendor line that is three percent of the total budget is not worth disrupting, even if the absolute number is annoying. The arithmetic favours the rebuild only when the savings are large enough to fund the engineering effort and still leave a margin worth the risk.

The functionality you use is a thin slice of what the vendor provides. If you use ten percent of Datadog’s product (just metrics, just one alerting integration, just a small set of dashboards), you are paying for the other ninety percent of features you do not use. The thinner the slice, the more credibly you can replicate it without absorbing the vendor’s full R&D burden. If you use most of the product, the rebuild is the full product, and that is a different conversation.

You have engineers with the relevant skills sitting idle, or the skills are a strategic investment. The team that will build the replacement needs to know the domain. A team that has never built a metrics database is not going to clone Datadog in a quarter, regardless of how motivated they are. The realistic candidates for in-house replacement are categories where your team has demonstrated expertise or where building expertise is a strategic goal worth funding on its own.

The thing has shifted from differentiated to commoditised. This is the under-appreciated factor. A metrics database in 2016 was a hard problem with a small set of viable open-source building blocks. A metrics database in 2026 is a much easier problem, because Prometheus, ClickHouse, VictoriaMetrics, Mimir, and Grafana have made the building blocks abundant and well-understood. The same shift has happened in event pipelines (Kafka plus a small framework versus a custom Segment replacement), error tracking (open-source Sentry versus the hosted version), feature flags, and several other categories. When the building blocks have commoditised, the rebuild is mostly integration work, and the calculation tilts toward build.

When all four hold, building has a real chance of paying off. The published cases (companies that left Segment for a custom event pipeline, companies that built their own observability and saved six-figure monthly bills, companies that replaced expensive feature-flag vendors with a small internal service) tend to satisfy all four conditions.

When rebuilding does not make sense

The mirror set, equally important, easier to forget when the spreadsheet is shouting.

The vendor’s R&D is much larger than your team’s. Datadog has more engineers working on their observability product than most companies have engineers, period. Trying to clone it with three people is a project that will produce something demoable in six months, something half-credible in eighteen, and something genuinely competitive never. The honest answer is to accept the bill and use the team’s time on problems where you have a chance of being world-class.

The thing is critical-path infrastructure you do not want to be on the hook for at three in the morning. Vendors have on-call rotations, SLAs, and twenty-four-seven support staff. Your in-house version has the team that built it, who would also like to sleep. For systems where reliability is load-bearing on the rest of the business (authentication, payments, observability itself when an outage cascades), the vendor’s operational maturity is part of what you are paying for, and replacing it is replacing a lot more than the feature set.

Your team is small or focused on product. If the company is fifty engineers and the bottleneck is shipping product to customers, taking two of those engineers out of product and pointing them at platform infrastructure rebuild is the wrong allocation. The rebuild question becomes credible only at scales where there is a platform team large enough to absorb the work without cannibalising product velocity. That threshold is usually somewhere in the low hundreds of engineers, with exceptions both directions.

The workload is still evolving fast. If your usage of the vendor is changing every quarter as the product evolves, you are a moving target, and a rebuild is chasing requirements that will have shifted by the time the rebuild ships. Stable, mature workloads make better rebuild candidates than fast-changing ones.

The “we built it in a weekend and now it’s a millstone” pattern lives at the intersection of these factors. An engineer with a good week and a strong opinion builds an internal replacement for some annoying SaaS tool. It works. The engineer moves on or leaves. Two years later, nobody fully understands the code, the dependencies have rotted, the tool is still in the critical path of half a dozen workflows, and the team curses it every time something breaks. The original saving was real on paper and negative in practice, because nobody costed the long-tail maintenance.

The hybrid that often wins

The pattern most successful rebuilds actually follow is not a clean replacement. It is a partition. Build the easy eighty percent in-house, where the workload is well-understood and the vendor’s surface area is much larger than what you use. Keep the hard twenty percent on a vendor, where the operational complexity, edge cases, or sheer feature breadth would dominate the rebuild cost.

The pattern repeats. Companies that left Segment often did so for the high-volume server-side event pipeline (Kafka, a small ingestion service, a few well-understood transforms) while keeping the vendor for the long tail of mobile SDKs and partner integrations. Companies that built their own observability typically own metrics ingestion and basic dashboards while paying a vendor for synthetic checks, advanced anomaly detection, or APM in languages where the in-house instrumentation gap is largest. Companies that built their own feature flagging often kept a vendor for the parts of the product that need experimentation analytics integrated with their A/B platform.

The hybrid is unsexy. It does not produce the satisfying “we cancelled our Datadog contract” announcement. It does produce the actual cost savings, because it preserves the vendor’s R&D for the parts of the surface area where it matters most and reclaims it where it does not.

The decision tree

Diagram to create: a Mermaid flowchart that walks the rebuild-vs-buy decision. Inputs: vendor annual cost, vendor cost as percentage of total infra, fraction of vendor surface area used, team headcount available with relevant skills, whether the category has commoditised, whether the workload is critical-path. Outputs: keep on vendor / hybrid (build core, keep edge) / full rebuild / negotiate harder first.

flowchart TD
    A[Vendor bill is painful] --> B{>10% of total<br/>infra spend?}
    B -- No --> Z1[Negotiate. Stop here.]
    B -- Yes --> C{Use thin slice<br/>of product?}
    C -- No --> Z2[Stay on vendor.<br/>Optimise usage.]
    C -- Yes --> D{Category<br/>commoditised?}
    D -- No --> Z3[Stay. R&D gap<br/>too large.]
    D -- Yes --> E{Skilled team<br/>available?}
    E -- No --> Z4[Stay or hire first.]
    E -- Yes --> F{Critical-path<br/>at 3am?}
    F -- Yes --> H[Hybrid: build core,<br/>keep vendor for edge]
    F -- No --> G[Full rebuild candidate.<br/>Plan 3-year horizon.]

The tree is not a substitute for the calculation. It is a filter that catches the cases where the calculation is not even worth running. Most build-versus-buy debates die at the first gate, because the vendor bill is annoying but not large enough to fund the alternative. Of the ones that pass that gate, most die at the second or third, because the team’s actual usage is broader than the framing implied or the category is harder to build than it looks. The cases that survive all the gates are the ones worth building a real plan around.

A note on negotiation

Before any rebuild project gets approved, run the negotiation play. Vendors will often discount substantially when faced with a credible threat of leaving, especially at renewal time. A twenty-to-forty percent discount on a multi-million-dollar contract pays for a lot of optimisation work without spending a single engineer-year on a rebuild. The information you gathered while costing the in-house alternative is exactly the leverage you need at the negotiating table.

This is not a substitute for the rebuild question. It is the cheapest way to get a meaningful chunk of the savings without taking on the rebuild risk, and the team that runs the negotiation play first usually finds that the rebuild question changes shape afterwards.

Where this leaves Module 9

Cost optimisation has been the spine of Module 9. Storage layout, compute right-sizing, query rewrites, FinOps practice, and now the meta-question of whether the vendor relationships themselves need to change. The next lesson closes Module 9 with a single company’s published cost-optimisation programme, where most of these levers got pulled at once and the savings compounded. Then Module 10 opens by stepping out of the data platform entirely, into the broader question of how services in a system divide responsibility, with microservices and the trade-offs that come with them.

Search