A CAD Kernel in a Weekend

I wanted to build a B-rep CAD kernel. Knot is written in Rust, meant to complement Rhino3D, built for the web, and designed to scale like Google’s Manifold while maintaining the geometric precision that meshes can’t provide. I wanted to see if a fleet of coding agents could build it in a weekend, operating in a loop against a real dataset that told them whether they were converging.

This is how it went.

The problem with existing kernels

The idea came from working with Rhino Compute. Rhino is great software, but hosting it means running Windows VMs in the cloud because it depends on Windows APIs. Scaling it is expensive. There’s no path to running it at the edge or embedding it in a web app.

Manifold from Google is the modern alternative. It’s fast and open source and runs anywhere. But it operates on meshes, which are approximations. If you need the precision that NURBS-based B-rep provides (exact tangency, exact offsets, exact fillets), meshes are the wrong representation. You’re working with a discretized approximation of the geometry rather than the geometry itself.

So the target was clear: the precision of Rhino’s NURBS-based approach, built from scratch in Rust for the web, with no Windows dependencies.

Setting up the environment

The first thing I needed was a way to know whether the kernel worked. Not on toy geometry, but on real CAD models with real degeneracies. I used the ABC dataset, a large corpus of CAD models, as the stress probe.

I set up two tiers of validation. The first was synthetic primitives: 300 boolean ops on parametrically-generated boxes, spheres, cylinders, cones, and tori with a fixed seed. Fast, deterministic, near-perfect pass rate. The regression smoke test. The second was ABC chunk 0000: 30 pairs times 3 operations, totaling 90 booleans on real CAD models. Slower, messier, and it exposed every kind of degeneracy the synthetic tests missed.

The agent’s job was to implement the boolean operations pipeline. ABC was how it knew whether things worked. Implement, run the harness, read the results, iterate. The dataset was the environment.

The first surprise: measurement was broken

The agents started iterating and the numbers looked encouraging. We hit 94.4% on ABC. But the number wasn’t stable. Runs would vary by 5 or more percentage points with no code changes between them.

The first real work wasn’t algorithmic. It was fixing the measurement.

An infinite loop in the boolean could wedge the harness without anyone noticing. Hung pairs got counted into different buckets depending on timing. That “94.4% peak” was an artifact. We added a watchdog timeout and the real number dropped to a noisy 88-94%.

File ordering wasn’t deterministic. The test harness walked directories and HashMap iteration order changed between runs. We sorted the walk and the noise disappeared.

We also had no way to tell failure types apart. A topology error, a timeout, and a bad input file all counted the same way. We added per-pair categorization: valid, empty, bad_input, topo_fail, tess_fail, crash, timeout. Now when the agent fixed something, we could see whether it closed a problem or moved it between buckets.

After this work the number settled at 93.3%. Lower than our previous “peak” but real. Every improvement after this point was measured against something trustworthy.

Queuing the wrong work

With reliable measurement in place, the natural instinct was to go after the hard algorithmic problems. We queued up fat-plane Bezier clipping for NURBS-vs-NURBS intersections (1-2 weeks of estimated work) and BVH spatial culling to speed up candidate filtering (3-5 days). Both seemed like obvious next steps based on what the failures looked like.

Before starting either, I had the agents write short diagnostic harnesses. One per question, about an hour each. Six total. This changed the project.

The filter diagnostic asked: what’s the bounding-box filter survival rate per pair? Answer: 4.6% on one pair, 0.3% on another. The hard cases weren’t NURBS-vs-NURBS. The model had 206 planar faces out of 253. The Bezier clipping project was addressing the wrong surface type.

The stage-trace diagnostic asked: where does the 8-second budget go per pair? Answer: 23.9 seconds in classify_face_exact on the hardest pair. Candidate filtering was already 1ms. The BVH project was optimizing a part of the pipeline that wasn’t the bottleneck.

Both multi-week projects got cancelled. The fix for both problems was a SolidClassifier: about 50 lines of acceleration structure plus a dispatcher refactor. The agents wrote it in hours.

The pattern that emerged

This happened repeatedly. We’d look at the failure data, form an intuition about the cause, queue a project, then write a diagnostic that told us the intuition was wrong. The diagnostic would point at the real problem, and the real fix would be smaller and faster than the queued plan.

A validation diagnostic showed 14 Euler violations and 8 non-manifold edges on import. This drove a line-edge reconciliation fix in the STEP importer that closed 6 topology failures. We’d assumed those failures were in the boolean pipeline. They were in the import.

Each time the same thing happened: verify first, then commit. The agents could write a diagnostic harness in about an hour. That hour saved days.

Making time part of correctness

A boolean that takes 30 seconds is not a successful boolean for a CAD user. So we made time budgets part of the correctness criterion, not a tunable to loosen.

We used three layers. A pipeline budget of 8 seconds, checked between major stages, returns a clean timeout error. A per-call SSI budget of 200ms caps the surface intersection marcher per face pair so one pathological pair can’t consume the whole pipeline. A harness watchdog of 10 seconds catches anything escaping the inner budgets.

The 8-second limit shaped every engineering decision. The question for any optimization was “can we make this fit in 8s on real CAD.” This eliminated approaches that would produce correct output at non-interactive speed.

What moved the number

The reliability journey, in order:

The instinctively-prioritized work (algebraic SSI subsystems, BVH spatial culling, Bezier clipping) moved the number less than the boring work (instrumentation, deterministic ordering, soft-accept calibration, classify-stage caching). If we’d guessed instead of measured, we would have done the work in the opposite order and spent weeks on changes that didn’t matter.

For a reliability-bound system, the binding constraint is rarely where intuition places it. A diagnostic harness that takes an hour to write and points you at the right problem is worth more than a month of work on the wrong one.

How the agent loop worked

Zooming out: the whole approach is real-time CI applied to agentic authorship. The agent makes a change, the harness runs, the agent sees whether reliability went up or down and in which failure category, then it iterates.

This extends to anything with a verifiable environment. The agent could call a local API to check values, hit a live test deployment to validate integration, or run against a dataset to check correctness. The environment provides the signal and the agent iterates until it converges.

The signal needs to be fast (so the agent can run many iterations) and honest (so it’s not chasing measurement noise). If you can also categorize failures by type, the agent can target specific problems rather than flailing at a pass/fail binary. Getting the signal right was half the project. Once it was right, the agents converged.

Not a free lunch, but a silver bullet

This approach is not free. Compute costs add up. Agent loops that run for hours burn tokens and CPU time. You need good data or a good environment to optimize against. If your verification signal is noisy or incomplete, the agent will converge on something that passes checks but doesn’t work. Testing and specification discipline still apply.

But I think this is a silver bullet in the Brooks sense. The constraint on what you can build is no longer human engineering time. It’s what you can compute. If you can generate the test cases, construct the dataset, and define what “correct” looks like in a way the agent can check, you can build the system. You shift from writing the software to defining the problem and providing the verification signal. The cost of searching program space has dropped to the price of compute, and compute gets cheaper every year.

What it cost

I’m not sure what this means yet economically. When I had Claude report its session cost using Anthropic’s default pricing, each agent session estimated around $150. I’m on the 5x Claude Max plan at $100/month, so I’m subsidized. Without that subsidy, running a fleet of agents on a frontier model would be expensive.

Based on the status line cost tracking script, I used about $900 in credits to build the kernel. For a weekend of work that produced a functional B-rep kernel, that’s a good deal. But it scales with problem complexity and model choice. Smaller models cost less per token but may need more iterations. Frontier models converge faster but cost more.

The cost is a moving target. Inference costs drop. What cost $900 this year might cost $90 next year. The pattern (agent fleet optimizing against a dataset) works with any model. As cost per token drops, the range of problems where this makes economic sense widens.