What Building a Baby Monitor Taught Me About ML and Coding Agents

2026-04-1318 min read

Part 2 of 2. Part 1 was the build — architecture, model cascade, training pipeline, and dashboard. This post is the meta: six opinions the project left me holding more strongly, six failures that taught me more than the successes, and the new failure modes coding agents introduce that nobody is naming yet.

Short recap if you skipped part 1: BILBO is a local-only baby monitor running on a 2014 Mac mini, now packaged as a four-container Docker stack. A 3-stage MobileNetV3 cascade decides whether the baby is awake; a Flask dashboard lets me review frames and correct labels; every inference runs on-device except an under-1% tail that falls back to GPT-4o (and that tail has been zero on six of the last seven days).

Part 1 was the system. This post is the opinions I'd have wanted to read before starting.

The post is laid out as an argument. Four tradeoffs open it with the numbers — what the data actually said when I tested the obvious alternatives. Six opinions are what those numbers convinced me of — things I'd read about hobby ML before starting and nodded along with, but now hold from experience rather than from reading. Six failures are the concrete bugs and dead-end designs that produced each opinion — the mistakes you have to ship to actually believe the rule. A section on coding agents covers what this generation of tools changed about how hobby projects get built, and the failure modes that came with it. And at the end, a short pitch for why a project like this is the single best applied-ML learning vehicle I've found.

Tradeoffs with numbers

Most of what I've read about hobby ML is qualitative — "I tried the bigger model and it didn't help," "the pre-trained detector didn't work." I found that unsatisfying when I was trying to decide what to try myself, so before I get to the opinions, here are the actual numbers from BILBO's history for the decisions that moved (or didn't move) the needle. These tables are what the opinions further down are built on.

Tradeoff 1: Hardware cost

Item	Cost	Notes
TP-Link Tapo C100 IP camera	$25	RTSP, 1080p, IR night vision
Gooseneck mic stand (mount)	$15	Clamps to the bassinet rail
2014 Mac mini	$0	Already owned; would otherwise be e-waste
Python, PyTorch, ffmpeg, SQLite, Flask	$0	All open source
Total	~$40	One-time; no recurring

The most important entry in that table is the $0 next to "Mac mini." If you had to buy a computer for this, the economics shift — though even a fresh $150 Mac mini would amortize against a few months of a commercial baby monitor subscription, and then every month after that is free.

Tradeoff 2: Did a bigger model help? (No.)

I spent a couple of evenings testing whether more model capacity or higher input resolution would move the eye-state metric. They didn't:

Configuration	Parameters	Per-frame latency	Eye-state macro F1
MobileNetV3-Small, 224×224 (chosen)	~2.5M	~40 ms	0.97
MobileNetV3-Large, 224×224	~5.4M	~2× slower	no measurable gain
MobileNetV3-Small, 384×384	~2.5M	~2× slower	no measurable gain
MobileNetV3-Large, 384×384	~5.4M	~3× slower	no measurable gain

The ceiling wasn't the model — it was the pixels. Once I added a face detector and the eye-state model was looking at a tight ~40-pixel face crop instead of an 8-pixel-tall eye region on the full bassinet view, the small model had all the signal it needed.

Tradeoff 3: Pre-trained vs. custom face detector

This was the biggest single accuracy win in the project — and the least expected:

Metric	YuNet (pre-trained ONNX)	Custom MobileNetV3-Small
Training data	Millions of adult daytime portraits	~780 hand-labeled frames from my own bassinet
Detection rate on baby-present frames	53%	100% (validation set)
Night vision (IR grayscale)	Falls off a cliff	Matches daytime
Single-frame latency	~10 ms	~40 ms
Effort to produce	0 (download + load)	~10h bbox labeling + one afternoon training

The custom MobileNetV3 is what the pipeline runs today. YuNet is kept as a fallback when the custom detector fails to load.

Tradeoff 4: Per-stage classifier results

For completeness, here are the current production metrics for each stage of the cascade, measured against reviewed-and-corrected ground truth on the currently deployed checkpoint over a recent 7-day window:

Stage	Model	Input	Output	Production metric (last 7 days)
1. Presence	MobileNetV3-Small	Bassinet crop	present / not_present	~99% agreement on reviewed frames
2. Face detection	MobileNetV3-Small (regression head)	Bassinet crop	bbox or no_face	99.7% detection rate (2,712 / 2,719 present frames); mean IoU 0.74
3. Eye state	MobileNetV3-Small	~40×40 face crop from stage 2	eyes_open / eyes_closed	macro F1 0.95 on 2,651 reviewed frames

The whole cascade runs in 80–130 ms on the Mac mini's CPU. No GPU, no accelerator, no quantization. The bottleneck in the end-to-end tick is the ffmpeg RTSP grab, not the model inference.

Here's the eye-state confusion matrix over the last 7 days of production traffic:

Confusion matrix for the eye-state classifier on a recent 7-day window of production traffic. 2,651 reviewed frames, macro F1 0.95. Rows are actual labels, columns are predicted. Eyes_open row: 497 correct, 31 predicted as eyes_closed (94.1%/5.9%). Eyes_closed row: 56 predicted as eyes_open, 2,067 correct (2.6%/97.4%). Per-class F1: eyes_open 0.92, eyes_closed 0.98.

The row support is 528 eyes_open frames and 2,123 eyes_closed frames — a roughly 4:1 class skew, which mirrors real-world behaviour (the baby is asleep more than awake). Per-class F1 is 0.92 and 0.98 respectively. The eyes_open class is still the weaker one and the place where misclassification has the most impact on downstream alerts; opinion #5 below is about exactly that.

Six opinions the project left me holding more strongly

The first two are about the privacy architecture, since the choice to go local shaped every subsequent engineering decision. The next four are about the craft of shipping a small ML system once you've made that choice.

1. Privacy is an architecture, not a policy.

Every commercial baby monitor I looked at markets "end-to-end encryption" and "SOC 2." Both are meaningful controls. Neither changes the fact that the architecture is a perpetual outbound firehose of video frames of my child into someone else's storage, 24/7, forever. The blast radius of a vendor compromise is years of full-resolution footage of every customer's baby. No policy rewrites that.

BILBO's blast radius is different in kind, not degree. There is no listening socket on my public IP. The dashboard tunnel only carries bytes when I actively open the page; those bytes are HTML, JSON, and a cropped still I'm currently reviewing. If my Cloudflare account is compromised tomorrow, the worst case is an attacker sees the dashboard. There is no archive to steal because there is no archive.

That's not a privacy policy. That's a shape. The industry talks about privacy like it's a compliance checklist. It's a data-flow diagram.

2. On-device ML isn't a privacy feature. It's a reliability feature that happens to be private.

When BIRDEYE was the shadow model and GPT-4o was authoritative, an OpenAI outage — or just slow network — would make my baby monitor degrade in ways I couldn't predict. Flipping the pipeline so the local cascade decides first changed the numbers you'd expect (cost dropped ~95%, median latency ~1,200 ms → ~130 ms). It also changed the numbers you might not:

Works when home internet is down: no → yes
Works when OpenAI has an outage: no → yes (cloud path degrades to low_confidence)
Blast radius of a cloud compromise: every frame → under 1% of frames (currently 0)

Privacy is the part of this that gets headlines. The part that actually matters at 3am is that my baby monitor works when my ISP doesn't. Every system that puts an LLM in the hot path of a safety-relevant decision is buying the cloud provider's failure modes. Most people don't think of that as a design choice because the default tutorial stack makes it invisible.

3. Thresholds are policy, not ML.

The reflex when you get a model's precision-recall curve is to pick a single "best" operating point and move on. That's treating thresholding as a model decision. It's a product decision, and it's a different decision per alert.

BILBO has two alerts on the same eye-state model:

Wake alert (baby is waking up, ping me on Telegram). I weight recall heavily — missing a real wake event is worse than a false ping. Gate: loose 2-of-3 rule.
Edge alert (baby pressed against the bassinet side; the one I later deleted — more on that below). I weighted precision heavily. A baby monitor that false-alarms twice a day will get muted, and a muted alert is the same as no alert. I needed precision near 1.0 even if recall suffered.

Same model. Different thresholds. Different operating points. Different costs to the user on each side of the matrix. The model doesn't have a threshold. The product does, and it has one per alert. Everything I've seen about model deployment treats "pick a threshold" as the last step of training. It's actually the first step of product design.

4. Ground truth is a product decision you have to write down.

Early on I treated GPT-4o's labels as the source of truth for evaluating BIRDEYE. It was convenient — the labels were right there in the database. When I eventually spot-checked, ~5% of "asleep" labels were wrong, usually mid-blink during motion. Training my model to match the cloud API exactly was teaching it to copy a 5% error rate, and my reported accuracy was bounded above by a number I'd never measured.

The fix was one sentence committed to the README: only frames I've manually reviewed or corrected count as ground truth. Raw cloud API output is training signal, not ground truth. That sentence changed how the dashboard worked (the "Reviewed" checkbox exists to enforce it), how the label priority pipeline in training works (human correction > human review > cloud API), and how backtests are interpreted.

I think this is the most common unforced error in applied ML. If you haven't written down, in one sentence, what counts as truth in your system, you don't have ground truth — you have a convenience sample. The distinction between "label I trust" and "label I have" collapses quietly, and you calibrate against the second while believing you're calibrating against the first.

5. Aggregate metrics are structurally optimistic, not just misleading.

There's a lazy version of this lesson ("aggregate metrics can be misleading") that most people nod at and then ignore. The stronger version is that macro F1, accuracy, and AUC are structurally optimistic on the problems you actually care about, because they average across a class distribution that's almost never representative of the decision surface where things go wrong.

The interesting mistakes aren't uniformly distributed. They cluster on the hard cases: partially closed eyes, IR shadows, motion blur, profile views, occluded faces. Those cases are underrepresented in the test set by construction — they're rare in the raw data, and label noise makes them doubly hard to capture. So the aggregate metric is a weighted average where the interesting weights are tiny.

The fix I use now, which is more discipline than cleverness:

Per-class recall is mandatory. Never look at macro F1 without it.
Maintain a separate "adversarial" eval set of hand-curated hard cases. Report it alongside the standard metric. If they diverge, the standard metric is wrong for your problem.
Count the rare class's examples. If it's single digits, you don't have a measurement; you have a rumor.

6. Training distribution dominates model capacity — and nobody acts like it.

The two ablation tables above ("Did a bigger model help? (No.)" and "Pre-trained face detector vs. one trained on my own data") tell the same story from different angles.

The capacity sweep says: bigger backbone, bigger input, every combination — none of it moved macro F1. What did move it was giving the same small model a 40-pixel face crop instead of an 8-pixel-tall eye region on the full bassinet view. Better inputs, same model — the whole win.

The YuNet table says: a mature pre-trained detector with every structural advantage (millions of training images, years of tuning, a team smarter than me) hit 53% detection rate on baby-present frames. A MobileNetV3 I fine-tuned on ~780 hand-labeled frames from my own bassinet hit 100% on the validation set. YuNet wasn't bad at face detection; it was trained on the wrong distribution (adult daytime portraits) for my task (IR night footage of a swaddled infant).

Combined, the two tables say the same thing: training distribution dominates model capacity. 780 examples of the right thing beat millions of examples of the wrong thing. Same architecture with better inputs beats bigger architecture with worse inputs. This is the whole story of applied ML today and it keeps getting drowned out by the model-size arms race. If you're spending weeks debating backbones while your training data is 300 frames of the actual conditions you care about, you're on the wrong axis. When accuracy plateaus, the answer is almost never "bigger model" — it's "give the model better inputs" or "fix the distribution." Scale is the most expensive possible fix for a data problem.

One more, learned in the refactor: build the monolith first, then split it

The six above are what I believed when I first wrote this up. This one I only earned afterward, taking BILBO apart and putting it back together — and it's the one I'd most want to hand the version of me who started.

BILBO began as a monolith, and deliberately so. One launchd job fired a single Python script once a minute. The dashboard was one Flask app that reached straight into the pipeline's internals — sys.path hacks, direct SQLite reads, shelling out to the training script by subprocess. No API boundary, no service contract, no container. By every textbook this is the "wrong" architecture, and for a long stretch it was exactly right. While the design was still churning — was the face detector its own stage, did eye-state need a separate crop, where did the cloud fallback sit — every abstraction boundary I might have drawn would have been drawn in the wrong place. A monolith has no boundaries to get wrong. I could move a function from the dashboard into the pipeline in a single commit because there was no contract to renegotiate.

The seams only became visible once the design stopped moving. After the BIRDEYE-primary flip the shape settled, and the right boundaries were suddenly obvious — because the code had been telling me where they were the whole time. The dashboard dragged the entire PyTorch stack into its process just to read rows from SQLite. The thing that talked to the camera shared an interpreter with the thing I pointed my phone at. Training competed with live capture for the same process. Those weren't design decisions; they were accidents of having started as one process. The refactor into four containers (Part 1 has the layout) wasn't me imposing an architecture — it was me cutting along seams the monolith had already worn into the code.

Build the monolith first. Let it run long enough to show you where it wants to come apart, then split it there. The standard advice — "design your service boundaries up front" — assumes you know the boundaries before you've built the thing, which on an exploratory project you never do. Premature containerization would have cost me real time maintaining contracts between components whose responsibilities hadn't settled. Doing it last meant every boundary landed in the right place on the first try — including the one that matters most. The privacy boundary is now also a process boundary: the dashboard is a container that shares no code with the one that can see the camera, and that fell out of the refactor for free precisely because I waited until the shape was known.

What went wrong

Six failures that taught me more than any of the successes.

The deepcopy bug. I saved best-epoch weights with best_state = model.state_dict().copy(). PyTorch's state_dict().copy() is a shallow copy — the outer dict is fresh, the tensors inside are still live references to Parameter.data. Every subsequent optimizer.step() silently mutated the "saved" snapshot. Eight model versions shipped to production with last-epoch weights instead of best-epoch weights. The validation metrics in my training log described models that had never been deployed. Fix: copy.deepcopy(model.state_dict()). The deeper lesson: your training metric and your deployment artifact are connected by a chain of assumptions, any one of which can be wrong silently. Audit the chain, not just the endpoints.

The schema migration drift. I renamed shadow_birdeye_state → shadow_birdeye_present, updated the writer, missed one reader in the dashboard's safety stats endpoint. SQLite doesn't enforce schemas across code — it will happily return None for a missing column. The dashboard failure mode was "show stale data" rather than "throw." Broken for two days before I noticed. Since then I'm much more willing to keep legacy column names forever and route writes through a single helper.

The JSONL byte-budget bug. The temporal state smoother looks back at the last n frames to apply a 4-of-6 rule over the eye-state signal. The history read — lib.storage.get_recent_entries — used a fixed budget of n * 600 bytes when seeking into the JSONL tail. Fine when entries were 500 bytes; broken once entries had grown to ~1.4 KB (shadow dict, experiments dict, face bbox), because asking for 5 frames was quietly returning 2. The rule could never fire, every present frame fell through to carry-forward, and ~24 hours of in-bassinet time cascaded into Unknown. The timeline showed 498 consecutive Unknown blocks over what should have been a clear Asleep stretch before I noticed. Fix was two-part: switch the live smoother's history read to SQLite (an indexed LIMIT query), and make the JSONL tail read adaptive (retry on underflow). The deeper lesson is about silent undercounts in rolling-window logic: the consumer can't distinguish between "history had no matching run" and "history was truncated before the rule could see the run." A rolling-window rule should assert it got the window size it asked for — anything less is just as invisible as a bug in the rule itself.

The bidirectional retrain regression. I shipped a presence-classifier retrain after what looked like a clean training run — validation macro F1 was within a percent of the previous deploy. The next morning, the dashboard was lighting up with cloud-fallback calls; one tick crashed on a timeout inside the fallback and left a 2.5-hour DB gap. The A/B against the previous checkpoint on the same frames was unambiguous: on confirmed-present frames, the two preceding checkpoints scored 30/30 each; the new one scored 3/30. On confirmed-empty frames, every other checkpoint scored p_present ≤ 0.02; the new one scored p_present = 0.86–0.999 on every one. The training set had picked up a new bassinet-cover pattern and the model had found a spurious texture correlation — its entire decision surface had moved onto the cover. The FNs on present frames and the FPs on empty frames approximately cancelled, so aggregate macro F1 didn't move. The deeper lesson is about bidirectional regressions and deploy gates: a retrain can be worse in both directions at once, and any summary statistic that averages across them will hide it. After this one, my deploy gate is per-class delta vs. the previous checkpoint and disagreement rate on a frozen A/B frame set — aggregate F1 is evidence of nothing.

The geometric edge alert. I wanted a "baby pressed against the side" alert before I had a classifier for it, so I wrote a geometric heuristic: face bbox in bottom 30% of the bassinet + high presence confidence + 2-of-3 recent frames agree. Backtested against 7 days of labels:

Metric	Result
Recall	79%
Precision	6%
False positives per true positive	~15

Parameter sweep couldn't get precision above 11%. I deleted the heuristic and opened an issue for a proper trained classifier. The decision-relevant lesson: a rigorous negative result is a finished piece of work. I wanted to ship the heuristic. The data said no. Running the backtest and honoring it is the discipline, not the code.

The "head crop from stale cloud coordinates" idea. Before I had a face detector, I tried to seed the on-device eye-state crop using head coordinates from the cloud API's last call. Babies move on a sub-minute timescale. At the 4-minute cadence I was running, the "head crop" was routinely an empty pillow or a corner of the mattress. I should have derived this from first principles; I had to look at the frames to see it. Generalized: any time you're reusing a signal across a time gap, you're making an assumption about the signal's decorrelation rate. Make that assumption explicit before you build on it.

Coding agents: a different gear, with new failure modes nobody's naming yet

Honest disclosure: most of this project was built collaboratively with a coding agent. I'm not going to pretend otherwise, because the current generation of tools is genuinely a different gear for hobby projects, and being coy about that obscures what's actually changed.

The high-signal version of what I learned:

What changed: friction collapsed, so projects finish.

Flask scaffolding, SQLite schema, Telegram integration, launchd plists, argparse wiring — all of this used to be "sit down with the docs for an hour per piece." Now it's minutes. The bottleneck moves from "how do I write this" to "what do I want this to do."

That's not a 2× speedup. It's a phase change, because the thing that actually kills hobby projects isn't the total amount of work — it's the friction of any individual step. I started a dozen ML hobby projects in the previous five years and shipped zero. I shipped this one. The difference isn't discipline. It's that every step's activation energy dropped by an order of magnitude.

What didn't change: judgment.

Three things the agent can't do, and I think the gap between these is where the interesting product-of-ML questions live for the next few years:

It can't tell you your validation set is bad. The agent will compute whatever metric you ask for, against whatever data you point it at. It cannot tell you that your "ground truth" shares a source with your training labels and your accuracy is therefore meaningless. The cloud-API-as-truth mistake would have been just as invisible to the agent as it was to me until I said "spot-check these."

It can't tell you when to stop iterating. It will happily keep tuning. The "this is good enough, ship it" decision needs product taste that doesn't fit in a context window.

It can't tell you the right answer is to delete your work. When the geometric edge alert backtest came back terrible, the agent flagged the numbers — good. It didn't say "delete the branch." I had to say that. Which brings me to the new failure mode.

The new failure modes nobody's naming yet

Two things I've noticed working with coding agents daily on this project that I haven't seen discussed honestly, and that I think matter more than most of the benchmarks we argue about:

Collaborative sunk-cost bias. When an agent helps you build something, it subsequently advocates for that thing in ways that feel collaborative but function as sunk-cost pressure. When I said "this edge alert isn't working, let's tune the thresholds," the agent helped me tune. It didn't say "you should consider whether this whole approach is doomed." It felt like a helpful collaborator. It was actually a sunk-cost amplifier. The agent has no stake in the deletion the way a human coworker with their own time budget would. If anything, it's biased the other way — it was trained to complete tasks, and deletion isn't completion. Override it explicitly. I now start sessions where I'm uncertain by saying "before you help me build this, list the reasons it might be the wrong thing to build."

Calibration toward helpfulness when you need calibration toward disagreement. Agents are trained to be helpful and collaborative. That's great most of the time and bad when you're heading off a cliff. When I proposed the 3-class eye-state classifier, the agent helped me try it. A more senior human collaborator would have said "class imbalance is going to eat you — let's count your labels first" before writing a line of code. Agents don't push back hard enough at the idea stage because pushback is costly to their training objective. This is fixable with prompting, but the default is dangerous, and most users don't know to flip it.

The summary I've landed on is this: a coding agent today is a very fast, slightly junior collaborator who will do whatever you ask without complaint. That framing captures both the benefits (speed, stamina, boring parts automated) and the risks (no pushback, no "stop," no "delete this"). Keep both sides in mind and you'll get a lot out of one. Keep only the first side and you'll ship things you shouldn't have.

Why a project like this is the best ML learning vehicle I've found

One section I want to keep, because I don't see it said enough.

If you've done ML in notebooks but never deployed a model you had to rely on, a home sensor with a ground truth you can see is uniquely valuable:

Ground truth is sitting next to you. Most production ML has ambiguous labels you argue about with reviewers. Here, the oracle is across the room. Every prediction has an immediate, unambiguous check.
The feedback loop is minutes, not weeks. New test case every minute. Correction-to-retrain in the time it takes to click a button. You get to iterate on data at speeds you don't get in a corporate ML environment.
Costs are felt, not reported. A missed wake event teaches you something a Slack alert about degraded F1 does not.
You get engineering discipline for free. When the customer is your child, you stop taking shortcuts on test-set hygiene, validation rigor, and deployment pipelines. The discipline ML tutorials describe and never enforce becomes non-negotiable.

Closing

Two things I didn't expect to be the big lessons when I started, and both of which I now believe strongly:

Almost all the value in an applied ML project lives in the boring parts — the dashboard, the correction loop, the label discipline, the deployment pipeline, the write-it-down-once rules about what counts as truth. The model itself is the interesting 20%, and the interesting 20% stops being interesting the moment it's good enough. Most of what separates "I trained a model" from "I shipped a model" is work that ML tutorials never cover.

The friction of starting has dropped to near-zero, and I think that's the most important development for builders in the last decade. The thinking part is still on you. But the five-weekend side project is now a five-hour side project, and the projects that used to die of friction are shipping. That changes the math on what's worth trying. If you've been thinking about an ML project and haven't started because you don't have the time — the time you think it'll take is no longer the time it actually takes. Pick something whose ground truth is in your house, point a camera at it, and start. BILBO's code and pretrained weights are open source — the repo's on GitHub if a working example helps you get going.