Tech Blog

Insights on SME automation, production AI, and reliable engineering systems.

AI/ML

The VLM That Scored a Collapsing Robot 62/100

After metrics kept missing degenerate gaits and LLM-iterated rewards hit the survival cliff, we tried a vision-language model as the fitness scorer. The VLM was harsher than metrics and produced actionable failure descriptions — and it scored a collapsing robot 62/100. The case study in honest fitness, plus the four-layer evaluation stack we landed on.

AI/ML

When LLMs Iterate on Rewards: A Negative Result From Humanoid Locomotion

We ran NVIDIA's Eureka methodology — an LLM iteratively proposing reward functions for an RL agent — on cold-start humanoid bipedal walking. 24 candidates across two rounds, all fell by step 70. The LLM was thoughtful; the survival cliff was state-space coverage, not reward design.

AI/ML

Five Ways a Humanoid Cheats at Walking

Pure RL, physics priors, single-image poses, adversarial structural rewards, LLM-iterated rewards — five attempts to train a humanoid walker without mocap, five distinct ways the policy cheated. With the v16 'flamingo hopping' retraction the user caught.

AI/ML

Two Poses Are Enough: How Much Mocap Data Does a Humanoid Need to Walk?

We ablated mocap reference data for a humanoid walker from 50 frames down to 2 poses. The minimal version walked fastest — and the result has practical implications for any humanoid project bottlenecked on mocap.

AI/ML

The VecNormalize Trap: Two Silent Bugs That Hid a Working Walking Policy

Same checkpoint, two evaluations, 12× different episode lengths. Two silent bugs in series were responsible. The fix replaces SB3's VecNormalize with a fixed, physics-derived normalization that reproduces deterministically.

AI/ML

I Ran Karpathy's Autoresearch on a $1,299 MacBook — Here's What Happened

Head-to-head MLX vs PyTorch MPS comparison of Karpathy's autoresearch on an M2 MacBook Pro 16GB. Includes stability tests, a real autonomous agent loop that found a 5% improvement, and a cost analysis of Mac vs cloud GPU.

Productivity

Two Weeks, 40 Commits, and an AI That Remembers My Preferences

First-hand account of shipping 40+ commits across multiple projects with Claude Code. Covers the memory system, sub-agent patterns, testing culture, and honest reflections on where AI coding assistants shine and stumble.

AI/ML

Claude Code Skills: How to Set Them Up and Use Them

Learn how to create and use Claude Code skills: folder structure, YAML frontmatter, progressive disclosure, and when skills beat repeating yourself.

Showing 12 of 52

Follow Me