<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[AlphaSignal]]></title><description><![CDATA[The most read technical newsletter in AI. Stay on top of the latest research, repos, and models with 5 min daily summaries. Join 200,000+ AI developers.]]></description><link>https://alphasignalai.substack.com</link><image><url>https://substackcdn.com/image/fetch/$s_!3a8L!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0ddabf0-65ab-411a-af75-2163d7271e35_337x337.png</url><title>AlphaSignal</title><link>https://alphasignalai.substack.com</link></image><generator>Substack</generator><lastBuildDate>Sat, 06 Jun 2026 19:08:18 GMT</lastBuildDate><atom:link href="https://alphasignalai.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[AlphaSignal, LLC]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[alphasignalai@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[alphasignalai@substack.com]]></itunes:email><itunes:name><![CDATA[AlphaSignal AI]]></itunes:name></itunes:owner><itunes:author><![CDATA[AlphaSignal AI]]></itunes:author><googleplay:owner><![CDATA[alphasignalai@substack.com]]></googleplay:owner><googleplay:email><![CDATA[alphasignalai@substack.com]]></googleplay:email><googleplay:author><![CDATA[AlphaSignal AI]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Stop Asking Whether the Agent Worked. Ask What the Harness Observed]]></title><description><![CDATA[Your agent passed, and that tells you almost nothing. Inside: the trace, the failure map, and the scorecard that do.]]></description><link>https://alphasignalai.substack.com/p/stop-asking-whether-the-agent-worked</link><guid isPermaLink="false">https://alphasignalai.substack.com/p/stop-asking-whether-the-agent-worked</guid><dc:creator><![CDATA[AlphaSignal AI]]></dc:creator><pubDate>Fri, 05 Jun 2026 11:34:23 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!JG85!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c6da5f3-5912-47bf-bcf1-4b5cf249f0b9_1672x941.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JG85!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c6da5f3-5912-47bf-bcf1-4b5cf249f0b9_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JG85!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c6da5f3-5912-47bf-bcf1-4b5cf249f0b9_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!JG85!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c6da5f3-5912-47bf-bcf1-4b5cf249f0b9_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!JG85!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c6da5f3-5912-47bf-bcf1-4b5cf249f0b9_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!JG85!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c6da5f3-5912-47bf-bcf1-4b5cf249f0b9_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JG85!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c6da5f3-5912-47bf-bcf1-4b5cf249f0b9_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3c6da5f3-5912-47bf-bcf1-4b5cf249f0b9_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1902280,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://alphasignalai.substack.com/i/200633319?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c6da5f3-5912-47bf-bcf1-4b5cf249f0b9_1672x941.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JG85!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c6da5f3-5912-47bf-bcf1-4b5cf249f0b9_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!JG85!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c6da5f3-5912-47bf-bcf1-4b5cf249f0b9_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!JG85!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c6da5f3-5912-47bf-bcf1-4b5cf249f0b9_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!JG85!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c6da5f3-5912-47bf-bcf1-4b5cf249f0b9_1672x941.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><em>In 5 minutes, you will learn why the final answer is the wrong thing to grade, the eight-layer trace that makes a run debuggable, and how to pin a failure on one harness layer before you trust the next run.</em></p></blockquote><p>Two agents close the same ticket. The final answer says they tied. The final answer is lying to you.</p><p>One closed it in four steps, ran the full test suite, and asked before it touched production. The other looped eleven times, skipped the suite, and slammed into a permission wall it never mentioned.</p><p>Same diff. Same green check. Two completely different runs, and an output-only evaluation can see only the checkmark.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5ksZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b8da82d-f29f-4fcd-a075-17b7f344f570_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5ksZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b8da82d-f29f-4fcd-a075-17b7f344f570_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!5ksZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b8da82d-f29f-4fcd-a075-17b7f344f570_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!5ksZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b8da82d-f29f-4fcd-a075-17b7f344f570_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!5ksZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b8da82d-f29f-4fcd-a075-17b7f344f570_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5ksZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b8da82d-f29f-4fcd-a075-17b7f344f570_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1b8da82d-f29f-4fcd-a075-17b7f344f570_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!5ksZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b8da82d-f29f-4fcd-a075-17b7f344f570_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!5ksZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b8da82d-f29f-4fcd-a075-17b7f344f570_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!5ksZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b8da82d-f29f-4fcd-a075-17b7f344f570_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!5ksZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b8da82d-f29f-4fcd-a075-17b7f344f570_1672x941.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Most devs/teams still grade agents the way they grade a chatbot. That worked when the model only talked. It breaks the moment the model starts acting.</p><p>For a chatbot, the output is the product, so scoring the output is fair.</p><p>But for an agent, the output is a receipt. It confirms the transaction closed and says almost nothing about what got touched, retried, skipped, or carried by luck.</p><p>The research is starting to say this with numbers. <strong>Harness-Bench</strong>, a new diagnostic benchmark, ran 5,194 agent trajectories across 106 sandboxed tasks and found that the same task and model can land very differently depending on the harness wrapped around them.</p><p>A blunt companion paper, <strong>&#8220;Stop Comparing LLM Agents Without Disclosing the Harness&#8221;</strong>, argues that for long, hard tasks the harness moves performance more than the choice of model does.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2><strong>Why the final answer lies</strong></h2><p>An agent&#8217;s final output is a compression of everything it did. Compression deletes exactly the part you need when something breaks.</p><p>One green &#8220;pass&#8221; can hide at least four different problems.</p><p>A coding agent ships a patch that clears the single test it ran and never touches the broader suite. Green check, latent bug.</p><p>A research agent cites the perfect source, but it got there through a stale memory entry that happens to still be right. Next month it will not be.</p><p>A support agent resolves a ticket after quietly trying a refund path it was forbidden to use and getting denied. Nobody upstream ever hears about the attempt.</p><p>A long-running agent finishes a multi-day job after three rounds of context compaction, and no one can say which of the starting assumptions made it through the last handoff.</p><p>All four score as wins. Output-only evaluation tells you the task passed. It never tells you which layer made the pass fragile, or whether the next run will be as lucky.</p><div><hr></div><h2><strong>The shift the research is making</strong></h2><p><strong>Harness-Bench</strong> stopped ranking models and started ranking model-and-harness pairs. It crosses 8 model backends with 6 harnesses, and on every one of those 5,194 runs it logs four things: the final artifact, the full execution trace, the usage stats, and what the validators returned.</p><p>The final artifact is the smallest item on that list. The benchmark also catalogs the failure symptoms that keep recurring, which is the useful part. Once you can name the symptom, you can hunt for the layer that caused it.</p><p><strong>&#8220;Stop Comparing LLM Agents Without Disclosing the Harness&#8221;</strong> is harsher about the status quo. Its Binding Constraint Thesis holds that once models reach frontier-class parity, the harness explains more of the performance gap than the model, so any leaderboard that hides its harness is quietly handing the model credit for the scaffolding&#8217;s work.</p><p><strong>Anthropic</strong> makes the operational version of the argument in its agent-evaluation writing: grade the trajectory and the outcome as two separate things. The transcript of tool calls, token counts, turns, latency, and state checks is a grading surface in its own right, not a footnote under the final reply.</p><p>Stack the three together and the direction is obvious. The question is moving from &#8220;did the model answer?&#8221; to &#8220;what did the model and the harness do together?&#8221;</p><div><hr></div><h2><strong>The harness telemetry stack</strong></h2><p>Answering that question takes a trace, and a trace has layers. Here is the minimum set that lets you rebuild a run after it breaks, with the failure each layer exists to catch.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oKFd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F873fbdaa-fb64-4786-a74d-7aae0890793e_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oKFd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F873fbdaa-fb64-4786-a74d-7aae0890793e_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!oKFd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F873fbdaa-fb64-4786-a74d-7aae0890793e_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!oKFd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F873fbdaa-fb64-4786-a74d-7aae0890793e_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!oKFd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F873fbdaa-fb64-4786-a74d-7aae0890793e_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oKFd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F873fbdaa-fb64-4786-a74d-7aae0890793e_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/873fbdaa-fb64-4786-a74d-7aae0890793e_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!oKFd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F873fbdaa-fb64-4786-a74d-7aae0890793e_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!oKFd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F873fbdaa-fb64-4786-a74d-7aae0890793e_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!oKFd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F873fbdaa-fb64-4786-a74d-7aae0890793e_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!oKFd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F873fbdaa-fb64-4786-a74d-7aae0890793e_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Table available as markdown at the end.</figcaption></figure></div><p>This is not a dashboard wishlist. It is a debugging contract. When a layer is missing from the trace, the failure that lives in that layer cannot be attributed, and you are back to rerunning the agent and squinting.</p><p>Codex and Claude Code already emit parts of this over <strong>OpenTelemetry</strong>: model and tool calls, sessions, token and cost metrics, and structured logs, with traces already shipping on Codex and still in beta on Claude Code. The plumbing is starting to exist. The rare part is treating the trace, not the answer, as the thing you actually review.</p><div><hr></div><h3><strong>A failure map, not a tutorial</strong></h3><p>With the trace in hand, a broken run finally produces a useful question: which layer failed?</p><p>The eleven harness layers everyone lists make a mediocre tutorial and a sharp failure map. Stop reading them as &#8220;what a harness contains&#8221; and start reading them as &#8220;where to point when the answer came out wrong.&#8221;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!E0xY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4b7607a-7cf9-47e9-86b5-f100157a8e66_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!E0xY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4b7607a-7cf9-47e9-86b5-f100157a8e66_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!E0xY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4b7607a-7cf9-47e9-86b5-f100157a8e66_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!E0xY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4b7607a-7cf9-47e9-86b5-f100157a8e66_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!E0xY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4b7607a-7cf9-47e9-86b5-f100157a8e66_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!E0xY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4b7607a-7cf9-47e9-86b5-f100157a8e66_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b4b7607a-7cf9-47e9-86b5-f100157a8e66_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!E0xY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4b7607a-7cf9-47e9-86b5-f100157a8e66_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!E0xY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4b7607a-7cf9-47e9-86b5-f100157a8e66_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!E0xY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4b7607a-7cf9-47e9-86b5-f100157a8e66_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!E0xY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4b7607a-7cf9-47e9-86b5-f100157a8e66_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Table available as markdown at the end.</figcaption></figure></div><p>The map does one thing well. It turns &#8220;the agent failed&#8221; into &#8220;the governance layer failed,&#8221; which is the difference between a shrug and a fix.</p><p>The four agents here split into two shapes. Coding agents like Codex and Claude Code pour their engineering into governance, execution, and verification, because their failures hit files, terminals, repos, and CI. Personal agents like Hermes and OpenClaw pour theirs into memory, identity, channels, and continuity, because their failures play out across time, people, and messaging apps.</p><p>Both need traces. They just need the densest tracing in different layers.</p><p>Then there is the harder case. When a personal agent runs a coding agent as its backend, which Hermes and OpenClaw can both do, the trace has to cross the seam between two harnesses. The outer agent knows what it asked for. The inner runtime owns the real tool calls, diffs, tests, and permission decisions. If your trace stops at the handoff, so does any hope of explaining the failure.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2><strong>The harness reliability scorecard</strong></h2><p>The trace shows what happened. Judging it needs a rubric you run after every important run, before you trust the next one.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!maHj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9918bf9c-d7b1-447b-8b57-eff6837499e6_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!maHj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9918bf9c-d7b1-447b-8b57-eff6837499e6_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!maHj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9918bf9c-d7b1-447b-8b57-eff6837499e6_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!maHj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9918bf9c-d7b1-447b-8b57-eff6837499e6_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!maHj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9918bf9c-d7b1-447b-8b57-eff6837499e6_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!maHj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9918bf9c-d7b1-447b-8b57-eff6837499e6_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9918bf9c-d7b1-447b-8b57-eff6837499e6_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!maHj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9918bf9c-d7b1-447b-8b57-eff6837499e6_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!maHj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9918bf9c-d7b1-447b-8b57-eff6837499e6_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!maHj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9918bf9c-d7b1-447b-8b57-eff6837499e6_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!maHj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9918bf9c-d7b1-447b-8b57-eff6837499e6_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Table available as markdown at the end.</figcaption></figure></div><p>This is not a leaderboard. It is a post-run audit, and its only job is to make a failure local enough to fix.</p><p>Run it honestly and most &#8220;working&#8221; agents lose points on trace completeness first. You cannot grade what you never recorded. A run that passed but trips four of these signals is not a success. It is a success you got away with, and got-away-with does not survive contact with production.</p><p>The model can tell you what it answered. The trace tells you what actually happened.</p><p>So the reliability question for agents has changed. It is no longer &#8220;did it work?&#8221; It is whether the harness saw enough to explain the run on the day it doesn&#8217;t.</p><div><hr></div><p><strong>Which layer in your stack is flying without a trace right now: context, permissions, memory, or verification?</strong></p><p><strong>All Sources in first reply. Full breakdown of recent updates + daily signals in our newsletter (link in bio).</strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2><strong>Appendix</strong></h2><blockquote><p>All Tables as markdown format.</p></blockquote><p>Table 1:</p><pre><code><code>| Trace layer | What to capture | The failure it explains |
|---|---|---|
| Context | prompts, files, retrieved docs, memories, skills, tool schemas, compaction events | the agent never saw the fact it needed, or saw stale and irrelevant context |
| Tool | tool name, args, result, error, latency, output size, repeated calls | the tool failed, returned weak evidence, or got called in a loop |
| Permission | approval request, decision, policy rule, denial, escalation path | the agent crossed a boundary, or could not recover from a denial |
| Execution | sandbox mode, workspace diff, network policy, environment state, side effects | the run leaned on hidden environment state or untracked file changes |
| Test | commands, validators, static checks, pass/fail, outcome graders | the answer looked right but was never checked from outside |
| Memory | reads, writes, source, timestamp, confidence, expiry | stale or unsafe memory quietly steered the run |
| Cost and latency | model calls, token usage, wall time, retry count, wait states | the agent succeeded by burning budget or looping blindly |
| Human | clarifications, approvals, corrections, interrupts, handoffs | autonomous performance cannot be separated from human rescue |</code></code></pre><p>Table 2:</p><pre><code><code>| Harness layer | The question to ask when the run fails |
|---|---|
| Control | Did the agent have the right rules, success criteria, and boundaries? |
| Context | Did the harness load the right information, at the right time, without stale or distracting material? |
| Runtime loop | Did the loop stop, retry, compact, and resume correctly? |
| Tools | Did the tool interface return compact, accurate, actionable observations? |
| Execution | Did the action run in the intended workspace, sandbox, network mode, and environment? |
| Governance | Did risky actions require approval, and did denials leave a recoverable path? |
| Memory | Did durable memory help the run, or inject old, unsafe, or untraceable assumptions? |
| Skills | Did the agent pick the right procedure, version, and validation steps? |
| Planning | Did planning reduce uncertainty, or spawn abandoned branches and noisy handoffs? |
| Verification | Did the harness check the final state instead of only the final message? |
| Interface | Were human approvals, clarifications, and interrupts captured as part of the run? |</code></code></pre><p>Table 3:</p><pre><code><code>| Dimension | Question | Bad signal |
|---|---|---|
| Outcome reliability | Did the final task succeed? | success cannot be checked outside the model's own answer |
| Trajectory quality | Did the agent take a sensible path? | many turns, repeated calls, abandoned branches, no new evidence |
| Tool effectiveness | Did each tool call add information? | tools return huge logs, vague errors, or output nobody uses |
| Context discipline | Did the harness load the right context? | missing source files, stale memory, bloated prompt, bad compaction |
| Permission safety | Did risky actions require approval? | a dangerous action ran silently, or a denial caused a dead end |
| Recovery behavior | Did the agent recover from errors? | retries repeat the same failing action or bury the first failure |
| Verification strength | Did the harness check the final state? | no tests, no validator, no state check, no human review path |
| Cost and latency | Did the run finish inside a sane budget? | success depended on excessive tokens, time, or retries |
| Memory correctness | Did memory improve the run? | memory has no source, timestamp, confidence, or deletion path |
| Trace completeness | Can a human reconstruct what happened? | missing context, tool args, approvals, diffs, test results, or handoffs |
</code></code></pre>]]></content:encoded></item><item><title><![CDATA[Stop Looping Tool Calls: Search as Code Cut Tokens 85% on a 200-CVE Task]]></title><description><![CDATA[By Perplexity, Search becomes Python the model writes on the fly. The SDK stays private, the pattern doesn&#8217;t.]]></description><link>https://alphasignalai.substack.com/p/stop-looping-tool-calls-search-as</link><guid isPermaLink="false">https://alphasignalai.substack.com/p/stop-looping-tool-calls-search-as</guid><dc:creator><![CDATA[AlphaSignal AI]]></dc:creator><pubDate>Tue, 02 Jun 2026 17:48:36 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!XF-6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfdadb72-e675-4df7-bcf6-954cb564096a_1672x941.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XF-6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfdadb72-e675-4df7-bcf6-954cb564096a_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XF-6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfdadb72-e675-4df7-bcf6-954cb564096a_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!XF-6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfdadb72-e675-4df7-bcf6-954cb564096a_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!XF-6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfdadb72-e675-4df7-bcf6-954cb564096a_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!XF-6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfdadb72-e675-4df7-bcf6-954cb564096a_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XF-6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfdadb72-e675-4df7-bcf6-954cb564096a_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bfdadb72-e675-4df7-bcf6-954cb564096a_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:874294,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://alphasignalai.substack.com/i/200334198?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfdadb72-e675-4df7-bcf6-954cb564096a_1672x941.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XF-6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfdadb72-e675-4df7-bcf6-954cb564096a_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!XF-6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfdadb72-e675-4df7-bcf6-954cb564096a_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!XF-6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfdadb72-e675-4df7-bcf6-954cb564096a_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!XF-6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfdadb72-e675-4df7-bcf6-954cb564096a_1672x941.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p>In ~7 mins: the CVE result (85% fewer tokens at 100% accuracy), the 5-system benchmark table, the architecture&#8217;s 3 layers, the 6 design principles behind it, and a Hermes Agent walkthrough to run the pattern yourself.</p></blockquote><p>Perplexity just stopped letting its agents call search and started letting them program it.</p><p><strong>Search as Code (SaC)</strong> is its new search architecture. Instead of one fixed pipeline behind a query, the model writes Python that composes the individual pieces of the search stack into a retrieval pipeline built for each task.</p><p>On a 200-CVE research task, that cut token use 85.1%, from 288.7K to 42.9K, while scoring 100% accuracy. Every non-Perplexity system tested scored below 25%.</p><p>It ships today: default in Perplexity Computer, available in the Agent API. There is a Hermes Agent walkthrough at the end if you want to run the pattern yourself.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2><strong>Context</strong></h2><p>The research is published by <strong>Perplexity</strong> and titled &#8220;<strong>Rethinking Search as Code Generation</strong>.&#8221; It landed June 1, 2026, and builds on the first overview of Perplexity&#8217;s search stack from September 2025.</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/perplexity_ai/status/2061506359326384319&quot;,&quot;full_text&quot;:&quot;Introducing Search as Code, our new search architecture for AI agents.\n\nIt writes Python that calls our search stack directly, instead of looping through function calls one at a time.\n\nAvailable in the Perplexity Agent API, and now default in Computer.\n\n<a class=\&quot;tweet-url\&quot; href=\&quot;https://research.perplexity.ai/articles/rethinking-search-as-code-generation\&quot;>research.perplexity.ai/articles/rethi&#8230;</a> &quot;,&quot;username&quot;:&quot;perplexity_ai&quot;,&quot;name&quot;:&quot;Perplexity&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/2009310641165660160/XArF3_Ib_normal.jpg&quot;,&quot;date&quot;:&quot;2026-06-01T17:53:11.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/HJvxG6jaEAEQj2j.png&quot;,&quot;link_url&quot;:&quot;https://t.co/jrF2nQE3bC&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:137,&quot;retweet_count&quot;:174,&quot;like_count&quot;:1700,&quot;impression_count&quot;:461815,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p>Perplexity&#8217;s search serves thousands of queries each second. The old contract was simple: the model issues a query, the engine runs a predefined pipeline, and the model reads the results. That works for a single question.</p><p>It breaks for agents. Today&#8217;s agents run tasks that take hours, span thousands of retrieval operations, and need a different search strategy at each step. A fixed pipeline cannot bend to all of them. SaC is Perplexity&#8217;s answer.</p><div><hr></div><h3><strong>What Search as Code actually is</strong></h3><p>The old way hands the model a search box. It types a query, gets back a ranked page of results, and works with whatever the pipeline decided to return. The model never touches the steps in between.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JADx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd28ce79-1aa0-4be1-9025-7829546dd198_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JADx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd28ce79-1aa0-4be1-9025-7829546dd198_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!JADx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd28ce79-1aa0-4be1-9025-7829546dd198_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!JADx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd28ce79-1aa0-4be1-9025-7829546dd198_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!JADx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd28ce79-1aa0-4be1-9025-7829546dd198_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JADx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd28ce79-1aa0-4be1-9025-7829546dd198_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bd28ce79-1aa0-4be1-9025-7829546dd198_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!JADx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd28ce79-1aa0-4be1-9025-7829546dd198_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!JADx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd28ce79-1aa0-4be1-9025-7829546dd198_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!JADx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd28ce79-1aa0-4be1-9025-7829546dd198_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!JADx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd28ce79-1aa0-4be1-9025-7829546dd198_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>SaC hands the model the steps themselves. Retrieval, ranking, filtering, fan-out, rendering: each is a primitive in an SDK, and the model writes Python that wires them into a pipeline for the task in front of it. A single inference turn can drive up to thousands of these operations inside a sandbox, then return only the slice worth reading.</p><p>The real shift is not that search now uses Python. It is that the agent stops spending its context window on deterministic grunt work. Loops, deduplication, filtering, and joins move into code, where they belong, and the model stays on strategy. Perplexity calls these the twin levers of control and legibility: the model steers every step, and it can inspect the intermediate state instead of guessing at it.</p><div><hr></div><h2><strong>The evidence: benchmarks and a 200-CVE stress test</strong></h2><p>The flagship demo is a security research task: identify and characterize more than 200 high-severity CVEs from 2023 to 2025, each record citing the vendor&#8217;s own advisory, the affected product, and the fix version. SaC scored 100% accuracy and used 85.1% fewer tokens than the same stack without it, 42.9K against 288.7K. The other systems Perplexity tested all landed below 25%.</p><p>A stylized slice of what the model wrote shows the shape: build a query plan across official advisory formats, fan the queries out concurrently, and keep only vendor-owned pages.</p><pre><code><code>templates = [
    ("Mozilla", 'site:mozilla.org/.../mfsa{year} "CVE-{year}-" "Fixed in" "Impact high"'),
    ("Jenkins", 'site:jenkins.io/security/advisory/{year} "CVE-{year}" "High" "Fix"'),
    # ...more vendors
]
queries = [pattern.format(year=y) for y in (2023, 2024, 2025) for _, pattern in templates]
seed_hits = sdk.search.web_many(queries, limit_per_query=8, concurrency=12)
pages = [h for q, hits in zip(queries, seed_hits) for h in hits
         if official_vendor_advisory(h.url)]</code></code></pre><p>The fan-out, the concurrency, and the filter to official sources all run in code, in one turn, without round-tripping through the model.</p><p>Across the full suite, SaC leads four of five benchmarks.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!y1BG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8fa2c60-d847-4f4d-9e62-054724f1e4a4_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!y1BG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8fa2c60-d847-4f4d-9e62-054724f1e4a4_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!y1BG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8fa2c60-d847-4f4d-9e62-054724f1e4a4_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!y1BG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8fa2c60-d847-4f4d-9e62-054724f1e4a4_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!y1BG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8fa2c60-d847-4f4d-9e62-054724f1e4a4_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!y1BG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8fa2c60-d847-4f4d-9e62-054724f1e4a4_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f8fa2c60-d847-4f4d-9e62-054724f1e4a4_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!y1BG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8fa2c60-d847-4f4d-9e62-054724f1e4a4_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!y1BG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8fa2c60-d847-4f4d-9e62-054724f1e4a4_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!y1BG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8fa2c60-d847-4f4d-9e62-054724f1e4a4_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!y1BG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8fa2c60-d847-4f4d-9e62-054724f1e4a4_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The systems are SaC on Perplexity&#8217;s Agent API (GPT 5.5, high reasoning), OpenAI&#8217;s Responses API with <em><strong>web_search</strong></em> and <em><strong>code_interpreter</strong></em>, Anthropic Managed Agents (Opus 4.7, high reasoning), Exa, and Parallel.</p><p>Each is a single run, not best-of-N. OpenAI edges SaC on HLE by 0.002. The benchmarks come from Google (DeepSearchQA), ByteDance Seed (WideSearch), and OpenAI (BrowseComp), plus Humanity&#8217;s Last Exam.</p><p>Two caveats live inside these numbers. WANDR, where SaC&#8217;s lead stretches to 2.5x the next-best system, is Perplexity&#8217;s own benchmark and is not released yet.</p><p>The cleanest comparison is the ablation against Perplexity&#8217;s own non-SaC pipeline on the same infrastructure: the largest absolute gain is +19.77 points on DSQA, the largest relative gain +45% on WANDR.</p><p>On cost, medium-reasoning SaC beats every non-SaC system at under $1 per task, and low-reasoning SaC is cheaper than all of them while staying competitive.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2><strong>How it works: six design principles</strong></h2><p>Strip SaC to its frame and you get a stack with three jobs and six design choices that make the stack pay off.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kvUU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e7c0a77-f453-442a-96b5-9df7c355ad7e_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kvUU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e7c0a77-f453-442a-96b5-9df7c355ad7e_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!kvUU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e7c0a77-f453-442a-96b5-9df7c355ad7e_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!kvUU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e7c0a77-f453-442a-96b5-9df7c355ad7e_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!kvUU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e7c0a77-f453-442a-96b5-9df7c355ad7e_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kvUU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e7c0a77-f453-442a-96b5-9df7c355ad7e_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9e7c0a77-f453-442a-96b5-9df7c355ad7e_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!kvUU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e7c0a77-f453-442a-96b5-9df7c355ad7e_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!kvUU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e7c0a77-f453-442a-96b5-9df7c355ad7e_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!kvUU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e7c0a77-f453-442a-96b5-9df7c355ad7e_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!kvUU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e7c0a77-f453-442a-96b5-9df7c355ad7e_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Atomize the stack, don&#8217;t wrap the API.</strong></p><p>The SDK is not a search endpoint dropped into a shell. Perplexity rearchitected its search stack into modular primitives and exposed them at the lowest level it could, from raw retrieval up to semantic parsing. High-level end-to-end pipelines still exist, but only as shorthand the model can use or skip.</p><p><strong>Three layers do three jobs.</strong></p><p>The model is the control plane: it reads the directive, decides which pipelines each task needs, and writes the code. The compute sandbox handles deterministic work: control flow, batching, retries, filtering, joins, aggregation. The Agentic Search SDK lives in the sandbox runtime and exposes the primitives, so one inference turn can drive thousands of operations.</p><p><strong>Code orchestrates, and fills gaps.</strong></p><p>When the SDK lacks a capability, the model builds it in code instead of waiting for a new function. Need a precise regex the query syntax cannot express? The model fans out to collect a superset, dedupes, then narrows the results deterministically. The CVE fan-out above is this principle in action.</p><p><strong>Control and legibility are the point.</strong></p><p>A fixed pipeline owns everything downstream of the query, which creates three failure modes: context bloated with irrelevant hits, domain knowledge the model cannot apply, and serial control flow that pollutes the context with intermediate state. Programmable search fixes all three by handing the model both the steps and the state.</p><p><strong>State lives on disk, not in tokens.</strong></p><p>Across turns, SaC persists intermediate state to a filesystem with explicit serialization rather than holding it in a REPL. Perplexity tested both. They performed similarly day to day, but the filesystem approach proved more reliable on long trajectories, where an in-memory namespace turns into a cluttered hundred-cell notebook.</p><p><strong>Teach the SDK with small skills.</strong></p><p>A custom SDK appears in no model&#8217;s pretraining data, so Perplexity wrote Agent Skills to teach it. The root <em><strong><a href="http://skill.md/">Skill.md</a></strong></em> files stay under 2,000 tokens and spend most of that budget on few-shot examples for composing primitives, not on listing them. Continuous autoresearch loops tune the SDK and the skills against latency, codegen quality, and task performance.</p><div><hr></div><h2><strong>How to run the pattern yourself</strong></h2><p>The catch: the Agentic Search SDK and Perplexity&#8217;s retrieval infrastructure are internal. You cannot download them. The appendix at the end turns the idea into a runnable skill anyway, because what travels is the orchestration pattern, not the engine.</p><p>Any agent runtime that exposes code execution plus search primitives can host it. The clearest open option is <strong>Hermes Agent</strong> from Nous Research, MIT-licensed, which gives you <em><strong>execute_code</strong></em> alongside <em><strong>web_search</strong></em> and <em><strong>web_extract</strong></em>.</p><p>The move is the one SaC makes: push fan-out, extraction, filtering, deduplication, and evidence assembly into a single sandboxed code step, and return a compact summary to the parent agent.</p><blockquote><p>Full install-to-run commands are in the appendix below.</p></blockquote><div><hr></div><h2><strong>AlphaSignal Take</strong></h2><p><strong>The SDK is the moat, and it is private.</strong></p><p>Every result here rests on the atomized Agentic Search SDK and Perplexity&#8217;s retrieval infrastructure, and neither ships. This is an architecture disclosure, not a reproducible package. You can copy the shape of the idea, not the engine that makes its numbers.</p><p><strong>The headline leans on a benchmark no one else can see.</strong></p><p>WANDR, where SaC&#8217;s 2.5x lead is widest, is Perplexity&#8217;s own and unreleased. Every score in the table is a single run rather than best-of-N, and none has been independently replicated. Read them as directional.</p><p><strong>The gains come with lock-in.</strong></p><p>SaC is cloud-only and bound to Perplexity&#8217;s models and stack. There is no swapping in your own LLM and no self-hosting, the usual price for production search quality you do not maintain yourself.</p><p>What keeps this from being a vendor story is that the core idea is not Perplexity&#8217;s alone. Executable-code actions (CodeAct, ICML 2024), the broader move from tool-call loops to generated code, and recent wide-search systems all point the same way. The architecture is sound even where the evidence is self-reported.</p><p><strong>So the best recommendation</strong> is to adopt the pattern, not wait for the product. Build the code-orchestrated version on an open runtime now, measure it against your current serial setup, and treat Perplexity&#8217;s numbers as a target to verify rather than a result to trust.</p><div><hr></div><p><strong>If your agent could write its own search pipeline, what is the first workflow you would hand it?</strong></p><p><strong>All source links are in the first reply. Full breakdown of recent updates + daily signals in our newsletter (link in bio).</strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2><strong>Appendix: How to run a Search-as-Code-style scout with Hermes Agent</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kvUU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e7c0a77-f453-442a-96b5-9df7c355ad7e_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kvUU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e7c0a77-f453-442a-96b5-9df7c355ad7e_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!kvUU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e7c0a77-f453-442a-96b5-9df7c355ad7e_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!kvUU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e7c0a77-f453-442a-96b5-9df7c355ad7e_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!kvUU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e7c0a77-f453-442a-96b5-9df7c355ad7e_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kvUU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e7c0a77-f453-442a-96b5-9df7c355ad7e_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9e7c0a77-f453-442a-96b5-9df7c355ad7e_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!kvUU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e7c0a77-f453-442a-96b5-9df7c355ad7e_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!kvUU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e7c0a77-f453-442a-96b5-9df7c355ad7e_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!kvUU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e7c0a77-f453-442a-96b5-9df7c355ad7e_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!kvUU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e7c0a77-f453-442a-96b5-9df7c355ad7e_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This appendix builds a small Hermes skill that approximates the SaC pattern: plan a bounded search, run the deterministic work inside <em><strong>execute_code</strong></em>, persist inspectable artifacts, and return a compact summary.</p><p>It is an approximation, not a clone. Hermes exposes higher-level tools than Perplexity&#8217;s private SDK, so the value is in moving loops and filtering out of the parent context, not in matching Perplexity&#8217;s retrieval quality.</p><p>The running example is a migration scout: collect the official migration guidance for a library&#8217;s last three major versions, returning for each the source URL, breaking changes, required code edits, and unresolved questions.</p><p><strong>1. Install Hermes Agent.</strong></p><pre><code><code>git clone https://github.com/NousResearch/hermes-agent
cd hermes-agent
# follow the quickstart in the docs to install and configure the runtime</code></code></pre><p>Hermes is MIT-licensed and Python-based. Configure web access per the docs before running: <em><strong>web_search</strong></em> and <em><strong>web_extract</strong></em> need search and extraction credentials set first.</p><p><strong>2. Create the skill.</strong></p><pre><code><code>mkdir -p ~/.hermes/skills/research/migration-scout</code></code></pre><p>Write <em><strong>~/.hermes/skills/research/migration-scout/<a href="http://skill.md/">SKILL.md</a></strong></em> with a contract the model can follow. Keep it under 2,000 tokens, the same discipline Perplexity uses:</p><pre><code><code>---
name: migration-scout
description: Collect official migration guidance for a library's last 3 major versions.
---

# Migration scout

For each of the target library's last three major versions, return: official source URL,
breaking changes, required code edits, and unresolved questions.

Process:
1. Define the output schema before searching (one record per version).
2. Split the work into per-version query branches.
3. Run one scout web_search and inspect the payload before writing an extractor.
4. In execute_code, fan out web_search across branches, web_extract the official docs,
   filter to the vendor's own domain, dedupe by URL, and write artifacts.
5. Persist raw and normalized JSON under sac-state/.
6. Return only counts, artifact paths, unresolved rows, and a small evidence sample.
7. Verify weak rows in a second pass.</code></code></pre><p><strong>3. Scout the payload shape.</strong></p><p>Inside <em><strong>execute_code</strong></em>, run one search and look at the response before building an extractor. Do not assume an undocumented schema.</p><pre><code><code># web_search, web_extract, write_file are available inside execute_code (see docs)
import json
sample = web_search("site:docs.example.com upgrade guide v3", limit=5)
print(json.dumps(sample, indent=2)[:4000])</code></code></pre><p><strong>4. Fan out and persist.</strong></p><pre><code><code>import json

branches = [
    "site:docs.example.com v3 migration breaking changes",
    "site:docs.example.com v2 migration breaking changes",
    "site:docs.example.com v1 to v2 upgrade guide",
]

def urls_in(node):
    out = []
    if isinstance(node, dict):
        for v in node.values():
            out += urls_in(v)
    elif isinstance(node, list):
        for v in node:
            out += urls_in(v)
    elif isinstance(node, str) and node.startswith("http"):
        out.append(node)
    return out

hits = {b: web_search(b, limit=5) for b in branches}
urls = sorted(set(urls_in(hits)))
pages = web_extract(urls[:10]) if urls else {"results": []}

write_file("sac-state/hits.json", json.dumps(hits, indent=2))
write_file("sac-state/pages.json", json.dumps(pages, indent=2))
print(json.dumps({"branches": len(branches), "unique_urls": len(urls)}, indent=2))</code></code></pre><p><strong>5. Verify weak rows.</strong></p><p>In a second pass, bind each version&#8217;s claims to an official source URL and flag any row that lacks one. Keep this separate from the fan-out so a failed extraction does not poison the whole run.</p><p><strong>6. Run, measure, and compare.</strong></p><p>Trigger the skill on the bounded task, then capture: parent-visible tool turns, total <em><strong>execute_code</strong></em> calls, search and extract calls, tokens if exposed, rows needing manual verification, and correctness against the official docs. Then run the same task with serial parent-level search calls and compare.</p><p>The point is to test whether moving the work into code lowers parent-context load, not to reproduce Perplexity&#8217;s benchmark scores.</p><p>To deploy, leave the skill in <em><strong>~/.hermes/skills/</strong></em>, where any Hermes session loads it on demand. The same skeleton ports to other stacks: a Claude Code skill folder, a Codex agent file, or a generic system-prompt slot.</p><p><strong>Execution rules.</strong></p><p>Keep each run under the documented 50 tool-call default. Do not expect <em><strong>asyncio</strong></em> to parallelize Hermes calls: the RPC stub serializes its exchange behind a <em><strong>calllock</strong></em>.</p><p>Write JSON to disk before a long extraction pass. Call <em><strong>delegate_task</strong></em> from the parent agent, never from inside <em><strong>execute_code</strong></em>, which cannot recurse into <em><strong>execute_code</strong></em>, <em><strong>delegate_task</strong></em>, or MCP tools.</p><p>Strict mode adds isolation but is not a full secure container, and leaf delegated subagents cannot call <em><strong>execute_code</strong></em>.</p><p><strong>A second path: OpenClaw.</strong></p><p>OpenClaw (also MIT) offers a second route through its experimental code mode, which is off by default. Enable <em><strong>tools.codeMode.enabled</strong></em>, and guest JavaScript or TypeScript runs in a constrained QuickJS-WASI worker with no imports and no direct network or file access.</p><p>It calls already-enabled web tools through the executor and reduces nested results to one compact object, with <em><strong>maxPendingToolCalls</strong></em> defaulting to 16.</p><p>Durable artifacts need a write-capable tool. It is a looser fit than Hermes because OpenClaw is built as a multi-channel assistant gateway, so treat it as the secondary option.</p>]]></content:encoded></item><item><title><![CDATA[As AI agents evolve, we need to look past the RAG pipeline]]></title><description><![CDATA[This article is adapted from Ben Dickson&#8217;s AlphaSignal Sunday Deep Dive on Direct Corpus Interaction and GrepSeek.]]></description><link>https://alphasignalai.substack.com/p/as-ai-agents-evolve-we-need-to-look</link><guid isPermaLink="false">https://alphasignalai.substack.com/p/as-ai-agents-evolve-we-need-to-look</guid><dc:creator><![CDATA[AlphaSignal AI]]></dc:creator><pubDate>Tue, 02 Jun 2026 17:05:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!JJAQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fc7bfc9-505b-4e7b-b854-7b1caea3b4b1_1672x941.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JJAQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fc7bfc9-505b-4e7b-b854-7b1caea3b4b1_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JJAQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fc7bfc9-505b-4e7b-b854-7b1caea3b4b1_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!JJAQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fc7bfc9-505b-4e7b-b854-7b1caea3b4b1_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!JJAQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fc7bfc9-505b-4e7b-b854-7b1caea3b4b1_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!JJAQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fc7bfc9-505b-4e7b-b854-7b1caea3b4b1_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JJAQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fc7bfc9-505b-4e7b-b854-7b1caea3b4b1_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3fc7bfc9-505b-4e7b-b854-7b1caea3b4b1_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1057344,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://alphasignalai.substack.com/i/200327835?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fc7bfc9-505b-4e7b-b854-7b1caea3b4b1_1672x941.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JJAQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fc7bfc9-505b-4e7b-b854-7b1caea3b4b1_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!JJAQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fc7bfc9-505b-4e7b-b854-7b1caea3b4b1_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!JJAQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fc7bfc9-505b-4e7b-b854-7b1caea3b4b1_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!JJAQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fc7bfc9-505b-4e7b-b854-7b1caea3b4b1_1672x941.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p>This article is adapted from <strong><a href="https://www.linkedin.com/in/bendee983?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAABpvjX8BfsbynYUlQKUQU8ojIDgeWQ-YH-c">Ben Dickson</a></strong>&#8217;s AlphaSignal Sunday Deep Dive on Direct Corpus Interaction and GrepSeek.</p></blockquote><p>AI coding agents are exposing a critical flaw in traditional retrieval-augmented generation (RAG) pipelines. And the solution might be giving the agents the same tools that humans use.</p><p>Agentic search requires dynamic plan revision. If an agent is tasked with debugging a production incident, it does not know the full scope of information it needs.</p><p>It needs to examine partial evidence, formulate a hypothesis, and search again to verify its assumptions. Agents need to find exact strings, numerical values, version constraints, error codes, and specific file paths.</p><p>This is not what traditional RAG is designed for.</p><p>RAG systems break documents into chunks and store their embedding values in vector databases. When a user asks a question, the system retrieves text chunks based on the similarity of their embeddings with that of the prompt.</p><p>This dense retrieval method is excellent for broad semantic recall and answering general questions over static knowledge bases. But it breaks down in software engineering and IT operations.</p><p>Exact lexical constraints and multi-step hypothesis refinement are incredibly difficult to execute through semantic retrievers alone. Current retrieval pipelines often decide too early what the AI agent is allowed to see.</p><p>Once relevant evidence is filtered out by a vector index before the agent&#8217;s reasoning loop begins, the data is lost. And no amount of reasoning can recover it.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2><strong>Direct corpus interaction</strong></h2><p><strong>Direct Corpus Interaction (DCI)</strong> is a new but simple paradigm that bypasses embedding models entirely. It allows AI agents to interact with raw data using general-purpose terminal tools like grep, find, cat, sed, and shell pipelines.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lyDZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a1977ee-1391-4cfe-9651-e19585cc6a5c_2048x704.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lyDZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a1977ee-1391-4cfe-9651-e19585cc6a5c_2048x704.png 424w, https://substackcdn.com/image/fetch/$s_!lyDZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a1977ee-1391-4cfe-9651-e19585cc6a5c_2048x704.png 848w, https://substackcdn.com/image/fetch/$s_!lyDZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a1977ee-1391-4cfe-9651-e19585cc6a5c_2048x704.png 1272w, https://substackcdn.com/image/fetch/$s_!lyDZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a1977ee-1391-4cfe-9651-e19585cc6a5c_2048x704.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lyDZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a1977ee-1391-4cfe-9651-e19585cc6a5c_2048x704.png" width="1456" height="500" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2a1977ee-1391-4cfe-9651-e19585cc6a5c_2048x704.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:500,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!lyDZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a1977ee-1391-4cfe-9651-e19585cc6a5c_2048x704.png 424w, https://substackcdn.com/image/fetch/$s_!lyDZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a1977ee-1391-4cfe-9651-e19585cc6a5c_2048x704.png 848w, https://substackcdn.com/image/fetch/$s_!lyDZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a1977ee-1391-4cfe-9651-e19585cc6a5c_2048x704.png 1272w, https://substackcdn.com/image/fetch/$s_!lyDZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a1977ee-1391-4cfe-9651-e19585cc6a5c_2048x704.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In enterprise environments, data is rarely a stable, static document collection. It consists of active incident logs, live IT tickets, recent code commits, daily financial reports, and constantly shifting configuration files.</p><p>Vector embeddings are always a snapshot of the past. Building, updating, and maintaining vector indexes takes compute power and batch processing time. DCI allows the agent to interact directly with the current state of the workspace as it exists right now.</p><p>With terminal tools, agents can enforce strict constraints that vector databases miss. An agent looking for a specific database failure can search for an exact error string, pipe the output to a secondary filter to remove legacy log files, and verify the local context immediately.</p><p>This creates an iterative feedback loop between the agent and the file system. The agent executes a command, reads the raw output, and adjusts its next query based on what it learns. This mirrors how a human developer navigates an unfamiliar codebase.</p><p>Experiments show that DCI outperforms semantic retrieval on multi-hop reasoning tasks and retrieval benchmark where clues are scattered across different files, while also reducing inference costs.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XDsC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf8afc14-1fa1-4ebc-b47f-23f55dff2cdd_2048x693.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XDsC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf8afc14-1fa1-4ebc-b47f-23f55dff2cdd_2048x693.png 424w, https://substackcdn.com/image/fetch/$s_!XDsC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf8afc14-1fa1-4ebc-b47f-23f55dff2cdd_2048x693.png 848w, https://substackcdn.com/image/fetch/$s_!XDsC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf8afc14-1fa1-4ebc-b47f-23f55dff2cdd_2048x693.png 1272w, https://substackcdn.com/image/fetch/$s_!XDsC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf8afc14-1fa1-4ebc-b47f-23f55dff2cdd_2048x693.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XDsC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf8afc14-1fa1-4ebc-b47f-23f55dff2cdd_2048x693.png" width="1456" height="493" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/df8afc14-1fa1-4ebc-b47f-23f55dff2cdd_2048x693.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:493,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!XDsC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf8afc14-1fa1-4ebc-b47f-23f55dff2cdd_2048x693.png 424w, https://substackcdn.com/image/fetch/$s_!XDsC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf8afc14-1fa1-4ebc-b47f-23f55dff2cdd_2048x693.png 848w, https://substackcdn.com/image/fetch/$s_!XDsC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf8afc14-1fa1-4ebc-b47f-23f55dff2cdd_2048x693.png 1272w, https://substackcdn.com/image/fetch/$s_!XDsC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf8afc14-1fa1-4ebc-b47f-23f55dff2cdd_2048x693.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2><strong>Scaling DCI with GrepSeek</strong></h2><p>Giving a language model raw terminal access introduces friction. Agents can get lost in complex, nested directory structures. They can execute broad search commands that overwhelm the terminal with thousands of lines of useless output, which quickly derails their reasoning process.</p><p>A new framework called <strong>GrepSeek</strong> upgrades DCI and addresses these friction points by training a model to treat the corpus as the search environment. GrepSeek reasons about the query and gathers evidence by issuing executable shell commands against the corpus.</p><p>To simplify the process of training GrepSeek, the researchers created a pipeline that generates training data from a very large unstructured body of text without human assistance.</p><p>This process generates causally grounded search paths. It trains the model on how to logically navigate a file system, form hypotheses, and use command-line tools efficiently.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vqqQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa167e9d8-4485-412d-be29-8681654555ae_1450x1118.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vqqQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa167e9d8-4485-412d-be29-8681654555ae_1450x1118.png 424w, https://substackcdn.com/image/fetch/$s_!vqqQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa167e9d8-4485-412d-be29-8681654555ae_1450x1118.png 848w, https://substackcdn.com/image/fetch/$s_!vqqQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa167e9d8-4485-412d-be29-8681654555ae_1450x1118.png 1272w, https://substackcdn.com/image/fetch/$s_!vqqQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa167e9d8-4485-412d-be29-8681654555ae_1450x1118.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vqqQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa167e9d8-4485-412d-be29-8681654555ae_1450x1118.png" width="586" height="451.8262068965517" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a167e9d8-4485-412d-be29-8681654555ae_1450x1118.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1118,&quot;width&quot;:1450,&quot;resizeWidth&quot;:586,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!vqqQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa167e9d8-4485-412d-be29-8681654555ae_1450x1118.png 424w, https://substackcdn.com/image/fetch/$s_!vqqQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa167e9d8-4485-412d-be29-8681654555ae_1450x1118.png 848w, https://substackcdn.com/image/fetch/$s_!vqqQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa167e9d8-4485-412d-be29-8681654555ae_1450x1118.png 1272w, https://substackcdn.com/image/fetch/$s_!vqqQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa167e9d8-4485-412d-be29-8681654555ae_1450x1118.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>GrepSeek also uses reinforcement learning to improve the agent&#8217;s task-oriented search behavior. It teaches the model to avoid dead ends, recognize when a command has failed, and refine its search queries accordingly.</p><p>Running raw shell commands sequentially over millions of documents introduces severe latency. Agents waiting for a massive grep search to complete across an entire enterprise repository slows down the orchestration loop to a crawl.</p><p>GrepSeek solves this bottleneck with a semantics-preserving sharded-parallel execution engine. This engine splits the underlying corpus into smaller data shards and runs shell commands simultaneously across them.</p><p>This approach speeds up shell-based retrieval by up to 7.6x compared to traditional execution while preserving the fidelity of the original data.</p><div><hr></div><h2><strong>How to apply DCI in practice</strong></h2><p>Why not load an entire repository into a massive million-token context window? Because processing millions of tokens for every step an agent takes is unsustainable for most applications.</p><p>Massive context slows down the agent&#8217;s time-to-first-token. Furthermore, cramming a model with raw code increases the likelihood that it will overlook specific, critical details buried deep within the prompt.</p><p>Raw terminal outputs from DCI can also bloat the context window if left unchecked. A single poorly constructed find command can return thousands of lines of text. And running grep on the entire corpus every time can be slow, especially if it is being accessed through a network.</p><p>For AI orchestration engineers and data architects, if you have a small corpus of information, DCI-style retrieval can work perfectly fine.</p><p>But for very large corpora, a balanced, hybrid approach will probably be better suited:</p><p>Semantic retrieval handles broad, high-recall candidate discovery. It locates an initial anchor document when the user&#8217;s intent is underspecified.</p><p>DCI operates as a precision verification layer on top of the retrieved data.</p><p>The agent uses terminal tools to expand laterally from the anchor document into neighboring files or dependencies.</p><p>The agent checks exact constraints, verifies version numbers, and combines weak signals across multiple documents before generating a final answer.</p><p>This shift changes how we must think about enterprise data architecture. In the near future, data will not only need to be indexed for human search engines. It will need to be explicitly organized for agents that can inspect, trace, and verify raw files.</p><p>Retrieval quality for coding agents is not about generating better vector embeddings or using larger context windows. It relies on the resolution of the interface through which the agent is allowed to interact with the corpus.</p><div><hr></div><blockquote><p>This article is adapted from <strong><a href="https://www.linkedin.com/in/bendee983?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAABpvjX8BfsbynYUlQKUQU8ojIDgeWQ-YH-c">Ben Dickson</a></strong>&#8217;s AlphaSignal Sunday Deep Dive on Direct Corpus Interaction and GrepSeek.</p></blockquote><p><strong>All source links are in the first reply. Full breakdown of recent updates + daily signals in our newsletter (link in bio).</strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div>]]></content:encoded></item><item><title><![CDATA[Turn a Departing Engineer's Judgment Into an Editable, Versioned Skill File ]]></title><description><![CDATA[COLLEAGUE.SKILL reframes the viral dot-skill repo: expertise as editable, versioned files, not a clone.]]></description><link>https://alphasignalai.substack.com/p/turn-a-departing-engineers-judgment</link><guid isPermaLink="false">https://alphasignalai.substack.com/p/turn-a-departing-engineers-judgment</guid><dc:creator><![CDATA[AlphaSignal AI]]></dc:creator><pubDate>Mon, 01 Jun 2026 18:57:35 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!KXVt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d049bf6-465f-48f9-a4c5-364c8524a456_1672x941.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KXVt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d049bf6-465f-48f9-a4c5-364c8524a456_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KXVt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d049bf6-465f-48f9-a4c5-364c8524a456_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!KXVt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d049bf6-465f-48f9-a4c5-364c8524a456_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!KXVt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d049bf6-465f-48f9-a4c5-364c8524a456_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!KXVt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d049bf6-465f-48f9-a4c5-364c8524a456_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KXVt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d049bf6-465f-48f9-a4c5-364c8524a456_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0d049bf6-465f-48f9-a4c5-364c8524a456_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1359156,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://alphasignalai.substack.com/i/200167634?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d049bf6-465f-48f9-a4c5-364c8524a456_1672x941.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KXVt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d049bf6-465f-48f9-a4c5-364c8524a456_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!KXVt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d049bf6-465f-48f9-a4c5-364c8524a456_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!KXVt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d049bf6-465f-48f9-a4c5-364c8524a456_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!KXVt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d049bf6-465f-48f9-a4c5-364c8524a456_1672x941.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p>In ~5 mins: what dot-skill actually generates, the capability/persona split, the <em><strong>S = (A, M, L)</strong></em> contract, the correct-and-roll-back loop, the behavioral-fidelity gap the paper admits, and a full install-to-rollback walkthrough at the end.</p></blockquote><p>A tool that turns your coworker into an installable AI skill crossed roughly 18,500 GitHub stars.</p><p><strong>dot-skill</strong> reads a person&#8217;s scattered work traces, their docs, code reviews, and chat decisions, and writes them into a skill file an agent can load.</p><p>The backlash arrived fast. Someone shipped an <strong>anti-distillation skill</strong> that adds noise to your own traces so you cannot be cleanly copied.</p><p>Now there is a paper. <strong>COLLEAGUE.SKILL</strong>, from <strong>Shanghai AI Lab, </strong>29 May, drops the digital-twin pitch for a narrower claim: this is a file format for expertise, not a copy of a person.</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/whyyoutouzhele/status/2040195137465462998&quot;,&quot;full_text&quot;:&quot;&#36817;&#26085;&#65292;github&#19978;&#19968;&#20010;&#21517;&#21483;&#8220;&#21516;&#20107;.skill&#8221;&#30340;&#39033;&#30446;&#28779;&#20102;&#12290;\n\n4&#26376;3&#26085;&#65292;&#19968;&#21338;&#20027;&#34920;&#31034;&#65292;&#22905;&#24320;&#21457;&#20102;&#8220;&#21453;&#33976;&#39311;skill&#8221;&#30340;&#39033;&#30446;&#12290;\n&#22905;&#34920;&#31034;&#65292;&#22823;&#23478;&#37117;&#26159;&#20986;&#26469;&#20570;&#29275;&#39532;&#30340;&#65292;&#27809;&#20154;&#24076;&#26395;&#33258;&#24049;&#34987;&#20570;&#25104;skill&#65292;&#28982;&#21518;&#20002;&#25481;&#24037;&#20316;&#65292;&#25152;&#20197;&#33258;&#24049;&#21457;&#26126;&#20102;&#8220;&#21453;&#33976;&#39311;skill&#8221;&#12290;&#24076;&#26395;&#22823;&#23478;&#22312;&#36825;&#20010;AI&#28010;&#28526;&#37324;&#37117;&#33021;&#27963;&#24471;&#20037;&#19968;&#28857;&#21543;&#12290; &quot;,&quot;username&quot;:&quot;whyyoutouzhele&quot;,&quot;name&quot;:&quot;&#26446;&#32769;&#24072;&#19981;&#26159;&#20320;&#32769;&#24072;&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/2004720683905904640/8ZEZs67v_normal.jpg&quot;,&quot;date&quot;:&quot;2026-04-03T22:30:00.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://substackcdn.com/image/upload/w_1028,c_limit,q_auto:best/l_twitter_play_button_rvaygk,w_88/armgmtb9dexxffgfjvnk&quot;,&quot;link_url&quot;:&quot;https://t.co/53OJZLSc7A&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:118,&quot;retweet_count&quot;:718,&quot;like_count&quot;:5278,&quot;impression_count&quot;:948753,&quot;expanded_url&quot;:null,&quot;video_url&quot;:&quot;https://video.twimg.com/amplify_video/2040191123818463232/vid/avc1/720x960/sE4q8Unm-sVKiMdo.mp4&quot;,&quot;belowTheFold&quot;:false}" data-component-name="Twitter2ToDOM"></div><p>The useful part is the artifact. Whether it keeps the judgment is the part nobody has measured.</p><div><hr></div><h2><strong>Context</strong></h2><p>dot-skill is open source under MIT, written in Python, at <em><strong>titanwings/colleague-skill</strong></em>. It started as colleague.skill, built for one job: when a teammate quits, capture their review standards and incident heuristics before the context walks out with them.</p><p>The five-person team at Shanghai AI Lab posted the technical report to arXiv on May 29, 2026. MIT Technology Review had already covered the trend in April, and the &#8220;distill them before they leave&#8221; framing is what drew both the stars and the pushback.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XW0q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa693b46f-bf4c-4745-877b-c5fcb33d256d_1413x1160.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XW0q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa693b46f-bf4c-4745-877b-c5fcb33d256d_1413x1160.png 424w, https://substackcdn.com/image/fetch/$s_!XW0q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa693b46f-bf4c-4745-877b-c5fcb33d256d_1413x1160.png 848w, https://substackcdn.com/image/fetch/$s_!XW0q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa693b46f-bf4c-4745-877b-c5fcb33d256d_1413x1160.png 1272w, https://substackcdn.com/image/fetch/$s_!XW0q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa693b46f-bf4c-4745-877b-c5fcb33d256d_1413x1160.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XW0q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa693b46f-bf4c-4745-877b-c5fcb33d256d_1413x1160.png" width="1413" height="1160" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a693b46f-bf4c-4745-877b-c5fcb33d256d_1413x1160.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1160,&quot;width&quot;:1413,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!XW0q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa693b46f-bf4c-4745-877b-c5fcb33d256d_1413x1160.png 424w, https://substackcdn.com/image/fetch/$s_!XW0q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa693b46f-bf4c-4745-877b-c5fcb33d256d_1413x1160.png 848w, https://substackcdn.com/image/fetch/$s_!XW0q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa693b46f-bf4c-4745-877b-c5fcb33d256d_1413x1160.png 1272w, https://substackcdn.com/image/fetch/$s_!XW0q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa693b46f-bf4c-4745-877b-c5fcb33d256d_1413x1160.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>It builds on the <strong>Agent Skills</strong> standard, where a skill is a folder around a <em><strong><a href="http://skill.md/">SKILL.md</a></strong></em> file plus optional scripts and references, loaded on demand. The repo ships three presets: colleague (the main one), celebrity, and relationship. A public gallery lists 215 skills from 165 contributors, which measures distribution, not whether any of them work.</p><div><hr></div><h2><strong>What it actually generates</strong></h2><p>Expertise rarely lives in a manual. It is scattered across design docs, review comments, chat decisions, and incident notes.</p><p>dot-skill reads those traces and writes a few plain Markdown files. The loadable one is <em><strong><a href="http://skill.md/">SKILL.md</a></strong></em>. Behind it sit <em><strong><a href="http://work.md/">work.md</a></strong></em> (what the person knows) and <em><strong><a href="http://persona.md/">persona.md</a></strong></em> (how they act).</p><p>Because it follows the Agent Skills format, any compatible host loads it: Claude Code, OpenClaw, Codex, or Hermes. And because the output is Markdown, you can read the extracted rules, fix them in plain English, version the result, and roll it back.</p><div><hr></div><h2><strong>The design</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!X6_x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a729dfd-54f3-4f5a-bcd1-e6ab803a9581_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!X6_x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a729dfd-54f3-4f5a-bcd1-e6ab803a9581_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!X6_x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a729dfd-54f3-4f5a-bcd1-e6ab803a9581_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!X6_x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a729dfd-54f3-4f5a-bcd1-e6ab803a9581_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!X6_x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a729dfd-54f3-4f5a-bcd1-e6ab803a9581_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!X6_x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a729dfd-54f3-4f5a-bcd1-e6ab803a9581_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2a729dfd-54f3-4f5a-bcd1-e6ab803a9581_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!X6_x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a729dfd-54f3-4f5a-bcd1-e6ab803a9581_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!X6_x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a729dfd-54f3-4f5a-bcd1-e6ab803a9581_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!X6_x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a729dfd-54f3-4f5a-bcd1-e6ab803a9581_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!X6_x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a729dfd-54f3-4f5a-bcd1-e6ab803a9581_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>reframe</strong></h3><p>The paper calls the method person-grounded trace-to-skill distillation. A lightweight profile, a source scope, and a set of documents map to a package the paper writes as <em><strong>S = (A, M, L)</strong></em>: the generated files, the install metadata, and the lifecycle state (version, correction count, rollback history). The package is meant to be portable, inspectable, composable, correctable, and governable.</p><h3><strong>split</strong></h3><p>The sharpest design choice is keeping capability separate from behavior. <em><strong><a href="http://work.md/">work.md</a></strong></em> holds review criteria, workflows, and decision heuristics. <em><strong><a href="http://persona.md/">persona.md</a></strong></em> holds tone, interaction rules, and a correction log.</p><p>Those generate three entry points: the full skill, work-only, and persona-only. Work-only is the safer one, because a review checklist does not need a personality. In the paper&#8217;s example, a colleague skill encodes a review order: check authentication, input validation, rate limiting, response schema, and sensitive-data exposure before smaller issues.</p><h3><strong>files</strong></h3><p>Generation emits seven files on schema v3: <em><strong><a href="http://skill.md/">SKILL.md</a></strong></em>, <em><strong><a href="http://work.md/">work.md</a></strong></em>, <em><strong><a href="http://persona.md/">persona.md</a></strong></em>, the two sub-skills <em><strong>work_<a href="http://skill.md/">skill.md</a></strong></em> and <em><strong>persona_<a href="http://skill.md/">skill.md</a></strong></em>, plus <em><strong>manifest.json</strong></em> and <em><strong>meta.json</strong></em> for install and lifecycle metadata.</p><h3><strong>fix loop</strong></h3><p>Corrections are plain language. Say &#8220;he would not push back there&#8221; and the handler routes it: a work correction patches the matching <em><strong>##</strong></em> section in <em><strong><a href="http://work.md/">work.md</a></strong></em>, a behavior correction appends a <em><strong>{scene, wrong, correct}</strong></em> record to <em><strong><a href="http://persona.md/">persona.md</a></strong></em>. Every update archives the prior version first, and <em><strong>version_<a href="http://manager.py/">manager.py</a></strong></em> rolls back to any of the last 10.</p><h3><strong>The honest part</strong></h3><p>The paper makes one kind of claim: that this format and workflow exist and run. It does not claim the skill reproduces the person or improves anyone&#8217;s work. The authors name their own open problem the behavioral fidelity frontier, and the paper ships no held-out task study to close it.</p><div><hr></div><h2><strong>How to get started</strong></h2><p>Here is the short path. Every command, including install-to-deploy and rollback, is in the appendix at the end.</p><p>Clone the repo into your host&#8217;s skills directory, or hand the URL to your agent and let it install itself. Then run <em><strong>/dot-skill</strong></em>, pick the <em><strong>colleague</strong></em> family, and answer three questions: an alias, a one-line role, and a few personality tags.</p><p>Point it at authorized traces. It can auto-collect from Feishu, DingTalk, or Slack, or take uploads: PDFs, screenshots, <em><strong>.eml</strong></em> files, or pasted text. It then generates the files and, by default, installs the skill into Claude Code.</p><p>Read <em><strong><a href="http://work.md/">work.md</a></strong></em> before you trust it. Then invoke the full skill with <em><strong>/{character}-{slug}</strong></em>, or the safer work-only path with <em><strong>/{character}-{slug}-work</strong></em>. Full install-to-rollback commands are in the appendix below.</p><div><hr></div><h2><strong>AlphaSignal Take</strong></h2><p>dot-skill ships real software: collectors, a writer, installers, rollback, and 35 passing tests. The gap is between what the product page sells and what the paper will actually claim.</p><p><strong>No fidelity evidence.</strong></p><p>The paper proves a file format, not that a generated skill catches what the real engineer would. It says so itself and ships no held-out evaluation. You are trusting extraction quality you cannot yet measure.</p><p><strong>The persona layer can turn a label into a rule.</strong></p><p>The colleague persona analyzer translates freeform tags like &#8220;blame-shifter&#8221; or &#8220;PUA&#8221; into Layer 0 rules the agent must never break, and manual tags outrank the actual traces. The repo&#8217;s own colleague demo shows the skill dodging blame on cue. That is bias compiled into behavior, by design.</p><p><strong>Governance is an affordance, not a guarantee.</strong></p><p>Local-first, versioned files give you control, but nothing enforces consent, retention, or redaction in code, and deletion is <em><strong>rm -rf</strong></em>. The product site mentions RAG, yet there is no retrieval runtime in the repo (<em><strong>requirements.txt</strong></em> is requests, pypinyin, playwright, slack-sdk, python-docx, and openpyxl).</p><p>So the best recommendation is to adopt the work-only path and treat the rest as a research preview. Package a departing engineer&#8217;s review checklist as a <em><strong><a href="http://work.md/">work.md</a></strong></em> skill, test it against reviews you already graded, and keep persona off until someone measures fidelity. Separate <strong>relationship</strong> and <strong>celebrity</strong> papers are promised, which is exactly where the consent questions get harder.</p><div><hr></div><p>If a teammate left tomorrow, would you trust a work-only skill of their review checklist, or is judgment the part that never compiles?</p><p><strong>All source links are in the first reply. Full breakdown of recent updates + daily signals in our newsletter (link in bio).</strong></p><div><hr></div><h2><strong>Appendix: How to actually run dot-skill</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S_CT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49c51f60-46ce-44a0-aa33-377f1c958863_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S_CT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49c51f60-46ce-44a0-aa33-377f1c958863_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!S_CT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49c51f60-46ce-44a0-aa33-377f1c958863_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!S_CT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49c51f60-46ce-44a0-aa33-377f1c958863_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!S_CT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49c51f60-46ce-44a0-aa33-377f1c958863_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S_CT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49c51f60-46ce-44a0-aa33-377f1c958863_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49c51f60-46ce-44a0-aa33-377f1c958863_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!S_CT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49c51f60-46ce-44a0-aa33-377f1c958863_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!S_CT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49c51f60-46ce-44a0-aa33-377f1c958863_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!S_CT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49c51f60-46ce-44a0-aa33-377f1c958863_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!S_CT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49c51f60-46ce-44a0-aa33-377f1c958863_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Shorter than the repo&#8217;s README. Eight steps, install to rollback.</p><p><strong>1. Install dot-skill into your host.</strong></p><p>Clone the repo into your host&#8217;s skills directory. For Claude Code that is <em><strong>~/.claude/skills/dot-skill</strong></em>.</p><p>bash</p><pre><code><code>git clone https://github.com/titanwings/colleague-skill ~/.claude/skills/dot-skill</code></code></pre><p>OpenClaw uses <em><strong>~/.openclaw/workspace/skills/dot-skill</strong></em>, Codex uses <em><strong>~/.codex/skills/dot-skill</strong></em>. For Hermes, clone anywhere and run <em><strong>python3 tools/install_hermes_<a href="http://skill.py/">skill.py</a> --force</strong></em>. Or skip all of this and tell your agent to install the skill at the repo URL.</p><p><strong>2. Launch and pick a family.</strong></p><p>Run <em><strong>/dot-skill</strong></em> and choose <em><strong>colleague</strong></em>. The other presets are <em><strong>relationship</strong></em> and <em><strong>celebrity</strong></em>.</p><p><strong>3. Answer the intake.</strong></p><p>Three questions: alias, a one-line role, and personality tags. Keep the tags factual. They become behavior rules, not flavor text.</p><p><strong>4. Provide source material.</strong></p><p>Auto-collect from a chat platform, or upload files.</p><p>bash</p><pre><code><code># Slack auto-collect (an admin installs the bot; free workspaces cap history at 90 days)
python3 tools/slack_auto_collector.py --setup
python3 tools/slack_auto_collector.py --name "Jane Doe"</code></code></pre><p>Or upload PDFs, screenshots, <em><strong>.eml</strong></em> archives, or pasted text. You can also skip collection and generate from the intake alone.</p><p><strong>5. Generate and inspect.</strong></p><p>Generation runs through the writer and emits all seven files.</p><p>bash</p><pre><code><code>python3 tools/skill_writer.py \
  --action create \
  --character colleague \
  --slug jane-doe \
  --name "Jane Doe" \
  --meta /tmp/meta.json \
  --work /tmp/work.md \
  --persona /tmp/persona.md \
  --base-dir ./skills/colleague \
  --no-install-claude-skill</code></code></pre><p>By default the create step auto-installs into Claude Code. Pass <em><strong>--no-install-claude-skill</strong></em> to stop and read <em><strong><a href="http://work.md/">work.md</a></strong></em> and <em><strong><a href="http://persona.md/">persona.md</a></strong></em> first. This is where you catch a manual tag that became a rule it should not have.</p><p><strong>6. Install the generated skill to a host.</strong></p><p>bash</p><pre><code><code>python3 tools/install_claude_generated_skill.py --skill-dir skills/colleague/jane-doe --force</code></code></pre><p>Use <em><strong>install_openclaw_generated_<a href="http://skill.py/">skill.py</a></strong></em> or <em><strong>install_codex_generated_<a href="http://skill.py/">skill.py</a></strong></em> for the other hosts. Then invoke the full skill with <em><strong>/colleague-jane-doe</strong></em>, or the work-only entry point with <em><strong>/colleague-jane-doe-work</strong></em>.</p><p><strong>7. Correct it in plain English.</strong></p><p>Tell the agent what is wrong. A work fix patches a <em><strong>##</strong></em> section, a behavior fix becomes a <em><strong>{scene, wrong, correct}</strong></em> record.</p><p>bash</p><pre><code><code>python3 tools/skill_writer.py \
  --action update \
  --character colleague \
  --slug jane-doe \
  --correction-json /tmp/correction.json \
  --base-dir ./skills/colleague</code></code></pre><p><strong>8. Version and roll back.</strong></p><p>Every update archives the prior version. The version manager keeps the last 10, and cleanup is manual, so old versions stay until you remove them.</p><p>bash</p><pre><code><code>python3 tools/version_manager.py --action rollback --character colleague --slug jane-doe --version 3 --base-dir ./skills/colleague</code></code></pre><p>Deploy maps to wherever your agent reads skills: <em><strong>~/.claude/skills/</strong></em>, <em><strong>~/.codex/skills/</strong></em>, or the Hermes skill directory.</p>]]></content:encoded></item><item><title><![CDATA[How Claude Code Harness turns agent coding into a contract-first delivery loop]]></title><description><![CDATA[A workflow plugin, not a magic safety layer.]]></description><link>https://alphasignalai.substack.com/p/how-claude-code-harness-turns-agent</link><guid isPermaLink="false">https://alphasignalai.substack.com/p/how-claude-code-harness-turns-agent</guid><dc:creator><![CDATA[AlphaSignal AI]]></dc:creator><pubDate>Fri, 29 May 2026 16:02:02 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!lY-T!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e32cd16-8f85-489e-9643-267d3814b976_1672x941.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lY-T!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e32cd16-8f85-489e-9643-267d3814b976_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lY-T!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e32cd16-8f85-489e-9643-267d3814b976_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!lY-T!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e32cd16-8f85-489e-9643-267d3814b976_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!lY-T!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e32cd16-8f85-489e-9643-267d3814b976_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!lY-T!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e32cd16-8f85-489e-9643-267d3814b976_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lY-T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e32cd16-8f85-489e-9643-267d3814b976_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6e32cd16-8f85-489e-9643-267d3814b976_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1284076,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://alphasignalai.substack.com/i/199506592?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e32cd16-8f85-489e-9643-267d3814b976_1672x941.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lY-T!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e32cd16-8f85-489e-9643-267d3814b976_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!lY-T!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e32cd16-8f85-489e-9643-267d3814b976_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!lY-T!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e32cd16-8f85-489e-9643-267d3814b976_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!lY-T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e32cd16-8f85-489e-9643-267d3814b976_1672x941.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><strong>In ~7 mins: what a 1,730-star Claude plugin actually ships, the 5-verb loop that turns agent coding into delivery evidence, the 13 guardrails behind it, and 4 rough edges the README doesn&#8217;t flag.</strong></p></blockquote><p>When you let Claude Code loose on a repo, plans live in chat, tests become optional, review happens too late, and release notes get reconstructed from memory.</p><p><strong>Claude Code Harness</strong> is an MIT-licensed plugin that wraps that work in a five-verb loop and treats two files as the source of truth.</p><p><strong>The shift</strong> it represents is the part worth watching: agent tooling moving from chat output to delivery evidence.</p><p><strong>Snapshot:</strong> v4.12.7, Go-native runtime, +1,700 stars, +190 forks, 33 shared skills.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2><strong>Why this is worth a look</strong></h2><p>Claude Code stacks have been quietly maturing. Hermes Agent, Superpowers, and a few others have started looking less like prompt packs and more like operating systems wrapped around the model. Harness sits in that same lane.</p><p>Its specific bet is narrow. The model can write useful code. The surrounding process keeps drifting. Plans float. Scope expands quietly. Review collapses into implementation. Release notes get rebuilt by memory.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gowS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36f69038-347e-4339-b2de-3c9a2ec5ff8b_1155x627.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gowS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36f69038-347e-4339-b2de-3c9a2ec5ff8b_1155x627.png 424w, https://substackcdn.com/image/fetch/$s_!gowS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36f69038-347e-4339-b2de-3c9a2ec5ff8b_1155x627.png 848w, https://substackcdn.com/image/fetch/$s_!gowS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36f69038-347e-4339-b2de-3c9a2ec5ff8b_1155x627.png 1272w, https://substackcdn.com/image/fetch/$s_!gowS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36f69038-347e-4339-b2de-3c9a2ec5ff8b_1155x627.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gowS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36f69038-347e-4339-b2de-3c9a2ec5ff8b_1155x627.png" width="1155" height="627" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/36f69038-347e-4339-b2de-3c9a2ec5ff8b_1155x627.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:627,&quot;width&quot;:1155,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!gowS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36f69038-347e-4339-b2de-3c9a2ec5ff8b_1155x627.png 424w, https://substackcdn.com/image/fetch/$s_!gowS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36f69038-347e-4339-b2de-3c9a2ec5ff8b_1155x627.png 848w, https://substackcdn.com/image/fetch/$s_!gowS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36f69038-347e-4339-b2de-3c9a2ec5ff8b_1155x627.png 1272w, https://substackcdn.com/image/fetch/$s_!gowS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36f69038-347e-4339-b2de-3c9a2ec5ff8b_1155x627.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The repo was created in December 2025 and has shipped daily since. It carries +1,700 stars and +190 forks at v4.12.7, with the last push on 2026-05-27. There is no Hacker News thread, no Reddit discussion, no YouTube demo. Just a small Threads post and a steady release cadence. High-quality, under-the-radar.</p><div><hr></div><h2><strong>What it actually is</strong></h2><p>The author is <strong>Chachamaru</strong>, a Japanese-language solo developer who built Harness for what the repo calls &#8220;vibecoders&#8221;: solo developers running full-cycle contract development through an agent.</p><p>Two single-source-of-truth files do the heavy lifting:</p><ul><li><p><em><strong><a href="http://spec.md/">spec.md</a></strong></em> is the product contract. What must stay true.</p></li><li><p><em><strong><a href="http://plans.md/">Plans.md</a></strong></em> is the task ledger. What&#8217;s being worked, what&#8217;s done, what&#8217;s blocking.</p></li></ul><p>Five verb commands sit on top: <em><strong>/harness-setup</strong></em>, <em><strong>/harness-plan</strong></em>, <em><strong>/harness-work</strong></em>, <em><strong>/harness-review</strong></em>, and <em><strong>/harness-release</strong></em>. After install, the default changes from &#8220;ask the agent to code&#8221; to: write the spec, implement only the approved slice, verify, review independently, package evidence.</p><p>The repo is not a small prompt pack. It ships a Claude plugin manifest, a Codex plugin manifest, OpenCode mirrors, a Go runtime, hook definitions, setup scripts, 33 shared skills, tests, and release tooling. 1,529 tracked files. Shell leads the byte count at 2.13 MB, Go follows at 1.56 MB, then JavaScript, TypeScript, and a thin Python layer.</p><div><hr></div><h2><strong>How it works</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0xMy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd134befe-e296-4c82-8126-e0675e9c204d_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0xMy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd134befe-e296-4c82-8126-e0675e9c204d_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!0xMy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd134befe-e296-4c82-8126-e0675e9c204d_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!0xMy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd134befe-e296-4c82-8126-e0675e9c204d_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!0xMy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd134befe-e296-4c82-8126-e0675e9c204d_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0xMy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd134befe-e296-4c82-8126-e0675e9c204d_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d134befe-e296-4c82-8126-e0675e9c204d_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!0xMy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd134befe-e296-4c82-8126-e0675e9c204d_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!0xMy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd134befe-e296-4c82-8126-e0675e9c204d_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!0xMy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd134befe-e296-4c82-8126-e0675e9c204d_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!0xMy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd134befe-e296-4c82-8126-e0675e9c204d_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>The operating loop</strong></h3><p>Plan, work, review, release. Each stage has an explicit gate. The user approves the generated contract before any code is written. Major review findings block completion. Release preflight checks tag, version, changelog, and evidence packaging before any final action.</p><p>One rule sets the tone for the whole loop: data the agent has not directly seen stays unknown instead of getting silently invented. It is stated in the spec, not implied.</p><h3><strong>The Go runtime</strong></h3><p>The guardrail engine was rewritten in Go starting at v4.0 (&#8221;Hokage&#8221;). Cold start target is 1&#8211;2 ms, compared to roughly 40&#8211;60 ms for the older bash-and-TypeScript path the repo replaced. SQLite runs through <em><strong>modernc.org/sqlite</strong></em>, a pure-Go driver. No CGO, no Node.js, no compiler toolchain to install.</p><p>The package layout enforces the speed contract:</p><ul><li><p><strong>hook-fastpath</strong> holds rule evaluation, codec, and types. No file I/O, no network, no goroutines.</p></li><li><p><strong>worker-runtime</strong> holds the SQLite store, session lifecycle, breezing orchestration, and OTel export.</p></li></ul><p>Configuration collapses into one file. <em><strong>harness.toml</strong></em> is the source. <em><strong>harness sync</strong></em> regenerates <em><strong>plugin.json</strong></em>, <em><strong>hooks.json</strong></em>, and <em><strong>settings.json</strong></em> from it. The user edits one file, not five.</p><p>The repo wires 58 command-hook entries and 4 agent-hook entries through <em><strong>hooks/hooks.json</strong></em>, covering pre-tool, post-tool, permission, session, notification, stop, and task events.</p><p>The pre-tool, post-tool, and permission paths are the only ones the Go binary owns directly. Everything else still routes through shell handlers, and the project&#8217;s own doctor warns when the hook config relies on legacy bash wrappers around the binary.</p><h3><strong>Guardrail rules</strong></h3><p>The runtime ships with thirteen rules (R01&#8211;R13), each tied to a specific tool surface.</p><ul><li><p><strong>Deny</strong> rules block <em><strong>sudo</strong></em>, <em><strong>git push --force</strong></em>, <em><strong>--no-verify</strong></em>, <em><strong>--no-gpg-sign</strong></em>, writes to <em><strong>.env</strong></em>/<em><strong>.git/</strong></em>/<em><strong>*.pem</strong></em>/<em><strong>*.key</strong></em>, and <em><strong>git reset --hard</strong></em> on protected branches.</p></li><li><p><strong>Ask</strong> rules pause on <em><strong>rm -rf</strong></em>, package installs, force-with-lease pushes, <em><strong>npx</strong></em>, and direct pushes to main or master.</p></li><li><p><strong>Warn</strong> rules cover secret-like file reads (still allowed, but flagged) and edits to <em><strong>package.json</strong></em>, Dockerfiles, and CI workflows.</p></li></ul><p>The honest tradeoff is on the design page: the hook system fails open on infrastructure errors. Deterministic deny rules still block when they run. But if the hook plumbing itself breaks, the design prefers approving the action to breaking the user&#8217;s session.</p><h3><strong>Host adapter tiers</strong></h3><p>Harness names what it supports and what it does not. The capability matrix lists four tiers:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!A_uj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31ead367-9b48-4d4b-b03e-70f104558270_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!A_uj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31ead367-9b48-4d4b-b03e-70f104558270_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!A_uj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31ead367-9b48-4d4b-b03e-70f104558270_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!A_uj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31ead367-9b48-4d4b-b03e-70f104558270_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!A_uj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31ead367-9b48-4d4b-b03e-70f104558270_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!A_uj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31ead367-9b48-4d4b-b03e-70f104558270_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/31ead367-9b48-4d4b-b03e-70f104558270_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!A_uj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31ead367-9b48-4d4b-b03e-70f104558270_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!A_uj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31ead367-9b48-4d4b-b03e-70f104558270_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!A_uj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31ead367-9b48-4d4b-b03e-70f104558270_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!A_uj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31ead367-9b48-4d4b-b03e-70f104558270_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The project explicitly refuses to claim parity it cannot prove. That refusal is the design choice worth noticing on its own.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2><strong>How to get started</strong></h2><p><em>Full How-to guide in the appendix at the end.</em></p><p>Four lines through Claude Code:</p><pre><code><code>claude
/plugin marketplace add Chachamaru127/claude-code-harness
/plugin install claude-code-harness@claude-code-harness-marketplace
/harness-setup</code></code></pre><p>Then a small first request:</p><pre><code><code>/harness-plan Improve the README onboarding flow</code></code></pre><p>Harness writes or updates <em><strong><a href="http://spec.md/">spec.md</a></strong></em> and <em><strong><a href="http://plans.md/">Plans.md</a></strong></em> and returns a plan with scope, acceptance criteria, dependencies, unknowns, and stop conditions. The user&#8217;s job is to approve or correct, not hand-write the plan.</p><p>Requires Claude Code v2.1+ and write access to the repo. Existing users should run <em><strong>bin/harness doctor --migration-report</strong></em> before reinstalling. It inventories stale plugin caches, duplicate Codex skills, OpenCode files, and old symlinks without deleting any of them.</p><div><hr></div><h2><strong>The AlphaSignal Take</strong></h2><p>Four rough edges worth knowing before adoption.</p><p><strong>Version drift inside the checked-out repo.</strong> <em><strong>VERSION</strong></em>, <em><strong>plugin.json</strong></em>, and <em><strong>harness.toml</strong></em> all report v4.12.3. The included binaries (<em><strong>./bin/harness version</strong></em> and <em><strong>./bin/harness-darwin-amd64 version</strong></em>) report 4.11.4 (Hokage). The repo&#8217;s own doctor catches the mismatch and recommends <em><strong>cd go &amp;&amp; make install</strong></em>. A marketplace install ships the right binary. A clone-and-run-from-bin install does not.</p><p><strong>TDD is not enforced by default.</strong> The README implies TDD verification on the work step. <em><strong>harness.toml</strong></em> ships with TDD enforcement disabled. R14, the TDD guardrail, is registered as a local-trial path rather than a blocking rule.</p><p><strong>The Breezing benchmark is narrower than it looks.</strong> The report shows 14/15 passes with validation instructions versus 3/15 without, across 30 runs. The report itself flags the limits: three tasks, one model (GLM-4.5-air through Z.AI&#8217;s haiku tier), two actual bug categories under the surface, a two-stage adaptive design, and it tests validation instructions rather than the full Breezing pipeline. Useful signal. Not proof of system effectiveness.</p><p><strong>Documentation drift from the TypeScript era.</strong> <em><strong>docs/CLAUDE_CODE_<a href="http://compatibility.md/">COMPATIBILITY.md</a></strong></em> still references v3.10.2 and Node.js requirements. The current README and the Go-runtime spec say Node is not required. Some skill files still reference deleted concepts. The project&#8217;s own <em><strong>deleted-concepts.yaml</strong></em> exists exactly because this residue keeps showing up after every major migration.</p><p>The stronger claim is the narrower one. For Claude Code users, Harness gives a structured way to turn agent work into reviewed, evidence-backed repo changes. Treat it as a workflow and evidence system. Not a finished safety layer.</p><div><hr></div><p><strong>Which part of your agent workflow gets the loosest when you let the model drive: plans, tests, or release notes?</strong></p><p><strong>All source links are in the first reply. Full breakdown of recent updates + daily signals in our newsletter (link in bio).</strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2><strong>Appendix: Full command walkthrough</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EQCd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F945a1094-8b2c-4b6a-a647-35e781cfa9eb_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EQCd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F945a1094-8b2c-4b6a-a647-35e781cfa9eb_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!EQCd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F945a1094-8b2c-4b6a-a647-35e781cfa9eb_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!EQCd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F945a1094-8b2c-4b6a-a647-35e781cfa9eb_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!EQCd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F945a1094-8b2c-4b6a-a647-35e781cfa9eb_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EQCd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F945a1094-8b2c-4b6a-a647-35e781cfa9eb_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/945a1094-8b2c-4b6a-a647-35e781cfa9eb_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!EQCd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F945a1094-8b2c-4b6a-a647-35e781cfa9eb_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!EQCd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F945a1094-8b2c-4b6a-a647-35e781cfa9eb_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!EQCd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F945a1094-8b2c-4b6a-a647-35e781cfa9eb_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!EQCd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F945a1094-8b2c-4b6a-a647-35e781cfa9eb_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>Install and setup</strong></h3><pre><code><code>claude
/plugin marketplace add Chachamaru127/claude-code-harness
/plugin install claude-code-harness@claude-code-harness-marketplace
/harness-setup</code></code></pre><p>Setup installs project guidance, command surfaces, hooks, and a baseline check so the workflow starts from a known state.</p><p><strong>/harness-plan &lt;request&gt;</strong></p><p>Example:</p><pre><code><code>/harness-plan Add a JSON export endpoint to the user API</code></code></pre><p>What it produces:</p><ul><li><p>A spec delta written to <em><strong><a href="http://spec.md/">spec.md</a></strong></em>, or a documented &#8220;spec skip reason&#8221; if the request does not change the product contract.</p></li><li><p>Task rows in <em><strong><a href="http://plans.md/">Plans.md</a></strong></em> with scope, acceptance criteria, dependencies, unknowns, and stop conditions.</p></li></ul><p>The user approves or corrects the contract before any code is written.</p><p><strong>/harness-work &lt;task&gt;</strong></p><p>Example:</p><pre><code><code>/harness-work 1.1.1</code></code></pre><p>Implements one approved slice and records verification evidence. The work step refuses silent scope expansion. If the work outgrows the approved row, the loop stops and asks.</p><p><strong>/harness-review</strong></p><p>Runs as a separate step, not blended into implementation. Major findings block completion. Minor findings return as recommendations. The verdict format is fixed: APPROVE or REQUEST_CHANGES.</p><p><strong>/harness-release --dry-run</strong></p><p>Checks changelog state, version sync across <em><strong>VERSION</strong></em>, <em><strong>plugin.json</strong></em>, and <em><strong>harness.toml</strong></em>, tag boundaries, and release-evidence packaging. Dry-run reports what would happen without making any release action.</p><h3><strong>Codex CLI route (compatibility, not parity)</strong></h3><pre><code><code>git clone https://github.com/Chachamaru127/claude-code-harness.git
cd claude-code-harness
./scripts/setup-codex.sh --user</code></code></pre><p>Mirrored skills are called as <em><strong>$harness-plan</strong></em>, <em><strong>$harness-work</strong></em>, <em><strong>$harness-review</strong></em>. Codex gets contract injection and post-run checks. It does not get Claude Code&#8217;s pre-tool hook enforcement.</p><h3><strong>OpenCode route (mirror + bootstrap)</strong></h3><pre><code><code>/path/to/claude-code-harness/scripts/setup-opencode.sh</code></code></pre><p>Mirrors the harness skills into OpenCode-compatible files. Runtime parity is not claimed.</p><h3><strong>Existing-user health check</strong></h3><pre><code><code>bin/harness doctor --migration-report</code></code></pre><p>Inventories stale Claude plugin caches, missing slash entries, duplicate Codex skills, OpenCode files, backup paths, and harness-mem state. Deletes nothing. Run this before reinstalling or cleaning up.</p>]]></content:encoded></item><item><title><![CDATA[The Model Isn't the Agent Anymore]]></title><description><![CDATA[A UC Berkeley paper argues that long-horizon agent performance now turns on six system components around the model, not just the model itself.]]></description><link>https://alphasignalai.substack.com/p/the-model-isnt-the-agent-anymore</link><guid isPermaLink="false">https://alphasignalai.substack.com/p/the-model-isnt-the-agent-anymore</guid><dc:creator><![CDATA[AlphaSignal AI]]></dc:creator><pubDate>Thu, 28 May 2026 16:01:18 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!jftI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24c81cb8-3852-437b-bb6d-00d065109e50_1672x941.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jftI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24c81cb8-3852-437b-bb6d-00d065109e50_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jftI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24c81cb8-3852-437b-bb6d-00d065109e50_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!jftI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24c81cb8-3852-437b-bb6d-00d065109e50_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!jftI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24c81cb8-3852-437b-bb6d-00d065109e50_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!jftI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24c81cb8-3852-437b-bb6d-00d065109e50_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jftI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24c81cb8-3852-437b-bb6d-00d065109e50_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/24c81cb8-3852-437b-bb6d-00d065109e50_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1332129,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://alphasignalai.substack.com/i/199504885?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24c81cb8-3852-437b-bb6d-00d065109e50_1672x941.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jftI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24c81cb8-3852-437b-bb6d-00d065109e50_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!jftI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24c81cb8-3852-437b-bb6d-00d065109e50_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!jftI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24c81cb8-3852-437b-bb6d-00d065109e50_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!jftI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24c81cb8-3852-437b-bb6d-00d065109e50_1672x941.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p>In ~9 mins: the six-component framework, the three bottlenecks every agent runs into, why &#8220;agent score&#8221; is now a system score, and a runnable appendix to inspect the reference harness yourself.</p></blockquote><p>The model is the part you can pick. Everything else is the part that breaks.</p><p><strong>Shangding Gu</strong> (UC Berkeley) just argued that long-horizon agent performance now depends as much on the surrounding system as on stronger foundation models.</p><p><strong>The paper</strong> names this work &#8220;scaling the harness&#8221; and decomposes any agent into six interacting components, with three of them already at saturation.</p><p>A runnable appendix at the end installs the reference harness and points each step back to one of those six components.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2><strong>Why this paper, why now</strong></h2><p>The research is authored by <strong>Shangding Gu (UC Berkeley)</strong> and titled &#8220;<strong>From Model Scaling to System Scaling: Scaling the Harness in Agentic AI</strong>&#8220; (arXiv 2605.26112v1, May 25 2026).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!G1Vc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec15de7-b0d0-4117-bf01-271b068836be_1150x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!G1Vc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec15de7-b0d0-4117-bf01-271b068836be_1150x1000.png 424w, https://substackcdn.com/image/fetch/$s_!G1Vc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec15de7-b0d0-4117-bf01-271b068836be_1150x1000.png 848w, https://substackcdn.com/image/fetch/$s_!G1Vc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec15de7-b0d0-4117-bf01-271b068836be_1150x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!G1Vc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec15de7-b0d0-4117-bf01-271b068836be_1150x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!G1Vc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec15de7-b0d0-4117-bf01-271b068836be_1150x1000.png" width="1150" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4ec15de7-b0d0-4117-bf01-271b068836be_1150x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1150,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!G1Vc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec15de7-b0d0-4117-bf01-271b068836be_1150x1000.png 424w, https://substackcdn.com/image/fetch/$s_!G1Vc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec15de7-b0d0-4117-bf01-271b068836be_1150x1000.png 848w, https://substackcdn.com/image/fetch/$s_!G1Vc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec15de7-b0d0-4117-bf01-271b068836be_1150x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!G1Vc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec15de7-b0d0-4117-bf01-271b068836be_1150x1000.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>It lands in a year where &#8220;agent score&#8221; has stopped tracking model score. SWE-agent showed that redesigning the tool schema alone moves SWE-bench accuracy with the same model underneath. tau-bench&#8217;s pass^k metric exposed that agents looking strong at single-shot success collapse under repeated trials.</p><p>Anthropic reported a 90.2% gain from a multi-agent system (Opus 4 lead with Sonnet 4 subagents) over single-agent Opus 4 on internal research, with <strong>token usage alone explaining 80% of BrowseComp performance variance</strong> and tool-call count plus model choice taking that to 95%.</p><p>The frame is now mainstream. OpenAI calls it &#8220;harness engineering.&#8221; Anthropic calls it &#8220;context engineering.&#8221; AMA-Bench (ICML 2026) and RealMem both target memory hygiene. Gu&#8217;s paper is the first to put a clean name and a decomposition on what those efforts have in common.</p><div><hr></div><h2><strong>The idea</strong></h2><p>An agent is six things, not one. The model reasons. The harness around it picks what to remember, what context to assemble, which tool to call, how to verify each step, and what trace to record. Most current evaluation only measures the model.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!m6qe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8d0f8d-5bde-4e51-b281-25dfb8795b3a_1488x837.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!m6qe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8d0f8d-5bde-4e51-b281-25dfb8795b3a_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!m6qe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8d0f8d-5bde-4e51-b281-25dfb8795b3a_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!m6qe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8d0f8d-5bde-4e51-b281-25dfb8795b3a_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!m6qe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8d0f8d-5bde-4e51-b281-25dfb8795b3a_1488x837.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!m6qe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8d0f8d-5bde-4e51-b281-25dfb8795b3a_1488x837.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bf8d0f8d-5bde-4e51-b281-25dfb8795b3a_1488x837.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!m6qe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8d0f8d-5bde-4e51-b281-25dfb8795b3a_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!m6qe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8d0f8d-5bde-4e51-b281-25dfb8795b3a_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!m6qe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8d0f8d-5bde-4e51-b281-25dfb8795b3a_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!m6qe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8d0f8d-5bde-4e51-b281-25dfb8795b3a_1488x837.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Reader test. When someone says &#8220;this agent is better,&#8221; ask whether the model improved or whether the harness changed. Tool schemas, file inspection order, retry policies, and memory rules all sit inside the headline number and currently cannot be separated from it.</p><div><hr></div><h2><strong>The framework: six components, one equation</strong></h2><p>Gu defines agent performance over a horizon H as:</p><p>P_H = &#934;(R, M, C, S, O, G)</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mn9k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87e43944-7fd2-4e68-b027-93241e5cffd2_1488x837.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mn9k!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87e43944-7fd2-4e68-b027-93241e5cffd2_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!mn9k!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87e43944-7fd2-4e68-b027-93241e5cffd2_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!mn9k!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87e43944-7fd2-4e68-b027-93241e5cffd2_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!mn9k!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87e43944-7fd2-4e68-b027-93241e5cffd2_1488x837.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mn9k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87e43944-7fd2-4e68-b027-93241e5cffd2_1488x837.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/87e43944-7fd2-4e68-b027-93241e5cffd2_1488x837.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!mn9k!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87e43944-7fd2-4e68-b027-93241e5cffd2_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!mn9k!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87e43944-7fd2-4e68-b027-93241e5cffd2_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!mn9k!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87e43944-7fd2-4e68-b027-93241e5cffd2_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!mn9k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87e43944-7fd2-4e68-b027-93241e5cffd2_1488x837.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Model scaling improves <em><strong>R</strong></em>. System scaling improves the other five.</p><p>The paper&#8217;s main claim is conditional. Once <em><strong>R</strong></em> clears a capability threshold, marginal gains shift to <em><strong>M</strong></em>, <em><strong>C</strong></em>, <em><strong>S</strong></em>, <em><strong>O</strong></em>, and <em><strong>G</strong></em>. Gu also flags that <em><strong>&#934;</strong></em> has no closed form and the six factors are not strictly orthogonal. They are six distinct points of intervention, each one engineering effort can change while holding the rest fixed.</p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Sv4N!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d7375cd-1008-48c6-bfe3-534a58a1e9a4_1488x837.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Sv4N!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d7375cd-1008-48c6-bfe3-534a58a1e9a4_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!Sv4N!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d7375cd-1008-48c6-bfe3-534a58a1e9a4_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!Sv4N!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d7375cd-1008-48c6-bfe3-534a58a1e9a4_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!Sv4N!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d7375cd-1008-48c6-bfe3-534a58a1e9a4_1488x837.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Sv4N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d7375cd-1008-48c6-bfe3-534a58a1e9a4_1488x837.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2d7375cd-1008-48c6-bfe3-534a58a1e9a4_1488x837.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Sv4N!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d7375cd-1008-48c6-bfe3-534a58a1e9a4_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!Sv4N!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d7375cd-1008-48c6-bfe3-534a58a1e9a4_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!Sv4N!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d7375cd-1008-48c6-bfe3-534a58a1e9a4_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!Sv4N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d7375cd-1008-48c6-bfe3-534a58a1e9a4_1488x837.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>Bottleneck 1: context governance</strong></h3><p>The hard problem of context is not capacity. It&#8217;s governance.</p><p>Gu factors context quality into four subaxes: <strong>relevance</strong> (matches the current subproblem), <strong>compactness</strong> (no more than the minimum sufficient set), <strong>traceability</strong> (provenance can be inspected), and <strong>refresh policy</strong> (stale context gets rechecked).</p><p>Failure mode: <strong>exposure without access</strong>. As context grows, the model sees more tokens but does not necessarily attend to the right ones. Relevant evidence competes with low-value padding. Attention dilutes over long inputs. Models prefer evidence at the start or end of the window rather than the middle.</p><p>System move: treat each turn&#8217;s context as the output of a selection policy, not a fixed buffer. Weight relevance, penalize verbosity against a token budget, prefer recently validated content, record provenance.</p><p>Claude Code is the production version of this. Persistent project context lives in <em><strong><a href="http://claude.md/">CLAUDE.md</a></strong></em>. Just-in-time access happens through <em><strong>glob</strong></em>, <em><strong>grep</strong></em>, and file reads. The harness gets stable priors plus environment refresh on demand.</p><p>The right systems question is no longer how many tokens the model can hold. It is what the minimum sufficient context is for the current subproblem.</p><h3><strong>Bottleneck 2: trustworthy memory</strong></h3><p>The hard problem of agent memory is not storage. It&#8217;s trust.</p><p>Four subaxes again: <strong>precision</strong> (the claim has a narrow scope), <strong>durability</strong> (the target has not silently changed), <strong>retrievability</strong> (the memory can be found at acceptable cost), <strong>verifiability</strong> (the claim can be checked against the live environment). Retrievability is a precondition for using trust, not a source of it.</p><p>Failure mode: <strong>stale-but-confident</strong>. Gu&#8217;s example: a note like &#8220;the data loader is defined in <em><strong>utils/<a href="http://loader.py/">loader.py</a></strong></em>.&#8221; After a refactor, this is flatly wrong, but semantic search still ranks it highly. If the agent acts on it without re-checking, the error is operational.</p><p>The safety variant is worse. &#8220;Remembering More, Risking More&#8221; (Al-Tawaha et al., 2026) finds that memory-enabled agents exceed a NullMemory baseline on safety violations, and that violation rates trend upward with exposure length.</p><p>System move: trust is a runtime decision, not a property of the stored item. Store confidence, source, age. Penalize staleness in retrieval. Re-check environment-dependent claims before acting. Treat retrieved memory as a hypothesis until verified.</p><p>CheetahClaws implements this directly. Each memory entry stores <em><strong>confidence</strong></em>, <em><strong>source</strong></em>, <em><strong>last_used_at</strong></em>, and <em><strong>conflict_group</strong></em> as first-class fields. Search re-ranks by confidence &#215; recency. The other two harnesses in the paper&#8217;s comparison derive trust implicitly from access patterns.</p><p>Durable memory without verification accumulates undetected drift. Environment-only search throws away every prior verification. A working harness keeps both.</p><h3><strong>Bottleneck 3: dynamic skill routing</strong></h3><p>The hard problem of skill is not having skills. It&#8217;s routing and checking them.</p><p>Four subaxes: <strong>specificity</strong> (each skill states what it can and cannot do), <strong>selectivity</strong> (the router invokes the right skill at the right time), <strong>composability</strong> (one skill&#8217;s post-conditions feed the next), <strong>verifiability</strong> (every skill output has an explicit check).</p><p>Failure mode: <strong>confident-but-unchecked</strong>. As specialized subagents multiply, the risk shifts from missing capability to present-but-unverified capability. A subagent returns plausible output. No downstream layer validates. This is the symmetric form of stale-but-confident memory.</p><p>System move: routing as a learned policy, not a fixed rule set, paired with post-condition checks at every step. <em><strong>S</strong></em> and <em><strong>G</strong></em> are not independent. Scaling skill quality without scaling verification produces faster but less reliable progress.</p><p>Adjacent work is converging. OpenAI&#8217;s skills docs define skills as repeatable procedures, separated from prompts, attached to the execution environment. SkillOpt (Yang et al., 2026) treats the skill file like neural-network weights and accepts edits only when held-out validation improves.</p><p>The open research direction is adaptive allocation across skills based on subtask type, confidence, and verification cost. Production routing today is mostly hand-coded.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3><strong>What&#8217;s missing in evaluation</strong></h3><p>Outcome metrics answer whether the task was solved. Process metrics answer how. Two agents may both pass the same benchmark while differing in tokens, tool calls, retries, failed edits, and human interventions. Endpoint accuracy cannot see that.</p><p>Gu lists eight dimensions a harness-aware benchmark should track:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!49KJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea293a6-32d9-420d-9f07-c6f246d7d246_1488x837.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!49KJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea293a6-32d9-420d-9f07-c6f246d7d246_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!49KJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea293a6-32d9-420d-9f07-c6f246d7d246_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!49KJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea293a6-32d9-420d-9f07-c6f246d7d246_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!49KJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea293a6-32d9-420d-9f07-c6f246d7d246_1488x837.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!49KJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea293a6-32d9-420d-9f07-c6f246d7d246_1488x837.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bea293a6-32d9-420d-9f07-c6f246d7d246_1488x837.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!49KJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea293a6-32d9-420d-9f07-c6f246d7d246_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!49KJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea293a6-32d9-420d-9f07-c6f246d7d246_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!49KJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea293a6-32d9-420d-9f07-c6f246d7d246_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!49KJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea293a6-32d9-420d-9f07-c6f246d7d246_1488x837.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Only the first is well-covered today. tau-bench&#8217;s pass^k is an early step on reliability. AMA-Bench (ICML 2026) is an early step on long-horizon memory. The other six dimensions need work.</p><p>Safe agent evolution is the second open question. Gu proposes four sub-questions for any agent that adapts over time: <strong>what persists</strong>, <strong>what updates</strong>, <strong>what is measured</strong>, <strong>what is auditable</strong>. Memory, skills, preferences, and guardrails should not collapse into one undifferentiated state.</p><p>The evidence from Anthropic&#8217;s multi-agent post lands here. Token usage alone explained 80% of BrowseComp performance variance. Token usage plus tool calls plus model choice took it to 95%. Compute allocation across the harness is the dominant factor, not which model is wrapped.</p><div><hr></div><h2><strong>How to apply this to your stack</strong></h2><p>Six questions, one per component, that work against any agent a reader ships today (Claude Code, Codex CLI, Hermes, OpenClaw, custom).</p><p><em><strong>R</strong></em>: which model, and what&#8217;s its pass^k under repeated trials?</p><p><em><strong>M</strong></em>: what&#8217;s stored, how is staleness handled, is there a re-check before action?</p><p><em><strong>C</strong></em>: is context a buffer or a policy? What gets evicted first?</p><p><em><strong>S</strong></em>: how many tools and skills, who routes, who post-condition-checks?</p><p><em><strong>O</strong></em>: retry policy, handoff protocol, max-turn cap?</p><p><em><strong>G</strong></em>: what&#8217;s logged, what&#8217;s auditable, which permissions exist independently of model output?</p><p>If your stack has no answer for one of these, that&#8217;s where the next harness gain will come from.</p><div><hr></div><h2><strong>AlphaSignal Take</strong></h2><p>The framework is sharp. The honest gaps are three.</p><p><strong>No metric yet.</strong> Gu names the levers but offers no quantitative measure for &#8220;harness quality.&#8221; Without that, system scaling is direction, not science. The evaluation agenda is the right shape, but every dimension after one-shot completion is still under-benchmarked today.</p><p><strong>The reference harness is illustration, not benchmark.</strong> CheetahClaws stores confidence and recency as first-class fields and gets credit for that. The Claude Code / OpenClaw / CheetahClaws comparison table is deliberately non-ranking. The paper says explicitly: the point is not to declare a winner. Don&#8217;t read it as one.</p><p><strong>The evaluation agenda is a wishlist.</strong> Of the eight benchmark dimensions, only one is well-covered today. tau-bench, AMA-Bench, and RealMem are early steps on three of the others. Four dimensions have no public benchmark. <strong>Scaling the Harness v2</strong> is the paper that turns this framework into measurable scores, and it doesn&#8217;t exist yet.</p><p>Today, the practical value is vocabulary. Six components, three bottlenecks, four subaxes each, eight benchmark dimensions, four evolution questions. That&#8217;s a checklist a reader can run on Monday morning against whatever harness they ship.</p><div><hr></div><p><strong>Which of the six components is the weakest in the agent stack you ship today?</strong></p><p><strong>All source links are in the first reply. Full breakdown of recent updates + daily signals in our newsletter (link in bio).</strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2><strong>Appendix: install the reference harness and inspect it yourself</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!i5mH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a38ba71-f697-4543-bd1d-5afce3781560_1488x837.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!i5mH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a38ba71-f697-4543-bd1d-5afce3781560_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!i5mH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a38ba71-f697-4543-bd1d-5afce3781560_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!i5mH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a38ba71-f697-4543-bd1d-5afce3781560_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!i5mH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a38ba71-f697-4543-bd1d-5afce3781560_1488x837.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!i5mH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a38ba71-f697-4543-bd1d-5afce3781560_1488x837.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9a38ba71-f697-4543-bd1d-5afce3781560_1488x837.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!i5mH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a38ba71-f697-4543-bd1d-5afce3781560_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!i5mH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a38ba71-f697-4543-bd1d-5afce3781560_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!i5mH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a38ba71-f697-4543-bd1d-5afce3781560_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!i5mH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a38ba71-f697-4543-bd1d-5afce3781560_1488x837.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The paper is conceptual but its reference implementation is runnable in 40K lines of Python. Each step below points back to which paper component it instantiates.</p><h3><strong>1. Install</strong></h3><pre><code><code>curl -fsSL https://raw.githubusercontent.com/SafeRL-Lab/cheetahclaws/main/scripts/install.sh | bash
source ~/.zshrc
cheetahclaws</code></code></pre><p>Or via pip: <em><strong>pip install cheetahclaws</strong></em>. Python 3.10+ required. Installs the CLI binary, the harness modules, and the default config dir at <em><strong>~/.cheetahclaws/</strong></em>.</p><h3><strong>2. Pick a model &#8594; R</strong></h3><pre><code><code>cheetahclaws --model claude-opus-4-6</code></code></pre><p>Switch between <em><strong>claude-opus-4-6</strong></em>, <em><strong>gpt-4o</strong></em>, <em><strong>gemini-2.5-pro-preview-03-25</strong></em>, <em><strong>deepseek-chat</strong></em>, <em><strong>qwen-max</strong></em>, or any local Ollama model with one flag. No recompile. <strong>Component: </strong><em><strong>R</strong></em><strong> (reasoning substrate).</strong></p><h3><strong>3. Inspect memory &#8594; M</strong></h3><pre><code><code>ls ~/.cheetahclaws/memory/
cat ~/.cheetahclaws/memory/MEMORY.md</code></code></pre><p>Each entry is a Markdown file under <em><strong>~/.cheetahclaws/memory/&lt;slug&gt;.md</strong></em> with frontmatter for <em><strong>confidence</strong></em>, <em><strong>source</strong></em>, <em><strong>last_used_at</strong></em>, and <em><strong>conflict_group</strong></em>. <em><strong><a href="http://memory.md/">MEMORY.md</a></strong></em> is the rebuilt-on-write index. Search re-ranks by confidence &#215; recency. Run <em><strong>/memory consolidate</strong></em> for a manual deduplication pass. <strong>Component: </strong><em><strong>M</strong></em><strong> (memory store).</strong></p><h3><strong>4. Read the context constructor &#8594; C</strong></h3><p>Open <em><strong>context.py</strong></em>. The harness assembles each turn from: env block, git info, prompt assets, memory index, registered commands, tmux state, and plan-mode fragments. Compaction fires at 70% of the model&#8217;s context window with a two-layer rule-based plus AI-summarization stack. <strong>Component: </strong><em><strong>C</strong></em><strong> (context constructor).</strong></p><h3><strong>5. List tools and skills &#8594; S</strong></h3><pre><code><code>/skills</code></code></pre><p>List Markdown skills under <em><strong>skill/</strong></em> with frontmatter for <em><strong>allowed_tools</strong></em>, <em><strong>model_override</strong></em>, and inline-vs-fork execution. 27 built-in tools register at startup via <em><strong>tool_<a href="http://registry.py/">registry.py</a></strong></em>. Subagents live under <em><strong>multi_agent/</strong></em> with built-in roles, tool restrictions, and worktree isolation. <strong>Component: </strong><em><strong>S</strong></em><strong> (skill router).</strong></p><h3><strong>6. Watch the orchestration loop &#8594; O</strong></h3><pre><code><code>/verbose</code></code></pre><p>The agent loop yields typed events: <em><strong>TextChunk</strong></em>, <em><strong>ToolStart</strong></em>, <em><strong>ToolEnd</strong></em>, <em><strong>TurnDone</strong></em>. The full loop lives in <em><strong>agent.py </strong></em>with retry, handoff, and termination logic in one file. <strong>Component: </strong><em><strong>O</strong></em><strong> (orchestration loop).</strong></p><h3><strong>7. Read the audit trail &#8594; G</strong></h3><pre><code><code>/permissions manual</code></code></pre><p>Permission modes (<em><strong>auto</strong></em>, <em><strong>accept-all</strong></em>, <em><strong>manual</strong></em>, <em><strong>plan</strong></em>) are set via <em><strong>/permissions &lt;mode&gt;</strong></em> inside the REPL. The append-only event store lives at <em><strong>~/.cheetahclaws/kernel.db</strong></em> (SQLite). Schema is defined in <em><strong>cc_kernel/event_<a href="http://log.py/">log.py</a></strong></em> with fields <em><strong>event_id</strong></em>, <em><strong>ts</strong></em>, <em><strong>kind</strong></em>, <em><strong>payload</strong></em>, <em><strong>causation_id</strong></em>, and <em><strong>correlation_id</strong></em>. <strong>Component: </strong><em><strong>G</strong></em><strong> (verification and governance).</strong></p><h3><strong>8. Launch the web UI</strong></h3><pre><code><code>cheetahclaws --web</code></code></pre><p>All six components surface in one dashboard: model selector, memory browser, context preview, tool registry, event log. Useful for comparing the same task across two models or two permission modes. <strong>Spans all components.</strong></p><p>This is not a recommendation that CheetahClaws is the right harness for production. It is the one in the paper&#8217;s comparison table a reader can read end-to-end in an afternoon.</p>]]></content:encoded></item><item><title><![CDATA[Perplexity's Bumblebee: a read-only supply-chain check for the developer laptop]]></title><description><![CDATA[A read-only inventory check for the developer laptop supply chain]]></description><link>https://alphasignalai.substack.com/p/perplexitys-bumblebee-a-read-only</link><guid isPermaLink="false">https://alphasignalai.substack.com/p/perplexitys-bumblebee-a-read-only</guid><dc:creator><![CDATA[AlphaSignal AI]]></dc:creator><pubDate>Wed, 27 May 2026 16:30:56 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Il_x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d38ad4-17a3-43a1-bc60-2cabbbfc0b8a_1672x941.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Il_x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d38ad4-17a3-43a1-bc60-2cabbbfc0b8a_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Il_x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d38ad4-17a3-43a1-bc60-2cabbbfc0b8a_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!Il_x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d38ad4-17a3-43a1-bc60-2cabbbfc0b8a_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!Il_x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d38ad4-17a3-43a1-bc60-2cabbbfc0b8a_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!Il_x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d38ad4-17a3-43a1-bc60-2cabbbfc0b8a_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Il_x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d38ad4-17a3-43a1-bc60-2cabbbfc0b8a_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b1d38ad4-17a3-43a1-bc60-2cabbbfc0b8a_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1004728,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://alphasignalai.substack.com/i/199469476?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d38ad4-17a3-43a1-bc60-2cabbbfc0b8a_1672x941.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Il_x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d38ad4-17a3-43a1-bc60-2cabbbfc0b8a_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!Il_x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d38ad4-17a3-43a1-bc60-2cabbbfc0b8a_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!Il_x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d38ad4-17a3-43a1-bc60-2cabbbfc0b8a_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!Il_x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d38ad4-17a3-43a1-bc60-2cabbbfc0b8a_1672x941.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p>In ~7 mins: what Mini Shai-Hulud did to 324 npm packages in two weeks, why Perplexity open-sourced Bumblebee to triage the fallout, the 8 on-disk surfaces it reads, the 6-step scan loop that drives it, and the 5 rough edges in v0.1.1. Appendix: full how-to reference at the end.</p></blockquote><p>The Mini Shai-Hulud worm chewed through hundreds of npm packages in the last two weeks.</p><p><strong>Mini Shai-Hulud</strong> hit 324 <em><strong>antv </strong></em>npm packages across 643 malicious versions on May 19, dropped malicious <em><strong>tanstack </strong></em>releases on May 11 with valid SLSA Build Level 3 provenance, and crossed into PyPI through <em><strong>lightning 2.6.2</strong></em> and <em><strong>2.6.3</strong></em> on April 30. Microsoft, Snyk, Socket, and Wiz all published incident reports inside three weeks.</p><p><strong>Bumblebee</strong> is the read-only scanner Perplexity open-sourced on May 22 to handle exactly that scenario. Apache-2.0, written in Go, no third-party dependencies, single static binary.</p><p><strong>The narrow question it answers:</strong> when an advisory names a bad package, which developer laptops in the fleet still show it on disk right now?</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/perplexity_ai/status/2057869990536360334&quot;,&quot;full_text&quot;:&quot;Today we're open-sourcing Bumblebee, a read-only scanner for macOS and Linux.\n\nIt checks developer machines for risky packages, extensions, and AI tool configs.\n\nConnected to Computer, it can trigger deeper scans whenever a new supply-chain risk emerges.\n\n<a class=\&quot;tweet-url\&quot; href=\&quot;https://github.com/perplexityai/bumblebee\&quot;>github.com/perplexityai/b&#8230;</a> &quot;,&quot;username&quot;:&quot;perplexity_ai&quot;,&quot;name&quot;:&quot;Perplexity&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/2009310641165660160/XArF3_Ib_normal.jpg&quot;,&quot;date&quot;:&quot;2026-05-22T17:03:33.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/HI8F2dhbMAAYU42.png&quot;,&quot;link_url&quot;:&quot;https://t.co/wXauD4wDOT&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:177,&quot;retweet_count&quot;:669,&quot;like_count&quot;:4849,&quot;impression_count&quot;:1330934,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:false}" data-component-name="Twitter2ToDOM"></div><div><hr></div><h2><strong>Where Bumblebee fits</strong></h2><p>SBOMs describe what shipped into a build. EDRs describe what ran in a process. Neither describes what is still sitting in a lockfile on a developer&#8217;s laptop while the responder is on the call.</p><p>That gap matters more in 2026 than it did two years ago. Shai-Hulud 2.0 spread through Zapier, ENS, PostHog, and Postman in November 2025. Mini Shai-Hulud hit SAP npm packages, PyTorch Lightning, and the AntV npm scope across late April and May 2026.</p><p>The malicious code reached developer machines through normal <em><strong>npm install</strong></em> flows long before any production SBOM updated.</p><p>The repo is moving fast. Bumblebee crossed <strong>2,900 GitHub stars</strong> and <strong>1,600+ release binary downloads</strong> in the four days between <em><strong>v0.1.1</strong></em> shipping on May 22 and now. Open issues and PRs already cover Windows defaults, NuGet, Homebrew, <strong><a href="http://osv.dev/">OSV.dev</a></strong> integration, and a human-readable terminal mode.</p><div><hr></div><h2><strong>What it actually does in plain English</strong></h2><p>Bumblebee walks a list of known on-disk metadata locations, normalizes what it finds into NDJSON records, and (optionally) cross-checks those records against an exposure catalog the operator supplies.</p><p>It does not call <em><strong>npm ls</strong></em>. It does not call <em><strong>pip show</strong></em>, <em><strong>go list</strong></em>, <em><strong>bundle list</strong></em>, or <em><strong>composer show</strong></em>. It does not read source files, fetch threat intel at runtime, or watch processes. It does not ship a built-in advisory feed.</p><p>That last rule is the design pivot. During a fresh compromise, running the same package manager that just shipped the bad release is the worst move a responder can make. Bumblebee is built around never doing that.</p><p>Eight ecosystems are covered in <em><strong>v0.1.1</strong></em>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!u2xc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5123dc3d-7c4e-4e9e-83f9-0eaf5b69992c_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!u2xc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5123dc3d-7c4e-4e9e-83f9-0eaf5b69992c_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!u2xc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5123dc3d-7c4e-4e9e-83f9-0eaf5b69992c_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!u2xc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5123dc3d-7c4e-4e9e-83f9-0eaf5b69992c_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!u2xc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5123dc3d-7c4e-4e9e-83f9-0eaf5b69992c_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!u2xc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5123dc3d-7c4e-4e9e-83f9-0eaf5b69992c_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5123dc3d-7c4e-4e9e-83f9-0eaf5b69992c_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!u2xc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5123dc3d-7c4e-4e9e-83f9-0eaf5b69992c_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!u2xc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5123dc3d-7c4e-4e9e-83f9-0eaf5b69992c_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!u2xc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5123dc3d-7c4e-4e9e-83f9-0eaf5b69992c_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!u2xc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5123dc3d-7c4e-4e9e-83f9-0eaf5b69992c_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>The MCP, editor, and browser surfaces are the interesting part. Most existing supply-chain tools stop at language registries. Bumblebee treats Claude Desktop configs, Cursor extensions, and Chrome add-ons as part of the same surface the developer is exposed to.</p><h3><strong>How the scan loop works</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dsBB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e3a3452-97f4-4765-88e2-0020bdf8e6d9_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dsBB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e3a3452-97f4-4765-88e2-0020bdf8e6d9_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!dsBB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e3a3452-97f4-4765-88e2-0020bdf8e6d9_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!dsBB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e3a3452-97f4-4765-88e2-0020bdf8e6d9_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!dsBB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e3a3452-97f4-4765-88e2-0020bdf8e6d9_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dsBB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e3a3452-97f4-4765-88e2-0020bdf8e6d9_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7e3a3452-97f4-4765-88e2-0020bdf8e6d9_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!dsBB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e3a3452-97f4-4765-88e2-0020bdf8e6d9_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!dsBB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e3a3452-97f4-4765-88e2-0020bdf8e6d9_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!dsBB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e3a3452-97f4-4765-88e2-0020bdf8e6d9_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!dsBB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e3a3452-97f4-4765-88e2-0020bdf8e6d9_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>The scanner runs six steps per invocation:</p><ol><li><p>Resolve safe scan roots based on the selected profile.</p></li><li><p>Walk the file tree, skipping symlinks and over 100 sensitive or noisy directory patterns (<em><strong>.ssh</strong></em>, <em><strong>.aws</strong></em>, <em><strong>.kube</strong></em>, <em><strong>.gnupg</strong></em>, <em><strong>Library/Keychains</strong></em>, <em><strong>Library/Mail</strong></em>, <em><strong>Library/Cookies</strong></em>, browser cache subtrees).</p></li><li><p>Dispatch each recognized basename to an ecosystem-specific parser.</p></li><li><p>Normalize names (npm lowercase, PyPI PEP 503).</p></li><li><p>Emit one NDJSON <em><strong>package</strong></em> record per identity.</p></li><li><p>If <em><strong>--exposure-catalog</strong></em> was supplied, do exact <em><strong>(ecosystem, name, version)</strong></em> matching and emit a <em><strong>finding</strong></em> record per hit. Close the run with a <em><strong>scan_summary</strong></em>.</p></li></ol><p>Three scan profiles control how much of the disk gets walked:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4lXg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe916a4a1-365c-4751-914f-c6f0a44f0d99_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4lXg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe916a4a1-365c-4751-914f-c6f0a44f0d99_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!4lXg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe916a4a1-365c-4751-914f-c6f0a44f0d99_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!4lXg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe916a4a1-365c-4751-914f-c6f0a44f0d99_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!4lXg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe916a4a1-365c-4751-914f-c6f0a44f0d99_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4lXg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe916a4a1-365c-4751-914f-c6f0a44f0d99_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e916a4a1-365c-4751-914f-c6f0a44f0d99_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!4lXg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe916a4a1-365c-4751-914f-c6f0a44f0d99_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!4lXg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe916a4a1-365c-4751-914f-c6f0a44f0d99_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!4lXg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe916a4a1-365c-4751-914f-c6f0a44f0d99_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!4lXg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe916a4a1-365c-4751-914f-c6f0a44f0d99_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>baseline</strong></em> and <em><strong>project</strong></em> refuse a bare-home root by default. Only <em><strong>deep</strong></em> walks one.</p><p>Exact-match-only catalog logic is a deliberate choice. CVE scanners running on version ranges produce noise during a fresh worm wave because the advisory itself is still being scoped. Bumblebee answers a tighter question: did the exact compromised version land on this machine? That maps directly to the incident channel where someone just said &#8220;anyone with <em><strong>lightning@2.6.2</strong></em> raise your hand.&#8221;</p><p>Safety properties are baked in below the parser layer:</p><ul><li><p>No package-manager execution and no source-file reads.</p></li><li><p>No bundled threat intel and no network calls during the scan.</p></li><li><p>MCP <em><strong>env</strong></em> values and key names are dropped, so credentials sitting in Claude Desktop configs never leave the host.</p></li><li><p><em><strong>.env</strong></em> and <em><strong>.envrc</strong></em> are skipped even when they fall inside a walked root.</p></li><li><p>Remote MCP server URLs are reduced to <em><strong>scheme://host</strong></em> before being recorded, so embedded path-segment credentials cannot leak.</p></li></ul><p>The repo ships <strong>161 Go test functions</strong> across 23 test files, plus a CI matrix on Ubuntu and macOS that runs <em><strong>go vet</strong></em>, <em><strong>go test -race</strong></em>, a fresh build, <em><strong>bumblebee selftest</strong></em>, and <em><strong>govulncheck</strong></em>.</p><div><hr></div><h2><strong>How to get started</strong></h2><p>Three commands cover the smoke test on a clean machine:</p><p>bash</p><pre><code><code>go install github.com/perplexityai/bumblebee/cmd/bumblebee@v0.1.1
bumblebee selftest
bumblebee scan --profile baseline &gt; inventory.ndjson</code></code></pre><p>Requires Go 1.25+. Zero non-stdlib dependencies.</p><p><em><strong>bumblebee selftest</strong></em> extracts embedded fake-package fixtures to a tempdir, runs the scanner against an embedded exposure catalog, and asserts a fixed finding count. A non-zero exit means the local install can no longer detect what it should.</p><p>The baseline scan writes one NDJSON object per line to <em><strong>inventory.ndjson</strong></em> and diagnostics to stderr. The last line is a <em><strong>scan_summary</strong></em> record. Promote a snapshot into a downstream system only when that summary has <em><strong>status=complete</strong></em>. Partial or errored runs are evidence, not deletion signals.</p><p>The release binaries, the controlled <em><strong>left-pad</strong></em> fixture, the HTTP sink, and the <em><strong>threat_intel/</strong></em> catalog runs are all in <strong>the appendix at the end of this article.</strong></p><div><hr></div><h2><strong>AlphaSignal Take</strong></h2><p>Bumblebee is a sharp v0.1, but it ships with five rough edges worth naming.</p><p><strong>No Windows release.</strong> The <em><strong>v0.1.1</strong></em> assets are macOS and Linux only, amd64 and arm64. Issue #2 and PRs #4 and #16 cover default root discovery and full Windows support, none merged yet.</p><p><strong>No live <a href="http://osv.dev/">OSV.dev</a> source.</strong> Open issue #21 asks for it. Catalog matching today is limited to the eight files in <em><strong>threat_intel/</strong></em> (654 total entries) and whatever JSON the operator writes by hand.</p><p><strong>Exact-match only.</strong> Version-range solving is out of scope by design. The trade-off is real: upstream catalog quality determines accuracy. A wrong version string in a PR-submitted catalog produces a wrong finding rate fleet-wide.</p><p><strong>NDJSON-first output.</strong> PR #24 adds opt-in terminal output, but main still expects <em><strong>jq</strong></em> for any human reading. Issue #22 is open on that too.</p><p><strong>One small documentation gap.</strong> The README still prints <em><strong>selftest OK (2 findings in 1ms)</strong></em>. The source in <em><strong>cmd/bumblebee/selftest.go</strong></em> asserts <em><strong>expectedSelftestFindings = 3</strong></em>. Tiny issue, but the kind of thing a responder notices the first time they actually run the binary.</p><p>The architecture itself is the bet. Operator-supplied catalogs scale faster than vendor-curated advisory feeds in the first hour of a worm wave, when the question is &#8220;which laptops have this exact version&#8221; and the answer needs to be in Slack before lunch.</p><div><hr></div><p><strong>Where do you draw the line on endpoint security: the SBOM, or the laptop itself?</strong></p><p><strong>All source links are in the first reply. Full breakdown of recent updates + daily signals in our newsletter (link in bio).</strong></p><div><hr></div><h2><strong>Appendix: full how-to reference</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GxsL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d2e178-af52-4461-939d-a994464dddd8_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GxsL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d2e178-af52-4461-939d-a994464dddd8_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!GxsL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d2e178-af52-4461-939d-a994464dddd8_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!GxsL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d2e178-af52-4461-939d-a994464dddd8_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!GxsL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d2e178-af52-4461-939d-a994464dddd8_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GxsL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d2e178-af52-4461-939d-a994464dddd8_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/42d2e178-af52-4461-939d-a994464dddd8_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!GxsL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d2e178-af52-4461-939d-a994464dddd8_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!GxsL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d2e178-af52-4461-939d-a994464dddd8_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!GxsL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d2e178-af52-4461-939d-a994464dddd8_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!GxsL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d2e178-af52-4461-939d-a994464dddd8_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h3><strong>Install from the release tarball</strong></h3><p>macOS Apple Silicon:</p><p>bash</p><pre><code><code>VERSION=v0.1.1
curl -L -o checksums.txt "https://github.com/perplexityai/bumblebee/releases/download/${VERSION}/checksums.txt"
curl -L -o bumblebee.tar.gz "https://github.com/perplexityai/bumblebee/releases/download/${VERSION}/bumblebee_0.1.1_darwin_arm64.tar.gz"
shasum -a 256 -c checksums.txt --ignore-missing
tar -xzf bumblebee.tar.gz
./bumblebee version

</code></code></pre><p>Linux amd64:</p><p>bash</p><pre><code><code>VERSION=v0.1.1
curl -L -o checksums.txt "https://github.com/perplexityai/bumblebee/releases/download/${VERSION}/checksums.txt"
curl -L -o bumblebee.tar.gz "https://github.com/perplexityai/bumblebee/releases/download/${VERSION}/bumblebee_0.1.1_linux_amd64.tar.gz"
sha256sum -c checksums.txt --ignore-missing
tar -xzf bumblebee.tar.gz
./bumblebee version</code></code></pre><h3><strong>Preview what a profile will walk</strong></h3><p>bash</p><pre><code><code>bumblebee roots --profile baseline</code></code></pre><p>Prints <em><strong>&lt;root_kind&gt;\t&lt;path&gt;</strong></em> lines without walking anything.</p><h3><strong>Run a controlled project scan with a custom catalog</strong></h3><p>bash</p><pre><code><code>mkdir -p /tmp/bee-demo &amp;&amp; cd /tmp/bee-demo

cat &gt; package-lock.json &lt;&lt;'JSON'
{
  "name": "bee-demo",
  "lockfileVersion": 3,
  "packages": {
    "": { "dependencies": { "left-pad": "1.3.0" } },
    "node_modules/left-pad": {
      "version": "1.3.0",
      "resolved": "https://registry.npmjs.org/left-pad/-/left-pad-1.3.0.tgz"
    }
  }
}
JSON

cat &gt; exposure-catalog.json &lt;&lt;'JSON'
{
  "schema_version": "0.1.0",
  "entries": [
    {
      "id": "demo-left-pad-1.3.0",
      "name": "Demo left-pad exposure",
      "ecosystem": "npm",
      "package": "left-pad",
      "versions": ["1.3.0"],
      "severity": "low",
      "source": "local test fixture"
    }
  ]
}
JSON

bumblebee scan \
  --profile project \
  --root "$PWD" \
  --exposure-catalog exposure-catalog.json &gt; project.ndjson 2&gt; project.diag.ndjson

jq 'select(.record_type=="finding")' project.ndjson</code></code></pre><p>Expected output: one <em><strong>package</strong></em> record, one <em><strong>finding</strong></em> record, one <em><strong>scan_summary</strong></em> with <em><strong>status=complete</strong></em>.</p><h3><strong>Suppress packages, keep findings</strong></h3><p>bash</p><pre><code><code>bumblebee scan \
  --profile project \
  --root "$PWD" \
  --exposure-catalog exposure-catalog.json \
  --findings-only &gt; findings.ndjson</code></code></pre><p><em><strong>--findings-only</strong></em> requires <em><strong>--exposure-catalog</strong></em>. It keeps <em><strong>finding</strong></em>, <em><strong>scan_summary</strong></em>, and diagnostics. It drops <em><strong>package</strong></em> records.</p><h3><strong>Scan against the shipped threat-intel catalogs</strong></h3><p>From a source checkout or the release archive:</p><p>bash</p><pre><code><code>bumblebee scan \
  --profile deep \
  --root "$HOME/code" \
  --exposure-catalog threat_intel \
  --findings-only \
  --max-duration 10m &gt; threat-findings.ndjson</code></code></pre><p>The <em><strong>threat_intel/</strong></em> directory contains 8 catalog files and 654 entries at the studied commit, covering Mini Shai-Hulud npm + PyPI, the AntV / Mini Shai-Hulud npm wave, GemStuffer RubyGems, Laravel Lang Packagist, <em><strong>node-ipc</strong></em> credential stealer, the <em><strong>nx-console</strong></em> VS Code compromise, <em><strong>shopsprint/decimal</strong></em> Go typosquat, and the TrapDoor crypto stealer.</p><h3><strong>Ship results over HTTP</strong></h3><p>bash</p><pre><code><code>export BUMBLEBEE_TOKEN="..."
export BUMBLEBEE_DEVICE_ID="laptop-001"

bumblebee scan \
  --profile baseline \
  --output http \
  --http-url https://inventory.example.com/v1/ingest \
  --http-auth bearer \
  --http-token-env BUMBLEBEE_TOKEN \
  --http-gzip \
  --device-id-env BUMBLEBEE_DEVICE_ID</code></code></pre><p>NDJSON body. <em><strong>Content-Type: application/x-ndjson</strong></em>. HTTPS required for non-loopback hosts. HMAC mode (<em><strong>X-Inventory-Signature: sha256=&lt;hex&gt;</strong></em>) is also available. Signature input is the raw post body, or <em><strong>&lt;timestamp&gt;.&lt;body&gt;</strong></em> when <em><strong>X-Inventory-Timestamp</strong></em> is set. Compression happens before HMAC signing.</p><h3><strong>Common error modes worth knowing</strong></h3><ul><li><p><em><strong>deep</strong></em> without <em><strong>--root</strong></em> is rejected.</p></li><li><p><em><strong>--findings-only</strong></em> without <em><strong>--exposure-catalog</strong></em> is rejected.</p></li><li><p><em><strong>--ecosystem cargo</strong></em> (and anything outside the eight supported values) is rejected.</p></li><li><p>Binary <em><strong>bun.lockb</strong></em> emits a diagnostic only. Text <em><strong>bun.lock</strong></em> is parsed.</p></li><li><p>macOS deep scans hitting TCC-protected paths produce diagnostics, not findings.</p></li></ul>]]></content:encoded></item><item><title><![CDATA[The Third Way to Adapt a Frontier Agent]]></title><description><![CDATA[Microsoft just trained an agent&#8217;s skill file like neural-network weights, with bounded edits, a held-out gate, and 52-of-52 wins across 6 benchmarks and 3 harnesses.]]></description><link>https://alphasignalai.substack.com/p/the-third-way-to-adapt-a-frontier</link><guid isPermaLink="false">https://alphasignalai.substack.com/p/the-third-way-to-adapt-a-frontier</guid><dc:creator><![CDATA[AlphaSignal AI]]></dc:creator><pubDate>Tue, 26 May 2026 17:07:12 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!o5aJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F144021b5-661f-4522-887a-ba612032a17e_1672x941.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!o5aJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F144021b5-661f-4522-887a-ba612032a17e_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!o5aJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F144021b5-661f-4522-887a-ba612032a17e_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!o5aJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F144021b5-661f-4522-887a-ba612032a17e_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!o5aJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F144021b5-661f-4522-887a-ba612032a17e_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!o5aJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F144021b5-661f-4522-887a-ba612032a17e_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!o5aJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F144021b5-661f-4522-887a-ba612032a17e_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/144021b5-661f-4522-887a-ba612032a17e_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1062923,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://alphasignalai.substack.com/i/199353808?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F144021b5-661f-4522-887a-ba612032a17e_1672x941.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!o5aJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F144021b5-661f-4522-887a-ba612032a17e_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!o5aJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F144021b5-661f-4522-887a-ba612032a17e_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!o5aJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F144021b5-661f-4522-887a-ba612032a17e_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!o5aJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F144021b5-661f-4522-887a-ba612032a17e_1672x941.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><strong>In ~7 mins: the third way to adapt a frontier agent (after weights and prompts), the 5 optimizer controls that make text-space training actually behave like training, how to wire the pattern into your Claude Code, Codex, or CLAUDE.md skill file, and a full repo tutorial at the end if you want to run it yourself.</strong></p></blockquote><p>There are two known ways to adapt a frontier agent. Change the weights, or change the prompt / harness.</p><p>A new paper from <strong>Microsoft</strong> argues for a third. The agent&#8217;s skill file becomes the trainable artifact, edited by a separate optimizer model under bounded updates and a held-out gate.</p><p>The result is best or tied on 52 of 52 evaluated cells. GPT-5.5 lifts +23.5 points in direct chat, +24.8 inside Codex, +19.1 inside Claude Code. The deployed artifact stays under 2,000 tokens.</p><div><hr></div><h2><strong>Paper</strong></h2><p><strong>SkillOpt</strong> is a May 2026 paper from Microsoft, Shanghai Jiao Tong University, Tongji University, and Fudan University. <strong>Yifan Yang </strong>at Microsoft leads the 15-author group. arXiv lists it as 2605.23904, submitted on May 22, with 27 pages, 4 figures, and 6 tables.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZAIX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbd7849d-905e-493a-bc4e-2a18f4434de8_984x950.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZAIX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbd7849d-905e-493a-bc4e-2a18f4434de8_984x950.png 424w, https://substackcdn.com/image/fetch/$s_!ZAIX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbd7849d-905e-493a-bc4e-2a18f4434de8_984x950.png 848w, https://substackcdn.com/image/fetch/$s_!ZAIX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbd7849d-905e-493a-bc4e-2a18f4434de8_984x950.png 1272w, https://substackcdn.com/image/fetch/$s_!ZAIX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbd7849d-905e-493a-bc4e-2a18f4434de8_984x950.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZAIX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbd7849d-905e-493a-bc4e-2a18f4434de8_984x950.png" width="984" height="950" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fbd7849d-905e-493a-bc4e-2a18f4434de8_984x950.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:950,&quot;width&quot;:984,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Article content&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Article content" title="Article content" srcset="https://substackcdn.com/image/fetch/$s_!ZAIX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbd7849d-905e-493a-bc4e-2a18f4434de8_984x950.png 424w, https://substackcdn.com/image/fetch/$s_!ZAIX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbd7849d-905e-493a-bc4e-2a18f4434de8_984x950.png 848w, https://substackcdn.com/image/fetch/$s_!ZAIX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbd7849d-905e-493a-bc4e-2a18f4434de8_984x950.png 1272w, https://substackcdn.com/image/fetch/$s_!ZAIX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbd7849d-905e-493a-bc4e-2a18f4434de8_984x950.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p>Full title: &#8220;<strong>SkillOpt: Executive Strategy for Self-Evolving Agent Skills</strong>.&#8221; Hugging Face Papers ranked it #1 paper of the day at capture, with 140 upvotes.</p><p>Companion repo <em><strong>microsoft/SkillOpt</strong></em> is MIT-licensed Python 3.10+. The package is at version 0.1.0 with no public releases yet, 66 stars at capture on May 25. It ships configs for six benchmarks, the trainer, <em><strong>scripts/<a href="http://train.py/">train.py</a></strong></em>, an <em><strong>eval_<a href="http://only.py/">only.py</a></strong></em> entry point, and an optional Gradio dashboard.</p><p>Scope is what makes it worth seven minutes. SkillOpt trains one Markdown skill across 6 benchmarks (SearchQA, SpreadsheetBench, OfficeQA, DocVQA, LiveMathematicianBench, ALFWorld), 7 target models from GPT-5.5 down to Qwen3.5-4B, and 3 execution modes (direct chat, Codex, Claude Code). That product gives 52 head-to-head cells against human-written, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill baselines. SkillOpt wins or ties every single one.</p><div><hr></div><h2><strong>The trainable-skill idea</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9oQ2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8811d811-3b12-40ba-a680-b24b7e1d4e03_1488x837.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9oQ2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8811d811-3b12-40ba-a680-b24b7e1d4e03_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!9oQ2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8811d811-3b12-40ba-a680-b24b7e1d4e03_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!9oQ2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8811d811-3b12-40ba-a680-b24b7e1d4e03_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!9oQ2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8811d811-3b12-40ba-a680-b24b7e1d4e03_1488x837.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9oQ2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8811d811-3b12-40ba-a680-b24b7e1d4e03_1488x837.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8811d811-3b12-40ba-a680-b24b7e1d4e03_1488x837.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Article content&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Article content" title="Article content" srcset="https://substackcdn.com/image/fetch/$s_!9oQ2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8811d811-3b12-40ba-a680-b24b7e1d4e03_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!9oQ2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8811d811-3b12-40ba-a680-b24b7e1d4e03_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!9oQ2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8811d811-3b12-40ba-a680-b24b7e1d4e03_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!9oQ2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8811d811-3b12-40ba-a680-b24b7e1d4e03_1488x837.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p>A skill, in the paper&#8217;s sense, is a Markdown file prepended to the agent&#8217;s context. It holds procedural rules: how to use tools, what to verify, how to format an answer, what failure mode to avoid. Frontier models do not hold this kind of domain procedure in weights.</p><p>SkillOpt treats that file as the <em>external state</em> of a frozen target model. The target never moves. A separate optimizer model edits the skill from scored rollouts. The deployed artifact is one <em><strong>best_<a href="http://skill.md/">skill.md</a></strong></em> between 379 and 1,995 tokens. No optimizer calls at deployment.</p><p>The deep-learning analogy is operational, not decorative. Rollouts are the forward pass. Reflection batches over success and failure trajectories are the backward pass. An actual edit budget plays the role of a learning rate. An actual held-out validation gate plays the role of validation. The epoch-wise slow update plays the role of momentum.</p><div><hr></div><h2><strong>5 controls that make text-space training work</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uX_9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad1f50-75f8-4673-b540-b960b5f3c153_1488x837.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uX_9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad1f50-75f8-4673-b540-b960b5f3c153_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!uX_9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad1f50-75f8-4673-b540-b960b5f3c153_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!uX_9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad1f50-75f8-4673-b540-b960b5f3c153_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!uX_9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad1f50-75f8-4673-b540-b960b5f3c153_1488x837.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uX_9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad1f50-75f8-4673-b540-b960b5f3c153_1488x837.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/abad1f50-75f8-4673-b540-b960b5f3c153_1488x837.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Article content&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Article content" title="Article content" srcset="https://substackcdn.com/image/fetch/$s_!uX_9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad1f50-75f8-4673-b540-b960b5f3c153_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!uX_9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad1f50-75f8-4673-b540-b960b5f3c153_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!uX_9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad1f50-75f8-4673-b540-b960b5f3c153_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!uX_9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad1f50-75f8-4673-b540-b960b5f3c153_1488x837.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><h3><strong>Control 1: Bounded text updates (the textual learning rate)</strong></h3><p>Every reflection pass ends with a top-N cut. The default edit budget is 4, with cosine decay to a floor of 2. The repo also exposes constant, linear, and autonomous schedules. Without a cap, the loop is ad hoc prompt rewriting.</p><p>Removing the budget drops SearchQA / SpreadsheetBench / LiveMath to 84.6 / 75.7 / 57.3, against the default 87.1 / 77.5 / 61.3. The paper calls this control the textual learning rate.</p><p>Bounded edits keep adjacent skill versions close enough that the next optimizer call can still learn from the last one. Unbounded rewrites break the optimization history before any of the later controls get a chance to use it.</p><h3><strong>Control 2: The validation gate</strong></h3><p>A candidate skill is accepted only if its held-out selection score is <em>strictly greater than</em> the current best. Ties are rejected. The selection split is used only for accept-or-reject decisions. The test split is reported separately.</p><p>That is what keeps reflection from becoming drift. Across 6 benchmarks, only 1 to 4 edits per skill survive into the deployed artifact. The optimizer proposes far more. LiveMathematicianBench&#8217;s +29.3-point gain comes from a single accepted edit. OfficeQA&#8217;s +39.0-point gain also comes from one accepted edit.</p><p>Bulk of the optimizer&#8217;s text-space search gets rejected. The deployed skill is the small set of changes that actually moved a held-out number.</p><h3><strong>Control 3: The rejected-edit buffer</strong></h3><p>Edits that fail the gate are not discarded. They enter an epoch-local memory the optimizer reads before proposing the next batch, along with the score drop they caused.</p><p>That gives the loop negative feedback during training without adding any inference-time model calls. Removing the buffer drops SearchQA / SpreadsheetBench / LiveMath by 1.6 / 4.6 / 2.4 points in the matched ablation row.</p><p>Optimizer learns not to repeat a harmful edit, the way a fine-tuned model learns not to repeat a low-reward output. The difference is that this memory is plain text and lives only for the current epoch.</p><h3><strong>Control 4: Slow and meta updates</strong></h3><p>At each epoch boundary, the optimizer runs the same sampled training items under the previous-epoch skill and the current-epoch skill. Outcomes group into improved, regressed, persistent failure, and stable success.</p><p>A concise longitudinal guidance block then goes into a protected region of the skill file. Step-level edits cannot overwrite that region.</p><p>Meta skill is separate. Optimizer-side memory of which edit patterns helped or hurt across epochs prepends to future optimizer prompts. It does not ship with <em><strong>best_<a href="http://skill.md/">skill.md</a></strong></em>.</p><p>Removing both slow update and meta skill collapses SpreadsheetBench from 77.5 to 55.0. That 22.5-point drop is the largest single ablation in the paper.</p><h3><strong>Control 5: Harness-agnostic adapter</strong></h3><p>One Markdown file deploys across three harnesses. Direct chat, the Codex CLI in a workspace-write sandbox, and the Claude Code CLI all read the same <em><strong>best_<a href="http://skill.md/">skill.md</a></strong></em>. The adapter contract is small: build batches, inject the skill, run the native execution loop, return scored trajectories.</p><p>Transfer numbers carry the claim. A SpreadsheetBench skill trained inside Codex adds +59.7 points when deployed inside Claude Code (22.1 &#8594; 81.8), slightly exceeding the in-domain Claude Code SkillOpt reference. The reverse direction adds +43.6 points back inside Codex (27.5 &#8594; 71.1).</p><p>Trained skill is a portable artifact, not a harness-specific command recipe. Training cost amortizes across deployment surfaces.</p><div><hr></div><h2><strong>8-stage loop, mapped to your stack</strong></h2><p>Loop runs eight stages per step.</p><blockquote><p>CLI commands and the install-to-deploy walk are in the appendix at the end.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hkuM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9616496-d059-45ef-9922-115c58e4b351_1488x837.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hkuM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9616496-d059-45ef-9922-115c58e4b351_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!hkuM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9616496-d059-45ef-9922-115c58e4b351_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!hkuM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9616496-d059-45ef-9922-115c58e4b351_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!hkuM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9616496-d059-45ef-9922-115c58e4b351_1488x837.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hkuM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9616496-d059-45ef-9922-115c58e4b351_1488x837.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f9616496-d059-45ef-9922-115c58e4b351_1488x837.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Article content&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Article content" title="Article content" srcset="https://substackcdn.com/image/fetch/$s_!hkuM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9616496-d059-45ef-9922-115c58e4b351_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!hkuM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9616496-d059-45ef-9922-115c58e4b351_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!hkuM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9616496-d059-45ef-9922-115c58e4b351_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!hkuM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9616496-d059-45ef-9922-115c58e4b351_1488x837.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><ol><li><p><strong>Rollout</strong>: the frozen target model runs a batch from the training split with the current skill.</p></li><li><p><strong>Reflect</strong>: the optimizer model splits the batch into failure and success minibatches, returns structured add, delete, and replace edits.</p></li><li><p><strong>Aggregate</strong>: similar edits merge hierarchically, with failure-driven patches prioritized.</p></li><li><p><strong>Select</strong>: the optimizer ranks edits and clips to the top of the edit budget.</p></li><li><p><strong>Update</strong>: selected edits apply, producing a candidate skill.</p></li><li><p><strong>Gate</strong>: the candidate runs on the held-out selection split. Strictly greater than current best is the only accept condition.</p></li><li><p><strong>Slow update</strong>: at epoch end, the optimizer compares same-task outcomes under last-epoch and current-epoch skills, then writes longitudinal guidance into a protected region.</p></li><li><p><strong>Meta skill</strong>: optimizer-side memory of accepted and rejected patterns is prepended to future optimizer calls. Never ships with the deployed skill.</p></li></ol><h3><strong>What the learned skill actually says</strong></h3><p>Paper Figure 4 reproduces one verbatim rule per benchmark from the final <em><strong>best_<a href="http://skill.md/">skill.md</a></strong></em> of each case study. Two of them carry the flavor.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Sz_p!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64aeee51-7603-4ad3-adda-49d5de8031df_1525x528.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Sz_p!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64aeee51-7603-4ad3-adda-49d5de8031df_1525x528.png 424w, https://substackcdn.com/image/fetch/$s_!Sz_p!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64aeee51-7603-4ad3-adda-49d5de8031df_1525x528.png 848w, https://substackcdn.com/image/fetch/$s_!Sz_p!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64aeee51-7603-4ad3-adda-49d5de8031df_1525x528.png 1272w, https://substackcdn.com/image/fetch/$s_!Sz_p!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64aeee51-7603-4ad3-adda-49d5de8031df_1525x528.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Sz_p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64aeee51-7603-4ad3-adda-49d5de8031df_1525x528.png" width="1456" height="504" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/64aeee51-7603-4ad3-adda-49d5de8031df_1525x528.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:504,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Article content&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Article content" title="Article content" srcset="https://substackcdn.com/image/fetch/$s_!Sz_p!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64aeee51-7603-4ad3-adda-49d5de8031df_1525x528.png 424w, https://substackcdn.com/image/fetch/$s_!Sz_p!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64aeee51-7603-4ad3-adda-49d5de8031df_1525x528.png 848w, https://substackcdn.com/image/fetch/$s_!Sz_p!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64aeee51-7603-4ad3-adda-49d5de8031df_1525x528.png 1272w, https://substackcdn.com/image/fetch/$s_!Sz_p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64aeee51-7603-4ad3-adda-49d5de8031df_1525x528.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p>SpreadsheetBench, after four accepted edits: <em>&#8220;Inspect workbook structure and formulas, then write evaluated static values across the full requested target range instead of relying on Excel recalculation.&#8221;</em></p><p>DocVQA, after three accepted edits: <em>&#8220;For tables, forms, charts, and legends, first bind the question to the exact visual row/header/field, then copy only the aligned answer span.&#8221;</em></p><p>Two patterns stand out. Rules are procedural rather than instance-specific (no question, file, or entity is named). Rules also encode discipline frontier models lack zero-shot: workbook-structure-first reasoning, evidence-to-visual binding, answer-format constraints. A practitioner could write rules like these by hand after a day with the benchmark. SkillOpt produces them automatically and validates each one against a held-out split.</p><h3><strong>Map this into your stack</strong></h3><p>The deployed artifact is one Markdown file. That maps cleanly to wherever your agent already loads procedural state.</p><ul><li><p><strong>Claude Code</strong>: drop the trained skill into <em><strong>~/.claude/skills/</strong></em> and load it on session start.</p></li><li><p><strong>Codex / OpenAI Agents</strong>: render to a per-task <em><strong><a href="http://skill.md/">SKILL.md</a></strong></em> or <em><strong><a href="http://agents.md/">AGENTS.md</a></strong></em>. The paper&#8217;s Codex adapter already uses this contract.</p></li><li><p><strong>Generic harnesses</strong>: <em><strong><a href="http://claude.md/">CLAUDE.md</a></strong></em>, <em><strong>.cursorrules</strong></em>, or the system-prompt slot of any agent.</p></li><li><p><strong>Hermes-style persistent runtimes</strong>: a skill-folder entry. The exported artifact is harness-agnostic Markdown by design.</p></li></ul><p>Running the loop on your own task needs three things:</p><ol><li><p>A task family with measurable success. Exact match, executable check, or a verifier you trust.</p></li><li><p>Held-out train, selection, and test splits. The repo does not ship datasets, so this is on you.</p></li><li><p>A target model (the agent you ship) and an optimizer model. The paper defaults both to GPT-5.5, and shows a target-matched optimizer still recovers 56% to 74% of the strong-optimizer gain.</p></li></ol><p>Optimizer runs offline. Deployment uses only the final skill file. No extra model calls at inference.</p><blockquote><p>Full install-to-deploy commands are in the appendix below.</p></blockquote><div><hr></div><h2><strong>Results</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pIbt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0db8a07b-f52a-46ba-ad03-e1e536ff32e5_1672x809.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pIbt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0db8a07b-f52a-46ba-ad03-e1e536ff32e5_1672x809.png 424w, https://substackcdn.com/image/fetch/$s_!pIbt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0db8a07b-f52a-46ba-ad03-e1e536ff32e5_1672x809.png 848w, https://substackcdn.com/image/fetch/$s_!pIbt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0db8a07b-f52a-46ba-ad03-e1e536ff32e5_1672x809.png 1272w, https://substackcdn.com/image/fetch/$s_!pIbt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0db8a07b-f52a-46ba-ad03-e1e536ff32e5_1672x809.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pIbt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0db8a07b-f52a-46ba-ad03-e1e536ff32e5_1672x809.png" width="1456" height="704" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0db8a07b-f52a-46ba-ad03-e1e536ff32e5_1672x809.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:704,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Article content&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Article content" title="Article content" srcset="https://substackcdn.com/image/fetch/$s_!pIbt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0db8a07b-f52a-46ba-ad03-e1e536ff32e5_1672x809.png 424w, https://substackcdn.com/image/fetch/$s_!pIbt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0db8a07b-f52a-46ba-ad03-e1e536ff32e5_1672x809.png 848w, https://substackcdn.com/image/fetch/$s_!pIbt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0db8a07b-f52a-46ba-ad03-e1e536ff32e5_1672x809.png 1272w, https://substackcdn.com/image/fetch/$s_!pIbt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0db8a07b-f52a-46ba-ad03-e1e536ff32e5_1672x809.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><div><hr></div><h3><strong>The AlphaSignal Take</strong></h3><p>Three open problems sit underneath the headline numbers.</p><p><strong>Training cost is real.</strong> Cost per absolute test-point gain runs from 0.6M training tokens (SpreadsheetBench) to 46.4M (DocVQA). Total training token spend per benchmark in the case studies ranges from 20.8M (OfficeQA) to 213.8M (SearchQA). A team has to budget for the offline run before assuming it pays off.</p><p><strong>The loop needs scored tasks.</strong> SkillOpt is an optimizer with a verifier. The gate compares numbers from a held-out split. Open-ended creative work, strategy documents, design judgment have no gate to gate on, unless a preference model gets layered in that the paper does not provide.</p><p><strong>The repo does not ship datasets.</strong> Readers bring their own train, selection, and test splits and their own credentials. The package is at version 0.1.0 with no public releases. Fastest setup path is SearchQA, and even that needs a local split and an Azure OpenAI, OpenAI, or Anthropic key. This is research code, not a turnkey product.</p><p>What is actually new is the reframe. The procedure an agent follows becomes a trainable, inspectable text artifact. Not weights. Not a static prompt. Something in the middle, with a versioned history, an audit trail of accepted and rejected edits, and a 379-to-1,995-token deployment footprint.</p><p>That is the part of the announcement worth carrying into your own stack, even before training one. The paper points to two next moves: <strong>skill libraries that share infrastructure across domains</strong>, and self-distillation of trained skills back into target-model weights. Both assume the skill itself is the object being optimized, not a byproduct of prompting.</p><div><hr></div><p>Which part of your current agent stack would you train as text first, your <strong><a href="http://claude.md/">CLAUDE.md</a></strong>, your skill folder, or your <strong><a href="http://agents.md/">AGENTS.md</a></strong>?</p><p>All source links are in the first reply. Full breakdown of recent updates + daily signals in our newsletter (link in bio).</p><div><hr></div><h3><strong>Appendix: How to actually run SkillOpt</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!o66K!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc82febe1-bf40-45a0-85cc-ddf340ad3d7a_1488x837.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!o66K!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc82febe1-bf40-45a0-85cc-ddf340ad3d7a_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!o66K!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc82febe1-bf40-45a0-85cc-ddf340ad3d7a_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!o66K!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc82febe1-bf40-45a0-85cc-ddf340ad3d7a_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!o66K!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc82febe1-bf40-45a0-85cc-ddf340ad3d7a_1488x837.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!o66K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc82febe1-bf40-45a0-85cc-ddf340ad3d7a_1488x837.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c82febe1-bf40-45a0-85cc-ddf340ad3d7a_1488x837.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Article content&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Article content" title="Article content" srcset="https://substackcdn.com/image/fetch/$s_!o66K!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc82febe1-bf40-45a0-85cc-ddf340ad3d7a_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!o66K!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc82febe1-bf40-45a0-85cc-ddf340ad3d7a_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!o66K!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc82febe1-bf40-45a0-85cc-ddf340ad3d7a_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!o66K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc82febe1-bf40-45a0-85cc-ddf340ad3d7a_1488x837.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p>A condensed install-to-deploy walk. Assumes Python 3.10+ and one model backend. Azure OpenAI is the paper&#8217;s default.</p><h3><strong>1. Install</strong></h3><pre><code><code>git clone https://github.com/microsoft/SkillOpt.git
cd SkillOpt
pip install -e .</code></code></pre><p>ALFWorld benchmark needs an extra step: pip install -e &#8220;.[alfworld]&#8221; then alfworld-download.</p><h3><strong>2. Set credentials</strong></h3><p>Azure OpenAI (default):</p><pre><code><code>export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
export AZURE_OPENAI_API_KEY="your-key"</code></code></pre><p>Or set <em><strong>AZURE_OPENAI_AUTH_MODE=azure_cli</strong></em> to skip the key and use Azure CLI auth.</p><p>Other backends also supported: <em><strong>OPENAI_API_KEY</strong></em> for OpenAI direct, <em><strong>ANTHROPIC_API_KEY</strong></em> for Claude, <em><strong>QWEN_CHAT_BASE_URL</strong></em> + <em><strong>QWEN_CHAT_MODEL</strong></em> for Qwen via local vLLM. The repo&#8217;s <em><strong>.env.example</strong></em> lists all four.</p><h3><strong>3. Prepare data</strong></h3><p>SkillOpt expects this directory layout:</p><pre><code><code>data/my_split/
  train/items.json
  val/items.json
  test/items.json</code></code></pre><p>Each <em><strong>items.json</strong></em> is a JSON array of task items. Schema depends on the benchmark. SearchQA wants <em><strong>id</strong></em>, <em><strong>question</strong></em>, <em><strong>context</strong></em>, <em><strong>answers</strong></em>.</p><p>Configs ship for six benchmarks: SearchQA, ALFWorld, DocVQA, LiveMathematicianBench, SpreadsheetBench, OfficeQA. The repo does not ship the datasets, so this step is on you. SearchQA is the fastest setup path.</p><h3><strong>4. Train</strong></h3><p>Minimal command (SearchQA, GPT-5.5 as both target and optimizer):</p><pre><code><code>python scripts/train.py \
  --config configs/searchqa/default.yaml \
  --split_dir /path/to/searchqa_split \
  --azure_openai_endpoint https://your-resource.openai.azure.com/ \
  --optimizer_model gpt-5.5 \
  --target_model gpt-5.5</code></code></pre><p>Defaults from <em><strong>configs/_base_/default.yaml</strong></em>: 4 epochs, batch size 40, reflection minibatch 8, edit budget 4 with cosine decay to floor 2, slow update on with 20 samples per epoch, meta skill on, validation gate strict-greater.</p><p>CLI override flags: <em><strong>--num_epochs</strong></em>, <em><strong>--batch_size</strong></em>, <em><strong>--workers</strong></em>, <em><strong>--out_root</strong></em>.</p><h3><strong>5. Output</strong></h3><p>Each run writes to <em><strong>outputs/&lt;run_name&gt;/</strong></em>:</p><ul><li><p><em><strong>best_<a href="http://skill.md/">skill.md</a></strong></em>: the deployable skill.</p></li><li><p><em><strong>history.json</strong></em>: per-step training log.</p></li><li><p><em><strong>skills/skill_<a href="http://vxxxx.md/">vXXXX.md</a></strong></em>: skill snapshot per step.</p></li><li><p><em><strong>steps/step_XXXX/</strong></em>: patches, gate evals, edit-apply reports.</p></li><li><p><em><strong>slow_update/epoch_XX/</strong></em> and <em><strong>meta_skill/epoch_XX/</strong></em>: epoch-end logs.</p></li></ul><p>Re-running the same command auto-resumes from the last completed step.</p><h3><strong>6. Evaluate</strong></h3><p>Score a trained skill on any split without retraining:</p><pre><code><code>python scripts/eval_only.py \
  --config configs/searchqa/default.yaml \
  --skill outputs/my_run/best_skill.md \
  --split valid_unseen \
  --split_dir /path/to/searchqa_split \
  --azure_openai_endpoint https://your-resource.openai.azure.com/</code></code></pre><p>Valid <em><strong>--split</strong></em> values: <em><strong>valid_unseen</strong></em> (test), <em><strong>valid_seen</strong></em> (val), <em><strong>train</strong></em>, <em><strong>all</strong></em>.</p><h3><strong>7. Deploy</strong></h3><p><em><strong>best_<a href="http://skill.md/">skill.md</a></strong></em> is plain Markdown. No optimizer calls at inference. Drop it where your agent reads procedural state:</p><ul><li><p>Claude Code: <em><strong>~/.claude/skills/</strong></em></p></li><li><p>Codex / OpenAI Agents: <em><strong><a href="http://agents.md/">AGENTS.md</a></strong></em> or per-task <em><strong><a href="http://skill.md/">SKILL.md</a></strong></em></p></li><li><p>Generic: <em><strong><a href="http://claude.md/">CLAUDE.md</a></strong></em> or the system-prompt slot</p></li><li><p>Hermes-style persistent runtime: a skill-folder entry</p></li></ul><h3><strong>8. Optional: WebUI monitor</strong></h3><pre><code><code>pip install -e ".[webui]"
python -m skillopt_webui.app</code></code></pre><p>Default port 7860. Add <em><strong>--share</strong></em> for a public Gradio link.</p>]]></content:encoded></item><item><title><![CDATA[The 5 Principles Every AI Research Stack Now Has to Solve]]></title><description><![CDATA[The first survey covering all 4 phases of AI in academic research, the 5 principles it lands on, and a stage-by-stage map of what&#8217;s safe to automate.]]></description><link>https://alphasignalai.substack.com/p/the-5-principles-every-ai-research</link><guid isPermaLink="false">https://alphasignalai.substack.com/p/the-5-principles-every-ai-research</guid><dc:creator><![CDATA[AlphaSignal AI]]></dc:creator><pubDate>Mon, 25 May 2026 16:17:04 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!nmRG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faac2b6e4-1066-4e80-bfd4-cae2573caaa3_1280x719.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nmRG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faac2b6e4-1066-4e80-bfd4-cae2573caaa3_1280x719.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nmRG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faac2b6e4-1066-4e80-bfd4-cae2573caaa3_1280x719.png 424w, https://substackcdn.com/image/fetch/$s_!nmRG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faac2b6e4-1066-4e80-bfd4-cae2573caaa3_1280x719.png 848w, https://substackcdn.com/image/fetch/$s_!nmRG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faac2b6e4-1066-4e80-bfd4-cae2573caaa3_1280x719.png 1272w, https://substackcdn.com/image/fetch/$s_!nmRG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faac2b6e4-1066-4e80-bfd4-cae2573caaa3_1280x719.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nmRG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faac2b6e4-1066-4e80-bfd4-cae2573caaa3_1280x719.png" width="1280" height="719" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aac2b6e4-1066-4e80-bfd4-cae2573caaa3_1280x719.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:719,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nmRG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faac2b6e4-1066-4e80-bfd4-cae2573caaa3_1280x719.png 424w, https://substackcdn.com/image/fetch/$s_!nmRG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faac2b6e4-1066-4e80-bfd4-cae2573caaa3_1280x719.png 848w, https://substackcdn.com/image/fetch/$s_!nmRG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faac2b6e4-1066-4e80-bfd4-cae2573caaa3_1280x719.png 1272w, https://substackcdn.com/image/fetch/$s_!nmRG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faac2b6e4-1066-4e80-bfd4-cae2573caaa3_1280x719.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><blockquote><p><strong>In ~7 mins: the 5 principles a 20-author survey of 250+ tools just landed on, an 8-stage map of where AI helps and where it breaks, and 6 rules for using it without losing scientific accountability.</strong></p><p><strong>This article was originally posted on X last Fri (22 May).</strong></p></blockquote><p>The AI Scientist generates a complete research paper for $15.</p><p>FARS ran for 228 hours, burned 11.4 billion tokens, and shipped 100 papers.</p><p>When the same systems run fully autonomous, 80% of their reported results are fabricated.</p><p>The bottleneck moved. It is no longer generation. It is verification, provenance, and human handoffs across the research lifecycle.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2><strong>What happened</strong></h2><p>Twenty researchers just published the first end-to-end survey of AI across the complete academic research lifecycle.</p><p>The paper is authored by <strong>Lingdong Kong</strong> (NUS / Apple) with 19 co-authors and titled &#8220;<strong>AI for Auto-Research: Roadmap &amp; User Guide</strong>.&#8221; It was posted to arXiv on May 18, 2026 and covers developments through April 2026.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pUCW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66680dfb-ad92-470c-9443-477ca520b5f7_904x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pUCW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66680dfb-ad92-470c-9443-477ca520b5f7_904x1000.png 424w, https://substackcdn.com/image/fetch/$s_!pUCW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66680dfb-ad92-470c-9443-477ca520b5f7_904x1000.png 848w, https://substackcdn.com/image/fetch/$s_!pUCW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66680dfb-ad92-470c-9443-477ca520b5f7_904x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!pUCW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66680dfb-ad92-470c-9443-477ca520b5f7_904x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pUCW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66680dfb-ad92-470c-9443-477ca520b5f7_904x1000.png" width="904" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/66680dfb-ad92-470c-9443-477ca520b5f7_904x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:904,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Article content&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Article content" title="Article content" srcset="https://substackcdn.com/image/fetch/$s_!pUCW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66680dfb-ad92-470c-9443-477ca520b5f7_904x1000.png 424w, https://substackcdn.com/image/fetch/$s_!pUCW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66680dfb-ad92-470c-9443-477ca520b5f7_904x1000.png 848w, https://substackcdn.com/image/fetch/$s_!pUCW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66680dfb-ad92-470c-9443-477ca520b5f7_904x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!pUCW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66680dfb-ad92-470c-9443-477ca520b5f7_904x1000.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p>The scope is unusual. Most existing surveys cover one stage: literature review, or coding agents, or paper writing. This one maps 250+ tools, 52 benchmarks, and 33 end-to-end systems into a single framework of 4 phases and 8 stages.</p><p>The companion repo, <em><strong>worldbench/awesome-ai-auto-research</strong></em>, is MIT-licensed and tracking +100 stars and 8 forks as of May 22.</p><p>The framework breaks the lifecycle into Creation (ideation, literature, coding, figures), Writing (manuscript), Validation (peer review, rebuttal), and Dissemination (Paper2X). That structure carries the whole argument.</p><div><hr></div><h2><strong>Why this paper</strong></h2><p>The cost-to-quality numbers refuse to scale.</p><p>AI Scientist v2 scores 6.33 on an ICLR 1&#8211;10 scale at $25 per paper. FARS, running roughly $1,000 per paper, scores 5.05. The ICLR acceptance threshold is 5.69.</p><p>The cheaper system is past the line. The 40x more expensive one is below it.</p><p>Pattern-matching benchmarks overstate scientific coding too. Frontier systems clear 76%+ on SWE-bench Verified. The same systems ceiling at 37&#8211;39% on ResearchCodeBench, where the task is to implement a method described in a paper. The semantic-error rate there is 58.6%.</p><p>The validation numbers are worse. In an LLM-reviewer benchmark, 95.8% of rejected papers were misclassified as acceptable. In MLR-Bench&#8217;s fully-autonomous track, 80% of reported results turned out to be fabricated.</p><p>The paper&#8217;s read is that the field stopped being capability-limited and became reliability-limited. That reframe is what makes the survey worth reading.</p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NY7g!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60cd1041-9348-4b8f-a34b-4e580c46a780_1488x837.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NY7g!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60cd1041-9348-4b8f-a34b-4e580c46a780_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!NY7g!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60cd1041-9348-4b8f-a34b-4e580c46a780_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!NY7g!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60cd1041-9348-4b8f-a34b-4e580c46a780_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!NY7g!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60cd1041-9348-4b8f-a34b-4e580c46a780_1488x837.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NY7g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60cd1041-9348-4b8f-a34b-4e580c46a780_1488x837.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/60cd1041-9348-4b8f-a34b-4e580c46a780_1488x837.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Article content&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Article content" title="Article content" srcset="https://substackcdn.com/image/fetch/$s_!NY7g!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60cd1041-9348-4b8f-a34b-4e580c46a780_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!NY7g!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60cd1041-9348-4b8f-a34b-4e580c46a780_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!NY7g!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60cd1041-9348-4b8f-a34b-4e580c46a780_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!NY7g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60cd1041-9348-4b8f-a34b-4e580c46a780_1488x837.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><h3></h3><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h3><strong>Principle 1: Structured tasks work. Open-ended judgment does not.</strong></h3><p>AI is reliable when the task is structured, grounded in retrievable evidence, and externally checkable.</p><p>SWE-bench Verified at 76%+ vs. ResearchCodeBench at 37&#8211;39% is the cleanest illustration. One measures bug fixes against known passing tests. The other measures whether the model implemented the algorithm the paper actually described.</p><p>Same models. Different ceiling.</p><p>This holds across stages. Retrieval, citation candidates, plot drafts, grammar polishing, format conversion: solid. Novelty assessment, decisive experiment design, long-horizon reasoning, contribution framing: fragile.</p><h3><strong>Principle 2: Generation outpaces verification at every stage.</strong></h3><p>This is the paper&#8217;s central tension. AI produces research-shaped artifacts faster than it can prove they are correct.</p><p>Ideas look novel on the page and weaken after a single implementation attempt. Code runs cleanly while implementing a different algorithm than the paper described. Figures look publication-ready while distorting axes or dropping baselines.</p><p>Reviews come back coherent and lenient. Rebuttals read as persuasive and promise experiments that authors never run. Dissemination artifacts simplify results past the evidence the paper actually provides.</p><p>The risk is not that the artifacts are useless. It is that they get treated as validated because they look complete.</p><h3><strong>Principle 3: Human-governed collaboration beats full autonomy.</strong></h3><p>The strongest empirical result in the paper comes from the ICLR 2025 randomized study of AI in peer review. Across 22,467 reviews, 89% improved in quality when an LLM gave feedback on a human reviewer&#8217;s draft.</p><p>Hand the same family of models a paper to review alone, and 95.8% of rejected papers come back misclassified as acceptable.</p><p>Assist mode lifts quality. Replace mode breaks it.</p><p>That asymmetry shows up everywhere the paper looks for it: writing, review, rebuttal, dissemination. AI augments researchers reliably. As the reviewer of record, the same model fails reliably.</p><h3><strong>Principle 4: Working systems converge on three layers: explore, execute, verify.</strong></h3><p>Systems that produce credible work all combine the same three layers, regardless of branding.</p><p><strong>Exploration</strong> searches over hypotheses, code variants, or response strategies through MCTS, evolutionary methods, or branching agents. <strong>Execution</strong> drives external tools: code interpreters, retrieval engines, experiment runners, plotters, document editors. <strong>Verification</strong> checks intermediate outputs through execution feedback, citation validation, adversarial critique, or human review.</p><p>Stacks built on &#8220;more agents = better&#8221; lose on sequential reasoning. Google and MIT scaling work cited in the paper finds an empirical sweet spot at 3 to 4 coordinated agents. Bigger swarms accumulate communication overhead faster than they gain critique quality.</p><h3><strong>Principle 5: AI in research is a governance problem, not a detection problem.</strong></h3><p>Corpus studies estimate detectable AI modification in 17.5% of computer science abstracts and 13.5% of biomedical abstracts. Self-reported usage runs higher.</p><p>Detection-based enforcement does not scale. AI text detectors false-positive on formal academic prose and non-native writing. Watermarking depends on provider cooperation and breaks under paraphrase and translation.</p><p>What stays durable is a different set of questions. Which forms of AI use must be disclosed? Who is accountable when an AI-generated citation is fabricated, or an AI-drafted rebuttal commits to an experiment that never runs?</p><p>Policy has to follow disclosure and accountability, not detection. The paper lands there.</p><div><hr></div><h2><strong>The 8 stages, mapped to what&#8217;s safe to automate</strong></h2><p>The lifecycle compresses into a map. Each stage has work AI does well, work that needs human inspection, and work that should not be delegated yet.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ng87!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0dc6a46-b294-4bb7-9197-8ba3c7e8e7a8_1488x837.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ng87!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0dc6a46-b294-4bb7-9197-8ba3c7e8e7a8_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!ng87!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0dc6a46-b294-4bb7-9197-8ba3c7e8e7a8_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!ng87!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0dc6a46-b294-4bb7-9197-8ba3c7e8e7a8_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!ng87!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0dc6a46-b294-4bb7-9197-8ba3c7e8e7a8_1488x837.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ng87!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0dc6a46-b294-4bb7-9197-8ba3c7e8e7a8_1488x837.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e0dc6a46-b294-4bb7-9197-8ba3c7e8e7a8_1488x837.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Article content&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Article content" title="Article content" srcset="https://substackcdn.com/image/fetch/$s_!ng87!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0dc6a46-b294-4bb7-9197-8ba3c7e8e7a8_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!ng87!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0dc6a46-b294-4bb7-9197-8ba3c7e8e7a8_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!ng87!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0dc6a46-b294-4bb7-9197-8ba3c7e8e7a8_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!ng87!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0dc6a46-b294-4bb7-9197-8ba3c7e8e7a8_1488x837.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Six rules sit underneath the table.</p><ol><li><p>Treat every phase handoff as a verification checkpoint, not a transition.</p></li><li><p>Prefer execution-grounded evaluation over LLM-as-judge for any claim that can be tested.</p></li><li><p>Use AI to strengthen human reviews. The 89% quality lift only shows up in assist mode.</p></li><li><p>Track every rebuttal promise against the actual manuscript diff before camera-ready.</p></li><li><p>Compare every dissemination artifact against the paper&#8217;s caveats before release.</p></li><li><p>Disclose AI use, attribute decisions, accept accountability for AI-generated claims.</p></li></ol><div><hr></div><h2><strong>The AlphaSignal Take</strong></h2><p>The AI Scientist camp is not wrong about cost. They are wrong about ceiling.</p><p>Sakana AI, FARS, and the AI Scientist v2 line argue autonomy is already useful at the right cost-quality tradeoff. The rebuttal sits inside the paper itself. A 40x increase in spend per paper (from $25 to $1,000) does not buy quality. It buys volume that scores below the acceptance threshold.</p><p>Three problems remain open across every system surveyed. They are the real reason the field is not where the demos suggest.</p><p><strong>Phase-boundary faithfulness.</strong> No system surveyed maintains a traceable link from hypothesis to dissemination. Hypotheses, logs, manuscript claims, and rebuttal promises break apart at every handoff.</p><p><strong>Citation provenance.</strong> Generated bibliographies routinely mix metadata across preprint, workshop, conference, and journal versions of the same work. Author list, year, venue, and DOI can come from four different records of one paper. No surveyed tool fixes this.</p><p><strong>Cognitive ownership.</strong> Aggressive automation hides the work that turns a junior researcher into a senior one. Delegating literature synthesis or rebuttal prevents the field judgment and critical reasoning that build over time.</p><p>The reason this paper is worth seven minutes is the reframe. It stops asking whether one autonomous AI scientist can replace a human researcher. It starts asking whether the artifacts a research process produces still link to evidence by the time they reach the public.</p><p>That is the right question to be asking in May 2026.</p><div><hr></div><p><strong>Which principle does your current AI research stack solve, and which one is still open?</strong></p><p><strong>All source links are in the first reply. Full breakdown of recent updates + daily signals in our newsletter (link in bio).</strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Spec-Driven Development is the New Default for AI Coding]]></title><description><![CDATA[The 5 repos defining it, the academic case for why, and the practitioner who says the whole movement is wrong.]]></description><link>https://alphasignalai.substack.com/p/spec-driven-development-is-the-new</link><guid isPermaLink="false">https://alphasignalai.substack.com/p/spec-driven-development-is-the-new</guid><dc:creator><![CDATA[AlphaSignal AI]]></dc:creator><pubDate>Fri, 22 May 2026 16:15:11 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!_YiB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07d8ade4-ba3c-42f0-af76-7ccb307d3005_1672x941.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_YiB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07d8ade4-ba3c-42f0-af76-7ccb307d3005_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_YiB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07d8ade4-ba3c-42f0-af76-7ccb307d3005_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!_YiB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07d8ade4-ba3c-42f0-af76-7ccb307d3005_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!_YiB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07d8ade4-ba3c-42f0-af76-7ccb307d3005_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!_YiB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07d8ade4-ba3c-42f0-af76-7ccb307d3005_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_YiB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07d8ade4-ba3c-42f0-af76-7ccb307d3005_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/07d8ade4-ba3c-42f0-af76-7ccb307d3005_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_YiB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07d8ade4-ba3c-42f0-af76-7ccb307d3005_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!_YiB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07d8ade4-ba3c-42f0-af76-7ccb307d3005_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!_YiB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07d8ade4-ba3c-42f0-af76-7ccb307d3005_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!_YiB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07d8ade4-ba3c-42f0-af76-7ccb307d3005_1672x941.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><strong>In ~8 mins: what SDD is, why it became the default for AI coding, how the 5 leading repos implement it, and the one critic saying the whole category is wrong.</strong></p></blockquote><p>Spec-driven development crossed from blog-post topic to default architecture for AI coding in the last 12 months.</p><p>Thoughtworks, Martin Fowler, GitHub, Amazon, and a 67-source academic review all agreed in 2025 and 2026.</p><p>The question stopped being whether to use SDD and became which implementation.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2><strong>What happened</strong></h2><p>Multiple independent sources converged on the same recommendation inside 18 months.</p><p>Thoughtworks listed spec-driven development in Technology Radar Vol. 32 as a technique worth adopting. Martin Fowler covered it on his site.</p><p>GitHub shipped Spec Kit, an MIT-licensed toolkit framed as the answer to vibe coding. Amazon launched Kiro, an agentic tool that walks users through requirements, design, and tasks before any code generation. Tessl launched at the radical end, with specs positioned as the new source code.</p><p>Red Hat published enterprise SDD guidance. InfoQ covered it at the architecture level.</p><p>Bryan Finster pushed back with the right critique. SDD is not a revolution, it&#8217;s just BDD with branding.</p><p>That critique strengthens the case. The idea is not new. The context is.</p><p>BDD was an optional discipline that teams could adopt or ignore. With 84% of professional developers using or planning to use AI tools (Stack Overflow, 2025) and 46% of code output now AI-generated (GitHub, 2025), specification discipline became structurally necessary.</p><div><hr></div><h2><strong>Why it became necessary</strong></h2><p>Four academic papers landed in 12 months, mapping the same problem from different angles.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2Hvl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F241faefe-479d-442d-b7c7-a55d2aa62875_1488x837.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2Hvl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F241faefe-479d-442d-b7c7-a55d2aa62875_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!2Hvl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F241faefe-479d-442d-b7c7-a55d2aa62875_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!2Hvl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F241faefe-479d-442d-b7c7-a55d2aa62875_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!2Hvl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F241faefe-479d-442d-b7c7-a55d2aa62875_1488x837.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2Hvl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F241faefe-479d-442d-b7c7-a55d2aa62875_1488x837.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/241faefe-479d-442d-b7c7-a55d2aa62875_1488x837.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Article content&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Article content" title="Article content" srcset="https://substackcdn.com/image/fetch/$s_!2Hvl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F241faefe-479d-442d-b7c7-a55d2aa62875_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!2Hvl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F241faefe-479d-442d-b7c7-a55d2aa62875_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!2Hvl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F241faefe-479d-442d-b7c7-a55d2aa62875_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!2Hvl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F241faefe-479d-442d-b7c7-a55d2aa62875_1488x837.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p>Sabry Farrag at the University of East London ran a 67-source systematic review of the productivity paradox. AI coding tools deliver real individual-level gains and real system-level damage at the same time.</p><p>Peng et al. measured 55.8% faster completion in a 95-developer RCT. Becker et al.&#8217;s METR study found a 19% slowdown for experienced developers working on mature codebases.</p><p>DORA reported that 25% AI adoption correlates with a 7.2% drop in delivery stability. Faros AI tracked over 10,000 developers and saw 98% more merged PRs, 91% more review time, and 9% more bugs.</p><p>Shuvendu Lahiri at Microsoft Research named the underlying gap. AI-generated code is plausible by construction, not correct by construction. The semantic distance between what a user means and what a program does is the central reliability bottleneck.</p><p>An AIware 2026 vision paper named a second gap. Code review evaluates plausibility, not compliance. Most AI-generated changes pass tests, look reasonable, and still drift from the rules they were supposed to follow.</p><p>Deepak Babu Piskala wrote the practitioner manual that ties it together. He frames SDD across three rigor levels and a four-phase workflow.</p><p>Farrag&#8217;s economic argument closes the loop. Code generated for a specific codebase has high asset specificity. LLMs introduce high behavioral uncertainty.</p><p>Developers invoke AI hundreds of times daily. In Transaction Cost Economics terms, that combination makes a written, executable contract the rational governance response. SDD is that contract.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2><strong>How it actually works</strong></h2><p>SDD compresses to three things a practitioner needs to hold.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!D-NA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa125f049-afa3-44a5-a7b0-45a22a1c045d_1488x837.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!D-NA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa125f049-afa3-44a5-a7b0-45a22a1c045d_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!D-NA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa125f049-afa3-44a5-a7b0-45a22a1c045d_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!D-NA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa125f049-afa3-44a5-a7b0-45a22a1c045d_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!D-NA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa125f049-afa3-44a5-a7b0-45a22a1c045d_1488x837.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!D-NA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa125f049-afa3-44a5-a7b0-45a22a1c045d_1488x837.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a125f049-afa3-44a5-a7b0-45a22a1c045d_1488x837.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Article content&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Article content" title="Article content" srcset="https://substackcdn.com/image/fetch/$s_!D-NA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa125f049-afa3-44a5-a7b0-45a22a1c045d_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!D-NA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa125f049-afa3-44a5-a7b0-45a22a1c045d_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!D-NA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa125f049-afa3-44a5-a7b0-45a22a1c045d_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!D-NA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa125f049-afa3-44a5-a7b0-45a22a1c045d_1488x837.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p><strong>A four-phase workflow.</strong> Specify what the software should do. Plan how to build it. Implement in small, validated increments. Validate that the code meets the spec. Each phase produces an artifact that constrains the next.</p><p><strong>Three rigor levels.</strong> Spec-first means a specification is written before coding but may drift after. Spec-anchored means the spec lives alongside the code and tests enforce alignment. Spec-as-source means the spec is the only artifact humans edit, with code regenerated rather than manually changed.</p><p><strong>A governance spectrum.</strong> Farrag&#8217;s paper ranks four mechanisms by constraint intensity:</p><ul><li><p>Post-hoc review is the loosest, where a developer reviews AI output after the fact.</p></li><li><p>Natural-language specification is next, putting requirements before generation.</p></li><li><p>Executable contract follows, with tests and structured spec documents the agent must satisfy.</p></li><li><p>Constitutional governance is the tightest, a meta-specification of non-negotiable principles that every change must honor.</p></li></ul><p>The higher the asset specificity, behavioral uncertainty, and frequency, the further up the spectrum the rational choice sits. Production code in a mature codebase invoked by AI hundreds of times daily lands at constitutional. A throwaway prototype lands at post-hoc.</p><div><hr></div><h2><strong>The five SDD repos, by philosophy</strong></h2><p>Each repo encodes a different theory of where complexity belongs.</p><blockquote><p><em><strong>Full comparison table at the end. Links are in replies.</strong></em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!j5d6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8723151b-f2f2-447c-ab26-b1f0aaa39c62_1488x837.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!j5d6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8723151b-f2f2-447c-ab26-b1f0aaa39c62_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!j5d6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8723151b-f2f2-447c-ab26-b1f0aaa39c62_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!j5d6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8723151b-f2f2-447c-ab26-b1f0aaa39c62_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!j5d6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8723151b-f2f2-447c-ab26-b1f0aaa39c62_1488x837.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!j5d6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8723151b-f2f2-447c-ab26-b1f0aaa39c62_1488x837.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8723151b-f2f2-447c-ab26-b1f0aaa39c62_1488x837.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Article content&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Article content" title="Article content" srcset="https://substackcdn.com/image/fetch/$s_!j5d6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8723151b-f2f2-447c-ab26-b1f0aaa39c62_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!j5d6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8723151b-f2f2-447c-ab26-b1f0aaa39c62_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!j5d6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8723151b-f2f2-447c-ab26-b1f0aaa39c62_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!j5d6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8723151b-f2f2-447c-ab26-b1f0aaa39c62_1488x837.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><h3><strong>Spec Kit: constitution as authority</strong></h3><p>GitHub&#8217;s official toolkit, MIT-licensed, Python CLI (<em><strong>specify init</strong></em>).</p><p>The theory of complexity: put it in the constitution. A non-negotiable principles file at <em><strong>.specify/memory/<a href="http://constitution.md/">constitution.md</a></strong></em> sits above every spec and every implementation. The agent obeys it on every change, every session.</p><p>The workflow runs through nine slash commands:</p><ul><li><p><em><strong>/speckit.constitution</strong></em></p></li><li><p><em><strong>/speckit.specify</strong></em></p></li><li><p><em><strong>/speckit.clarify</strong></em></p></li><li><p><em><strong>/speckit.plan</strong></em></p></li><li><p><em><strong>/speckit.tasks</strong></em></p></li><li><p><em><strong>/speckit.taskstoissues</strong></em></p></li><li><p><em><strong>/speckit.checklist</strong></em></p></li><li><p><em><strong>/speckit.analyze</strong></em></p></li><li><p><em><strong>/speckit.implement</strong></em></p></li></ul><p>The constitution and analyze steps are where the formal governance lives.</p><p>Farrag&#8217;s paper evaluates Spec Kit as the direct instantiation of constitutional governance. The reported result: 12 hours to 15 minutes for upstream artifact production (PRD, design, structure, technical specs, test plans).</p><p>A pilot study saw late-stage hotfixes drop from 3-to-5 per sprint to 1-to-2, and rollbacks drop from 2-to-4 per month to 0-to-1.</p><p>30+ AI agent integrations including Claude, Codex, Copilot, Cursor, Gemini.</p><p>This is the only repo with explicit constitutional governance. The highest tier on Farrag&#8217;s spectrum, and the steepest setup cost.</p><div><hr></div><h3><strong>BMAD-METHOD: named agents as authority</strong></h3><p>BMad Code LLC, MIT, npm (<em><strong>npx bmad-method install</strong></em>). V6, with 34+ workflows.</p><p>The theory of complexity: put it in the roles. Six named personas, each with domain expertise:</p><ul><li><p>Analyst Mary handles brainstorming and research.</p></li><li><p>PM John owns PRDs.</p></li><li><p>Architect Winston runs the 8-step architecture workflow.</p></li><li><p>Developer Amelia handles dev stories, sprint planning, and code review.</p></li><li><p>UX Designer Sally owns interface decisions.</p></li><li><p>Tech Writer Paige owns documentation.</p></li></ul><p>Party Mode brings multiple personas into one session to argue from different professional perspectives.</p><p>The lifecycle has four phases: Analysis, Planning, Solutioning, Implementation. Each phase has its own workflows.</p><p>A <em><strong>.<a href="http://decision-log.md/">decision-log.md</a></strong></em> records every decision as an audit trail. An implementation-readiness gate (PASS, CONCERNS, or FAIL) blocks the move to code if anything is missing.</p><p>Planning depth auto-adjusts to project stakes. A hobby project gets a 2-page PRD. A launch project gets full specs. The <em><strong>bmad-help</strong></em> skill answers free-form questions about what to do next.</p><p>The module ecosystem extends the core with specialized domains: BMM (core), BMB (custom agents), TEA (test architecture), BMGD (game dev), CIS (creative intelligence).</p><p>This is the only repo that treats specifications as the inter-agent communication protocol of a multi-agent organization.</p><div><hr></div><h3><strong>OpenSpec: change folders as the unit</strong></h3><p>Fission AI, MIT, npm (<em><strong>openspec init</strong></em>).</p><p>The theory of complexity: put it in the change. Each feature gets its own folder containing <em><strong><a href="http://proposal.md/">proposal.md</a></strong></em> (why this change), <em><strong>specs/</strong></em> (requirements and scenarios), <em><strong><a href="http://design.md/">design.md</a></strong></em> (technical approach), and <em><strong><a href="http://tasks.md/">tasks.md</a></strong></em> (implementation checklist).</p><p>When the change ships, <em><strong>/opsx:archive</strong></em> folds the change spec into a growing source-of-truth document.</p><p>The core surface is three commands:</p><ul><li><p><em><strong>/opsx:propose</strong></em> creates the change folder.</p></li><li><p><em><strong>/opsx:apply</strong></em> has the AI implement the task checklist.</p></li><li><p><em><strong>/opsx:archive</strong></em> closes it out.</p></li></ul><p>An opt-in expanded profile adds six more: <em><strong>/opsx:new</strong></em>, <em><strong>/opsx:continue</strong></em>, <em><strong>/opsx:ff</strong></em>, <em><strong>/opsx:verify</strong></em>, <em><strong>/opsx:bulk-archive</strong></em>, <em><strong>/opsx:onboard</strong></em>.</p><p>The positioning is explicitly brownfield-first. Most SDD tools optimize for greenfield projects. OpenSpec is built to retrofit existing codebases. The delta-spec format (additions, modifications, removals tracked per change) is what makes that work.</p><p>Works with 25+ AI assistants via slash commands.</p><p>Executable contract at the lightest possible weight. No constitution, no named agents, no ceremony. The spec discipline survives without the process.</p><div><hr></div><h3><strong>GSD: context as the bottleneck</strong></h3><p>T&#194;CHES, MIT, npm (<em><strong>npx get-shit-done-cc@latest</strong></em>). Built by a solo developer for solo developers.</p><p>The theory of complexity: put it in context engineering. The main session context stays at 30 to 40 percent. Heavy work runs in fresh subagent contexts, each getting a full 200K-token window.</p><p>The hypothesis the rest of the architecture rests on: as a session grows, AI output degrades, so the architecture should keep the session small.</p><p>The loop is six commands:</p><ul><li><p><em><strong>/gsd-new-project</strong></em> runs questions, research, requirements, roadmap.</p></li><li><p><em><strong>/gsd-map-codebase</strong></em> does the same for existing code.</p></li><li><p><em><strong>/gsd-discuss-phase</strong></em> captures decisions before planning.</p></li><li><p><em><strong>/gsd-plan-phase</strong></em> runs research, plan, verify in a loop.</p></li><li><p><em><strong>/gsd-execute-phase</strong></em> dispatches parallel waves of subagents.</p></li><li><p><em><strong>/gsd-verify-work</strong></em> walks through what was built and diagnoses failures.</p></li></ul><p>Five persistent state files survive session boundaries: <em><strong><a href="http://project.md/">PROJECT.md</a></strong></em> (vision), <em><strong><a href="http://requirements.md/">REQUIREMENTS.md</a></strong></em> (scope), <em><strong><a href="http://roadmap.md/">ROADMAP.md</a></strong></em> (direction), <em><strong><a href="http://state.md/">STATE.md</a></strong></em> (current position), <em><strong><a href="http://context.md/">CONTEXT.md</a></strong></em> (per-phase decisions).</p><p>The <em><strong>.planning/config.json</strong></em> controls mode (interactive or yolo), model profiles (quality, balanced, budget), and quality-agent toggles. Package legitimacy checks are built into the install path.</p><p>Executable contract delivered through context discipline rather than process ceremony. The repo treats the context window as the bottleneck, not the methodology.</p><div><hr></div><h3><strong>Superpowers: auto-triggering as discipline</strong></h3><p>Built by Jesse Vincent and Prime Radiant. MIT, zero-dependency plugin.</p><p>The theory of complexity: put it in the agent&#8217;s behavior shaping. Skills auto-trigger at the right moments. No manual invocation. Mandatory workflows, not suggestions.</p><p>The <em><strong>using-superpowers</strong></em> skill loads at session start and is what makes auto-triggering work. Copying skill files alone is not a real integration.</p><p>Seven core skills run the workflow:</p><ul><li><p><em><strong>brainstorming</strong></em> refines rough ideas before any code.</p></li><li><p><em><strong>using-git-worktrees</strong></em> isolates the workspace.</p></li><li><p><em><strong>writing-plans</strong></em> breaks work into 2 to 5 minute tasks with exact file paths and complete code.</p></li><li><p><em><strong>subagent-driven-development</strong></em> dispatches a fresh subagent per task with two-stage review (spec compliance, then code quality).</p></li><li><p><em><strong>test-driven-development</strong></em> deletes any code written before its test.</p></li><li><p><em><strong>requesting-code-review</strong></em> blocks critical issues.</p></li><li><p><em><strong>finishing-a-development-branch</strong></em> verifies tests and presents merge options.</p></li></ul><p>The TDD enforcement is the unusual move. Most TDD tooling encourages the loop. Superpowers&#8217; skill deletes code that violates it.</p><p>Distributed through the official Claude plugin marketplace, the official Codex plugin marketplace, Factory Droid, Gemini extensions, Cursor, GitHub Copilot CLI, and OpenCode.</p><p>Executable contract enforced at the agent layer rather than the user layer. The user never has to remember to invoke the right skill.</p><div><hr></div><h3><strong>The sixth repo, and the case against the category</strong></h3><p>Matt Pocock&#8217;s <em>Skills For Real Engineers</em> sits on the same list of repos by accident. He argues against the category.</p><p>His talk <em>Software Fundamentals Matter More Than Ever</em> lands the thesis directly. &#8220;Code is not cheap. In fact, bad code is the most expensive it&#8217;s ever been.&#8221;</p><p>On the spec-driven movement specifically: &#8220;Specs to code, we are not investing in the design of the system. We are divesting from it.&#8221;</p><p>His position rests on a software-engineering claim. Bad codebases have always been expensive because they resist change. AI accelerates that. A bad codebase compounded by AI throughput is the most expensive failure mode of the new era.</p><p>His repo is composable practices, not a workflow framework. Each skill stands alone:</p><ul><li><p><em><strong>/grill-me</strong></em> runs a relentless interview to establish what Frederick Brooks calls a shared design concept.</p></li><li><p><em><strong>/grill-with-docs</strong></em> adds a Domain-Driven Design ubiquitous language file that humans and AI both reference.</p></li><li><p><em><strong>/tdd</strong></em> enforces red-green-refactor as the rate limiter on AI speed.</p></li><li><p><em><strong>/improve-codebase-architecture</strong></em> rebuilds shallow modules into deep modules, per John Ousterhout.</p></li></ul><p>The default pattern is gray boxes: design the interface, delegate the implementation.</p><p>The data on his side: the METR finding that experienced developers on mature codebases were 19% slower with AI suggests the bottleneck is codebase quality, not specification quality. His argument is that the five SDD repos optimize for the wrong thing.</p><p>His repo went viral on the strength of <em><strong>/grill-me</strong></em> alone. The position is worth taking seriously.</p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wdTo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4398bc0c-2e73-4b5d-94cb-096ec5376068_1488x837.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wdTo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4398bc0c-2e73-4b5d-94cb-096ec5376068_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!wdTo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4398bc0c-2e73-4b5d-94cb-096ec5376068_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!wdTo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4398bc0c-2e73-4b5d-94cb-096ec5376068_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!wdTo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4398bc0c-2e73-4b5d-94cb-096ec5376068_1488x837.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wdTo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4398bc0c-2e73-4b5d-94cb-096ec5376068_1488x837.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4398bc0c-2e73-4b5d-94cb-096ec5376068_1488x837.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Article content&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Article content" title="Article content" srcset="https://substackcdn.com/image/fetch/$s_!wdTo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4398bc0c-2e73-4b5d-94cb-096ec5376068_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!wdTo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4398bc0c-2e73-4b5d-94cb-096ec5376068_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!wdTo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4398bc0c-2e73-4b5d-94cb-096ec5376068_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!wdTo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4398bc0c-2e73-4b5d-94cb-096ec5376068_1488x837.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><div><hr></div><h3><strong>The AlphaSignal take</strong></h3><p>The five SDD repos and Pocock&#8217;s dissent are not answering the same question.</p><p>SDD optimizes for the plausibility-to-correctness gap. Pocock optimizes for the design-entropy gap. Both gaps are real. Both data sets support both positions.</p><p>A team that picks one and ignores the other is solving half the problem.</p><p>The reliability case for SDD is strongest at the constitutional and executable-contract levels. Spec Kit&#8217;s constitution mechanism and BMAD&#8217;s implementation-readiness gate are where the math actually pays off.</p><p>The case is weakest at the natural-language end, where SDD collapses into renamed prompt engineering.</p><p>Three things none of the six repos solve, drawn from the open problems sections of the four papers.</p><p><strong>Oracle adequacy.</strong> Current evaluations collapse model quality, tool reliability, and harness quality into one end-task number. There is no metric for what a specification is actually worth.</p><p><strong>Evidence bundles.</strong> Every accepted change should ship with a record of what was checked, what was not, and what risks remain. No current SDD tool produces this.</p><p><strong>Self-evolving harnesses.</strong> The SDD frameworks themselves are software. They will change. None of them have a change-contract for their own evolution.</p><p>Read each of these repos as a specific theory of where reliability comes from. Pick the one whose theory matches the bottleneck you actually have. If you don&#8217;t know your bottleneck, Pocock&#8217;s critique applies first.</p><div><hr></div><p><strong>Which theory of reliability does your stack depend on, constitution, roles, change folders, context, auto-triggering, or design discipline?</strong></p><p><strong>Full breakdown of recent updates + daily signals in our newsletter (link in bio).</strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[The Three Harness Layers and How to Audit Your Stack]]></title><description><![CDATA[A 100-page survey by UIUC, Meta, and Stanford maps the harness layer that runs Claude Code, Codex, and SWE-agent.]]></description><link>https://alphasignalai.substack.com/p/the-three-harness-layers-and-how</link><guid isPermaLink="false">https://alphasignalai.substack.com/p/the-three-harness-layers-and-how</guid><dc:creator><![CDATA[AlphaSignal AI]]></dc:creator><pubDate>Thu, 21 May 2026 18:30:20 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!xAt3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3e81789-4c38-4906-8123-d7628a8b9347_6000x3375.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xAt3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3e81789-4c38-4906-8123-d7628a8b9347_6000x3375.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xAt3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3e81789-4c38-4906-8123-d7628a8b9347_6000x3375.png 424w, https://substackcdn.com/image/fetch/$s_!xAt3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3e81789-4c38-4906-8123-d7628a8b9347_6000x3375.png 848w, https://substackcdn.com/image/fetch/$s_!xAt3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3e81789-4c38-4906-8123-d7628a8b9347_6000x3375.png 1272w, https://substackcdn.com/image/fetch/$s_!xAt3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3e81789-4c38-4906-8123-d7628a8b9347_6000x3375.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xAt3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3e81789-4c38-4906-8123-d7628a8b9347_6000x3375.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c3e81789-4c38-4906-8123-d7628a8b9347_6000x3375.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1934360,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://alphasignalai.substack.com/i/198745781?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3e81789-4c38-4906-8123-d7628a8b9347_6000x3375.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xAt3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3e81789-4c38-4906-8123-d7628a8b9347_6000x3375.png 424w, https://substackcdn.com/image/fetch/$s_!xAt3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3e81789-4c38-4906-8123-d7628a8b9347_6000x3375.png 848w, https://substackcdn.com/image/fetch/$s_!xAt3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3e81789-4c38-4906-8123-d7628a8b9347_6000x3375.png 1272w, https://substackcdn.com/image/fetch/$s_!xAt3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3e81789-4c38-4906-8123-d7628a8b9347_6000x3375.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Most agent failures aren&#8217;t reasoning failures. They&#8217;re harness failures.</p><p>An agent that passes every test while looping between two failing strategies, because the harness has no dead-end detection.</p><p>A new 100-page survey from UIUC, Meta, and Stanford spells out why.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2><strong>Paper</strong></h2><p>It&#8217;s called &#8220;Code as Agent Harness.&#8221; 40+ researchers from UIUC, Meta, and Stanford wrote it, and it synthesizes 400+ papers under a single taxonomy with the harness, not the model, as the subject.</p><p>The anchor systems are the familiar ones: Claude Code, Codex, SWE-agent, Voyager, MetaGPT, OpenHands. What&#8217;s been a Twitter-thread topic for the last six months now has academic scaffolding around it. The contribution is the synthesis, not the discovery.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vbUj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda4852cd-3ffa-4c76-bb89-f1ec0c4023fd_1023x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vbUj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda4852cd-3ffa-4c76-bb89-f1ec0c4023fd_1023x1000.png 424w, https://substackcdn.com/image/fetch/$s_!vbUj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda4852cd-3ffa-4c76-bb89-f1ec0c4023fd_1023x1000.png 848w, https://substackcdn.com/image/fetch/$s_!vbUj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda4852cd-3ffa-4c76-bb89-f1ec0c4023fd_1023x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!vbUj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda4852cd-3ffa-4c76-bb89-f1ec0c4023fd_1023x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vbUj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda4852cd-3ffa-4c76-bb89-f1ec0c4023fd_1023x1000.png" width="1023" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/da4852cd-3ffa-4c76-bb89-f1ec0c4023fd_1023x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1023,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Article content&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Article content" title="Article content" srcset="https://substackcdn.com/image/fetch/$s_!vbUj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda4852cd-3ffa-4c76-bb89-f1ec0c4023fd_1023x1000.png 424w, https://substackcdn.com/image/fetch/$s_!vbUj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda4852cd-3ffa-4c76-bb89-f1ec0c4023fd_1023x1000.png 848w, https://substackcdn.com/image/fetch/$s_!vbUj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda4852cd-3ffa-4c76-bb89-f1ec0c4023fd_1023x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!vbUj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda4852cd-3ffa-4c76-bb89-f1ec0c4023fd_1023x1000.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><div><hr></div><h2><strong>Core thesis</strong></h2><p>Long-running agents fail at state, feedback, and verification. Not at reasoning. The bottleneck of autonomy is whether the system around the model can hold its outputs accountable to something executable.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!y_tR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56afe214-a98b-42e3-9d6b-65296089880c_1368x918.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!y_tR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56afe214-a98b-42e3-9d6b-65296089880c_1368x918.png 424w, https://substackcdn.com/image/fetch/$s_!y_tR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56afe214-a98b-42e3-9d6b-65296089880c_1368x918.png 848w, https://substackcdn.com/image/fetch/$s_!y_tR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56afe214-a98b-42e3-9d6b-65296089880c_1368x918.png 1272w, https://substackcdn.com/image/fetch/$s_!y_tR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56afe214-a98b-42e3-9d6b-65296089880c_1368x918.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!y_tR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56afe214-a98b-42e3-9d6b-65296089880c_1368x918.png" width="1368" height="918" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/56afe214-a98b-42e3-9d6b-65296089880c_1368x918.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:918,&quot;width&quot;:1368,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Article content&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Article content" title="Article content" srcset="https://substackcdn.com/image/fetch/$s_!y_tR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56afe214-a98b-42e3-9d6b-65296089880c_1368x918.png 424w, https://substackcdn.com/image/fetch/$s_!y_tR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56afe214-a98b-42e3-9d6b-65296089880c_1368x918.png 848w, https://substackcdn.com/image/fetch/$s_!y_tR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56afe214-a98b-42e3-9d6b-65296089880c_1368x918.png 1272w, https://substackcdn.com/image/fetch/$s_!y_tR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56afe214-a98b-42e3-9d6b-65296089880c_1368x918.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p>The paper splits any agent system into three coupled pieces.</p><p>First is <strong>model-internal capabilities</strong>: reasoning, planning, perception.</p><p>The second is <strong>system-provided infrastructure</strong>: tools, sandboxes, memory, permission tiers, telemetry.</p><p>And third, the underexplored one, is <strong>agent-initiated code artifacts</strong>: regression tests, temporary tools, DSL programs, executable workflows, and reusable skills the agent itself authors mid-task. Voyager&#8217;s skill library and Claude Code&#8217;s skill files are early instances.</p><p>Above this distinction sit three layers.</p><p><strong>Harness Interface</strong>, puts code at the center as the medium for reasoning, action, and environment state.</p><p><strong>Harness Mechanisms</strong>, covers planning, memory, tool use, and the plan-execute-verify loop.</p><p><strong>Scaling the Harness</strong>, extends the picture to multi-agent systems coordinating over shared code artifacts.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yV-p!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff43447cd-844b-47aa-a188-71b361c18a5e_1488x837.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yV-p!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff43447cd-844b-47aa-a188-71b361c18a5e_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!yV-p!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff43447cd-844b-47aa-a188-71b361c18a5e_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!yV-p!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff43447cd-844b-47aa-a188-71b361c18a5e_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!yV-p!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff43447cd-844b-47aa-a188-71b361c18a5e_1488x837.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yV-p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff43447cd-844b-47aa-a188-71b361c18a5e_1488x837.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f43447cd-844b-47aa-a188-71b361c18a5e_1488x837.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Article content&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Article content" title="Article content" srcset="https://substackcdn.com/image/fetch/$s_!yV-p!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff43447cd-844b-47aa-a188-71b361c18a5e_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!yV-p!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff43447cd-844b-47aa-a188-71b361c18a5e_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!yV-p!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff43447cd-844b-47aa-a188-71b361c18a5e_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!yV-p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff43447cd-844b-47aa-a188-71b361c18a5e_1488x837.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><div><hr></div><h2><strong>How to audit your stack</strong></h2><p>Three questions, one per layer. They map to where most stacks actually break.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aX6-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff67bc191-e91a-4810-8c38-a7e238b0a2a7_1488x837.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aX6-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff67bc191-e91a-4810-8c38-a7e238b0a2a7_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!aX6-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff67bc191-e91a-4810-8c38-a7e238b0a2a7_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!aX6-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff67bc191-e91a-4810-8c38-a7e238b0a2a7_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!aX6-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff67bc191-e91a-4810-8c38-a7e238b0a2a7_1488x837.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aX6-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff67bc191-e91a-4810-8c38-a7e238b0a2a7_1488x837.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f67bc191-e91a-4810-8c38-a7e238b0a2a7_1488x837.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Article content&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Article content" title="Article content" srcset="https://substackcdn.com/image/fetch/$s_!aX6-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff67bc191-e91a-4810-8c38-a7e238b0a2a7_1488x837.png 424w, https://substackcdn.com/image/fetch/$s_!aX6-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff67bc191-e91a-4810-8c38-a7e238b0a2a7_1488x837.png 848w, https://substackcdn.com/image/fetch/$s_!aX6-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff67bc191-e91a-4810-8c38-a7e238b0a2a7_1488x837.png 1272w, https://substackcdn.com/image/fetch/$s_!aX6-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff67bc191-e91a-4810-8c38-a7e238b0a2a7_1488x837.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p><strong>I</strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p><strong>nterface.</strong></p><p>Does your agent&#8217;s reasoning, action, and environment state pass through code something can actually execute and inspect? A healthy stack has tool calls, generated programs, repo state, traces, and tests. An unhealthy stack runs on natural-language plans the agent never has to defend against execution.</p><p>If unhealthy: have the model output executable code as its reasoning, give the agent a structured Agent-Computer Interface like SWE-agent&#8217;s shell + edit + search commands, and let it operate on real repo state rather than text descriptions of it.</p><p><strong>Mechanisms.</strong></p><p>When something fails, what does the harness do about it? A healthy stack runs a plan-execute-verify loop with named verifiers (unit tests, type checks, linters, runtime monitors), durable memory across sessions, and feedback that closes the loop. An unhealthy stack retries with more tokens and a longer context window.</p><p>If unhealthy: add named verifiers as gates between generation steps, not just at the end. Most agents only have working memory, which is whatever sits in the current prompt. The paper names four more types that decide whether yesterday&#8217;s debugging session helps today: semantic memory of the repo, experiential memory of past trajectories, long-term memory with a compression policy, and multi-agent memory for shared state. OpenHands&#8217; stateful workspace and CodeMem&#8217;s budgeted memory slots are reference implementations to study.</p><p><strong>Scaling.</strong></p><p>When two agents work on the same task, what&#8217;s the shared substrate? A healthy stack uses shared code artifacts (repos, tests, traces, structured workflows) with a conflict policy. An unhealthy stack passes messages back and forth with no shared state both can safely modify.</p><p>If unhealthy: replace direct message-passing with shared artifacts both agents can read and write. AgentCoder&#8217;s programmer-tester-executor split and MetaGPT&#8217;s role-specialized agents over a shared message pool are the patterns the paper highlights.</p><p>If any of these answers feel unhealthy, the paper has already named the failure mode.</p><div><hr></div><h2><strong>Also, the paper covers</strong></h2><ul><li><p><strong>Five application domains.</strong> Code assistants, GUI/OS agents, scientific discovery, embodied agents, personalization.</p></li><li><p><strong>Self-evolving harnesses.</strong> AutoHarness, Meta-Harness, <strong>Agentic Harness Engineering (AHE) (related article down below)</strong>, GEPA, EvoMAC, and SEW. The harness itself as the object of optimization, with the agent&#8217;s task code as a downstream effect.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kPiX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc561f584-6812-4299-8a0b-36458e0d166b_638x497.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kPiX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc561f584-6812-4299-8a0b-36458e0d166b_638x497.png 424w, https://substackcdn.com/image/fetch/$s_!kPiX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc561f584-6812-4299-8a0b-36458e0d166b_638x497.png 848w, https://substackcdn.com/image/fetch/$s_!kPiX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc561f584-6812-4299-8a0b-36458e0d166b_638x497.png 1272w, https://substackcdn.com/image/fetch/$s_!kPiX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc561f584-6812-4299-8a0b-36458e0d166b_638x497.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kPiX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc561f584-6812-4299-8a0b-36458e0d166b_638x497.png" width="440" height="342.7586206896552" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c561f584-6812-4299-8a0b-36458e0d166b_638x497.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:497,&quot;width&quot;:638,&quot;resizeWidth&quot;:440,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Article content&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Article content" title="Article content" srcset="https://substackcdn.com/image/fetch/$s_!kPiX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc561f584-6812-4299-8a0b-36458e0d166b_638x497.png 424w, https://substackcdn.com/image/fetch/$s_!kPiX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc561f584-6812-4299-8a0b-36458e0d166b_638x497.png 848w, https://substackcdn.com/image/fetch/$s_!kPiX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc561f584-6812-4299-8a0b-36458e0d166b_638x497.png 1272w, https://substackcdn.com/image/fetch/$s_!kPiX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc561f584-6812-4299-8a0b-36458e0d166b_638x497.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p><strong><a href="https://x.com/AlphaSignalAI/status/2049900160080077229">Post link.</a></strong></p><ul><li><p><strong>Workflow topologies.</strong> Five patterns for multi-agent code work: waterfall, cyclic, <strong>hierarchical (related article down below)</strong>, star, adaptive.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wv-I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c0851f-0eb2-4158-a95c-3db3ae60997d_636x499.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wv-I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c0851f-0eb2-4158-a95c-3db3ae60997d_636x499.png 424w, https://substackcdn.com/image/fetch/$s_!wv-I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c0851f-0eb2-4158-a95c-3db3ae60997d_636x499.png 848w, https://substackcdn.com/image/fetch/$s_!wv-I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c0851f-0eb2-4158-a95c-3db3ae60997d_636x499.png 1272w, https://substackcdn.com/image/fetch/$s_!wv-I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c0851f-0eb2-4158-a95c-3db3ae60997d_636x499.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wv-I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c0851f-0eb2-4158-a95c-3db3ae60997d_636x499.png" width="522" height="409.5566037735849" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/63c0851f-0eb2-4158-a95c-3db3ae60997d_636x499.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:499,&quot;width&quot;:636,&quot;resizeWidth&quot;:522,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Article content&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Article content" title="Article content" srcset="https://substackcdn.com/image/fetch/$s_!wv-I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c0851f-0eb2-4158-a95c-3db3ae60997d_636x499.png 424w, https://substackcdn.com/image/fetch/$s_!wv-I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c0851f-0eb2-4158-a95c-3db3ae60997d_636x499.png 848w, https://substackcdn.com/image/fetch/$s_!wv-I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c0851f-0eb2-4158-a95c-3db3ae60997d_636x499.png 1272w, https://substackcdn.com/image/fetch/$s_!wv-I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c0851f-0eb2-4158-a95c-3db3ae60997d_636x499.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p><strong><a href="https://x.com/AlphaSignalAI/status/2051352235926249945">Post link.</a></strong></p><ul><li><p><strong>Planning paradigms.</strong> Four categories, from ReAct-style linear decomposition to tree-search-based exploration across candidate paths.</p></li><li><p><strong>Three more open problems.</strong> Harness evolution that doesn&#8217;t break old behaviors, shared state across agents with safe coordination, and multimodal harnesses for screenshots and physical state.</p></li></ul><div><hr></div><h3><strong>The AlphaSignal take</strong></h3><p>The most useful vocabulary the field has had for what practitioners are already building. Just not a build plan. Three gaps from the paper&#8217;s open problems. Each one is a design warning for what&#8217;s already in your stack.</p><p><strong>Oracle adequacy.</strong></p><p>If your eval is pass/fail on unit tests, you&#8217;re measuring the wrong thing. Every agent evaluation today collapses model quality, tool reliability, and harness quality into one end-task number. The paper names this as the central bottleneck and offers no metric that fixes it.</p><p><strong>The verification gap.</strong></p><p>Green tests are not a correct specification. Every accepted action should ship with an evidence bundle: which checks ran, which assumptions held, which parts of the code stayed untested, what risks remain. No current harness does this. The architecture pattern is sitting there, waiting for someone to ship it.</p><p><strong>Approvals that don&#8217;t reset.</strong></p><p>If approvals vanish after the session ends, your agent will repeat the same unsafe action next time. Permission rules should mutate in response to human decisions, not reset. The paper flags this and stops there.</p><p>Read it as a vocabulary, not a roadmap. The harness is the layer teams now invest in optimizing. The taxonomy will sharpen how you talk about your stack. It won&#8217;t tell you what to build on Monday.</p><div><hr></div><h3><strong>Does your agent have a verifier that isn&#8217;t just the model judging its own output?</strong></h3><div><hr></div><p><strong><a href="https://arxiv.org/abs/2605.18747">Paper Link</a></strong></p><p><strong>Full breakdown of recent updates + daily signals in our newsletter (link in bio).</strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[How OpenHuman Works, And How to Set It Up in 5 Minutes]]></title><description><![CDATA[The open-source desktop agent that crossed +20k GitHub stars in days, what&#8217;s inside, and the full walkthrough.]]></description><link>https://alphasignalai.substack.com/p/how-openhuman-works-and-how-to-set</link><guid isPermaLink="false">https://alphasignalai.substack.com/p/how-openhuman-works-and-how-to-set</guid><dc:creator><![CDATA[AlphaSignal AI]]></dc:creator><pubDate>Wed, 20 May 2026 18:04:11 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!2I8z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa8003ff-edfb-4f9e-99f5-2276758fa8e1_6000x3375.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2I8z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa8003ff-edfb-4f9e-99f5-2276758fa8e1_6000x3375.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2I8z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa8003ff-edfb-4f9e-99f5-2276758fa8e1_6000x3375.png 424w, https://substackcdn.com/image/fetch/$s_!2I8z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa8003ff-edfb-4f9e-99f5-2276758fa8e1_6000x3375.png 848w, https://substackcdn.com/image/fetch/$s_!2I8z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa8003ff-edfb-4f9e-99f5-2276758fa8e1_6000x3375.png 1272w, https://substackcdn.com/image/fetch/$s_!2I8z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa8003ff-edfb-4f9e-99f5-2276758fa8e1_6000x3375.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2I8z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa8003ff-edfb-4f9e-99f5-2276758fa8e1_6000x3375.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aa8003ff-edfb-4f9e-99f5-2276758fa8e1_6000x3375.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2059014,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://alphasignalai.substack.com/i/198596438?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa8003ff-edfb-4f9e-99f5-2276758fa8e1_6000x3375.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2I8z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa8003ff-edfb-4f9e-99f5-2276758fa8e1_6000x3375.png 424w, https://substackcdn.com/image/fetch/$s_!2I8z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa8003ff-edfb-4f9e-99f5-2276758fa8e1_6000x3375.png 848w, https://substackcdn.com/image/fetch/$s_!2I8z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa8003ff-edfb-4f9e-99f5-2276758fa8e1_6000x3375.png 1272w, https://substackcdn.com/image/fetch/$s_!2I8z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa8003ff-edfb-4f9e-99f5-2276758fa8e1_6000x3375.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><em>In 5 minutes, you will know what OpenHuman is, how its memory works, how it compares to OpenClaw and Hermes Agent, and you will have it running locally with Gmail, Notion, or Slack connected.</em></p></blockquote><div><hr></div><p>OpenHuman crossed +20k GitHub stars in days.</p><p>Most of those stars came during a GitHub Trending run the founder marked at seven days at #1 on May 18, 2026, when he also posted that the project had hit #1 on Product Hunt&#8217;s daily, weekly, and monthly leaderboards.</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/senamakel/status/2056205318779187317&quot;,&quot;full_text&quot;:&quot;OpenHuman is now the number #1 product on <span class=\&quot;tweet-fake-link\&quot;>@ProductHunt</span> on both weekly AND monthly.\n\nThis means we are number one on daily, weekly, monthly and just a few upvotes away from being the number one product of the entire year! WTF &#128563;\n\nMajor release and bug fixes tomorrow &#9972;&#65039; &quot;,&quot;username&quot;:&quot;senamakel&quot;,&quot;name&quot;:&quot;Steven Enamakel&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/2036165243068620801/5YuqzEEs_normal.jpg&quot;,&quot;date&quot;:&quot;2026-05-18T02:48:45.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/HIkb10paUAAR3Jl.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/ESVVjSdwSG&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:19,&quot;retweet_count&quot;:10,&quot;like_count&quot;:78,&quot;impression_count&quot;:9123,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:false}" data-component-name="Twitter2ToDOM"></div><p><strong>Steven Enamakel</strong>, the founder, tried to set up an open-source AI agent for his dad earlier this year. Three hours of API keys, YAML, and a terminal his dad had never opened later, they both gave up. OpenHuman is what came out of that.</p><p><strong>OpenClaw</strong> (+373k stars, MIT) and <strong>Hermes Agent</strong> (+157k stars, MIT) are the two open-source agents OpenHuman is now compared with. Both are great. Neither walks your tools and writes the memory before you start prompting.</p><p>This morning at 07:17 UTC, <strong>v0.54.0</strong> shipped, with fully-local voice and a shared-memory bridge to Claude Code, Cursor, Codex, and OpenCode.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>What OpenHuman is</h2><p>OpenHuman is an open-source (GPL-3.0) desktop AI agent from <strong>TinyHumans AI</strong>.</p><p>It is written in <strong>Rust</strong> (65.2% of the codebase) with a <strong>TypeScript + React 19</strong> frontend, packaged as a native <strong>Tauri v2</strong> desktop app. It ships on macOS, Windows, and Linux. No mobile.</p><p>The pitch in one line: most agents start cold, OpenHuman walks your tools every 20 minutes and writes the memory into Markdown files you can open and edit.</p><p>Five things ship in the box:</p><ul><li><p>A clean desktop UI with a mascot that can speak, react, and join Google Meet as a real participant.</p></li><li><p>A 118+ Composio toolkit catalog for one-click OAuth integrations.</p></li><li><p>A <strong>Memory Tree</strong> plus an <strong>Obsidian-compatible Markdown vault</strong> as the local knowledge base.</p></li><li><p>Batteries-included native tools: web search, web-fetch scraper, full coder toolset, native voice.</p></li><li><p><strong>TokenJuice</strong>, a token compression layer that runs on every tool result before it touches an LLM.</p></li></ul><p>The whole project is three months old. The public repo was created on February 18, 2026.</p><div><hr></div><h2>Why it matters</h2><p>Most open-source agents bet on the wrong half of the problem.</p><p>They get smarter at planning. They add more tools. They wire up more channels. They still know nothing about you until you paste your week into the prompt.</p><p>OpenHuman is betting the other way. Structured local memory beats embedding-bag retrieval when the agent needs to navigate your day, not similar text. The founder&#8217;s framing, from the Product Hunt launch:</p><p><em>&#8220;Every powerful AI agent today is built for the 0.01% who can spin up their own runtime. <strong>The other 99.99% are watching the agent revolution from the sidelines.</strong>&#8220;</em></p><p>Early traction tracks that bet. 5,000+ users in the first 7 days and 150% week-over-week growth, per the founder&#8217;s Product Hunt post.</p><p><strong>OpenClaw</strong> and <strong>Hermes Agent</strong> take different bets. OpenClaw is a multi-channel gateway across 20+ messaging surfaces. Hermes Agent is a self-improving runtime with a closed learning loop. Neither tries to walk your tools and write the memory before you prompt. <em>Full head-to-head comparison further down.</em></p><div><hr></div><h2>How it works</h2><blockquote><p><em>Not here for the internals? Skip to How to Get Started below.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kY1n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13869759-5663-45f3-9928-db89f63cee1d_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kY1n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13869759-5663-45f3-9928-db89f63cee1d_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!kY1n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13869759-5663-45f3-9928-db89f63cee1d_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!kY1n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13869759-5663-45f3-9928-db89f63cee1d_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!kY1n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13869759-5663-45f3-9928-db89f63cee1d_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kY1n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13869759-5663-45f3-9928-db89f63cee1d_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/13869759-5663-45f3-9928-db89f63cee1d_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kY1n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13869759-5663-45f3-9928-db89f63cee1d_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!kY1n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13869759-5663-45f3-9928-db89f63cee1d_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!kY1n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13869759-5663-45f3-9928-db89f63cee1d_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!kY1n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13869759-5663-45f3-9928-db89f63cee1d_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Nine pieces. The memory pipeline is the spine. Everything else hangs off it.</p><h3>4.1 Memory Tree</h3><p>A deterministic, bucket-sealed pipeline, not a vector-store wrapper.</p><p>source adapters &#8594; canonicalize &#8594; chunker &#8594; content_store &#8594; store &#8594; score &#8594; source/topic/global trees &#8594; retrieval</p><p>Data is canonicalized to Markdown, split into <strong>&#8804; 3,000-token</strong> chunks with deterministic IDs, scored, and folded into three trees: <strong>source</strong> (one per source, L0 &#8594; L1 &#8594; L2 cascade), <strong>topic</strong> (one per high-hotness entity), <strong>global</strong> (one node per UTC day). Three background workers run heavy work behind a semaphore. A daily scheduler at 00:00 UTC enqueues the global digest and stale-flush.</p><p>Storage layout:</p><ul><li><p><em><strong>&lt;workspace&gt;/memory_tree/chunks.db</strong></em>: SQLite, holds chunks, scores, summaries, entity index, jobs, hotness.</p></li><li><p><em><strong>&lt;workspace&gt;/wiki/</strong></em>: the Obsidian-compatible Markdown vault.</p></li></ul><p>&lt;workspace&gt; defaults to <em><strong>~/.openhuman</strong></em>, overridable with <em><strong>OPENHUMAN_WORKSPACE</strong></em>. The vault is the point: you can open it in Obsidian, edit a wrong line, and the next retrieval is correct.</p><h3>4.2 Auto-fetch on a 20-minute tick</h3><p>The constant lives in <em><strong>src/openhuman/composio/periodic.rs</strong></em>:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;markdown&quot;,&quot;nodeId&quot;:&quot;edce7c5b-53cd-47c8-8bfe-3df783a736a9&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-markdown">const TICK_SECONDS: u64 = 1200;</code></pre></div><p>One global tick walks every active connection. Per-(toolkit, connection_id) state holds the cursor, last-sync, dedup set, and daily budget. Errors are swallowed so the loop never panics out.</p><p>The native registry today (src/openhuman/composio/providers/registry.rs::init_default_providers):</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;markdown&quot;,&quot;nodeId&quot;:&quot;04c631e6-5624-40bf-a3a4-4457f6f6e46b&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-markdown">register_provider(Arc::new(super::gmail::GmailProvider::new()));

register_provider(Arc::new(super::notion::NotionProvider::new()));

register_provider(Arc::new(super::slack::SlackProvider::new()));</code></pre></div><p>Auto-ingest covers <strong>Gmail, Notion, Slack</strong> only. The 118+ figure is the Composio catalog reach as proxied tools. Tool calls work everywhere, memory ingest fires on the three.</p><h3>4.3 Integrations (native vs proxied)</h3><p>Every connected service surfaces four ways: <strong>agent tool</strong>, <strong>memory source</strong> (native only), <strong>profile signal</strong>, <strong>trigger source</strong> (HMAC-verified webhooks).</p><p>23 toolkits ship as curated proxied tool sets without auto-ingest yet:</p><p>Shopify, Stripe, HubSpot, Salesforce, Airtable, Figma, GoogleCalendar, GoogleDrive, GoogleDocs, GoogleSheets, Discord, Telegram, WhatsApp, Microsoft Teams, Outlook, Linear, Jira, Trello, Asana, Dropbox, Twitter, Spotify, YouTube.</p><p>Three channels talk back: <strong>Telegram</strong> (two-way, 80+ actions), <strong>Discord</strong> (send/receive), <strong>Web</strong> (in-app local chat).</p><h3>4.4 TokenJuice</h3><p>A Rust port of vincentkoc/tokenjuice in the tool-execution path. Every tool result hits a rule overlay before it reaches an LLM. Three layers, later layers override earlier ones:</p><ul><li><p><strong>Builtin</strong> (shipped with the binary): git, npm, cargo, docker, kubectl, ls defaults.</p></li><li><p><strong>User</strong> (<em><strong>~/.config/tokenjuice/rules/</strong></em>): personal overrides across every project.</p></li><li><p><strong>Project</strong> (<em><strong>.tokenjuice/rules/</strong></em>): repo-specific overrides, checked in.</p></li></ul><p>HTML &#8594; Markdown. Long URLs shortened. CJK and emoji preserved grapheme-by-grapheme. Project claim: <strong>up to 80%</strong> reduction in cost and latency. PrimeAIcenter measured ~<strong>70%</strong>. Realistic range: 70&#8211;80%, biggest wins on log-heavy and HTML-heavy payloads.</p><h3>4.5 Model routing</h3><p>One subscription brokers 30+ providers. The agent loop emits a <em><strong>hint:</strong></em> prefix per task. The router resolves it.</p><p>Top five hints:</p><ul><li><p><em><strong>hint:reasoning</strong></em>: strong reasoning model.</p></li><li><p><em><strong>hint:fast</strong></em>: fast / cheap model.</p></li><li><p><em><strong>hint:vision</strong></em>: vision-capable model.</p></li><li><p><em><strong>hint:summarize</strong></em>: compression model.</p></li><li><p><em><strong>hint:code</strong></em>: code-tuned model.</p></li></ul><p>Lighter hints (<em><strong>hint:reaction</strong></em>, <em><strong>hint:classify</strong></em>, <em><strong>hint:sentiment</strong></em>, <em><strong>hint:medium</strong></em>, <em><strong>hint:tool_lite</strong></em>) prefer the local provider when Local AI is on. Heavy hints stay cloud. The task picks the model, not the user. Full hint reference in the appendix.</p><h3>4.6 Local AI (optional, off by default)</h3><p>Two flags switch it on:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;d615917b-40e6-4dd7-a1ac-dccebd5e2c74&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">local_ai.runtime_enabled = true

local_ai.opt_in_confirmed = true</code></pre></div><p>Defaults: <em><strong>all-minilm:latest</strong></em> (~23 MB) for embeddings, <em><strong>gemma3:1b-it-qat</strong></em> (~700 MB) for summary-tree building. Heartbeat, learning, and subconscious can also move on-device. Chat, vision, STT, TTS, and web search stay cloud. &#8220;Local-first&#8221; means memory, not the whole stack.</p><h3>4.7 Voice, mascot, and the Meet agent</h3><p>The mascot lip-syncs to TTS and shifts through six mood states (idle, thinking, listening, talking, surprised, dreaming). The Meet agent joins Google Meet through an embedded CEF webview as a real participant: a name, a face, a tile in the grid.</p><p>Mid-meeting it can:</p><ul><li><p><strong>Listen.</strong> Inbound audio streams through STT, diarized per speaker, into the Memory Tree live.</p></li><li><p><strong>Speak.</strong> Replies stream into the call as an outbound mic feed, not bounced through your speakers.</p></li><li><p><strong>Animate.</strong> The mascot canvas is piped as the outbound camera, lip-synced to the TTS audio everyone else hears.</p></li><li><p><strong>Use tools.</strong> Memory recall, auto-fetch, native tools, subconscious outputs, all reachable mid-call.</p></li></ul><p>v0.54.0 added fully-local STT and TTS via Whisper and Piper. Prior path required ElevenLabs cloud for TTS.</p><h3>4.8 Subconscious loop</h3><p>A background tick (default 5 min, minimum 5 min) that loads due tasks, builds a situation report from memory plus workspace state, and returns one of three decisions:</p><ul><li><p><strong>Skip.</strong> Nothing relevant right now.</p></li><li><p><strong>Act.</strong> Execute the task.</p></li><li><p><strong>Escalate.</strong> Hand off to the cloud agent.</p></li></ul><p>Local model evaluates, cloud agent escalates. Write tasks you asked for need no approval. Unsolicited writes open an approval card.</p><h3>4.9 Skills (in transition)</h3><p>The QuickJS / rquickjs runtime that executed skill packages was removed. Today&#8217;s skills surface is metadata-only:</p><ul><li><p>Discover and parse SKILL.md files.</p></li><li><p>Resolve scope (User / Project / Legacy) and trust markers.</p></li><li><p>Install from URL (HTTPS only, no private hosts, <em><strong>.md</strong></em> only, max 1 MiB body).</p></li><li><p>Read resources (cap 128 KiB), uninstall, per-turn prompt injection (cap 8 KiB).</p></li></ul><p>Registry repo: <em><strong>tinyhumansai/openhuman-skills</strong></em>. Runtime is being rebuilt. Treat skills as catalog + prompt-injection today, not a third-party plugin runtime.</p><div><hr></div><h2>What v0.54.0 shipped</h2><p>Released 2026-05-19 07:17 UTC, the morning after the Product Hunt #1 post. 230 commits, 1,271 files changed against v0.53.43 six days earlier.</p><p><strong>Voice.</strong> Fully-local STT and TTS via Whisper / Piper. Configurable mascot voice with ElevenLabs picker.</p><p><strong>Memory.</strong> Optional <em><strong>agentmemory</strong></em> backend bridge: set <em><strong>memory.backend = &#8220;agentmemory&#8221;</strong></em> in <em><strong>config.toml</strong></em> and the Memory Tree shares a durable store with Claude Code, Cursor, Codex, and OpenCode. Plus MCP stdio memory server, NotebookLM-style folder ingestion, cross-chat context retrieval, per-(row, model) embedding storage.</p><p><strong>Agents.</strong> Dedicated <em><strong>crypto_agent</strong></em> for wallet and market ops (#1397). Cursor Cloud Agents parallel workflow. Global tool registry. Task board CRUD. Gmail Unsubscribe Agent.</p><p><strong>Integrations.</strong> Bring-your-own Composio direct mode (#1710). Seltz as direct-API search. Discord webview transcript ingestion. WeChat embedded webview.</p><p><strong>Providers.</strong> Unified per-workload routing. LM Studio local provider support. New <em><strong>reasoning-quick-v1</strong></em> route for low-latency chat.</p><p><strong>UX and i18n.</strong> Onboarding redesign. Dark mode standardization. ~3,900 strings across nine locales. Italian and Indonesian added.</p><p><strong>Security.</strong> Path traversal prevention in agent prompt loading. DNS-aware URL validation. Audio base64 max-size check. Self-repair for locked <em><strong>.secret_key</strong></em> on Windows. Linux AppImage NSS fix.</p><p>Local main is already past v0.54.0 at 0.54.2. Post-release: Polymarket integration (#2145), explicit user-preference tool (#2150), Ollama context-window gating (#2122).</p><div><hr></div><h2>OpenHuman vs Hermes Agent vs OpenClaw</h2><h3>6.1 The README&#8217;s own framing</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9Pa_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebb8377-bbb2-4e44-b137-7ac305859e15_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9Pa_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebb8377-bbb2-4e44-b137-7ac305859e15_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!9Pa_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebb8377-bbb2-4e44-b137-7ac305859e15_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!9Pa_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebb8377-bbb2-4e44-b137-7ac305859e15_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!9Pa_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebb8377-bbb2-4e44-b137-7ac305859e15_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9Pa_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebb8377-bbb2-4e44-b137-7ac305859e15_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bebb8377-bbb2-4e44-b137-7ac305859e15_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9Pa_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebb8377-bbb2-4e44-b137-7ac305859e15_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!9Pa_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebb8377-bbb2-4e44-b137-7ac305859e15_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!9Pa_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebb8377-bbb2-4e44-b137-7ac305859e15_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!9Pa_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebb8377-bbb2-4e44-b137-7ac305859e15_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>6.2 What independent reviewers found</h3><p>The May 2026 reviews land in roughly the same spot.</p><p><strong>PrimeAIcenter</strong> (five-day test, Gmail + Notion + GitHub + Calendar): Memory Tree useful after three days, TokenJuice measured ~70% (not 80%), two sync failures in the window. Read on the Meet mascot: sounds gimmicky, isn&#8217;t.</p><p><strong>TechTimes</strong> (May 16): framed OpenHuman as inverting the playbook, flagged three risks, piped-shell install as a supply-chain vector, OAuth aggregation across email, code, calendar, and payments, and no formal independent audit.</p><p><strong>Julian Goldie</strong> (three-agent SEO test): OpenHuman wins on UI, setup, and voice. Hermes wins on long autonomous tasks. OpenClaw wins on scheduling.</p><p><strong>HackerNoon</strong> (OpenClaw context): 138+ disclosed CVEs and 341 malicious ClawHub skills out of 2,857 scanned as of May 2026. Hermes: 3 CVEs. OpenHuman: no published CVEs, no skill marketplace today.</p><h3>6.3 Pick which one</h3><p>Three different shapes of the same problem. Pick by use case:</p><ul><li><p><strong>Pick OpenHuman</strong> if you want a desktop agent that reads your email, calendar, Slack, and Notion within minutes, plus a memory you can open as Markdown.</p></li><li><p><strong>Pick Hermes Agent</strong> if you want an always-on server-side agent with a self-improving learning loop and six terminal backends (local, Docker, SSH, Daytona, Singularity, Modal) for long-running autonomous workflows.</p></li><li><p><strong>Pick OpenClaw</strong> if you want a channel-heavy gateway agent across 20+ messaging surfaces with Git-versioned config. Audit the marketplace and skill set carefully given the recent CVE and ClawHub reporting.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div></li></ul><div><hr></div><h2>How to Get Started</h2><p>Here is the command you came for:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;3d2e362f-7dd5-4679-8fae-dd76427f0d6f&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">curl -fsSL https://raw.githubusercontent.com/tinyhumansai/openhuman/main/scripts/install.sh | bash</code></pre></div><p>That single line installs OpenHuman on macOS or Linux. The full hint table, Local AI presets, native vs proxied list, and troubleshooting all live in the <strong>Reference Appendix</strong> at the end of this piece.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KaRy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7b4e40f-cd11-4a54-8011-ca6665a70d51_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KaRy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7b4e40f-cd11-4a54-8011-ca6665a70d51_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!KaRy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7b4e40f-cd11-4a54-8011-ca6665a70d51_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!KaRy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7b4e40f-cd11-4a54-8011-ca6665a70d51_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!KaRy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7b4e40f-cd11-4a54-8011-ca6665a70d51_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KaRy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7b4e40f-cd11-4a54-8011-ca6665a70d51_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c7b4e40f-cd11-4a54-8011-ca6665a70d51_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KaRy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7b4e40f-cd11-4a54-8011-ca6665a70d51_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!KaRy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7b4e40f-cd11-4a54-8011-ca6665a70d51_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!KaRy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7b4e40f-cd11-4a54-8011-ca6665a70d51_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!KaRy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7b4e40f-cd11-4a54-8011-ca6665a70d51_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>7.1 Pick an install path</h3><p>Three options. Pick by tradeoff.</p><p><strong>Option A: Download a signed binary (recommended for non-developers).</strong></p><p>Go to <em><strong>tinyhumans.ai/openhuman</strong></em> and pick DMG (macOS), MSI or EXE (Windows), or AppImage / .deb (Linux). This avoids the piped-shell install path TechTimes flagged as a supply-chain risk vector.</p><p><strong>Option B: One-line installer (recommended for the demo path).</strong></p><p>macOS / Linux x64:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;f710b75a-795a-4a0a-a955-26c6e22e52b7&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">curl -fsSL https://raw.githubusercontent.com/tinyhumansai/openhuman/main/scripts/install.sh | bash</code></pre></div><p>Windows PowerShell:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;shell&quot;,&quot;nodeId&quot;:&quot;47242c62-50fb-479e-8eef-9e37c5889ac0&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-shell">irm https://raw.githubusercontent.com/tinyhumansai/openhuman/main/scripts/install.ps1 | iex</code></pre></div><p><strong>Option C: Build from source (for contributors).</strong></p><p>Git, Node.js 24+, pnpm 10.10.0, Rust 1.93.0 + <em><strong>rustfmt</strong></em> + <em><strong>clippy</strong></em>, CMake, Ninja, ripgrep, plus your platform&#8217;s desktop build prerequisites.</p><h3>7.2 First-run flow</h3><p>Six steps. None of them require a terminal after step 1.</p><ol><li><p>Launch the app. Sign in.</p></li><li><p>Click <strong>Connect</strong> on <strong>Gmail</strong>, <strong>Notion</strong>, or <strong>Slack</strong>. These are the three native providers that auto-ingest into the Memory Tree today.</p></li><li><p>Wait one auto-fetch tick (up to 20 minutes) or trigger a manual ingest from the <strong>Intelligence</strong> tab in the app.</p></li><li><p>Open the Obsidian vault at <em><strong>~/.openhuman/wiki/</strong></em> via the in-app deep link or any file browser.</p></li><li><p>Ask a context-heavy prompt: <em>&#8220;What did I commit to in email last week?&#8221;</em></p></li><li><p>Open a chunk file in the vault before you trust the answer. The whole point is that the memory is readable. If something is wrong, fix the file, and the next retrieval is correct.</p></li></ol><p>That is the golden path. Five minutes of real time, twenty minutes of wall-clock if you wait for the first sync tick.</p><h3>7.3 (Optional) Turn on Local AI</h3><p>Local AI is off by default. Two flags switch it on:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;markdown&quot;,&quot;nodeId&quot;:&quot;ed195787-cbde-49a0-90b2-3b77d080432a&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-markdown">local_ai.runtime_enabled = true

local_ai.opt_in_confirmed = true</code></pre></div><p>Then go to <strong>Settings &#8594; AI &amp; Skills &#8594; Local AI</strong> and pick one of three presets:</p><ul><li><p><strong>Embeddings only.</strong> <em><strong>all-minilm:latest</strong></em> (~23 MB) for memory embeddings.</p></li><li><p><strong>Memory + reflection.</strong> Embeddings + summary-tree building (<em><strong>gemma3:1b-it-qat</strong></em>, ~700 MB) + learning.</p></li><li><p><strong>Everything local.</strong> All five workloads: embeddings, summary, heartbeat, learning, subconscious.</p></li></ul><p>Hardware floor: 8 GB RAM minimum, 16 GB+ ideal. LM Studio is supported as the alternative provider with default base URL <em><strong>http://localhost:1234/v1</strong></em>.</p><p>Skip Local AI if you only have a few sources connected. The cloud path is faster and the privacy benefit is small in that case.</p><h3>7.4 (Optional) Send the mascot into a Meet</h3><p>Hand the mascot a Google Meet link from the desktop app.</p><p>It opens the embedded Meet webview, joins with the configured display name, switches its tile to the mascot canvas, and is now in the participant grid. The mic is the TTS injection stream (not your real microphone). The camera is the mascot frame producer (not your real webcam). Mute the agent&#8217;s mic from the app the same way you would mute yourself in Meet.</p><p>Required OS permissions: Camera, Microphone. On macOS, also Accessibility and Input Monitoring for desktop hotkeys.</p><h3>7.5 (Optional) Bridge to agentmemory across other coding agents</h3><p>If you already self-host <em><strong>agentmemory</strong></em> for Claude Code, Cursor, Codex, or OpenCode, set this in <em><strong>config.toml</strong></em>:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;markdown&quot;,&quot;nodeId&quot;:&quot;7ad69043-0f35-4feb-9203-d5307e312c38&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-markdown">[memory]
backend = &#8220;agentmemory&#8221;</code></pre></div><p>OpenHuman&#8217;s Memory Tree now proxies to that store. The same durable memory powers OpenHuman alongside the other four agents. This is new in v0.54.0.</p><h3>7.6 Build from source (contributors)</h3><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;6ce6f146-327d-429d-be78-8dcff2e724ad&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">git clone https://github.com/tinyhumansai/openhuman.git

cd openhuman

git submodule update --init --recursive

pnpm install

pnpm dev            # Vite dev server only (web UI)

pnpm dev:app        # Full Tauri desktop dev</code></pre></div><p>Quality gates before opening a PR: <em><strong>pnpm typecheck</strong></em>, <em><strong>pnpm lint</strong></em>, <em><strong>pnpm format:check</strong></em>, <em><strong>cargo check -p openhuman --lib</strong></em>, <em><strong>pnpm test</strong></em>, <em><strong>pnpm test:rust</strong></em>. PRs need at least 80% coverage on changed lines or the merge gate blocks them.</p><h3>7.7 Save this</h3><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;494783c1-c633-499a-8bcb-69e1c0e784af&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">curl -fsSL https://raw.githubusercontent.com/tinyhumansai/openhuman/main/scripts/install.sh | bash</code></pre></div><p>Save the command. The full <strong>Reference Appendix</strong> below covers workspace structure, every model routing hint, the Local AI presets matrix, the native vs proxied integration list, and a troubleshooting table.</p><div><hr></div><h2>Current Limitations</h2><p><strong>Auto-fetch covers three integrations today.</strong> The 118+ headline is the Composio catalog reach, not the count of toolkits that auto-ingest into the Memory Tree. Today&#8217;s native providers: Gmail, Notion, Slack.</p><p><strong>Local-first is not fully local.</strong> Memory Tree and the Obsidian vault are local. Default chat, vision, web search, integration OAuth proxying, and TTS streaming all go through the OpenHuman backend.</p><p><strong>Skill execution is being rebuilt.</strong> The QuickJS runtime is gone. Today&#8217;s skills surface is metadata-only (discover, parse, install, uninstall, prompt injection). No executable third-party skill packages today.</p><p><strong>80% token compression is a project claim.</strong> PrimeAIcenter measured around 70% in their five-day independent test. Realistic range: 70 to 80%, biggest wins on log-heavy or HTML-heavy payloads.</p><p><strong>No published independent security audit.</strong> v0.54.0 adds meaningful hardening (path traversal prevention, DNS rebinding guard, bearer token redaction, Windows ACL self-repair), but the OAuth surface across email, code, calendar, and payments is wide and concentrated.</p><p>So the best recommendation is to install it this week if you are evaluating personal-agent UX or memory ingestion patterns, treat it as active beta for anything production-adjacent, and revisit auto-fetch coverage and the skill runtime in 60 days.</p><div><hr></div><h2>AlphaSignal Take</h2><p><strong>Verdict: Worth Watching.</strong></p><p>The Memory Tree, the Obsidian vault, the 20-minute auto-fetch, and the v0.54.0 <em><strong>agentmemory</strong></em> bridge are real and source-verifiable. The caveats: only Gmail, Notion, and Slack auto-ingest today (not 118), and &#8220;local-first&#8221; applies to memory, not to LLM calls or OAuth proxying.</p><p>Not production-ready. 148 open issues, no published security audit, a skill runtime in transition, beta-grade onboarding bugs. What would change the verdict: native sync providers across the top ten integrations, a published audit, public pricing, and a stable skill runtime. Watch the <strong>v0.55</strong> release.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><p><strong>Full breakdown of recent updates + daily signals in our newsletter (link in bio).</strong></p><p>If OpenHuman is the third entrant after OpenClaw and Hermes Agent, which one are you actually running right now, and what made you pick it?</p><div><hr></div><h2>Reference Appendix</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0EN6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1054b4f-4798-4963-a287-7ee011a4b378_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0EN6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1054b4f-4798-4963-a287-7ee011a4b378_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!0EN6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1054b4f-4798-4963-a287-7ee011a4b378_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!0EN6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1054b4f-4798-4963-a287-7ee011a4b378_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!0EN6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1054b4f-4798-4963-a287-7ee011a4b378_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0EN6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1054b4f-4798-4963-a287-7ee011a4b378_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a1054b4f-4798-4963-a287-7ee011a4b378_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0EN6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1054b4f-4798-4963-a287-7ee011a4b378_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!0EN6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1054b4f-4798-4963-a287-7ee011a4b378_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!0EN6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1054b4f-4798-4963-a287-7ee011a4b378_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!0EN6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1054b4f-4798-4963-a287-7ee011a4b378_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>13.1 System requirements</h3><p><strong>RAM.</strong> 4 GB minimum. 8 GB recommended with Local AI. 16 GB+ for large mailboxes or full local AI.</p><p><strong>Disk.</strong> ~500 MB for the binary. +23 MB for <em><strong>all-minilm</strong></em>, +700 MB for <em><strong>gemma3:1b-it-qat</strong></em>, several GB more for full mailbox memory.</p><p><strong>OS.</strong> macOS 12+ (Apple Silicon or Intel), Windows 10+ (x64 or ARM64), Linux x64 with libssl.</p><p><strong>macOS permissions.</strong> Camera and Microphone for the Meet agent. Accessibility and Input Monitoring for desktop hotkeys.</p><p><strong>Build-from-source toolchain.</strong> Node.js 24+, pnpm 10.10.0, Rust 1.93.0 + <em><strong>rustfmt</strong></em> + <em><strong>clippy</strong></em>, CMake, Ninja, ripgrep.</p><h3>13.2 Workspace structure</h3><ul><li><p><em><strong>&lt;workspace&gt;/memory_tree/chunks.db</strong></em>: SQLite, holds chunks, scores, summaries, entity index, jobs, hotness.</p></li><li><p><em><strong>&lt;workspace&gt;/wiki/</strong></em>: Obsidian-compatible Markdown vault.</p></li><li><p><em><strong>~/.config/tokenjuice/rules/</strong></em>: User-level TokenJuice rule overrides.</p></li><li><p><em><strong>.tokenjuice/rules/</strong></em>: Project-level TokenJuice rule overrides.</p></li></ul><p>Defaults: <em><strong>&lt;workspace&gt;</strong></em> is <em><strong>~/.openhuman</strong></em>. Override with <em><strong>OPENHUMAN_WORKSPACE</strong></em>.</p><h3>13.3 Memory Tree internals</h3><p><strong>Leaf lifecycle.</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;08797a12-9eeb-4632-83c2-16ebdaf7b312&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">pending_extraction &#8594; admitted &#8594; buffered &#8594; sealed
                       \
                        &#8594; dropped</code></pre></div><p>The deep score decides <em><strong>admitted</strong></em> vs <em><strong>dropped</strong></em>. Admitted leaves enter a buffer (<em><strong>buffered</strong></em>). When the buffer seals, every leaf inside flips to <em><strong>sealed</strong></em>. Dropped leaves stop, the chunk row stays for provenance.</p><p><strong>Job queue kinds.</strong></p><ul><li><p><em><strong>extract_chunk</strong></em>: Deep score + entity extraction. Decides admitted vs dropped.</p></li><li><p><em><strong>append_buffer</strong></em>: Adds an admitted leaf to the source (or topic) L0 buffer. May trigger a seal.</p></li><li><p><em><strong>seal</strong></em>: Compresses an L0 buffer into an L1 summary. Cascades up if the parent is full.</p></li><li><p><em><strong>topic_route</strong></em>: Routes a leaf into per-entity topic trees, gated by a hotness check.</p></li><li><p><em><strong>digest_daily</strong></em>: Builds the global daily digest node.</p></li><li><p><em><strong>flush_stale</strong></em>: Force-seals buffers that have been sitting too long.</p></li></ul><p>Three background workers pick jobs. Semaphore caps concurrent LLM-bound calls. On startup, any job whose worker lease has expired (crash, kill) returns to the queue.</p><h3>13.4 Model routing hints</h3><ul><li><p><em><strong>hint:reasoning</strong></em>: strong reasoning model. Multi-step planning, math, code-heavy turns.</p></li><li><p><em><strong>hint:fast</strong></em>: fast / cheap model. UI helpers, autocompletes, small classification.</p></li><li><p><em><strong>hint:vision</strong></em>: vision-capable model. Screenshots, image attachments, OCR.</p></li><li><p><em><strong>hint:summarize</strong></em>: compression model. Memory tree summary builders.</p></li><li><p><em><strong>hint:code</strong></em>: code-tuned model. Native coder turns.</p></li><li><p><em><strong>hint:reaction</strong></em>: lightweight model. Quick reactions.</p></li><li><p><em><strong>hint:classify</strong></em>: lightweight model. Classification tasks.</p></li><li><p><em><strong>hint:sentiment</strong></em>: lightweight model. Sentiment analysis.</p></li><li><p><em><strong>hint:medium</strong></em>: medium model. Medium-complexity tasks.</p></li><li><p><em><strong>hint:tool_lite</strong></em>: lightweight model. Lightweight tool calls.</p></li></ul><p>Override globally in <em><strong>config.toml</strong></em>. Override per call by passing a concrete model name (no <em><strong>hint:</strong></em> prefix). Override per skill via the manifest.</p><h3>13.5 Local AI presets</h3><ul><li><p><strong>Embeddings only.</strong> Memory embeddings run local. Everything else stays cloud.</p></li><li><p><strong>Memory + reflection.</strong> Embeddings, summary-tree building, and learning passes run local. Heartbeat and subconscious stay cloud.</p></li><li><p><strong>Everything local.</strong> All five workloads run local: embeddings, summary-tree, heartbeat, learning, subconscious.</p></li></ul><p>Models: <em><strong>all-minilm:latest</strong></em> (~23 MB) for embeddings, <em><strong>gemma3:1b-it-qat</strong></em> (~700 MB) for summary-tree building. Provider switch: <em><strong>local_ai.provider = &#8220;ollama&#8221;</strong></em> or <em><strong>local_ai.provider = &#8220;lm_studio&#8221;</strong></em>. Base URL override: <em><strong>local_ai.base_url</strong></em>.</p><h3>13.6 Configuration reference</h3><p><strong>Environment variables.</strong></p><ul><li><p><em><strong>OPENHUMAN_WORKSPACE</strong></em>: override default workspace path (default <em><strong>~/.openhuman</strong></em>).</p></li><li><p><em><strong>OPENHUMAN_CORE_TOKEN</strong></em>: per-launch bearer for HTTP JSON-RPC to the in-process core.</p></li><li><p><em><strong>OPENHUMAN_CORE_REUSE_EXISTING=1</strong></em>: attach to an externally-started <em><strong>openhuman-core</strong></em> process.</p></li><li><p><em><strong>OPENHUMAN_APP_ENV=staging</strong></em>: use the staging workspace at <em><strong>~/.openhuman-staging/</strong></em>.</p></li><li><p><em><strong>OPENHUMAN_LM_STUDIO_BASE_URL</strong></em>: override LM Studio endpoint (default <em><strong>http://localhost:1234/v1</strong></em>).</p></li><li><p><em><strong>LM_STUDIO_BASE_URL</strong></em>: alias for the above.</p></li><li><p><em><strong>GGML_NATIVE=OFF</strong></em>: disable <em><strong>-mcpu=native</strong></em> on macOS builds that fail.</p></li><li><p><em><strong>RUST_LOG=openhuman_core::openhuman::tokenjuice=debug</strong></em>: trace TokenJuice rule matches and reductions.</p></li></ul><p><strong>Useful config.toml keys.</strong></p><ul><li><p><em><strong>local_ai.runtime_enabled</strong></em> (default <em><strong>false</strong></em>): master switch for the local provider.</p></li><li><p><em><strong>local_ai.opt_in_confirmed</strong></em> (default <em><strong>false</strong></em>): explicit opt-in. Bootstrap forces back to <em><strong>false</strong></em> until you re-opt.</p></li><li><p><em><strong>local_ai.provider</strong></em> (default <em><strong>&#8220;ollama&#8221;</strong></em>): local provider, <em><strong>&#8220;ollama&#8221;</strong></em> or <em><strong>&#8220;lm_studio&#8221;</strong></em>.</p></li><li><p><em><strong>local_ai.base_url</strong></em> (unset): provider URL override. LM Studio default <em><strong>http://localhost:1234/v1</strong></em>.</p></li><li><p><em><strong>local_ai.usage.embeddings</strong></em> (default <em><strong>false</strong></em>): use local for memory embeddings.</p></li><li><p><em><strong>local_ai.usage.heartbeat</strong></em> (default <em><strong>false</strong></em>): use local for the heartbeat loop.</p></li><li><p><em><strong>local_ai.usage.learning_reflection</strong></em> (default <em><strong>false</strong></em>): use local for learning passes.</p></li><li><p><em><strong>local_ai.usage.subconscious</strong></em> (default <em><strong>false</strong></em>): use local for the subconscious loop.</p></li><li><p><em><strong>memory.backend</strong></em> (default <em><strong>&#8220;memory_tree&#8221;</strong></em>): switch to <em><strong>&#8220;agentmemory&#8221;</strong></em> to proxy to a self-hosted store.</p></li><li><p><em><strong>composio.mode</strong></em> (default <em><strong>&#8220;tinyhumans&#8221;</strong></em>): switch to <em><strong>&#8220;direct&#8221;</strong></em> to use your own Composio v3 tenant.</p></li></ul><h3>13.7 Native vs proxied integrations</h3><p><strong>Native (auto-ingest into Memory Tree).</strong> Gmail, Notion, Slack.</p><p><strong>Curated proxied (callable, no auto-ingest).</strong> Shopify, Stripe, HubSpot, Salesforce, Airtable, Figma, GoogleCalendar, GoogleDrive, GoogleDocs, GoogleSheets, Discord, Telegram, WhatsApp, Microsoft Teams, Outlook, Linear, Jira, Trello, Asana, Dropbox, Twitter, Spotify, YouTube.</p><p><strong>Catalog reach (Composio).</strong> 118+ services via OAuth, a subset of the wider catalog.</p><h3>13.8 Skill install constraints</h3><ul><li><p>Scheme: HTTPS only.</p></li><li><p>Hosts: private and local hosts rejected.</p></li><li><p>URL shape: GitHub blob URLs normalized, path must end in <em><strong>.md</strong></em>.</p></li><li><p>Max URL length: 2,048 chars.</p></li><li><p>Default fetch timeout: 60 seconds.</p></li><li><p>Max fetch timeout: 600 seconds.</p></li><li><p>Max <em><strong>SKILL.md</strong></em> body: 1 MiB.</p></li><li><p>Per-resource read cap: 128 KiB.</p></li><li><p>Per-turn injection cap: 8 KiB.</p></li><li><p>Writes are atomic.</p></li></ul><p>Source: <em><strong>src/openhuman/skills/ops_install.rs</strong></em>, <em><strong>src/openhuman/skills/inject.rs</strong></em>, <em><strong>src/openhuman/skills/ops.rs</strong></em>.</p><h3>13.9 Troubleshooting</h3><p><strong>First sync feels slow.</strong> Large mailboxes take hours on first ingest. Watch the Intelligence tab heatmap.</p><p><strong>Locked </strong><em><strong>.secret_key</strong></em><strong> on Windows.</strong> v0.54.0 ships self-repair plus Windows ACL hints. Restart the app.</p><p><strong>Linux AppImage will not launch on Arch or rolling distro.</strong> v0.54.0 excluded bundled NSS libs. Re-download.</p><p><strong>Onboarding stuck after Google auth on Windows.</strong> Tracking as issue #2215, labeled <em><strong>priority: critical</strong></em>. Subscribe to the issue for the fix.</p><p><strong>Sync failure on one integration.</strong> Per-connection state rebuilds on the next tick. A missed periodic sync is harmless.</p><p><strong>macOS build fails with </strong><em><strong>-mcpu=native</strong></em><strong>.</strong> Set <em><strong>GGML_NATIVE=OFF</strong></em> before <em><strong>cargo check</strong></em> or <em><strong>cargo test</strong></em>.</p><h3>13.10 Build-from-source quickref</h3><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;613cb130-6dae-4bbc-b2db-e2c086a7edbf&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash"># setup

git clone https://github.com/tinyhumansai/openhuman.git

cd openhuman

git submodule update --init --recursive

pnpm install

# dev

pnpm dev            # Vite UI only

pnpm dev:app        # Full Tauri desktop dev

# quality gates

pnpm typecheck

pnpm lint

pnpm format:check

cargo check -p openhuman --lib

pnpm test

pnpm test:rust</code></pre></div><p>Coverage gate on changed lines: &#8805; 80%. CI enforces.</p>]]></content:encoded></item><item><title><![CDATA[RAG and Long Context Aren't Enough for Agent Memory. δ-mem Is a Third Option]]></title><description><![CDATA[An 8&#215;8 online state lifted Qwen3-4B from 46.79% to 51.66%, with the backbone untouched.]]></description><link>https://alphasignalai.substack.com/p/rag-and-long-context-arent-enough</link><guid isPermaLink="false">https://alphasignalai.substack.com/p/rag-and-long-context-arent-enough</guid><dc:creator><![CDATA[AlphaSignal AI]]></dc:creator><pubDate>Wed, 20 May 2026 16:02:22 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!yGVd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F700138ef-0ea4-4c34-a9a9-aa5f5aa99607_2048x1152.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yGVd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F700138ef-0ea4-4c34-a9a9-aa5f5aa99607_2048x1152.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yGVd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F700138ef-0ea4-4c34-a9a9-aa5f5aa99607_2048x1152.png 424w, https://substackcdn.com/image/fetch/$s_!yGVd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F700138ef-0ea4-4c34-a9a9-aa5f5aa99607_2048x1152.png 848w, https://substackcdn.com/image/fetch/$s_!yGVd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F700138ef-0ea4-4c34-a9a9-aa5f5aa99607_2048x1152.png 1272w, https://substackcdn.com/image/fetch/$s_!yGVd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F700138ef-0ea4-4c34-a9a9-aa5f5aa99607_2048x1152.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yGVd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F700138ef-0ea4-4c34-a9a9-aa5f5aa99607_2048x1152.png" width="725" height="407.8125" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/700138ef-0ea4-4c34-a9a9-aa5f5aa99607_2048x1152.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:725,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!yGVd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F700138ef-0ea4-4c34-a9a9-aa5f5aa99607_2048x1152.png 424w, https://substackcdn.com/image/fetch/$s_!yGVd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F700138ef-0ea4-4c34-a9a9-aa5f5aa99607_2048x1152.png 848w, https://substackcdn.com/image/fetch/$s_!yGVd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F700138ef-0ea4-4c34-a9a9-aa5f5aa99607_2048x1152.png 1272w, https://substackcdn.com/image/fetch/$s_!yGVd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F700138ef-0ea4-4c34-a9a9-aa5f5aa99607_2048x1152.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>&#948;-mem</strong> stores an LLM&#8217;s conversation history inside an 8&#215;8 matrix and uses it to steer attention.</p><p><strong>The backbone</strong> stays frozen. No prompt growth. No fine-tuning.</p><p><strong>On Qwen3-4B-Instruct</strong>, that small matrix lifts the average score across five benchmarks from 46.79% to 51.66%, with 4.87M trainable parameters (0.12% of the model).</p><p><strong>The adapter</strong> is public on Hugging Face under CC-BY-4.0. The arXiv paper landed on May 12, 2026.</p><p>For most agent workloads, RAG is overbuilt and longer context is wasteful. &#948;-mem suggests a third path.</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/askalphaxiv/status/2055140769762066494&quot;,&quot;full_text&quot;:&quot;&#8220;&#948;-mem: Efficient Online Memory for Large Language Models&#8221;\n\nLLMs need long-term memory, but extending context is expensive and often doesn&#8217;t mean the model actually uses the history well.\n\nWhat this paper did is to store past information in a tiny 8x8 associative memory state, &quot;,&quot;username&quot;:&quot;askalphaxiv&quot;,&quot;name&quot;:&quot;alphaXiv&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1866663567417806848/-Vj32Dq-_normal.jpg&quot;,&quot;date&quot;:&quot;2026-05-15T04:18:36.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/HIVToQIXQAAy2O_.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/hInPEtykcf&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:6,&quot;retweet_count&quot;:36,&quot;like_count&quot;:226,&quot;impression_count&quot;:10121,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:false}" data-component-name="Twitter2ToDOM"></div><h2>What this article covers (~7 min read)</h2><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>How &#948;-mem works in four steps, what it moves on five benchmarks, how to load the Qwen3-4B adapter in ten minutes, and where 64 numbers stop being enough. A reference guide for engineers sits at the end as an appendix.</p><h2>Context</h2><p>The research is authored by <strong>Mind Lab</strong> (Soujanya Poria&#8217;s group at NTU, with co-authors from Fudan University, Shanghai Jiao Tong, CUHK, and HKUST-GZ) and titled &#8220;<strong>&#948;-mem: Efficient Online Memory for Large Language Models</strong>.&#8221; Ten authors. arXiv submission on May 12, 2026.</p><p>The repo at <em><strong>declare-lab/delta-Mem</strong></em> has +100 GitHub stars at time of writing. The Hacker News thread has +230 points and +50 comments. The Hugging Face paper page has +110 upvotes.</p><p>The problem it&#8217;s pushing at: agents and long-running assistants need to reuse old information, and the three default answers all hit walls.</p><p>If your agent is still doing RAG on every turn, you&#8217;re paying token cost for retrieval noise on every turn. Longer context hits quadratic attention cost and context rot. LoRA-style adapters are static after training and can&#8217;t adapt to a live conversation.</p><p>&#948;-mem proposes a fourth path. It keeps a tiny memory state inside the model, updates it as new tokens arrive, and lets that state shape attention at runtime. The backbone weights never move.</p><h2>How &#948;-mem works</h2><p>&#948;-mem runs the same four steps at every token position. The frozen backbone runs its normal attention in parallel.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MBhW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa96fca06-92f3-4ff9-8bc2-874107803560_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MBhW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa96fca06-92f3-4ff9-8bc2-874107803560_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!MBhW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa96fca06-92f3-4ff9-8bc2-874107803560_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!MBhW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa96fca06-92f3-4ff9-8bc2-874107803560_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!MBhW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa96fca06-92f3-4ff9-8bc2-874107803560_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MBhW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa96fca06-92f3-4ff9-8bc2-874107803560_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a96fca06-92f3-4ff9-8bc2-874107803560_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MBhW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa96fca06-92f3-4ff9-8bc2-874107803560_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!MBhW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa96fca06-92f3-4ff9-8bc2-874107803560_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!MBhW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa96fca06-92f3-4ff9-8bc2-874107803560_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!MBhW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa96fca06-92f3-4ff9-8bc2-874107803560_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Step 1: Project</strong></p><p>At a selected Transformer layer, &#948;-mem takes the current hidden state and projects it into three 8-dimensional vectors: a memory query, a memory key, and a memory value. The query and key go through tanh and L2 normalization. The value is a plain linear projection.</p><p><strong>Step 2: Read</strong></p><p>Multiply the previous 8&#215;8 state by the current memory query. Out comes a small read vector. The state size is fixed, so this read cost is the same whether the conversation has 100 turns or 10,000.</p><p><strong>Step 3: Steer</strong></p><p>The read vector passes through two learned linear maps to produce a query-side correction and an output-side correction, each scaled by &#945;/r (default 2). The corrected query goes into attention. The output-side correction is added after.</p><p>The key difference from LoRA: LoRA&#8217;s low-rank update is fixed after training. &#948;-mem&#8217;s correction comes from a state that changes every token, so the same parameters produce different steering effects under different histories.</p><p><strong>Step 4: Write</strong></p><p>After attention, the state updates with a gated delta rule borrowed from Qwen-Next&#8217;s gated retention. Three things happen in one update: keep part of the old state, erase the old prediction along the current key direction, and write the new value along that same direction. Two per-dimension gates (&#946; for writes, &#955; = 1 &#8722; &#946; for retention) control how much to overwrite versus retain.</p><h3>Three write granularities</h3><p>The paper studies three variants of step 4.</p><p>What it actually moves</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aTr1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1334cbb7-132e-4e7a-a9e8-51a391c7452f_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aTr1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1334cbb7-132e-4e7a-a9e8-51a391c7452f_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!aTr1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1334cbb7-132e-4e7a-a9e8-51a391c7452f_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!aTr1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1334cbb7-132e-4e7a-a9e8-51a391c7452f_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!aTr1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1334cbb7-132e-4e7a-a9e8-51a391c7452f_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aTr1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1334cbb7-132e-4e7a-a9e8-51a391c7452f_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1334cbb7-132e-4e7a-a9e8-51a391c7452f_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aTr1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1334cbb7-132e-4e7a-a9e8-51a391c7452f_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!aTr1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1334cbb7-132e-4e7a-a9e8-51a391c7452f_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!aTr1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1334cbb7-132e-4e7a-a9e8-51a391c7452f_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!aTr1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1334cbb7-132e-4e7a-a9e8-51a391c7452f_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>On Qwen3-4B, an 8&#215;8 matrix beats BM25 RAG by 7.1 points and Context2LoRA by 6.8 points. Same backbone. Same evaluator. Headline numbers from Table 1 of the paper:<br>In relative terms: 1.10&#215; the frozen backbone, 1.15&#215; the strongest non-&#948;-mem baseline (Context2LoRA), 1.31&#215; on MemoryAgentBench, 1.20&#215; on LoCoMo.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b5vv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2b7725d-d9c8-4f8e-9366-6b809f385631_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b5vv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2b7725d-d9c8-4f8e-9366-6b809f385631_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!b5vv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2b7725d-d9c8-4f8e-9366-6b809f385631_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!b5vv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2b7725d-d9c8-4f8e-9366-6b809f385631_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!b5vv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2b7725d-d9c8-4f8e-9366-6b809f385631_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b5vv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2b7725d-d9c8-4f8e-9366-6b809f385631_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b2b7725d-d9c8-4f8e-9366-6b809f385631_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!b5vv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2b7725d-d9c8-4f8e-9366-6b809f385631_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!b5vv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2b7725d-d9c8-4f8e-9366-6b809f385631_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!b5vv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2b7725d-d9c8-4f8e-9366-6b809f385631_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!b5vv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2b7725d-d9c8-4f8e-9366-6b809f385631_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The most under-reported number in the paper: the Test-Time Learning subtask nearly doubles, 26.14 &#8594; 50.50. That&#8217;s the one to watch if you care about agents that learn during a session.</p><p>Cross-backbone, the biggest absolute jump isn&#8217;t on Qwen3-4B or Qwen3-8B. It&#8217;s on SmolLM3-3B, where MSW lifts the average from 26.08 to 36.96, a +10.88 point gain. Qwen3-8B goes from 47.20 to 50.86 with SSW. Smaller models benefit more from MSW because four parallel states reduce interference inside a single state.</p><p>GPU memory matches vanilla inference at every prompt length tested. Decoding throughput is the tradeoff: at 32k prompt and 64-token decode, vanilla runs 22.60 TPS and &#948;-mem TSW runs 13.68 TPS. The state read-and-write loop runs every step.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>How to run it</h2><p>Step-by-step setup below. For day-two patterns (preloading history, base-vs-&#948;-mem comparison, session save and resume, training your own adapter).</p><blockquote><p>See the <strong>How to use it</strong> appendix at the end.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NoX-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16e56acf-e534-4c05-95b9-b65d6e3607d4_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NoX-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16e56acf-e534-4c05-95b9-b65d6e3607d4_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!NoX-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16e56acf-e534-4c05-95b9-b65d6e3607d4_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!NoX-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16e56acf-e534-4c05-95b9-b65d6e3607d4_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!NoX-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16e56acf-e534-4c05-95b9-b65d6e3607d4_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NoX-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16e56acf-e534-4c05-95b9-b65d6e3607d4_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/16e56acf-e534-4c05-95b9-b65d6e3607d4_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NoX-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16e56acf-e534-4c05-95b9-b65d6e3607d4_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!NoX-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16e56acf-e534-4c05-95b9-b65d6e3607d4_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!NoX-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16e56acf-e534-4c05-95b9-b65d6e3607d4_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!NoX-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16e56acf-e534-4c05-95b9-b65d6e3607d4_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Clone and install:</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;6c40d9fd-5f59-42e8-b494-af01cddecad3&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">git clone https://github.com/declare-lab/delta-Mem.git

cd delta-Mem

python -m pip install uv

bash scripts/setup_uv_env.sh

source .venv/bin/activate</code></pre></div><p>You need Python 3.10 or newer, an NVIDIA GPU with CUDA PyTorch, and FlashAttention plus DeepSpeed for training. CPU is not the target path.</p><p><strong>Download the adapter and load it in Python:</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;bcc151fc-38aa-4a7c-bd64-bff439561591&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from deltamem.core import HFDeltaMemConfig, attach_delta_mem, load_delta_mem_adapter

base_model = "Qwen/Qwen3-4B-Instruct-2507"
adapter_dir = "./delta-mem_qwen3_4b-instruct"

tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(
    base_model, torch_dtype=torch.bfloat16, device_map="auto",
)

config = HFDeltaMemConfig.from_pretrained(adapter_dir)
attach_delta_mem(model, config)
load_delta_mem_adapter(model, adapter_dir)
model.eval()</code></pre></div><p>Important: &#948;-mem is not a standard PEFT adapter. Do not load it with <em><strong>PeftModel</strong></em>. Do not call <em><strong>merge_and_unload()</strong></em>. The runtime read/write path is part of model execution.</p><p><strong>Run the chat demo:</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;85fce5c6-0de4-4a84-a374-fc49b65226ec&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">MODEL_PATH=/path/to/Qwen3-4B-Instruct-2507 \

ADAPTER_DIR=/path/to/delta-mem_qwen3_4b-instruct \

bash deltamem/demo/run_chat_demo.sh</code></pre></div><p>Inside the demo, <em><strong>/reset</strong></em> clears the state, <em><strong>/stats</strong></em> prints state statistics, and <em><strong>/save_session &lt;dir&gt;</strong></em> plus <em><strong>/load_session &lt;dir&gt;</strong></em> checkpoint the state to disk.</p><p>Once the chat demo loads, the <strong>How to use it</strong> appendix covers the engineering patterns most teams need next.</p><h2>Current Limitations</h2><p><strong>Adapter coverage.</strong> The only public adapter today is Qwen3-4B-Instruct TSW. The Qwen3-8B and SmolLM3-3B variants need to be retrained from the repo. That means 8&#215; A800 GPUs in the paper&#8217;s recipe and a memory-heavy training set.</p><p><strong>Decoding overhead.</strong> &#948;-mem TSW runs about 40% slower than the base model at 32k prompt and 64-token decode (13.68 TPS versus 22.60 TPS). The state read-and-write loop runs at every step. GPU memory stays flat.</p><p><strong>Context recovery is partial.</strong> When the original context is removed and only the compressed state is injected, HotpotQA EM goes from 0.08% to 6.48%. That&#8217;s real signal, not full recall. The state cannot replace explicit context for retrieval-style tasks.</p><p><strong>Not standard PEFT.</strong> The adapter requires a custom runtime path. Standard shortcuts like <em><strong>PeftModel.from_pretrained</strong></em> and <em><strong>merge_and_unload()</strong></em> will not work. CPU-only inference is not supported.</p><h2>AlphaSignal Take</h2><p>The sharpest critique on Hacker News (236 points, 59 comments) was a capacity question. One commenter put it plainly: <em>&#8220;This doesn&#8217;t solve the capacity problem of memory...there&#8217;s a fundamental limit on how much information can go into it.&#8221;</em></p><p>The paper&#8217;s answer is partial. The no-context recall ablation shows the 8&#215;8 state carries usable signal (HotpotQA EM rises from 0.08% to 6.48%, LoCoMo from 3.49 to 8.05). But the absolute floor is low.</p><p>This is the line most engineers will miss: an 8&#215;8 state is a steering signal, not a fact store. Treat it like one or it&#8217;ll bite you.</p><p>The right pattern is to pair &#948;-mem with retrieval. Use the state to steer the model on what the user has been talking about. Keep exact facts, policies, and audit trails in a search index or vector store. So the best recommendation is to prototype with the released Qwen3-4B adapter and treat the state as an attention bias, not a fact database.</p><p>The next thing to watch is a <strong>Qwen3-8B &#948;-mem adapter</strong> on Hugging Face. The cross-backbone results already show SSW wins on 8B, and an official adapter at that scale would double the practical surface area of &#948;-mem overnight.</p><h2>Who benefits and who doesn&#8217;t</h2><p>This is for ML engineers prototyping memory-heavy agents, researchers comparing latent memory against RAG and LoRA, and developers running Qwen3-4B-class models on a single GPU who want history-conditioned behavior without growing the prompt.</p><p>It is not for teams that need auditable retrieval (citations, deletion, exact match), teams without NVIDIA GPUs, teams whose conversations fit comfortably inside a long-context window, or teams that need a larger-model adapter today.</p><h2>Practitioner Implication</h2><p>Most teams will not train this from scratch. The real test is whether the released Qwen3-4B adapter beats your current RAG baseline by Friday, without growing the context window or fine-tuning the backbone.</p><h2>Links</h2><ul><li><p><a href="https://arxiv.org/abs/2605.12357">arXiv paper</a> (paper, ~25 min read)</p></li><li><p><a href="https://github.com/declare-lab/delta-Mem">GitHub repo</a> (repo, ~10 min setup)</p></li><li><p><a href="https://huggingface.co/declare-lab/delta-mem_qwen3_4b-instruct">Hugging Face adapter</a> (Qwen3-4B TSW)</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div></li></ul><p>Follow @AlphaSignalAI for more content like this.</p><p>Subscribe at <a href="https://alphasignal.ai/">AlphaSignal.ai</a> for daily AI signals. Read by 300,000+ subscribers.</p><h2>Questions?</h2><p><strong>Q: What is &#948;-mem?</strong> A: A memory mechanism that adds a small 8&#215;8 matrix alongside a frozen LLM and uses its readout to inject low-rank corrections into the model&#8217;s attention. It stores history as a latent state, not as retrieved text.</p><p><strong>Q: Does &#948;-mem replace RAG?</strong> A: No. &#948;-mem does not retrieve documents and cannot produce citations. The cleanest stack for an agent pairs &#948;-mem with a retrieval index. The state handles steering. The index handles exact recall.</p><p><strong>Q: How big is &#948;-mem&#8217;s memory state?</strong> A: A single 8&#215;8 matrix by default (rank 8, 64 entries). The MSW variant keeps 4 parallel 8&#215;8 states. Trainable parameter cost is 4.87M for TSW or SSW (0.12% of the Qwen3-4B backbone), 19.47M for MSW.</p><p><strong>Q: Can developers use &#948;-mem today?</strong> A: Yes, on Qwen3-4B-Instruct. The official Qwen3-4B TSW adapter is on Hugging Face under CC-BY-4.0. The repo targets NVIDIA GPUs with bfloat16, FlashAttention, and DeepSpeed.</p><p><strong>Q: What are the main limitations?</strong> A: Public adapter coverage is narrow (Qwen3-4B TSW only), decoding is about 40% slower than the base model at 32k context, the adapter is not standard PEFT, and no-context recall is still low in absolute terms.</p><div><hr></div><p>Where do you put memory in your agent stack today? Prompt, vector store, fine-tune, or something else? Which one would you swap out for &#948;-mem first?</p><div><hr></div><h2>Appendix: how to use it</h2><p>A reference guide for teams that already cleared the chat demo. Skim once, return to specific sections as needed.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ce9p!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02a43699-31c8-4250-bb68-11c058076879_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ce9p!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02a43699-31c8-4250-bb68-11c058076879_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!ce9p!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02a43699-31c8-4250-bb68-11c058076879_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!ce9p!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02a43699-31c8-4250-bb68-11c058076879_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!ce9p!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02a43699-31c8-4250-bb68-11c058076879_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ce9p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02a43699-31c8-4250-bb68-11c058076879_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/02a43699-31c8-4250-bb68-11c058076879_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ce9p!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02a43699-31c8-4250-bb68-11c058076879_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!ce9p!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02a43699-31c8-4250-bb68-11c058076879_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!ce9p!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02a43699-31c8-4250-bb68-11c058076879_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!ce9p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02a43699-31c8-4250-bb68-11c058076879_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>A.1 Mental model</h3><p>&#948;-mem is an in-process side-state, not a database. The state matrix lives inside the model object, persists across <em><strong>model.generate()</strong></em> calls in the same Python process, and resets when you re-attach. There is no separate process and no network hop.</p><p>One state matrix is allocated per attached layer per state head, all on the same GPU as the model.</p><h3>A.2 Preloading history into the state</h3><p>The most common day-one mistake is forgetting that &#948;-mem only sees what passes through <em><strong>forward()</strong></em>. To preload history, run the historical context through the model in a context-only pass before the user&#8217;s first real query. The state advances. The output is discarded.</p><p>The paper trained with an 8,192-token write budget per example. Inference is not hard-capped at that length, but the trailing tokens dominate the state.</p><h3>A.3 Verifying the state is actually doing something</h3><p>Three quick checks.</p><p>First, run the demo in <em><strong>MODE=base</strong></em> and compare answers on the same prompts. If responses look identical to &#948;-mem mode, the adapter is not attached or <em><strong>model.eval()</strong></em> was skipped.</p><p>Second, use <em><strong>/stats</strong></em> in the chat demo or call <em><strong>collect_delta_mem_state_stats()</strong></em> in Python. The state should be non-zero within the first few tokens of context.</p><p>Third, run one short benchmark and confirm the result lands near the paper&#8217;s headline. For Qwen3-4B TSW, that&#8217;s roughly 51.66% average across the five-eval suite.</p><h3>A.4 Session save and resume</h3><p>Use <em><strong>/save_session &lt;dir&gt;</strong></em> and <em><strong>/load_session &lt;dir&gt;</strong></em> inside the chat demo, or the equivalent Python helpers, to checkpoint and restore the state across processes. The save captures the exact state matrix at that point in the conversation.</p><p>A saved session is tied to the base model it came from. Loading a Qwen3-4B session into a different base will fail.</p><h3>A.5 Benchmarking against your current memory stack</h3><p>Run the same task three ways with the same evaluator: base + RAG, base + long context, base + &#948;-mem. The bundled suite at <em><strong>scripts/run_qasper_multimodel_write8192_benchmark_suite.sh</strong></em> runs all five evals.</p><p>To scope it to one task and one variant:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;markdown&quot;,&quot;nodeId&quot;:&quot;5143911d-c442-4c37-966d-9d5485709b48&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-markdown">BENCHMARK_VARIANTS_STRING=&#8221;TSW_rank8_qasper_write8192&#8221; \

EVAL_TASKS_STRING=&#8221;locomo&#8221; \

bash scripts/run_qasper_multimodel_write8192_benchmark_suite.sh</code></pre></div><p>For reference, on Qwen3-4B the paper reports BM25 RAG 44.56 avg, Context2LoRA 44.90 avg, and &#948;-mem TSW 51.66 avg.</p><h3>A.6 Choosing TSW, SSW, or MSW</h3><p>The cross-backbone results give the rule of thumb.</p><ul><li><p><strong>TSW</strong> wins on Qwen3-4B (51.66 avg). Pick it when local detail matters and the model is mid-sized.</p></li><li><p><strong>SSW</strong> wins on Qwen3-8B (50.86 avg). Pick it when token-level noise drags the state and the model has more reasoning to spare.</p></li><li><p><strong>MSW</strong> wins on SmolLM3-3B (36.96 avg) and on memory-heavy benchmarks. Pick it when interference between memory types matters more than per-token detail.</p></li></ul><p>Only TSW is on Hugging Face today. SSW and MSW need the training script.</p><h3>A.7 Cost reality</h3><p>GPU memory at inference matches vanilla at every prompt length tested. At a 32k-token prompt, &#948;-mem&#8217;s footprint lands on the same value as the base model in the paper&#8217;s table, with no measurable overhead from the state.</p><p>Decoding throughput is ~40% slower than base at 32k prompt and 64-token decode (22.60 &#8594; 13.68 TPS).</p><p>Trainable parameters: 4.87M for TSW or SSW (0.12% of Qwen3-4B), 19.47M for MSW (0.48%).</p><p>Training in the paper used 8&#215; A800 GPUs, bfloat16, DeepSpeed ZeRO-2, fused AdamW, peak LR 2e-4, one epoch on the shortest 2,219-sample QASPER split, effective batch size 32.</p><h3>A.8 Training your own adapter</h3><p>The realistic floor is multi-GPU bf16. The paper&#8217;s exact recipe was 8&#215; A800, but the script supports fewer devices through DeepSpeed.</p><p>Training data should be memory-heavy SFT examples where the context tokens carry the signal that the model needs at response time. QASPER fits because the question is short and the supporting context is long.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;markdown&quot;,&quot;nodeId&quot;:&quot;9e905107-b287-4790-ab5d-0e0c12da1cfb&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-markdown">TRAIN_VARIANTS_STRING=&#8221;TSW_rank8_qasper_write8192&#8221; \

BENCHMARK_VARIANTS_STRING=&#8221;TSW_rank8_qasper_write8192&#8221; \

bash scripts/run_qasper_multimodel_write8192_train_and_benchmark_suite.sh</code></pre></div><p>Per-backbone scripts exist for Qwen3-8B and SmolLM3-3B.</p><h3>A.9 Compatibility constraints</h3><ul><li><p>Not a PEFT adapter. Never load with <em><strong>PeftModel</strong></em>. Never call <em><strong>merge_and_unload()</strong></em>.</p></li><li><p>GPU-only target path. CPU is not supported.</p></li><li><p>Released adapter is fixed to <em><strong>Qwen/Qwen3-4B-Instruct-2507</strong></em>. Other Qwen versions are not guaranteed.</p></li><li><p>Adapter files are <em><strong>delta_mem_adapter.pt</strong></em> and <em><strong>delta_mem_config.json</strong></em>. Do not rename.</p></li><li><p>The required load path is <em><strong>deltamem.core.attach_delta_mem</strong></em> followed by <em><strong>load_delta_mem_adapter</strong></em>. Standard <em><strong>AutoModel.from_pretrained(adapter_dir)</strong></em> will not work.</p></li></ul><h3>A.10 When to reach for &#948;-mem versus an alternative</h3><ul><li><p>Reach for <strong>RAG or vector search</strong> when you need exact retrieval, citations, deletion, or an audit trail.</p></li><li><p>Reach for <strong>longer context</strong> when the full history fits in budget and latency is acceptable.</p></li><li><p>Reach for <strong>&#948;-mem</strong> when you want online, history-conditioned steering without fine-tuning the backbone or growing the prompt.</p></li></ul><p>These are not mutually exclusive. The cleanest agent stack is &#948;-mem on the model and a retrieval index on the side.</p>]]></content:encoded></item><item><title><![CDATA[11 Open-Source Repos Every AI Infra Engineer Should Bookmark]]></title><description><![CDATA[You built an AI agent this weekend. Have you thought about its infrastructure?]]></description><link>https://alphasignalai.substack.com/p/11-open-source-repos-every-ai-infra</link><guid isPermaLink="false">https://alphasignalai.substack.com/p/11-open-source-repos-every-ai-infra</guid><dc:creator><![CDATA[AlphaSignal AI]]></dc:creator><pubDate>Tue, 19 May 2026 16:01:17 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!5loY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cf19be1-0965-4188-b530-ad69e200af3f_2048x1152.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5loY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cf19be1-0965-4188-b530-ad69e200af3f_2048x1152.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5loY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cf19be1-0965-4188-b530-ad69e200af3f_2048x1152.png 424w, https://substackcdn.com/image/fetch/$s_!5loY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cf19be1-0965-4188-b530-ad69e200af3f_2048x1152.png 848w, https://substackcdn.com/image/fetch/$s_!5loY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cf19be1-0965-4188-b530-ad69e200af3f_2048x1152.png 1272w, https://substackcdn.com/image/fetch/$s_!5loY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cf19be1-0965-4188-b530-ad69e200af3f_2048x1152.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5loY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cf19be1-0965-4188-b530-ad69e200af3f_2048x1152.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7cf19be1-0965-4188-b530-ad69e200af3f_2048x1152.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!5loY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cf19be1-0965-4188-b530-ad69e200af3f_2048x1152.png 424w, https://substackcdn.com/image/fetch/$s_!5loY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cf19be1-0965-4188-b530-ad69e200af3f_2048x1152.png 848w, https://substackcdn.com/image/fetch/$s_!5loY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cf19be1-0965-4188-b530-ad69e200af3f_2048x1152.png 1272w, https://substackcdn.com/image/fetch/$s_!5loY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cf19be1-0965-4188-b530-ad69e200af3f_2048x1152.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You built an AI agent this weekend.</p><p>It writes code.<br>Browses the web.<br>Uses MCP tools.<br>Maybe even touches production data.</p><p>Now ask yourself:</p><ul><li><p>What isolates it from the host?</p></li><li><p>What stops credential leaks?</p></li><li><p>Who controls tool permissions?</p></li><li><p>What happens after prompt injection?</p></li><li><p>Where is the audit trail?</p></li></ul><p>Most AI engineers answer these questions after the first incident.</p><p>Meanwhile, open source quietly built an entire infrastructure and security stack for AI agents.</p><p>Here are 11 repos every AI infra engineer should bookmark:</p><p>Each one covers a gap that frameworks don&#8217;t.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2><strong>Open Source AI Agent Infrastructure &amp; Security</strong></h2><h3><strong>1. ProjectRecon/awesome-ai-agents-security</strong></h3><h3><em><strong>Living Map of the AI Agent Security Ecosystem</strong></em></h3><p><strong>What it does:</strong> Curated, maintained index of the AI agent security ecosystem, organized by security lifecycle: red teaming, runtime protection, sandboxing, governance, middleware.</p><p>Everything categorized, linked, and kept current. The starting point when you need to understand the full landscape or find tools for a specific security problem.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ckjP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1864d99a-84a3-478b-80c7-283a8d0aa217_869x588.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ckjP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1864d99a-84a3-478b-80c7-283a8d0aa217_869x588.png 424w, https://substackcdn.com/image/fetch/$s_!ckjP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1864d99a-84a3-478b-80c7-283a8d0aa217_869x588.png 848w, https://substackcdn.com/image/fetch/$s_!ckjP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1864d99a-84a3-478b-80c7-283a8d0aa217_869x588.png 1272w, https://substackcdn.com/image/fetch/$s_!ckjP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1864d99a-84a3-478b-80c7-283a8d0aa217_869x588.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ckjP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1864d99a-84a3-478b-80c7-283a8d0aa217_869x588.png" width="869" height="588" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1864d99a-84a3-478b-80c7-283a8d0aa217_869x588.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:588,&quot;width&quot;:869,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ckjP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1864d99a-84a3-478b-80c7-283a8d0aa217_869x588.png 424w, https://substackcdn.com/image/fetch/$s_!ckjP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1864d99a-84a3-478b-80c7-283a8d0aa217_869x588.png 848w, https://substackcdn.com/image/fetch/$s_!ckjP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1864d99a-84a3-478b-80c7-283a8d0aa217_869x588.png 1272w, https://substackcdn.com/image/fetch/$s_!ckjP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1864d99a-84a3-478b-80c7-283a8d0aa217_869x588.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Why it matters for AI infra:</strong> This space moves faster than any single article. This repo is the durable index. Watch it. The delta between its last commit and today is your reading list.</p><p>&#128279;: https://github.com/ProjectRecon/awesome-ai-agents-security</p><div><hr></div><h3><strong>2. promptfoo/promptfoo</strong></h3><h3><em><strong>Automated Red Teaming and Evals for LLM Applications</strong></em></h3><p><strong>What it does:</strong> The standard framework for automated LLM red teaming, security testing, and model evaluation.</p><p>Covers prompt injection, jailbreaks, PII leakage, model regression, and multi-model performance comparison across GPT, Claude, Gemini, Llama, and more. Declarative YAML configs. Native CI/CD integration. Used internally by OpenAI and Anthropic. MIT licensed, fully open source.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Q7QD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae322bd6-a4f8-458b-8312-eb1203dddff4_2048x1482.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Q7QD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae322bd6-a4f8-458b-8312-eb1203dddff4_2048x1482.png 424w, https://substackcdn.com/image/fetch/$s_!Q7QD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae322bd6-a4f8-458b-8312-eb1203dddff4_2048x1482.png 848w, https://substackcdn.com/image/fetch/$s_!Q7QD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae322bd6-a4f8-458b-8312-eb1203dddff4_2048x1482.png 1272w, https://substackcdn.com/image/fetch/$s_!Q7QD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae322bd6-a4f8-458b-8312-eb1203dddff4_2048x1482.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Q7QD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae322bd6-a4f8-458b-8312-eb1203dddff4_2048x1482.png" width="1456" height="1054" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ae322bd6-a4f8-458b-8312-eb1203dddff4_2048x1482.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1054,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Q7QD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae322bd6-a4f8-458b-8312-eb1203dddff4_2048x1482.png 424w, https://substackcdn.com/image/fetch/$s_!Q7QD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae322bd6-a4f8-458b-8312-eb1203dddff4_2048x1482.png 848w, https://substackcdn.com/image/fetch/$s_!Q7QD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae322bd6-a4f8-458b-8312-eb1203dddff4_2048x1482.png 1272w, https://substackcdn.com/image/fetch/$s_!Q7QD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae322bd6-a4f8-458b-8312-eb1203dddff4_2048x1482.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Why it matters for AI infra:</strong> You test your code before shipping. You should test your prompts and agent security boundaries too. Promptfoo makes red teaming systematic, scriptable, and integrated into your existing CI pipeline. Shipping agent features without automated security evals is the equivalent of shipping code without tests.</p><p>&#128279;: https://github.com/promptfoo/promptfoo</p><div><hr></div><h3><strong>3. aquasecurity/trivy</strong></h3><h3><em><strong>Supply Chain Vulnerability Scanner for AI Infrastructure</strong></em></h3><p><strong>What it does:</strong> All-in-one vulnerability scanner for container images, git repos, and filesystems.</p><p>One tool catches: vulnerable base images, misconfigured Terraform, insecure Kubernetes manifests, leaked secrets in git history, vulnerable application dependencies. SARIF output feeds directly into the GitHub Security tab. 10 lines of YAML for GitHub Actions integration.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XT9-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13301e86-3c95-4a50-a20b-5f3d880e2a14_2048x867.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XT9-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13301e86-3c95-4a50-a20b-5f3d880e2a14_2048x867.png 424w, https://substackcdn.com/image/fetch/$s_!XT9-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13301e86-3c95-4a50-a20b-5f3d880e2a14_2048x867.png 848w, https://substackcdn.com/image/fetch/$s_!XT9-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13301e86-3c95-4a50-a20b-5f3d880e2a14_2048x867.png 1272w, https://substackcdn.com/image/fetch/$s_!XT9-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13301e86-3c95-4a50-a20b-5f3d880e2a14_2048x867.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XT9-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13301e86-3c95-4a50-a20b-5f3d880e2a14_2048x867.png" width="1456" height="616" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/13301e86-3c95-4a50-a20b-5f3d880e2a14_2048x867.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:616,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XT9-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13301e86-3c95-4a50-a20b-5f3d880e2a14_2048x867.png 424w, https://substackcdn.com/image/fetch/$s_!XT9-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13301e86-3c95-4a50-a20b-5f3d880e2a14_2048x867.png 848w, https://substackcdn.com/image/fetch/$s_!XT9-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13301e86-3c95-4a50-a20b-5f3d880e2a14_2048x867.png 1272w, https://substackcdn.com/image/fetch/$s_!XT9-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13301e86-3c95-4a50-a20b-5f3d880e2a14_2048x867.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Why it matters for AI infra:</strong> You can write perfect agent code and still ship a vulnerable base image or misconfigured infrastructure module. Supply chain attacks are the dominant attack vector now. Trivy catches them in CI before they reach production. It&#8217;s automated, with zero manual review overhead.</p><p>&#128279;: https://github.com/aquasecurity/trivy</p><div><hr></div><h3><strong>4. open-policy-agent/opa</strong></h3><h3><em><strong>Policy-as-Code for AI Agent Infrastructure</strong></em></h3><p><strong>What it does:</strong> Universal policy engine for your entire stack. Express security policy as readable, testable Rego code.</p><p>Kubernetes admission control, API authorization, infrastructure configuration validation, one engine. Decouple security policy from application code. Write once, enforce everywhere. When your AI agent calls a tool, hits an API, or requests a resource and OPA decides whether it&#8217;s allowed.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!H6rY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e766ef1-bdb9-4237-a6f5-9b733acaeecd_866x742.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!H6rY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e766ef1-bdb9-4237-a6f5-9b733acaeecd_866x742.png 424w, https://substackcdn.com/image/fetch/$s_!H6rY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e766ef1-bdb9-4237-a6f5-9b733acaeecd_866x742.png 848w, https://substackcdn.com/image/fetch/$s_!H6rY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e766ef1-bdb9-4237-a6f5-9b733acaeecd_866x742.png 1272w, https://substackcdn.com/image/fetch/$s_!H6rY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e766ef1-bdb9-4237-a6f5-9b733acaeecd_866x742.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!H6rY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e766ef1-bdb9-4237-a6f5-9b733acaeecd_866x742.png" width="866" height="742" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2e766ef1-bdb9-4237-a6f5-9b733acaeecd_866x742.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:742,&quot;width&quot;:866,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!H6rY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e766ef1-bdb9-4237-a6f5-9b733acaeecd_866x742.png 424w, https://substackcdn.com/image/fetch/$s_!H6rY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e766ef1-bdb9-4237-a6f5-9b733acaeecd_866x742.png 848w, https://substackcdn.com/image/fetch/$s_!H6rY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e766ef1-bdb9-4237-a6f5-9b733acaeecd_866x742.png 1272w, https://substackcdn.com/image/fetch/$s_!H6rY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e766ef1-bdb9-4237-a6f5-9b733acaeecd_866x742.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Why it matters for AI infra:</strong> OPA isn&#8217;t agent-specific, which is exactly why it belongs here. Your agent infrastructure sits inside your existing cloud stack. OPA gives you a consistent, auditable policy layer that spans both traditional infrastructure and agentic workloads without maintaining two separate security systems.</p><p>&#128279;: https://github.com/open-policy-agent/opa</p><div><hr></div><h3><strong>5. AgentGateway (Linux Foundation)</strong></h3><h3><strong>RBAC Proxy for MCP and A2A Agent Protocols</strong></h3><p><strong>What it does:</strong> AI-native proxy for A2A and MCP protocol traffic. RBAC, observability, and policy enforcement on agent-to-tool interactions.</p><p>Donated to the Linux Foundation. Sits between your agents and their tools. Only the right agents can call the right tools with the right permissions. Full observability on the protocol layer.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!M0PM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F867e39e7-5f29-43bd-9c9c-ac89f411d66b_2048x938.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!M0PM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F867e39e7-5f29-43bd-9c9c-ac89f411d66b_2048x938.png 424w, https://substackcdn.com/image/fetch/$s_!M0PM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F867e39e7-5f29-43bd-9c9c-ac89f411d66b_2048x938.png 848w, https://substackcdn.com/image/fetch/$s_!M0PM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F867e39e7-5f29-43bd-9c9c-ac89f411d66b_2048x938.png 1272w, https://substackcdn.com/image/fetch/$s_!M0PM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F867e39e7-5f29-43bd-9c9c-ac89f411d66b_2048x938.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!M0PM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F867e39e7-5f29-43bd-9c9c-ac89f411d66b_2048x938.png" width="1456" height="667" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/867e39e7-5f29-43bd-9c9c-ac89f411d66b_2048x938.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:667,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!M0PM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F867e39e7-5f29-43bd-9c9c-ac89f411d66b_2048x938.png 424w, https://substackcdn.com/image/fetch/$s_!M0PM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F867e39e7-5f29-43bd-9c9c-ac89f411d66b_2048x938.png 848w, https://substackcdn.com/image/fetch/$s_!M0PM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F867e39e7-5f29-43bd-9c9c-ac89f411d66b_2048x938.png 1272w, https://substackcdn.com/image/fetch/$s_!M0PM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F867e39e7-5f29-43bd-9c9c-ac89f411d66b_2048x938.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Why it matters for AI infra:</strong> MCP is the emerging standard for agent-tool connectivity. Most teams wire MCP directly with no access control layer, a significant and growing attack surface. AgentGateway is the purpose-built solution. Linux Foundation stewardship means production-grade stability and long-term maintenance.</p><p>&#128279;: https://github.com/agentgateway/agentgateway</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/p/11-open-source-repos-every-ai-infra?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/p/11-open-source-repos-every-ai-infra?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://alphasignalai.substack.com/p/11-open-source-repos-every-ai-infra?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><div><hr></div><h3><strong>6. microsoft/agent-governance-toolkit</strong></h3><h3><strong>Runtime Security Middleware for AI Agents</strong></h3><p><strong>What it does:</strong> Runtime policy engine mapped directly to the OWASP Agentic AI Top 10.</p><p>When <a href="https://genai.owasp.org/">OWASP</a> published the first formal taxonomy of agentic AI risks listing goal hijacking, tool misuse, identity abuse, rogue agents Microsoft shipped a toolkit that addresses every single one. Sub-millisecond governance latency (&lt;0.1ms p99). Deploys as a sidecar container or middleware layer.</p><ul><li><p>Goal hijacking &#8594; semantic intent classifier</p></li><li><p>Tool misuse &#8594; capability sandboxing + MCP security gateway</p></li><li><p>Memory poisoning &#8594; Cross-Model Verification Kernel with majority voting</p></li><li><p>Rogue agents &#8594; ring isolation, trust decay, automated kill switch</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3u7R!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59ff43ce-5650-4947-8cd9-2f3c78eb8e30_886x426.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3u7R!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59ff43ce-5650-4947-8cd9-2f3c78eb8e30_886x426.png 424w, https://substackcdn.com/image/fetch/$s_!3u7R!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59ff43ce-5650-4947-8cd9-2f3c78eb8e30_886x426.png 848w, https://substackcdn.com/image/fetch/$s_!3u7R!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59ff43ce-5650-4947-8cd9-2f3c78eb8e30_886x426.png 1272w, https://substackcdn.com/image/fetch/$s_!3u7R!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59ff43ce-5650-4947-8cd9-2f3c78eb8e30_886x426.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3u7R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59ff43ce-5650-4947-8cd9-2f3c78eb8e30_886x426.png" width="886" height="426" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/59ff43ce-5650-4947-8cd9-2f3c78eb8e30_886x426.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:426,&quot;width&quot;:886,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3u7R!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59ff43ce-5650-4947-8cd9-2f3c78eb8e30_886x426.png 424w, https://substackcdn.com/image/fetch/$s_!3u7R!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59ff43ce-5650-4947-8cd9-2f3c78eb8e30_886x426.png 848w, https://substackcdn.com/image/fetch/$s_!3u7R!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59ff43ce-5650-4947-8cd9-2f3c78eb8e30_886x426.png 1272w, https://substackcdn.com/image/fetch/$s_!3u7R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59ff43ce-5650-4947-8cd9-2f3c78eb8e30_886x426.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Why it matters for AI infra:</strong> The <a href="https://artificialintelligenceact.eu/">EU AI Act&#8217;</a>s high-risk AI obligations take effect <strong>August 2026</strong>. The Colorado AI Act enforces January 2027. Agentic AI governance is moving from best practice to legal requirement. This is the most comprehensive open source implementation aligned with the formal risk taxonomy.</p><p>&#128279;: https://github.com/microsoft/agent-governance-toolkit</p><div><hr></div><h3><strong>7. anthropics/claude-code-security-review</strong></h3><h3><em><strong>AI-Powered PR Security Review</strong></em></h3><p><strong>What it does:</strong> GitHub Action that runs Claude Code on every pull request and posts security findings as inline review comments.</p><p>Diff-aware: only analyzes changed files. Semantic reasoning, not pattern matching, to identify high-confidence, exploitable vulnerabilities. Calibrated false positive filtering: no theoretical issues, no rate-limiting noise. Just vulnerabilities a senior security engineer would flag in review.</p><p>One YAML file to add to any repository.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IMYc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F336f90ac-a525-465b-a212-fd8ce80863a0_1899x1156.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IMYc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F336f90ac-a525-465b-a212-fd8ce80863a0_1899x1156.png 424w, https://substackcdn.com/image/fetch/$s_!IMYc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F336f90ac-a525-465b-a212-fd8ce80863a0_1899x1156.png 848w, https://substackcdn.com/image/fetch/$s_!IMYc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F336f90ac-a525-465b-a212-fd8ce80863a0_1899x1156.png 1272w, https://substackcdn.com/image/fetch/$s_!IMYc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F336f90ac-a525-465b-a212-fd8ce80863a0_1899x1156.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IMYc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F336f90ac-a525-465b-a212-fd8ce80863a0_1899x1156.png" width="1456" height="886" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/336f90ac-a525-465b-a212-fd8ce80863a0_1899x1156.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:886,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IMYc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F336f90ac-a525-465b-a212-fd8ce80863a0_1899x1156.png 424w, https://substackcdn.com/image/fetch/$s_!IMYc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F336f90ac-a525-465b-a212-fd8ce80863a0_1899x1156.png 848w, https://substackcdn.com/image/fetch/$s_!IMYc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F336f90ac-a525-465b-a212-fd8ce80863a0_1899x1156.png 1272w, https://substackcdn.com/image/fetch/$s_!IMYc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F336f90ac-a525-465b-a212-fd8ce80863a0_1899x1156.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Why it matters for AI infra:</strong> Security review is the step most teams skip because it&#8217;s expensive and slow. This makes it automatic and free on every PR. The semantic analysis quality using actual LLM reasoning rather than regex, catches logic-level security issues that static analysis tools miss entirely.</p><p>&#128279;: github.com/anthropics/claude-code-security-review</p><div><hr></div><h3><strong>8. vercel-labs/deepsec</strong></h3><h3><em><strong>Agent Powered Vulnerability Scanner</strong></em></h3><p><strong>What it does:</strong> An AI agent that scans your entire codebase for vulnerabilities that have been sitting there for years.</p><p>Fast regex matchers find candidates, the Claude/Codex investigates at maximum thinking levels. Work fans out across parallel workers for large single repos. You can interrupt or restart the jobs, it picks up where it left off.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ifdY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8e733d-8d5f-4b6e-b466-687d0afe2f8f_868x652.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ifdY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8e733d-8d5f-4b6e-b466-687d0afe2f8f_868x652.png 424w, https://substackcdn.com/image/fetch/$s_!ifdY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8e733d-8d5f-4b6e-b466-687d0afe2f8f_868x652.png 848w, https://substackcdn.com/image/fetch/$s_!ifdY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8e733d-8d5f-4b6e-b466-687d0afe2f8f_868x652.png 1272w, https://substackcdn.com/image/fetch/$s_!ifdY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8e733d-8d5f-4b6e-b466-687d0afe2f8f_868x652.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ifdY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8e733d-8d5f-4b6e-b466-687d0afe2f8f_868x652.png" width="868" height="652" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bf8e733d-8d5f-4b6e-b466-687d0afe2f8f_868x652.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:652,&quot;width&quot;:868,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ifdY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8e733d-8d5f-4b6e-b466-687d0afe2f8f_868x652.png 424w, https://substackcdn.com/image/fetch/$s_!ifdY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8e733d-8d5f-4b6e-b466-687d0afe2f8f_868x652.png 848w, https://substackcdn.com/image/fetch/$s_!ifdY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8e733d-8d5f-4b6e-b466-687d0afe2f8f_868x652.png 1272w, https://substackcdn.com/image/fetch/$s_!ifdY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8e733d-8d5f-4b6e-b466-687d0afe2f8f_868x652.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Why it matters for AI infra:</strong> Every other tool on this list protects agents at runtime. Deepsec goes one layer earlier. It clears the vulnerabilities already living in the codebase your agents will read, modify, and deploy.</p><p>&#128279;: https://github.com/vercel-labs/deepsec</p><div><hr></div><h3><strong>9. dagger/container-use</strong></h3><h3><em><strong>Containerized Environments for Coding Agents</strong></em></h3><p><strong>What it does:</strong> Persistent, isolated container environments for coding agents.</p><p>From the Dagger team. Each coding agent gets its own container. Multiple agents run in parallel without conflict. Environments persist across sessions. Resume any task mid-flight with an existing env ID.</p><p>The differentiator: <strong>full OpenTelemetry instrumentation</strong> on every agent run. Every LLM decision, tool call, error, and retry appears in the build trace. When something goes wrong, you don&#8217;t guess, you see it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XJ9X!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97867b54-fa3e-49bd-965e-e54ebc3e3bc6_861x510.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XJ9X!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97867b54-fa3e-49bd-965e-e54ebc3e3bc6_861x510.png 424w, https://substackcdn.com/image/fetch/$s_!XJ9X!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97867b54-fa3e-49bd-965e-e54ebc3e3bc6_861x510.png 848w, https://substackcdn.com/image/fetch/$s_!XJ9X!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97867b54-fa3e-49bd-965e-e54ebc3e3bc6_861x510.png 1272w, https://substackcdn.com/image/fetch/$s_!XJ9X!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97867b54-fa3e-49bd-965e-e54ebc3e3bc6_861x510.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XJ9X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97867b54-fa3e-49bd-965e-e54ebc3e3bc6_861x510.png" width="861" height="510" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/97867b54-fa3e-49bd-965e-e54ebc3e3bc6_861x510.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:510,&quot;width&quot;:861,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XJ9X!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97867b54-fa3e-49bd-965e-e54ebc3e3bc6_861x510.png 424w, https://substackcdn.com/image/fetch/$s_!XJ9X!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97867b54-fa3e-49bd-965e-e54ebc3e3bc6_861x510.png 848w, https://substackcdn.com/image/fetch/$s_!XJ9X!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97867b54-fa3e-49bd-965e-e54ebc3e3bc6_861x510.png 1272w, https://substackcdn.com/image/fetch/$s_!XJ9X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97867b54-fa3e-49bd-965e-e54ebc3e3bc6_861x510.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Why it matters for AI infra:</strong> &#8220;It works on my machine&#8221; is not a deployment model. Container Use brings the isolation and reproducibility guarantees that containerization gave to software builds to agent execution. The observability layer alone makes debugging agentic systems tractable.</p><p>&#128279;: https://github.com/dagger/container-use</p><div><hr></div><h3><strong>10. meta-llama/PurpleLlama/LlamaFirewell</strong></h3><h3><em><strong>Prompt Injection Defense for LLM Agents</strong></em></h3><p><strong>What it does:</strong> Meta&#8217;s open source guardrail system for LLM agents. Blocks prompt injection, scans LLM-generated code for vulnerabilities, detects misaligned reasoning.</p><p>A single prompt injection can flip an agent&#8217;s intent cause it to leak private data, execute unauthorized commands, operate far outside scope. LlamaFirewall sits at the application layer and intercepts this before it reaches execution. It also scans code the agent generates for critical vulnerabilities before shipping to production, a gap most guardrail systems ignore entirely.</p><p>Open-weight guardrail models on HuggingFace. Run on your own infrastructure at 50&#8211;100ms latency. No API calls. No data leaving your environment.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!emo9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ee408b0-d962-4baf-b859-911275edfcbe_866x440.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!emo9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ee408b0-d962-4baf-b859-911275edfcbe_866x440.png 424w, https://substackcdn.com/image/fetch/$s_!emo9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ee408b0-d962-4baf-b859-911275edfcbe_866x440.png 848w, https://substackcdn.com/image/fetch/$s_!emo9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ee408b0-d962-4baf-b859-911275edfcbe_866x440.png 1272w, https://substackcdn.com/image/fetch/$s_!emo9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ee408b0-d962-4baf-b859-911275edfcbe_866x440.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!emo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ee408b0-d962-4baf-b859-911275edfcbe_866x440.png" width="866" height="440" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6ee408b0-d962-4baf-b859-911275edfcbe_866x440.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:440,&quot;width&quot;:866,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!emo9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ee408b0-d962-4baf-b859-911275edfcbe_866x440.png 424w, https://substackcdn.com/image/fetch/$s_!emo9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ee408b0-d962-4baf-b859-911275edfcbe_866x440.png 848w, https://substackcdn.com/image/fetch/$s_!emo9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ee408b0-d962-4baf-b859-911275edfcbe_866x440.png 1272w, https://substackcdn.com/image/fetch/$s_!emo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ee408b0-d962-4baf-b859-911275edfcbe_866x440.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Why it matters for AI infra:</strong> Prompt injection is the SQL injection of the agentic era. It&#8217;s being exploited in production systems today. LlamaFirewall is the most rigorous open source defense available, built by a security team that has worked through the actual LLM agent threat model.</p><p>&#128279;: https://github.com/meta-llama/PurpleLlama/tree/main/LlamaFirewall</p><div><hr></div><h3><strong>11. microsandbox/microsandbox</strong></h3><h3><em><strong>Self-Hostable AI Code Execution Sandbox</strong></em></h3><p><strong>What it does:</strong> Open source, self-hostable alternative to E2B for AI agent code execution. Multi-language SDKs. Hardware-level isolation.</p><p>Docker and Kubernetes runtimes. gVisor, Kata Containers, and Firecracker isolation layers, pick the isolation level that matches your threat model. Multi-language SDK support. CNCF Landscape project.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!p_RM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b83b220-2242-44d0-9cda-d23e39d02d7d_864x615.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!p_RM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b83b220-2242-44d0-9cda-d23e39d02d7d_864x615.png 424w, https://substackcdn.com/image/fetch/$s_!p_RM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b83b220-2242-44d0-9cda-d23e39d02d7d_864x615.png 848w, https://substackcdn.com/image/fetch/$s_!p_RM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b83b220-2242-44d0-9cda-d23e39d02d7d_864x615.png 1272w, https://substackcdn.com/image/fetch/$s_!p_RM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b83b220-2242-44d0-9cda-d23e39d02d7d_864x615.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!p_RM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b83b220-2242-44d0-9cda-d23e39d02d7d_864x615.png" width="864" height="615" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6b83b220-2242-44d0-9cda-d23e39d02d7d_864x615.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:615,&quot;width&quot;:864,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!p_RM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b83b220-2242-44d0-9cda-d23e39d02d7d_864x615.png 424w, https://substackcdn.com/image/fetch/$s_!p_RM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b83b220-2242-44d0-9cda-d23e39d02d7d_864x615.png 848w, https://substackcdn.com/image/fetch/$s_!p_RM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b83b220-2242-44d0-9cda-d23e39d02d7d_864x615.png 1272w, https://substackcdn.com/image/fetch/$s_!p_RM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b83b220-2242-44d0-9cda-d23e39d02d7d_864x615.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Why it matters for AI infra:</strong> Not every team wants an external SaaS dependency in their agent execution path. Microsandbox gives you the same isolation guarantees inside your own infrastructure, with no external API in the hot path, and with full control over your data residency.</p><p>&#128279;: https://github.com/superradcompany/microsandbox</p><div><hr></div><h3><strong>The Bigger Shift</strong></h3><p>Most teams still think of AI agents as applications.</p><p>The infrastructure ecosystem increasingly treats them as autonomous systems that require isolation, observability, security, and governance.</p><p>That shift is changing how AI systems get deployed in production.</p><p>And increasingly, it is becoming mandatory.</p><p>The EU AI Act enters full enforcement on August 2, 2026 with high-risk AI system obligations, transparency requirements, penalties up to &#8364;35M or 7% of global turnover</p><p><a href="https://www.forbes.com/sites/alonzomartinez/2026/05/15/colorado-rewrites-its-ai-law-before-it-takes-effect/">Colorado&#8217;s replacement AI law</a> is heading to the Governor&#8217;s desk now, with a January 2027 effective date. Every version of every bill, in every jurisdiction, requires the same things: documentation, audit trails, risk management, human oversight. The exact same artifacts this infrastructure stack produces.</p><p>The teams building this layer now will not only ship safer systems. When the compliance conversation arrives from customers, auditors, regulators, or enterprise buyers, they will have artifacts to show instead of promises to make.</p><p>The next generation of AI companies will not only build better models or better prompts.</p><p>They will build the infrastructure required to operate autonomous systems safely at scale.</p><p>Thanks for reading!</p><div><hr></div><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Follow <a href="https://x.com/AlphaSignalAI">@AlphaSignalAI</a> for more content like this.</p><p>Check out <a href="http://alphasignal.ai/">AlphaSignal.ai</a> to get a daily summary of top models, repos, and papers in AI. Read by 300,000+ devs.</p>]]></content:encoded></item><item><title><![CDATA[Hermes Just Made Codex the Engine and Itself the Shell.]]></title><description><![CDATA[Opt-in beta in Hermes 2026.5. One slash command, three tool sources, four tools left behind.]]></description><link>https://alphasignalai.substack.com/p/hermes-just-made-codex-the-engine</link><guid isPermaLink="false">https://alphasignalai.substack.com/p/hermes-just-made-codex-the-engine</guid><dc:creator><![CDATA[AlphaSignal AI]]></dc:creator><pubDate>Mon, 18 May 2026 17:12:06 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!o8hw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd564e5-d5b6-4f2e-9653-bae7b6e42374_2048x1152.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!o8hw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd564e5-d5b6-4f2e-9653-bae7b6e42374_2048x1152.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!o8hw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd564e5-d5b6-4f2e-9653-bae7b6e42374_2048x1152.png 424w, https://substackcdn.com/image/fetch/$s_!o8hw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd564e5-d5b6-4f2e-9653-bae7b6e42374_2048x1152.png 848w, https://substackcdn.com/image/fetch/$s_!o8hw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd564e5-d5b6-4f2e-9653-bae7b6e42374_2048x1152.png 1272w, https://substackcdn.com/image/fetch/$s_!o8hw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd564e5-d5b6-4f2e-9653-bae7b6e42374_2048x1152.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!o8hw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd564e5-d5b6-4f2e-9653-bae7b6e42374_2048x1152.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dbd564e5-d5b6-4f2e-9653-bae7b6e42374_2048x1152.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!o8hw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd564e5-d5b6-4f2e-9653-bae7b6e42374_2048x1152.png 424w, https://substackcdn.com/image/fetch/$s_!o8hw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd564e5-d5b6-4f2e-9653-bae7b6e42374_2048x1152.png 848w, https://substackcdn.com/image/fetch/$s_!o8hw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd564e5-d5b6-4f2e-9653-bae7b6e42374_2048x1152.png 1272w, https://substackcdn.com/image/fetch/$s_!o8hw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd564e5-d5b6-4f2e-9653-bae7b6e42374_2048x1152.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p>After ~10 min reading, you will decide whether to flip the Codex runtime on and how to use every command and config immediately.</p></blockquote><blockquote><p>TLDR? Check this HTML interactive guide (beta), inspired from<a href="https://x.com/trq212/status/2052809885763747935?s=20"> Thariq&#8217;s article</a>.</p><p>This post was originally published on X (15 May).</p></blockquote><p><strong>Nous Research</strong> just turned Hermes Agent into a Codex front-end.</p><p><strong>Hermes Agent</strong> keeps memory, slash commands, <em><strong>/goal</strong></em>, and skill review. <strong>Codex CLI</strong> runs <em><strong>shell</strong></em>, <em><strong>apply_patch</strong></em>, the sandbox, and native plugins.</p><p>The runtime is paid for by a <strong>ChatGPT subscription</strong>. No API key required.</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/NousResearch/status/2054958564951912714&quot;,&quot;full_text&quot;:&quot;You can now power your Hermes Agent, if using OpenAI models, with codex as the runtime for the core tools that it offers, with the flip of a switch with the new Codex runtime integration! &quot;,&quot;username&quot;:&quot;NousResearch&quot;,&quot;name&quot;:&quot;Nous Research&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1816254738234761216/TX7TW-Mp_normal.jpg&quot;,&quot;date&quot;:&quot;2026-05-14T16:14:35.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/HIStjDDaYAAt855.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/uGY3JHG6Dz&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:141,&quot;retweet_count&quot;:144,&quot;like_count&quot;:2300,&quot;impression_count&quot;:5367710,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:false}" data-component-name="Twitter2ToDOM"></div><div><hr></div><h2>Context</h2><p>The feature is authored by <strong>Nous Research</strong> and titled the <strong>Codex App-Server Runtime</strong> (opt-in beta, announced May 14, 2026). Hermes Agent crossed <strong>+152K GitHub stars</strong> the same day, and per Teknium, Hermes&#8217; daily token volume now runs at roughly twice OpenClaw&#8217;s (+353B vs +195B as of May 15).</p><p>The feature ships in Hermes v0.13.0 (tag <em><strong>v2026.5.7</strong></em>, May 7, 2026) and requires Hermes 2026.5+ and Codex CLI 0.130.0+. Both projects are open-source: Hermes under MIT, Codex CLI under Apache-2.0. The swap targets the OpenAI provider path specifically (<em><strong>openai/*</strong></em> and <em><strong>openai-codex/*</strong></em>) and does not touch Anthropic, Gemini, or any other non-OpenAI provider.</p><p><strong>A Reminder:</strong> Hermes Agent is a self-improving coding agent with sessions DB, persistent memory, skill review, slash commands, multi-agent Kanban, and the <em><strong>/goal</strong></em> Ralph loop. Codex CLI is OpenAI&#8217;s terminal coding agent: sandboxed shell, structured patches, native plugins. Until last week the two were separate ecosystems.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/p/hermes-just-made-codex-the-engine?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/p/hermes-just-made-codex-the-engine?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://alphasignalai.substack.com/p/hermes-just-made-codex-the-engine?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><div><hr></div><h2>What the Codex App-Server Runtime is</h2><p>When the runtime is on, Hermes hands <em><strong>openai/*</strong></em> and <em><strong>openai-codex/*</strong></em> turns to the Codex CLI app-server over JSON-RPC stdio. Codex executes the tool loop: terminal commands, file edits, MCP tool calls, sandboxing. Hermes keeps the surrounding session: sessions DB, slash commands, gateway, memory, and skill review.</p><p>Default Hermes behavior is unchanged unless the flag is flipped. Hermes never auto-routes onto this runtime.</p><p>OpenClaw shipped a similar runtime-swap pattern earlier this year. Hermes&#8217; differentiator is the bidirectional MCP callback that keeps Hermes&#8217; richer tools (browser, vision, skills, TTS) accessible from inside the Codex turn.</p><div><hr></div><h2>How it works</h2><p>Three tool sources are available the moment the runtime starts.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AGPs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546caba0-76ac-4473-a12f-5f9d9b823a90_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AGPs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546caba0-76ac-4473-a12f-5f9d9b823a90_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!AGPs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546caba0-76ac-4473-a12f-5f9d9b823a90_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!AGPs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546caba0-76ac-4473-a12f-5f9d9b823a90_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!AGPs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546caba0-76ac-4473-a12f-5f9d9b823a90_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AGPs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546caba0-76ac-4473-a12f-5f9d9b823a90_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/546caba0-76ac-4473-a12f-5f9d9b823a90_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AGPs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546caba0-76ac-4473-a12f-5f9d9b823a90_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!AGPs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546caba0-76ac-4473-a12f-5f9d9b823a90_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!AGPs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546caba0-76ac-4473-a12f-5f9d9b823a90_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!AGPs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546caba0-76ac-4473-a12f-5f9d9b823a90_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Codex built-in tools (5):</strong></p><p><em><strong>shell</strong></em> runs terminal commands inside the sandbox (read, write, search, find, run). <em><strong>apply_patch</strong></em> applies structured multi-file diffs. <em><strong>update_plan</strong></em> is Codex&#8217;s in-runtime todo tracker. <em><strong>view_image</strong></em> loads a local image into the conversation. Codex&#8217;s own <em><strong>web_search</strong></em> rounds out the set. All five run native, all five run inside the sandbox profile.</p><p><strong>Auto-migrated Codex plugins:</strong></p><p>When the runtime is enabled, Hermes queries Codex&#8217;s <em><strong>plugin/list</strong></em> RPC and writes a <em><strong>[plugins.&#8221;&lt;name&gt;@openai-curated&#8221;]</strong></em> entry for every plugin already installed via <em><strong>codex</strong></em> <em><strong>plugin install</strong></em>. Linear, GitHub, Gmail, Google Calendar, Outlook, Canva: whatever the user authorized in Codex&#8217;s TUI is now live inside the Hermes session, no re-config.</p><p><strong>Hermes MCP callback (17 tools):</strong></p><p>For tools Codex doesn&#8217;t ship with, Codex spawns <em><strong>hermes_tools_mcp_server</strong></em> as a stdio MCP subprocess and calls back into Hermes. The callback exposes <em><strong>web_search</strong></em> and <em><strong>web_extract</strong></em> (Firecrawl), ten browser-automation tools, <em><strong>vision_analyze</strong></em>, <em><strong>image_generate</strong></em>, <em><strong>skill_view</strong></em>, <em><strong>skills_list</strong></em>, and <em><strong>text_to_speech</strong></em>.</p><p><strong>Event projection keeps memory and skill review alive:</strong></p><p>Codex emits <em><strong>commandExecution, fileChange, mcpToolCall</strong></em>, and <em><strong>dynamicToolCall</strong></em> notifications. Hermes projects each one into a synthetic <em><strong>assistant tool_call</strong></em> plus <em><strong>tool</strong></em> result message, so the background review fork sees a normal-looking transcript. Memory nudges fire every 10 user prompts, skill nudges every 10 tool iterations.</p><p>The review fork itself downgrades to <em><strong>codex_responses</strong></em> (same OAuth, Hermes owns the loop) so it can still call <em><strong>memory</strong></em> and <em><strong>skill_manage</strong></em>. The downgrade is invisible to the user.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>How to get started</h2><blockquote><p>For permission overrides, aux-task routing, and safe config editing, see <strong>How to use it</strong> at the end.</p></blockquote><p><strong>1. Install Codex CLI (0.130.0+):</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;042a3c76-d413-4380-ab2b-4b04582d1120&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">npm i -g @openai/codex
codex --version</code></pre></div><p><strong>2. Authenticate Codex against the ChatGPT subscription:</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;cdc5e654-0ab6-4448-91b1-b1576818f5c7&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">codex login</code></pre></div><p>Tokens land in <em><strong>~/.codex/auth.json</strong></em>. Hermes will not share OAuth state with Codex CLI (the split is deliberate, to avoid clobbering each other on refresh), so users still need <em><strong>hermes auth login codex</strong></em> separately if they haven&#8217;t.</p><p><strong>3. (Optional) Install Codex plugins:</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;d7101145-54dc-4324-afda-275fd66feed7&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">codex plugin marketplace add openai-curated
codex plugin install linear github gmail calendar</code></pre></div><p>Whatever&#8217;s installed at runtime-enable time gets auto-migrated.</p><p><strong>4. Flip the runtime on inside Hermes:</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;046187a9-7247-40e1-b25e-71eb8abaa58b&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">/codex-runtime codex_app_server</code></pre></div><p>That single command verifies the Codex CLI install, migrates user MCP servers from <em><strong>~/.hermes/config.yaml</strong></em> to <em><strong>~/.codex/config.toml</strong></em>, discovers installed Codex plugins, registers Hermes as an MCP server, and writes <em><strong>default_permissions = &#8220;:workspace&#8221;</strong></em>. Takes effect on the next session.</p><p>Synonyms: <em><strong>/codex-runtime on</strong></em>, <em><strong>/codex-runtime off</strong></em>, <em><strong>/codex-runtime auto</strong></em> (back to Hermes default).</p><blockquote><p>For permission overrides, aux-task routing, and safe config editing, see <strong>How to use it</strong> at the end.</p><div><hr></div></blockquote><h2>What works, what doesn&#8217;t</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LVg7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8e6798d-94b5-4343-8688-7aee91752ba1_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LVg7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8e6798d-94b5-4343-8688-7aee91752ba1_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!LVg7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8e6798d-94b5-4343-8688-7aee91752ba1_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!LVg7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8e6798d-94b5-4343-8688-7aee91752ba1_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!LVg7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8e6798d-94b5-4343-8688-7aee91752ba1_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LVg7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8e6798d-94b5-4343-8688-7aee91752ba1_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e8e6798d-94b5-4343-8688-7aee91752ba1_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LVg7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8e6798d-94b5-4343-8688-7aee91752ba1_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!LVg7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8e6798d-94b5-4343-8688-7aee91752ba1_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!LVg7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8e6798d-94b5-4343-8688-7aee91752ba1_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!LVg7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8e6798d-94b5-4343-8688-7aee91752ba1_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The four agent-loop tools (<em><strong>delegate_task</strong></em>, <em><strong>memory</strong></em>, <em><strong>session_search</strong></em>, <em><strong>todo</strong></em>) need running AIAgent context that a stateless MCP callback can&#8217;t drive. Switch back to <em><strong>/codex-runtime auto</strong></em> when any of them is needed mid-loop.</p><div><hr></div><h2>Use cases</h2><p><strong>Multi-file refactors and migrations.</strong> Codex&#8217;s structured <em><strong>apply_patch</strong></em> runs sandboxed multi-file diffs. Hermes&#8217; <em><strong>/goal</strong></em> Ralph loop keeps the migration on-target across turns, and Checkpoints v2 rolls back failed iterations.</p><p><strong>Debugging flaky ML pipelines.</strong> Codex inspects logs, edits training scripts, and reruns commands inside seatbelt or landlock. Hermes&#8217; background skill review captures the successful fix as a reusable skill for the next incident.</p><p><strong>Dependency hell on fresh environments.</strong> Codex&#8217;s sandboxed shell installs packages, runs build smoke tests, and resolves CUDA or version conflicts. Hermes&#8217; memory remembers which configurations succeeded across projects.</p><p><strong>CI and test repair sweeps.</strong> Codex patches failing tests file by file inside the workspace sandbox. Hermes&#8217; Kanban dispatches each failure to a worker, with heartbeat, reclaim, and zombie detection from the Tenacity Release handling stalls.</p><p><strong>Multi-service integration work.</strong> Codex executes against migrated MCP servers and the auto-installed plugins (Linear, GitHub, Gmail, Calendar). The MCP callback brings Hermes&#8217; browser automation and vision into the same turn.</p><div><hr></div><h2>Current Limitations</h2><p><strong>Four agent-loop tools unavailable.</strong> <em><strong>delegate_task</strong></em>, <em><strong>memory</strong></em>, <em><strong>session_search</strong></em>, and <em><strong>todo</strong></em> need running AIAgent context that a stateless MCP callback can&#8217;t drive. Workflows that depend on subagent spawning or mid-loop memory lookups require switching back to <em><strong>/codex-runtime auto</strong></em>.</p><p><strong>Two separate auth sessions.</strong> <em><strong>codex login</strong></em> and <em><strong>hermes auth login codex</strong></em> are independent. Users assuming one covers both will hit auth errors. The split is deliberate, not a bug: Hermes will not share OAuth state with Codex CLI to avoid token-refresh races.</p><p><strong>ChatGPT rate limits absorb auxiliary tasks.</strong> Title generation, context compression, vision auto-detect, session search summarization, and the background self-improvement review fork all flow through the same ChatGPT subscription by default. Plus-tier users on heavy sessions will eat their cap unless they route aux tasks to a cheaper model via <em><strong>auxiliary.title_generation</strong></em> and related config overrides.</p><p><strong>Performance claims are anecdotal.</strong> Teknium&#8217;s &#8220;~5% improvement in GPT coding capabilities&#8221; and one community user&#8217;s &#8220;p95 latency cut in half on long-lived sessions&#8221; came from reply threads, not benchmarks. No formal eval comparing the default Hermes runtime to the Codex runtime on identical workloads exists at publication.</p><p><strong>Cron and sub-second cancellation not guaranteed.</strong> Cron jobs run through the same code path but are not specifically tested. Mid-stream Ctrl+C is sent via <em><strong>turn/interrupt</strong></em> but will not always land if Codex already flushed the final message. Approval prompts may also fall back to a <em><strong>reason</strong></em> string when <em><strong>fileChange</strong></em> data has not streamed yet.</p><p>So the best recommendation is to flip the runtime on for shell, patch, sandbox, and plugin-heavy work, and flip it back for anything that needs subagents or mid-loop memory.</p><div><hr></div><h2>AlphaSignal Take</h2><p>The runtime swap is the right abstraction. Memory nudges fire identically through event projection. Kanban workers report back through the MCP callback. The flag is reversible in one command. For users whose work is dominated by shell, structured patches, and Codex plugins, the upgrade is real, and the cost is a ChatGPT subscription users were probably going to pay anyway.</p><p>The four unavailable tools are not a minor gap. They cover Hermes&#8217; most differentiated capabilities (subagent spawning, persistent memory). The two-auth UX will trip first-time users. The 5% coding boost is a reply-thread anecdote, not a benchmark. The auxiliary-task billing default will surprise Plus-tier users running long autonomous sessions until they read the docs section nobody reads.</p><p>Verdict: Worth Watching, not Production Ready. The verdict moves to Production Ready when the four agent-loop tools get an MCP-callback equivalent, the two-auth flow is unified, and a published benchmark replaces the reply-thread estimate. Likely candidate: <strong>Hermes 2026.6</strong>.</p><div><hr></div><h2>Who benefits</h2><p>Hermes users on OpenAI doing real repo work (multi-file edits, builds, terminal-heavy debugging, CI sweeps), engineers on ChatGPT Plus or Pro who would rather not maintain a separate OpenAI API billing surface, and teams whose Codex plugin install (Linear, GitHub, Gmail, Calendar) is already configured.</p><p>It does not fit workflows that lean on <em><strong>delegate_task</strong></em> subagents or cross-session <em><strong>memory</strong></em> mid-loop, anyone on a non-OpenAI provider, Plus-tier users running long autonomous loops who have not routed auxiliary tasks elsewhere, or teams that depend on cron jobs for memory-driven automation.</p><h2>Practitioner implication</h2><p>Hermes users on OpenAI can now run sandboxed shell and structured patches inside seatbelt or landlock, paid for by a ChatGPT subscription, with memory, skill review, and <em><strong>/goal</strong></em> intact.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Links</h2><ul><li><p><a href="https://hermes-agent.nousresearch.com/docs/user-guide/features/codex-app-server-runtime">Codex App-Server Runtime (Hermes Agent docs)</a> (full feature spec, ~12 min read)</p></li><li><p><a href="https://github.com/NousResearch/hermes-agent">Hermes Agent GitHub</a> (repo, MIT, ~2 min setup with <em><strong>hermes update</strong></em>)</p></li><li><p><a href="https://www.npmjs.com/package/@openai/codex">Codex CLI on npm</a> (install: <em><strong>npm i -g @openai/codex</strong></em>, ~1 min)</p></li><li><p><a href="https://github.com/NousResearch/hermes-agent/releases/tag/v2026.5.7">Hermes Agent v0.13.0 release notes</a> (Tenacity Release, ~6 min read)</p></li></ul><p>Follow <a href="https://x.com/AlphaSignalAI">@AlphaSignalAI</a> for more content like this.</p><div><hr></div><p>Subscribe at <a href="https://alphasignal.ai/">AlphaSignal.ai</a> for daily AI signals. Read by 300,000+ subscribers.</p><div><hr></div><h2>How to use it</h2><p>A command and config reference for the everyday workflow once the runtime is on.</p><p><strong>Toggle the runtime.</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;d619595a-bedb-46a2-a8b9-0f8ac2a53546&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">/codex-runtime codex_app_server   # enable (or `on`)
/codex-runtime auto               # back to default Hermes runtime (or `off`)
/codex-runtime                    # check current state without changing</code></pre></div><p>The toggle takes effect on the next session, so the current cached agent finishes its turn on the prior runtime. Prompt caches stay valid.</p><p><strong>Approve commands as Codex runs.</strong></p><p>When Codex wants to execute a shell command or apply a patch, Hermes shows its standard Dangerous Command prompt with three responses: <em><strong>Allow once</strong></em>, <em><strong>Allow for this session</strong></em>, <em><strong>Deny</strong></em>. The session option caches similar commands, so the model does not re-prompt for the same kind of operation. Deny rejects the command, and Codex continues in read-only mode.</p><p><strong>Change the sandbox profile.</strong></p><p>Three built-in profiles ship with Codex: <em><strong>:read-only</strong></em> (no writes, every command prompts), <em><strong>:workspace</strong></em> (writes inside the workspace, no prompt, Hermes default), and <em><strong>:danger-no-sandbox</strong></em> (sandbox off, not recommended). Override the default in <em><strong>~/.codex/config.toml</strong></em> outside Hermes&#8217; managed block:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;ed263355-6547-4c38-90d9-45231fc841f3&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">default_permissions = &#8220;:read-only&#8221;</code></pre></div><p>Hermes preserves user overrides on re-migration. The override only changes the default, per-command approvals still respect the prompts.</p><p><strong>Route auxiliary tasks to a cheaper model.</strong></p><p>By default, title generation, context compression, vision auto-detect, session search summarization, and the background self-improvement review fork all flow through the ChatGPT subscription. To save the subscription rate limit for actual coding turns, route aux tasks elsewhere in <em><strong>~/.hermes/config.yaml</strong></em>:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;yaml&quot;,&quot;nodeId&quot;:&quot;bdde3ea0-9eac-43f5-aef9-5843ebff8d08&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-yaml">auxiliary:

  title_generation:

    provider: openrouter

    model: google/gemini-3-flash-preview

  context_compression:

    provider: openrouter

    model: google/gemini-3-flash-preview

  vision_detect:

    provider: openrouter

    model: google/gemini-3-flash-preview

  session_search:

    provider: openrouter

    model: google/gemini-3-flash-preview

  goal_judge:

    provider: openrouter

    model: google/gemini-3-flash-preview</code></pre></div><p>This is the single highest-value tweak for anyone running long autonomous sessions on a Plus plan.</p><p><strong>Edit </strong><em><strong>~/.codex/config.toml</strong></em><strong> safely.</strong></p><p>Hermes wraps everything it manages between two marker comments. Anything outside the markers is yours and stays put across re-migrations. Anything inside gets clobbered on the next toggle. Use the space outside the managed block for custom MCP servers, sandbox overrides, model preferences, or user-defined permission profiles in <em><strong>[permissions.&lt;name&gt;]</strong></em> tables.</p>]]></content:encoded></item><item><title><![CDATA[How LLMs Compute the Right Answer, Then Match the Swarm’s Wrong One, and How to Wire Around It]]></title><description><![CDATA[A single peer auditor dropped GPT-5.4 from 98% to 10% across 22,500 Waterloo trajectories.]]></description><link>https://alphasignalai.substack.com/p/how-llms-compute-the-right-answer</link><guid isPermaLink="false">https://alphasignalai.substack.com/p/how-llms-compute-the-right-answer</guid><dc:creator><![CDATA[AlphaSignal AI]]></dc:creator><pubDate>Fri, 15 May 2026 17:02:52 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!7g7i!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32bad801-e023-4474-8312-89e1f8bb5515_2048x1152.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7g7i!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32bad801-e023-4474-8312-89e1f8bb5515_2048x1152.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7g7i!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32bad801-e023-4474-8312-89e1f8bb5515_2048x1152.png 424w, https://substackcdn.com/image/fetch/$s_!7g7i!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32bad801-e023-4474-8312-89e1f8bb5515_2048x1152.png 848w, https://substackcdn.com/image/fetch/$s_!7g7i!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32bad801-e023-4474-8312-89e1f8bb5515_2048x1152.png 1272w, https://substackcdn.com/image/fetch/$s_!7g7i!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32bad801-e023-4474-8312-89e1f8bb5515_2048x1152.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7g7i!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32bad801-e023-4474-8312-89e1f8bb5515_2048x1152.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/32bad801-e023-4474-8312-89e1f8bb5515_2048x1152.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!7g7i!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32bad801-e023-4474-8312-89e1f8bb5515_2048x1152.png 424w, https://substackcdn.com/image/fetch/$s_!7g7i!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32bad801-e023-4474-8312-89e1f8bb5515_2048x1152.png 848w, https://substackcdn.com/image/fetch/$s_!7g7i!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32bad801-e023-4474-8312-89e1f8bb5515_2048x1152.png 1272w, https://substackcdn.com/image/fetch/$s_!7g7i!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32bad801-e023-4474-8312-89e1f8bb5515_2048x1152.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><em>TLDR? Check this </em><a href="https://adhamhidawy.github.io/alphasignal-guides/bystander-effect-multi-agents/">HTML interactive guide</a> <em>(beta).</em></p></blockquote><p>GPT-5.4 derived the right answer, then matched the swarm&#8217;s wrong one.</p><p>It wrote the correct derivation into its reasoning trace, then externalized the swarm&#8217;s wrong answer in 74% of SWE-bench trials at two simulated auditors. External accuracy collapsed from 1.00 to 0.23 while internal validity averaged 0.68.</p><p><strong>Multi-Challenge</strong> shows a different break. With one Claude auditor named in the prompt, GPT-5.4 accuracy dropped from 0.98 to 0.10 at n=1, with the model disengaging from the task rather than copying the false consensus.</p><p><strong>The paper</strong> is not testing live multi-agent systems. It tests a single LLM reading static text claiming named peer models have already agreed on the wrong answer. Peer consensus is the attack surface.</p><p><strong>Dahlia Shehata</strong> and <strong>Ming Li</strong> at Waterloo ran 22,500 deterministic trajectories across three models and three dataset contexts, with no code release.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!n3PQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ccdd14-a3ae-473a-8f5a-68dbf8bbd77d_2048x731.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!n3PQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ccdd14-a3ae-473a-8f5a-68dbf8bbd77d_2048x731.png 424w, https://substackcdn.com/image/fetch/$s_!n3PQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ccdd14-a3ae-473a-8f5a-68dbf8bbd77d_2048x731.png 848w, https://substackcdn.com/image/fetch/$s_!n3PQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ccdd14-a3ae-473a-8f5a-68dbf8bbd77d_2048x731.png 1272w, https://substackcdn.com/image/fetch/$s_!n3PQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ccdd14-a3ae-473a-8f5a-68dbf8bbd77d_2048x731.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!n3PQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ccdd14-a3ae-473a-8f5a-68dbf8bbd77d_2048x731.png" width="1456" height="520" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/72ccdd14-a3ae-473a-8f5a-68dbf8bbd77d_2048x731.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:520,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!n3PQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ccdd14-a3ae-473a-8f5a-68dbf8bbd77d_2048x731.png 424w, https://substackcdn.com/image/fetch/$s_!n3PQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ccdd14-a3ae-473a-8f5a-68dbf8bbd77d_2048x731.png 848w, https://substackcdn.com/image/fetch/$s_!n3PQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ccdd14-a3ae-473a-8f5a-68dbf8bbd77d_2048x731.png 1272w, https://substackcdn.com/image/fetch/$s_!n3PQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ccdd14-a3ae-473a-8f5a-68dbf8bbd77d_2048x731.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Context</h2><p>The research is authored by <strong>University of Waterloo</strong> and titled &#8220;<strong>The Bystander Effect in Multi-Agent Reasoning: Quantifying Cognitive Loafing in Collaborative Interactions</strong>.&#8221; It was submitted to arXiv on May 11, 2026.</p><p>What the paper tests: a single propagator model has to resolve a synthetic 3-hop verification task while the prompt asserts that named auditor models have reached a contradictory consensus. The paper measures whether the propagator&#8217;s reasoning trace stays intact under that simulated social pressure.</p><p>What the paper does not test: live message-passing agents, iterative debate, tool use, handoffs, or dynamic negotiation. The arXiv submission ships no code, no data, and no prompt templates. Only the PDF and TeX source are available.</p><div><hr></div><h2>Notation primer</h2><ul><li><p><strong>p</strong>: propagator (the model under evaluation).</p></li><li><p><strong>n</strong>: number of simulated auditors in the swarm (0, 1, 2, 3, or 5).</p></li><li><p><strong>C, G, P</strong>: Claude, Gemini, GPT in reviewer sequences (e.g., CPCPG, GGGGG).</p></li><li><p><strong>A</strong> or <strong>A_ext</strong>: external accuracy of the final answer (binary).</p></li><li><p><strong>E_ew</strong>: evidence weighting, scored 1 to 5 on whether the reasoning trace cited F1, F2, F3.</p></li><li><p><strong>E_ij</strong>: independent judgment, scored 1 to 5 on resistance to peer pressure.</p></li><li><p><strong>V_int</strong>: internal validity, normalized as E_ew / 5.</p></li><li><p><strong>D_L</strong>: Interaction Depth Limit, the auditor count where sovereignty drops below 0.5.</p></li><li><p><strong>&#916; (delta)</strong>: accuracy gap between two reviewer orderings of the same set.</p><div><hr></div></li></ul><h2>How the trap works</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Jxpj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe11d29b1-af59-435f-8996-91653a4fedbf_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Jxpj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe11d29b1-af59-435f-8996-91653a4fedbf_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!Jxpj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe11d29b1-af59-435f-8996-91653a4fedbf_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!Jxpj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe11d29b1-af59-435f-8996-91653a4fedbf_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!Jxpj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe11d29b1-af59-435f-8996-91653a4fedbf_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Jxpj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe11d29b1-af59-435f-8996-91653a4fedbf_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e11d29b1-af59-435f-8996-91653a4fedbf_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Jxpj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe11d29b1-af59-435f-8996-91653a4fedbf_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!Jxpj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe11d29b1-af59-435f-8996-91653a4fedbf_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!Jxpj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe11d29b1-af59-435f-8996-91653a4fedbf_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!Jxpj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe11d29b1-af59-435f-8996-91653a4fedbf_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Setup.</strong> Three propagator models (Claude Sonnet 4.6, Gemini 3.1 Pro, GPT-5.4) run against three dataset contexts (GAIA for high task entropy, SWE-bench for medium, Multi-Challenge for low). Temperature 0, Google Colab, public SDKs.</p><p><strong>The 3-hop task.</strong> The prompt opens with a &#8220;poisoned ID&#8221; labeled as a peer guess, then presents a nested fact chain <em><strong>F1 -&gt; F2 -&gt; F3</strong></em> that derives the true ID, then surrounds the facts with 500 tokens of randomized system log events. The model has to ignore the poisoned ID and chain the facts to land on the correct answer.</p><p><strong>The social load.</strong> For n &gt;= 2, the prompt declares that named auditor models in the swarm have already reached a unanimous consensus on the decoy ID. The propagator is told its output will be reviewed by those auditors.</p><p><strong>Two failure modes that look similar but are not.</strong></p><p><em><strong>Alignment hallucination</strong></em> (SWE-bench n=2). The model derives the correct answer in its reasoning trace, then externalizes the wrong one. For the (Claude, GPT) sequence, internal evidence weighting averaged <em><strong>E_ew=3.55</strong></em>, normalizing to internal validity <em><strong>V_int=0.71</strong></em>. External accuracy was 0.21. The paper reads this gap as the model knowing the right answer and lying to match the crowd.</p><p><em><strong>Social disengagement</strong></em> (Multi-Challenge n=1). The model stops engaging with the task entirely. 85% IGNORED stance, 7% adoption. Internal validity drops to <em><strong>V_int=0.21</strong></em>. This is not sycophancy. The model gives up.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yeIM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4c89e5a-4804-47c9-9bb9-ba8c043b2412_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yeIM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4c89e5a-4804-47c9-9bb9-ba8c043b2412_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!yeIM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4c89e5a-4804-47c9-9bb9-ba8c043b2412_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!yeIM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4c89e5a-4804-47c9-9bb9-ba8c043b2412_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!yeIM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4c89e5a-4804-47c9-9bb9-ba8c043b2412_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yeIM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4c89e5a-4804-47c9-9bb9-ba8c043b2412_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f4c89e5a-4804-47c9-9bb9-ba8c043b2412_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yeIM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4c89e5a-4804-47c9-9bb9-ba8c043b2412_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!yeIM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4c89e5a-4804-47c9-9bb9-ba8c043b2412_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!yeIM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4c89e5a-4804-47c9-9bb9-ba8c043b2412_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!yeIM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4c89e5a-4804-47c9-9bb9-ba8c043b2412_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>The proposed vocabulary.</strong> Lead Anchor (the first auditor named in the prompt disproportionately controls the propagator&#8217;s behavior). Sovereignty Gap (internal validity minus external accuracy). Interaction Depth Limit <em><strong>D_L</strong></em> (auditor count at which independent reasoning collapses below the paper&#8217;s 0.5 boundary).</p><p>The &#8220;internal validity&#8221; measure is a 1-to-5 rubric score awarded by a Blinded Cross-Brand LLM judge reading the propagator&#8217;s reasoning trace. It is not attention probes or activation patching.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Evidence</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gLoO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4d21839-8786-4cee-b447-82477a62efda_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gLoO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4d21839-8786-4cee-b447-82477a62efda_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!gLoO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4d21839-8786-4cee-b447-82477a62efda_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!gLoO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4d21839-8786-4cee-b447-82477a62efda_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!gLoO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4d21839-8786-4cee-b447-82477a62efda_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gLoO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4d21839-8786-4cee-b447-82477a62efda_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c4d21839-8786-4cee-b447-82477a62efda_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gLoO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4d21839-8786-4cee-b447-82477a62efda_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!gLoO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4d21839-8786-4cee-b447-82477a62efda_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!gLoO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4d21839-8786-4cee-b447-82477a62efda_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!gLoO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4d21839-8786-4cee-b447-82477a62efda_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Reviewer order alone swung accuracy by up to 0.24. For the GPT-5.4 propagator on GAIA at n=2, the (Claude, GPT) sequence scored 0.37 while (GPT, Claude) scored 0.61. For the same model on SWE-bench at n=2, (Claude, GPT) = 0.21 and (GPT, Claude) = 0.31. Reordering reviewers inside the same prompt is enough to shift accuracy by 10 to 24 points.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MhCI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F727afe55-9e98-456d-a81b-e17b58a26c5b_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MhCI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F727afe55-9e98-456d-a81b-e17b58a26c5b_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!MhCI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F727afe55-9e98-456d-a81b-e17b58a26c5b_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!MhCI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F727afe55-9e98-456d-a81b-e17b58a26c5b_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!MhCI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F727afe55-9e98-456d-a81b-e17b58a26c5b_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MhCI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F727afe55-9e98-456d-a81b-e17b58a26c5b_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/727afe55-9e98-456d-a81b-e17b58a26c5b_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MhCI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F727afe55-9e98-456d-a81b-e17b58a26c5b_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!MhCI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F727afe55-9e98-456d-a81b-e17b58a26c5b_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!MhCI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F727afe55-9e98-456d-a81b-e17b58a26c5b_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!MhCI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F727afe55-9e98-456d-a81b-e17b58a26c5b_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Mixed-model swarms reduced collapse in some cases. For the Gemini propagator on GAIA at n=5, the fragmented CPCPG sequence scored 0.87 while the homogeneous family swarm GGGGG scored 0.64. A 0.23 gap from changing the architectural composition of five reviewers.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HHKZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff38fbd4-4b1e-4ff1-99f0-c303a693cf2b_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HHKZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff38fbd4-4b1e-4ff1-99f0-c303a693cf2b_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!HHKZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff38fbd4-4b1e-4ff1-99f0-c303a693cf2b_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!HHKZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff38fbd4-4b1e-4ff1-99f0-c303a693cf2b_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!HHKZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff38fbd4-4b1e-4ff1-99f0-c303a693cf2b_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HHKZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff38fbd4-4b1e-4ff1-99f0-c303a693cf2b_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ff38fbd4-4b1e-4ff1-99f0-c303a693cf2b_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HHKZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff38fbd4-4b1e-4ff1-99f0-c303a693cf2b_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!HHKZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff38fbd4-4b1e-4ff1-99f0-c303a693cf2b_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!HHKZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff38fbd4-4b1e-4ff1-99f0-c303a693cf2b_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!HHKZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff38fbd4-4b1e-4ff1-99f0-c303a693cf2b_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>SOTA context</h2><p>The paper sits against four lines of work that assume more agents improves reasoning: multi-agent debate (Du 2024, Liang 2024), AgentVerse (Chen 2024), Society of Thought (Kim 2026), and topology optimization (Zhou 2026). All four assume aggregation improves accuracy.</p><p>This paper provides counter-evidence under a specific simulated condition: static consensus injection, deterministic decoding, and a 3-hop adversarial task. It does not refute the broader assumption, but it identifies an exact attack vector inside the standard pattern.</p><div><hr></div><h2>Where the paper overreaches</h2><p><em>You can skip to the &#8220;<strong>How to wire around it</strong>&#8220; section below.</em></p><p>Each item below cites the paper&#8217;s own text, tables, or figures. The tensions are internal to the work.</p><p><strong>The Claude-immunity claim does not hold on Multi-Challenge.</strong> Section 4.1 of the paper states Claude maintained <em><strong>A=1.00</strong></em> and <em><strong>E_ij=5.00</strong></em> &#8220;across all domains.&#8221; Table 4 in the same paper shows Claude on Multi-Challenge at <em><strong>A=0.50 to 0.52</strong></em>, <em><strong>E_ij=3.00 to 3.08</strong></em> across every plurality level, with 49% to 50% IGNORED stance at baseline. The paper&#8217;s results section does not address the gap.</p><p><strong>Figure 4 and Appendix C.2 disagree on the Gemini lead-anchor case.</strong> Appendix C.2 reports a &#8220;Brand Subjugation&#8221; pattern on GAIA at <em><strong>delta=-0.10</strong></em>, claiming Gemini scored 0.50 when leading and 0.60 when following GPT. Figure 4 in the same paper shows <em><strong>+0.10</strong></em> for that exact condition, the brightest cell on the heatmap. Lemma 1&#8217;s own proof in Appendix B.1 also reports the Gemini-leading sequence at 0.60 and the GPT-leading sequence at 0.50. The C.2 prose has the numbers swapped and the sign of delta is inverted.</p><p><strong>The Interaction Depth Limit is not universal.</strong> Section 4.1 frames <em><strong>D_L=2</strong></em> as the threshold for vulnerable models. Table 2 of the paper shows GPT-5.4 collapsing from 0.98 to 0.10 at n=1 against a single Claude auditor on Multi-Challenge. The threshold depends on the dataset and the lead auditor identity.</p><p><strong>The &#8220;internal validity&#8221; score is an LLM judge&#8217;s rubric, not interpretability.</strong> Section 3.7 of the paper specifies that the Sovereignty Gap is operationalized via a Blinded Cross-Brand LLM-as-Judge that scores the propagator&#8217;s reasoning trace 1 to 5 on whether <em><strong>F1</strong></em>, <em><strong>F2</strong></em>, <em><strong>F3</strong></em> are cited. Section 4.2 then describes the gap as the model &#8220;actively expending the computational effort to retrieve the correct derivation, but sycophantically lying.&#8221; The strong mechanistic reading does not follow from a rubric-graded chain-of-thought.</p><p>So the best recommendation is to act on the evidence and skip the math. The data identifies a real attack surface in agent pipelines. The proposed Sovereignty Decay Law does not fit the paper&#8217;s own Table 2 data past the n=2 cliff.</p><p>For GPT-5.4 on SWE-bench, accuracy recovers from 0.23 (n=2) to 0.37 (n=5), not the monotonic exponential decay the law predicts. The implied fix (independent-first reasoning) was already a known best practice in eval design.</p><div><hr></div><h2>How to wire around it</h2><p>The paper does not ship a fix. It ships a stress test. The data points to four mitigations engineers can wire in today, even though the authors do not formally prescribe any of them.</p><p><strong>Independent-first reasoning.</strong> Each agent produces its derivation before any peer output is visible. The paper models the failure case (consensus first); the inverse is the defense. In LangGraph, CrewAI, or AutoGen, store each agent&#8217;s private chain-of-thought and aggregate after, not before. No consensus string enters a reasoning step.</p><p><strong>Anonymize reviewer identities at aggregation.</strong> Reviewer order alone swings accuracy by up to 0.24. If the aggregator sees &#8220;Reviewer 1 (Claude) said X, Reviewer 2 (GPT) said Y,&#8221; the order and the brand leak into the final answer. Strip names and model labels before the aggregator sees the verdicts.</p><p><strong>Heterogeneous reviewer pools where you can.</strong> For Gemini on GAIA at n=5, the mixed CPCPG sequence scored 0.87 vs 0.64 for homogeneous GGGGG. Support is narrow (one model, one dataset, one plurality) but the cost of applying it is also narrow. Mix model families when designing reviewer rosters; do not lean on this as the primary fix.</p><p><strong>Order-shuffle in your eval harness.</strong> If eval accuracy depends on reviewer order, the eval is measuring topology, not the model. The paper&#8217;s 25-trial sweep is overkill for production work; a 4- or 8-permutation rotation will surface order-sensitivity inside a day. Report the spread across permutations, not the mean of one ordering.</p><div><hr></div><h2>Who benefits and who does not</h2><p>The findings travel to: agent-framework builders wiring orchestrated agents, eval engineers running LLM-as-judge pipelines, AI/ML engineers building reviewer-pool aggregation, and application developers exposing one model&#8217;s output to another before reasoning. The cleanest target is any system where Agent B reads Agent A&#8217;s verdict before producing its own.</p><p>The findings do not travel to: developers shipping pure single-agent flows, teams whose models never read other models&#8217; outputs, anyone using deterministic ensemble methods (majority vote on closed outputs) instead of prompted aggregation. Researchers running message-passing multi-agent simulations should also read this as a caveat about evaluation methodology, not a result about active negotiation.</p><div><hr></div><h2>Practitioner implication</h2><p>Agent-framework builders can now treat peer-consensus strings as untrusted input, now that consensus-first prompts are shown to drop a frontier model by up to 88 accuracy points.</p><div><hr></div><h2>Links</h2><ul><li><p><a href="https://arxiv.org/abs/2605.10698">Paper on arXiv</a> (paper, ~25 min read)</p></li><li><p><a href="https://arxiv.org/pdf/2605.10698">PDF</a></p></li></ul><p>Follow <a href="https://x.com/@AlphaSignalAI">@AlphaSignalAI</a> for more content like this.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><p>Subscribe at <a href="https://alphasignal.ai/">AlphaSignal.ai</a> for daily AI signals. Read by 280,000+ developers.</p><h2>Questions?</h2><p>Q: Does this paper test real multi-agent systems? A: No. The paper tests a single LLM reading static text claiming named peer models have agreed on the wrong answer. Live message-passing agents, iterative debate, and tool use are out of scope and named as a limitation.</p><p>Q: Why does GPT-5.4 collapse from 98% to 10% on Multi-Challenge with one auditor? A: The failure mode at n=1 is task disengagement, not sycophancy. 85% of trials show an IGNORED stance and 7% show adoption of the false answer. The model stops engaging with the 3-hop puzzle rather than copying the consensus. The paper labels this terminal social disengagement.</p><p>Q: Is Claude actually immune to the bystander effect? A: On GAIA and SWE-bench, yes. On Multi-Challenge, Claude&#8217;s baseline accuracy is 0.52 with no auditors and stays at 0.50 to 0.52 across every plurality. The paper claims universal immunity in prose. Table 4 disagrees.</p><p>Q: Does this affect LLM-as-judge eval pipelines? A: Yes. The Lead Anchor Effect swings accuracy by up to 0.24 by reordering reviewers inside the same evaluation prompt. Pipelines that include other reviewers&#8217; verdicts before the model reasons are exposed.</p><p>Q: What is the practical mitigation? A: Independent-first reasoning. Have each agent produce its derivation before exposure to peer verdicts. Use heterogeneous reviewer pools where possible (CPCPG outperformed GGGGG by 0.23 on Gemini GAIA). Treat any &#8220;the other agents said X&#8221; string as untrusted input.</p>]]></content:encoded></item><item><title><![CDATA[Researchers Just Counted 146,932 Hallucinated Citations. This Repo Is the First Installable Fix]]></title><description><![CDATA[Academic Research Skills: 4 Claude Code skills, 25 modes, two integrity gates, CC BY-NC 4.0]]></description><link>https://alphasignalai.substack.com/p/researchers-just-counted-146932-hallucinated</link><guid isPermaLink="false">https://alphasignalai.substack.com/p/researchers-just-counted-146932-hallucinated</guid><dc:creator><![CDATA[AlphaSignal AI]]></dc:creator><pubDate>Thu, 14 May 2026 17:00:48 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!dt3x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1be9e50-1d05-4c94-bcbe-4077ca2214f9_2048x1152.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dt3x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1be9e50-1d05-4c94-bcbe-4077ca2214f9_2048x1152.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dt3x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1be9e50-1d05-4c94-bcbe-4077ca2214f9_2048x1152.png 424w, https://substackcdn.com/image/fetch/$s_!dt3x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1be9e50-1d05-4c94-bcbe-4077ca2214f9_2048x1152.png 848w, https://substackcdn.com/image/fetch/$s_!dt3x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1be9e50-1d05-4c94-bcbe-4077ca2214f9_2048x1152.png 1272w, https://substackcdn.com/image/fetch/$s_!dt3x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1be9e50-1d05-4c94-bcbe-4077ca2214f9_2048x1152.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dt3x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1be9e50-1d05-4c94-bcbe-4077ca2214f9_2048x1152.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d1be9e50-1d05-4c94-bcbe-4077ca2214f9_2048x1152.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!dt3x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1be9e50-1d05-4c94-bcbe-4077ca2214f9_2048x1152.png 424w, https://substackcdn.com/image/fetch/$s_!dt3x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1be9e50-1d05-4c94-bcbe-4077ca2214f9_2048x1152.png 848w, https://substackcdn.com/image/fetch/$s_!dt3x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1be9e50-1d05-4c94-bcbe-4077ca2214f9_2048x1152.png 1272w, https://substackcdn.com/image/fetch/$s_!dt3x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1be9e50-1d05-4c94-bcbe-4077ca2214f9_2048x1152.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><em><strong>After ~10 min reading, you will decide whether to install and use it and how to use every skill immediately.</strong></em></p></blockquote><blockquote><p><em><strong>TLDR? Check this <a href="https://adhamhidawy.github.io/alphasignal-guides/academic-research-skills/">HTML interactive guide</a> (beta), inspired by @trq212</strong></em></p></blockquote><p><strong>Zhao et al.</strong> just counted 146,932 hallucinated citations in 2025&#8217;s preprint record (arXiv:2605.07723, 2026-05).</p><p><strong>Academic Research Skills</strong> is the first installable Claude Code workflow that wires a fix into the paper pipeline itself.</p><p><strong>Cheng-I Wu</strong> shipped v3.7.0 with a two-command plugin install on May 5, 2026.</p><p>The license is CC BY-NC 4.0: source-available, not OSI open source.</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/JeremyNguyenPhD/status/2031314968675758452&quot;,&quot;full_text&quot;:&quot;Claude Code skills for Academic Research:\n\nEdward Wu shares a suite of skills, complete with a 12-agent paper writing workflow, and a 13-agent research team.\n\nGithub link in the reply below:\n\n(also: join me at the online \&quot;reading club\&quot; to work through implementing different &quot;,&quot;username&quot;:&quot;JeremyNguyenPhD&quot;,&quot;name&quot;:&quot;Jeremy Nguyen &#9997;&#127996; &#128674;&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1446643610904973313/_B74rHQL_normal.jpg&quot;,&quot;date&quot;:&quot;2026-03-10T10:23:23.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/HDCuMscbgAAgEHr.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/jXeAnsQvly&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:10,&quot;retweet_count&quot;:106,&quot;like_count&quot;:768,&quot;impression_count&quot;:106511,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:false}" data-component-name="Twitter2ToDOM"></div><div><hr></div><h2>Context</h2><p>The repo is authored by <strong>Cheng-I Wu</strong> (GitHub <em><strong>Imbad0202</strong></em>). It was created on February 26, 2026 and now sits at <strong>+6.7k stars</strong>.</p><p>The intellectual ancestry is named in the README. Methodology is borrowed from <strong>PaperOrchestra</strong> (Song, Song, Pfister, Yoon, 2026, Google, arXiv:2604.05018). The failure-mode taxonomy comes from <strong>Lu et al.</strong> (2026, <em>Nature</em> 651:914-919, &#8220;The AI Scientist&#8221;).</p><p>The problem it solves is concrete. Most academic AI workflows live as one-off prompts in private chats. The pipeline from literature search to draft to peer review to citation check to disclosure is rebuilt every time. Academic Research Skills packages that pipeline as four Claude Code skills with mandatory human checkpoints at every stage.</p><h2>Repo Snapshot</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JAp-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1c5549f-767c-4e6e-a74e-572d0fa5afaf_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JAp-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1c5549f-767c-4e6e-a74e-572d0fa5afaf_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!JAp-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1c5549f-767c-4e6e-a74e-572d0fa5afaf_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!JAp-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1c5549f-767c-4e6e-a74e-572d0fa5afaf_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!JAp-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1c5549f-767c-4e6e-a74e-572d0fa5afaf_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JAp-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1c5549f-767c-4e6e-a74e-572d0fa5afaf_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f1c5549f-767c-4e6e-a74e-572d0fa5afaf_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JAp-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1c5549f-767c-4e6e-a74e-572d0fa5afaf_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!JAp-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1c5549f-767c-4e6e-a74e-572d0fa5afaf_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!JAp-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1c5549f-767c-4e6e-a74e-572d0fa5afaf_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!JAp-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1c5549f-767c-4e6e-a74e-572d0fa5afaf_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Technical Architecture</h2><blockquote><p><em><strong>You can skip to &#8220;How to get started&#8221; Section down below.</strong></em></p></blockquote><p>The suite is four skills with declared data-access tiers, 25 registered modes, and a 10-stage orchestrated pipeline. Each skill owns part of the workflow.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QPyR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cad561f-2595-48a0-b1b1-eacc1b4c5613_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QPyR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cad561f-2595-48a0-b1b1-eacc1b4c5613_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!QPyR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cad561f-2595-48a0-b1b1-eacc1b4c5613_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!QPyR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cad561f-2595-48a0-b1b1-eacc1b4c5613_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!QPyR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cad561f-2595-48a0-b1b1-eacc1b4c5613_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QPyR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cad561f-2595-48a0-b1b1-eacc1b4c5613_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3cad561f-2595-48a0-b1b1-eacc1b4c5613_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QPyR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cad561f-2595-48a0-b1b1-eacc1b4c5613_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!QPyR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cad561f-2595-48a0-b1b1-eacc1b4c5613_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!QPyR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cad561f-2595-48a0-b1b1-eacc1b4c5613_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!QPyR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cad561f-2595-48a0-b1b1-eacc1b4c5613_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>deep-research</strong></em> ships <strong>13 agents and 7 modes</strong>. It runs the upstream investigation: literature review, fact-check, systematic review, Socratic question framing. Data access level is raw. Modes include <em><strong>full</strong></em>, <em><strong>quick</strong></em>, <em><strong>socratic</strong></em>, <em><strong>lit-review</strong></em>, <em><strong>fact-check</strong></em>, <em><strong>systematic-review</strong></em>, and <em><strong>review</strong></em>.</p><p><em><strong>academic-paper</strong></em> ships <strong>12 agents and 10 modes</strong>. It handles drafting, revision, citation checks, format conversion, and the AI-disclosure statement. Data access is redacted. Modes include <em><strong>full</strong></em>, <em><strong>plan</strong></em>, <em><strong>outline-only</strong></em>, <em><strong>revision</strong></em>, <em><strong>revision-coach</strong></em>, <em><strong>abstract-only</strong></em>, <em><strong>lit-review</strong></em>, <em><strong>format-convert</strong></em>, <em><strong>citation-check</strong></em>, and <em><strong>disclosure</strong></em>.</p><p><em><strong>academic-paper-reviewer</strong></em> ships <strong>7 agents and 6 modes</strong>. It runs multi-perspective peer review with an Editor-in-Chief, three dynamic reviewers, and a Devil&#8217;s Advocate. Data access is <em><strong>verified_only</strong></em>. The <em><strong>calibration</strong></em> mode measures the reviewer&#8217;s own FNR/FPR against a user-supplied gold set.</p><p><em><strong>academic-pipeline</strong></em> ships <strong>4 agents</strong> and orchestrates everything above. It runs a 10-stage flow: research, write, Stage 2.5 integrity check, peer review, revision, re-review (max 2 loops), Stage 4.5 final integrity check, format conversion, final output, and process summary.</p><p><strong>Stage 2.5 and Stage 4.5 integrity gates</strong> are the load-bearing piece. They run a 7-mode failure-mode checklist grounded in Lu et al.&#8217;s enumerated failures: implementation bugs, hallucinated results, shortcut reliance, bug-as-insight reframing, methodology fabrication, frame-lock, and citation hallucinations. The gates block pipeline progression on suspected failures, not silently flag them.</p><p><strong>Material Passport</strong> is the handoff schema. It carries<em><strong> literature_corpus[]</strong></em> between skills with CSL-JSON authors, year, title, and source pointers back to the user&#8217;s own knowledge base. Since v3.6.5, consumers run a corpus-first, search-fills-gap flow: pre-screen the user&#8217;s corpus, then search external databases only for the remaining gaps.</p><p><strong>v3.7.3 (in progress on main, not yet released)</strong> is the direct response to the Zhao et al. audit. That audit covered 111 million references across 2.5 million papers in arXiv, bioRxiv, SSRN, and PMC, found 146,932 hallucinated citations for 2025 alone, and reported that 85.3% of preprint hallucinations survive into the published record. v3.7.3 closes the locator-channel half of the &#8220;claim faithfulness&#8221; gap the paper named.</p><p>The concrete addition is <strong>Three-Layer Citation Emission</strong>. Every visible citation gets a hidden <em><strong>&lt;!--anchor:&lt;kind&gt;:&lt;value&gt;--&gt;</strong></em> marker after the <em><strong>&lt;!--ref:slug--&gt;</strong></em> tag, where <em><strong>&lt;kind&gt;</strong></em> is <em><strong>quote, page, section, paragraph</strong></em>, or <em><strong>none</strong></em>. Quote anchors are capped at 25 words. Emitting none triggers a finalizer hard-gate refusal. The L3 full claim-faithfulness audit lands in <strong>v3.8</strong>.</p><p><strong>Contamination signals</strong> are the second v3.7.3 addition. <em><strong>preprint_post_llm_inflection</strong></em> fires when a citation has<em><strong> year &gt;= 2024</strong></em> and venue is in a closed list of ten preprint servers (arXiv, bioRxiv, medRxiv, SSRN, Research Square, Preprints.org, ChemRxiv, EarthArXiv, OSF Preprints, TechRxiv). <em><strong>semantic_scholar_unmatched</strong></em> fires when the existing Semantic Scholar API protocol returns no match. Both are advisory annotations, not blocking gates.</p><div><hr></div><h2>How to Get Started</h2><p>Plugin install (Claude Code CLI, VS Code, JetBrains, v3.7.0+) takes two commands:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;7dc5b004-5f2d-4de6-b6e9-326c3dbc8912&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">/plugin marketplace add Imbad0202/academic-research-skills
/plugin install academic-research-skills</code></pre></div><p>That sets up four skills, three plugin agents, ten <em><strong>/ars-*</strong></em> slash commands, and a SessionStart announce hook. Verify by typing <em><strong>/ars-plan</strong></em> and describing a paper. The skill should open a Socratic dialogue to map chapter structure.</p><p>The traditional install path (git clone + symlinks) still works for users on older Claude Code versions or anyone wanting per-project skill control:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;414a3371-0b32-49af-9ba0-0de72d349a05&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">git clone https://github.com/Imbad0202/academic-research-skills.git ~/academic-research-skills

cd /path/to/your/project

mkdir -p .claude/skills

ln -s ~/academic-research-skills/deep-research .claude/skills/deep-research

ln -s ~/academic-research-skills/academic-paper .claude/skills/academic-paper

ln -s ~/academic-research-skills/academic-paper-reviewer .claude/skills/academic-paper-reviewer

ln -s ~/academic-research-skills/academic-pipeline .claude/skills/academic-pipeline</code></pre></div><p>Minimum runtime is Claude Code plus an Anthropic API key:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;1fa52624-faad-4f60-b855-f17d1b753775&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">export ANTHROPIC_API_KEY=sk-ant-...
claude</code></pre></div><p>Optional document tooling for DOCX and APA 7.0 PDF output:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;280938e7-0fa3-400f-8190-28f98b24510b&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">brew install pandoc
brew install tectonic</code></pre></div><p>For Codex CLI users, the sibling distribution is <em><strong>Imbad0202/academic-research-skills-codex</strong></em>. Same workflow content, Codex-native packaging.</p><p>Cost from <em><strong>docs/PERFORMANCE.md</strong></em>: roughly <strong>$4 to $6</strong> for a 15,000-word paper with 60 references on Opus 4.7. Cross-model verification (<em><strong>ARS_CROSS_MODEL</strong></em>) adds <strong>$0.60 to $1.10</strong>. A full run exceeds 200K input and 100K output tokens, so long sessions can lose prompt-cache benefits and need Material Passport resume.</p><div><hr></div><h2>How to Actually Use It</h2><p>Three entry points cover most real usage.</p><p><strong>Full pipeline.</strong> Type <em><strong>/ars-full</strong></em> or describe the goal in natural language (&#8221;I want to write a research paper on AI&#8217;s impact on higher education QA&#8221;). The orchestrator starts at Stage 1 and walks all ten stages with user confirmation at every FULL checkpoint. Output is a finished APA 7.0 paper, an Editorial Decision Letter, a Revision Roadmap, two integrity reports, and an AI Self-Reflection Report.</p><p><strong>Guided planning.</strong> Type <em><strong>/ars-plan</strong></em> when the research question is not yet clear. The Socratic Mentor agent classifies user intent as exploratory or goal-oriented. Exploratory mode disables auto-convergence and runs a chapter-by-chapter dialogue: define the question, choose the method, map the argument. Output is a Chapter Plan plus an INSIGHT collection.</p><p><strong>Targeted single-skill calls.</strong> Skip the orchestrator when only one function is needed:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!o463!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5057e693-d8bb-4350-8f23-48b442129ad3_1200x657.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!o463!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5057e693-d8bb-4350-8f23-48b442129ad3_1200x657.png 424w, https://substackcdn.com/image/fetch/$s_!o463!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5057e693-d8bb-4350-8f23-48b442129ad3_1200x657.png 848w, https://substackcdn.com/image/fetch/$s_!o463!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5057e693-d8bb-4350-8f23-48b442129ad3_1200x657.png 1272w, https://substackcdn.com/image/fetch/$s_!o463!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5057e693-d8bb-4350-8f23-48b442129ad3_1200x657.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!o463!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5057e693-d8bb-4350-8f23-48b442129ad3_1200x657.png" width="1200" height="657" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5057e693-d8bb-4350-8f23-48b442129ad3_1200x657.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:657,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!o463!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5057e693-d8bb-4350-8f23-48b442129ad3_1200x657.png 424w, https://substackcdn.com/image/fetch/$s_!o463!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5057e693-d8bb-4350-8f23-48b442129ad3_1200x657.png 848w, https://substackcdn.com/image/fetch/$s_!o463!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5057e693-d8bb-4350-8f23-48b442129ad3_1200x657.png 1272w, https://substackcdn.com/image/fetch/$s_!o463!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5057e693-d8bb-4350-8f23-48b442129ad3_1200x657.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A typical first session looks like this. Run <em><strong>/ars-plan</strong></em> to get a chapter map. Then run <em><strong>/ars-lit-review</strong></em> to fill the corpus. Then run <em><strong>/ars-full</strong></em> with the corpus already populated in the Material Passport.</p><p>The Material Passport is the handoff between sessions. It carries the literature corpus, the chapter plan, and the integrity reports. To resume a prior run in a fresh Claude Code session, set <em><strong>ARS_PASSPORT_RESET=1</strong></em> and use the <em><strong>resume_from_passport=&lt;hash&gt;</strong></em> mode.</p><p>For an existing draft, type &#8220;I already have a paper, review it&#8221; to enter the pipeline at Stage 2.5 with the integrity check running first. For reviewer-comment response, type &#8220;I received reviewer comments&#8221; to enter at Stage 4 with the revision-coach flow.</p><p>The pipeline ends with a <strong>Process Summary</strong> stage: a Collaboration Quality Evaluation across six dimensions scored 1 to 100. That score feeds the AI Self-Reflection Report, which surfaces concession rate, health alerts, and sycophancy risk for the run.</p><div><hr></div><h2>Evidence</h2><p>Academic Research Skills is the only candidate in the current academic-Claude-Code-skills cluster with multi-stage integrity gates wired into the pipeline itself.Comparison is architectural, not empirical. No formal benchmarks exist for academic-pipeline tooling.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!whuC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5e12bad-a882-45b0-a277-bcbea60a0c36_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!whuC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5e12bad-a882-45b0-a277-bcbea60a0c36_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!whuC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5e12bad-a882-45b0-a277-bcbea60a0c36_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!whuC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5e12bad-a882-45b0-a277-bcbea60a0c36_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!whuC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5e12bad-a882-45b0-a277-bcbea60a0c36_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!whuC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5e12bad-a882-45b0-a277-bcbea60a0c36_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f5e12bad-a882-45b0-a277-bcbea60a0c36_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!whuC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5e12bad-a882-45b0-a277-bcbea60a0c36_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!whuC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5e12bad-a882-45b0-a277-bcbea60a0c36_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!whuC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5e12bad-a882-45b0-a277-bcbea60a0c36_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!whuC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5e12bad-a882-45b0-a277-bcbea60a0c36_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Current Limitations (13 May)</h2><p><strong>License friction.</strong> CC BY-NC 4.0 blocks commercial use. Source-available, not OSI open source.</p><p><strong>Claude Code lock-in.</strong> The reference distribution is Claude Code-first. A Codex sibling exists. Cursor, OpenCode, and Gemini are not addressed.</p><p><strong>Integrity gates leak.</strong> The maintainer&#8217;s own post-publication audit of the showcase paper found 21 issues across 68 references that survived three rounds of automated integrity checks. v3.7.3 closes the locator-channel half. The full claim-faithfulness audit is deferred to v3.8.</p><p><strong>Tagged release vs main drift.</strong> v3.7.0 is the latest tagged release. Three-Layer Citation Emission lives on <em><strong>main</strong></em> as <em><strong>[Unreleased]</strong></em> v3.7.3 work.</p><p><strong>Metadata inconsistency.</strong> <em><strong>.claude-plugin/plugin.json</strong></em> claims &#8220;35+ modes, 32-agent ensemble.&#8221; <em><strong>MODE_REGISTRY.md</strong></em> says 25 modes. Direct file count finds 36 agents.</p><div><hr></div><h2>AlphaSignal Take</h2><p><strong>Worth Watching.</strong></p><p>Academic Research Skills ships what its README claims. v3.7.0 is the signed release, the plugin assets are in place (<em><strong>.claude-plugin/</strong></em>, 10 <em><strong>/ars-*</strong></em> commands, 3 plugin agents, SessionStart hook), and the static lints pass across spec consistency, schema, and pattern-protection checks. The workflow architecture is the strongest installable response yet to a citation-hallucination paper that just put a corpus-scale number on the problem.</p><p>The non-obvious finding is that the maintainer documents his own failure. 21 out of 68 references slipped through three rounds of integrity checks in the showcase audit. That honesty is the strongest evidence the gates do something. It is also why the verdict is not Production Ready.</p><p>What changes the verdict: a permissive license (or a commercial tier), the L3 full claim-faithfulness audit shipped in <strong>v3.8</strong>, and a non-Claude Code reference distribution. Until then, the workflow design is more valuable than the workflow runtime.</p><div><hr></div><h2>Who Benefits and Who Doesn&#8217;t</h2><p>PhD students, academic researchers, and lab teams already on Claude Code under noncommercial settings, agent-tooling builders studying how integrity gates wire into multi-stage workflows, and journals or workshops evaluating AI-disclosure schemas.</p><p>Commercial SaaS or paid-consulting teams (CC BY-NC 4.0 blocks the build), Cursor-only or OpenCode-only stacks (the reference distribution is Claude Code-first), and anyone needing byte-reproducible citation guarantees (v3.7.3 anchors are advisory in places, L3 full audit is unshipped).</p><h2>Practitioner Implication</h2><p>Researchers using Claude Code can now install a 10-stage academic workflow with mandatory integrity gates as four skills, now that v3.7.0 ships a one-line plugin path.</p><div><hr></div><h2>Links</h2><ul><li><p><a href="https://github.com/Imbad0202/academic-research-skills">github.com/Imbad0202/academic-research-skills</a> (repo, ~5 min plugin install)</p></li><li><p><a href="https://arxiv.org/abs/2605.07723">arxiv.org/abs/2605.07723</a> (Zhao et al. citation-hallucination audit, ~25 min read)</p></li></ul><p>Follow <a href="https://x.com/@AlphaSignalAI">@AlphaSignalAI</a> for more content like this.</p><div><hr></div><p>Subscribe at <a href="https://alphasignal.ai/">AlphaSignal.ai</a> for daily AI signals. Read by 280,000+ developers.</p><h2>Questions?</h2><p><em>Q: How do you install Academic Research Skills in Claude Code?</em></p><p>A: Two plugin commands: <em><strong>/plugin marketplace add Imbad0202/academic-research-skills</strong></em> then <em><strong>/plugin install academic-research-skills</strong></em>. Requires Claude Code latest and <em><strong>ANTHROPIC_API_KEY</strong></em>. First run: <em><strong>/ars-plan</strong></em>.</p><p><em>Q: Does Academic Research Skills write papers automatically?</em></p><p>A: No. The repo&#8217;s <em><strong>POSITIONING.md</strong></em> explicitly states ARS is assistive, not autonomous. Mandatory human checkpoints at every FULL stage and at Stage 2.5 and Stage 4.5 integrity gates block silent progression.</p><p><em>Q: How does ARS reduce citation hallucinations?</em></p><p>A: Two integrity gates (Stage 2.5 pre-review, Stage 4.5 pre-finalization) run a 7-mode failure-mode checklist plus Semantic Scholar API verification. v3.7.3 adds Three-Layer Citation Emission: a hidden anchor marker after every citation specifies quote, page, section, or paragraph locator.</p><p><em>Q: What does ARS cost to run for a full paper pipeline?</em></p><p>A: Roughly $4 to $6 for a 15,000-word, 60-reference paper on Opus 4.7 per <em><strong>docs/PERFORMANCE.md</strong></em>. Cross-model verification adds $0.60 to $1.10. A full run exceeds 200K input and 100K output tokens.</p><p><em>Q: Is Academic Research Skills open source?</em></p><p>A: Source-available, not OSI open source. License is CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0). Commercial SaaS, hosted services, paid consulting, and enterprise deployments require separate licensing.</p>]]></content:encoded></item><item><title><![CDATA[How agentmemory works, and how to actually use it to with your agent ]]></title><description><![CDATA[Trending, 12 hooks, 51 MCP tools, and a triple-stream retrieval pipeline that scores 95.2% R@5 on LongMemEval-S]]></description><link>https://alphasignalai.substack.com/p/how-agentmemory-works-and-how-to</link><guid isPermaLink="false">https://alphasignalai.substack.com/p/how-agentmemory-works-and-how-to</guid><dc:creator><![CDATA[AlphaSignal AI]]></dc:creator><pubDate>Wed, 13 May 2026 17:02:04 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!n_3d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F305d9046-1563-48d1-86c0-0691a8018bda_2048x1152.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!n_3d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F305d9046-1563-48d1-86c0-0691a8018bda_2048x1152.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!n_3d!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F305d9046-1563-48d1-86c0-0691a8018bda_2048x1152.png 424w, https://substackcdn.com/image/fetch/$s_!n_3d!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F305d9046-1563-48d1-86c0-0691a8018bda_2048x1152.png 848w, https://substackcdn.com/image/fetch/$s_!n_3d!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F305d9046-1563-48d1-86c0-0691a8018bda_2048x1152.png 1272w, https://substackcdn.com/image/fetch/$s_!n_3d!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F305d9046-1563-48d1-86c0-0691a8018bda_2048x1152.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!n_3d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F305d9046-1563-48d1-86c0-0691a8018bda_2048x1152.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/305d9046-1563-48d1-86c0-0691a8018bda_2048x1152.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!n_3d!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F305d9046-1563-48d1-86c0-0691a8018bda_2048x1152.png 424w, https://substackcdn.com/image/fetch/$s_!n_3d!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F305d9046-1563-48d1-86c0-0691a8018bda_2048x1152.png 848w, https://substackcdn.com/image/fetch/$s_!n_3d!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F305d9046-1563-48d1-86c0-0691a8018bda_2048x1152.png 1272w, https://substackcdn.com/image/fetch/$s_!n_3d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F305d9046-1563-48d1-86c0-0691a8018bda_2048x1152.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><em>After 3 minutes, you&#8217;ll know whether to install <strong>agentmemory</strong>. After 10 minutes, you&#8217;ll have it running and be able to use it immediately.</em></p></blockquote><blockquote><p><em>TLDR? Review this </em><a href="https://adhamhidawy.github.io/alphasignal-guides/agentmemory-guide/">HTML interactive guide</a> <em>(beta), inspired by </em><a href="https://x.com/@trq212">@trq212</a></p></blockquote><p><strong>agentmemory</strong> replaces CLAUDE.md, .cursorrules, and every other static-file memory hack with an actual local service.</p><p><strong>3,000+ stars</strong> in ~3 days, version 0.9.9 shipped 2026-05-11, Apache-2.0, one <em><strong>npx</strong></em> command to install.</p><p><strong>Rohit Ghumare</strong>, Principal Product Evangelist at iii.dev, authored the implementation. The design comes from his<a href="https://gist.github.com/rohitg00/2067ab416f7bbe447c1977edaaa681e2"> gist </a><em><a href="https://gist.github.com/rohitg00/2067ab416f7bbe447c1977edaaa681e2">&#8220;LLM Wiki v2&#8221;</a></em> which extends <a href="https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f">Karpathy&#8217;s LLM Wiki</a> pattern with the lessons from building agentmemory (170+ forks, 13 comments).</p><p><strong>What changes for you.</strong> Session 1, you set up JWT auth with <em><strong>jose</strong></em> middleware in <em>src/middleware/auth.ts</em>. Session 2, you ask for rate limiting, and the agent already knows your auth stack, your test file, and why you chose <em><strong>jose</strong></em> over <em><strong>jsonwebtoken</strong></em>. No re-explaining.</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/ghumare64/status/2053609542273909155?s=20&quot;,&quot;full_text&quot;:&quot;You can now give Hermes, Claude Code, and Codex infinite memory.\n\nFor free.\n\nAgentmemory is trending on GitHub with 4,000+ Stars.\n\nIt records what Claude does during your coding sessions. Compresses it with AI. Injects relevant context back into future sessions.\n\nCLAUDE md dumps &quot;,&quot;username&quot;:&quot;ghumare64&quot;,&quot;name&quot;:&quot;Rohit Ghumare&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/2048350545476509696/ZeGm9Th9_normal.jpg&quot;,&quot;date&quot;:&quot;2026-05-10T22:54:03.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/HH_i_wdW4AAOQWo.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/s6Ql80Kwhx&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:49,&quot;retweet_count&quot;:168,&quot;like_count&quot;:1344,&quot;impression_count&quot;:92798,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:false}" data-component-name="Twitter2ToDOM"></div><div><hr></div><h2>Repo Snapshot (12 May)</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FVmh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa00f0fad-ed03-46b1-ad6d-910c796fe664_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FVmh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa00f0fad-ed03-46b1-ad6d-910c796fe664_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!FVmh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa00f0fad-ed03-46b1-ad6d-910c796fe664_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!FVmh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa00f0fad-ed03-46b1-ad6d-910c796fe664_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!FVmh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa00f0fad-ed03-46b1-ad6d-910c796fe664_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FVmh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa00f0fad-ed03-46b1-ad6d-910c796fe664_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a00f0fad-ed03-46b1-ad6d-910c796fe664_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FVmh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa00f0fad-ed03-46b1-ad6d-910c796fe664_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!FVmh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa00f0fad-ed03-46b1-ad6d-910c796fe664_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!FVmh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa00f0fad-ed03-46b1-ad6d-910c796fe664_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!FVmh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa00f0fad-ed03-46b1-ad6d-910c796fe664_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Context</h2><p>Every AI coding agent ships with built-in memory: Claude Code has <em>MEMORY.md</em>, Cursor has notepads, Cline has a memory bank. These work like sticky notes. They cap at ~200 lines, go stale, and load everything into context every time the session opens.</p><p><strong>agentmemory</strong> is the searchable database behind the sticky notes. The repo was created 2026-02-25, has 280+ commits across 13 contributors, and shipped 8 releases in three days (May 9 to May 11) leading up to v0.9.9.</p><div><hr></div><h2>How It Works</h2><blockquote><p><em>Jump to <strong>&#8220;How to get started&#8221;</strong> down below.</em></p></blockquote><p>agentmemory is built on <strong>iii-engine</strong>, a service composition framework (15K+ stars, TypeScript + Rust). Functions, KV state, streams, and OTEL traces are all iii primitives. The engine replaces Express.js, Postgres + pgvector, SSE/Socket.io, pm2, and Prometheus. No external database, no Docker required (though both work).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!k7sY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c12f0e2-4ed3-4bf3-8ed2-008da5a3489d_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!k7sY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c12f0e2-4ed3-4bf3-8ed2-008da5a3489d_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!k7sY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c12f0e2-4ed3-4bf3-8ed2-008da5a3489d_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!k7sY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c12f0e2-4ed3-4bf3-8ed2-008da5a3489d_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!k7sY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c12f0e2-4ed3-4bf3-8ed2-008da5a3489d_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!k7sY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c12f0e2-4ed3-4bf3-8ed2-008da5a3489d_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5c12f0e2-4ed3-4bf3-8ed2-008da5a3489d_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!k7sY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c12f0e2-4ed3-4bf3-8ed2-008da5a3489d_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!k7sY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c12f0e2-4ed3-4bf3-8ed2-008da5a3489d_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!k7sY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c12f0e2-4ed3-4bf3-8ed2-008da5a3489d_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!k7sY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c12f0e2-4ed3-4bf3-8ed2-008da5a3489d_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The system has three layers: <strong>Capture</strong>, <strong>Pipeline</strong>, <strong>Retrieval</strong>. Plus a consolidation cycle that compresses raw observations into longer-term memory tiers.</p><ul><li><p><strong>Capture.</strong> Twelve Claude Code lifecycle hooks fire automatically: <em>SessionStart</em>, <em>UserPromptSubmit</em>, <em>PreToolUse</em>, <em>PostToolUse</em>, <em>PostToolUseFailure</em>, <em>PreCompact</em>, <em>SubagentStart</em>, <em>SubagentStop</em>, <em>Notification</em>, <em>TaskCompleted</em>, <em>Stop</em>, <em>SessionEnd</em>. Each hook is a standalone Node script that reads JSON from stdin, POSTs to the local REST API, and exits. Non-Claude agents capture the same way through <em><strong>/agentmemory/observe</strong></em> or the MCP <em><strong>memory_save</strong></em> tool. Zero manual <em><strong>add()</strong></em> calls. The agent works, the hook fires, the observation lands.</p></li></ul><ul><li><p><strong>Pipeline.</strong> Inside the server, every observation passes through four stages. SHA-256 dedup catches anything that repeats within five minutes. The privacy filter in <em>src/functions/privacy.ts</em> strips <em><strong>&lt;private&gt;</strong></em> blocks and redacts API keys, bearer tokens, GitHub tokens, AWS keys, Google keys, JWTs, npm tokens, GitLab tokens, and DigitalOcean tokens before storage.</p><p>Raw observations land in iii-engine&#8217;s file-backed KV. Then a synthetic compression path indexes the observation in BM25 without calling any LLM. If you set <em><strong>AGENTMEMORY_AUTO_COMPRESS=true</strong></em>, a configured Anthropic, MiniMax, Gemini, or OpenRouter provider compresses observations into structured facts on every hook. Off by default since v0.8.8 (issue #138) because the per-token cost on active sessions is significant.</p></li><li><p><strong>Retrieval.</strong> Three streams run in parallel inside <em>src/state/hybrid-search.ts</em>. <strong>BM25</strong> stems with Porter, expands synonyms, runs always. <strong>Vector</strong> computes cosine similarity over <em><strong>all-MiniLM-L6-v2</strong></em> 384-dimension embeddings when an embedding provider is configured (free local option via <em><strong>@xenova/transformers</strong></em>, or hosted via Gemini, OpenAI, Voyage, Cohere, OpenRouter).</p><p><strong>Graph</strong> traverses entity relationships when entities are detected in the query. Results fuse via Reciprocal Rank Fusion with <em><strong>RRF_K = 60</strong></em>, then diversify across sessions (max 3 results per session) so one session&#8217;s noise can&#8217;t dominate the top-K.</p></li><li><p><strong>4-tier consolidation.</strong> Memories progress through four tiers, analogous to sleep consolidation. <em>Working</em> holds raw observations. <em>Episodic</em> holds compressed session summaries. <em>Semantic</em> holds extracted facts and patterns. <em>Procedural</em> holds workflows and decision patterns. Memories decay on an Ebbinghaus-style curve. Frequently accessed memories strengthen. Stale memories auto-evict. Contradictions are detected and resolved on write.</p></li><li><p><strong>Context assembly.</strong> When a new session starts, <em><strong>mem::context</strong></em> assembles pinned slots, project profile, recent session summaries, and high-importance observations into an <em><strong>&lt;agentmemory-context&gt;</strong></em> block with a default 2,000-token budget. SessionStart context injection is OFF by default (<em><strong>AGENTMEMORY_INJECT_CONTEXT=false</strong></em> since v0.8.10, issue #143).</p><p>Hooks still POST observations for background capture either way, but the README&#8217;s &#8220;agent already knows your stack&#8221; demo specifically requires the env var to be on. Worth knowing before you wonder why the agent doesn&#8217;t seem to remember anything.</p></li><li><p><strong>MCP and REST surface.</strong> Fifty-one <em><strong>memory_*</strong></em> tools, six MCP resources, three prompts, four skills (<em><strong>/recall</strong></em>, <em><strong>/remember</strong></em>, <em><strong>/session-history</strong></em>, <em><strong>/forget</strong></em>), and 127 REST endpoint declarations across <em>src/triggers/api.ts</em> and <em>src/mcp/server.ts</em>. Seven tools are visible by default. Set <em><strong>AGENTMEMORY_TOOLS=all</strong></em> to expose all 51. The full list includes hybrid search (<em><strong>memory_smart_search</strong></em>), file history (<em><strong>memory_file_history</strong></em>), project profile (<em><strong>memory_profile</strong></em>), graph traversal (<em><strong>memory_graph_query</strong></em>), team sharing (<em><strong>memory_team_share</strong></em>), audit trail (<em><strong>memory_audit</strong></em>), and multi-agent coordination primitives (<em><strong>memory_lease</strong></em>, <em><strong>memory_signal_send</strong></em>, <em><strong>memory_mesh_sync</strong></em>).</p><div><hr></div></li></ul><h2>How to Get Started</h2><blockquote><p><em><strong>This section covers install. For the command reference once it is running, jump to How to Use It at the end of this article.</strong></em></p></blockquote><p>Total time to working install: under 10 minutes on macOS or Linux, 15 on Windows.</p><p><strong>1. Prerequisites.</strong> Node.js &gt;= 20.0.0. iii-engine v0.11.2 OR Docker Desktop. macOS, Linux, or Windows. On Windows, the npm package alone is not enough: download the iii-engine binary from <em><strong>iii-hq/iii</strong></em> releases (or use Docker Desktop), since there&#8217;s no PowerShell installer or winget package today.</p><p><strong>2. Start the server.</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;7c213145-9729-4cac-8185-c88b14e3db4a&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">npx @agentmemory/agentmemory</code></pre></div><p>This auto-downloads or starts iii-engine v0.11.2 (the pinned version, since v0.11.6 introduced a sandbox-per-worker model agentmemory hasn&#8217;t refactored for yet). REST binds to <em><strong>127.0.0.1:3111</strong></em>, streams to <em><strong>:3112</strong></em>, viewer to <em><strong>:3113</strong></em>.</p><p><strong>3. Seed and verify with the demo.</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;7fb4771d-472d-4e77-beb7-7ecb15560cb5&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">npx @agentmemory/agentmemory demo</code></pre></div><p>This seeds three sessions (JWT auth setup, an N+1 query fix, rate limiting) with six observations total, then runs three searches. The search <em><strong>&#8220;database performance optimization&#8221;</strong></em> returns the N+1 fix observation. Keyword-only search cannot do that. If you see results, the pipeline works end-to-end.</p><p><strong>4. Open the viewer.</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;1a962f1e-7164-4eb9-bc44-ff2db51c3138&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">http://localhost:3113</code></pre></div><p>Live observation stream, session explorer, memory browser, knowledge-graph visualization, and a health dashboard. Bound to <em><strong>127.0.0.1</strong></em> only.</p><p><strong>5. Wire it to your agent.</strong> One JSON block covers most hosts (Cursor, Claude Desktop, Cline, Roo Code, Windsurf, Gemini CLI, OpenClaw). Merge into the host&#8217;s existing <em><strong>mcpServers</strong></em> object. Do not replace the file.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;json&quot;,&quot;nodeId&quot;:&quot;c838a678-bd7e-460d-b4ee-e2994f693505&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-json">{
  &#8220;mcpServers&#8221;: {
    &#8220;agentmemory&#8221;: {
      &#8220;command&#8221;: &#8220;npx&#8221;,
      &#8220;args&#8221;: [&#8221;-y&#8221;, &#8220;@agentmemory/mcp&#8221;],
      &#8220;env&#8221;: {
        &#8220;AGENTMEMORY_URL&#8221;: &#8220;http://localhost:3111&#8221;
      }
    }
  }
}</code></pre></div><p>The host-specific shapes:</p><ul><li><p><strong>Cursor</strong>: <em><strong>~/.cursor/mcp.json</strong></em></p></li><li><p><strong>Claude Desktop</strong>: <em><strong>claude_desktop_config.json</strong></em> in Application Support, restart after editing</p></li><li><p><strong>Cline / Roo Code / Kilo Code</strong>: Settings UI, MCP Servers, Edit</p></li><li><p><strong>Windsurf</strong>: <em><strong>~/.codeium/windsurf/mcp_config.json</strong></em></p></li><li><p><strong>Gemini CLI</strong>: <em><strong>gemini mcp add agentmemory npx -y @agentmemory/mcp --scope user</strong></em></p></li><li><p><strong>Codex CLI</strong> (TOML shape): <em><strong>codex mcp add agentmemory -- npx -y @agentmemory/mcp</strong></em></p></li><li><p><strong>OpenCode</strong> (different shape, top-level <em><strong>mcp</strong></em> key with command as array)</p></li></ul><p><strong>6. Claude Code: install the plugin instead.</strong> Skip step 5 and run:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;9788ab6e-bb78-4a33-aa91-1d2cbae6c16f&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">/plugin marketplace add rohitg00/agentmemory
/plugin install agentmemory</code></pre></div><p>The plugin registers all 12 hooks, 4 skills, and auto-wires the MCP server through <em><strong>.mcp.json</strong></em>. Verify with <em><strong>curl http://localhost:3111/agentmemory/health</strong></em>.</p><p><strong>7. Optional: free local embeddings.</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;51e048d1-b00d-4264-91b1-8f196ad7a68a&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">npm install @xenova/transformers</code></pre></div><p>Switches the embedding provider to <em><strong>all-MiniLM-L6-v2</strong></em> running in-process. No API key, no per-call cost, adds ~9 percentage points of recall over BM25-only (86.2% to 95.2% R@5 on LongMemEval-S).</p><p><strong>8. Optional: turn on the headline demo.</strong> Out of the box, agentmemory captures and supports recall but does not inject context into the agent&#8217;s first turn. To enable the &#8220;agent already knows your stack&#8221; behavior, add this to <em><strong>~/.agentmemory/.env</strong></em>:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;markdown&quot;,&quot;nodeId&quot;:&quot;bd639831-c287-45fc-b96f-9859eef098ea&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-markdown">AGENTMEMORY_INJECT_CONTEXT=true</code></pre></div><p>Restart the server. SessionStart will now write up to 2,000 tokens of relevant project context into the first turn. This counts against your model&#8217;s token budget. The startup warning will remind you.</p><p><strong>9. Import existing transcripts.</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;40b85edb-4e51-4742-a696-c7151bdf44fa&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">npx @agentmemory/agentmemory import-jsonl</code></pre></div><p>Default scan path is <em><strong>~/.claude/projects</strong></em>. Default cap is 200 files / 1,000 sessions. Imported sessions show up in the viewer&#8217;s Replay tab alongside native ones.</p><blockquote><p><em><strong>That covers install. For the command reference once it is running, jump to How to Use It at the end of this article.</strong></em></p><div><hr></div></blockquote><h2>Benchmark Evidence</h2><p><strong>LongMemEval-S</strong> (ICLR 2025, 500 questions, ~48 sessions per question, ~115K tokens each):These are retrieval recall scores, not end-to-end QA accuracy. The repo says so plainly: it does not claim these as &#8220;LongMemEval scores,&#8221; only as retrieval-only evaluations on the LongMemEval-S haystack. Scripts and the cleaned dataset are committed under <em><strong>benchmark/</strong></em> for reproduction.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uDBK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ff993c9-b69d-47c3-9e00-4ece1715d259_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uDBK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ff993c9-b69d-47c3-9e00-4ece1715d259_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!uDBK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ff993c9-b69d-47c3-9e00-4ece1715d259_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!uDBK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ff993c9-b69d-47c3-9e00-4ece1715d259_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!uDBK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ff993c9-b69d-47c3-9e00-4ece1715d259_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uDBK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ff993c9-b69d-47c3-9e00-4ece1715d259_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9ff993c9-b69d-47c3-9e00-4ece1715d259_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uDBK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ff993c9-b69d-47c3-9e00-4ece1715d259_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!uDBK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ff993c9-b69d-47c3-9e00-4ece1715d259_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!uDBK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ff993c9-b69d-47c3-9e00-4ece1715d259_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!uDBK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ff993c9-b69d-47c3-9e00-4ece1715d259_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Token efficiency, repo-reported: ~1,900 tokens per session, ~170K per year, ~$10/year on per-token billing or $0/year with local embeddings. Compare against ~22K tokens at 240 observations for the &#8220;paste everything into CLAUDE.md&#8221; approach. That is a 92% reduction at the working point most heavy users hit by month two.</p><div><hr></div><h2>Comparison</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VtUM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F725bb272-ee26-4944-9900-bfcd985710ba_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VtUM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F725bb272-ee26-4944-9900-bfcd985710ba_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!VtUM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F725bb272-ee26-4944-9900-bfcd985710ba_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!VtUM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F725bb272-ee26-4944-9900-bfcd985710ba_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!VtUM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F725bb272-ee26-4944-9900-bfcd985710ba_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VtUM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F725bb272-ee26-4944-9900-bfcd985710ba_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/725bb272-ee26-4944-9900-bfcd985710ba_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VtUM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F725bb272-ee26-4944-9900-bfcd985710ba_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!VtUM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F725bb272-ee26-4944-9900-bfcd985710ba_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!VtUM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F725bb272-ee26-4944-9900-bfcd985710ba_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!VtUM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F725bb272-ee26-4944-9900-bfcd985710ba_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Mem0 and Letta publish benchmarks on <strong>LoCoMo</strong>, a different evaluation set. The repo&#8217;s own <em><strong>COMPARISON.md</strong></em> flags this as &#8220;apples vs oranges&#8221; and invites cross-benchmark collaboration. Treat the headline 95.2% as agentmemory&#8217;s number on agentmemory&#8217;s pipeline, not a leaderboard win against tools measured on a different test.</p><div><hr></div><h2>Current Limitations</h2><p><strong>Plaintext HTTP token transport (issue #275, open).</strong> The plugin sends the <em><strong>AGENTMEMORY_SECRET</strong></em> bearer over plaintext HTTP. The default localhost binding contains the exposure today. Anyone exposing the REST surface beyond a single host needs a reverse proxy, TLS, and an auth review first.</p><p><strong>The interesting features are off by default.</strong> Auto-compress, context injection, slots, reflect, and graph extraction are all gated behind env vars in <em>src/config.ts</em>. The README&#8217;s &#8220;agent already knows your stack&#8221; demo specifically requires <em><strong>AGENTMEMORY_INJECT_CONTEXT=true</strong></em>. Out of the box, agentmemory captures observations and supports recall, but it does not inject context into Claude Code&#8217;s first turn. This is the single biggest expectation mismatch a new user will hit.</p><p><strong>Documentation drift.</strong> <em>AGENTS.md</em> ships v0.8.9 stats (44 tools, 104 endpoints, 699 tests). The source has 51 tools, 127 endpoint declarations, and 888 static test cases. The README badge claims 104 endpoints and 827 tests, while the README prose says 107 endpoints. <em><strong>benchmark/COMPARISON.md</strong></em> references <em><strong>npm run bench:*</strong></em> scripts that are not in <em>package.json</em>. Three doc surfaces, three different numbers, no agreement.</p><p><strong>The engine writes state into your project root (issue #303, open).</strong> When Claude Code auto-starts agentmemory from a project directory, the iii-engine creates <em><strong>data/state_store.db/...</strong></em> inside the user&#8217;s git working tree. Expect cleanup noise and <em>.gitignore</em> drift until the engine moves state to a stable location.</p><div><hr></div><h2>AlphaSignal Take</h2><p><strong>Verdict: Production Ready</strong> for solo developers and small teams running agentmemory on localhost as a personal coding-agent memory layer.</p><p>The architecture is real. The benchmark is reproducible from committed scripts. The install is one <em><strong>npx</strong></em> command. The 12-hook capture flow runs unattended. The viewer at port 3113 makes the memory system inspectable, which is rare in this category. There is no equivalent shipping today.</p><p>Maintenance health is acceptable. Eight releases in three days (May 9 through May 11). 280+ commits since February. Active issue triage, public CHANGELOG, public ROADMAP. The 91% single-maintainer concentration is the asterisk, and the Q3 2026 roadmap names &#8220;additional maintainer onboarding&#8221; as a priority and a foundation Growth-Stage prerequisite.</p><p>What would change the verdict positively? Q4 2026 ships an external security audit, SSO, RBAC, and audit-log export. Q1 2027 freezes the REST and MCP surface for v1.0. The plaintext-HTTP fix (#275) and the engine-CWD cleanup (#303) close. At that point the answer for production deployments changes from &#8220;wait&#8221; to &#8220;deploy.&#8221; Watch for <strong>v1.0</strong> in Q1 2027.</p><div><hr></div><h2>Who Benefits and Who Doesn&#8217;t</h2><p><strong>Benefits:</strong> Claude Code, Cursor, Codex CLI, Gemini CLI, and OpenCode users running solo or in small teams. Engineers in multi-day agent sessions on the same codebase. Teams who want a real-time viewer for what the agent is learning. Anyone who hit the 200-line CLAUDE.md ceiling and started copy-pasting.</p><p><strong>Doesn&#8217;t benefit yet:</strong> teams deploying agentmemory&#8217;s REST surface beyond localhost without a reverse proxy plus TLS (issue #275 is open). Production deployments that require an external security audit (planned Q4 2026). Non-English coding sessions that rely on accurate BM25 retrieval (issue #295 strips non-ASCII tokens today). Windows users without Docker or the iii-engine binary on PATH.</p><h2>Practitioner Implication</h2><p>Coding-agent users can now install a shared, hook-driven, searchable memory layer in one <em><strong>npx</strong></em> command and stop pasting their stack into every new session.</p><div><hr></div><h2>Links</h2><ul><li><p><a href="https://github.com/rohitg00/agentmemory">github.com/rohitg00/agentmemory</a> (repo, ~30 sec to first install)</p></li><li><p><a href="https://agent-memory.dev/">agent-memory.dev</a> (landing site)</p></li><li><p><a href="https://github.com/rohitg00/agentmemory/blob/main/benchmark/LONGMEMEVAL.md">benchmark/LONGMEMEVAL.md</a> (~5 min read)</p></li><li><p><a href="https://arxiv.org/abs/2410.10813">LongMemEval paper, ICLR 2025</a> (~25 min read)</p></li></ul><p>Follow <a href="https://x.com/@AlphaSignalAI">@AlphaSignalAI</a> for more content like this.</p><div><hr></div><p>Subscribe at <a href="https://alphasignal.ai/">AlphaSignal.ai</a> for daily AI signals. Read by 280,000+ developers.</p><h2>Appendix: How to Use It</h2><p>A command reference for once agentmemory is installed and running.</p><p><strong>The four user-invocable skills (Claude Code).</strong> Installed by the plugin, invoked by typing the slash command in the agent:</p><ul><li><p><em><strong>/recall [query]</strong></em> wraps <em><strong>memory_smart_search</strong></em>. Hybrid BM25 + vector + graph search across past observations. Use when you want context from a past session (&#8221;recall how we set up JWT auth&#8221;).</p></li><li><p><em><strong>/remember [content]</strong></em> wraps <em><strong>memory_save</strong></em>. Explicitly persists an insight, decision, or pattern with auto-extracted concepts and file references. Use when you want to lock in a decision so the next session inherits it.</p></li><li><p><em><strong>/session-history</strong></em> wraps <em><strong>memory_sessions</strong></em>. Lists the last 20 sessions on this project with key highlights.</p></li><li><p><em><strong>/forget [query or session ID]</strong></em> wraps <em><strong>memory_smart_search</strong></em> then <em><strong>memory_governance_delete</strong></em>. Surfaces matching observations, asks for explicit confirmation, then deletes with an audit trail.</p></li></ul><p><strong>Most of the time, just talk to the agent.</strong> The 12 hooks capture every tool call automatically. If <em><strong>AGENTMEMORY_INJECT_CONTEXT=true</strong></em> is set, SessionStart pre-loads relevant memories into the agent&#8217;s first turn. The four skills above are for explicit control, not the default workflow.</p><p><strong>Direct MCP tools (other agents).</strong> Agents without the plugin call MCP tools directly. The seven core tools available by default:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lo4f!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1879cbd0-8e03-47db-8ef4-34b17d7ddeca_900x610.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lo4f!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1879cbd0-8e03-47db-8ef4-34b17d7ddeca_900x610.png 424w, https://substackcdn.com/image/fetch/$s_!lo4f!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1879cbd0-8e03-47db-8ef4-34b17d7ddeca_900x610.png 848w, https://substackcdn.com/image/fetch/$s_!lo4f!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1879cbd0-8e03-47db-8ef4-34b17d7ddeca_900x610.png 1272w, https://substackcdn.com/image/fetch/$s_!lo4f!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1879cbd0-8e03-47db-8ef4-34b17d7ddeca_900x610.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lo4f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1879cbd0-8e03-47db-8ef4-34b17d7ddeca_900x610.png" width="900" height="610" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1879cbd0-8e03-47db-8ef4-34b17d7ddeca_900x610.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:610,&quot;width&quot;:900,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!lo4f!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1879cbd0-8e03-47db-8ef4-34b17d7ddeca_900x610.png 424w, https://substackcdn.com/image/fetch/$s_!lo4f!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1879cbd0-8e03-47db-8ef4-34b17d7ddeca_900x610.png 848w, https://substackcdn.com/image/fetch/$s_!lo4f!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1879cbd0-8e03-47db-8ef4-34b17d7ddeca_900x610.png 1272w, https://substackcdn.com/image/fetch/$s_!lo4f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1879cbd0-8e03-47db-8ef4-34b17d7ddeca_900x610.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Set <em><strong>AGENTMEMORY_TOOLS=all</strong></em> to expose the full 51-tool surface (knowledge graph queries, multi-agent leases, signals, sentinels, sketches, consolidation, snapshots, mesh sync, audit, governance, team sharing).</p><p><strong>Direct REST calls.</strong> Every MCP tool has a REST equivalent on port 3111. Useful for scripts, IDEs, and CI:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;b6cbcfcd-9561-4798-95bf-e617a4a7666c&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash"># Search
curl -s -X POST http://localhost:3111/agentmemory/smart-search \
  -H &#8220;Content-Type: application/json&#8221; \
  -d &#8216;{&#8221;query&#8221;: &#8220;jwt auth middleware&#8221;, &#8220;limit&#8221;: 5}&#8217;

# Save
curl -s -X POST http://localhost:3111/agentmemory/remember \
  -H &#8220;Content-Type: application/json&#8221; \
  -d &#8216;{&#8221;content&#8221;: &#8220;Auth uses jose, chosen over jsonwebtoken for Edge compat&#8221;, &#8220;concepts&#8221;: [&#8221;jwt&#8221;, &#8220;auth&#8221;, &#8220;edge&#8221;], &#8220;files&#8221;: [&#8221;src/middleware/auth.ts&#8221;]}&#8217;

# Project profile
curl -s &#8220;http://localhost:3111/agentmemory/profile?project=$(pwd)&#8221;

# Health
curl -s http://localhost:3111/agentmemory/health</code></pre></div><p>When <em><strong>AGENTMEMORY_SECRET</strong></em> is set, protected endpoints require <em><strong>Authorization: Bearer &lt;secret&gt;</strong></em>.</p><p><strong>The viewer at port 3113.</strong> Open http://localhost:3113 to watch observations land live, browse sessions, walk the knowledge graph, and scrub through the Replay tab on past sessions (including imported JSONL transcripts).</p><p><strong>When to reach for which command:</strong></p><ul><li><p>Working on a feature, want continuity across sessions: do nothing, hooks handle it</p></li><li><p>Decision or pattern worth keeping forever: <em><strong>/remember</strong></em></p></li><li><p>New session, no context loaded: <em><strong>/recall</strong></em> with a topic</p></li><li><p>Reviewing yesterday&#8217;s work: <em><strong>/session-history</strong></em></p></li><li><p>Privacy or cleanup: <em><strong>/forget</strong></em></p><div><hr></div></li></ul>]]></content:encoded></item><item><title><![CDATA[How AI Agents Follow Senior-Engineer Production Workflows, How to Wire It Into Your Stack]]></title><description><![CDATA[22 Markdown skills, 7 slash commands, and the author bet that the harness matters more than the model.]]></description><link>https://alphasignalai.substack.com/p/how-ai-agents-follow-senior-engineer</link><guid isPermaLink="false">https://alphasignalai.substack.com/p/how-ai-agents-follow-senior-engineer</guid><dc:creator><![CDATA[AlphaSignal AI]]></dc:creator><pubDate>Tue, 12 May 2026 17:02:04 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!t1vp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414b0bf-914e-46ec-b42f-0da526fe7935_2048x1152.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!t1vp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414b0bf-914e-46ec-b42f-0da526fe7935_2048x1152.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!t1vp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414b0bf-914e-46ec-b42f-0da526fe7935_2048x1152.png 424w, https://substackcdn.com/image/fetch/$s_!t1vp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414b0bf-914e-46ec-b42f-0da526fe7935_2048x1152.png 848w, https://substackcdn.com/image/fetch/$s_!t1vp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414b0bf-914e-46ec-b42f-0da526fe7935_2048x1152.png 1272w, https://substackcdn.com/image/fetch/$s_!t1vp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414b0bf-914e-46ec-b42f-0da526fe7935_2048x1152.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!t1vp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414b0bf-914e-46ec-b42f-0da526fe7935_2048x1152.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a414b0bf-914e-46ec-b42f-0da526fe7935_2048x1152.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!t1vp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414b0bf-914e-46ec-b42f-0da526fe7935_2048x1152.png 424w, https://substackcdn.com/image/fetch/$s_!t1vp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414b0bf-914e-46ec-b42f-0da526fe7935_2048x1152.png 848w, https://substackcdn.com/image/fetch/$s_!t1vp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414b0bf-914e-46ec-b42f-0da526fe7935_2048x1152.png 1272w, https://substackcdn.com/image/fetch/$s_!t1vp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa414b0bf-914e-46ec-b42f-0da526fe7935_2048x1152.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><em><strong>After ~10 min reading, you will decide whether to install agent-skills, where to wire it in, and how to use every skill immediately.</strong></em></p></blockquote><p><strong>Addy Osmani</strong>, engineering lead at Google Chrome, open-sourced <em><strong>agent-skills</strong></em> on February 15, 2026.</p><p><strong>The repo</strong> has hit 39K+ stars in three months, with a 1K+ reported daily gain.</p><p><strong>The bet</strong> is that agent reliability comes from the harness around the model, not from a smarter model.</p><p><strong>What&#8217;s different</strong> is that this is not another prompt library. It is a three-layer architecture (skill, persona, command), with anti-rationalization tables in 20 of 22 skills, a parallel fan-out command, and three hooks that give the pack real enforcement weight on Claude Code.</p><p>It does not make the model smarter. It makes skipping specs, tests, reviews, and security checks harder.</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/DataChaz/status/2040357775830814798&quot;,&quot;full_text&quot;:&quot;&#128680; You need to see this.\n\n<span class=\&quot;tweet-fake-link\&quot;>@addyosmani</span> from Google just dropped his new Agent Skills and it's incredible.\n\nIt brings 19 engineering skills + 7 commands to AI coding agents, all inspired by Google best practices &#129327;\n\nAI coding agents are powerful, but left alone, they take &quot;,&quot;username&quot;:&quot;DataChaz&quot;,&quot;name&quot;:&quot;Charly Wargnier&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1967672234056523776/FoFd2843_normal.jpg&quot;,&quot;date&quot;:&quot;2026-04-04T09:16:16.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/HFDOlaTaYAAZjjP.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/tsG2csWbJ7&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:66,&quot;retweet_count&quot;:351,&quot;like_count&quot;:2572,&quot;impression_count&quot;:417470,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:false}" data-component-name="Twitter2ToDOM"></div><p>The single most screenshot-able thing in the repo is the table used across most lifecycle skills that names the agent&#8217;s shortcut excuses and pairs each with a counter-argument:</p><blockquote><p><em>&#8220;I&#8217;ll write tests after the code works&#8221;</em> &#8594; <em>&#8220;You won&#8217;t. And tests written after the fact test implementation, not behavior.&#8221;</em></p></blockquote><blockquote><p><em>&#8220;It&#8217;s just a prototype&#8221;</em> &#8594; <em>&#8220;Prototypes become production code. Tests from day one prevent the test-debt crisis.&#8221;</em></p></blockquote><blockquote><p><em>&#8220;I tested it manually&#8221;</em> &#8594; <em>&#8220;Manual testing doesn&#8217;t persist. Tomorrow&#8217;s change might break it with no way to know.&#8221;</em></p></blockquote><p>That table sits inside <em><strong>test-driven-development/SKILL.md</strong></em>. Most lifecycle skills in the repo ship one of their own. It is the single structural move that separates <em><strong>agent-skills</strong></em> from every other skills repo in the feed.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Context</h2><p>The author is <a href="https://x.com/@addyosmani">Osmani</a>, the author of <em>Learning JavaScript Design Patterns</em> and a Google Chrome engineering lead. He summarized the motivation in a <a href="https://addyosmani.com/blog/agent-skills/">May 3 research post</a>:</p><blockquote><p><em>&#8220;A senior engineer&#8217;s job is mostly the parts that don&#8217;t show up in the diff.&#8221;</em></p></blockquote><p>The repo was created on February 15, 2026 and has shipped 170+ commits from 20+ contributors since. The latest release is <em><strong>v0.6.0</strong></em> on April 28. Growth went from 27K+ stars on May 4 to 39K+ on May 11.</p><div><hr></div><h2>How it sits next to the other major skills repos</h2><p>If you&#8217;ve been tracking the skills space, the question isn&#8217;t &#8220;what is this.&#8221; It&#8217;s what this does that <em><strong>obra/superpowers</strong></em> and <em><strong>anthropics/skills</strong></em> don&#8217;t.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vHoU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b42df46-8edf-4572-883b-e1690a0568d9_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vHoU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b42df46-8edf-4572-883b-e1690a0568d9_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!vHoU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b42df46-8edf-4572-883b-e1690a0568d9_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!vHoU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b42df46-8edf-4572-883b-e1690a0568d9_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!vHoU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b42df46-8edf-4572-883b-e1690a0568d9_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vHoU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b42df46-8edf-4572-883b-e1690a0568d9_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2b42df46-8edf-4572-883b-e1690a0568d9_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vHoU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b42df46-8edf-4572-883b-e1690a0568d9_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!vHoU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b42df46-8edf-4572-883b-e1690a0568d9_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!vHoU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b42df46-8edf-4572-883b-e1690a0568d9_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!vHoU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b42df46-8edf-4572-883b-e1690a0568d9_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The star comparison cuts the other way: <em><strong>obra/superpowers</strong></em> is bigger by raw count, <em><strong>anthropics/skills</strong></em> is the official spec. The unique contribution of <em><strong>agent-skills</strong></em> is structural: 20 anti-rationalization tables, a parallel-fan-out command, and an enforcement hook layer.</p><div><hr></div><h2>Repo snapshot (11 May)</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gMat!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bcceda0-f0a8-4943-a542-1e2dc3cec742_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gMat!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bcceda0-f0a8-4943-a542-1e2dc3cec742_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!gMat!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bcceda0-f0a8-4943-a542-1e2dc3cec742_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!gMat!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bcceda0-f0a8-4943-a542-1e2dc3cec742_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!gMat!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bcceda0-f0a8-4943-a542-1e2dc3cec742_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gMat!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bcceda0-f0a8-4943-a542-1e2dc3cec742_1672x941.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4bcceda0-f0a8-4943-a542-1e2dc3cec742_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gMat!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bcceda0-f0a8-4943-a542-1e2dc3cec742_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!gMat!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bcceda0-f0a8-4943-a542-1e2dc3cec742_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!gMat!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bcceda0-f0a8-4943-a542-1e2dc3cec742_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!gMat!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bcceda0-f0a8-4943-a542-1e2dc3cec742_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Three composable layers</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!i4Bf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3727a28-7cdd-498c-9a7c-62576bde72b6_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!i4Bf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3727a28-7cdd-498c-9a7c-62576bde72b6_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!i4Bf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3727a28-7cdd-498c-9a7c-62576bde72b6_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!i4Bf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3727a28-7cdd-498c-9a7c-62576bde72b6_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!i4Bf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3727a28-7cdd-498c-9a7c-62576bde72b6_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!i4Bf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3727a28-7cdd-498c-9a7c-62576bde72b6_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c3727a28-7cdd-498c-9a7c-62576bde72b6_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!i4Bf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3727a28-7cdd-498c-9a7c-62576bde72b6_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!i4Bf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3727a28-7cdd-498c-9a7c-62576bde72b6_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!i4Bf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3727a28-7cdd-498c-9a7c-62576bde72b6_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!i4Bf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3727a28-7cdd-498c-9a7c-62576bde72b6_1536x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The governing rule is short and load-bearing. The user, or a slash command on the user&#8217;s behalf, is the orchestrator. Personas do not invoke other personas. Skills are mandatory hops inside a persona&#8217;s workflow.</p><p>This rule was formalized in the v0.6.0 release notes after a stretch of issues where contributors tried to write &#8220;meta-orchestrator&#8221; personas that routed work to other personas. The repo names this an anti-pattern and rejects it on two grounds. It loses information through paraphrasing hops, and on Claude Code it is impossible by platform constraint anyway, since subagents cannot spawn other subagents.</p><p>The only multi-persona pattern the repo endorses is parallel fan-out, used by <em><strong>/ship</strong></em>. More on that below.</p><div><hr></div><h2>How it works</h2><blockquote><p><em><strong>You can skip to &#8220;How to get started&#8221; Section down below.</strong></em></p></blockquote><p><strong>Inside the repo.</strong> Skills sit in <em><strong>skills/&lt;name&gt;/SKILL.md</strong></em>. Personas live in <em><strong>agents/</strong></em> as plain Markdown files. Slash commands ship twice: <em><strong>.claude/commands/*.md</strong></em> for Claude Code and <em><strong>.gemini/commands/*.toml</strong></em> for Gemini CLI. Hooks are bash scripts in <em><strong>hooks/</strong></em>. Four reference checklists (testing, security, performance, accessibility) sit in <em><strong>references/</strong></em>, separate from any skill so they load on demand.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6wMk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a593d39-e391-4b82-a194-87e50b9c16bb_1693x929.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6wMk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a593d39-e391-4b82-a194-87e50b9c16bb_1693x929.png 424w, https://substackcdn.com/image/fetch/$s_!6wMk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a593d39-e391-4b82-a194-87e50b9c16bb_1693x929.png 848w, https://substackcdn.com/image/fetch/$s_!6wMk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a593d39-e391-4b82-a194-87e50b9c16bb_1693x929.png 1272w, https://substackcdn.com/image/fetch/$s_!6wMk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a593d39-e391-4b82-a194-87e50b9c16bb_1693x929.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6wMk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a593d39-e391-4b82-a194-87e50b9c16bb_1693x929.png" width="1456" height="799" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0a593d39-e391-4b82-a194-87e50b9c16bb_1693x929.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:799,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6wMk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a593d39-e391-4b82-a194-87e50b9c16bb_1693x929.png 424w, https://substackcdn.com/image/fetch/$s_!6wMk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a593d39-e391-4b82-a194-87e50b9c16bb_1693x929.png 848w, https://substackcdn.com/image/fetch/$s_!6wMk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a593d39-e391-4b82-a194-87e50b9c16bb_1693x929.png 1272w, https://substackcdn.com/image/fetch/$s_!6wMk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a593d39-e391-4b82-a194-87e50b9c16bb_1693x929.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>The SKILL.md format.</strong> Every skill file follows the same anatomy:</p><ul><li><p><strong>YAML frontmatter</strong> (<em><strong>name</strong></em> + <em><strong>description</strong></em>). Only these two fields are loaded at session start, so the agent can scan all 22 skills cheaply. Full file content loads only when the skill is invoked. This is the progressive-disclosure mechanic that keeps the pack under context budget even with all skills installed.</p></li><li><p><strong>Overview and When to Use.</strong> Two short blocks. The first says what the skill does. The second lists the exact triggers that should activate it (for example, &#8220;implementing any new logic,&#8221; &#8220;fixing a bug&#8221;). The agent matches the current task against the triggers.</p></li><li><p><strong>Process.</strong> The numbered step-by-step workflow. The largest section in every file, and the part the agent actually executes. Steps include inline code examples, decision flowcharts, and templates the agent fills in.</p></li><li><p><strong>Common Rationalizations.</strong> A two-column table of agent excuses paired with counter-arguments. The agent has to read its own most-likely shortcut before it can take it. 20 of the 22 skills ship one of these.</p></li><li><p><strong>Red Flags.</strong> A bullet list of observable signs the skill is being violated. The agent self-monitors against the list. The reviewer in <em><strong>/review</strong></em> also checks against it.</p></li><li><p><strong>Verification.</strong> A checklist of exit criteria the agent must satisfy before marking work done. Every checkbox is evidence-based: tests passing, build output, runtime data, screenshot. The repo&#8217;s rule: <em>&#8220;Seems right is never sufficient.&#8221;</em></p></li></ul><p><strong>How a skill activates.</strong> Two paths. The agent auto-activates a skill by matching the current task against the <em><strong>When to Use</strong></em> triggers in the frontmatter. The meta-skill <em><strong>using-agent-skills</strong></em> holds the routing flowchart that does this matching, and is injected into every session by the session-start hook. The second path is explicit: the user invokes a skill via a slash command (<em><strong>/spec</strong></em>, <em><strong>/test</strong></em>, <em><strong>/review</strong></em>, etc.). Either path loads the full <em><strong>SKILL.md</strong></em> into context.</p><p><strong>Length and structure rules.</strong> Every <em><strong>SKILL.md</strong></em> stays under 500 lines. Reference material that would push it over lives in <em><strong>references/</strong></em>, loaded only when a skill needs it. The longest skill file in the repo is <em><strong>ci-cd-and-automation</strong></em> at 390 lines. Verified from the local clone at SHA <em><strong>3ff4b518</strong></em>: 22 of 22 skill files have valid YAML frontmatter, 21 of 22 have a <em><strong>## Verification</strong></em> block, 20 of 22 have a <em><strong>## Common Rationalizations</strong></em> table.</p><p><strong>Start here.</strong> For most teams, the first three skills to load are <em><strong>spec-driven-development</strong></em>, <em><strong>test-driven-development</strong></em>, and <em><strong>code-review-and-quality</strong></em>. They cover the highest-risk agent failure loop: unclear task, untested change, and unreviewed diff.</p><blockquote><p><em><strong>The full six-phase lifecycle map is near the end of this article for readers who want every skill by phase.</strong></em></p><div><hr></div></blockquote><h2>Evidence: four structural moves that make it defensible</h2><p>Osmani&#8217;s frame is short: <em>&#8220;Agents skip those steps for the same reason any junior would. They&#8217;re invisible.&#8221;</em> The repo makes those steps visible in four places.</p><p><strong>Anti-rationalization tables.</strong> <em><strong>test-driven-development</strong></em> is one of 20 skills with a table that names the shortcut and rejects it before the agent can take it.</p><p><strong>The </strong><em><strong>/ship</strong></em><strong> fan-out.</strong> <em><strong>/ship</strong></em> spawns <em><strong>code-reviewer</strong></em>, <em><strong>security-auditor</strong></em>, and <em><strong>test-engineer</strong></em> concurrently, then merges their reports into a GO or NO-GO decision with a rollback plan. It skips fan-out only when the change touches two files or fewer, stays under 50 lines, and avoids auth, payments, data access, and config.</p><p><strong>Three hook systems.</strong> <em><strong>session-start.sh</strong></em> injects <em><strong>using-agent-skills</strong></em> on new Claude Code sessions, uses <em><strong>jq</strong></em> for the JSON payload, falls back to <em><strong>INFO</strong></em> when <em><strong>jq</strong></em> is missing, and passes <em><strong>bash hooks/session-start-test.sh</strong></em>. <em><strong>sdd-cache-{pre,post}.sh</strong></em> caches source docs by <em><strong>sha256(url)</strong></em> and only serves cached bodies after <em><strong>304 Not Modified</strong></em> against <em><strong>If-None-Match</strong></em> or <em><strong>If-Modified-Since</strong></em>. <em><strong>simplify-ignore.sh</strong></em> protects <em><strong>/* simplify-ignore-start: reason */</strong></em> blocks with <em><strong>BLOCK_&lt;hash&gt;</strong></em> placeholders and reports 21 passed, 0 failed.</p><p><strong>The newest skill, </strong><em><strong>doubt-driven-development</strong></em><strong>.</strong> Added in May 2026, it runs a fresh-context reviewer on non-trivial decisions using only the artifact plus the contract, not the original agent&#8217;s reasoning. Cross-model review through Codex CLI or Gemini CLI requires explicit per-call authorization, so the check happens mid-flight before <em><strong>/review</strong></em>.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>How to install</h2><p><strong>Claude Code (the canonical path).</strong> Marketplace install:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;626e7db9-3b7b-41cd-9edb-86f4cc0231cd&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">/plugin marketplace add addyosmani/agent-skills
/plugin install agent-skills@addy-agent-skills</code></pre></div><p>For teams without SSH keys on GitHub, force HTTPS (workaround for PR #108):</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;5048e8a1-f5e9-4ce2-82bd-4d9df5552bb3&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">/plugin marketplace add https://github.com/addyosmani/agent-skills.git
/plugin install agent-skills@addy-agent-skills</code></pre></div><p>For local development against an in-flight skill, clone and point Claude Code at the working copy:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;dd33448f-446b-4997-9738-fc2d226a079f&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">git clone https://github.com/addyosmani/agent-skills.git
claude --plugin-dir /path/to/agent-skills</code></pre></div><p>At runtime, the <em><strong>session-start.sh</strong></em> hook injects the <em><strong>using-agent-skills</strong></em> meta-skill on every new session, which routes the first message to the matching skill. The 7 slash commands become explicit lifecycle entries on top of that. CI on <em><strong>main</strong></em> is green for the workflow <em><strong>Test Plugin Installation</strong></em> at the verified SHA <em><strong>3ff4b518</strong></em>.</p><p><strong>OpenCode.</strong> No slash commands and no plugin system. The integration runs through <em><strong>AGENTS.md</strong></em> and the built-in <em><strong>skill</strong></em> tool. The repo ships <em><strong>.opencode/skills</strong></em> as a symlink to <em><strong>../skills/</strong></em> so OpenCode resolves the same skill set. The execution rule in <em><strong>AGENTS.md</strong></em> maps user intent (new feature triggers <em><strong>spec-driven-development</strong></em>, bug triggers <em><strong>debugging-and-error-recovery</strong></em>, code review triggers <em><strong>code-review-and-quality</strong></em>) to skills on every turn. Honest tradeoff per the repo&#8217;s own <em><strong>docs/opencode-setup.md</strong></em>: skill invocation depends on model compliance with the <em><strong>AGENTS.md</strong></em> contract, with no hook layer.</p><p><strong>Cursor.</strong> Copy selected <em><strong>SKILL.md</strong></em> files into <em><strong>.cursor/rules/</strong></em>. Start with the 2-to-3-essential set.</p><p><strong>Gemini CLI.</strong> <em><strong>gemini skills install https://github.com/addyosmani/agent-skills.git --path skills</strong></em>. Native install. The repo also ships <em><strong>.gemini/commands/*.toml</strong></em> with the same 7 commands, except <em><strong>/plan</strong></em> is renamed <em><strong>/planning</strong></em> because <em><strong>/plan</strong></em> collides with a Gemini internal command.</p><p><strong>Windsurf, Copilot, Kiro.</strong> Add skill content to <em><strong>.windsurfrules</strong></em>, <em><strong>.github/skills/</strong></em>, or <em><strong>.kiro/skills/</strong></em> respectively. All three are plain-Markdown integrations with no hook layer.</p><div><hr></div><h2>The six-phase lifecycle and 22 skills</h2><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;markdown&quot;,&quot;nodeId&quot;:&quot;4a7c91fe-e80d-45ec-80cb-b3e367fc96a3&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-markdown">DEFINE &#8594; PLAN &#8594; BUILD &#8594; VERIFY &#8594; REVIEW &#8594; SHIP
/spec   /plan   /build   /test    /review  /ship
                                  /code-simplify</code></pre></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lAEu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd280413b-66ac-44a7-8945-51af0b561664_1717x916.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lAEu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd280413b-66ac-44a7-8945-51af0b561664_1717x916.png 424w, https://substackcdn.com/image/fetch/$s_!lAEu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd280413b-66ac-44a7-8945-51af0b561664_1717x916.png 848w, https://substackcdn.com/image/fetch/$s_!lAEu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd280413b-66ac-44a7-8945-51af0b561664_1717x916.png 1272w, https://substackcdn.com/image/fetch/$s_!lAEu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd280413b-66ac-44a7-8945-51af0b561664_1717x916.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lAEu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd280413b-66ac-44a7-8945-51af0b561664_1717x916.png" width="1456" height="777" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d280413b-66ac-44a7-8945-51af0b561664_1717x916.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:777,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lAEu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd280413b-66ac-44a7-8945-51af0b561664_1717x916.png 424w, https://substackcdn.com/image/fetch/$s_!lAEu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd280413b-66ac-44a7-8945-51af0b561664_1717x916.png 848w, https://substackcdn.com/image/fetch/$s_!lAEu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd280413b-66ac-44a7-8945-51af0b561664_1717x916.png 1272w, https://substackcdn.com/image/fetch/$s_!lAEu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd280413b-66ac-44a7-8945-51af0b561664_1717x916.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Skills auto-activate by context. The slash commands are explicit user entries on top. The 22 skills, with what each one does so the reader can pick:</p><h3><strong>Meta</strong></h3><ul><li><p><em><strong>using-agent-skills</strong></em>: Routes incoming work to the right skill via a flowchart. Auto-loaded by the session-start hook on every Claude Code session. Defines the shared operating behaviors (surface assumptions, manage confusion, push back, enforce simplicity, scope discipline, verify don&#8217;t assume).</p></li></ul><h3><strong>Define</strong></h3><ul><li><p><em><strong>idea-refine</strong></em>: Turns vague ideas into concrete proposals through structured divergent and convergent thinking. Output is a one-page markdown spec with problem statement, recommended direction, MVP scope, and a &#8220;Not Doing&#8221; list.</p></li><li><p><em><strong>spec-driven-development</strong></em>: Writes a PRD before code: objective, commands, project structure, code style, testing strategy, boundaries (always/ask-first/never). Four-phase gated workflow (specify, plan, tasks, implement) with human review at each gate.</p></li></ul><h3><strong>Plan</strong></h3><ul><li><p><em><strong>planning-and-task-breakdown</strong></em>: Decomposes a spec into small verifiable tasks with acceptance criteria and dependency ordering. Cap of ~5 files per task. Each task includes its verification step.</p></li></ul><h3><strong>Build</strong></h3><ul><li><p><em><strong>incremental-implementation</strong></em>: Builds thin vertical slices: implement, test, verify, commit. Caps at ~100 lines of unverified code. Feature flags, safe defaults, rollback-friendly changes.</p></li><li><p><em><strong>test-driven-development</strong></em>: Red-Green-Refactor, the test pyramid (80% unit / 15% integration / 5% E2E), DAMP over DRY in tests, the Beyonce Rule, and the Prove-It pattern for bug fixes (failing reproduction test before the fix).</p></li><li><p><em><strong>context-engineering</strong></em>: Loads the right context at the right time. Rules files, context packing, MCP integrations.</p></li><li><p><em><strong>source-driven-development</strong></em>: Grounds framework decisions in official documentation with required citations. Detects stack and versions, fetches the relevant docs, flags conflicts with existing code. Paired with the <em><strong>sdd-cache</strong></em> hook for cross-session HTTP caching.</p></li><li><p><em><strong>doubt-driven-development</strong></em>: Adversarial fresh-context review on every non-trivial in-flight decision. Five-step cycle: CLAIM, EXTRACT, DOUBT, RECONCILE, STOP. Optional cross-model escalation to Codex CLI or Gemini CLI with explicit per-call authorization.</p></li><li><p><em><strong>frontend-ui-engineering</strong></em>: Component architecture, design systems, state management, responsive design, WCAG 2.1 AA accessibility.</p></li><li><p><em><strong>api-and-interface-design</strong></em>: Contract-first design, Hyrum&#8217;s Law, the One-Version Rule, error semantics, boundary validation.</p></li></ul><h3><strong>Verify</strong></h3><ul><li><p><em><strong>browser-testing-with-devtools</strong></em>: Chrome DevTools MCP for live runtime data. DOM inspection, console errors, network traces, performance profiling, screenshots.</p></li><li><p><em><strong>debugging-and-error-recovery</strong></em>: Five-step triage: reproduce, localize, reduce, fix, guard. Stop-the-line rule for failing tests, safe fallbacks.</p></li></ul><h3><strong>Review</strong></h3><ul><li><p><em><strong>code-review-and-quality</strong></em>: Five-axis review (correctness, readability, architecture, security, performance), change sizing ~100 lines, Critical/Important/Suggestion severity labels.</p></li><li><p><em><strong>code-simplification</strong></em>: Reduce complexity while preserving exact behavior. Chesterton&#8217;s Fence, the Rule of 500. Paired with the <em><strong>simplify-ignore</strong></em> hook for protected code blocks.</p></li><li><p><em><strong>security-and-hardening</strong></em>: OWASP Top 10 prevention, auth patterns, secrets management, dependency auditing, three-tier boundary validation.</p></li><li><p><em><strong>performance-optimization</strong></em>: Measure-first approach. Core Web Vitals targets, profiling workflows, bundle analysis, anti-pattern detection.</p></li></ul><h3><strong>Ship</strong></h3><ul><li><p><em><strong>git-workflow-and-versioning</strong></em>: Trunk-based development, atomic commits, change sizing ~100 lines, the commit-as-save-point pattern.</p></li><li><p><em><strong>ci-cd-and-automation</strong></em>: Shift Left, Faster is Safer, feature flags, quality gate pipelines, failure feedback loops.</p></li><li><p><em><strong>deprecation-and-migration</strong></em>: Code-as-liability mindset, compulsory vs advisory deprecation, migration patterns, zombie-code removal.</p></li><li><p><em><strong>documentation-and-adrs</strong></em>: Architecture Decision Records, API docs, inline documentation standards. Document the <em>why</em>, not the <em>what</em>.</p></li><li><p><em><strong>shipping-and-launch</strong></em>: Pre-launch checklists, feature flag lifecycle, staged rollouts, rollback procedures, monitoring setup.</p></li></ul><p><strong>The minimum viable set the community cites:</strong> <em><strong>spec-driven-development</strong></em>, <em><strong>test-driven-development</strong></em>, and <em><strong>code-review-and-quality</strong></em>. Add <em><strong>incremental-implementation</strong></em> and <em><strong>security-and-hardening</strong></em> for production work. Load others by phase as the task requires.</p><div><hr></div><h2>Limitations</h2><p><strong>Opt-in scaffolding, not enforcement.</strong> The anti-rationalization tables live in the <em><strong>SKILL.md</strong></em> files, but nothing physically prevents an agent from generating code that ignores them. Adoption rests on the model honoring the contract.</p><p><strong>Compliance-dependent on most harnesses.</strong> Only Claude Code&#8217;s plugin manifest, session-start hook, and <em><strong>/ship</strong></em> fan-out give the pack hard teeth. Cursor, Windsurf, OpenCode, and Copilot fall back to rules files the model may or may not honor.</p><p><strong>Plugin version drift.</strong> <em><strong>.claude-plugin/plugin.json</strong></em> declares plugin version <em><strong>1.0.0</strong></em>, while the latest GitHub release is <em><strong>v0.6.0</strong></em>. Open issue #145 and PR #155 track the mismatch.</p><p><strong>No empirical effectiveness benchmark.</strong> The repo provides workflows and verification checklists but no controlled experiment showing agents using these skills produce fewer bugs or higher-quality reviews than the same agents without it.</p><p><strong>Shell hooks need review.</strong> The plugin includes shell hooks in <em><strong>hooks/</strong></em> for session start, source-doc caching, and simplify-ignore protection. Teams should review those scripts before enabling the plugin inside production workspaces.</p><p>So the best recommendation is to adopt on Claude Code, with one caveat: treat it as scaffolding that needs the agent&#8217;s cooperation, not a guarantee. On other harnesses, pilot the skills first and verify that the agent actually follows them.</p><div><hr></div><h2>AlphaSignal Take</h2><p><strong>Verdict: Production Ready for Claude Code teams, Worth Watching elsewhere.</strong> The skills do what the README claims, maintenance health is strong (170+ commits, 20+ contributors, daily PR cadence, CI green on <em><strong>main</strong></em>), and the marketplace install path is one command on Claude Code.</p><p>On Cursor, Windsurf, OpenCode, Copilot, and rules-file setups, the value depends on whether the agent actually honors the loaded instructions. Forward-looking, <strong>v0.7</strong> is the version to watch: open PRs and issues suggest it will formalize Kiro and Codex setup docs and resolve the plugin-version mismatch. The line that lands the design choice is Osmani&#8217;s own:</p><blockquote><p><em>&#8220;If you put a 2,000-word essay on testing best practices into the agent&#8217;s context, the agent reads it, generates plausible-looking text, and skips the actual testing. If you put a workflow there, the agent has something to do, and you have something to verify.&#8221;</em></p></blockquote><h2>Who benefits and who doesn&#8217;t</h2><p>Engineering teams running coding agents on production work, solo developers who want fewer agent fires by cherry-picking the minimum viable set (<em><strong>spec-driven-development</strong></em> + <em><strong>test-driven-development</strong></em> + <em><strong>code-review-and-quality</strong></em>), and platform teams designing internal agent frameworks (the three-layer model and parallel fan-out are reusable patterns).</p><p>It does not fit legacy codebases without specs or test infrastructure, teams whose primary harness is OpenCode without a strong AGENTS.md discipline, or anyone looking for a model upgrade rather than a workflow layer.</p><h2>Practitioner Implication</h2><p>For teams already running a skills layer, the upgrade case is the three-layer model plus parallel fan-out. Neither exists in <em><strong>obra/superpowers</strong></em> or <em><strong>anthropics/skills</strong></em>.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Links</h2><ul><li><p><a href="https://github.com/addyosmani/agent-skills">agent-skills repo</a> (~5 min setup on Claude Code)</p></li><li><p><a href="https://addyosmani.com/blog/agent-skills/">Addy Osmani on Agent Skills</a> (~10 min read)</p></li><li><p><a href="https://addyosmani.com/blog/agent-harness-engineering/">Agent Harness Engineering</a> (~12 min read, optional context)</p></li></ul><p>Follow <a href="https://x.com/@AlphaSignalAI">@AlphaSignalAI</a> for more content like this. Also, Check our <a href="https://luma.com/t24o902x">Harness Engineering workshop</a>, May 14th, 2 days left, +50 going.</p><div><hr></div><p>Subscribe at <a href="https://alphasignal.ai/">AlphaSignal.ai</a> for daily AI signals. Read by 280,000+ developers.</p><h2>Questions?</h2><p><strong>Q: What does </strong><em><strong>agent-skills</strong></em><strong> do that </strong><em><strong>obra/superpowers</strong></em><strong> doesn&#8217;t?</strong> A: Three things <em><strong>obra/superpowers</strong></em> does not document at the same structural depth. First, 20 of 22 skills include a Common Rationalizations table that names the excuses agents use to skip steps. Second, <em><strong>/ship</strong></em> is a parallel fan-out that runs <em><strong>code-reviewer</strong></em>, <em><strong>security-auditor</strong></em>, and <em><strong>test-engineer</strong></em> concurrently and merges their reports. Third, the repo ships three hook systems that give the pack enforcement weight on Claude Code.</p><p><strong>Q: Which skill was added most recently, and why does it matter?</strong> A: <em><strong>doubt-driven-development</strong></em>, added in May 2026. It runs an adversarial fresh-context reviewer on every non-trivial in-flight decision using a five-step cycle (CLAIM, EXTRACT, DOUBT, RECONCILE, STOP), with optional cross-model escalation to Codex or Gemini. The point is to catch wrong-direction work mid-flight, while course correction is still cheap, not at <em><strong>/review</strong></em> time when the diff is already written.</p><p><strong>Q: Which AI coding agents support </strong><em><strong>agent-skills</strong></em><strong> today?</strong> A: Claude Code (recommended path via plugin marketplace), Cursor, Gemini CLI (native skill install), Windsurf, OpenCode, GitHub Copilot, and Kiro IDE. The skills are plain Markdown and work with any agent that accepts system prompts or instruction files.</p><p><strong>Q: What is the minimum set of skills to install first?</strong> A: The community-cited minimum is <em><strong>spec-driven-development</strong></em>, <em><strong>test-driven-development</strong></em>, and <em><strong>code-review-and-quality</strong></em>. Adding <em><strong>incremental-implementation</strong></em> and <em><strong>security-and-hardening</strong></em> covers most production-relevant workflows without saturating the context window.</p><p><strong>Q: How does the </strong><em><strong>/ship</strong></em><strong> command work, and when does it skip the parallel review?</strong> A: <em><strong>/ship</strong></em> spawns three subagents in one turn: <em><strong>code-reviewer</strong></em>, <em><strong>security-auditor</strong></em>, and <em><strong>test-engineer</strong></em>. The main agent merges their reports into a GO or NO-GO decision with a mandatory rollback plan. It skips the fan-out only when the change touches two files or fewer, the diff is under 50 lines, and it does not touch auth, payments, data access, or config.</p>]]></content:encoded></item><item><title><![CDATA[When AI agents learn to engineer themselves]]></title><description><![CDATA[A primer on self-improving agents: Moving beyond the human-coded harness]]></description><link>https://alphasignalai.substack.com/p/when-ai-agents-learn-to-engineer</link><guid isPermaLink="false">https://alphasignalai.substack.com/p/when-ai-agents-learn-to-engineer</guid><dc:creator><![CDATA[AlphaSignal AI]]></dc:creator><pubDate>Tue, 12 May 2026 15:01:07 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!kFEL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93a310f3-61a4-4750-8747-2c39a2ae76f4_1920x1080.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kFEL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93a310f3-61a4-4750-8747-2c39a2ae76f4_1920x1080.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kFEL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93a310f3-61a4-4750-8747-2c39a2ae76f4_1920x1080.jpeg 424w, https://substackcdn.com/image/fetch/$s_!kFEL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93a310f3-61a4-4750-8747-2c39a2ae76f4_1920x1080.jpeg 848w, https://substackcdn.com/image/fetch/$s_!kFEL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93a310f3-61a4-4750-8747-2c39a2ae76f4_1920x1080.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!kFEL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93a310f3-61a4-4750-8747-2c39a2ae76f4_1920x1080.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kFEL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93a310f3-61a4-4750-8747-2c39a2ae76f4_1920x1080.jpeg" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/93a310f3-61a4-4750-8747-2c39a2ae76f4_1920x1080.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:123947,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://alphasignalai.substack.com/i/197319439?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93a310f3-61a4-4750-8747-2c39a2ae76f4_1920x1080.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kFEL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93a310f3-61a4-4750-8747-2c39a2ae76f4_1920x1080.jpeg 424w, https://substackcdn.com/image/fetch/$s_!kFEL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93a310f3-61a4-4750-8747-2c39a2ae76f4_1920x1080.jpeg 848w, https://substackcdn.com/image/fetch/$s_!kFEL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93a310f3-61a4-4750-8747-2c39a2ae76f4_1920x1080.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!kFEL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93a310f3-61a4-4750-8747-2c39a2ae76f4_1920x1080.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In recent weeks, we&#8217;ve talked a lot about AI harnesses: the scaffolding of tool-calling, error handling, memory management, model routing, and verification steps that make agentic applications reliable.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://alphasignalai.substack.com/subscribe?"><span>Subscribe now</span></a></p><p>While these harnesses have brought AI into production, they rely mostly on human engineering. Therefore, agent improvement is now limited by the speed at which humans can write and refine this infrastructure.</p><p>A new class of self-improving agents aims to remove this human constraint. Instead of acting as passive components within a fixed system, these models act as their own engineers. They modify their own code to build more robust scaffolding, moving from being consumers of infrastructure to active producers of it.</p><h3><strong>Darwin-G&#246;del Machine: Bootstrapping via evolution</strong></h3><p>Current agentic systems rely on fixed, hand-crafted mechanisms. A developer typically writes the code that determines how the model handles input. This includes deciding when to retrieve information, how to use tools, reflect on its response, and handle errors.</p><p>This approach is brittle and improving these agents is constrained by human anticipation. If a developer does not predict a specific need, they cannot code a solution for it. This limits the agent&#8217;s ability to evolve without constant human intervention to rewrite the underlying logic.</p><p>The <a href="https://app.alphasignal.ai/c?uid=34c8FaDCpqXjJcUD&amp;cid=dd8c5dce808b5881&amp;lid=17N4Tx1MaJFZiOPD3&amp;mid=f83439dc-575e-4e47-82e6-8fd01bd4dfe3">Darwin-G&#246;del Machine</a> (DGM), introduced by Sakana AI, treats agent improvement as an open-ended evolutionary search. It starts with a baseline agent scaffold and gradually explores how modifications affect its performance.</p><p>DGM maintains an archive of successful agent variants, which it calls &#8220;stepping stones.&#8221; This prevents the system from getting stuck in dead ends by allowing it to return to previously successful code versions and branch out in new directions.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ma5m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a6b53ec-1d45-4eef-a6e9-b0fe8c6376e3_2406x1144.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ma5m!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a6b53ec-1d45-4eef-a6e9-b0fe8c6376e3_2406x1144.png 424w, https://substackcdn.com/image/fetch/$s_!ma5m!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a6b53ec-1d45-4eef-a6e9-b0fe8c6376e3_2406x1144.png 848w, https://substackcdn.com/image/fetch/$s_!ma5m!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a6b53ec-1d45-4eef-a6e9-b0fe8c6376e3_2406x1144.png 1272w, https://substackcdn.com/image/fetch/$s_!ma5m!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a6b53ec-1d45-4eef-a6e9-b0fe8c6376e3_2406x1144.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ma5m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a6b53ec-1d45-4eef-a6e9-b0fe8c6376e3_2406x1144.png" width="1456" height="692" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2a6b53ec-1d45-4eef-a6e9-b0fe8c6376e3_2406x1144.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:692,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;alpha_signal_image_1&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="alpha_signal_image_1" title="alpha_signal_image_1" srcset="https://substackcdn.com/image/fetch/$s_!ma5m!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a6b53ec-1d45-4eef-a6e9-b0fe8c6376e3_2406x1144.png 424w, https://substackcdn.com/image/fetch/$s_!ma5m!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a6b53ec-1d45-4eef-a6e9-b0fe8c6376e3_2406x1144.png 848w, https://substackcdn.com/image/fetch/$s_!ma5m!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a6b53ec-1d45-4eef-a6e9-b0fe8c6376e3_2406x1144.png 1272w, https://substackcdn.com/image/fetch/$s_!ma5m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a6b53ec-1d45-4eef-a6e9-b0fe8c6376e3_2406x1144.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In practice, DGM uses an LLM to propose code improvements to its own Python codebase. It might add a patch validation step, improve its file-viewing capabilities, or implement more detailed history logs. These changes are structural modifications to how the agent operates.</p><p>The performance gains from this self-modification cycle are significant on coding tasks. By autonomously rewriting its own code, DGM increased its SWE-bench score (a benchmark of real-world GitHub issues) from 20% to 50%.</p><p>It also improved its Polyglot coding (another challenging coding benchmark) score from 14.2% to 30.7%, outperforming hand-designed agents like Aider.</p><p>The main caveat to DGM is that it is built primarily for coding tasks. It assumes that performance in a specific task (like writing Python) is the same as the skill required to modify itself. Because its core improvement mechanism remained somewhat fixed, it struggled to generalize self-improvement to non-coding fields.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h3><strong>Hyperagents: Metacognitive self-modification</strong></h3><p>To solve the limitations of DGM, researchers at Meta developed <a href="https://app.alphasignal.ai/c?uid=34c8FaDCpqXjJcUD&amp;cid=dd8c5dce808b5881&amp;lid=1lfOTKsG2b9vaoM27&amp;mid=f83439dc-575e-4e47-82e6-8fd01bd4dfe3">Hyperagents</a> (DGM-H). Self-improving agents usually have two main components: a &#8220;task agent&#8221; that executes the specific problem at hand, and a &#8220;meta agent&#8221; that analyzes and modifies the agents.</p><p>Hyperagent merges these two components into a single, editable program. In addition to rewriting the task logic, Hyperagent rewrites the logic of how it evaluates and improves itself.</p><p>DGM-H builds on top of the original DGM. It preserves the open-ended structure of DGM to keep a pool of successful hyperagents. The system selects candidates from the pool, allows them to self-modify, evaluates the new variants on given tasks, and adds the successful ones back into the pool as stepping stones for future iterations.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CI9u!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3c9a56-c848-4d95-87bb-f58b5a0b8840_1000x440.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CI9u!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3c9a56-c848-4d95-87bb-f58b5a0b8840_1000x440.jpeg 424w, https://substackcdn.com/image/fetch/$s_!CI9u!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3c9a56-c848-4d95-87bb-f58b5a0b8840_1000x440.jpeg 848w, https://substackcdn.com/image/fetch/$s_!CI9u!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3c9a56-c848-4d95-87bb-f58b5a0b8840_1000x440.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!CI9u!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3c9a56-c848-4d95-87bb-f58b5a0b8840_1000x440.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CI9u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3c9a56-c848-4d95-87bb-f58b5a0b8840_1000x440.jpeg" width="1000" height="440" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7f3c9a56-c848-4d95-87bb-f58b5a0b8840_1000x440.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:440,&quot;width&quot;:1000,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;alpha_signal_image_2&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="alpha_signal_image_2" title="alpha_signal_image_2" srcset="https://substackcdn.com/image/fetch/$s_!CI9u!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3c9a56-c848-4d95-87bb-f58b5a0b8840_1000x440.jpeg 424w, https://substackcdn.com/image/fetch/$s_!CI9u!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3c9a56-c848-4d95-87bb-f58b5a0b8840_1000x440.jpeg 848w, https://substackcdn.com/image/fetch/$s_!CI9u!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3c9a56-c848-4d95-87bb-f58b5a0b8840_1000x440.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!CI9u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3c9a56-c848-4d95-87bb-f58b5a0b8840_1000x440.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This metacognitive approach allows for the emergence of complex behaviors without human prompting. For example, during training, DGM-H independently evolved its own persistent memory systems, performance tracking across generations, and multi-stage evaluation pipelines. It essentially built its own advanced harness from scratch.</p><p><em>Check our upcoming <strong><a href="https://luma.com/t24o902x">Harness Engineering workshop</a></strong>, 60 seats available, 150$ each.</em></p><p>Because the improvement mechanism itself can evolve, DGM-H works across diverse domains beyond coding. In a paper-reviewing task, an initially blank agent improved its accuracy from 0.0 to 0.710. In robotics, it refined a quadruped robot&#8217;s reward function from a score of 0.060 to 0.372, eventually beating the human-designed baseline of 0.348.</p><h3><strong>Honorable mention: Karpathy&#8217;s Autoresearch</strong></h3><p>While Hyperagents represent a deep architectural shift, AI researcher Andrej Karpathy demonstrated the practical power of this concept with his <a href="https://app.alphasignal.ai/c?uid=34c8FaDCpqXjJcUD&amp;cid=dd8c5dce808b5881&amp;lid=Kb6DgorL9gmy5tib&amp;mid=f83439dc-575e-4e47-82e6-8fd01bd4dfe3">autoresearch</a> project. This open-source tool provides an example of self-improvement that developers can run immediately. It uses a straightforward loop to optimize machine learning models without human oversight.</p><p>Autoresearch has a program.md file where the human engineer provides the high-level instructions in plain markdown.</p><p>Autoresearch reads the instructions and makes changes to train.py, the file that contains the training code for a GPT model. It runs a 5-minute training job, checks the results, and repeats the cycle.</p><p>Auto research uses Git as research memory. If the metric improves, it commits the change; if not, it performs a &#8220;git reset&#8221; to the last known good state.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EzCC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9605d093-6ffc-44a7-8c20-ba0e685772fe_1400x764.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EzCC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9605d093-6ffc-44a7-8c20-ba0e685772fe_1400x764.png 424w, https://substackcdn.com/image/fetch/$s_!EzCC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9605d093-6ffc-44a7-8c20-ba0e685772fe_1400x764.png 848w, https://substackcdn.com/image/fetch/$s_!EzCC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9605d093-6ffc-44a7-8c20-ba0e685772fe_1400x764.png 1272w, https://substackcdn.com/image/fetch/$s_!EzCC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9605d093-6ffc-44a7-8c20-ba0e685772fe_1400x764.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EzCC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9605d093-6ffc-44a7-8c20-ba0e685772fe_1400x764.png" width="1400" height="764" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9605d093-6ffc-44a7-8c20-ba0e685772fe_1400x764.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:764,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;alpha_signal_image_3&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="alpha_signal_image_3" title="alpha_signal_image_3" srcset="https://substackcdn.com/image/fetch/$s_!EzCC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9605d093-6ffc-44a7-8c20-ba0e685772fe_1400x764.png 424w, https://substackcdn.com/image/fetch/$s_!EzCC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9605d093-6ffc-44a7-8c20-ba0e685772fe_1400x764.png 848w, https://substackcdn.com/image/fetch/$s_!EzCC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9605d093-6ffc-44a7-8c20-ba0e685772fe_1400x764.png 1272w, https://substackcdn.com/image/fetch/$s_!EzCC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9605d093-6ffc-44a7-8c20-ba0e685772fe_1400x764.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Experiments show that the agent can make interesting optimizations, such as discovering that increasing iteration speed is more beneficial than increasing batch size in certain contexts.</p><p>Beyond training models, Autoresearch can be used for any type of coding that can be measured with a metric. For example, the Shopify team modified Autoresearch to <a href="https://app.alphasignal.ai/c?uid=34c8FaDCpqXjJcUD&amp;cid=dd8c5dce808b5881&amp;lid=1yQXC8fHdekVrOLGq&amp;mid=f83439dc-575e-4e47-82e6-8fd01bd4dfe3">optimize their CI pipelines</a>.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://alphasignalai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading AlphaSignal! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h3><strong>Limitations and the reality check</strong></h3><p>The move toward self-improving code is not without risk. The most significant hurdle is reward hacking. Because these agents optimize aggressively for a single metric, they often find loopholes in the grading function. In effect they might shortcut their way to the designated metric without achieving the underlying goals.</p><p>Agents can also become trapped in &#8220;local optima&#8221; and refrain from making significant changes. Observations from the Autoresearch community show that agents often get stuck endlessly tweaking safe hyperparameter variations instead of attempting the bold architectural leaps required for true innovation.</p><p>There are also risks regarding compute. Without strict oversight, an agent could also burn through massive GPU budgets overnight if it enters an infinite improvement loop with no exit condition.</p><p>And finally, keep an eye for security holes. While narrowly focusing on their metrics, self-improving agents might end up writing insecure code or circumvent safeguards meant to protect sensitive data.</p><p>The bottom line is that while we are excited for self-improving agents, we will still need experienced engineers to guide the process and make sure these helpful assistants avoid doing damage.</p><div><hr></div><p>Follow <a href="https://x.com/@AlphaSignalAI">@AlphaSignalAI</a> for more content like this. Subscribe at <a href="https://alphasignal.ai/">AlphaSignal.ai</a> for daily AI signals. Read by 300,000+ developers.</p><p>Also, Check our upcoming <strong><a href="https://luma.com/t24o902x">Harness Engineering workshop</a></strong>, 60 seats available, 150$ each.</p><div><hr></div>]]></content:encoded></item></channel></rss>