Building accessibility audit agents for AI-generated UI
May 2026
TLDR ā Vibe coding ships more interface in a week than human audit can review in a quarter. AI-generated UI inherits the same accessibility floor as the web it's trained on, and that floor is low. I built two surfaces, a Claude skill and a Claude plugin, that share one methodology and audit at AI speed. This is the context, the build, what dogfooding turned up, and what's next.

Where accessibility breaks today
Researchers at the A11y-CUA benchmark (CHI 2026) tested Claude Sonnet 4.5 as a computer-use agent on sixty everyday web tasks. Book a flight. Fill a form. Find a result. Three input conditions, three ways of constraining how the agent sees and clicks the page. ⢠Standard access (mouse plus full viewport, how most users browse): 78% success ⢠Keyboard-only (no cursor, the path blind and motor-impaired users take through screen readers and switches): 42% ⢠Magnified viewport (zoomed to where most layouts reflow or break, the path low-vision users take): 28% The agent fails worst on the paths blind, motor-impaired, and low-vision users rely on every day. If it can't navigate them, why would we expect it to generate them any better? A11y was always backloaded. Automated scanners catch the mechanical class of issues; humans catch the rest. That was a manageable workflow when humans were the bottleneck on UI production. Now Cursor, v0, Lovable, and a hundred Claude skills are generating interface faster than any audit pipeline can keep up. The bottleneck didn't get smaller. The pipe got wider.
What I built (and the research that shaped it)
Two surfaces, one methodology, two stages of the design lifecycle. A skill (/ally) for design time, inside Claude Code. Use it while you're prototyping a component, refining an AI-generated layout, or reviewing a handoff, before anything ships. It reads the file you're working in, follows the related components, and returns findings inline in chat. A plugin (/audit) for already-shipped UI. Point it at any live URL and it opens a real browser, runs the industry-standard scanner (axe-core, the engine inside Lighthouse and most other a11y tools), screenshots the page, and reasons over what it sees. Use it on your portfolio, a competitor you're researching, a launched product, an AI-generated landing page, anything you can open in a browser. Both write to the same methodology.md. The methodology is POUR plus a fifth category for AI-generated content. POUR (Perceivable, Operable, Understandable, Robust) is the WCAG spine since 2008 and still the right primitives. The fifth covers what the existing tooling has nothing to say about: alt text quality on generated images, focus management after a streaming response, announcement timing on live regions, hallucinated accessibility claims in body copy, whether the AI-produced label matches the AI-produced action.
Standards
The agent ships against WCAG 2.2 AA, the binding W3C Recommendation since October 2023, and prioritizes the new 2.2 criteria aimed at modern UI patterns: focus visibility (2.4.11), target size (2.5.8, 24Ć24 CSS px), dragging alternatives (2.5.7), consistent help (3.2.6), redundant entry (3.3.7), and accessible authentication (3.3.8). WCAG 3.0 is still a Working Draft; Candidate Recommendation isn't expected until Q4 2027.
Tool gap (the agent's reason for existing)
axe-core is the best-in-class automated engine. Google Lighthouse uses axe-core under the hood, and so do most other scanners on the market. Same engine, different surface, same ceiling. The current axe-core release is 4.11.4 (May 2026), which added oklch/oklab color support, better aria-hidden handling, and lazy-load fixes. Improvements ship steadily, but the ceiling is structural. Automated tools can only check what rules express. Contrast ratio is a rule. Whether the alt text on an AI-generated image actually describes the image is not. Intelligent Guided Testing (semi-automated workflows that walk a human through AI-suggested checks) pushes coverage to about 80%. AI agents that reason about user impact, not just rule violations, close more still and scale to the volume vibe coding produces.
Existing AI a11y tool landscape
Credit where it's due. Community-Access agents by Taylor Arndt (screen reader user, COO at Techopolis) and Jeff Bishop: 57 MIT-licensed specialists enforcing WCAG 2.2 AA at code generation in Claude Code, Copilot, and Cursor (Arndt's "Fifty-Five Agents, Three Teams" is the canonical introduction). Matthew Stephens's "I Built 33 Claude Skills to Fix the Vibe Design Accessibility Gap" coined the phrase the discourse needed. Both intercept AI as a producer to stop bad a11y from being generated. Ally sits at the next stage and audits the result regardless of what produced it: the AI-generated artifacts already in production and the human-written ones nobody got around to checking.
Dogfooding and testing results
Skill run, on my own portfolio
First pass on Nav.tsx, following the import chain into LiquidText.tsx. Five findings: one Critical, one Major, three AI-uncertain. The two confident ones got fixed in the same session. Critical (fixed). Missing focus indicators on the nav logo and links. Keyboard users got no visible feedback while tabbing through the header. The fix shipped at the token level: a new --focus-ring variable that every future component using :focus-visible inherits. One fix, every future header. Major (fixed). LiquidText.tsx splits text per-character at runtime for an animated hover effect. The spans had no aria-hidden, so every page was announcing each letter individually to screen readers. axe couldn't catch it because the split happens at runtime; the agent followed the import chain and flagged it. The fix was to wrap the output in <span aria-hidden="true"> with an sr-only sibling carrying the real text. Right structurally. The wrapper broke CSS inheritance and underlines stopped propagating to <a> parents. The agent is coverage and consistency, not authority. I added the regression to the methodology so the next run doesn't repeat it. The other three findings were AI-uncertain: nav text contrast at the top state (depends on what page content scrolls underneath), an emoji-plus-text label redundancy (brand-intent call), and <nav> landmark labeling (depends on whether a footer nav ever ships). Three of five being AI-uncertain isn't a failure to decide; it's the right refusal.


Plugin runs, on vercel.com and skylrk.com, two viewports each
Two test sites picked for contrast: mature design system versus indie vibe-design. Vercel. axe-detected: 6 violation types across 24 nodes. AI-layer added 2 findings axe missed (duplicate "Start Deploying" CTAs, a possible live-region risk on the analytics ticker). Three design-system fixes resolve every axe finding: darken the --gray-800 token (12 contrast nodes), enforce accessible-name on the Link component (6 unnamed-link failures), patch the NavigationMenu collapse to use inert (3 hidden-but-focusable popovers). The maturity didn't produce the failures. It made the fixes leverageable. Skylrk. axe-detected: 2 nodes (a cookie button at 3.81:1 contrast, a viewport meta that disables pinch zoom). AI-layer findings: 3 axe couldn't see. The site's whole interaction language is hover-reveal of floating clothing items, which is invisible to axe but a wall for every touch user and every screen reader. The mostly-black scroll below the hero has no headings or landmarks. Looks like a bug. Probably brand atmosphere.
The pattern that emerged
Mature sites concentrate their issues in shared design system tokens and components: one fix, every place. Vibe-coded sites break differently, in structural choices like hover-as-primary-interaction or pages with no headings, which aren't rule violations and so automated scanners can't see them. And modern production sites increasingly block scanners outright: the plugin's first run on vercel.com failed before axe could load. The AI reasoning layer worked in all three cases because it reads what's already rendered. Automated tools weren't built for the failures vibe coding tends to ship, and the web is getting harder to scan at the same time.
Takeaways and next steps
Cite, never certify. The agent defers on anything that needs lived experience, real user testing, or knowledge of what hasn't shipped yet. Agents-as-users. Semantic HTML serves AI agents the way it serves screen readers. The web they navigate at 42% is the web we built without landmarks. If AI generates the next web with the same gaps, the next agents will navigate it worse. Compounding runs one direction unless the fix is structural. Property, not phase. Accessibility is a property of the work, owed to every person who uses it and embedded in the tools that build it. Next. Test more and explore if running ally during prototyping measurably improves AI-generated UI at ship time.
Sources
Standards
Legal
Tools
AI accessibility landscape
- A11y-CUA Dataset: Characterizing the Accessibility Gap in Computer Use Agents (UC Berkeley + University of Michigan, CHI 2026), source of the 78/42/28 numbers
- Community-Access agents (GitHub)
- Community-Access org site
- Fifty-Five Agents, Three Teams by Taylor Arndt
- I Built 33 Claude Skills to Fix the Vibe Design Accessibility Gap by Matthew Stephens