@sensho
didn’t u hear bro francois doesn’t believe in harnesses in benchmarks. not agi unless the model spawns in with perfect innate capabilities and not allowed to learn/improve with tools.
Claim: ARC-AGI-3 might be solved with GPT-5.5-xhigh + tools. Sentiment analysis shows 58.33% confronting skepticism, 25% supporting — mixed community reaction.
there's a chance ARC-AGI-3 is already solved with GPT-5.5-xhigh + tools
Real-time analysis of public opinion and engagement
What the community is saying — both sides
Many replies argue that model performance depends far more on whether it can test, search and recover in a loop than on raw parameter-count or single-turn intelligence.
Critics say tools augment but do not replace the need for a strong underlying model—planning, understanding prompts and error-correction still rely on core architecture and training.
Requests focus on better plugin/tool integration, longer context, faster inference, local control, and robust developer APIs rather than just a new version name.
People worry that autonomous tool access creates dangerous feedback loops (unsupervised web queries, code execution), and call for strict permissioning, auditing, and sandboxing.
Several replies emphasize the need for benchmarks that measure closed-loop, tool-using behavior (multi-step retrieval+execution, recovery from failure), not just static LLM metrics.
A portion of replies treats “5.5 Pro” as marketing—demanding transparent changelogs and reproducible gains instead of hype-driven names.
is misleading — the right unit is cost-per-successful-task at the budget cap. Without a capped inference spend many results are an expensive lottery and evaporate when you constrain cost.
only, not be propped up by toolchains or allowed to learn/improve during evaluation.
and models that are ~1000x cheaper to reach a sustainable progress curve.
to claim progress rather than accept real failure modes.
for GPT-5.5-xhigh). Inflating scores by changing the rubric conflates success on easy cases with mastery of hard ones.
(a structural gap, not just a budget problem).
are both accused of being poor tests that fail to measure the right capabilities.
Most popular replies, ranked by engagement
didn’t u hear bro francois doesn’t believe in harnesses in benchmarks. not agi unless the model spawns in with perfect innate capabilities and not allowed to learn/improve with tools.
Waiting for Fran to move the goal post again.
ARC-AGI-3 'solved' with xhigh + tools is the wrong unit. cost-per-successful-task at the budget cap is what determines whether the result is reproducible or just expensive lottery. cap inference spend and those numbers usually halve.
https://t.co/LqsXN8b1CS
Tools are the hidden variable. Raw intelligence matters less once the environment lets the model test, search, and recover inside the loop.
what about GPT 5.5 Pro?
Found something wrong with this article? Let us know and we'll look into it.