What is vibe-testing?
Vibe-testing is the informal way users compare language models in practice, often by trying them on their own tasks and judging qualities like clarity, style, trust, and workflow fit.
How is this different from standard LLM benchmarks?
Standard benchmarks focus on fixed prompts and fixed metrics. This paper studies how users personalize both what they test and how they judge model outputs.
What does the paper find?
The main result is that personalization can change which model is preferred in head-to-head comparisons.
Why does personalization matter?
Different users care about different response qualities, such as clarity, ambiguity handling, explanation style, or workflow compatibility. Those preferences are often not captured by static leaderboards.
What tasks are studied in this paper?
The proof-of-concept pipeline is evaluated on coding tasks and coding benchmarks.