From Feelings to Metrics: Understanding and Formalizing How Users VIBE-TEST LLMs

Itzhak, Itay; Habba, Eliya; Stanovsky, Gabriel; Belinkov, Yonatan

Figure 1

Example of vibe-testing, where a user compares two LLM outputs on a concrete task using workflow-relevant criteria.

An example vibe-test illustrating how users compare models on a concrete task and judge outputs using personal, workflow-relevant criteria.

TL;DR

Benchmark leaderboards are useful, but they often miss what matters during real use, so users turn to vibe-testing: manual workflow-based model comparison. In this work, we empirically study how vibe-testing happens in practice, formalize it as a two-part evaluation process, and propose a user-centered evaluation pipeline based on that formalization. The experiments show that personalizing both the prompt and the judgment criteria can change which model is preferred, laying the groundwork for more user-centered model evaluation.

Empirical grounding Built from a user survey and in-the-wild comparison reports that show how people actually vibe-test models.

Two-part formalization Vibe-testing is formalized through input dimensions for what users test and output dimensions for how they judge.

Evaluation pipeline The paper proposes a modular pipeline that personalizes prompts and compares model outputs using user-aware criteria.

Abstract

Evaluating LLMs is challenging because benchmark scores often fail to capture models' real-world usefulness. Users therefore rely on vibe-testing: informal, experience-based comparisons on tasks that resemble their own workflows. This project studies how vibe-testing works in practice, formalizes it as a two-part process over input and output dimensions, and introduces a proof-of-concept evaluation pipeline for coding tasks. The pipeline personalizes prompts, runs head-to-head model comparisons, and judges responses using user-aware subjective criteria. Experiments on coding benchmarks show that when both prompt framing and response judgment are personalized, the preferred model can change relative to the original benchmark setting.

Why Standard LLM Benchmarks Miss User Preferences

The paper starts from the gap between benchmark rankings and what users actually notice when working with models.

Benchmark scores compress model behavior into a small number of standardized metrics.
In practice, users care about qualities such as clarity, ambiguity handling, style fit, trust, and workflow compatibility.
Those qualities are often tested through personalized prompts and judged qualitatively from the user's own perspective.
As a result, practical model comparisons are informative but difficult to reproduce, compare, or analyze at scale.

What This Paper Studies About Real-World Model Comparison

To understand vibe-testing as a real-world evaluation practice, the paper first examines it empirically before proposing structure and automation.

A survey of user evaluation practices identifies what people think benchmarks fail to capture and how they test models instead.
An in-the-wild corpus of model comparison reports from blogs, forums, tech articles, and social media reveals recurring testing and judging patterns.
Together, these sources motivate a formal framework and a downstream evaluation pipeline for systematic study.

Figure 2

Survey results showing what users think benchmarks miss and the methods they use when vibe-testing models in practice.

Formalizing Vibe-Testing

The paper frames vibe-testing as a two-part process: users personalize what they test, and they personalize how they judge model outputs.

Input Dimensions

What users test

Input dimensions describe how prompts are shaped to reflect a user's context. In the coding setting, this includes task framing, real-world workflow details, the amount of context supplied, and the type of constraints the user cares about.

Output Dimensions

How users judge responses

Output dimensions describe what makes a response feel useful from the user's perspective, such as clarity, style fit, workflow fit, ambiguity handling, and other subjective properties that are rarely captured by standard benchmarks.

This formalization turns vibe-testing into something reproducible: it separates how users shape test inputs from how they interpret outputs, creating the basis for a modular evaluation pipeline rather than an ad hoc comparison habit.

Automated Pipeline Overview

Figure 3 in the paper presents the pipeline as three connected parts. The implementation in this repository expands each part into modular scripts and analysis stages, but the public-facing overview is best understood through the paper's A / B / C structure.

A

User profiling

Convert a short natural-language user description into a structured profile that captures both input preferences and output preferences.

B

Vibe dataset construction

Rewrite benchmark prompts into personalized variants aligned with the user's input dimensions, while checking that the task intent and ground truth are preserved.

C

Model comparison

Evaluate correctness and compare two model outputs head to head using the user's output dimensions, producing per-dimension and overall preference signals.

Figure 3

Diagram of the three-stage automated pipeline: (A) User profiling into structured preferences, (B) Rewriting benchmark prompts for personalization, and (C) Pairwise model comparison using subjective output dimensions.

The automatic vibe-testing pipeline: user profiling, vibe dataset construction, and model comparison in one A/B/C workflow.

Results: Personalization Can Change Model Preferences

The paper's central experimental claim is not only that personalization matters, but that it can materially change preference orderings between models.

Personalized prompts and user-aware judgment can reverse which model is preferred in head-to-head evaluations.
Neutral control paraphrases mostly preserve the ordering seen under original benchmark prompts.
This suggests that benchmark prompts can mask user-relevant differences in clarity, style, workflow fit, and related subjective dimensions.

Figure 4

Head-to-head model comparison results on original prompts across evaluation dimensions.

Head-to-head model comparison results on personalized prompts, showing that personalization can shift model preferences.

Head-to-head comparison results under original and personalized prompts, showing how personalization can shift model preferences across subjective dimensions.

Key Contributions

Empirical Study

Grounding in real user behavior

The work studies vibe-testing through a survey of user practices and a corpus of real-world comparison reports, rather than treating personalized evaluation as a purely speculative idea.

Formalization

A structure for reproducibility

Vibe-testing is decomposed into reusable input dimensions and output dimensions, making informal model comparison easier to analyze and reproduce systematically.

Automation

A full evaluation pipeline

The paper proposes a modular pipeline that personalizes prompts, compares models head to head, and turns user-aware evaluation into a reproducible research setup.

Finding

Preferences can flip under personalization

In coding benchmark experiments, tailoring both prompt framing and judgment criteria can change which model is preferred, while control paraphrases mostly preserve the original ordering.

Related Work

Recent work on human-centered LLM evaluation, personalized model evaluation, and evaluation beyond static benchmarks has started to capture parts of what users care about in practice, but usually only covers one side of vibe-testing. VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models (Dunlap et al., 2024) and Vibe Checker: Aligning Code Evaluation with Human Preference (Zhong et al., 2025) are especially relevant for vibe-oriented and coding-focused evaluation, but they do not center the empirical study of vibe-testing as a user practice. HELM Instruct (Zhang et al., 2024) and ChatBench: From Static Benchmarks to Human-AI Evaluation (Chang et al., 2025) highlight the limits of static benchmarks and predefined criteria, while user-focused methods such as EvalLM (Kim et al., 2024), IQA-Eval (Li et al., 2024), and EVALAGENT (Wadhwa et al., 2025) personalize parts of the evaluation process without grounding the framework in an empirical account of how users actually compare models. The difference here is the combination of empirical analysis, formalization, and a pipeline built directly from that observed practice.

FAQ

What is vibe-testing?
Vibe-testing is the informal way users compare language models in practice, often by trying them on their own tasks and judging qualities like clarity, style, trust, and workflow fit.

How is this different from standard LLM benchmarks?
Standard benchmarks focus on fixed prompts and fixed metrics. This paper studies how users personalize both what they test and how they judge model outputs.

What does the paper find?
The main result is that personalization can change which model is preferred in head-to-head comparisons.

Why does personalization matter?
Different users care about different response qualities, such as clarity, ambiguity handling, explanation style, or workflow compatibility. Those preferences are often not captured by static leaderboards.

What tasks are studied in this paper?
The proof-of-concept pipeline is evaluated on coding tasks and coding benchmarks.

Citation

If this paper is useful in your work on LLM evaluation, personalized evaluation, or model comparison, please cite it as follows.

@misc{itzhak2026feelingsmetrics,
  title        = {From Feelings to Metrics: Understanding and Formalizing How Users VIBE-TEST LLMs},
  author       = {Itay Itzhak and Eliya Habba and Gabriel Stanovsky and Yonatan Belinkov},
  year         = {2026},
  note         = {Under review},
  howpublished = {\url{https://arxiv.org/abs/2604.14137}}
}

arXiv PDF Code Project Page

Author links: Itay Itzhak, Eliya Habba, Gabriel Stanovsky, Yonatan Belinkov