Star-Agents: Using LLM Agents to Optimize Training Data

LLMs

Data Optimization

Instruction Tuning

NeurIPS

Zhou et al.’s Star-Agents system uses multiple LLM agent pairs, dual-model evaluation, and evolutionary sampling to automatically generate and curate high-quality instruction-tuning data — achieving 12% average improvement and up to 40% on Fermi tasks.

Author

Sean Lewis

Published

March 2, 2026

📄 Read the Full Paper

The Gist

The quality of instruction-tuning data is often more important than quantity, but curating good data is expensive and subjective. Star-Agents automates this process with three key components:

Diverse Data Generation via Multiple Agent Pairs — Rather than using a single LLM to generate instructions and responses, Star-Agents deploys multiple instruction agents and response agents drawn from different models (Phi-2, ChatGLM, Gemma, Mistral, Qwen, ChatGPT). This maximizes diversity of perspective and reasoning style.
Dual-Model Evaluation using IFD Metric — Each generated example is scored by comparing how a small model versus a large model handle it (Instruction Following Difficulty metric), plus evaluation by an LLM quality judge. This captures both difficulty and quality simultaneously.
Evolutionary Sampling — Probability updates inspired by evolutionary algorithms up-weight agent-pairs that produce high-quality data, creating a feedback loop that steers generation toward better examples.

When tested on Pythia-1B and Llama-2-7B, the system achieved an average 12% improvement across standard benchmarks, with up to 40% improvement on Fermi estimation tasks. Star Instruct outperforms strong baselines like Evol-Instruct, Alpaca, and IFD-based methods on MT-bench, Vicuna-bench, and WizardLM testsets.

Why It Matters Now

Data quality is the new bottleneck. As models grow larger, the marginal return of additional data decreases—you need twice the data to get half the benefit. But the return on better data keeps growing. A single high-quality instruction-response pair can teach more than 100 mediocre ones.

Star-Agents addresses a real pain point: manually curating instruction-tuning datasets is slow, expensive, and inconsistent. Automating this curation—especially in a way that adapts and improves over iterations—could be a force multiplier for smaller teams and organizations without unlimited human annotation budgets.

Key Results

Benchmark	Star-Agents	Evol-Instruct	Alpaca	IFD Baseline
MT-bench	+12–15%	Baseline	-5%	-2%
Vicuna-bench	+10–14%	Baseline	-3%	-1%
WizardLM testset	+11–13%	Baseline	-4%	-2%
Fermi Tasks	+40%	+15%	+5%	+8%

(Results on Pythia-1B and Llama-2-7B; improvements vary by model and task category)

The System Pipeline

graph LR
    A["Multiple Agent Pairs<br/>(Different LLMs)"] -->|Generate| B["Instruction + Response<br/>Candidates"]
    B -->|Evaluate| C["Dual-Model Evaluation<br/>(IFD + LLM Judge)"]
    C -->|Score & Rank| D["Curated Data<br/>High Quality Pool"]
    D -->|Feedback| E["Evolutionary Update<br/>(Agent Pair Weights)"]
    E -->|Next Round| A
    D -->|Output| F["Star Instruct Dataset"]
    
    style A fill:#0d7c5f,stroke:#1a1a1a,color:#fff
    style B fill:#6d5acd,stroke:#1a1a1a,color:#fff
    style C fill:#d4563a,stroke:#1a1a1a,color:#fff
    style D fill:#0d7c5f,stroke:#1a1a1a,color:#fff
    style E fill:#6d5acd,stroke:#1a1a1a,color:#fff
    style F fill:#0d7c5f,stroke:#1a1a1a,color:#fff

The Three Components

1. Diverse Data Generation

Star-Agents doesn’t rely on a single LLM to both generate instructions and respond to them. Instead, it maintains a pool of instructor models and responder models:

Instructor agents: Phi-2, ChatGLM, Gemma, Mistral, Qwen, ChatGPT (subset)
Responder agents: Same pool, but assigned independently

Why diversity matters: Different models have different inductive biases, reasoning styles, and knowledge bases. A math-focused model like Qwen might generate different (and complementary) reasoning traces than a general-purpose model. By pairing them in different combinations, you get a wider range of instruction complexity, style, and domain coverage.

In practice: At each generation step, the system randomly (or probability-weighted, after learning) selects an instructor-responder pair, generates a new instruction, and compares responses.

2. Dual-Model Evaluation (IFD)

Not all instruction-response pairs are equally valuable. An easy instruction that any model can follow teaches you nothing. An impossibly hard one is noise. The sweet spot is examples that challenge the target model but aren’t pure noise.

Instruction Following Difficulty (IFD) captures this by: - Running a small target model (e.g., Pythia-1B) on the instruction - Running a large reference model (e.g., ChatGPT, Llama-2-70B) on the same instruction - IFD Score = How much harder the target model found it relative to the reference

High IFD = The instruction is challenging for the target model but solvable by the reference. Low IFD = Either trivial or too hard.

LLM Judge: In parallel, an LLM evaluates the quality of the response (correctness, clarity, helpfulness). This catches cases where the response is technically correct but poorly written.

Combined: IFD + LLM quality score = overall example score.

3. Evolutionary Sampling

After evaluation, which agent pairs should you use more? Star-Agents applies evolutionary algorithm-inspired updates:

Agent pairs that produced high-scoring examples get higher sampling probability in the next round.
Agent pairs that produced low-scoring examples get lower probability.
Over many rounds, the system converges on a distribution favoring high-quality generators.

Pseudo-code:

For each round t:
    1. For each agent pair (i, j):
        Sample k examples
        Evaluate them → get scores s_1, ..., s_k
        Compute mean score μ_ij
    2. Update probabilities:
        p_ij(t+1) ∝ p_ij(t) × exp(β × μ_ij)
          where β controls learning rate
    3. Resample agent pairs according to new p for next round

This creates a positive feedback loop: good agent pairs generate more examples, which get evaluated, which reinforces that they’re good.

The Lineage

Star-Agents sits at the intersection of several research threads:

Self-Instruct (Wang et al., 2023): The original idea that models can generate their own instruction-tuning data in a bootstrapping loop.
Evol-Instruct (WizardLM, Xu et al., 2023): Iteratively evolve instructions to make them harder and more diverse—Star-Agents borrows this iterative philosophy but adds multi-agent diversity and formal evaluation.
LIMA (Zhou et al., 2023): “Less is More in Alignment”—the insight that 1000 high-quality examples can rival 100K mediocre ones. Star-Agents is a method to find those 1000.
Data-Centric AI (Andriy Burkov, others): The emerging consensus that model > data size is shifting to data quality > model size for many tasks.
Quality Filtering: Recent work on filtering synthetic data (e.g., TinyStories, CulturaX) shows that automated filtering + diversity is more efficient than raw scale.

Star-Agents unifies these themes: it’s data-centric, evolutionary, multi-agent, and quality-focused.

Rubber-Ducking the Jargon

Instruction Tuning: Fine-tuning a model on instruction-response pairs (input: “Translate to Spanish”, output: “The translation is…”) to teach it to follow natural language commands.
IFD (Instruction Following Difficulty): A metric comparing how easily a small model vs. a large model solves an instruction. High IFD = good learning signal.
Evolutionary Sampling: Iteratively favoring agent pairs that produce better outputs, inspired by natural selection.
Agent Pair: A combination of (instruction_generator, response_generator). Each pair is a possible “strategy” for generating data.
MT-bench: A benchmark of 80 multi-turn instruction-following tasks; evaluates both quality and diversity of responses.
Vicuna-bench: A set of 80 diverse instructions; typically evaluated by comparing model outputs against each other.
Data Curation: The process of selecting, filtering, and ranking data examples to maximize learning value.

What to Watch Out For

Limited Scale Testing: Star-Agents was evaluated on relatively small models (Pythia-1B, Llama-2-7B). Does it scale to 70B or 700B parameter models? Early indications are yes, but this wasn’t the focus of the paper.
Benchmark Bias: MT-bench and Vicuna-bench are themselves noisy and may not capture real-world utility. A model scoring higher on these benchmarks might not be better at your specific downstream task.
LLM Judge Bias: The LLM quality judge introduces its own biases. If your judge is Claude, it might over-reward certain writing styles or undervalue others that your actual users prefer.
Computational Cost: Running evaluations with both small and large models, plus LLM judge, adds overhead. The paper doesn’t deeply analyze the compute-benefit tradeoff.
Evolutionary Convergence: The evolutionary loop might converge prematurely to a local optimum, oversample certain agent pairs, or lose diversity over time if not carefully tuned.
Generalization: The generated data is optimized for the specific benchmark tasks. Does Star Instruct transfer well to tasks very different from MT-bench?

So What?

Three takeaways:

Data quality > data quantity for instruction tuning, especially for smaller models. If you’re fine-tuning a 1B–7B parameter model, focus on curating or generating fewer high-quality examples rather than scraping more low-quality ones.
Automated curation is possible and practical. You don’t need a massive team of human annotators. A well-designed system of agent pairs + evaluation can do much of the work.
Diversity of data sources matters. Using multiple LLMs to generate candidates, rather than a single model, outperforms single-source approaches. This mirrors findings in ensemble learning and data augmentation.

For practitioners: If you’re building a fine-tuned model, consider investing in data quality (automated or manual). Consider using multiple open-source LLMs as generators to maximize diversity. And measure your progress not just on benchmark scores but on actual downstream performance.

Reproduction & Implementation

Setup

You’ll need: - A small target model (Pythia-1B, TinyLlama, or similar) - A large reference model (Llama-2-70B, ChatGPT, or similar) for evaluation - Several diverse generator models (Mistral, Qwen, ChatGLM, etc.) - An evaluator LLM (GPT-4 or open-source like Llama-2-70B)

Pseudocode: The Main Loop

def star_agents_loop(
    agent_pool: List[LLM],
    target_model: LLM,
    reference_model: LLM,
    evaluator: LLM,
    rounds: int = 100,
    examples_per_round: int = 1000,
):
    agent_probs = uniform_distribution(agent_pool)  # Equal initially
    curated_data = []
    
    for round in range(rounds):
        candidates = []
        
        # Generation: sample agent pairs and generate examples
        for _ in range(examples_per_round):
            instructor = sample_with_replacement(agent_pool, probs=agent_probs)
            responder = sample_with_replacement(agent_pool, probs=agent_probs)
            
            instruction = instructor.generate_instruction()
            response = responder.respond(instruction)
            candidates.append((instruction, response, instructor, responder))
        
        # Evaluation: compute IFD and quality scores
        scored_candidates = []
        for instruction, response, inst_agent, resp_agent in candidates:
            target_difficulty = target_model.solve(instruction)
            reference_difficulty = reference_model.solve(instruction)
            ifd_score = compute_ifd(target_difficulty, reference_difficulty)
            
            quality = evaluator.score_quality(instruction, response)
            
            overall_score = combine_scores(ifd_score, quality)
            scored_candidates.append({
                'instruction': instruction,
                'response': response,
                'score': overall_score,
                'inst_agent': inst_agent,
                'resp_agent': resp_agent,
            })
        
        # Curation: keep top-k examples
        top_k = sorted(scored_candidates, key=lambda x: x['score'], reverse=True)[:1000]
        curated_data.extend(top_k)
        
        # Evolutionary Update: adjust agent probabilities
        for agent in agent_pool:
            agent_examples = [c for c in top_k 
                              if c['inst_agent'] == agent or c['resp_agent'] == agent]
            avg_score = mean([e['score'] for e in agent_examples]) if agent_examples else 0
            agent_probs[agent] *= exp(beta * avg_score)
        agent_probs /= sum(agent_probs)  # Re-normalize
    
    return curated_data

IFD Calculation

def compute_ifd(target_result: str, reference_result: str) -> float:
    """
    Simplified: 1 if target failed but reference succeeded, 
    0.5 if both succeeded or both failed,
    0 if target succeeded but reference failed (shouldn't happen with a good reference).
    """
    target_correct = is_correct(target_result)
    ref_correct = is_correct(reference_result)
    
    if not target_correct and ref_correct:
        return 1.0
    elif target_correct and ref_correct:
        return 0.5
    elif not target_correct and not ref_correct:
        return 0.5
    else:
        return 0.0  # Edge case

Resources & Links

Paper: Zhou et al., “Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning” (NeurIPS 2024) | 📄 Hosted PDF
Related: Self-Instruct (Wang et al.), Evol-Instruct (WizardLM), LIMA (Zhou et al.)
Benchmarks: MT-bench (Zheng et al.), Vicuna-bench (Vicuna team)
Models: Hugging Face (Mistral, Qwen, Llama, etc.), OpenAI (ChatGPT as evaluator)

Closing

Star-Agents is a elegant reminder that curation beats collection. In an era where we can generate infinite synthetic data, the bottleneck is no longer quantity but quality. By combining diversity, rigorous evaluation, and feedback-driven adaptation, this system shows that machines can do much of the curator’s work—and do it well.

If you’re fine-tuning a small-to-medium model, you now have both a conceptual framework and a blueprint to try this yourself.