Human-Centred Evaluation of an AI Coding Assistant

Background

A rapidly growing AI company developing an intelligent coding assistant wanted to assess its real-world performance beyond synthetic benchmarks. The assistant included advanced features designed to help developers explore repositories, refactor code, and execute commands seamlessly.

To ensure the product could perform effectively in authentic coding environments, the client sought to measure not just functionality but how developers naturally interacted with the assistant, switching files, running commands, and making edits in context.

Coaldev was engaged to perform a comprehensive side-by-side human evaluation (SxS Human Eval) that replicated the daily workflows of developers. The objective was to gather human feedback that reflected both the usability and performance of the AI tool, enabling the client’s team to improve reliability and user experience before a wider rollout.

Challenges

Evaluating a coding assistant in real developer environments required capturing natural behavior, balancing multiple feature modes, and assessing code quality with human judgment. The goal was to simulate authentic workflows while still producing consistent, comparable data across evaluators. Below are the main challenges Coaldev addressed while designing this human-centred evaluation framework.

Natural workflow fidelity: Traditional testing methods failed to capture how developers actually switch files, navigate code, and execute commands in sequence.

Multi-feature consistency: The assistant’s various modes had to be tested in equivalent, unbiased conditions to ensure comparability.

Evaluating complex code accurately: Human evaluators had to assess the assistant’s responses for correctness and contextual understanding.

Scalable evaluation framework: The client required a consistent and repeatable method for collecting human judgments across multiple evaluators and codebases.

Solution

Coaldev needed to strike the right balance between human realism and structured repeatability in the evaluation process.

Coaldev developed a rigorous human-in-the-loop evaluation framework to simulate real-world developer behavior across the assistant’s core features. Evaluators engaged with context-rich prompts and live codebases, systematically using Composer, Agent, and Search to complete end-to-end tasks. Their interactions—ranging from terminal execution to code validation—were logged and analyzed by Coaldev’s QA and data science teams to assess tool reliability, error recovery, and code quality. A custom reporting dashboard visualized performance metrics and developer feedback, enabling the client to pinpoint weaknesses, prioritize improvements, and benchmark future builds.

Here are the key components of our solution:

1. Structured Evaluation Protocol

Standardized testing across all assistant modes to ensure consistency.

2. Realistic Developer Simulation

End-to-end task execution using live codebases and assistant features.

3. Quantitative & Qualitative Logging

Captured detailed insights from evaluator interactions.

4. Custom Analytics Pipeline

Assessed reliability, recovery, and code quality across sessions.

5. Reporting Dashboard

Visualized task completion, accuracy, and feedback for traceable benchmarking.

Results

The project provided the client with critical, human-driven insights that directly influenced their AI assistant’s next iteration.

Key results

Human-verified evaluation data: Delivered qualitative feedback highlighting usability friction, error types, and feature-level strengths.
Improved accuracy: Clear visibility into the assistant’s ability to generate correct and executable code.
Structured benchmark framework: Established a repeatable human evaluation protocol for future testing cycles.
Faster refinement: Enabled the client’s engineering team to focus improvements on high-impact areas, accelerating iteration timelines by over 30%.

Coaldev’s work ensured that the AI model’s evaluation transitioned from synthetic, lab-based metrics to real-world, developer-centred performance validation.