Engineering · AI Ops

AI Agent for Comparing LLM responses side-by-side with Google Sheets

Monitor a user prompt, duplicate it to two LLMs, log both responses to Google Sheets, and display side-by-side outputs for evaluation.

How it works
1 Step
Capture and Route Prompt
2 Step
Run in Parallel
3 Step
Log and Compare
Capture the user input, duplicate it, and route to two LLMs in parallel.

Overview

End-to-end paired-model evaluation and auditing.

This AI agent captures a user prompt, duplicates it to two LLMs, and runs the prompts in parallel. It logs prompts, model outputs, and context to Google Sheets for traceability. It presents side-by-side results in chat and supports manual or automated scoring to choose the best model for production.


Capabilities

What AI Agent for Comparing LLM responses side-by-side with Google Sheets does

Executes a paired-model evaluation workflow and records results for auditing.

01

Capture the user prompt and log the context.

02

Duplicate the prompt and send to two LLMs in parallel.

03

Process each model memory independently.

04

Log prompts inputs and outputs to Google Sheets.

05

Display side-by-side outputs in chat for direct comparison.

06

Enable manual or automated scoring of results.

Why you should use AI Agent for Comparing LLM responses side-by-side with Google Sheets

Before adopting this AI agent, teams face scattered results and manual comparisons, with no auditable log. After adopting this AI agent, you gain centralized logs, repeatable comparisons, and clear go/no go decisions.

Before
Results are scattered across documents and chats, making it hard to compare models.
Manual comparison is slow and inconsistent across evaluators.
Memory contexts differ between models, leading to unreliable judgments.
There is no auditable log to justify production choices.
Duplicated prompts generate unnecessary token usage without a workflow.
After
All prompts, inputs, and outputs are stored in a single Google Sheets sheet.
Side by side results are instantly visible in chat.
Evaluation can be standardized with scores or automated scoring.
You identify the model that consistently meets criteria for your use case.
The workflow provides a clear record for governance and deployment decisions.
Process

How it works

A simple 3 step process easy for non-technical users.

Step 01

Capture and Route Prompt

Capture the user input, duplicate it, and route to two LLMs in parallel.

Step 02

Run in Parallel

Each model processes the prompt with its own memory context.

Step 03

Log and Compare

Log prompts and outputs to Google Sheets and present side by side results in chat.


Example

Example workflow

A realistic paired-model evaluation scenario.

Scenario: A product manager asks for a concise executive summary of a 1,000 word article. The AI agent sends the prompt to two models in parallel, each using its own memory. After both return, the prompts and outputs plus the context are logged to Google Sheets. The chat presents the two summaries side by side for immediate visual comparison, and an evaluation score can be added in Sheets or by a separate model.

Engineering Google SheetsOpenAI APIVertex AIOpenRouter AI Agent flow

Audience

Who can benefit

Roles that gain from structured model evaluation workflows.

✍️ Product managers

to quickly compare responses and decide which model to productionize

💼 Research engineers

to run controlled prompt experiments with parallel models

🧠 QA engineers

to validate outputs against criteria before release

Data scientists

to compare generation quality across prompts and models

🎯 AI governance leads

to maintain auditable evidence for model decisions

📋 Content teams

to benchmark output quality for published material

Integrations

Tools connected to the AI agent to enable end-to-end evaluation.

Google Sheets

Logs prompts, model outputs, context, and evaluation notes for audit and comparison.

OpenAI API

Sends prompts to two OpenAI models in parallel and collects their responses for comparison.

Vertex AI

Provides an alternative provider pathway to run parallel evaluations and capture results.

OpenRouter

Offers an additional provider option to compare model outputs in parallel.

Applications

Best use cases

Practical scenarios where side-by-side LLM evaluation adds value.

Model evaluation for production readiness
Prompt engineering experiments
Summarization model benchmarking
Code generation model testing
Content generation quality comparison
Translation model benchmarking

FAQ

FAQ

Common practical questions about using this AI agent.

The log includes the user prompt, both model outputs, model identifiers, timestamps, and optional evaluation scores. It may also include the interaction context that led to the outputs. Logs are stored in a single sheet to enable side by side review and auditing. You can control which fields are captured and how long they are retained. If needed, you can apply data governance rules to limit access.

The current template is designed for two models. To extend, you must modify the AI agent flow to query additional models in parallel and store their outputs in the same or separate sheets. This increases complexity and token usage, so plan accordingly. You can replicate the parallel processing pattern for each extra model and adjust the evaluation scheme. Consider governance and cost when expanding.

Each model uses its own memory context when processing prompts. Outputs are independent and logged with the associated context to avoid cross contamination. The chat displays both responses alongside the original prompt and the preceding history for clarity. This setup supports fair comparison even when models have different internal context handling.

Yes, you can apply an automated evaluator model or a scoring rubric within Google Sheets. You can also run an external AI agent to score responses based on defined criteria. Automation reduces manual workload but requires a carefully designed rubric to ensure consistency. You can mix automated scores with manual reviews for balance.

The AI agent operates within your configured cloud environment and logs data to Sheets in your workspace. Ensure proper access controls, data policies, and encryption as applicable. Consider anonymizing inputs or using private endpoints for sensitive prompts. Review governance rules before enabling external model providers.

Yes, you can tailor the system prompt and the tools used by the AI agent to suit your use case. You can adjust prompts, evaluation criteria, and sheet schema. Changes apply to the entire evaluation workflow and help align results with your standards. Test changes in a controlled environment before production use.

Expect higher token usage because prompts are processed by two models in parallel. The exact cost depends on the model types and prompt lengths. Use token caps and monitor sheet based logs to track spend. Plan for token budget when running large scale evaluations.


AI Agent for Comparing LLM responses side-by-side with Google Sheets

Monitor a user prompt, duplicate it to two LLMs, log both responses to Google Sheets, and display side-by-side outputs for evaluation.

Use this template → Read the docs