This n8n workflow automates the testing and evaluation of language model outputs stored in Google Sheets. It begins by fetching test cases from a Google Sheet, then submitting each input to an LLM (e.g., GPT-4) for response generation. The workflow uses a custom prompt to evaluate whether the model’s output passes specific criteria, such as factual accuracy, relevance, and completeness. The evaluation is performed via an external webhook that acts as the ‘judge’. The results, including decisions and reasoning, are parsed and then automatically updated back into a Google Sheet, creating an efficient loop for performance monitoring and model improvement. Practical applications include automating AI model testing, quality control in AI-generated content, or continuous evaluation in machine learning workflows.

Node Count	11 – 20 Nodes
Nodes Used	@n8n/n8n-nodes-langchain.chainLlm, @n8n/n8n-nodes-langchain.lmChatOpenRouter, @n8n/n8n-nodes-langchain.outputParserStructured, googleSheets, httpRequest, limit, manualTrigger, merge, set, stickyNote, webhook