This n8n workflow facilitates the comparison of multiple language models (LLMs) by automating the process of testing, evaluating, and logging their responses. It is designed to help developers and AI practitioners determine the best-performing model for their specific use case. The workflow begins with a webhook trigger that activates upon receiving a chat message. The message is then duplicated and sent to two different LLMs, such as OpenAI’s GPT series and Mistral models. Each model processes the same prompt independently, with their responses being stored in a session-specific memory buffer to preserve context.

The workflow captures and formats each model’s reply, along with the user’s input and prior conversation history, for side-by-side display in a chat interface for easy manual comparison. Simultaneously, the responses, input, and context data are logged into a Google Sheet, allowing users to evaluate and compare model outputs over multiple interactions.

Included are nodes for defining which models to compare, setting up session IDs for context management, and handling memory for each conversation. The workflow also aggregates and formats responses for display, making it ideal for teams wanting to evaluate multiple LLMs without extensive manual effort. This setup is particularly useful in AI development, research, outperforming models, and developing AI-powered applications where choosing the best language model is crucial.

Node Count	>20 Nodes
Nodes Used	@n8n/n8n-nodes-langchain.agent, @n8n/n8n-nodes-langchain.chatTrigger, @n8n/n8n-nodes-langchain.lmChatOpenRouter, @n8n/n8n-nodes-langchain.memoryBufferWindow, @n8n/n8n-nodes-langchain.memoryManager, aggregate, googleSheets, set, splitInBatches, splitOut, stickyNote, summarize