LLM Evaluation Tools
Langfuse
Langfuse serves as an advanced open-source platform designed for collaborative debugging and analysis of LLM applications. It offers essential features...
Scale Evaluation
Scale Evaluation serves as an advanced platform for the assessment of large language models, addressing critical gaps in evaluation datasets...
Chatbot Arena
Chatbot Arena allows users to engage with various anonymous AI chatbots, including ChatGPT, Gemini, and Claude. Users can ask questions,...
Arize Phoenix
Phoenix is an open-source observability tool that empowers AI engineers and data scientists to experiment, evaluate, and troubleshoot AI and...
Opik
Opik empowers developers to seamlessly debug, evaluate, and monitor LLM applications and workflows. By enabling trace logging and performance scoring,...
promptfoo
With over 70,000 developers utilizing it, Promptfoo revolutionizes LLM testing through automated red teaming for generative AI. Its custom probes...
Galileo
Galileo's Evaluation Intelligence Platform empowers AI teams to effectively evaluate and monitor their generative AI applications at scale. With tools...
Ragas
Ragas is an open-source framework that empowers developers to rigorously test and evaluate Large Language Model applications. It provides automatic...
DeepEval
DeepEval is an open-source framework designed for evaluating large-language models (LLMs) in Python. It offers specialized unit testing akin to...
AgentBench
AgentBench is an evaluation framework tailored for assessing the performance of autonomous AI agents. It employs a standardized set of...
Keywords AI
An innovative platform for AI startups, Keywords AI streamlines the monitoring and debugging of LLM workflows. With a unified API...
ChainForge
ChainForge is an innovative open-source visual programming environment tailored for prompt engineering and evaluating large language models. It empowers users...
Symflower
Enhancing software development, Symflower integrates static, dynamic, and symbolic analyses with Large Language Models (LLMs) to deliver superior code quality...
Literal AI
Literal AI serves as a dynamic platform for engineering and product teams, streamlining the development of production-grade Large Language Model...
Traceloop
Traceloop empowers developers to monitor Large Language Models (LLMs) by providing real-time alerts for quality changes and insights into how...
TruLens
TruLens 1.0 is a powerful open-source Python library designed for developers to evaluate and enhance their Large Language Model (LLM)...