EvalLM

Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria

Tae Soo Kim

KAIST

KAIST

NAVER AI Lab

NAVER AI Lab

KAIST

Paper Demo Code

EvalLM ⚗️ is an interactive system that aids prompt designers in iterating on prompts by evaluating and comparing generated outputs on user-defined criteria. With the aid of an LLM-based evaluation assistant, the user can iteratively evolve criteria+prompts to distinguish more specific qualities in outputs and then improve the quality of outputs on these aspects.

Animation of the overall workflow of EvalLM where users sample inputs from a dataset, generate outputs from each input using two different prompts, and then comparatively evaluate these outputs on user-defined criteria.

Interface

The main screen of the interface consists of three panels.

Main screen of EvalLM shows three panels. The generation panel shows text boxes for the prompt and task instruction, and buttons for input sampling. The evaluation panel shows text boxes for the criteria, buttons for evaluating, and stacked bar charts for the evaluation results.

Generation Panel: To generate outputs, the user defines their overall task instruction (A), two prompts they want to compare (B), and then samples inputs from a dataset (C) which will be used to test the prompts.

Evaluation Panel: To evaluate outputs, the user defines a set of evaluation criteria (D). Then, after evaluating, they can verify the overall evaluation performance of each prompt (E) or, if they created a validation set, validate how automatic evaluations align with ground-truth evaluations (F).

Data Panel: This panel shows data rows containing inputs, outputs, and evaluation results.

Criteria

EvalLM allows users to evaluate outputs on their own criteria specific to their application and/or context.

To define a criteria, the user simply provides the criteria with a name (A) and description (B) in natural language.

To assist users in creating more effective and helpful criteria, the system automatically reviews their criteria (C) and provides suggestions (D) on how the criteria can be refined, merged and split.

Criteria are represented as a set of text boxes that contain the name and description of the criteria. Suggested revisions are shown below the criteria.

Data Row

Data Rows in the interface display inputs, output pairs, and evaluation results. Clicking on evaluation results opens a panel that shows the explanation for that evaluation underneath the row.

For each sampled input (A), the interface presents the outputs generated from each prompt side-by-side (B) and the evaluation results for each criteria next to the outputs (C). For each criteria, the evaluation results show which prompt produced the output that better satisfied that criteria.

If the user wants to see more details, they can click on one of these evaluations to see the assistant’s explanation (D). To help the user match the explanation and outputs, the system also highlights spans from the outputs that were considered to be important when evaluating the criteria (E).

If the user selected to evaluate outputs on multiple trials, they can see the evaluations for other trials through the carousel (F).

Video Demo

See EvalLM in action in this Video Demo.

Bibtex

@inproceedings{10.1145/3613904.3642216,
author = {Kim, Tae Soo and Lee, Yoonjoo and Shin, Jamin and Kim, Young-Ho and Kim, Juho},
title = {EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria},
year = {2024},
isbn = {9798400703300},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3613904.3642216},
doi = {10.1145/3613904.3642216},
booktitle = {Proceedings of the CHI Conference on Human Factors in Computing Systems},
articleno = {306},
numpages = {21},
keywords = {Evaluation, Human-AI Interaction, Large Language Models, Natural Language Generation},
location = {, Honolulu, HI, USA, },
series = {CHI '24}
}

This research was supported by the KAIST-NAVER Hypercreative AI Center.