Evaluation
HumanEval
Evaluation· Advanced
Definition
A code generation benchmark consisting of 164 hand-crafted Python programming problems with unit tests. Models are evaluated on their pass@k metric — the probability that at least one of k generated solutions passes all tests. Standard benchmark for comparing LLM coding capabilities.
Tags
#code#benchmark#Python#programming#testing
MS
Maxx Stacks Editorial
Reviewed by enterprise AI practitioners
Maxx University
Keep learning. Keep building.
250+ terms. 5 learning paths. AI maturity assessment. Jargon translator. All free, always.