Inside Writer
– 10 min read
Palmyra LLMs achieve top benchmark scores

Today, we’re excited to announce that Palmyra, our family of open-source LLMs, has achieved top benchmark scores from Stanford’s Holistic Evaluation of Language Models (HELM).
Palmyra scores top marks on Stanford HELM
HELM is a benchmarking initiative by Stanford University’s Center of Research on Foundation Models that evaluates prominent language models across a wide range of scenarios. We’re thrilled to share that Palmyra has earned top scores on HELM tests that evaluate a model’s ability to understand knowledge and accurately answer natural language questions.
- Palmyra ranked first in several important tests, scoring 60.9% on Massive Multitask Language Understanding (MMLU), 89.6% on BoolQ, and 79.0% on NaturalQuestions.
- Palmyra ranked second in two additional key tests with 49.7% on Question Answering in Context and 61.6% on TruthfulQA.
- Palmyra outperformed models by OpenAI, Cohere, Anthropic, Microsoft, and important open-source models such as Falcon 40B and LLaMA-30B on key tests.
MMLU evaluates a model’s understanding of world knowledge and problem-solving abilities and covers subjects such as abstract algebra, college-level chemistry, computer security, econometrics, and US foreign policy. BoolQ, NaturalQuestions, and Question Answering in Context evaluate a model’s ability to make inferences and answer questions that are open-ended, context-dependent, and worded naturally instead of in a prompt format. TruthfulQA evaluates a model’s ability to avoid generating false answers from imitating human text. These scores highlight Palmyra’s power and ability to complete advanced tasks, which makes it uniquely capable of tackling a wide range of enterprise use cases.
The secure enterprise-ready LLM
Compared to other foundation models, such as GPT-4 which is said to have 1.76 trillion parameters, Palmyra LLMs are relatively small, with no model exceeding 43 billion parameters. Smaller models have several advantages over larger models – they’re faster, less costly to maintain, and take less time to train and update. These benchmark scores further demonstrate that Palmyra is not just efficient in size but can also deliver superior results when compared to larger models.
Unlike other LLMs, Palmyra is built for the enterprise. Our models are trained on formal and business writing, transparent and auditable rather than black box, and built so your data stays private and is never used in model training. Plus, we offer models fine-tuned specifically for industries like healthcare and financial services and offer the option for companies to self-host.
Palmyra is the foundation of our AI platform, but we don’t stop there. Writer connects to your business data through Knowledge Graph to ensure that your output is accurate and our governance features automatically enforce your brand, compliance, and AI rules. We’re committed to using generative AI to transform the way you work and empower your entire organization to maximize creativity and 10x productivity.
Ready to experience LLMs built for enterprise needs? Give Writer a try.