Research
– 8 min read
Palmyra-mini: Small models, big throughput, powerful reasoning

Introducing a family of 1.5B-parameter, open-source models tuned for extreme throughput and practical reasoning—built to run privately, cheaply, and nearly anywhere.
At WRITER, our mission has always been to bring the transformative power of AI to the enterprise. As our product suite and platform have evolved, we have always maintained a focus on deep research and continued to develop our Palmyra family of large language models, which have consistently ranked among the best in the world.
This week we released Palmyra-mini, a new family of smaller, open, and remarkably powerful models. These models are engineered to deliver state-of-the-art performance with a fraction of the computational footprint, making it possible to run highly capable AI in environments where it was previously impractical or impossible.
Sometimes an engineering milestone is less about brute scale and more about efficiency per parameter. Palmyra-mini is exactly that: a small, 1.5B-parameter model that delivers meaningful reasoning, strong benchmark results, and eye-popping throughput on a single GPU. It’s the kind of accomplishment that makes you re-evaluate where—and how—intelligence belongs in a modern AI stack.
Palmyra-mini demonstrates that a well-tuned 1.5B model can be small enough to run anywhere, but smart enough to matter. The way it combines high-throughput, low-latency inference with competitive reasoning performance opens a new design space for private, on-device, and cost-sensitive AI workflows. We’ve also made it an open model so that others in the community can build with us and drive innovation faster.
Pushing the boundaries of performance
There are three versions of the Palmyra mini model, each of which is optimized for certain use cases.
- Palmyra-mini 1.7B: A lightweight, non-thinking base model that serves as the foundation for more specialized variants.
- Palmyra-mini-thinking-a 1.7B: Optimized for complex reasoning and logic, this variant is perfect for applications that require nuanced decision-making.
- Palmyra-mini-thinking-b 1.5B: Excelling at mathematical equations and reasoning, this model is ideal for tasks that demand precision and accuracy.
Let’s run through some numbers:
Palmyra-mini thinking-b is an open-source 1.7B-parameter model designed for high-throughput, low-cost inference on real workloads. It’s tuned for ~10,382 tokens/sec per billion parameters, and in practice it delivers.
On a single H200 GPU at 512-way concurrency, the model achieves:
- 8.39 inferences/sec
- ~15.6K tokens/sec generation throughput
- ~29 ms inter-token latency
- ~1.3 s time-to-first-token (TTFT)
- ~60 s end-to-end on 2K-in / 1.8K-out sequences
Even with 8K-input tokens, Palmyra-mini thinking-b still comes in at 5.45 inferences/sec and ~9.8K tokens/sec—a notable retention of throughput as sequence length grows.
On widely followed reasoning and knowledge benchmarks (Pass@1, avg-of-64), Palmyra-mini thinking b posts strong scores for its size:
- AIME24 — 59.42
- AIME25 — 49.68
- GPQA — 42.00
- HMMT25 — 27.86
- HLE — 5.22
- MMLU-PRO — 55.49
- MATH500 — 93.80
- LCB — 34.50
These results are meaningful because they show the model isn’t only fast—it’s usefully accurate for a swath of reasoning-centric tasks that end users care about. Its intelligence is similar to and occasionally surpasses some 8B and 32B models. On a few benchmarks and in rare cases it even reaches the 70B mark. While you should always evaluate on your own tasks, it’s notable when a 1.5B model punches above its weight while maintaining outstanding latency characteristics.
For workloads where speed, concurrency, and cost matter more than maximal depth – such as immediate content checks, scoring, routing, or pre-draft generation at massive scale – Palmyra-mini changes the calculus.
Open source + privacy: run it where your data lives
Breakthroughs like this shouldn’t happen in a silo, which is why we are making the Palmyra-mini family available as completely open models. It joins Palmyra Fin and Palmyra Med, which are fine tuned for specific industry use cases, and are open as well.
Our goal is to give back to the research community and contribute to the progress of our entire industry. Because of Palmyra-mini’s size, developers and researchers can fine tune, study, and innovate without needing access to massive compute clusters.
The Palmyra-mini family provides engineering teams with models that offer:
- Inspectable model configuration and reproducible evaluations. You can understand how it’s built and tune it for your latency, memory, and accuracy budgets.
- Privacy when you need it. Because it’s small, Palmyra-mini is practical to run privately—on your own infra or devices—for use cases where data residency, regulatory posture, or vendor risk make cloud inference tricky.
The team is already seeing interest in on-device deployments. These are early signals, but the direction is clear: privacy-preserving, local intelligence is within reach for more scenarios when 1.5B delivers this much capability.
Open, small, and fast models give you more placement options in a system. You can move intelligence closer to data – think devices, VPCs, branch offices – or cut out network hops and tailor security boundaries without giving up too much quality. That flexibility is hard to get using only large, very centralized models.
What this unlocks for products and customers
Palmyra-mini doesn’t seek to replace larger models like Palmyra X5, our frontier foundation model; it complements them. Think of it as an everywhere-capable layer that can handle the speed-sensitive and cost-sensitive parts of a workflow, and then hand off to a bigger model when depth is needed.
Here are some possible use cases it enables:
- Airplane-mode assistants. For scenarios where there is no internet and you want to get things done, Palmyra-mini can provide instant, offline assistance. That makes it ideal for travel, field work, or secure facilities where outbound connections are restricted. An on-device model means zero cloud latency, stronger privacy, and lower per-interaction cost.
- High-throughput back-end copilots. In the data plane, speed and concurrency often matter as much as raw capability. Palmyra-mini is well-suited for real-time content moderation, quick fact-checking or misinformation flagging, scoring and routing tickets or customer support calls, and similar tasks where milliseconds and dollars per call add up quickly. Its throughput and TTFT profile allow you to fan out requests and keep pipelines flowing. Inference can be 10-50X faster with response times under 100 milliseconds.
- Cost/latency-aware orchestration. With Palmyra-mini in your fleet, you can route based on SLA and budget. You can use the small model for fast paths (classification, extraction, templated drafting, guardrails), and call a larger model only for edge cases. The result is lower average latency and lower average cost without materially impacting quality on the median task.
High-throughput backend copilots using smaller models like 1.5B parameters represent a strategic trade-off in AI system design where latency and throughput trump raw capability. Some backend AI tasks don’t need the full reasoning power of larger models—they need fast, consistent, “good enough” responses that can handle massive concurrent load while maintaining sub-second latency.
What comes next?
Based on our success with Palmyra-mini, we hope to continue research to push our other models further, optimizing for quality-per-parameter and exploring how much token usage and latency can be saved without compromising outcomes.
Current small models aren’t matching the performance of very large models that have over 100 billion parameters. However, research aimed at improving performance is valuable because it leads to more efficient processing. Research could lead us toward a sweet spot where a medium-sized model with around 30-50 billion parameters can perform just as well as those much larger 100+ billion parameter models. Achieving this balance between size and performance would be an excellent foundation to build upon, providing the same capabilities with significantly less computational power required.
Improving performance while shrinking model size is good for customer experience, providing snappier apps. It’s also good for infrastructure, with denser scheduling and lower spend. Streamlining model size is more power efficient, providing a cost reduction, and limiting environmental impact. It’s also good for deployment flexibility, offering more options for where a model can credibly run.
Try it out for yourself
At 1.5B parameters, Palmyra-mini delivers practical reasoning, competitive Pass@1, and single-GPU throughput that makes new system designs feasible: private by default, edge-deployable, and cheap to serve—without giving up the behaviors that matter to engineers.
Models like Palmyra-mini let you balance speed and usefulness on your own terms—placing intelligence where it makes the most sense, preserving privacy when you must, and reserving your largest models for the work that truly demands them.
RESOURCES
Read more about these models on our Hugging Face article.
Dive into the code and start building with Palmyra-mini here.
Inference on iOS: Palmyra-mini is the first WRITER LLM to run entirely 100% locally on an iPhone. A working implementation using llama.cpp on iOS 18 and the full open-source code can be found in this repo.
Notes on evaluation methodology:
Pass@1 (avg-of-1): computed using lm_eval and lighteval.
Pass@1 (avg-of-64) and Majority@64: computed using nemoskills.
If you’re interested in AI research, check out our open positions.