NAIRR Workshop · Jetstream2 Demo

What This Demo Is Doing — and Why

We take one AI model and run it three different ways to see which is fastest, lightest, and best — all on free cloud computing from NAIRR.

The big idea

An "AI model" is just a file full of math — the brain. But that brain can't do anything on its own; you need a piece of software to run it. There are several popular programs for this, and they're not all equally fast or efficient.

So we ask a simple, practical question: if we use the exact same AI brain, does it matter which program we run it with? We test three of the most popular ones and measure the difference.

🏎️ Think of it like a car engine. The model is the engine. The three programs are three different cars you could drop that same engine into. Same engine — but one car might be quicker off the line, another sips less fuel, another is built to carry a whole team. We're test‑driving all three.

What we're doing & why it matters

🔬 What we're doing

Running the same model (called Qwen3) through three programs — Ollama, llama.cpp, and vLLM — asking each the identical set of questions, and timing everything.

🎯 Why it matters

If you want to use AI in a classroom or research project, you have to pick a tool. This shows you, with real numbers, which one fits your needs — and proves you can do it on free NAIRR computing, no expensive hardware required.

💡 Why such a small AI model? We're using a small model on purpose — it's the size that fits comfortably in the free, shared computing we have. It's not the "best" model; it's the right model for our resources. On a bigger allocation or with a graphics card you'd scale up. Matching the model to the computing you actually have is the real skill — and it's exactly what lets your whole class run at once on free national resources.

The sequence — what happens, step by step

Pick one model & write the questions

We choose a single AI model and prepare a fixed list of questions, so every program gets the exact same test. Fair comparison starts here.

Load the model into the first program

We start Ollama, give it the model, and ask it our questions — recording how long each answer takes.

Repeat with the next program

We do the very same thing with llama.cpp, then vLLM. (vLLM needs a graphics card — on a basic instance it's skipped automatically.)

Measure everything

For each answer we track: how fast it types (words per second), how quickly it starts replying, how much memory it uses, and how good the answer is.

Put it all in one comparison table

Finally we line the three programs up side by side so the winner — and the trade‑offs — are obvious at a glance.

The questions we ask (and what each one reveals)

Quick hello

"Hi, who are you?" — measures the bare minimum response time. How snappy is it?

Explain

"Why is the sky blue?" — a everyday explanation, easy to judge for quality.

Write code

"Write a program to find prime numbers." — can it produce correct, working code?

Long answer

"Write a 400-word essay on photosynthesis." — the best test of sustained speed.

Reasoning

A two-step discount math problem. — does it actually think, or just guess?

Follow format

"List the planets as structured data." — can it follow precise instructions?

Comprehension

Read a passage, then answer about it. — how well does it handle longer input?

All at once

Every question at the same time. — which program handles a "crowd" (like a full classroom) best?

🎓 The one thing to take away

By the end, you'll have seen — with real numbers — that there's a simple rule behind all of it:

The program you choose mostly changes the speed.
The model's size mostly changes the quality of the answers.

Knowing that, you can confidently pick the right setup for your own classroom or research — and you just did it all on free, shared national computing.