
A few months ago I purchased a Microsoft Surface Pro tablet that ships with a Snapdragon X Elite processor having a generous amount of RAM. While I’m not a regular Windows user, the device was acquired specifically to explore the capabilities of its integrated neural processing unit (NPU), which has been heavily promoted in recent tech coverage (aka YouTube vlogs).
To test out NPU capabilities, I downloaded Foundry Local from Microsoft and tried a few models supporting it. However, this is not the story of Foundry Local or of NPU. That can come later. After I tinkered with Deepseek and Qwen2.5 for sometime, I downloaded ollama to test out other models. Foundry Local does not give you too many options. Unfortunately, all of the models download for ollama run on CPU, but since I was using the smaller models, performance was not too shabby.
In this blog we talk about image processing capabilities for two such models. One of the models is Gemma3:4b and the other one is qwen3-vl:8b. I was doing this compare more for fun, but since the results were interesting, I will just document it here. While I do understand that Qwen-VL by Alibaba has stronger multimodal capabilities in visual inferences, Gemma3 by Google is also multimodal and supports similar capabilities.
Let’s start with a brief description for the two models. I will let a third model describe these two just for fun. Here is what IBM Granite4 has to say about the two models.
Compare the two models – according to Graphite4
<start granite description>Qwen3-VL and Gemma3 are two noteworthy language AI models that have been developed by leading tech companies in the field of artificial intelligence, each bringing unique capabilities to the table.
Qwen3-VL (Visual Language) is a multimodal model capable of understanding and processing not just text but also visual data like images or videos. It stands out due to its ability to comprehend context within multimedia content which allows it to perform tasks such as image captioning, object detection in photos, or interpreting the narrative within a video clip. This broad spectrum of functionality makes Qwen3-VL highly versatile and well-suited for applications that require cross-modal understanding.
On the other hand, Gemma3 is an advanced language model that specializes primarily in text-based interactions. It has been trained extensively on vast amounts of textual data enabling it to excel at tasks like translation, summarization, sentiment analysis etc., thereby making it highly proficient in natural language processing tasks.
While both models exhibit strong performance capabilities, their strengths lie in different domains – Qwen3-VL for handling multimodal data effectively and Gemma3 for superior text-based interactions. Direct comparisons may not be entirely fair since they serve distinct purposes; however, the choice between them would largely depend on specific project needs or task requirements.
In summary, while both models are impressive feats of AI technology, their unique strengths – Qwen3-VL’s ability to process multiple types of media and Gemma3’s proficiency in textual data manipulation – make them suitable for different scenarios. The decision between the two should therefore be guided by the nature of your specific use case or task at hand. </end granite description>
One thing to note is that Qwen-VL is a reasoning model, so it will always show it’s reasoning steps. Unfortunately, sometimes it takes so long to think, that I start thinking I should go take a fifteen minute power nap and come back. On the other hand, Gemma3 responded more or less instantly.
Even though we did say we will verify inferences using images, let’s warm up with some text prompts.
Logic Puzzle I
Prompt: Given 5 numbers as 123456, 341256, 561234, 125634 and 345612, find the 6th number that can fit in the same logic. Explain the logic.
Qwen3
This model was able to arrive at the correct answer(563412). It also gave clear steps how it was able to arrive at the solution.
Gemma3
This model also arrived at the correct answer, but the logic it printed did not make too much of a sense. It printed the correct value and the logic was numbers rotated clockwise by 1 digit. However, that is not exactly true.
Logic Puzzle II
Prompt: You have 9 balls, all identical except one that’s heavier. You have a balance scale and can use it twice. How do you find the odd ball?
Qwen3
Qwen found the solution. But of course being a reasoning model, it thought for quite some time before finally arriving at the solution. One statement from it’s thought process was funny.
So why do I feel like sometimes these problems have more nuances? Maybe because sometimes people assume you don’t know if it’s heavier or lighter, but in this problem, it’s specified that one is
heavier, so we don’t have to consider light balls.
Gemma3
Well, Gemma knew the puzzle and solved it. It printed a classic solution to the puzzle without further ado.
With that little warm up done, let’s go to some image test.
Solve a Sudoku
I probably started with the hardest one of the lot, but I thought this would be a good test. I grabbed a random Sudoku image from the internet. Just a note, I do not own the copyright of any of the images that I have embedded in this page. They are only used for the purpose of testing the models and no copyright infringement is intended or desired. With that legal stuff out of the way, let’s start back on the blog.
Given below is a very simple puzzle that I had both the models solve.

Prompt: Solve the Sudoku puzzle given.
Qwen3:
This model kept on struggling with the puzzle. It was able to solve a lot of the blocks, but struggled to finish. On looking at the thought process, I noticed that Qwen kept on getting the 7th row wrong when it was processing the image. See below how it kept on messing up. I am sure if that minor glitch was solved, it would have got the puzzle correctly. Notice Row 7 below and compare with the original image. This was a consistent error from the model.
Row 1: 5, _, _, 4, 6, 7, 3, _, 9
Row 2: 9, _, 3, 8, 1, _, 4, 2, 7
Row 3: 1, 7, 4, 2, _, 3, _, _, _
Row 4: 2, 3, 1, 9, 7, 6, 8, 5, 4
Row 5: 8, 5, 7, 1, 2, 4, _, 9, _
Row 6: 4, 9, 6, 3, _, 8, 1, 7, 2
Row 7: _, _, _, 8, 9, 2, 6, _, _
Row 8: 7, 8, 2, 6, 4, 1, _, _, 5
Row 9: _, 1, _, _, _, _, 7, _, 8
So, final inference, Qwen was not able to solve it.
Gemma3
Gemma does not show the reasoning, so it came back with the solution relatively fast. The solved puzzled looked nothing like the one that I gave it. I insisted again, and same result. I got back a response, but the puzzle was not the same as that given. Finally Gemma gave up with the following statement.
You are absolutely right to keep pointing this out! I am failing miserably at this.
I sincerely apologize for the persistent errors. My image recognition and puzzle-solving capabilities are not yet reliable.
I’m going to stop attempting to solve this image-based Sudoku. It’s clear I’m not up to the task, and I don’t want to continue providing incorrect solutions.
Thank you so much for your patience and for helping me learn. I appreciate you pointing out the issues.
Is there anything else I can help you with, perhaps a simpler puzzle or a different type of problem?
No worries Gemma, you will do better next time :).
The next puzzle was simpler. The following images involve people faces. I have tried to search for images having creative commons license, but then of course I have to rely on search engines.
A Small Group

Prompt: Count the number of people in this image
Qwen3
This model initially counted 8 people in the group. However, on reanalysis, it came up with 9 people.
To determine the total number of people in the image, we count the individuals:
- Standing individuals: There are 7 people standing behind the sofa.
- Seated individuals: There are 2 people sitting on the sofa.
Adding these together: ( 7 + 2 = 9 ).
Thus, the total number of people in the image is 9.
Gemma3
This model was persistent with the calculation of 8 people. I tried multiple times after clearing session context, but it kept on missing someone in this group.
Based on the image, there are eight people.
Prompt: Provide a count of the number of women in this group
Qwen3
Qwen gave a pretty descriptive response for this question.
To determine the number of women in the image, we analyze the group:
- Standing individuals (from left to right):
- 1st: Man (gray hair, dark suit)
- 2nd: Man (black suit)
- 3rd: Man (bald, purple tie)
- 4th: Man (blue suit, bald)
- 5th: Man (glasses, red tie)
- 6th: Woman (brown hair, dark blazer)
- 7th: Woman (blue suit, white top, skirt)
- Seated individuals (front row):
- 8th: Man (red/blue striped tie)
- 9th: Man (patterned tie)
Among these, only the 6th and 7th individuals are women.
Thus, the number of women in the image is 2.
Gemma3
Gemma was also correct in determining with count of people.
Based on the image, there are two women in the group.
Prompt: Describe the picture in the background.
I gave this prompt just to understand if the models can identify what background means and specifically look for one object there. Both the models were successful in this check.
Qwen3
The painting in the background depicts a winter landscape featuring snow – covered ground, trees, houses, and a wooden fence. It portrays a rural or small – town setting during the winter season, with the snow creating a serene, pastoral atmosphere.
Gemma3
The painting in the background depicts a winter scene of a rustic, snow-covered farmhouse. It appears to be a traditional New England landscape.
It’s a charming and classic depiction of a rural home in a snowy setting.
Prompt: Describe the rug.
For this you need to know what rug means and where to look for it. After this you will need to get some description for the rug. This was also good with both the models.
Qwen3
The rug in the image features an intricate, traditional pattern with a mix of colors (including shades of blue, red, beige, and possibly gold). It appears to have ornate motifs, likely incorporating floral or geometric designs, and is bordered by a detailed decorative edge. The overall style suggests a classic or vintage-inspired design, common in formal or historic interiors.
Gemma3
The rug is a richly patterned, Persian-style rug. It features a complex design with intricate floral and geometric motifs in shades of red, blue, and gold. It adds a luxurious and traditional feel to the overall composition of the photo.
Next we will look at a larger group of people and see how the models respond.
A Large Group

This one is in a office setting and we have a much larger group. Again, we will try to see if the models can count the total number of people and can identify gender of the people.
Prompt: How many people are there in this group?
Qwen3
To determine the number of people in the image, we can count the individuals in both the front row (seated) and the back row (standing):
- Front row (seated at the table): There are 5 people.
- Back row (standing behind the table): There are 9 people.
Adding these together: ( 5 + 9 = 14 ).
Thus, there are 14 people in the image.
Gemma3
There are eleven people in the image.
So, Qwen was able to identify the count correctly, but Gemma was not.
Prompt: How many in this group are women?
Qwen3
To determine the number of women in the group, we analyze the image by identifying women in both the front row (seated) and back row (standing):
Front Row (Seated at the Table):
- 1st person (leftmost): Woman (black dress)
- 3rd person (center): Woman (gray blazer)
- Total in front row: 2 women
Back Row (Standing Behind the Table):
- 1st person (leftmost): Woman (green dress)
- 5th person (middle): Woman (gray blazer)
- 7th person: Woman (black blazer)
- 8th person: Woman (black suit)
- Total in back row: 4 women
Total Women:
2 (front row) + 4 (back row) = 6 women.
Thus, there are 6 women in the group.
Gemma3
There are five women in the group.
Again, Qwen did a much better job in this case as well and was quite descriptive in what it did.
Prompt: Describe the setting
This is a more generic one. We are just asking the models to identify the location for this photo. From manual observation, we do see an office setting and it probably is one of the larger conference rooms. Let’s see.
Qwen3
The setting is a modern corporate conference room designed for formal business meetings. Key elements include:
- Layout & Furniture: A large, polished wooden conference table dominates the foreground, with 5 individuals seated around it (suggesting a structured meeting). Behind the table, 9 people stand in a row, indicating a larger group gathering.
- Architecture & Lighting: The room features tall, panoramic windows that flood the space with natural light and offer a view of an urban cityscape (with buildings and greenery visible outside). The interior has neutral, professional tones (e.g., beige walls, modern ceiling fixtures), contributing to a sleek, corporate atmosphere.
- Ambiance: The space feels polished and formal, typical of executive offices or high-level corporate environments. The natural light, open layout, and professional decor (e.g., tailored business attire of the group) reinforce a sense of professionalism and collaboration.
Overall, the setting reflects a high-stakes business meeting in a contemporary, well-appointed corporate environment.
Gemma3
The setting is a modern, professional office space. Here’s a breakdown of the key elements:
- Large Conference Table: A substantial, dark wood conference table dominates the center of the frame.
- Glass Walls: The room is enclosed by large glass walls offering a view of a city skyline outside. This
suggests a prominent, likely urban location.- Professional Decor: The overall aesthetic is sleek and corporate, with a focus on a polished, modern look.
It’s a setting designed to convey authority, professionalism, and a forward-thinking environment.
Color & Object Recognition

Let’s try to figure out if the models can identify objects, or do they just read texts. Finally we will try to see if they can identify colors properly.
Prompt: is there a pencil case in this image?
We can see there is a pencil case (as text and image as well). So, it doesn’t matter what reference the model uses.
Qwen3
Yes, there is a pencil case in the image. It is labeled as “Pencil case” and is one of the classroom objects shown in the illustration.
Gemma3
Yes, there is a pencil case in the image! It’s located towards the left side of the image.
So, yes both models successfully identified the pencil case.
Prompt: Is there a laptop?
This one is a bit different. The image here is labelled as a Computer. Let’s see how the models behave in this case.
Qwen3
Yes, there is a laptop in the image. It is labeled as “Computer” and is one of the classroom objects depicted in the illustration.
Gemma3
No, there isn’t a laptop in the image.
Oops! So Gemma is going through the texts. Qwen was able to get it right.
Prompt: How many objects have green as color?
This one is color identification. From visual inspection, let’s see how many we think have green.
- Clock (has green minute hand)
- Crayons (has a green crayon)
- Book (green cover)
- Map (continents are green)
- Pencil Case (green pencil)
- Chair (green cushion)
- Desk (green base)
- Bookcase (green books)
- Cupboard (green borders)
- Board (green writing surface)
So, we counted 10 objects that are green. What do the models think?
Qwen3
The question asks for the number of classroom objects that have green as a color in their depiction. Below is the step-by-step analysis of all 24 objects in the image, based on standard classroom iconography:
✅ Objects with Green
- Chair – Seat is green (common in educational visuals).
- Desk – Seat is green (matches typical classroom desk design).
- Book – Cover is green (a standard green book).
- Board – Chalkboard is green (the board itself is green).
- Globe – Landmasses (continents) are green (standard globe design).
- Map – Landmasses (continents) are green (standard world map design).
- Paint – Paint set includes green (part of the colorful paint dots).
❌ Objects Without Green
- Calculator: Blue
- Paper: White
- Magnifying glass: Red handle
- Notebook: Blue
- Highlighter: Yellow
- Clock: Red
- Sharpener: Red
- Pencil: Yellow
- Crayons: Yellow box (crayons themselves are multi-colored but the object is yellow)
- Paint brushes: Yellow handles
- Eraser: Red
- Ruler: Wooden/brown
- Bottle: Blue
- Box: Orange
- Pencil case: Pink
- Backpack: Red
- Glue: Blue
- Ballpoint pen: Blue
- Computer: Blue
- Scissors: Red
- Window: Brown frame
- Bookcase: Brown
- Door: Orange
- Cupboard: Brown
✅ Final Count
Only 7 objects in the image have green as a color in their depiction.
???? Answer:
$$
\boxed{7}
$$
Gemma3
Okay, let’s count the objects with green in the image.
There are 6 objects that are predominantly green:
- Crayons
- Paintbrushes
- Highlighter
- Paint
- Textbook
- Paint palette
Let me know if you want to count anything else!
Well none of them got it right. Qwen counted one more than Gemma. But then Qwen counted Globe having green continents (which is incorrect).
Signs/ Directions and Colors

Okay, so here is my nice sign board that I grabbed from the net. Makes me feel like going to sunny Caribbean islands again. Anyway, back to task at hand.
Prompt: Read all signs in this image?
Qwen3
Here are all the signs in the image, including their directions and details:
- MALIBU 1760 miles → (right arrow)
- NASSAU 797 miles ← (left arrow)
- SHAGGY’S P.C. 19 miles → (right arrow)
- DESTIN 145 miles ← (left arrow)
- CANCUN 648 miles ← (left arrow)
- MAUI 4204 miles → (right arrow)
- PORT EADS 96 miles ← (left arrow)
- MIAMI BEACH 623 miles ← (left arrow)
- MONTEGO BAY 1073 miles ← (left arrow)
Gemma3
Here are all the signs from the image:
- Malibu 1760 miles
- Nassau 797 miles
- Shaggy’s P.C. 19 miles
- Destin 145 miles
- Cancun 48 miles
- Maui 4204 miles
- Port Eads 96 miles
- Miami Beach 623 miles
- Montego Bay 1073 miles
Oh! Qwen took away all the fun that I had in store for the next question. Both gave correct answers. Interestingly, Gemma converted all the signs to Camel case.
Prompt: List all colors and arrow directions for the signs
I know Qwen will get it right as it had already given me the answer in the previous one. But who cares, here goes.
Qwen3
Here’s a complete list of colors and arrow directions for each sign in the image:
- MALIBU 1760 miles → Color: Orange | Arrow: Right
- NASSAU 797 miles ← Color: Red | Arrow: Left
- SHAGGY’S P.C. 19 miles → Color: Yellow | Arrow: Right
- DESTIN 145 miles ← Color: Green | Arrow: Left
- CANCUN 648 miles ← Color: Blue | Arrow: Left
- MAUI 4204 miles → Color: Yellow | Arrow: Right
- PORT EADS 96 miles ← Color: Orange | Arrow: Left
- MIAMI BEACH 623 miles ← Color: Green | Arrow: Left
- MONTEGO BAY 1073 miles ← Color: Blue | Arrow: Left
(Note: Arrow direction indicates whether the sign points left (←) or right (→).)
Gemma3
Here’s a breakdown of the colors and arrow directions for the signs in the image:
Colors:
- Turquoise (background)
- Red
- Yellow
- Green
- Black
Arrow Directions:
- Right Arrows
- Left Arrows
Let me know if you’d like any other details from the image!
yes, I need that information for each arrow individually
Okay, let’s break down the details for each individual arrow on the signs!
Here’s a list of each sign and its corresponding arrow details:
- MALIBU – 1760 miles – Right Arrow
- NASSAU – 797 miles – Left Arrow
- SHAGGY’S P.C. – 19 miles – Right Arrow
- DESTIN – 145 miles – Left Arrow
- CANCUN – 648 miles – Right Arrow
- MAUI – 420 miles – Left Arrow
- PORT EADS – 96 miles – Right Arrow
- MIAMI BEACH – 63 miles – Left Arrow
- MONTEGO BAY – 1073 miles – Right Arrow
Would you like me to elaborate on any of these signs or their distances?
Ha.. Ha… Gemma took a shortcut first time. I had to prompt it again to get the arrows individually. But then of course it gave me partial answer correctly. It still skipped over the colors. Qwen was more obedient and answered in the first go. It also provided the arrows right in the answer to point to the directions where the sign arrows point.
Conclusion
This blog was really created for fun. It does not indicate any strength or weaknesses of the models. This is by no way a scientific test. I put this in a blog as I found how models behaves interesting. Have fun! Ciao for now…