Working with private LLM

As the use of Learning Language Models (LLMs) grows, one of the main challenges for any organization is trying to figure out if they want to go for hosted LLMs or should they just install it on premises. There are advantages for each one of these approaches. It is all about balancing performance, control, and data security.

In this blog we will work with couple of different options for installing an LLM locally. We will use this later in our blogs for our practice. However, let’s first discuss the topic of installing LLMs on-premise or local.

Benefits of keeping a local LLM

Data Privacy and Security
- One of the most compelling arguments for local installations is data privacy. Sensitive information, such as customer data or intellectual property, can remain on internal servers. Regulatory compliance can also be better managed, as organizations have full control.
Performance and Latency
- Local LLMs typically provide lower latency compared to hosted solutions. By eliminating the need for network requests to external servers, the processing speed can be significantly enhanced. This is particularly beneficial for applications requiring real-time interactions.
Customization and Fine-Tuning
- With local installations, organizations can tailor the LLMs to meet specific requirements. This flexibility allows for fine-tuning on proprietary datasets, making the model more aligned with the company’s unique language and terminologies.
Cost Efficiency Over Time
- While hosting providers usually offer a pay-as-you-go pricing model, costs can escalate rapidly with increased usage. Deploying a private SLM might involve higher upfront costs but can lead to significant savings in the long run.
No Downtime Risks
- Relying on external services opens the door to possible downtime and outages. With local installations, organizations mitigate these risks, ensuring that their applications are always available.

The Pros and Cons of Hosted LLMs

Criteria	Private LLM	Hosted LLM
Data Privacy	High	Low
Latency	Low	Medium to High
Customization	High	Limited
Cost	Long term savings	Pay as you go
Predictive Monthly Expense	Yes	Various
Infrastructure Cost	Ongoing	None
Dependency on Internet	No	Yes

Tools for On-Premise Install

In this blog we will talk about two different tools that allow you to run large language models on own hardware.

OLLAMA
Microsoft Foundry Local

Most of the blogs only go over OLLAMA, but since Foundry Local is an easy option that works on Windows and MacOS, I will cover that here as well. There is an additional advantage of Foundry Local that I will also cover when we go to that tool. We will briefly describe both of these tools, install them and finally run a LLM/ SLM on them.

OLLAMA

Ollama is described in their website as follows,

Ollama is the easiest way to get up and running with large language models such as gpt-oss, Gemma 3, DeepSeek-R1, Qwen3 and more.

That’s the gist of the tool. It provides a convenient way to run large language models locally on your machine.

Installation

Ollama can be installed on Windows, MacOS or Linux. Installation is just running a script on respective systems.

Operating System	Script to Run
MacOS (installer available)	curl -fsSL https://ollama.com/install.sh \| sh
Windows (installer available)	irm https://ollama.com/install.ps1 \| iex
Linux	curl -fsSL https://ollama.com/install.sh \| sh

Basic Commands

% ollama --help
Large language model runner

Usage:
  ollama [flags]
  ollama [command]

Available Commands:
  serve       Start Ollama
  create      Create a model
  show        Show information for a model
  run         Run a model
  stop        Stop a running model
  pull        Pull a model from a registry
  push        Push a model to a registry
  signin      Sign in to ollama.com
  signout     Sign out from ollama.com
  list        List models
  ps          List running models
  cp          Copy a model
  rm          Remove a model
  launch      Launch the Ollama menu or an integration
  help        Help about any command

Flags:
  -h, --help         help for ollama
      --nowordwrap   Don't wrap words to the next line automatically
      --verbose      Show timings for response
  -v, --version      Show version information

We will go over some frequently used commands here.

Ollama exposes a REST endpoint that can be used to access the models. It exposes custom endpoints as well as OpenAI SDK compliant endpoints. To start serving models, the first command that needs to be given to Ollama is serve. This starts up Ollama to serve models.

Ollama maintains a list of models in the registry. To load a model to your local repository, you will use the pull command. However if you want to run a model, local or cloud, you can use run command.

% ollama pull gemma3:12b-it-qat
pulling manifest
pulling 1fb99eda86dc:   1% ▕                  ▏  51 MB/8.9 GB

The other frequently used commands would be ps (show running process, in this case model list), list (all downloaded models) and rm (remove a model from local downloads).

% ollama list
NAME                               ID              SIZE      MODIFIED
llama3.1:8b                        46e0c10c039e    4.9 GB    2 weeks ago
qwen3-vl:8b-instruct               0533d74300e4    6.1 GB    5 weeks ago
phi4:14b                           ac896e5b8b34    9.1 GB    2 months ago
nomic-embed-text:v1.5              0a109f422b47    274 MB    2 months ago

The model name indicates a lot of things. For example qwen3-vl:8b-instruct as the name indicates base model is Alibaba qwen, version 3, vl indicates vision model, 8b is the number of parameters for this and finally instruct tells us this model is specifically trained for following instructions. Similar words used may be it which means instruct tuned.

Ollama also supports GGUF (GPT-Generated Unified Format) quantized models. In this case it will also be indicated in the name. For example, llama3.3:8b-q4_k_m indicates that this is a 4-bit quantized model (q4). Q4_k_m is the GGUF schema used to quantize. K_M in this case means model is using K-quants algorithm in llama.cpp to quantize. M indicates how aggressive (in this case Medium) the quantization will be. Quantization is not in scope of this blog.

Microsoft Foundry Local

We will again rely on website definition for Foundry Local as well.

Foundry Local is an on-device AI inference solution that you use to run AI models locally through a CLI, SDK, or REST API.

It does pretty much the same thing that Ollama does. One thing to note here is that Foundry Local is still in preview and according to Microsoft actively being developed.

One main differentiation factor for Foundry Local for me was its native support for QNN, Qualcomm NPU (Neural Processing Units) on Microsoft Windows devices. Since I bought this Qualcomm chip based Copilot+ PC for testing, I have been trying to also use tools that can support QNN by default. Not all models are supported, but quite a few are already using NPUs.

Installation

Operating System	Script to Run
Windows	winget install Microsoft.FoundryLocal
MacOS	brew install foundrylocal

Basic Commands

PS C:\> foundry --help
Description:
  Foundry Local CLI: Run AI models on your device.

  ???? Getting started:

     1. To view available models: foundry model list
     2. To run a model: foundry model run <model>

     EXAMPLES:
         foundry model run phi-3-mini-4k

Usage:
  foundry [command] [options]

Options:
  -?, -h, --help  Show help and usage information
  --version       Show version information
  --license       Display foundry license information

Commands:
  model    Discover, run and manage models
  cache    Manage the local cache
  service  Manage the local model inference service

In this case, we can get a list of all available models by selecting model list. All these models can also be filtered by their capabilities. For example, if I want to see only models that can utilize QNN NPU, I can use this.

C:\> foundry model list --filter device=npu
Alias                          Device     Task               File Size    License      Model ID
-----------------------------------------------------------------------------------------------
phi-3.5-mini                   NPU        chat-completion    2.78 GB      MIT          phi-3.5-mini-instruct-qnn-npu:1
----------------------------------------------------------------------------------------------------------------------
phi-3-mini-128k                NPU        chat-completion    2.78 GB      MIT          phi-3-mini-128k-instruct-qnn-npu:2
-------------------------------------------------------------------------------------------------------------------------
phi-3-mini-4k                  NPU        chat-completion    2.78 GB      MIT          phi-3-mini-4k-instruct-qnn-npu:2
-----------------------------------------------------------------------------------------------------------------------
deepseek-r1-14b                NPU        chat-completion    7.12 GB      MIT          deepseek-r1-distill-qwen-14b-qnn-npu:1
-----------------------------------------------------------------------------------------------------------------------------
deepseek-r1-7b                 NPU        chat-completion    3.71 GB      MIT          deepseek-r1-distill-qwen-7b-qnn-npu:1
----------------------------------------------------------------------------------------------------------------------------
qwen2.5-1.5b                   NPU        chat-completion    2.78 GB      MIT          qwen2.5-1.5b-instruct-qnn-npu:2
----------------------------------------------------------------------------------------------------------------------
qwen2.5-7b                     NPU        chat-completion    2.78 GB      MIT          qwen2.5-7b-instruct-qnn-npu:2

Also, in this case when you run a model, it will be downloaded when not available.

To see all installed models, you can use,

C:\> foundry cache list

To delete a model, use, foundry cache rm <model>.

Using APIs to access Models

Both Ollama and Microsoft Foundry exposes APIs that can be used to interact with the models. They provide their own custom APIs as well as support OpenAI SDK. For this blog we will show how to use OpenAI SDK in each of these frameworks.

Let’s create a translator, Pig Latin translator. We will use one of the Microsoft models, PHI. We need to install couple of libraries openai and foundry-local-sdk. You can install both using pip.

import openai
from foundry_local import FoundryLocalManager

Ollama and Foundry local works in a little bit of different way. Ollama always exposes a fixed port. However, Foundry Local may expose the API in different ports. We will have to use foundry manager to find out what port it is running on.

To get a LLM instance for Ollama, we can connect to the fixed port as follows:

def get_ollama_model(self):
  """
  Get Ollama model instance.
  """
  api_key = "not-a-real-key"
  endpoint = "http://localhost:11434/v1"
  llm = openai.OpenAI(
    base_url = endpoint,
    api_key = api_key
  )   
  return llm

Here we use the fixed port 11434 to access OpenAI compliant endpoint. One the other hand, for Foundry Local, following is needed,

def get_foundry_model(self):
  """
  Get Foundry Local model instance.
  """
  fdy_manager = FoundryLocalManager(self.model_alias)
  llm = openai.OpenAI(
    base_url = fdy_manager.endpoint,
    api_key = fdy_manager.api_key
  )
  return llm

For this we do not bother about knowing where Foundry is running. We will get the endpoint from Foundry Manager.

Let’s put this in a class. We will also initialize model_alias class level variable.

class PigLatinTranslator:
    """
    A simple class to demonstrate how to use Foundry Local and Ollama models for generating responses.
    """
    def __init__(self, framework="foundry_local"):
        if framework == "foundry_local":
            self.model_alias = "Phi-4-generic-cpu:1"
            self.llm = self.get_foundry_model()
        elif framework == "ollama":
            self.model_alias = "phi4:14b"
            self.llm = self.get_ollama_model()
        else:
            raise ValueError(f"Unsupported framework: {framework}")

If you see, in both cases we are using PHI-4 as the language model. As the next step, we will write a method to use the model and convert text to piglatin.

def generate_response(self, user_input) -> str:
  """
  Generates a response based on the user input.
  """
  prompt = [
    {"role": "system", "content": "You are a helpful assistant that translates English to Pig Latin."},
    {"role": "user", "content": user_input}
  ]
  response = self.llm.chat.completions.create(
    model=self.model_alias,
    messages=prompt,
    stream=False
  )
  return response.choices[0].message.content

Finally, we will create the Main method.

if __name__ == "__main__":
    #translator = PigLatinTranslator(framework="foundry_local")
    translator = PigLatinTranslator(framework="ollama")
    prompt = "Translate the following English text to Pig Latin: 'We are having fun understanding Ollama and Foundry Local.'"
    translation = translator.generate_response(prompt)
    print(f"Pig Latin Translation: {translation}")

That’s it, we are done with a very simple implementation for both Ollama and Foundry Local. Output looks similar for both models.

Pig Latin Translation: Sure! Here is your text translated into Pig Latin:

"eway areway avinghay unfay understandingway Allomaa owardn-Forylay Ecal-Locaway."

Let me know if you need anything else!

Conclusion

This blog went over the advantages of using a private LLM. Finally we implemented a basic python program using two different frameworks viz. Foundry Local and Ollama. Hope this blog will be helpful to you. Ciao for now!

Posts

Benefits of keeping a local LLM

The Pros and Cons of Hosted LLMs

Tools for On-Premise Install

OLLAMA

Installation

Basic Commands

Microsoft Foundry Local

Installation

Basic Commands

Using APIs to access Models

Conclusion

1 thought on “Working with private LLM”