As the use of Learning Language Models (LLMs) grows, one of the main challenges for any organization is trying to figure out if they want to go for hosted LLMs or should they just install it on premises. There are advantages for each one of these approaches. It is all about balancing performance, control, and data security.

In this blog we will work with couple of different options for installing an LLM locally. We will use this later in our blogs for our practice. However, let’s first discuss the topic of installing LLMs on-premise or local.
Benefits of keeping a local LLM
- Data Privacy and Security
- One of the most compelling arguments for local installations is data privacy. Sensitive information, such as customer data or intellectual property, can remain on internal servers. Regulatory compliance can also be better managed, as organizations have full control.
- Performance and Latency
- Local LLMs typically provide lower latency compared to hosted solutions. By eliminating the need for network requests to external servers, the processing speed can be significantly enhanced. This is particularly beneficial for applications requiring real-time interactions.
- Customization and Fine-Tuning
- With local installations, organizations can tailor the LLMs to meet specific requirements. This flexibility allows for fine-tuning on proprietary datasets, making the model more aligned with the company’s unique language and terminologies.
- Cost Efficiency Over Time
- While hosting providers usually offer a pay-as-you-go pricing model, costs can escalate rapidly with increased usage. Deploying a private SLM might involve higher upfront costs but can lead to significant savings in the long run.
- No Downtime Risks
- Relying on external services opens the door to possible downtime and outages. With local installations, organizations mitigate these risks, ensuring that their applications are always available.
The Pros and Cons of Hosted LLMs
| Criteria | Private LLM | Hosted LLM |
|---|---|---|
| Data Privacy | High | Low |
| Latency | Low | Medium to High |
| Customization | High | Limited |
| Cost | Long term savings | Pay as you go |
| Predictive Monthly Expense | Yes | Various |
| Infrastructure Cost | Ongoing | None |
| Dependency on Internet | No | Yes |
Tools for On-Premise Install
In this blog we will talk about two different tools that allow you to run large language models on own hardware.
- OLLAMA
- Microsoft Foundry Local
Most of the blogs only go over OLLAMA, but since Foundry Local is an easy option that works on Windows and MacOS, I will cover that here as well. There is an additional advantage of Foundry Local that I will also cover when we go to that tool. We will briefly describe both of these tools, install them and finally run a LLM/ SLM on them.
OLLAMA
Ollama is described in their website as follows,
Ollama is the easiest way to get up and running with large language models such as gpt-oss, Gemma 3, DeepSeek-R1, Qwen3 and more.
That’s the gist of the tool. It provides a convenient way to run large language models locally on your machine.
Installation
Ollama can be installed on Windows, MacOS or Linux. Installation is just running a script on respective systems.
| Operating System | Script to Run |
|---|---|
| MacOS (installer available) | curl -fsSL https://ollama.com/install.sh | sh |
| Windows (installer available) | irm https://ollama.com/install.ps1 | iex |
| Linux | curl -fsSL https://ollama.com/install.sh | sh |
Basic Commands
% ollama --help
Large language model runner
Usage:
ollama [flags]
ollama [command]
Available Commands:
serve Start Ollama
create Create a model
show Show information for a model
run Run a model
stop Stop a running model
pull Pull a model from a registry
push Push a model to a registry
signin Sign in to ollama.com
signout Sign out from ollama.com
list List models
ps List running models
cp Copy a model
rm Remove a model
launch Launch the Ollama menu or an integration
help Help about any command
Flags:
-h, --help help for ollama
--nowordwrap Don't wrap words to the next line automatically
--verbose Show timings for response
-v, --version Show version information
We will go over some frequently used commands here.
Ollama exposes a REST endpoint that can be used to access the models. It exposes custom endpoints as well as OpenAI SDK compliant endpoints. To start serving models, the first command that needs to be given to Ollama is serve. This starts up Ollama to serve models.
Ollama maintains a list of models in the registry. To load a model to your local repository, you will use the pull command. However if you want to run a model, local or cloud, you can use run command.
% ollama pull gemma3:12b-it-qat
pulling manifest
pulling 1fb99eda86dc: 1% ▕ ▏ 51 MB/8.9 GB
The other frequently used commands would be ps (show running process, in this case model list), list (all downloaded models) and rm (remove a model from local downloads).
% ollama list
NAME ID SIZE MODIFIED
llama3.1:8b 46e0c10c039e 4.9 GB 2 weeks ago
qwen3-vl:8b-instruct 0533d74300e4 6.1 GB 5 weeks ago
phi4:14b ac896e5b8b34 9.1 GB 2 months ago
nomic-embed-text:v1.5 0a109f422b47 274 MB 2 months ago
The model name indicates a lot of things. For example qwen3-vl:8b-instruct as the name indicates base model is Alibaba qwen, version 3, vl indicates vision model, 8b is the number of parameters for this and finally instruct tells us this model is specifically trained for following instructions. Similar words used may be it which means instruct tuned.
Ollama also supports GGUF (GPT-Generated Unified Format) quantized models. In this case it will also be indicated in the name. For example, llama3.3:8b-q4_k_m indicates that this is a 4-bit quantized model (q4). Q4_k_m is the GGUF schema used to quantize. K_M in this case means model is using K-quants algorithm in llama.cpp to quantize. M indicates how aggressive (in this case Medium) the quantization will be. Quantization is not in scope of this blog.
Microsoft Foundry Local
We will again rely on website definition for Foundry Local as well.
Foundry Local is an on-device AI inference solution that you use to run AI models locally through a CLI, SDK, or REST API.
It does pretty much the same thing that Ollama does. One thing to note here is that Foundry Local is still in preview and according to Microsoft actively being developed.
One main differentiation factor for Foundry Local for me was its native support for QNN, Qualcomm NPU (Neural Processing Units) on Microsoft Windows devices. Since I bought this Qualcomm chip based Copilot+ PC for testing, I Not all models are supported, but quite a few are already using NPUs.
Installation
| Operating System | Script to Run |
|---|---|
| Windows | winget install Microsoft.FoundryLocal |
| MacOS | brew install foundrylocal |
Basic Commands
PS C:\> foundry --help
Description:
Foundry Local CLI: Run AI models on your device.
???? Getting started:
1. To view available models: foundry model list
2. To run a model: foundry model run <model>
EXAMPLES:
foundry model run phi-3-mini-4k
Usage:
foundry [command] [options]
Options:
-?, -h, --help Show help and usage information
--version Show version information
--license Display foundry license information
Commands:
model Discover, run and manage models
cache Manage the local cache
service Manage the local model inference service
In this case, we can get a list of all available models by selecting model list. All these models can also be filtered by their capabilities. For example, if I want to see only models that can utilize QNN NPU, I can use this.
C:\> foundry model list --filter device=npu Alias Device Task File Size License Model ID ----------------------------------------------------------------------------------------------- phi-3.5-mini NPU chat-completion 2.78 GB MIT phi-3.5-mini-instruct-qnn-npu:1 ---------------------------------------------------------------------------------------------------------------------- phi-3-mini-128k NPU chat-completion 2.78 GB MIT phi-3-mini-128k-instruct-qnn-npu:2 ------------------------------------------------------------------------------------------------------------------------- phi-3-mini-4k NPU chat-completion 2.78 GB MIT phi-3-mini-4k-instruct-qnn-npu:2 ----------------------------------------------------------------------------------------------------------------------- deepseek-r1-14b NPU chat-completion 7.12 GB MIT deepseek-r1-distill-qwen-14b-qnn-npu:1 ----------------------------------------------------------------------------------------------------------------------------- deepseek-r1-7b NPU chat-completion 3.71 GB MIT deepseek-r1-distill-qwen-7b-qnn-npu:1 ---------------------------------------------------------------------------------------------------------------------------- qwen2.5-1.5b NPU chat-completion 2.78 GB MIT qwen2.5-1.5b-instruct-qnn-npu:2 ---------------------------------------------------------------------------------------------------------------------- qwen2.5-7b NPU chat-completion 2.78 GB MIT qwen2.5-7b-instruct-qnn-npu:2
Also, in this case when you run a model, it will be downloaded when not available.
To see all installed models, you can use,
C:\> foundry cache list
To delete a model, use, foundry cache rm <model>.
Using APIs to access Models
Both Ollama and Microsoft Foundry exposes APIs that can be used to interact with the models. They provide their own custom APIs as well as support OpenAI SDK. For this blog we will show how to use OpenAI SDK in each of these frameworks.
Let’s create a translator, Pig Latin translator. We will use one of the Microsoft models, PHI. We need to install couple of libraries openai and foundry-local-sdk. You can install both using pip.
import openai from foundry_local import FoundryLocalManager
Ollama and Foundry local works in a little bit of different way. Ollama always exposes a fixed port. However, Foundry Local may expose the API in different ports. We will have to use foundry manager to find out what port it is running on.
To get a LLM instance for Ollama, we can connect to the fixed port as follows:
def get_ollama_model(self):
"""
Get Ollama model instance.
"""
api_key = "not-a-real-key"
endpoint = "http://localhost:11434/v1"
llm = openai.OpenAI(
base_url = endpoint,
api_key = api_key
)
return llmHere we use the fixed port 11434 to access OpenAI compliant endpoint. One the other hand, for Foundry Local, following is needed,
def get_foundry_model(self):
"""
Get Foundry Local model instance.
"""
fdy_manager = FoundryLocalManager(self.model_alias)
llm = openai.OpenAI(
base_url = fdy_manager.endpoint,
api_key = fdy_manager.api_key
)
return llmFor this we do not bother about knowing where Foundry is running. We will get the endpoint from Foundry Manager.
Let’s put this in a class. We will also initialize model_alias class level variable.
class PigLatinTranslator:
"""
A simple class to demonstrate how to use Foundry Local and Ollama models for generating responses.
"""
def __init__(self, framework="foundry_local"):
if framework == "foundry_local":
self.model_alias = "Phi-4-generic-cpu:1"
self.llm = self.get_foundry_model()
elif framework == "ollama":
self.model_alias = "phi4:14b"
self.llm = self.get_ollama_model()
else:
raise ValueError(f"Unsupported framework: {framework}")If you see, in both cases we are using PHI-4 as the language model. As the next step, we will write a method to use the model and convert text to piglatin.
def generate_response(self, user_input) -> str:
"""
Generates a response based on the user input.
"""
prompt = [
{"role": "system", "content": "You are a helpful assistant that translates English to Pig Latin."},
{"role": "user", "content": user_input}
]
response = self.llm.chat.completions.create(
model=self.model_alias,
messages=prompt,
stream=False
)
return response.choices[0].message.contentFinally, we will create the Main method.
if __name__ == "__main__":
#translator = PigLatinTranslator(framework="foundry_local")
translator = PigLatinTranslator(framework="ollama")
prompt = "Translate the following English text to Pig Latin: 'We are having fun understanding Ollama and Foundry Local.'"
translation = translator.generate_response(prompt)
print(f"Pig Latin Translation: {translation}")That’s it, we are done with a very simple implementation for both Ollama and Foundry Local. Output looks similar for both models.
Pig Latin Translation: Sure! Here is your text translated into Pig Latin:
"eway areway avinghay unfay understandingway Allomaa owardn-Forylay Ecal-Locaway."
Let me know if you need anything else!
Conclusion
This blog went over the advantages of using a private LLM. Finally we implemented a basic python program using two different frameworks viz. Foundry Local and Ollama. Hope this blog will be helpful to you. Ciao for now!