Skip to content
Tutorial emka
Menu
  • Home
  • Debian Linux
  • Ubuntu Linux
  • Red Hat Linux
Menu

Run Local AI on Fedora 44 CPU Without Expensive GPU

Posted on May 25, 2026

Many people think you must buy a very expensive graphics card to run AI on your computer. But you can run a very good AI on Fedora 44 without any GPU. If you use a software called Ollama with small models like Gemma 3 1B or Qwen 2.5 1.5B, a small virtual machine with only 2 CPUs and 4 GB of RAM can run very fast. It can give you 12 to 25 words every second. This speed is actually faster than how fast normal people read books. This setup is very good if you want a private coding helper, a tool to summarize long text files, or a chatbot to talk with. We do not need a big GPU unless we want to run huge models or have many people using it at the same time. This simple guide will help you install it and configure everything on your Fedora machine.

To start, we must install Ollama on our Fedora 44 system. The people who made Ollama made a very easy script. This script looks at your CPU to see what kind it is and then downloads the correct file. It does everything for you. You just need to open your terminal and run one simple command. This command uses curl to get the script and runs it with shell.

curl -fsSL https://ollama.com/install.sh | sh

After the installation finishes, the script puts the program in a folder called /usr/local/bin/ollama. It also creates a special user on your computer named ollama. This is good for security because the AI program does not run as the root user. It also starts a background service called systemd so the AI starts automatically when your computer turns on. You can check if the installation worked by typing this command to see the version.

ollama --version

You should also check if the background service is running correctly. You can use the systemctl command to see the status. If everything is fine, it will say that the service is active and running. The service only uses about 45 megabytes of memory when it first starts because no AI model is loaded yet. This is very lightweight and does not slow down your computer.

systemctl status ollama --no-pager | head -8

Now we need to choose an AI model that runs well on a CPU. If a model is too big, the CPU will become very hot and run extremely slow. We want to use small models that have between 1 billion and 3 billion parameters. These models are also made smaller using a method where they compress the files. This means they use less memory but still give smart answers. There are three very good models we can try on our Fedora machine.

The first model is gemma3:1b. It is very small, only about 815 megabytes. It needs around 2 gigabytes of RAM to run. It is the fastest model on CPU and is great for quick chats and making summaries of articles. The second model is qwen2.5:1.5b. It is about 986 megabytes big and also needs 2 gigabytes of RAM. This model is very good at writing code and understands different languages very well. The third model is llama3.2:3b. It is bigger, about 2 gigabytes, and needs 4 gigabytes of RAM. It gives the best and longest answers, but it is a bit slower on the CPU.

You can download all three models to your computer using the pull command. Ollama will download them and put them in a hidden folder inside the /usr/share/ollama directory. Running these commands will download the files from the internet.

ollama pull qwen2.5:1.5b
ollama pull gemma3:1b
ollama pull llama3.2:3b

After you download them, you can see the list of models you have on your computer. Use the list command. It will show the name of each model, how big it is, and when you downloaded it. Remember, do not download models that are bigger than 4 billion parameters if you only have a CPU. Big models will make you wait too long for one answer, and it will feel like the program is broken.

ollama list

We can now run a model and see how fast it is. Ollama has a special flag called verbose. When you use this flag, the program will print statistics at the end of the chat. It will tell you how many words it generated per second. Let us try to ask a question to the Qwen model. We can send a question using the echo command.

echo "Explain what SELinux does in one sentence." | ollama run qwen2.5:1.5b --verbose

The model will output the answer and then show some numbers. The most important number is called the eval rate. This is the number of tokens, which are like small parts of words, that the AI makes in one second. On our test machine with 2 CPUs, the Qwen model can do about 23 tokens per second. This is very fast and comfortable to read. If we test all three models, we can see which one is the best for our needs.

Gemma 3 1B is the fastest because it does about 25 tokens per second and loads in only 2 seconds. Qwen 2.5 1.5B is also very fast with 23 tokens per second and loads almost instantly after the first time. Llama 3.2 3B is slower, running at 11 tokens per second and taking 6 seconds to load, but the answers are much better written. By default, Ollama keeps the model inside the RAM memory for 5 minutes after you stop talking to it. This means if you ask another question quickly, it will answer immediately without waiting to load again.

One of the coolest things about Ollama is that it has a web API. This means other programs on your computer, like code editors or scripts, can talk to the AI. Ollama has its own API, and it also has an API that looks exactly like the famous OpenAI API. This is very useful because many developer tools are made to talk to OpenAI. You can just change the web address in your tool to point to your local Ollama.

Let us test the native Ollama API first. We can use a tool called curl to send a JSON message to our local server on port 11434. We will ask a simple math question and ask the computer to format the output with python.

curl -s http://localhost:11434/api/chat -d '{ "model": "qwen2.5:1.5b", "messages": [{"role": "user", "content": "What is 2+2? Answer in one word."}], "stream": false }' | python3 -m json.tool

The server will send back a JSON response. It will show the answer, which is “4”, and it will also show all the statistics like how long it took to generate the answer. This is very easy to use if you are writing your own scripts.

Now let us test the OpenAI API format. This is important if you want to use plugins in editors like VS Code or Vim. We send a request to a different address ending in /v1/chat/completions.

curl -s http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "qwen2.5:1.5b", "messages": [{"role": "user", "content": "Hello in 3 words."}] }' | python3 -m json.tool

This will return a response that looks exactly like what OpenAI sends. Because of this, programs like the OpenAI helper library for Python or JavaScript can talk to your local Fedora server. You just have to tell them to use the address http://localhost:11434/v1/ and you can write any random letters for the API key because Ollama does not check for a real key.

When you install Ollama, it only allows requests from the same computer. This is called localhost. If you want to share your AI with other computers in your home network, you have to change this setting. We can do this safely by making a systemd override file. This ensures our changes do not get deleted when we update Ollama in the future.

sudo systemctl edit ollama.service

A text editor will open. You must write these lines under the Service section. This tells Ollama to listen to all network addresses and allows requests from other sources.

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_ORIGINS=*"

After saving the file and closing the editor, we must tell systemd to load the new configuration and restart the Ollama service. We also need to open the port in the Fedora firewall so other computers can reach it. We should only open it for the trusted zone so random people on public WiFi cannot access our AI.

sudo systemctl daemon-reload
sudo systemctl restart ollama
sudo firewall-cmd --permanent --zone=trusted --add-port=11434/tcp
sudo firewall-cmd --reload

You must be careful because Ollama does not have any password protection. If you open this port, anyone on that network can use your AI and make your CPU very busy. If you want to put this on the internet, you must use another software like Nginx to add a password.

We can make the CPU run the AI better by setting some environment variables. We can add these to the same override file we edited before. These settings will help the computer manage the RAM and CPU threads better.

[Service]
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Environment="OLLAMA_KEEP_ALIVE=15m"

Let us explain what these settings do. The first setting OLLAMA_NUM_PARALLEL=1 tells Ollama to process only one question at a time. If two people ask questions at the same time on a CPU, the computer will get very slow for both of them. It is much better to do them one after the other. The second setting OLLAMA_MAX_LOADED_MODELS=1 makes sure only one model stays in the RAM. This is important because our CPU machine does not have enough memory to hold many models at the same time.

The third setting OLLAMA_KEEP_ALIVE=15m tells the system to keep the model loaded in the RAM memory for 15 minutes after you stop using it. The default is 5 minutes. If you are writing code and asking the AI questions every 10 minutes, raising this to 15 minutes means you do not have to wait for the model to load from the hard drive again. After you save these settings, remember to reload systemd and restart Ollama.

Although CPU AI is very useful, it is not good for everything. You should not use a CPU if you want to run very big models like those with 70 billion parameters. Those models need massive graphics cards with lots of VRAM memory. You also should not use CPU if you want to make an app where hundreds of people use the AI at the same time. The CPU will get overloaded immediately. For heavy tasks like processing thousands of documents or real-time voice applications, you really need a real GPU with CUDA support.

Sometimes things do not work, and you might get errors. If you see an error saying “pull model manifest” or “dial tcp”, it means your computer cannot talk to the Ollama registry on the internet. This is usually a DNS problem. You can check if your internet connection can find the website by running a simple test command in your terminal.

nslookup registry.ollama.ai

If that test fails, you need to fix your DNS settings on your Fedora host. Another common issue is when the Ollama service keeps restarting in a loop. This usually happens because your hard drive is full. Ollama downloads models to a folder under /usr/share/ollama. If you download three models, they will take up about 4 gigabytes of space. You should check if you have enough space on your hard drive.

df -h /usr/share/ollama

If your computer has only 2 gigabytes of RAM and you try to run the Llama 3.2 3B model, the program might crash because of low memory. To fix this, you should use a smaller model like Gemma 1B, or you can add more RAM to your virtual machine. You can also set a setting called OLLAMA_LOW_VRAM=1 in the service file, which helps the system use less memory but makes it run a bit slower.

Lastly, if the AI is giving you weird answers or repeating the same words over and over, you can change the options in your API call. You can set the temperature lower, like 0.3, to make the answers more focused and realistic. You can also add a repetition penalty to stop it from looping.

curl -s http://localhost:11434/api/chat -d '{ "model": "qwen2.5:1.5b", "messages": [{"role": "user", "content": "Summarize Linux in 50 words."}], "stream": false, "options": {"temperature": 0.3, "repeat_penalty": 1.1, "num_predict": 80} }'

Running AI on your CPU on Fedora 44 is not a replacement for massive commercial models like GPT-4. But it is a very great, private, and free way to do daily tasks without sending your private data to big companies. It integrates very nicely with the Fedora system and gives you full control over your machine and your data.

Recent Posts

  • Auditd Custom Rules & Tips
  • Securing SSH Server with fail2ban
  • Fedora Linux Firewalld Drop Zone and Rich Rules
  • How to SSH Hardening 2026
  • How to Add Password Protection to GRUB
  • Linux Kernel Hardening: Command-line Lockdown
  • Make Linux Kernel More Safe and Hardening with Sysctl Easy Way
  • How to Lockdown Root & Wheel Group in Linux
  • How to Secure Sudo in Linux (Secure Sudo Logging & Timeout)
  • Make Fedora Login Safe with Authselect and Faillock
  • How Measure Linux Security Use OpenSCAP Lynis and Systemd
  • SELinux Make Nginx Break and How to Fix It Easy
  • How See Hidden SELinux Errors When Your Server Is Broken
  • How Fix SELinux Port Denied Error With Sealert Easy Guide
  • Read SELinux AVC Denial Log Simple Guide for Noob
  • How Check and Fix SELinux Block Things in Fedora Linux
  • How Actually SELinux is Work?
  • How to Install Elementary OS 8 Easy and Make It Good
  • How to Install UniFi OS Server on Ubuntu Linux Without Cloud Key
  • Top DNF5 Tips to Make Your Fedora Linux Super Fast
  • Run Local AI on Fedora 44 CPU Without Expensive GPU
  • Google Gemini Live Redesign: Works with more ‘Connected Apps’ on Android
  • A new LILYGO T3S3 ESP32-S3 with LoRA, WiFi & Bluetooth is Released only $16
  • New ESP32 Project: OpenTrafficMap ESP32-C5 C-ITS With 802.11p V2X communication
  • How to Unlock the Hidden Potential of Your Kindle with Amazing Community Plugins
  • Inilah Cara Mengatasi Connection Server Authentication Failed di VMware Horizon Client
  • Cara Laptop Nggak Lemot Pas Colok SD Card, Gampang Banget!
  • Inilah Caranya Mengatasi SD Card Reader yang Tidak Terbaca di Laptop
  • Inilah Cara Ampuh Atasi Perangkat USB yang Sering Terputus di Windows 10 dan 11
  • Cara Atasi USB Error dengan Update USB Root Hub dan Chipset Driver
  • How to Automate Your Entire SEO Strategy Using a Swarm of 100 Free AI Agents Working in Parallel
  • How to create professional presentations easily using NotebookLM’s AI power for school projects and beyond
  • How to Master SEO Automation with Google Gemini 3.1 Flash-Lite in Google AI Studio
  • How to create viral AI video ads and complete brand assets using the Claude and Higgsfield MCP integration
  • How to Transform Your Mac Into a Supercharged AI Assistant with Perplexity Personal Computer
RSS Error: WP HTTP Error: A valid URL was not provided.
©2026 Tutorial emka | Design: Newspaperly WordPress Theme