Skip to content
Tutorial emka
Menu
  • Home
  • Debian Linux
  • Ubuntu Linux
  • Red Hat Linux
Menu

Run Local AI on Fedora 44 CPU Without Expensive GPU

Posted on May 25, 2026

Many people think you must buy a very expensive graphics card to run AI on your computer. But you can run a very good AI on Fedora 44 without any GPU. If you use a software called Ollama with small models like Gemma 3 1B or Qwen 2.5 1.5B, a small virtual machine with only 2 CPUs and 4 GB of RAM can run very fast. It can give you 12 to 25 words every second. This speed is actually faster than how fast normal people read books. This setup is very good if you want a private coding helper, a tool to summarize long text files, or a chatbot to talk with. We do not need a big GPU unless we want to run huge models or have many people using it at the same time. This simple guide will help you install it and configure everything on your Fedora machine.

To start, we must install Ollama on our Fedora 44 system. The people who made Ollama made a very easy script. This script looks at your CPU to see what kind it is and then downloads the correct file. It does everything for you. You just need to open your terminal and run one simple command. This command uses curl to get the script and runs it with shell.

curl -fsSL https://ollama.com/install.sh | sh

After the installation finishes, the script puts the program in a folder called /usr/local/bin/ollama. It also creates a special user on your computer named ollama. This is good for security because the AI program does not run as the root user. It also starts a background service called systemd so the AI starts automatically when your computer turns on. You can check if the installation worked by typing this command to see the version.

ollama --version

You should also check if the background service is running correctly. You can use the systemctl command to see the status. If everything is fine, it will say that the service is active and running. The service only uses about 45 megabytes of memory when it first starts because no AI model is loaded yet. This is very lightweight and does not slow down your computer.

systemctl status ollama --no-pager | head -8

Now we need to choose an AI model that runs well on a CPU. If a model is too big, the CPU will become very hot and run extremely slow. We want to use small models that have between 1 billion and 3 billion parameters. These models are also made smaller using a method where they compress the files. This means they use less memory but still give smart answers. There are three very good models we can try on our Fedora machine.

The first model is gemma3:1b. It is very small, only about 815 megabytes. It needs around 2 gigabytes of RAM to run. It is the fastest model on CPU and is great for quick chats and making summaries of articles. The second model is qwen2.5:1.5b. It is about 986 megabytes big and also needs 2 gigabytes of RAM. This model is very good at writing code and understands different languages very well. The third model is llama3.2:3b. It is bigger, about 2 gigabytes, and needs 4 gigabytes of RAM. It gives the best and longest answers, but it is a bit slower on the CPU.

You can download all three models to your computer using the pull command. Ollama will download them and put them in a hidden folder inside the /usr/share/ollama directory. Running these commands will download the files from the internet.

ollama pull qwen2.5:1.5b
ollama pull gemma3:1b
ollama pull llama3.2:3b

After you download them, you can see the list of models you have on your computer. Use the list command. It will show the name of each model, how big it is, and when you downloaded it. Remember, do not download models that are bigger than 4 billion parameters if you only have a CPU. Big models will make you wait too long for one answer, and it will feel like the program is broken.

ollama list

We can now run a model and see how fast it is. Ollama has a special flag called verbose. When you use this flag, the program will print statistics at the end of the chat. It will tell you how many words it generated per second. Let us try to ask a question to the Qwen model. We can send a question using the echo command.

echo "Explain what SELinux does in one sentence." | ollama run qwen2.5:1.5b --verbose

The model will output the answer and then show some numbers. The most important number is called the eval rate. This is the number of tokens, which are like small parts of words, that the AI makes in one second. On our test machine with 2 CPUs, the Qwen model can do about 23 tokens per second. This is very fast and comfortable to read. If we test all three models, we can see which one is the best for our needs.

Gemma 3 1B is the fastest because it does about 25 tokens per second and loads in only 2 seconds. Qwen 2.5 1.5B is also very fast with 23 tokens per second and loads almost instantly after the first time. Llama 3.2 3B is slower, running at 11 tokens per second and taking 6 seconds to load, but the answers are much better written. By default, Ollama keeps the model inside the RAM memory for 5 minutes after you stop talking to it. This means if you ask another question quickly, it will answer immediately without waiting to load again.

One of the coolest things about Ollama is that it has a web API. This means other programs on your computer, like code editors or scripts, can talk to the AI. Ollama has its own API, and it also has an API that looks exactly like the famous OpenAI API. This is very useful because many developer tools are made to talk to OpenAI. You can just change the web address in your tool to point to your local Ollama.

Let us test the native Ollama API first. We can use a tool called curl to send a JSON message to our local server on port 11434. We will ask a simple math question and ask the computer to format the output with python.

curl -s http://localhost:11434/api/chat -d '{ "model": "qwen2.5:1.5b", "messages": [{"role": "user", "content": "What is 2+2? Answer in one word."}], "stream": false }' | python3 -m json.tool

The server will send back a JSON response. It will show the answer, which is “4”, and it will also show all the statistics like how long it took to generate the answer. This is very easy to use if you are writing your own scripts.

Now let us test the OpenAI API format. This is important if you want to use plugins in editors like VS Code or Vim. We send a request to a different address ending in /v1/chat/completions.

curl -s http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "qwen2.5:1.5b", "messages": [{"role": "user", "content": "Hello in 3 words."}] }' | python3 -m json.tool

This will return a response that looks exactly like what OpenAI sends. Because of this, programs like the OpenAI helper library for Python or JavaScript can talk to your local Fedora server. You just have to tell them to use the address http://localhost:11434/v1/ and you can write any random letters for the API key because Ollama does not check for a real key.

When you install Ollama, it only allows requests from the same computer. This is called localhost. If you want to share your AI with other computers in your home network, you have to change this setting. We can do this safely by making a systemd override file. This ensures our changes do not get deleted when we update Ollama in the future.

sudo systemctl edit ollama.service

A text editor will open. You must write these lines under the Service section. This tells Ollama to listen to all network addresses and allows requests from other sources.

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_ORIGINS=*"

After saving the file and closing the editor, we must tell systemd to load the new configuration and restart the Ollama service. We also need to open the port in the Fedora firewall so other computers can reach it. We should only open it for the trusted zone so random people on public WiFi cannot access our AI.

sudo systemctl daemon-reload
sudo systemctl restart ollama
sudo firewall-cmd --permanent --zone=trusted --add-port=11434/tcp
sudo firewall-cmd --reload

You must be careful because Ollama does not have any password protection. If you open this port, anyone on that network can use your AI and make your CPU very busy. If you want to put this on the internet, you must use another software like Nginx to add a password.

We can make the CPU run the AI better by setting some environment variables. We can add these to the same override file we edited before. These settings will help the computer manage the RAM and CPU threads better.

[Service]
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Environment="OLLAMA_KEEP_ALIVE=15m"

Let us explain what these settings do. The first setting OLLAMA_NUM_PARALLEL=1 tells Ollama to process only one question at a time. If two people ask questions at the same time on a CPU, the computer will get very slow for both of them. It is much better to do them one after the other. The second setting OLLAMA_MAX_LOADED_MODELS=1 makes sure only one model stays in the RAM. This is important because our CPU machine does not have enough memory to hold many models at the same time.

The third setting OLLAMA_KEEP_ALIVE=15m tells the system to keep the model loaded in the RAM memory for 15 minutes after you stop using it. The default is 5 minutes. If you are writing code and asking the AI questions every 10 minutes, raising this to 15 minutes means you do not have to wait for the model to load from the hard drive again. After you save these settings, remember to reload systemd and restart Ollama.

Although CPU AI is very useful, it is not good for everything. You should not use a CPU if you want to run very big models like those with 70 billion parameters. Those models need massive graphics cards with lots of VRAM memory. You also should not use CPU if you want to make an app where hundreds of people use the AI at the same time. The CPU will get overloaded immediately. For heavy tasks like processing thousands of documents or real-time voice applications, you really need a real GPU with CUDA support.

Sometimes things do not work, and you might get errors. If you see an error saying “pull model manifest” or “dial tcp”, it means your computer cannot talk to the Ollama registry on the internet. This is usually a DNS problem. You can check if your internet connection can find the website by running a simple test command in your terminal.

nslookup registry.ollama.ai

If that test fails, you need to fix your DNS settings on your Fedora host. Another common issue is when the Ollama service keeps restarting in a loop. This usually happens because your hard drive is full. Ollama downloads models to a folder under /usr/share/ollama. If you download three models, they will take up about 4 gigabytes of space. You should check if you have enough space on your hard drive.

df -h /usr/share/ollama

If your computer has only 2 gigabytes of RAM and you try to run the Llama 3.2 3B model, the program might crash because of low memory. To fix this, you should use a smaller model like Gemma 1B, or you can add more RAM to your virtual machine. You can also set a setting called OLLAMA_LOW_VRAM=1 in the service file, which helps the system use less memory but makes it run a bit slower.

Lastly, if the AI is giving you weird answers or repeating the same words over and over, you can change the options in your API call. You can set the temperature lower, like 0.3, to make the answers more focused and realistic. You can also add a repetition penalty to stop it from looping.

curl -s http://localhost:11434/api/chat -d '{ "model": "qwen2.5:1.5b", "messages": [{"role": "user", "content": "Summarize Linux in 50 words."}], "stream": false, "options": {"temperature": 0.3, "repeat_penalty": 1.1, "num_predict": 80} }'

Running AI on your CPU on Fedora 44 is not a replacement for massive commercial models like GPT-4. But it is a very great, private, and free way to do daily tasks without sending your private data to big companies. It integrates very nicely with the Fedora system and gives you full control over your machine and your data.

Leave a Reply Cancel reply

You must be logged in to post a comment.

Recent Posts

  • Run Local AI on Fedora 44 CPU Without Expensive GPU
  • Google Gemini Live Redesign: Works with more ‘Connected Apps’ on Android
  • A new LILYGO T3S3 ESP32-S3 with LoRA, WiFi & Bluetooth is Released only $16
  • New ESP32 Project: OpenTrafficMap ESP32-C5 C-ITS With 802.11p V2X communication
  • How to Unlock the Hidden Potential of Your Kindle with Amazing Community Plugins
  • How to Use Waze with Android Auto for the Ultimate Driving Experience
  • How to Transform Your GNOME Desktop with GNOME Prism
  • Why Your Google Maps Wear OS Navigation Fails While Using Android Auto
  • Packagist Attacked! How to Detect Hidden Malware Like This?
  • Claude Mythos Keeps Find High-severity Flaws, What You Should You Do?
  • How to Secure Your PHP Applications Against the Recent Laravel-Lang Supply Chain Attack and Credential Stealers
  • How to Protect Your Server from the LiteSpeed cPanel Plugin Privilege Escalation Vulnerability
  • How to build a high-performance private photo cloud with Immich and TrueNAS SCALE
  • How to Build an Endgame Local AI Agent Setup Using an 8-Node NVIDIA Cluster with 1TB Memory
  • How to Master Windows Event Logs to Level Up Your Cybersecurity Investigations and SOC Career
  • How to Build Ultra-Resilient Databases with Amazon Aurora Global Database and RDS Proxy for Maximum Uptime and Performance
  • How to Build Real-Time Personalization Systems Using AWS Agentic AI to Make Every User Feel Special
  • How to Transform Your Windows 11 Interface into a Sleek and Modern Aesthetic Masterpiece
  • How to Understand Google’s New TPU 8 Series for Massive AI Training and Inference
  • How to Level Up Your PC Gaming Experience with the New Valve Steam Controller and Its Advanced Features
  • Is it Time to Replace Nano? Discover Fresh, the Terminal Text Editor You Actually Want to Use
  • How to Design a Services Like Google Ads
  • How to Fix 0x800ccc0b Outlook Error: Step-by-Step Guide for Beginners
  • How to Fix NVIDIA App Error on Windows 11: Simple Guide
  • How to Fix Excel Formula Errors: Quick Fixes for #NAME
  • Inilah Usia Ideal Anak Masuk SD: 6 Tahun atau 7 Tahun atau 8 Tahun?
  • Cara Daftar Sekolah Maung 2026
  • Anak 6 Tahun Bisa Daftar SD! Kuota Prioritas Tetap Usia 7 Tahun?
  • Apa itu Pemetaan Calon Murid Baru di SPMB Jabar 2026, PCMB Bisa Pilih 1 atau 2 Jalur? Berapa Sekolah?
  • Ini Rekomendasi 15 SMA Swasta Terbaik di Bandung 2026
  • How to Automate Your Entire SEO Strategy Using a Swarm of 100 Free AI Agents Working in Parallel
  • How to create professional presentations easily using NotebookLM’s AI power for school projects and beyond
  • How to Master SEO Automation with Google Gemini 3.1 Flash-Lite in Google AI Studio
  • How to create viral AI video ads and complete brand assets using the Claude and Higgsfield MCP integration
  • How to Transform Your Mac Into a Supercharged AI Assistant with Perplexity Personal Computer
  • Apa itu Spear-Phishing via npm? Ini Pengertian dan Cara Kerjanya yang Makin Licin
  • Apa Itu Predator Spyware? Ini Pengertian dan Kontroversi Penghapusan Sanksinya
  • Mengenal Apa itu TONESHELL: Backdoor Berbahaya dari Kelompok Mustang Panda
  • Siapa itu Kelompok Hacker Silver Fox?
  • Apa itu CVE-2025-52691 SmarterMail? Celah Keamanan Paling Berbahaya Tahun 2025
©2026 Tutorial emka | Design: Newspaperly WordPress Theme