Run Local AI on Fedora 44 CPU Without Expensive GPU

Many people think you must buy a very expensive graphics card to run AI on your computer. But you can run a very good AI on Fedora 44 without any GPU. If you use a software called Ollama with small models like Gemma 3 1B or Qwen 2.5 1.5B, a small virtual machine with only 2 CPUs and 4 GB of RAM can run very fast. It can give you 12 to 25 words every second. This speed is actually faster than how fast normal people read books. This setup is very good if you want a private coding helper, a tool to summarize long text files, or a chatbot to talk with. We do not need a big GPU unless we want to run huge models or have many people using it at the same time. This simple guide will help you install it and configure everything on your Fedora machine.

To start, we must install Ollama on our Fedora 44 system. The people who made Ollama made a very easy script. This script looks at your CPU to see what kind it is and then downloads the correct file. It does everything for you. You just need to open your terminal and run one simple command. This command uses curl to get the script and runs it with shell.

curl -fsSL https://ollama.com/install.sh | sh

After the installation finishes, the script puts the program in a folder called /usr/local/bin/ollama. It also creates a special user on your computer named ollama. This is good for security because the AI program does not run as the root user. It also starts a background service called systemd so the AI starts automatically when your computer turns on. You can check if the installation worked by typing this command to see the version.

ollama --version

You should also check if the background service is running correctly. You can use the systemctl command to see the status. If everything is fine, it will say that the service is active and running. The service only uses about 45 megabytes of memory when it first starts because no AI model is loaded yet. This is very lightweight and does not slow down your computer.

systemctl status ollama --no-pager | head -8

Now we need to choose an AI model that runs well on a CPU. If a model is too big, the CPU will become very hot and run extremely slow. We want to use small models that have between 1 billion and 3 billion parameters. These models are also made smaller using a method where they compress the files. This means they use less memory but still give smart answers. There are three very good models we can try on our Fedora machine.

The first model is gemma3:1b. It is very small, only about 815 megabytes. It needs around 2 gigabytes of RAM to run. It is the fastest model on CPU and is great for quick chats and making summaries of articles. The second model is qwen2.5:1.5b. It is about 986 megabytes big and also needs 2 gigabytes of RAM. This model is very good at writing code and understands different languages very well. The third model is llama3.2:3b. It is bigger, about 2 gigabytes, and needs 4 gigabytes of RAM. It gives the best and longest answers, but it is a bit slower on the CPU.

You can download all three models to your computer using the pull command. Ollama will download them and put them in a hidden folder inside the /usr/share/ollama directory. Running these commands will download the files from the internet.

ollama pull qwen2.5:1.5b
ollama pull gemma3:1b
ollama pull llama3.2:3b

After you download them, you can see the list of models you have on your computer. Use the list command. It will show the name of each model, how big it is, and when you downloaded it. Remember, do not download models that are bigger than 4 billion parameters if you only have a CPU. Big models will make you wait too long for one answer, and it will feel like the program is broken.

ollama list

We can now run a model and see how fast it is. Ollama has a special flag called verbose. When you use this flag, the program will print statistics at the end of the chat. It will tell you how many words it generated per second. Let us try to ask a question to the Qwen model. We can send a question using the echo command.

echo "Explain what SELinux does in one sentence." | ollama run qwen2.5:1.5b --verbose

The model will output the answer and then show some numbers. The most important number is called the eval rate. This is the number of tokens, which are like small parts of words, that the AI makes in one second. On our test machine with 2 CPUs, the Qwen model can do about 23 tokens per second. This is very fast and comfortable to read. If we test all three models, we can see which one is the best for our needs.

Gemma 3 1B is the fastest because it does about 25 tokens per second and loads in only 2 seconds. Qwen 2.5 1.5B is also very fast with 23 tokens per second and loads almost instantly after the first time. Llama 3.2 3B is slower, running at 11 tokens per second and taking 6 seconds to load, but the answers are much better written. By default, Ollama keeps the model inside the RAM memory for 5 minutes after you stop talking to it. This means if you ask another question quickly, it will answer immediately without waiting to load again.

One of the coolest things about Ollama is that it has a web API. This means other programs on your computer, like code editors or scripts, can talk to the AI. Ollama has its own API, and it also has an API that looks exactly like the famous OpenAI API. This is very useful because many developer tools are made to talk to OpenAI. You can just change the web address in your tool to point to your local Ollama.

Let us test the native Ollama API first. We can use a tool called curl to send a JSON message to our local server on port 11434. We will ask a simple math question and ask the computer to format the output with python.

curl -s http://localhost:11434/api/chat -d '{ "model": "qwen2.5:1.5b", "messages": [{"role": "user", "content": "What is 2+2? Answer in one word."}], "stream": false }' | python3 -m json.tool

The server will send back a JSON response. It will show the answer, which is “4”, and it will also show all the statistics like how long it took to generate the answer. This is very easy to use if you are writing your own scripts.

Now let us test the OpenAI API format. This is important if you want to use plugins in editors like VS Code or Vim. We send a request to a different address ending in /v1/chat/completions.

curl -s http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "qwen2.5:1.5b", "messages": [{"role": "user", "content": "Hello in 3 words."}] }' | python3 -m json.tool

This will return a response that looks exactly like what OpenAI sends. Because of this, programs like the OpenAI helper library for Python or JavaScript can talk to your local Fedora server. You just have to tell them to use the address http://localhost:11434/v1/ and you can write any random letters for the API key because Ollama does not check for a real key.

When you install Ollama, it only allows requests from the same computer. This is called localhost. If you want to share your AI with other computers in your home network, you have to change this setting. We can do this safely by making a systemd override file. This ensures our changes do not get deleted when we update Ollama in the future.

sudo systemctl edit ollama.service

A text editor will open. You must write these lines under the Service section. This tells Ollama to listen to all network addresses and allows requests from other sources.

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_ORIGINS=*"

After saving the file and closing the editor, we must tell systemd to load the new configuration and restart the Ollama service. We also need to open the port in the Fedora firewall so other computers can reach it. We should only open it for the trusted zone so random people on public WiFi cannot access our AI.

sudo systemctl daemon-reload
sudo systemctl restart ollama
sudo firewall-cmd --permanent --zone=trusted --add-port=11434/tcp
sudo firewall-cmd --reload

You must be careful because Ollama does not have any password protection. If you open this port, anyone on that network can use your AI and make your CPU very busy. If you want to put this on the internet, you must use another software like Nginx to add a password.

We can make the CPU run the AI better by setting some environment variables. We can add these to the same override file we edited before. These settings will help the computer manage the RAM and CPU threads better.

[Service]
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Environment="OLLAMA_KEEP_ALIVE=15m"

Let us explain what these settings do. The first setting OLLAMA_NUM_PARALLEL=1 tells Ollama to process only one question at a time. If two people ask questions at the same time on a CPU, the computer will get very slow for both of them. It is much better to do them one after the other. The second setting OLLAMA_MAX_LOADED_MODELS=1 makes sure only one model stays in the RAM. This is important because our CPU machine does not have enough memory to hold many models at the same time.

The third setting OLLAMA_KEEP_ALIVE=15m tells the system to keep the model loaded in the RAM memory for 15 minutes after you stop using it. The default is 5 minutes. If you are writing code and asking the AI questions every 10 minutes, raising this to 15 minutes means you do not have to wait for the model to load from the hard drive again. After you save these settings, remember to reload systemd and restart Ollama.

Although CPU AI is very useful, it is not good for everything. You should not use a CPU if you want to run very big models like those with 70 billion parameters. Those models need massive graphics cards with lots of VRAM memory. You also should not use CPU if you want to make an app where hundreds of people use the AI at the same time. The CPU will get overloaded immediately. For heavy tasks like processing thousands of documents or real-time voice applications, you really need a real GPU with CUDA support.

Sometimes things do not work, and you might get errors. If you see an error saying “pull model manifest” or “dial tcp”, it means your computer cannot talk to the Ollama registry on the internet. This is usually a DNS problem. You can check if your internet connection can find the website by running a simple test command in your terminal.

nslookup registry.ollama.ai

If that test fails, you need to fix your DNS settings on your Fedora host. Another common issue is when the Ollama service keeps restarting in a loop. This usually happens because your hard drive is full. Ollama downloads models to a folder under /usr/share/ollama. If you download three models, they will take up about 4 gigabytes of space. You should check if you have enough space on your hard drive.

df -h /usr/share/ollama

If your computer has only 2 gigabytes of RAM and you try to run the Llama 3.2 3B model, the program might crash because of low memory. To fix this, you should use a smaller model like Gemma 1B, or you can add more RAM to your virtual machine. You can also set a setting called OLLAMA_LOW_VRAM=1 in the service file, which helps the system use less memory but makes it run a bit slower.

Lastly, if the AI is giving you weird answers or repeating the same words over and over, you can change the options in your API call. You can set the temperature lower, like 0.3, to make the answers more focused and realistic. You can also add a repetition penalty to stop it from looping.

curl -s http://localhost:11434/api/chat -d '{ "model": "qwen2.5:1.5b", "messages": [{"role": "user", "content": "Summarize Linux in 50 words."}], "stream": false, "options": {"temperature": 0.3, "repeat_penalty": 1.1, "num_predict": 80} }'

Running AI on your CPU on Fedora 44 is not a replacement for massive commercial models like GPT-4. But it is a very great, private, and free way to do daily tasks without sending your private data to big companies. It integrates very nicely with the Fedora system and gives you full control over your machine and your data.