Introduction
I’m writing this just before hopping on a long trans-atlantic flight. I’m not betting on the in-flight WiFi being very good, so I’m taking the time to get a local AI assistant running on my laptop. I think many travelers will be interested in setting something like this up, so I’d like to take the time to document a lot of how I’m doing this… and hopefully make it easier for others to get it going.
This will be more of a “live document” as I iteratively improve on my setup. Important parts are going to be missing until I get enough time to document it all.
Prerequisites
- A Mac (M1 or newer) with at least 16GB of RAM (but more is definitely better)
- Docker or Podman installed on your Mac
- An internet connection to download the models and other resources
Setting Up Docker
If you don’t already have Docker installed, you can download it from the official Docker website.
Setting Up Podman
If you don’t already have Podman installed, you can download it from the official Podman website.
Installing Ollama
You can install Ollama using Docker or Podman. But I installed it locally instead. You can get Ollama here.
Running Ollama
Before Ollama will be of any use, you’re going to need to download some useful models. The instructions below will grab the ones that I’m currently finding most useful day to day. I run these on an M1 Macbook Air with 16GB of RAM so these will generall run on most modern Apple-silicon Macs.
Pulling Models
You’ll need to open a terminal app. I’m assuming here that you’re using macOS’s default zsh
shell. If you’re using a different shell, you’ll need to adapt this for loop accordingly. Once you fire this off, it’s going to take a long time. If you’ve ever run a docker pull
command, it looks a lot like that. But these models are multiple gigabytes each, so expect it to be quite slow.
for model in "granite3.1-dense:2b" "qwen2.5-coder:1.5b" "llama3.2:1b" "nomic-embed-text:latest" "granite3.1-dense:8b" "qwen2.5-coder:latest" "llama3.2:latest"; do ollama pull "$model"; done # This will iterate over pulling a few very useful models for local work. This will take awhile.
Each of these models has certain pros and cons. You’ll need to work out for yourself which ones you want to use in different circumstances. You can find more information about each model on the Ollama website.
Some high level thoughts from me at this stage of my AI journey:
- The smaller models are faster, but they may not be as accurate or comprehensive.
- The larger models are slower, but they can provide more accurate and comprehensive results.
- The models I’ve pulled so far are all based on the Qwen architecture. This means that they should work well with most programming languages and frameworks.
- The
granite3.1-dense:8b
model is a very large model, which means it will take longer to train and use. However, it can provide the highest level of accuracy and comprehensive results. - The
llama3.2:latest
model is a relatively new model that has been trained on a much larger dataset than the granite3.1-dense:8b model. This means that it may be more accurate and comprehensive than thegranite3.1-dense:8b model
, but it will also take longer to train and use. - llama3.2:1b is light and fast. It’s good for smaller tasks like creating chat session titles, generating search queries, etc.
- The
qwen2.5-coder
models are really good at (you guessed it) coding! They can generate code that is both correct and efficient. They’re great for developers who need to write code quickly and accurately. They’re also good for beginners who want to learn how to code. The smaller models can do light & fast work, but for more comprehensive tasks like generating code, the larger models are better. I use this one a lot inside of VSCode, particularly for getting my head around a larger shared project that I’m starting to work on. It’s really good at summarizing all of the classes, API’s and methods in a single file. It’s good at commenting code that wasn’t commented by the original author, and even offering optimizations like removing unnecessary imports or refactoring code. It’s also great for debugging code, as it can show you where errors are occurring and suggest potential fixes. Overall, I’m really impressed with this model and I highly recommend using it for any coding tasks that require accuracy and efficiency. I know I’m saying a lot about this one, but for developers, this one (well, really a family of models) is the game changer!
Open WebUI
Open WebUI is a web-based interface for managing and interacting with various AI-related services. It provides a user-friendly way to access and control different components of the system, such as servers, storage, and network settings. When it’s up and running well, it can really substantially change the way you interact with your AI infrastructure (whether it’s local on the laptop, or as a value add on top of ChatGPT).
I’ll be honest: I think this project is still very young, changing substantially and trying to figure out what it wants to be. There’s often a lot more effort put into the code than into the docs at the time of this writing. So getting it set up the way you want it might feel frustrating and I can really identify with that sentiment. But! I’m finding it to be a really powerful tool in my arsenal, and I think it’s worth the effort to get running correctly.
Ironically, perhaps, I used ChatGPT extensively to troubleshoot and dial in on my configuration. OpenAI did a really good job of training gpt-4o
on Open WebUI, so I was able to figure out what settings were needed to get it running smoothly.
I’ve been doing some work on packaging it up for easy containerized deployments on Macs.