Ollama and running AI on the edge

Running AI on the edge processes data locally, which is beneficial for those concerned about privacy and security. It also allows for a more tailored experience, which is particularly useful for parents wanting to provide their children with AI-powered tools that are age-appropriate. Additionally, AI on the edge operates without an external internet connection, making it suitable in environments with slow or expensive internet. From my developer’s perspective, the ability to run open-source tools on the edge is valuable as it avoids me having to worry unexpected bills accruing from forgetting to spin down a VM running in AWS, especially during training phases. Also, as we move forward, it is likely more and more AI will be running on the edge.

The Ollama Platform: Simplifying Deployment, Management, and Scaling of Deep Learning Models

Ollama simplifies the deployment, management, and scaling of deep learning models, streamlining lifecycle management with features like model versioning, performance monitoring, and load balancing.

For my Debian environment, installation is straightforward:

curl -fsSL https://ollama.com/install.sh | sh
ollama run llama3

Ollama’s API

I divide Ollama’s API into three sections: generating results, managing models, and generating embeddings.

Generating Results

Generate a Completion: Create predictions or outputs based on input data processed by running models.
Generate a Chat Completion: Tailored for conversational AI, this function processes natural language inputs to generate contextually appropriate responses.

Managing Models

Managing models in Ollama is akin to managing containers in Docker, where you can pull and run models available on Ollama’s library.

These functions include:

Pull a Model: Similar to pulling images in Docker, this downloads a model from a remote repository to a local environment.
Run a Model: Equivalent to docker run my-image, this command runs a model directly from a remote repository.
Delete a Model: Comparable to docker rm, this command removes a model from the system.
List Local Models: Analogous to docker images, this returns a list of models currently stored locally.
List Running Models: Similar to docker ps, this provides an overview of all models currently in operation, aiding in monitoring and resource management.
Show Model Information: Similar to docker inspect, this displays detailed information about a specific model, including its parameters, version, and operational status.

Generating Embeddings

Generate Embeddings: Creates vector representations of data, useful for tasks requiring similarity comparisons or feature extraction, particularly for Retrieval-Augmented Generation (RAG).

Hypervisor (Proxmox) Considerations

While it’s feasible to run Ollama on a VM with minimal resources, such as a single core, 2 GB of memory, and 10 GB of storage, I recommend allocating considerably more resources for enhanced performance, especially for AI processes. I utilize a GPU passthrough configuration to leverage the computing power of my NVIDIA RTX 4070 for AI tasks.

It’s important to note that consumer GPUs, including higher-end models like the RTX 4070, typically support passthrough to only one VM at a time. This setup implies that all AI processing tasks are consolidated onto a single VM. While this configuration maximizes the use of GPU resources, it also introduces a single point of failure. This can be a critical consideration as you experiment and develop since any issues with the VM will impact all AI operations relying on that GPU.