llama.cpp?

llama.cpp is a open-source library for LLM inference on various environments

Logo of llama.cpp

 - It supports most LLM models with various quantizations by GGUF 

 - It can work with or without GPU

 - It has little dependency

 - It supports several optimized libraries, such as BLAS, RPC(Remote Procedure Call), KleidiAi.

 

llama.cpp installation

I used docker. The following Dockerfile is used for building docker image.

FROM armv64v8/python:3.12-slim AS builder
COPY --from=docker.io/astral/uv:latest /uv /uvx /bin/

RUN apt-get update && apt-get install -y --no-install-recommends \
	git \
    build-essential \
    cmake \
    libopenblas-dev \
    pkg-config \
    curl \
    libcurl4-openssl-dev \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

RUN git clone https://github.com/ggml-org/llama.cpp.git .

RUN cmake -B build -DGGML_RPC=ON

RUN cmake --build build --config Release

 

Build and run llama.cpp with docker is very simple.

# Build
docker build -t llama-cpp .

# Run (interactive shell)
docker run --rm -it -v /llama.cpp-models:/models --entrypoint /bin/bash llama-cpp

# Run (llama-cli for local inference)
docker run -v /llama.cpp-models:/models llama-cpp /app/build/bin/llama-cli -m /models/Qwen3-0.6B-Q8_0.gguf -p "hello"

# Run (llama-cli with RPC servers)
docker run -v /llama.cpp-models:/models llama-cpp /app/build/bin/llama-cli --rpc 192.168.1.51:50052,192.168.1.52:50052,192.168.1.53:50052,192.168.1.54:50052 -m /models/Qwen3-8B-Q8_0.gguf -p "hello"

 

Have FUN !!!

llama-cli execution with Qwen3-0.6B-Q8_0.gguf on RaspberryPi 4 with 4GB RAM

 

+ Recent posts