13 Ways of Serving LLMs for Inferencing

Sameer Mahajan
4 min readApr 2, 2024

ChatGPT, bard, GitHub Copilot were all good first steps in Generative AI. However you may not want to send all your sensitive questions or data to cloud all the time. There are also a number of fine tuned and domain specific LLMs that might be able to serve your needs better. Thus there is increasing need of serving your own LLMs rather than relying on cloud hosting. Fortunately there are a number of ways to do this currently.

LORAX

The top most infrastructure to do so curently is definitely lorax. It employs a number of optimization techniques for serving these LLMs. To name a few,

KV Caching

LLMs typically predict a sequence of tokens. This is where caching of Keys and Values from computation of previous step comes into picture. It almost halves the average latency of predicting next token.

KV Caching Improvments with number of tokens predicted in a sequence

Heterogeneous Continuous Batching

Batching tyically improves the throughput.

Throughput vs Latency tradeoff with increasing batch size

Making it continuous also contains the latency. Heterogeneous nature combines dissimilar requests in the same batches reaping broader benefits of the approach.

Quantization

With quantization techniques the memory footprint of the loaded model can be optimized as required, albeit with a little computational overhead to quantize the model and just in time dequantization before inferencing so that the precision of prediction is not hampered much.

Dynamic LoRA Adapters

LLMs can be adapted to particular tasks or domains by fine tuning. Low Rank Adaptation (LoRA) is a technique to reduce the trainable parameters of these LLMs by a large order of magnitude (like even 10,000) for such activities. This helps in their hosting for inferencing. Dynamic nature helps LORAX pick and choose these adaptations from the ensemble based on the activity on hand.

It also provides a managed service if you do not want to host it yourselves but still want to make use of all its benefits. One downside being that LORAX currently needs a GPU for hosting.

vLLM

vllm is another way of doing it. In fact LORAX uses vllm underneath for hosting. vllm provides complete OpenAI API server compatibility over your hosted LLM. It also provides most of the above mentioned optimization techniques (some of them in experimental state currently). After https://github.com/vllm-project/vllm/pull/3634 it now supports CPU only systems (without any GPU) for inferencing.

GPT4All

GPT4All offers a desktop application that can download and host LLMs for you. You can develop your chatbot on top. It also supports RAG with your

gpt4all

proprietary local data.

Ollama

Ollama is an easy way for downloading and running LLMs locally. All you need to do is:

ollama run model-name

It will download the model if not already. It has integrations with LangChain as well as PrivateGPT.

LLM

LLM is a python package that allows you to host a LLM locally and interact with it on command line. You first install llm as

pip install llm

Then you download and install a model of your choice as

pip install <your model>

Then you send your query to locally installed and hosted model to get its response as

llm -m <your model> "Your query"

h2oGPT

h2oGPT has a desktop application for hosting LLMs. It also provides RAG capability to interact with your private local proprietary data.

h2oGPT with proprietary data

PrivateGPT

PrivateGPT also gives capability of querying your own documents by hosting LLMs locally.

Jan

Jan interface

Chat with RTX

llamafile

llamafile interface

LocalGPT

LM Studio

LM Studio interface

LangChain

The last but not the least and for the sake of completeness you can also run LLMs locally in the popular AI Orchestrator tool of LangChain.

--

--

Sameer Mahajan

Generative AI, Machine Learning, Deep Learning, AI, Traveler