13 Ways of Serving LLMs for Inferencing
ChatGPT, bard, GitHub Copilot were all good first steps in Generative AI. However you may not want to send all your sensitive questions or data to cloud all the time. There are also a number of fine tuned and domain specific LLMs that might be able to serve your needs better. Thus there is increasing need of serving your own LLMs rather than relying on cloud hosting. Fortunately there are a number of ways to do this currently.
LORAX
The top most infrastructure to do so curently is definitely lorax. It employs a number of optimization techniques for serving these LLMs. To name a few,
KV Caching
LLMs typically predict a sequence of tokens. This is where caching of Keys and Values from computation of previous step comes into picture. It almost halves the average latency of predicting next token.
Heterogeneous Continuous Batching
Batching tyically improves the throughput.
Making it continuous also contains the latency. Heterogeneous nature combines dissimilar requests in the same batches reaping broader benefits of the approach.
Quantization
With quantization techniques the memory footprint of the loaded model can be optimized as required, albeit with a little computational overhead to quantize the model and just in time dequantization before inferencing so that the precision of prediction is not hampered much.
Dynamic LoRA Adapters
LLMs can be adapted to particular tasks or domains by fine tuning. Low Rank Adaptation (LoRA) is a technique to reduce the trainable parameters of these LLMs by a large order of magnitude (like even 10,000) for such activities. This helps in their hosting for inferencing. Dynamic nature helps LORAX pick and choose these adaptations from the ensemble based on the activity on hand.
It also provides a managed service if you do not want to host it yourselves but still want to make use of all its benefits. One downside being that LORAX currently needs a GPU for hosting.
vLLM
vllm is another way of doing it. In fact LORAX uses vllm underneath for hosting. vllm provides complete OpenAI API server compatibility over your hosted LLM. It also provides most of the above mentioned optimization techniques (some of them in experimental state currently). After https://github.com/vllm-project/vllm/pull/3634 it now supports CPU only systems (without any GPU) for inferencing.
GPT4All
GPT4All offers a desktop application that can download and host LLMs for you. You can develop your chatbot on top. It also supports RAG with your
proprietary local data.
Ollama
Ollama is an easy way for downloading and running LLMs locally. All you need to do is:
ollama run model-name
It will download the model if not already. It has integrations with LangChain as well as PrivateGPT.
LLM
LLM is a python package that allows you to host a LLM locally and interact with it on command line. You first install llm as
pip install llm
Then you download and install a model of your choice as
pip install <your model>
Then you send your query to locally installed and hosted model to get its response as
llm -m <your model> "Your query"
h2oGPT
h2oGPT has a desktop application for hosting LLMs. It also provides RAG capability to interact with your private local proprietary data.
PrivateGPT
PrivateGPT also gives capability of querying your own documents by hosting LLMs locally.
Chat with RTX
LM Studio
LangChain
The last but not the least and for the sake of completeness you can also run LLMs locally in the popular AI Orchestrator tool of LangChain.