Language models for your own data

Language models such as GPT, chatGPT, LLaMa, Bard and Alpaca are the focus of everyone's attention and are incredibly fascinating - they provide surprisingly accurate answers to all kinds of questions, complete texts, formulate emails or summarise presentations and documents. Wouldn't it be nice to use them on your own data?

Language models such as GPT, chatGPT, LLaMa, Bard and Alpaca are the focus of everyone’s attention and are incredibly fascinating – they provide surprisingly accurate answers to all kinds of questions, complete texts, formulate emails or summarise presentations and documents. By now, most of us have probably had the opportunity to try out and use such a model. Wouldn’t it be helpful to be able to use all these possibilities on your own documents and code? And wouldn’t it be useful if you could be sure that the data would not fall into the hands of third parties? Let’s take a look at what this could look like.

So far, large language models run in the data centres of OpenAI, Microsoft, Meta and Google and much of the data that is sent to the models is also used to enhance the models at the same time.

If one now wants to use language models on one’s own company-internal and sensitive data, two fundamental challenges arise:

  1. Processing documents: Language models are designed to work with text (e.g. a question), but they cannot deal with unstructured data such as PDF documents, Excel sheets, PowerPoint, etc.. One challenge is to pre-process these documents in such a way that the information is accessible to the language model and it can use it to provide answers.
  2. A closed model: In order to be sure that the data from company-internal documents does not migrate to language models or foreign servers, you need a trustworthy language model. And the simplest and safest option at present is to run your own language model locally.

In the following, I would like to briefly show how exactly you can put this into practice with relatively little code and then host your own language model answering questions about your personal documents.

But before we start, a brief outline of what we intend to do:

Theory

Large language model

The graphic above illustrates what a simple system looks like that can be used to generate suitable answers to questions with the help of indexed documents.

The process from question to finished answer consists of three steps:

  1. Question Vectorization: The question is transferred into a vector with the help of an embedding model.
  2. Context Retrieval: with the help of the question vector, suitable information from the previously indexed documents is searched in a vector database.
  3. Answer Generation: the search results and the question are passed to a language model that can generate a suitable answer with the knowledge provided.

Let’s take a closer look at the steps:

Embeddings and Vectorization

The Embedding Model has the ability to transfer text into a high dimensional vector space by creating representative vectors. It has been trained so that similar or related text elements are close to each other in this vector space.

In a pre-processing step, text elements are extracted from the documents that serve as the knowledge base and represented as vectors with the help of the embedding model. The text-vector pairs are then stored in a vector database.

Retrieval from vector space

If you now ask a question, it will also be encoded as a vector by the embedding model. This vector corresponds to a point in the high-dimensional vector space. Since similar and linked text passages are close to each other in this vector space, you can now search for the closest data points and get possible information that matches the question.

This is where the advantages of a vector database come into action, which can not only store vectors very efficiently, but can also return as search hits those vectors that are closer than a certain distance limit.

The set of unfiltered text passages that are linked to the question in some way is called information context.

Answer generation

Since a user wants a coherent answer and not a lot of more or less relevant information, one last step is necessary: the generation of a suitable answer.

And to do this, one can exploit the strengths of large language models (LLMs) by transferring the information context to the model and using it to generate a consistent answer.

Frameworks

Enough of the theory – but before we get to the practical part, here are two packages that are used in the following and are worth being described in two sentences:

LangChain

LangChain is a Python package that provides many useful functionalities such as data loaders and all the glue code needed to implement the above system very easily.

It allows you to focus on experimenting with models, different databases and action chains instead of having to write the boilerplate code in between.

LangChain logo

LLama.cpp

LLama.cpp is the open source implementation in C/C++ of Meta’s Llama Model. While Meta’s model was trained in data centres with huge numbers of GPUs, the goal of this implementation is to make the model run on a normal CPU. This way, you can run your own language model on your own computer.

However, this means that you have to accept some drawbacks in the quality of the answers and overall performance. However, this should be acceptablef for a simple experiment.

Implementation

Requirements

  • Python ≥ 3.10
  • Pip package manager
  • C/C++ Compiler (e.g. with Visual Studio Build Tools or gcc)

Setup

1. Installation of the required Python packages (C/C++ compiler required):

pip install --user pypdf==3.9.0 chromadb==0.3.22
pip install --user langchain==0.0.166 llama-cpp-python==0.1.48

2. Download of the model files (here) and setting of the path:

GPT4ALL_MODEL_PATH = "<absolute_path_to_model_dir>\gpt4all-converted.bin"

3. Download of the documents to train the model (here) and setting of the path:

PDF_PATH = r"<path_to_doc_dir>\Mosolf_Story_ETE_Kom_1.pdf"

That’s it – let’s go:

Questions to the raw model without access to additional documents (Baseline)

First, we want to ask a simple question to the model wihtout access to the documents that build the actual knowledge base:

from langchain import PromptTemplate, LLMChain

template = """
Question: {question}
Answer: Let's think step by step.
"""

prompt = PromptTemplate(template=template, input_variables=["question"])
llm = LlamaCpp(model_path=GPT4ALL_MODEL_PATH)

lm = LLMChain(prompt=prompt, llm=llm)
lm.run("What does the company ETECTURE do?")

The output is as follows:

„Firstly, to answer this question, we need to know what is meant by ‚company‘. In general terms, a company can be defined as an organization that has a distinctive name and/or logo, and a legal entity or structure separate from its owners or shareholders. ETECTURE can be understood as an engineering consultancy that provides various services such as infrastructure design and construction management for clients ranging from governmental entities to private individuals.“

The model is already very communicative, but unfortunately the answer is not correct. But how should it know what ETECTURE (software company based in Western Germany) does? So let’s give it some additional information:

Loading of the documents

First, we want to load the document that will later form the knowledge base of the model. We will use a published company presentation as an example document – it should tell us what ETECTURE actually does:

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader(PDF_PATH)
pages = loader.load_and_split()

Indexing of the documents

Next, we will use the embedding model to create the vectors for the loaded text passages and store everything in a vector database (ChromaDB):

from langchain.vectorstores import Chroma
from langchain.embeddings import LlamaCppEmbeddings

embeddings = LlamaCppEmbeddings(model_path=GPT4ALL_MODEL_PATH)
vectordb = Chroma.from_documents(pages, embedding=embeddings, persist_directory=".")

vectordb.persist()

The calculation of the vectors is very time-consuming, so this can take a few minutes (~15min). However, this has to be done only once.

Ask the model some questions

Now we want to ask the same question as above to the model again – except that this time the model has access to the indexed documents in the database:

from langchain.llms import LlamaCpp
from langchain.chains import RetrievalQA

MIN_DOCS = 1 # only one result from the database

qa = RetrievalQA.from_chain_type(llm=llm, retriever=vectordb.as_retriever(search_kwargs={"k": MIN_DOCS}))

query = "What does the company ETECTURE do?"
qa.run(query)

The output this time:

„The company ETECTURE is an IT consulting firm that provides solutions and services for complex business challenges.“

Perfect – I couldn’t have said it more precisely. 😀

Conclusion

As you might notice, it’s not that complicated to run a local language model on your own computer and extend it for your own purposes. A next step would be to extend the sources and perhaps use a small plugin or UI.

Whether language models are a big hype or not is something everyone can decide for themselves – in any case you have to admit that language models and the possibilities they offer are quite fascinating, aren’t they?

Related Articles

Post a comment

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert

Diese Website verwendet Akismet, um Spam zu reduzieren. Erfahre mehr darüber, wie deine Kommentardaten verarbeitet werden.