Let's set up Llama 2 locally using Python so that we can ask English questions about our own data. It's like having ChatGPT, but without requiring internet or API keys.

Disclaimer: This article is about my experience in following the steps described in this valuable blog. I had an issue with the library version but it is resolved here.

Prerequisites

Install Python and pip.
Verify it by checking the version python --version and pip --version
I am using Python 3.11.4 and pip 23.2.1.
Install C++ compiler
- If using Windows: Install Visual Studio Community with the “Desktop development with C++” workload. Verify it by running cl.exe from Developer Command Prompt (type "developer" in Windows Start Menu), which will show the version. I am using version 19.37.32824.
- If using Linux: apt install python3-dev
- If using MacOS: brew install python3-dev
Install llama-cpp-python by running pip install llama-cpp-python==0.1.78 (newer version didn't work, explained here).
Install langchain, sentence-transformers, faiss-cpu, and ctransformers packages:
pip install langchain sentence_transformers faiss-cpu ctransformers
Verify their versions by running
pip show langchain sentence_transformers faiss-cpu ctransformers
I am using versions 0.0.310, 2.2.2, 1.7.4, and 0.2.27 respectively.

Step 1: Download the LLM

Download the LLM from llama-2-7b-chat.ggmlv3.q8_0.bin (7 GB, largest). You can choose another smaller and faster model from Llama-2-7B-Chat-GGML/tree/main that might be good enough for simpler applications. The difference is explained here.

Step 2: Verify it by running a prompt

Run the following Python script:

from llama_cpp import Llama
import time

# start the timer
start = time.time()

# load the large language model file, use higher n_ctx for more complex questions
LLM = Llama(model_path="./llama-2-7b-chat.ggmlv3.q8_0.bin", n_ctx=512, seed=1337)

# create a text prompt
prompt = "Question: Why did we change file fingerprint from MD5 to SHA3-256? Answer in 2 sentences or less:"

# generate a response, use higher max_tokens or 0 for longer response
output = LLM(prompt, max_tokens=256)

# display the response
print(output["choices"][0]["text"])

# display the time taken
end = time.time()
print('Time taken', end - start)

The responses to the question "Why did we change file fingerprint from MD5 to SHA3-256?" when running using CPU i7 1.8GHz 16 GB are as follows.

Using the largest model 7GB, llama-2-7b-chat.ggmlv3.q8_0.bin

We changed the file fingerprint algorithm from MD5 to SHA3-256 because MD5 is vulnerable to collision attacks, whereas SHA3-256 is more secure. Additionally, SHA3-256 produces a larger and more unique output, which provides better protection against tampering and modification of files.

Time taken 36.86833596229553

Using the mid-size model 4GB, llama-2-7b-chat.ggmlv3.q4_1.bin

We changed the file fingerprint algorithm from MD5 to SHA3-256 because MD5 has been shown to be vulnerable to collision attacks, which could allow an attacker to create a modified file that appears to have the same fingerprint as the original file. By using SHA3-256 instead of MD5, we can provide stronger cryptographic protection against these types of attacks.

Time taken 33.310399532318115

Using the smallest model 2.8 GB, llama-2-7b-chat.ggmlv3.q2_K.bin

File fingerprinting is a technique used to verify the integrity of a file by calculating its digital "fingerprint." Historically, MD5 has been the most commonly used hash function for this purpose. However, with growing concerns over MD5's security vulnerabilities and potential collisions, SHA-3 has become the preferred choice for many organizations today.

Time taken 21.492518424987793

Those AI-generated responses above only contain information built into the model itself. For further parameter fine-tuning, check the llama-cpp-python API reference.

Step 3: Prepare your own custom data

Create one or more text files containing the custom data. Example:

Engineering

Knowledge Transfer - File Fingerprint Migration, MD5 to SHA3-256

Background
MD5 is largely seen as an insecure hash algorithm because a hash collision
can be engineered relatively easily. The motivation to shift file fingerprint
from MD5 to SHA3-256 is because the optics of using MD5 in any part of
Kiteworks is not good and not because of an exploitable vulnerability.
File fingerprint is exposed to end user via mail attachments in Web UI an is
thus the most visible part of Kiteworks that still uses MD5 hence having
higher priority in migration.

Evaluation of Hash Algorithms
In exploring new hash algorithms to replace MD5, benchmarking tests were
performed.

Details can be found in: Hash Algorithm Performance Comparison

SHA3-256 was chosen to replace MD5 based on performance, lower probability
of it being broken over the next few years and industry-wide acceptance.

Step 4: Digest the custom data

Run the following Python script:

"""
This script creates a database of information gathered from local text files.
"""

from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

# start the timer
import time
start = time.time()

# define what documents to load
loader = DirectoryLoader("./", glob="*.txt", loader_cls=TextLoader)

# interpret information in the documents
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500,
                                          chunk_overlap=50)
texts = splitter.split_documents(documents)
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={'device': 'cpu'})

# create and save the local database
db = FAISS.from_documents(texts, embeddings)
db.save_local("faiss")

# display the time taken
end = time.time()
print('Time taken', end - start)

It took 13 seconds on my local machine to digest that one custom data file.

Step 5: Ask a question about custom data

Run the following Python script:

"""
This script reads the database of information from local text files
and uses a large language model to answer questions about their content.
"""

from langchain.llms import CTransformers
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain import PromptTemplate
from langchain.chains import RetrievalQA
import time

# start the timer
start = time.time()

# prepare the template we will use when prompting the AI
template = """Use the following pieces of information to answer the user's question.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Context: {context}
Question: {question}
Only return the helpful answer below and nothing else.
Helpful answer:
"""

# load the language model
llm = CTransformers(model='./llama-2-7b-chat.ggmlv3.q8_0.bin',
                    model_type='llama',
                    config={'max_new_tokens': 1024, 'temperature': 0.01})

# load the interpreted information from the local database
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={'device': 'cpu'})
db = FAISS.load_local("faiss", embeddings)

# prepare a version of the llm pre-loaded with the local content
retriever = db.as_retriever(search_kwargs={'k': 2})
prompt = PromptTemplate(
    template=template,
    input_variables=['context', 'question'])
qa_llm = RetrievalQA.from_chain_type(llm=llm,
                                     chain_type='stuff',
                                     retriever=retriever,
                                     return_source_documents=True,
                                     chain_type_kwargs={'prompt': prompt})

# ask the AI chat about information in our local files
prompt = "Why did we change file fingerprint from MD5 to SHA3-256?"
output = qa_llm({'query': prompt})
print(output["result"])

# display the time taken
end = time.time()
print('Time taken', end - start)

The response to the same question in Step 2 is as follows:

We changed file fingerprint from MD5 to SHA3-256 because it is considered more secure than MD5, which has been shown to be vulnerable to collisions. While there are no known exploitable vulnerabilities in MD5 used in Kiteworks, the optics of using an insecure hash algorithm are not good and we want to prioritize security and trustworthiness in our product.

Time taken 84.27316904067993

Note that the answer is now using the information from the custom data!

Setup local chat with our own data

Table of contents

Prerequisites

Step 1: Download the LLM

Step 2: Verify it by running a prompt

Step 3: Prepare your own custom data

Step 4: Digest the custom data

Step 5: Ask a question about custom data