LangChain, GPT-4, Embeddings and sqlite-vss as a vector db
I needed to switch from using Colab to a local instance because Colab keeps crashing with SQlite VSS. Given that Colab is primarily known for its GPU capabilities, which we don't require in this scenario, I didn't allocate time to troubleshoot the issue thoroughly. It's probably related to a package issue.
The code is also within the repo.
Ask Phrack
In the following 3.txt
is from
The article:
|=-----------------------------------------------------------------------=|
|=---------------=[ The Art of Exploitation ]=---------------=|
|=-----------------------------------------------------------------------=|
|=------------------=[ Attacking JavaScript Engines ]=-------------------=|
|=--------=[ A case study of JavaScriptCore and CVE-2016-4622 ]=---------=|
|=-----------------------------------------------------------------------=|
|=----------------------------=[ saelo ]=--------------------------------=|
|=-----------------------=[ phrack@saelo.net ]=--------------------------=|
|=-----------------------------------------------------------------------=|
We can create a vector-based database from the given text and then conduct various searches based on it.
Prepare the text
The original text must undergo chunking. It’s in the 3.txt
file. Chunking is a method used to obtain embeddings.
from langchain.text_splitter import RecursiveCharacterTextSplitter
with open('3.txt') as f:
js_engines_phrack_21 = f.read()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=100,
chunk_overlap=20,
length_function=len
)
chunks = text_splitter.create_documents([js_engines_phrack_21])
# print(chunks[2])
# print(chunks[10].page_content)
print(f'Now you have {len(chunks)} chunks')
Costs
Regrettably, this is all within a locked SaaS world.
def print_embedding_cost(texts):
import tiktoken
enc = tiktoken.encoding_for_model('text-embedding-ada-002')
total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
print(f'Total Tokens: {total_tokens}')
print(f'Embedding Cost in USD: {total_tokens / 1000 * 0.0004:.6f}')
print_embedding_cost(chunks)
Self-hosting LLMs requires a GPU. I understand that it can be a little expensive, and I empathize with the concerns about cost. In my opinion, it's becoming too proprietary, and I agree that we should consider using openly available models only.
Here: it’s an evening project, mostly prototyping.
Vectorization of English text with OpenAI Embeddings
OpenAI Platform ← Embeddings
You’ll get a vector of 1536 dimensions per chunk.
The chunk size will be 100.characters
Adjacent chunks will have a 20 character overlap.
Prep and chunk the Phrack txt
Store Embeddings in a Vector DB
The concept of a vector database is the same across the board. Whether you use PostgreSQL or SQLite mainly depends on your DBMS architecture. Here it’s one client only.
Query the vector DB
Result:
looks like an index result
Vector DB result management with GPT-4 LLM
stuff
here describes the approach: into GPT-4, fire and forget. That does not work with all kinds of models
Result (non-deterministic):
SpiderMonkey is the JavaScript engine used in Mozilla's Firefox browser. It is responsible for interpreting and executing JavaScript code.
This looks pretty good tbh.
Ebooks, txts, etc.
This can also be done with Ebooks, PDFs etc. Give me another free evening, and we’ll see about that.