LangChain, GPT-4, Embeddings and sqlite-vss as a vector db

I needed to switch from using Colab to a local instance because Colab keeps crashing with SQlite VSS. Given that Colab is primarily known for its GPU capabilities, which we don't require in this scenario, I didn't allocate time to troubleshoot the issue thoroughly. It's probably related to a package issue.

1 Ask Phrack
2 Prepare the text
3 Costs
4 Vectorization of English text with OpenAI Embeddings
5 Prep and chunk the Phrack txt
6 Store Embeddings in a Vector DB
7 Query the vector DB
8 Vector DB result management with GPT-4 LLM
9 Ebooks, txts, etc.

https://github.com/norandom/project_bookworm/tree/main

The code is also within the repo.

https://twitter.com/windsheep_/status/1764352546410312114

Ask Phrack

In the following 3.txt is from

http://www.phrack.org/issues/70/3.html#article

The article:

|=-----------------------------------------------------------------------=|
|=---------------=[       The Art of Exploitation       ]=---------------=|
|=-----------------------------------------------------------------------=|
|=------------------=[ Attacking JavaScript Engines ]=-------------------=|
|=--------=[ A case study of JavaScriptCore and CVE-2016-4622 ]=---------=|
|=-----------------------------------------------------------------------=|
|=----------------------------=[ saelo ]=--------------------------------=|
|=-----------------------=[ phrack@saelo.net ]=--------------------------=|
|=-----------------------------------------------------------------------=|

We can create a vector-based database from the given text and then conduct various searches based on it.

Prepare the text

The original text must undergo chunking. It’s in the 3.txt file. Chunking is a method used to obtain embeddings.

from langchain.text_splitter import RecursiveCharacterTextSplitter

with open('3.txt') as f:
    js_engines_phrack_21 = f.read()


text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    length_function=len
)


chunks = text_splitter.create_documents([js_engines_phrack_21])
# print(chunks[2])
# print(chunks[10].page_content)
print(f'Now you have {len(chunks)} chunks')

Costs

Regrettably, this is all within a locked SaaS world.

def print_embedding_cost(texts):
    import tiktoken
    enc = tiktoken.encoding_for_model('text-embedding-ada-002')
    total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
    print(f'Total Tokens: {total_tokens}')
    print(f'Embedding Cost in USD: {total_tokens / 1000 * 0.0004:.6f}')
    
print_embedding_cost(chunks)

Self-hosting LLMs requires a GPU. I understand that it can be a little expensive, and I empathize with the concerns about cost. In my opinion, it's becoming too proprietary, and I agree that we should consider using openly available models only.

Here: it’s an evening project, mostly prototyping.

Vectorization of English text with OpenAI Embeddings

OpenAI Platform ← Embeddings

You’ll get a vector of 1536 dimensions per chunk.
The chunk size will be 100.characters
Adjacent chunks will have a 20 character overlap.

Prep and chunk the Phrack txt

Store Embeddings in a Vector DB

The concept of a vector database is the same across the board. Whether you use PostgreSQL or SQLite mainly depends on your DBMS architecture. Here it’s one client only.

Query the vector DB

Result:

looks like an index result

Vector DB result management with GPT-4 LLM

stuff here describes the approach: into GPT-4, fire and forget. That does not work with all kinds of models

Result (non-deterministic):

SpiderMonkey is the JavaScript engine used in Mozilla's Firefox browser. It is responsible for interpreting and executing JavaScript code.

This looks pretty good tbh.

Ebooks, txts, etc.

This can also be done with Ebooks, PDFs etc. Give me another free evening, and we’ll see about that.