LangChain, GPT-4, Embeddings and sqlite-vss as a vector db

I needed to switch from using Colab to a local instance because Colab keeps crashing with SQlite VSS. Given that Colab is primarily known for its GPU capabilities, which we don't require in this scenario, I didn't allocate time to troubleshoot the issue thoroughly. It's probably related to a package issue.

 

https://github.com/norandom/project_bookworm/tree/main

The code is also within the repo.

 

Ask Phrack

 

In the following 3.txt is from

The article:

|=-----------------------------------------------------------------------=| |=---------------=[ The Art of Exploitation ]=---------------=| |=-----------------------------------------------------------------------=| |=------------------=[ Attacking JavaScript Engines ]=-------------------=| |=--------=[ A case study of JavaScriptCore and CVE-2016-4622 ]=---------=| |=-----------------------------------------------------------------------=| |=----------------------------=[ saelo ]=--------------------------------=| |=-----------------------=[ phrack@saelo.net ]=--------------------------=| |=-----------------------------------------------------------------------=|

 

We can create a vector-based database from the given text and then conduct various searches based on it.

 

Prepare the text

The original text must undergo chunking. It’s in the 3.txt file. Chunking is a method used to obtain embeddings.

from langchain.text_splitter import RecursiveCharacterTextSplitter with open('3.txt') as f: js_engines_phrack_21 = f.read() text_splitter = RecursiveCharacterTextSplitter( chunk_size=100, chunk_overlap=20, length_function=len ) chunks = text_splitter.create_documents([js_engines_phrack_21]) # print(chunks[2]) # print(chunks[10].page_content) print(f'Now you have {len(chunks)} chunks')

 

Costs

Regrettably, this is all within a locked SaaS world.

def print_embedding_cost(texts): import tiktoken enc = tiktoken.encoding_for_model('text-embedding-ada-002') total_tokens = sum([len(enc.encode(page.page_content)) for page in texts]) print(f'Total Tokens: {total_tokens}') print(f'Embedding Cost in USD: {total_tokens / 1000 * 0.0004:.6f}') print_embedding_cost(chunks)

Self-hosting LLMs requires a GPU. I understand that it can be a little expensive, and I empathize with the concerns about cost. In my opinion, it's becoming too proprietary, and I agree that we should consider using openly available models only.

Here: it’s an evening project, mostly prototyping.

 

Vectorization of English text with OpenAI Embeddings

OpenAI Platform ← Embeddings

 

  • You’ll get a vector of 1536 dimensions per chunk.

  • The chunk size will be 100.characters

  • Adjacent chunks will have a 20 character overlap.

 

Prep and chunk the Phrack txt

 

Store Embeddings in a Vector DB

The concept of a vector database is the same across the board. Whether you use PostgreSQL or SQLite mainly depends on your DBMS architecture. Here it’s one client only.

 

Query the vector DB

Result:

  • looks like an index result

 

Vector DB result management with GPT-4 LLM

  • stuff here describes the approach: into GPT-4, fire and forget. That does not work with all kinds of models

 

Result (non-deterministic):

SpiderMonkey is the JavaScript engine used in Mozilla's Firefox browser. It is responsible for interpreting and executing JavaScript code.

 

This looks pretty good tbh.

 

 

Ebooks, txts, etc.

This can also be done with Ebooks, PDFs etc. Give me another free evening, and we’ll see about that.