Sunday, June 30, 2024

Splitting & Embedding Text using Langchain

 There are different type of loaders 

including , but not limited to

  • csv
  • facebook chat
  • file directory
  • html
  • power point
  • hugging face
  • hacker news

from langchain.text_splitter import RecursiveCharacterTextSplitter

with open('files/churchill_speech.txt') as f:
    churchill_speech =

text_splitter = RecursiveCharacterTextSplitter(

chunks = text_splitter.create_documents([churchill_speech])
# print(chunks[2])
# print(chunks[10].page_content)
print(f'Now you have {len(chunks)}')

from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

vector_store = Pinecone.from_documents (chunks, embeddings, index_name=index_name)

In a nutshell, this method processes the input documents, generates embeddings using the provided OpenAI embeddings instance, and returns a new pinecone vector store.

The resulting vector store object can perform similarity searches and retrieve relevant documents based on user queries.

With this, we've successfully embedded the text into vectors and inserted them into a pinecone index.

query = 'Where should we fight?'
result = vector_store.similarity_search(query)

The user defines a query. 
The query is embedded into a vector.
A similarity search is performed in the vector database, and the text behind the most similar vectors is the answer to the user's question.

