Sunday, June 30, 2024

Splitting & Embedding Text using Langchain

 There are different type of loaders

https://python.langchain.com/v0.2/docs/integrations/document_loaders/ 

including , but not limited to

  • csv
  • facebook chat
  • file directory
  • html
  • power point
  • hugging face
  • hacker news

from langchain.text_splitter import RecursiveCharacterTextSplitter

with open('files/churchill_speech.txt') as f:
    churchill_speech = f.read()


text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    length_function=len
)

chunks = text_splitter.create_documents([churchill_speech])
# print(chunks[2])
# print(chunks[10].page_content)
print(f'Now you have {len(chunks)}')

from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

vector_store = Pinecone.from_documents (chunks, embeddings, index_name=index_name)

In a nutshell, this method processes the input documents, generates embeddings using the provided OpenAI embeddings instance, and returns a new pinecone vector store.

The resulting vector store object can perform similarity searches and retrieve relevant documents based on user queries.
 // https://www.reddit.com/r/LangChain/comments/18ehcm7/guys_anyone_use_pineconefrom_documents












With this, we've successfully embedded the text into vectors and inserted them into a pinecone index.


query = 'Where should we fight?'
result = vector_store.similarity_search(query)
print(result)

The user defines a query. 
The query is embedded into a vector.
A similarity search is performed in the vector database, and the text behind the most similar vectors is the answer to the user's question.



No comments: