My Tech Learnings: Splitting & Embedding Text using Langchain

Sunday, June 30, 2024

Splitting & Embedding Text using Langchain

There are different type of loaders

https://python.langchain.com/v0.2/docs/integrations/document_loaders/

including , but not limited to

csv
facebook chat
file directory
html
power point
hugging face
hacker news

from langchain.text_splitter import RecursiveCharacterTextSplitter

with open('files/churchill_speech.txt') as f:

churchill_speech = f.read()

text_splitter = RecursiveCharacterTextSplitter(

chunk_size=100,

chunk_overlap=20,

length_function=len

)

chunks = text_splitter.create_documents([churchill_speech])

# print(chunks[2])

# print(chunks[10].page_content)

print(f'Now you have {len(chunks)}')

from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

vector_store = Pinecone.from_documents (chunks, embeddings, index_name=index_name)

In a nutshell, this method processes the input documents, generates embeddings using the provided OpenAI embeddings instance, and returns a new pinecone vector store.

The resulting vector store object can perform similarity searches and retrieve relevant documents based on user queries.

// https://www.reddit.com/r/LangChain/comments/18ehcm7/guys_anyone_use_pineconefrom_documents

With this, we've successfully embedded the text into vectors and inserted them into a pinecone index.

query = 'Where should we fight?'

result = vector_store.similarity_search(query)

print(result)

The user defines a query.

The query is embedded into a vector.

A similarity search is performed in the vector database, and the text behind the most similar vectors is the answer to the user's question.

My Tech Learnings

Sunday, June 30, 2024

Splitting & Embedding Text using Langchain

No comments:

About Me