Sunday, June 30, 2024

Vector Databases : What Why How

What are vector databases ?

They are new type of databases designed to store & query unstructured data.

Unstructured data is data that does not have a fixed schema , like text , image and audio.

Examples of Vector DB: https://lakefs.io/blog/12-vector-databases-2023/

  • pine cone
  • milvus
  • chroma


Three Steps

  • 1) Embedding
  • 2) Indexing 
  • 3) Querying 

https://www.pinecone.io/learn/vector-database/ 


Why do we need them ?

Vector databases are useful for storing and querying complex, unstructured data, such as images, audio, and user preferences, that traditional databases may struggle with. They can help developers retrieve data more quickly and simply by discovering similarities between data points. Vector databases can also support semantic search, which considers the context and semantic meaning of a search query, rather than just matching exact words or phrases. This can lead to more relevant and accurate search results

How to use ?

https://www.youtube.com/watch?v=AGKY_Q3GjRc&list=PLRLVhGQeJDTLiw-ZJpgUtZW-bseS2gq9- 

Using Pine cone in production ...

https://www.youtube.com/watch?v=fo0F-DAum7E&pp=ygUWcGluZWNvbmUgaW4gcHJvZHVjdGlvbg%3D%3D


Code : http://hilite.me/

# authenticating to Pinecone. 
# the API KEY is in .env
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True)



from pinecone import Pinecone, ServerlessSpec
# Initilizing and authenticating the pinecone client
pc = Pinecone()
# pc = Pinecone(api_key='YOUR_API_KEY')
# checking authentication
pc.list_indexes()

##################################
# Working with pinecone indexes
#################################

# listing all indexes
pc.list_indexes()

index_name = 'langchain'
# getting a complete description of a specific index:
pc.describe_index(index_name)

# getting a list with the index names 
pc.list_indexes().names()

# deleting an index
if index_name in pc.list_indexes().names():
    print(f'Deleting index {index_name} ... ')
    pc.delete_index(index_name)
    print('Done')
else:
    print(f'Index {index_name} does not exist!')


# creating a Serverless Pinecone index 
# starter free plan permits 1 project, up to 5 indexes, up to 100 namespaces per index
index_name = 'langchain'

if index_name not in pc.list_indexes().names():
    print(f'Creating index {index_name}')
    pc.create_index(
        name=index_name,
        dimension=1536,
        metric='cosine',
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"
        ) 
    )
    print('Index created! 😊')
else:
    print(f'Index {index_name} already exists!')

############################
# working with vectors
############################
index = pc.Index(index_name) index.describe_index_stats() # inserting vectors import random vectors = [[random.random() for _ in range(1536)] for v in range(5)] # print(vectors) ... above code generates 5 vectors with 1536 dimentions ids = list('abcde') index_name = 'langchain' index = pc.Index(index_name) index.upsert(vectors=zip(ids, vectors)) # updating vectors index.upsert(vectors=[('c', [0.5] * 1536)]) # fetching vectors # index = pc.Index(index_name) index.fetch(ids=['c', 'd']) # deleting vectors index.delete(ids=['b', 'c']) index.describe_index_stats() # querying a non-existing vector returns an empty vector index.fetch(ids=['x']) # querying vectors query_vector = [random.random() for _ in range(1536)] index.query( vector=query_vector, top_k=3, include_values=False ) ############################ # Namespaces ############################ # index.describe_index_stats() index = pc.Index('langchain') import random vectors = [[random.random() for _ in range(1536)] for v in range(5)] ids = list('abcde') index.upsert(vectors=zip(ids, vectors)) # partition the index into namespaces # creating a new namespace vectors = [[random.random() for _ in range(1536)] for v in range(3)] ids = list('xyz') index.upsert(vectors=zip(ids, vectors), namespace='first-namespace') vectors = [[random.random() for _ in range(1536)] for v in range(2)] ids = list('qp') index.upsert(vectors=zip(ids, vectors), namespace='second-namespace') index.describe_index_stats() index.fetch(ids=['x']) index.fetch(ids=['x'], namespace='first-namespace') index.delete(ids=['x'], namespace='first-namespace') index.delete(delete_all=True, namespace='first-namespace') index.describe_index_stats()


No comments: