type
status
date
slug
summary
tags
category
icon
password
Upstash Vector, a vector database that supports scalable similarity search, has vectorized 11 million Wikipedia articles and indexed more than 150 million vectors. The goal of the project is to create a semantic search engine and a RAG (Retrieval-Augmented Generation) chatbot using Wikipedia data. Wikipedia was chosen as the data source because of its rich information base and easy access.
By downloading large-scale Wikipedia data, cleaning it and splitting it into manageable paragraphs, and then embedding these paragraphs using the BGE-M3 model provided by Upstash. The embedding process lasted nearly a week and eventually generated approximately 144 million vectors covering 11 languages (English, German, French, Russian, Spanish, Italian, Chinese, Japanese, Portuguese, Persian and Turkish). These vectors are indexed into Upstash Vector to achieve efficient semantic search.
Upstash Vector is a vector database provided by Upstash, designed for efficient similarity search. It is mainly used to process and query large amounts of vectorized data. Vectorization is the process of converting text, images, or other data into digital vectors so that similarity comparisons can be made between these vectors. Upstash Vector provides the following key features:
Vectorized storage and query
It can store millions or even billions of vectors and support efficient similarity search, which is very important for application scenarios that need to quickly find similar content.
Namespace
Supports the use of namespaces to manage different data sets, making it possible to isolate and organize different data in the same database.
Metadata filtering
Supports metadata-based filtering, making queries more flexible and accurate.
Built-in embedding models
Some pre-trained embedding models are provided, so that users can directly use these models for vectorization processing without having to train the models themselves.
Problems that Upstash Vector solves
- Semantic search of large-scale data: In traditional keyword matching search, it is difficult to capture the relationship between context and semantics. Upstash Vector uses vectorization technology to enable semantic-based search, thereby improving the accuracy of search, especially when processing natural language queries. For example, traditional search can only find content based on literal matches, such as searching for "dog" may not find information about "puppy". Upstash Vector can understand the meaning of the content, and when searching for "dog", it will find all information related to "dog", such as "dog" or "pet".
- Cross-language support: With the help of multilingual embedding models, Upstash Vector can process and query content in multiple languages, which is very useful for applications that need to support multilingual users around the world. For example, if you search for "the highest mountain in the world" in Chinese, it can also find information about "Mount Everest" in English.
- Efficient processing of large-scale data: It is particularly good at processing massive amounts of data. It can search quickly even content as large as Wikipedia, without slowing down due to the large amount of data. When processing large-scale data sets (such as Wikipedia), traditional databases or search engines may encounter performance bottlenecks. Upstash Vector is designed to efficiently process and query large amounts of vector data, ensuring query response speed and accuracy.
- Simplify complex application development: By providing built-in models and simple APIs, Upstash Vector simplifies the process of building complex applications (such as semantic search engines and chatbots), allowing developers to focus more on the business logic of the application rather than the underlying technical details. Developers can more easily develop complex applications such as intelligent chatbots or semantic search engines with Upstash Vector without having to spend a lot of effort on dealing with the underlying technical details.
Main Features
Vectorized storage and retrieval
- Storing multi-dimensional vectors:
Upstash Vector can store large-scale multi-dimensional vector data and support efficient storage and retrieval.
- Similarity search: It can perform similarity search in a large number of vectors and quickly find the vector that is most similar to the query vector. This means that you can convert information such as text and images into vectors and then quickly find other content that is similar to it.
- Application scenarios: This is very important for applications that need to process massive amounts of data and quickly find relevant results, such as recommendation systems, image search, text retrieval, etc.
Namespace support
- Function description: Upstash Vector supports the use of namespaces to manage and organize different data sets. A namespace can be understood as an independent data space, so that you can store multiple data sets in one database without worrying about them interfering with each other.
- Application scenarios: When you need to deal with different projects or different data sets, namespaces can help you better manage and isolate these data.
Metadata filtering
- Function description: Upstash Vector supports the use of namespaces to organize and isolate different data sets. For example, you can create separate namespaces for data sets in different languages to better manage and retrieve data.
- Metadata filtering: When searching, Upstash Vector allows you to filter based on the metadata of the vector data. Metadata is "data" about the data, such as when it was recorded, where it came from, etc. This makes it possible to filter and locate data based on specific conditions, further improving the accuracy of the search.
- Application scenario: Metadata filtering is useful if you only want to search for content within a certain time period or from a specific source.
Built-in embedding models
- Automatic vector embedding: Upstash Vector provides a built-in embedding model. Users can directly convert data content such as text and images into vectors and insert them into the database, and the system will automatically generate vectors. This is convenient for users to use directly without spending time and resources to train the model. This greatly simplifies the process of creating semantic search engines or RAG (retrieval-based generation) applications.
- Application scenarios: Suitable for users who want to quickly implement vectorized processing, such as building an intelligent search engine in a short time.
Cross-language support
- Cross-language search: Using multilingual embedding models such as BGE-M3, Upstash Vector is able to process and understand content in multiple languages and perform similarity searches between different languages. It supports cross-language semantic search. This means that users can query in one language and find corresponding results in data in other languages.
- Semantic Understanding: Using the embedding model, Upstash Vector can understand and process the semantic level information of the text, making the search results more accurate and relevant.
- Application scenario: This function is particularly important for global applications or projects with multi-language support.
Efficient vector indexing and querying
- Fast indexing: When processing large-scale data sets, Upstash Vector can efficiently complete vector indexing, maintaining good performance even on large-scale data sets. It ensures that large-scale data sets can be indexed in a short time, thus supporting fast retrieval.
- Approximate Nearest Neighbor Search (ANN): Upstash Vector uses optimized algorithms (such as DiskANN) for approximate nearest neighbor search, balancing search accuracy and speed.
- Application scenarios: Suitable for scenarios that process large-scale data sets, such as systems that require fast search of massive text or images.
Integration and scalability:
- Integration with other Upstash tools: QStash LLM APIs Upstash Vector can be seamlessly integrated with other Upstash tools such as Redis and to build complex applications such as RAG chatbots or advanced recommendation systems.
- Scalability: Upstash Vector is designed to scale to support growing data volumes and user demands, making it suitable for building scalable enterprise-level systems.
Application scenarios:
- Recommendation system: recommends similar products, content or services to users.
- Image Search: Retrieve similar images based on a query image or description.
- Text retrieval: Finding the text most relevant to a query within a large set of documents or web pages.
- RAG Application: Use vector database as knowledge base and combine it with large language model to implement advanced question-answering system or chatbot.
Operation Mode
Upstash Vector provides two main operation modes to meet different application needs and usage scenarios:
1. Vector Mode with User-Provided Embeddings
In this mode, users provide generated vector embeddings, and Upstash Vector is responsible for storing, managing, and retrieving these vectors. This mode is suitable for situations where you already have vector embedding data or want to use a specific embedding model.
Run steps:
- Generate Embeddings: Users generate vector embeddings for their data using a model of their choice.
- Upload Vector: Upload the generated vector to Upstash Vector for storage.
- Similarity Search: Users can perform similarity search by querying a vector, and Upstash Vector will return the results that are most similar to the query vector.
Applicable scenarios:
- Users need a high degree of control over the embedding model and want to manage the vector generation process themselves.
- Specific embedding models are needed for data processing, such as using specialized models for images, text, or other fields.
2. Text-to-Vector Mode with Built-in Embeddings
In this mode, users can directly input raw text data, and Upstash Vector will automatically use its built-in embedding model to convert the text into vectors for storage and retrieval. This mode greatly simplifies the usage process and is suitable for users who do not want to deal with embedding generation.
Run steps:
- Text Input: Users enter text data directly into Upstash Vector.
- Automatic embedding generation: Upstash Vector converts text data into vectors using its built-in embedding model.
- Storage and Retrieval: Upstash Vector stores the generated vectors and allows users to perform similarity searches by querying text. The system automatically handles the conversion of text to vectors and returns relevant results.
Applicable scenarios:
- Users hope to simplify the process of generating vector embeddings and focus on application development rather than underlying technologies.
- Applications such as semantic search engines or chatbots need to be deployed and run quickly without manually managing embedding models.
Considerations for mode selection:
- Control vs. Simplification: If you need more control and customization over the embedding generation process, you can choose the native vector mode. If you want to simplify the development process as much as possible, you can choose the text input mode.
- Data type: If an embedding model for the data already exists (for example, in the field of image processing), the built-in vector mode is more appropriate. If the data is text and you want to deploy it quickly, the text input mode is a better choice.
How to use this vector data and Upstash's tools to build a RAG chatbot
How to use Upstash's tools (including Redis and QStash LLM API) and generated vector data to build a RAG (Retrieval-Augmented Generation) chatbot. The following is a detailed introduction to this section:
1. Concept of RAG Chatbot
RAG chatbot is an advanced chat system that combines retrieval and generation techniques. In this system, the user's query first retrieves relevant information in the knowledge base through vector search, and then uses a generative model (such as a large language model, LLM) to generate more targeted and contextual answers based on the retrieved information. This can provide more accurate and informative responses when processing user queries.
2. Use Upstash Vector for semantic search
- Vector database as knowledge base: Upstash Vector plays the role of a knowledge base in this system. It stores a large number of vectors generated from Wikipedia articles and performs fast similarity searches when users issue queries.
- Query vector generation: When a user enters a query, Upstash Vector automatically converts the query text into a vector (if using text input mode) and finds the most similar vector in the database.
- Search results: The search results include the most relevant Wikipedia passages to the user query, which are passed to the generative model for further processing.
3. Use Redis to store chat sessions
- Chat Transcript Storage: Upstash Redis is used to store chat session transcripts for each user. This allows the chatbot to maintain context throughout the conversation, providing more consistent and coherent responses.
- State management: Through Redis, the system can effectively manage and maintain the state of user sessions, such as tracking past conversation content, query history, etc., which is crucial for building an efficient chatbot.
4. Integration of QStash LLM API
- Generate Model: QStash LLM APIs provide access to large language models (such as Meta-Llama-3-8B-Instruct), which are responsible for generating the final response. The model generates answers based on the retrieved relevant information and returns them to the user through the API.
- Integration with vector database: The core of the RAG system is to combine the retrieved relevant information with the generative model. By feeding the Upstash Vector searched paragraphs into the LLM, the generative model can use this information to generate highly relevant and contextual answers.
5. Integration process
- Easy integration: Using the tools provided by Upstash, developers can integrate the entire system with very little code. Because Upstash provides high-level APIs and tools, complex parts of the system (such as vector retrieval, state management, and generative model integration) are greatly simplified.
- Reuse of existing indexes: Since Upstash Vector already processes and stores vector indexes of Wikipedia, these indexes can be directly used in the RAG chatbot without reprocessing the data.
6. Code examples and project links
- The source code of this project is provided for readers to view and learn how to implement this entire system. The code shows how to combine Upstash Vector, Redis and QStash LLM API to build a complete RAG chatbot.
Wikipedia-semantic-search: https://github.com/upstash/wikipedia-semantic-search
Upstash RAG Chat SDK: https://github.com/upstash/rag-chat
Online experience: https://wikipedia-semantic-search.vercel.app/
Original post: Indexing millions of Wikipedia articles using Upstash Vector
- Author:KCGOD
- URL:https://kcgod.com/upstash-vector
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!
Relate Posts
Google Launches Gemini-Powered Vids App for AI Video Creation
FLUX 1.1 Pro Ultra: Revolutionary AI Image Generator with 4MP Resolution
X-Portrait 2: ByteDance's Revolutionary AI Animation Tool for Cross-Style Expression Transfer
8 Best AI Video Generators Your YouTube Channel Needs
Meta AI’s Orion AR Glasses: Smart AI-Driven Tech to Replace Smartphones