
Source: ClipDrop by Stability AI
Vector databases became the hottest topic in tech, with every startup chasing VC funding for their vector database solutions. This fever caught me too, leading to my journey of building a minimalist vector database from scratch using familiar tools: SQLite, NumPy, and K-Nearest Neighbors.
Understanding Vector Databases
What Are Vector Databases?
Vector databases are specialized databases designed to handle the unique challenges of managing vector embeddings in production environments. Unlike traditional databases that store structured data, vector databases excel at storing and querying high-dimensional numerical representations of data.
Why Vector Databases Matter
The AI revolution brought us powerful Large Language Models (LLMs), but these models have a limitation: they're confined to their training data. To extend their capabilities, we need to feed them external information in a format they understand - high-dimensional vectors.
LLMs don't process language like humans do. Instead, they:
- Convert text into numerical representations (vectors)
- Process information as high-dimensional arrays (
ndarray
) - Operate in mathematical vector spaces rather than semantic meaning
Think of LLMs as "language wizards speaking in numbers" - they need everything translated into their mathematical language.
Building the Vector Database
The Challenge with Traditional Databases
Here's the problem: traditional relational databases like SQLite, PostgreSQL, and MySQL weren't designed to handle high-dimensional ndarray
data efficiently. They excel at structured data with fixed schemas, but struggle with the fluid, mathematical nature of vector embeddings.
Solution: Serialization Strategy
The key insight is to serialize vectors before storage and deserialize after retrieval. Our approach:
- Convert
ndarray
tobytes
using NumPy's native serialization - Store as binary data in SQLite using
BLOB
fields - Use SQLite's
register_adapter
andregister_converter
for seamless conversion
Vector Storage Implementation
import sqlite3
import numpy as np
import io
def adapt_array(arr):
out = io.BytesIO()
np.save(out, arr)
out.seek(0)
return sqlite3.Binary(out.read())
def convert_array(text):
out = io.BytesIO(text)
out.seek(0)
return np.load(out)
sqlite3.register_adapter(np.ndarray, adapt_array)
sqlite3.register_converter("array", convert_array)
This elegant solution allows us to store and retrieve vectors seamlessly, treating them as native SQLite data types.
Implementing Similarity Search
Beyond Traditional Querying
Traditional database queries work like this: "Give me everything that matches this exact condition."
SELECT * FROM db_table WHERE "condition" IS "MATCHED"
But vector databases require a fundamentally different approach. Instead of exact matches, we need similarity search - finding vectors that are "close" to our query vector in high-dimensional space.
Enter K-Nearest Neighbors (kNN)
K-Nearest Neighbors is our similarity search algorithm. Think of it as a matchmaking service that finds the most similar vectors to any given query.
How kNN Works:
- Calculate distances between the query vector and all stored vectors
- Sort by distance to find the closest matches
- Return the top-k most similar vectors
For distance calculation, we use the Euclidean distance - the straight-line distance between two points in multi-dimensional space.
Distance Calculation and kNN Implementation
def euclidean_distance(point1, point2):
"""Calculate the Euclidean distance between two points represented as NumPy arrays."""
if point1.shape != point2.shape:
raise ValueError("Input points must have the same shape.")
# Calculate the Euclidean distance using NumPy optimized function
distance = np.linalg.norm(point1 - point2)
return distance
def get_nearest_neighbor(train, test_row, num_neighbors: int = 1):
"""Find the nearest neighbors of a test data point in a dataset."""
distances = []
# Calculate distance to each training vector
for train_row in train:
dist = euclidean_distance(test_row, train_row)
distances.append((train_row, dist))
# Sort by distance (closest first)
distances.sort(key=lambda tup: tup[1])
# Return the k nearest neighbors
neighbors = []
for i in range(num_neighbors):
neighbors.append(distances[i][0])
return neighbors
Performance Considerations
Optimization Strategies
While this implementation is educational, production vector databases implement several optimizations:
- Indexing: Approximate Nearest Neighbor (ANN) algorithms like HNSW or IVF
- Caching: Keep frequently accessed vectors in memory
- Batch Processing: Vectorize distance calculations for better performance
- Compression: Reduce storage size through quantization techniques
Scalability Limitations
Our SQLite-based approach works well for:
- Prototyping and experimentation
- Small to medium datasets (< 1M vectors)
- Single-node applications
For larger scale applications, consider specialized vector databases like Pinecone, Weaviate, or Qdrant.
Conclusion
Building a vector database from scratch reveals the fundamental concepts behind these powerful tools. While our SQLite + NumPy + kNN implementation is minimalist, it demonstrates the core principles that power modern AI applications.
Key Takeaways
- Vector databases solve real problems in AI applications
- Serialization enables traditional databases to store high-dimensional data
- Similarity search is fundamentally different from exact matching
- Simple implementations can be surprisingly effective for many use cases
Whether this leads to VC funding or not, understanding these fundamentals helps build better AI systems! 🚀
Want to explore the code yourself?
Repository: sqlite_vector
License: MIT