Building a Vector Database on SQLite, Numpy and KNNs

By David Okpare • August 10, 2023

Vector database visualization showing high-dimensional data structure

Source: ClipDrop by Stability AI

Vector databases became the hottest topic in tech, with every startup chasing VC funding for their vector database solutions. This fever caught me too, leading to my journey of building a minimalist vector database from scratch using familiar tools: SQLite, NumPy, and K-Nearest Neighbors.

Understanding Vector Databases

What Are Vector Databases?

Vector databases are specialized databases designed to handle the unique challenges of managing vector embeddings in production environments. Unlike traditional databases that store structured data, vector databases excel at storing and querying high-dimensional numerical representations of data.

Why Vector Databases Matter

The AI revolution brought us powerful Large Language Models (LLMs), but these models have a limitation: they're confined to their training data. To extend their capabilities, we need to feed them external information in a format they understand - high-dimensional vectors.

LLMs don't process language like humans do. Instead, they:

  • Convert text into numerical representations (vectors)
  • Process information as high-dimensional arrays (ndarray)
  • Operate in mathematical vector spaces rather than semantic meaning

Think of LLMs as "language wizards speaking in numbers" - they need everything translated into their mathematical language.

Building the Vector Database

The Challenge with Traditional Databases

Here's the problem: traditional relational databases like SQLite, PostgreSQL, and MySQL weren't designed to handle high-dimensional ndarray data efficiently. They excel at structured data with fixed schemas, but struggle with the fluid, mathematical nature of vector embeddings.

Solution: Serialization Strategy

The key insight is to serialize vectors before storage and deserialize after retrieval. Our approach:

  1. Convert ndarray to bytes using NumPy's native serialization
  2. Store as binary data in SQLite using BLOB fields
  3. Use SQLite's register_adapter and register_converter for seamless conversion

Vector Storage Implementation

import sqlite3
import numpy as np
import io

def adapt_array(arr):
    out = io.BytesIO()
    np.save(out, arr)
    out.seek(0)
    return sqlite3.Binary(out.read())

def convert_array(text):
    out = io.BytesIO(text)
    out.seek(0)
    return np.load(out)

sqlite3.register_adapter(np.ndarray, adapt_array)
sqlite3.register_converter("array", convert_array)

This elegant solution allows us to store and retrieve vectors seamlessly, treating them as native SQLite data types.

Beyond Traditional Querying

Traditional database queries work like this: "Give me everything that matches this exact condition."

SELECT * FROM db_table WHERE "condition" IS "MATCHED"

But vector databases require a fundamentally different approach. Instead of exact matches, we need similarity search - finding vectors that are "close" to our query vector in high-dimensional space.

Enter K-Nearest Neighbors (kNN)

K-Nearest Neighbors is our similarity search algorithm. Think of it as a matchmaking service that finds the most similar vectors to any given query.

How kNN Works:

  1. Calculate distances between the query vector and all stored vectors
  2. Sort by distance to find the closest matches
  3. Return the top-k most similar vectors

For distance calculation, we use the Euclidean distance - the straight-line distance between two points in multi-dimensional space.

Distance Calculation and kNN Implementation

def euclidean_distance(point1, point2):
    """Calculate the Euclidean distance between two points represented as NumPy arrays."""
    if point1.shape != point2.shape:
        raise ValueError("Input points must have the same shape.")

    # Calculate the Euclidean distance using NumPy optimized function
    distance = np.linalg.norm(point1 - point2)
    return distance


def get_nearest_neighbor(train, test_row, num_neighbors: int = 1):
    """Find the nearest neighbors of a test data point in a dataset."""
    distances = []

    # Calculate distance to each training vector
    for train_row in train:
        dist = euclidean_distance(test_row, train_row)
        distances.append((train_row, dist))

    # Sort by distance (closest first)
    distances.sort(key=lambda tup: tup[1])
    
    # Return the k nearest neighbors
    neighbors = []
    for i in range(num_neighbors):
        neighbors.append(distances[i][0])

    return neighbors

Performance Considerations

Optimization Strategies

While this implementation is educational, production vector databases implement several optimizations:

  • Indexing: Approximate Nearest Neighbor (ANN) algorithms like HNSW or IVF
  • Caching: Keep frequently accessed vectors in memory
  • Batch Processing: Vectorize distance calculations for better performance
  • Compression: Reduce storage size through quantization techniques

Scalability Limitations

Our SQLite-based approach works well for:

  • Prototyping and experimentation
  • Small to medium datasets (< 1M vectors)
  • Single-node applications

For larger scale applications, consider specialized vector databases like Pinecone, Weaviate, or Qdrant.

Conclusion

Building a vector database from scratch reveals the fundamental concepts behind these powerful tools. While our SQLite + NumPy + kNN implementation is minimalist, it demonstrates the core principles that power modern AI applications.

Key Takeaways

  • Vector databases solve real problems in AI applications
  • Serialization enables traditional databases to store high-dimensional data
  • Similarity search is fundamentally different from exact matching
  • Simple implementations can be surprisingly effective for many use cases

Whether this leads to VC funding or not, understanding these fundamentals helps build better AI systems! 🚀


Want to explore the code yourself?

Repository: sqlite_vector
License: MIT

Reference