Back to all posts

Using Full-Text Search in Large Scale Datasets

Aniket
Using Full-Text Search in Large Scale Datasets

1 Background and Motivation

We worked with a client on a project where the task is to enhance a dataset containing various foods, their ingredients and nutrition data. As part of this effort, Open Food Facts (≈ 3.8 million rows) is one of the primary sources we used to collect ingredients, nutrition and allergen facts and merge them with the base product information.

The challenge: product names rarely match exactly.

My DatabaseOpen Food Facts
KURUKAHVECI MEHMET EFENDI Turkish Coffee, Arabica Beans… (17.6 oz) Pack of 3KURUKAHVECI MEHMET EFENDI Turkish Coffee

The traditional exact joins fail.
What we need is fuzzy, full-text similarity search that is:

  • Tolerant of “noise” like pack size, quantity, or marketing copy
  • Fast enough to run interactively on millions of rows

The original dataset contains ~1M rows. The search must be performant (fast and efficient) in order to perform search for each of these products from a largish dataset (3.8M rows from Open Food Facts) and enrich them.

We decided to use bm25 for this task.

2 Short primer on bm25

BM25 (Best Matching 25) is a powerful ranking function used by search engines to estimate the relevance of documents to a given search query. It's a step up from simpler methods like TF-IDF (Term Frequency-Inverse Document Frequency). In essence, BM25 scores documents based on the terms appearing in the query, considering:

  • Term Frequency (TF): How often do the query words appear in a document?
  • Inverse Document Frequency (IDF): How common or rare are these words across all documents in the collection? Rare words get more weight.
  • Document Length: It normalizes for document length, so shorter documents aren't unfairly advantaged or disadvantaged.

By using the longer, more descriptive name from the database as the search query against the Open Food Facts product names, BM25 can find the closest matches even with the variations. More details can be read from here.

3 The Dataset

The dataset is a TSV file named en.openfoodfacts.org.products.csv, downloaded from Open Food Facts. It occupies approximately 11 GB on disk and contains 3.8M records. We will focus on the product_name column as the key field for matching records. We also tried with parquet file for this excercise and it is generally efficient as it only takes up ~4GB space on the hard drive and is very efficient for downloading the file. But of course, as soon as you load it in the database, the memory requirements across filetypes are similar.

To evaluate different fuzzy-search techniques at scale, we’ll first conduct experiments in Python using BM25 libraries (pure-Python rank_bm25 versus Cython-based bm25s) and then perform tests inside DuckDB by using its native full-text search capabilities using their fts extension.

4 Hardware

I used a Dell Poweredge with 64GB memory, 24 cores CPU fit with dual NVIDIA RTX 3090 GPUs (not using NVLink).

5 Dataframe Supercharged

We made use of Dask CuDF for loading and performing operations.

A short write-up about what is CuDF (quoted from this link):

cuDF is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating tabular data using a DataFrame style API in the style of pandas.
Dask is a flexible library for parallel computing in Python that makes scaling out your workflow smooth and simple. On the CPU, Dask uses Pandas to execute operations in parallel on DataFrame partitions.
Dask cuDF extends Dask where necessary to allow its DataFrame partitions to be processed using cuDF GPU DataFrames instead of Pandas DataFrames.
import dask
import dask.dataframe as dd
from dask_cuda import LocalCUDACluster
from distributed import Client

# Define a GPU-aware cluster to leverage multiple GPUs
client = Client(
  LocalCUDACluster(
    CUDA_VISIBLE_DEVICES="0,1",  # Use two workers (on devices 0 and 1)
    rmm_pool_size=0.9,  # Use 90% of GPU memory as a pool for faster allocations
    enable_cudf_spill=True,  # Improve device memory stability
    local_directory="./",  # Use fast local storage for spilling
  )
)

# Set the default dataframe backend to "cudf"
dask.config.set({"dataframe.backend": "cudf"})

df = dd.read_csv("en.openfoodfacts.org.products.csv", sep='\t')
df["product_name"] = df["product_name"].astype("string")
df["code"]         = df["code"].astype("string")

df_keys = df[["product_name", "code"]]
df["product_name"] = df["product_name"].astype("string")
df["code"]         = df["code"].astype("string")
deduped = df.drop_duplicates(subset=["product_name", "code"])
deduped_keys_ddf = df_keys.drop_duplicates(subset=["product_name", "code"])
deduped_keys_ddf = deduped_keys_ddf.persist()
cudf_keys_df = deduped_keys_ddf.compute()

The impressive part is:

  • On an dual NVIDIA RTX-3090 GPUs, this takes ~17 seconds and fits in < 8 GB memory.
  • If you don’t have a GPU, using vanilla Pandas on a modern CPU takes around ~2.5 minutes with 24 GB memory.

6 Baseline: rank_bm25 (Pure Python)

# 1. Install requirements:
#    pip install rank_bm25 nltk

import pandas as pd
import numpy as np
from rank_bm25 import BM25Okapi
from nltk.tokenize import word_tokenize
import nltk

nltk.download("punkt")

# 2. Prepare tokenized corpus:
corpus = (
    df["product_name"]
      .fillna("")
      .str.lower()
      .apply(word_tokenize)
      .tolist()
)

# 3. Build BM25 index:
bm25 = BM25Okapi(corpus)

# 4. Define a search function:

def bm25_search_rank(query, k=5):
    tokenized_query = word_tokenize(query.lower())
    scores = bm25.get_scores(tokenized_query)
    df["bm25_score"] = scores
    top = df.sort_values("bm25_score", ascending=False).head(k)
    return top[["_id", "product_name", "bm25_score"]]

# 5. Example query:
query = (
    "Spicy Nuts and Cajun Sticks Trail Mix"
)
results = bm25_search_rank(query)
print(results)

Drawbacks:

  • No stemming or stop‐word removal out of the box.
  • Slow for large corpora: a single query takes 90– 120 s.

7 Turbocharged: bm25s + Stemming

bm25s is a Cython implementation of BM25, storing token‐ID arrays to reduce memory and speed up computations.
It integrates a Porter stemmer (via PyStemmer) for normalization, and can drop English stop words automatically.

# 1. Install requirements:
#    pip install bm25s PyStemmer

import bm25s
import Stemmer
import pandas as pd

# 2. Create a stemmer:
stemmer = Stemmer.Stemmer("english")

# 3. Tokenize & stem entire corpus (list of strings):
corpus = (
    df["product_name"]
      .fillna("")
      .str.lower()
      .apply(word_tokenize)
      .tolist()
)
text_list = corpus.fillna("").str.lower().tolist()
corpus_tokens = bm25s.tokenize(
    text_list,
    stopwords="en",
    stemmer=stemmer
)

# 4. Build BM25 index in C:
retriever = bm25s.BM25()
retriever.index(corpus_tokens)

# 5. Define a search function:
def bm25_search_s(query, k=5):
    q_tokens = bm25s.tokenize(query.lower(), stemmer=stemmer)
    doc_ids, scores = retriever.retrieve(q_tokens, k=k)
    results = []
    for idx, score in zip(doc_ids[0], scores[0]):
        results.append((df.iloc[idx]["_id"], df.iloc[idx]["product_name"], float(score)))
    return pd.DataFrame(results, columns=["_id", "product_name", "bm25_score"])

# 6. Example query:
query = "Spicy Nuts and Cajun Sticks Trail Mix 1 LB"
results = bm25_search_s(query)
print(results)

Performance gains:

  • Average query latency: 50-80 ms

8 Did It Work? – BM25 Examples

For the test query “KURUKAHVECI MEHMET EFENDI Arabica Beans Turkish Coffee 500 g”, BM25 successfully identified close matches despite the query containing extra tokens such as package size and descriptive adjectives. The top result was “KURUKAHVECI MEHMET EFENDI Turkish Coffee”, with a high score of 12.7, followed by “Turkish Coffee – Kurukahveci Mehmet Efendi” with a score of 11.9. A third notable match was “KURUKAHVECI MEHMET EFENDİ Turkish Coffee (Ground, 500 g)”, which scored 10.8. These results demonstrate that BM25 can effectively rank and retrieve the most relevant product names even when the user’s query includes additional details that do not exactly match the stored titles.

9 Production Tips for BM25

  • Persist the index on disk, then reload for inference:
retriever.save("bm25_openfoodfacts")
  • For bulk matching (e.g., matching 10,000 queries), vectorize queries and use batch retrieve() to amortize overhead.
  • Monitor memory if running on limited‐RAM instances; bm25s eats ~ 2 GB for 3.8 M docs.

10 Using DuckDB’s Full-Text Search

DuckDB now offers built-in FTS (full-text search) with BM25 scoring via a PRAGMA command. This requires no external Python libraries and can be run directly from the DuckDB shell or via Python/ODBC connectors.

10.1 Loading the JSONL into DuckDB

Download the JSON lines (JSONL) file and optionally gzip it:

wget https://static.openfoodfacts.org/data/en.openfoodfacts-products.jsonl.gz

Start DuckDB and create a new database:

duckdb fsg_duck.db

Create the products table by reading JSONL directly:

CREATE OR REPLACE TABLE products AS
SELECT
  _id,
  product_name
FROM read_ndjson(
  'openfoodfacts-products.jsonl.gz',
  auto_detect = FALSE,
  ignore_errors = TRUE,
  -- Only project the columns we need:
  columns = {
    '_id'          : 'VARCHAR',
    'product_name' : 'VARCHAR'
  }
)
WHERE product_name IS NOT NULL;

Deduplicate rows by product_name (simple de-dup): Now products contains one row per unique product_name.

DELETE FROM products
USING (
  SELECT _id,
         ROW_NUMBER() OVER (PARTITION BY product_name ORDER BY _id) AS rn
  FROM products
) d
WHERE products._id = d._id AND d.rn > 1;

10.2 Creating an FTS Index with BM25

PRAGMA create_fts_index(
  'products',                  -- table name
  '_id',                       -- unique identifier column
  'product_name',              -- text column to index
  stemmer = 'porter',          -- Porter stemmer for English
  stopwords = 'english',       -- drop common words (e.g., "and", "the")
  ignore = '(\.|[^a-z])+'     -- ignore punctuation & non-letter chars
);

DuckDB builds a specialized FTS index that supports:

  • match_bm25(...) to compute BM25 scores directly in SQL.
  • Stemming & stop-word filtering on the fly.

10.3 Querying with match_bm25

Once the index is ready, a simple SQL query yields the top‐k matches:

SELECT _id, product_name, score
FROM (
  SELECT *,
    fts_main_products.match_bm25(
      _id,
      'Spicy Nuts and Cajun Sticks Trail Mix',
      fields := 'product_name'
    ) AS score
  FROM products
) sq
WHERE score IS NOT NULL
ORDER BY score DESC
LIMIT 10;

Example output:

codeproduct_namescore
0029000072190Spicy nuts and Cajun sticks trail mix 11.941239958639224
0011822584814Spicy Cajun Trail Mix 9.966428902378025
0029000000254Planters Spicy nuts and cajun sticks 9.029640508620329
  • Query latency: ≈ 0.3-05 s (wall‐clock) to scan & rank 3.8 M rows once the index is built.
  • No additional data movement; everything happens inside DuckDB.

11 Summary

Efficiently matching products across large-scale datasets requires robust fuzzy search techniques that traditional exact matching methods fail to provide. By leveraging the power of BM25, we demonstrated an effective way to achieve high-quality fuzzy matching at scale. Through practical examples, we compared pure Python implementations, optimized Cython-based methods, and DuckDB's built-in full-text search capabilities. Our experiments highlighted significant differences in performance, ease of setup, and resource utilization, with DuckDB offering a particularly compelling balance between simplicity and speed.

MethodLatency [s]Memory Consumption [GB]Comments
rank_bm2590-1202-3Easy to implement but impractical at scale
bm25s0.02-0.052-3Optimized and fast. If you have GPUs, this method is highly recommended
DuckDB FTS0.2-0.420-30Balance of simplicity and ease of use. Downside is integration with app which might be written in a different language

Ultimately, selecting the right tool depends on your specific needs, whether prioritizing ease of integration, speed, or scalability.

Regardless of the approach, BM25 proved to be an invaluable technique for handling noisy, real-world data efficiently, enabling rapid and accurate data enrichment at scale.