Building an English Search Engine with TF-IDF Using Pure Node.js: From Principles to CSV Inverted Index Implementation

In the era of information explosion, search engines have become the core tool for people to access information. From Google to Bing, these large-scale search engines are backed by complex technical architectures, but their core principles can be implemented using basic technology stacks. This article will guide you through building a TF-IDF algorithm-based English search engine from scratch using pure Node.js, without any third-party libraries, storing the inverted index in CSV files. Through this practice, you will gain a deep understanding of the core mechanisms of information retrieval and master key technologies in text processing, weight calculation, and index construction.

Fundamentals of Search Engines and TF-IDF

A search engine is essentially an information filtering system whose core function is to receive user queries, quickly find the most relevant results from massive documents, and rank them. Modern search engines consist of four core modules: crawler (data acquisition), indexer (construction of retrieval structures), retriever (query processing), and ranking algorithm (result optimization). Our simplified version will focus on the indexer and retriever, using local documents as the data source.

TF-IDF (Term Frequency-Inverse Document Frequency) is a classic weighting algorithm in the field of information retrieval, proposed by Karen Spärck Jones in 1972, and is still widely used in various text systems. Its core idea is: the importance of a word to a document is proportional to its frequency of occurrence in the document and inversely proportional to its frequency of occurrence in all documents.

Term Frequency (TF): The number of times a word appears in the current document divided by the total number of words in the document. The formula is TF(t,d) = count(t,d) / totalTerms(d). For example, if "machine" appears 5 times in a document with 100 total words, the TF value is 0.05.
Inverse Document Frequency (IDF): The logarithm of the reciprocal of the number of documents containing the word. The formula is IDF(t) = log(totalDocs / docsWithTerm(t)). If the corpus has 1000 documents, 20 of which contain "learning", the IDF value is log(1000/20) = log(50) ≈ 3.912.

Multiplying the two gives TF-IDF(t,d) = TF(t,d) × IDF(t). The higher this value, the more representative the word is of the current document. During searching, by calculating the TF-IDF vector similarity (such as cosine similarity) between the query terms and documents, result ranking can be achieved.

Implementation Steps with Pure Node.js

Environment Preparation and Project Structure

This project only relies on the Node.js runtime (v14+ is sufficient) and does not require installing any npm packages. Create the following directory structure:

tfidf-search/
├── corpus/         # Stores English documents
│   ├── doc1.txt
│   ├── doc2.txt
│   └── ...
├── index/          # Stores index files
│   └── inverted.csv
├── src/
│   ├── indexer.js  # Index construction program
│   └── searcher.js # Search program
└── run.js          # Entry file

The corpus directory stores English documents to be检索ed, the index directory is used to store CSV files of the inverted index, and src contains core implementation code.

1. Text Data Processing

Text processing is the foundation of a search engine, requiring document reading, cleaning, and tokenization. Create src/indexer.js and implement basic processing functions:

// Read document content
const fs = require('fs');
const path = require('path');

function readDocuments(corpusPath) {
  const docs = [];
  const files = fs.readdirSync(corpusPath);
  
  for (const file of files) {
    if (file.endsWith('.txt')) {
      const content = fs.readFileSync(
        path.join(corpusPath, file), 
        'utf8'
      );
      docs.push({
        id: file.replace('.txt', ''),
        content: content
      });
    }
  }
  return docs;
}

Text cleaning involves removing punctuation, converting to lowercase, and tokenization. English tokenization is relatively simple and can be done by splitting on spaces, but special cases need to be handled:

function cleanText(text) {
  // Remove HTML tags if any
  let cleaned = text.replace(/<[^>]*>/g, ' ');
  // Convert to lowercase
  cleaned = cleaned.toLowerCase();
  // Remove punctuation and special characters
  cleaned = cleaned.replace(/[^a-z0-9\s]/g, ' ');
  // Merge multiple spaces into one
  cleaned = cleaned.replace(/\s+/g, ' ').trim();
  return cleaned;
}

function tokenize(text) {
  return cleanText(text).split(' ');
}

2. TF-IDF Calculation Implementation

Term Frequency (TF) Calculation

TF calculation needs to count the frequency of each word in the document, implemented as follows:

function calculateTF(tokens) {
  const tf = {};
  const totalTerms = tokens.length;
  
  for (const token of tokens) {
    if (token) { // Skip empty strings
      tf[token] = (tf[token] || 0) + 1;
    }
  }
  
  // Convert to frequency (number of words / total words)
  for (const term in tf) {
    tf[term] = tf[term] / totalTerms;
  }
  
  return tf;
}

Inverse Document Frequency (IDF) Calculation

IDF first requires counting the number of documents containing each word:

function calculateDF(docs) {
  const df = {};
  const totalDocs = docs.length;
  
  for (const doc of docs) {
    const tokens = new Set(tokenize(doc.content)); // Deduplication
    for (const token of tokens) {
      if (token) {
        df[token] = (df[token] || 0) + 1;
      }
    }
  }
  
  // Calculate IDF: log(total number of documents / number of documents containing the term)
  const idf = {};
  for (const term in df) {
    idf[term] = Math.log(totalDocs / df[term]);
  }
  
  return idf;
}

Document Vector Generation

Combine TF and IDF to generate vectors for each document:

function generateDocVectors(docs, idf) {
  const docVectors = {};
  
  for (const doc of docs) {
    const tokens = tokenize(doc.content);
    const tf = calculateTF(tokens);
    const vector = {};
    
    for (const term in tf) {
      vector[term] = tf[term] * idf[term] || 0;
    }
    
    docVectors[doc.id] = vector;
  }
  
  return docVectors;
}

3. Inverted Index Construction and CSV Storage

The inverted index is the core data structure of a search engine, storing the mapping between terms and the list of documents containing the term along with their weights. Its structure is usually: term,docId1,docId2,...

function buildInvertedIndex(docVectors) {
  const index = {};
  
  // Build in-memory inverted index
  for (const docId in docVectors) {
    const vector = docVectors[docId];
    for (const term in vector) {
      if (!index[term]) {
        index[term] = [];
      }
      index[term].push({
        docId,
        weight: vector[term]
      });
    }
  }
  
  // Convert to CSV format
  let csvContent = 'term,documents\n';
  for (const term in index) {
    const docsStr = index[term]
      .map(item => `${item.docId}:${item.weight.toFixed(6)}`)
      .join(';');
    csvContent += `"${term}",${docsStr}\n`;
  }
  
  return csvContent;
}

// Save index to CSV file
function saveIndex(csvContent, indexPath) {
  fs.writeFileSync(indexPath, csvContent, 'utf8');
}

The reasons for choosing CSV format are: strong readability of plain text, simple implementation (no need for serialization libraries), and direct parsing by tools like Excel. However, its drawback is that full-text scanning is required during queries, which can be optimized to more efficient formats later.

4. Complete Index Construction Process

Integrate the above functions into the index construction process:

async function buildIndex(corpusPath, indexPath) {
  try {
    console.log('Reading documents...');
    const docs = readDocuments(corpusPath);
    
    console.log('Calculating document frequency (DF)...');
    const idf = calculateDF(docs);
    
    console.log('Generating document vectors...');
    const docVectors = generateDocVectors(docs, idf);
    
    console.log('Building inverted index...');
    const csvContent = buildInvertedIndex(docVectors);
    
    console.log('Saving index to CSV...');
    saveIndex(csvContent, indexPath);
    
    console.log(`Index construction completed, processed ${docs.length} documents`);
  } catch (error) {
    console.error('Index construction failed:', error);
  }
}

module.exports = { buildIndex };

5. Search Function Implementation

The search program needs to: load the index, parse the query, calculate similarity, and return results.

// src/searcher.js
const fs = require('fs');
const path = require('path');

// Load CSV index and convert to in-memory object
function loadIndex(indexPath) {
  const index = {};
  const csvContent = fs.readFileSync(indexPath, 'utf8');
  const lines = csvContent.split('\n');
  
  // Skip header
  for (let i = 1; i < lines.length; i++) {
    const line = lines[i].trim();
    if (!line) continue;
    
    // Parse CSV line (handle quoted cases)
    const [termPart, docsPart] = line.split(',');
    const term = termPart.replace(/"/g, ''); // Remove quotes
    
    if (docsPart) {
      index[term] = docsPart.split(';').map(item => {
        const [docId, weight] = item.split(':');
        return { docId, weight: parseFloat(weight) };
      });
    }
  }
  
  return index;
}

// Calculate query vector
function getQueryVector(query, idf) {
  const tokens = tokenize(query);
  const tf = calculateTF(tokens);
  const vector = {};
  
  for (const term in tf) {
    vector[term] = tf[term] * (idf[term] || 0);
  }
  
  return vector;
}

// Execute search
function search(query, index, idf) {
  const queryVector = getQueryVector(query, idf);
  const results = {};
  
  // Accumulate document scores
  for (const term in queryVector) {
    if (index[term]) {
      const queryWeight = queryVector[term];
      for (const doc of index[term]) {
        // Simple similarity calculation: sum of weight products
        const score = (results[doc.docId] || 0) + (queryWeight * doc.weight);
        results[doc.docId] = score;
      }
    }
  }
  
  // Convert to sorted array
  return Object.entries(results)
    .map(([docId, score]) => ({ docId, score }))
    .sort((a, b) => b.score - a.score);
}

module.exports = { loadIndex, search };

6. Entry Program Implementation

Create run.js as the program entry, supporting two modes: index construction and search:

const { buildIndex } = require('./src/indexer');
const { loadIndex, search } = require('./src/searcher');
const { calculateDF, readDocuments } = require('./src/indexer');

const CORPUS_PATH = path.join(__dirname, 'corpus');
const INDEX_PATH = path.join(__dirname, 'index', 'inverted.csv');

// Command line argument processing
const args = process.argv.slice(2);

if (args[0] === 'build') {
  buildIndex(CORPUS_PATH, INDEX_PATH);
} else if (args[0] === 'search' && args[1]) {
  const query = args[1];
  try {
    console.log(`Searching: "${query}"`);
    const index = loadIndex(INDEX_PATH);
    const docs = readDocuments(CORPUS_PATH);
    const idf = calculateDF(docs);
    const results = search(query, index, idf);
    
    console.log('Search results (sorted by relevance):');
    results.forEach((result, i) => {
      console.log(`${i + 1}. ${result.docId} (Score: ${result.score.toFixed(6)})`);
    });
  } catch (error) {
    console.error('Search failed:', error);
  }
} else {
  console.log('Usage:');
  console.log('  Build index: node run.js build');
  console.log('  Search: node run.js search "query terms"');
}

Function Testing and Verification

Test Data Preparation

Create 3 English documents in the corpus directory:

ai.txt: "Artificial intelligence is the simulation of human intelligence processes by machines, especially computer systems. These processes include learning, reasoning, and self-correction."
ml.txt: "Machine learning is a subset of artificial intelligence focused on developing algorithms that learn from data. It enables computers to improve performance on a task through experience."
dl.txt: "Deep learning is a type of machine learning based on artificial neural networks with multiple layers. It excels at processing unstructured data like images and text."

Index Construction Test

Run node run.js build, the console will output the processing process, and after completion, index/inverted.csv will generate content as follows (excerpt):

term,documents
"intelligence",ai.txt:0.057538;ml.txt:0.028769
"artificial",ai.txt:0.057538;dl.txt:0.038359
"machine",ai.txt:0.028769;ml.txt:0.028769;dl.txt:0.025573
"learning",ml.txt:0.057538;dl.txt:0.057538

Search Function Test

Execute node run.js search "machine learning", which should return:

Searching: "machine learning"
Search results (sorted by relevance):
1. ml.txt (Score: 0.003308)
2. dl.txt (Score: 0.001462)
3. ai.txt (Score: 0.000832)

The results are as expected: ml.txt (machine learning) is the most relevant, followed by dl.txt (deep learning), and ai.txt (artificial intelligence) has the lowest relevance.

Technical Optimization and Expansion Directions

Limitations of the Basic Version

The current implementation has obvious limitations:

Simple tokenization: only splits on spaces, not handling hyphenated words (e.g., "state-of-the-art"), abbreviations (e.g., "don't"), etc.
Low index efficiency: CSV format requires full-text scanning during queries, with performance degradation for large datasets.
Basic ranking algorithm: only uses simple weighted summation, without implementing more accurate algorithms like cosine similarity.
No stop word processing: high-frequency meaningless words like "the" and "is" occupy index space.

Optimization Implementation Plans

1. Advanced Tokenization Implementation

Improve the tokenization function to handle special cases:

function advancedTokenize(text) {
  let cleaned = cleanText(text);
  // Handle hyphens and abbreviations
  cleaned = cleaned.replace(/-/g, ' ').replace(/'s/g, ' ');
  return cleaned.split(' ').filter(token => token.length > 1); // Filter single-character tokens
}

2. Stop Word Filtering

Add a stop word list and filter:

const STOP_WORDS = new Set([
  'the', 'and', 'of', 'to', 'a', 'in', 'is', 'it', 'you', 'that'
  // Can be extended with more stop words
]);

function tokenizeWithStopWords(text) {
  return advancedTokenize(text).filter(token => !STOP_WORDS.has(token));
}

3. Index Format Optimization

Change CSV to more efficient key-value storage (e.g., term|docId|docId per line) to speed up parsing:

// Optimized index format generation
function buildOptimizedIndex(docVectors) {
  let indexContent = '';
  for (const term in index) {
    const docsStr = index[term]
      .map(item => `${item.docId}:${item.weight.toFixed(6)}`)
      .join('|');
    indexContent += `${term}|${docsStr}\n`;
  }
  return indexContent;
}

4. Cosine Similarity Ranking

Implement more accurate similarity calculation:

function calculateCosineSimilarity(queryVector, docVector) {
  let dotProduct = 0;
  let queryNorm = 0;
  let docNorm = 0;
  
  // Calculate dot product and vector magnitudes
  for (const term in queryVector) {
    queryNorm += Math.pow(queryVector[term], 2);
    if (docVector[term]) {
      dotProduct += queryVector[term] * docVector[term];
    }
  }
  
  for (const term in docVector) {
    docNorm += Math.pow(docVector[term], 2);
  }
  
  // Prevent division by zero
  if (queryNorm === 0 || docNorm === 0) return 0;
  
  return dotProduct / (Math.sqrt(queryNorm) * Math.sqrt(docNorm));
}

Technical Summary and Application Scenarios

Although the search engine implemented in this article is simple, it fully contains the core components of modern search engines: text processing, TF-IDF weighting, inverted indexing, and result ranking. By implementing it with pure Node.js, we avoid dependencies on external libraries and gain an in-depth understanding of the implementation details of each technical link.

This technology can be applied in:

Small document library retrieval: such as internal search for project documents and help systems
Teaching demonstrations: intuitively showing the core principles of information retrieval
Secondary development foundation: can be extended to support Chinese word segmentation (requires adding ICU library or custom segmentation), distributed indexing, and other more complex systems

From a technical evolution perspective, TF-IDF is the foundation of more advanced algorithms like BM25, and CSV indexing can be seen as a simplified version of professional indexing libraries like Lucene. Understanding these basic implementations can help developers better grasp the working principles of advanced tools like Elasticsearch.

Leapcell: The Best of Serverless Web Hosting

Finally, we recommend the most suitable platform for deploying Node.js services: Leapcell

🚀 Build with Your Favorite Language

Develop effortlessly in JavaScript, Python, Go, or Rust.

🌍 Deploy Unlimited Projects for Free

Only pay for what you use—no requests, no charges.

⚡ Pay-as-You-Go, No Hidden Costs

No idle fees, just seamless scalability.

📖 Explore Our Documentation

🔹 Follow us on Twitter: @LeapcellHQ

Build a Pure Node.js Search Engine from Scratch Using Only Node.js