Build a Pure Node.js Search Engine from Scratch Using Only Node.js
Grace Collins
Solutions Engineer · Leapcell

Building an English Search Engine with TF-IDF Using Pure Node.js: From Principles to CSV Inverted Index Implementation
In the era of information explosion, search engines have become the core tool for people to access information. From Google to Bing, these large-scale search engines are backed by complex technical architectures, but their core principles can be implemented using basic technology stacks. This article will guide you through building a TF-IDF algorithm-based English search engine from scratch using pure Node.js, without any third-party libraries, storing the inverted index in CSV files. Through this practice, you will gain a deep understanding of the core mechanisms of information retrieval and master key technologies in text processing, weight calculation, and index construction.
Fundamentals of Search Engines and TF-IDF
A search engine is essentially an information filtering system whose core function is to receive user queries, quickly find the most relevant results from massive documents, and rank them. Modern search engines consist of four core modules: crawler (data acquisition), indexer (construction of retrieval structures), retriever (query processing), and ranking algorithm (result optimization). Our simplified version will focus on the indexer and retriever, using local documents as the data source.
TF-IDF (Term Frequency-Inverse Document Frequency) is a classic weighting algorithm in the field of information retrieval, proposed by Karen Spärck Jones in 1972, and is still widely used in various text systems. Its core idea is: the importance of a word to a document is proportional to its frequency of occurrence in the document and inversely proportional to its frequency of occurrence in all documents.
-
Term Frequency (TF): The number of times a word appears in the current document divided by the total number of words in the document. The formula is TF(t,d) = count(t,d) / totalTerms(d). For example, if "machine" appears 5 times in a document with 100 total words, the TF value is 0.05.
-
Inverse Document Frequency (IDF): The logarithm of the reciprocal of the number of documents containing the word. The formula is IDF(t) = log(totalDocs / docsWithTerm(t)). If the corpus has 1000 documents, 20 of which contain "learning", the IDF value is log(1000/20) = log(50) ≈ 3.912.
Multiplying the two gives TF-IDF(t,d) = TF(t,d) × IDF(t). The higher this value, the more representative the word is of the current document. During searching, by calculating the TF-IDF vector similarity (such as cosine similarity) between the query terms and documents, result ranking can be achieved.
Implementation Steps with Pure Node.js
Environment Preparation and Project Structure
This project only relies on the Node.js runtime (v14+ is sufficient) and does not require installing any npm packages. Create the following directory structure:
tfidf-search/
├── corpus/ # Stores English documents
│ ├── doc1.txt
│ ├── doc2.txt
│ └── ...
├── index/ # Stores index files
│ └── inverted.csv
├── src/
│ ├── indexer.js # Index construction program
│ └── searcher.js # Search program
└── run.js # Entry file
The corpus
directory stores English documents to be检索ed, the index
directory is used to store CSV files of the inverted index, and src
contains core implementation code.
1. Text Data Processing
Text processing is the foundation of a search engine, requiring document reading, cleaning, and tokenization. Create src/indexer.js
and implement basic processing functions:
// Read document content const fs = require('fs'); const path = require('path'); function readDocuments(corpusPath) { const docs = []; const files = fs.readdirSync(corpusPath); for (const file of files) { if (file.endsWith('.txt')) { const content = fs.readFileSync( path.join(corpusPath, file), 'utf8' ); docs.push({ id: file.replace('.txt', ''), content: content }); } } return docs; }
Text cleaning involves removing punctuation, converting to lowercase, and tokenization. English tokenization is relatively simple and can be done by splitting on spaces, but special cases need to be handled:
function cleanText(text) { // Remove HTML tags if any let cleaned = text.replace(/<[^>]*>/g, ' '); // Convert to lowercase cleaned = cleaned.toLowerCase(); // Remove punctuation and special characters cleaned = cleaned.replace(/[^a-z0-9\s]/g, ' '); // Merge multiple spaces into one cleaned = cleaned.replace(/\s+/g, ' ').trim(); return cleaned; } function tokenize(text) { return cleanText(text).split(' '); }
2. TF-IDF Calculation Implementation
Term Frequency (TF) Calculation
TF calculation needs to count the frequency of each word in the document, implemented as follows:
function calculateTF(tokens) { const tf = {}; const totalTerms = tokens.length; for (const token of tokens) { if (token) { // Skip empty strings tf[token] = (tf[token] || 0) + 1; } } // Convert to frequency (number of words / total words) for (const term in tf) { tf[term] = tf[term] / totalTerms; } return tf; }
Inverse Document Frequency (IDF) Calculation
IDF first requires counting the number of documents containing each word:
function calculateDF(docs) { const df = {}; const totalDocs = docs.length; for (const doc of docs) { const tokens = new Set(tokenize(doc.content)); // Deduplication for (const token of tokens) { if (token) { df[token] = (df[token] || 0) + 1; } } } // Calculate IDF: log(total number of documents / number of documents containing the term) const idf = {}; for (const term in df) { idf[term] = Math.log(totalDocs / df[term]); } return idf; }
Document Vector Generation
Combine TF and IDF to generate vectors for each document:
function generateDocVectors(docs, idf) { const docVectors = {}; for (const doc of docs) { const tokens = tokenize(doc.content); const tf = calculateTF(tokens); const vector = {}; for (const term in tf) { vector[term] = tf[term] * idf[term] || 0; } docVectors[doc.id] = vector; } return docVectors; }
3. Inverted Index Construction and CSV Storage
The inverted index is the core data structure of a search engine, storing the mapping between terms and the list of documents containing the term along with their weights. Its structure is usually: term,docId1
function buildInvertedIndex(docVectors) { const index = {}; // Build in-memory inverted index for (const docId in docVectors) { const vector = docVectors[docId]; for (const term in vector) { if (!index[term]) { index[term] = []; } index[term].push({ docId, weight: vector[term] }); } } // Convert to CSV format let csvContent = 'term,documents\n'; for (const term in index) { const docsStr = index[term] .map(item => `${item.docId}:${item.weight.toFixed(6)}`) .join(';'); csvContent += `"${term}",${docsStr}\n`; } return csvContent; } // Save index to CSV file function saveIndex(csvContent, indexPath) { fs.writeFileSync(indexPath, csvContent, 'utf8'); }
The reasons for choosing CSV format are: strong readability of plain text, simple implementation (no need for serialization libraries), and direct parsing by tools like Excel. However, its drawback is that full-text scanning is required during queries, which can be optimized to more efficient formats later.
4. Complete Index Construction Process
Integrate the above functions into the index construction process:
async function buildIndex(corpusPath, indexPath) { try { console.log('Reading documents...'); const docs = readDocuments(corpusPath); console.log('Calculating document frequency (DF)...'); const idf = calculateDF(docs); console.log('Generating document vectors...'); const docVectors = generateDocVectors(docs, idf); console.log('Building inverted index...'); const csvContent = buildInvertedIndex(docVectors); console.log('Saving index to CSV...'); saveIndex(csvContent, indexPath); console.log(`Index construction completed, processed ${docs.length} documents`); } catch (error) { console.error('Index construction failed:', error); } } module.exports = { buildIndex };
5. Search Function Implementation
The search program needs to: load the index, parse the query, calculate similarity, and return results.
// src/searcher.js const fs = require('fs'); const path = require('path'); // Load CSV index and convert to in-memory object function loadIndex(indexPath) { const index = {}; const csvContent = fs.readFileSync(indexPath, 'utf8'); const lines = csvContent.split('\n'); // Skip header for (let i = 1; i < lines.length; i++) { const line = lines[i].trim(); if (!line) continue; // Parse CSV line (handle quoted cases) const [termPart, docsPart] = line.split(','); const term = termPart.replace(/"/g, ''); // Remove quotes if (docsPart) { index[term] = docsPart.split(';').map(item => { const [docId, weight] = item.split(':'); return { docId, weight: parseFloat(weight) }; }); } } return index; } // Calculate query vector function getQueryVector(query, idf) { const tokens = tokenize(query); const tf = calculateTF(tokens); const vector = {}; for (const term in tf) { vector[term] = tf[term] * (idf[term] || 0); } return vector; } // Execute search function search(query, index, idf) { const queryVector = getQueryVector(query, idf); const results = {}; // Accumulate document scores for (const term in queryVector) { if (index[term]) { const queryWeight = queryVector[term]; for (const doc of index[term]) { // Simple similarity calculation: sum of weight products const score = (results[doc.docId] || 0) + (queryWeight * doc.weight); results[doc.docId] = score; } } } // Convert to sorted array return Object.entries(results) .map(([docId, score]) => ({ docId, score })) .sort((a, b) => b.score - a.score); } module.exports = { loadIndex, search };
6. Entry Program Implementation
Create run.js
as the program entry, supporting two modes: index construction and search:
const { buildIndex } = require('./src/indexer'); const { loadIndex, search } = require('./src/searcher'); const { calculateDF, readDocuments } = require('./src/indexer'); const CORPUS_PATH = path.join(__dirname, 'corpus'); const INDEX_PATH = path.join(__dirname, 'index', 'inverted.csv'); // Command line argument processing const args = process.argv.slice(2); if (args[0] === 'build') { buildIndex(CORPUS_PATH, INDEX_PATH); } else if (args[0] === 'search' && args[1]) { const query = args[1]; try { console.log(`Searching: "${query}"`); const index = loadIndex(INDEX_PATH); const docs = readDocuments(CORPUS_PATH); const idf = calculateDF(docs); const results = search(query, index, idf); console.log('Search results (sorted by relevance):'); results.forEach((result, i) => { console.log(`${i + 1}. ${result.docId} (Score: ${result.score.toFixed(6)})`); }); } catch (error) { console.error('Search failed:', error); } } else { console.log('Usage:'); console.log(' Build index: node run.js build'); console.log(' Search: node run.js search "query terms"'); }
Function Testing and Verification
Test Data Preparation
Create 3 English documents in the corpus
directory:
- ai.txt: "Artificial intelligence is the simulation of human intelligence processes by machines, especially computer systems. These processes include learning, reasoning, and self-correction."
- ml.txt: "Machine learning is a subset of artificial intelligence focused on developing algorithms that learn from data. It enables computers to improve performance on a task through experience."
- dl.txt: "Deep learning is a type of machine learning based on artificial neural networks with multiple layers. It excels at processing unstructured data like images and text."
Index Construction Test
Run node run.js build
, the console will output the processing process, and after completion, index/inverted.csv
will generate content as follows (excerpt):
term,documents
"intelligence",ai.txt:0.057538;ml.txt:0.028769
"artificial",ai.txt:0.057538;dl.txt:0.038359
"machine",ai.txt:0.028769;ml.txt:0.028769;dl.txt:0.025573
"learning",ml.txt:0.057538;dl.txt:0.057538
Search Function Test
Execute node run.js search "machine learning"
, which should return:
Searching: "machine learning"
Search results (sorted by relevance):
1. ml.txt (Score: 0.003308)
2. dl.txt (Score: 0.001462)
3. ai.txt (Score: 0.000832)
The results are as expected: ml.txt (machine learning) is the most relevant, followed by dl.txt (deep learning), and ai.txt (artificial intelligence) has the lowest relevance.
Technical Optimization and Expansion Directions
Limitations of the Basic Version
The current implementation has obvious limitations:
- Simple tokenization: only splits on spaces, not handling hyphenated words (e.g., "state-of-the-art"), abbreviations (e.g., "don't"), etc.
- Low index efficiency: CSV format requires full-text scanning during queries, with performance degradation for large datasets.
- Basic ranking algorithm: only uses simple weighted summation, without implementing more accurate algorithms like cosine similarity.
- No stop word processing: high-frequency meaningless words like "the" and "is" occupy index space.
Optimization Implementation Plans
1. Advanced Tokenization Implementation
Improve the tokenization function to handle special cases:
function advancedTokenize(text) { let cleaned = cleanText(text); // Handle hyphens and abbreviations cleaned = cleaned.replace(/-/g, ' ').replace(/'s/g, ' '); return cleaned.split(' ').filter(token => token.length > 1); // Filter single-character tokens }
2. Stop Word Filtering
Add a stop word list and filter:
const STOP_WORDS = new Set([ 'the', 'and', 'of', 'to', 'a', 'in', 'is', 'it', 'you', 'that' // Can be extended with more stop words ]); function tokenizeWithStopWords(text) { return advancedTokenize(text).filter(token => !STOP_WORDS.has(token)); }
3. Index Format Optimization
Change CSV to more efficient key-value storage (e.g., term|docId
// Optimized index format generation function buildOptimizedIndex(docVectors) { let indexContent = ''; for (const term in index) { const docsStr = index[term] .map(item => `${item.docId}:${item.weight.toFixed(6)}`) .join('|'); indexContent += `${term}|${docsStr}\n`; } return indexContent; }
4. Cosine Similarity Ranking
Implement more accurate similarity calculation:
function calculateCosineSimilarity(queryVector, docVector) { let dotProduct = 0; let queryNorm = 0; let docNorm = 0; // Calculate dot product and vector magnitudes for (const term in queryVector) { queryNorm += Math.pow(queryVector[term], 2); if (docVector[term]) { dotProduct += queryVector[term] * docVector[term]; } } for (const term in docVector) { docNorm += Math.pow(docVector[term], 2); } // Prevent division by zero if (queryNorm === 0 || docNorm === 0) return 0; return dotProduct / (Math.sqrt(queryNorm) * Math.sqrt(docNorm)); }
Technical Summary and Application Scenarios
Although the search engine implemented in this article is simple, it fully contains the core components of modern search engines: text processing, TF-IDF weighting, inverted indexing, and result ranking. By implementing it with pure Node.js, we avoid dependencies on external libraries and gain an in-depth understanding of the implementation details of each technical link.
This technology can be applied in:
- Small document library retrieval: such as internal search for project documents and help systems
- Teaching demonstrations: intuitively showing the core principles of information retrieval
- Secondary development foundation: can be extended to support Chinese word segmentation (requires adding ICU library or custom segmentation), distributed indexing, and other more complex systems
From a technical evolution perspective, TF-IDF is the foundation of more advanced algorithms like BM25, and CSV indexing can be seen as a simplified version of professional indexing libraries like Lucene. Understanding these basic implementations can help developers better grasp the working principles of advanced tools like Elasticsearch.
Leapcell: The Best of Serverless Web Hosting
Finally, we recommend the most suitable platform for deploying Node.js services: Leapcell
🚀 Build with Your Favorite Language
Develop effortlessly in JavaScript, Python, Go, or Rust.
🌍 Deploy Unlimited Projects for Free
Only pay for what you use—no requests, no charges.
⚡ Pay-as-You-Go, No Hidden Costs
No idle fees, just seamless scalability.
🔹 Follow us on Twitter: @LeapcellHQ