Running machine learning models directly in the user's web browser is no longer a futuristic concept—it is a production-ready reality. By shifting inference workloads from centralized server clusters to client devices, web developers can achieve zero server billing, 100% offline functionality, and complete user data privacy. However, executing neural networks inside a browser environment comes with severe constraints: high memory usage, downloading multi-megabyte model shards, and potential UI freezing due to heavy computation.
In this article, we will examine the client-side embedding generation architecture implemented in the Interview Copilot project. We will explore how it leverages the Web Crypto API, ONNX Runtime Web, and Transformers.js inside background Web Workers to run real-time semantic search without blocking the browser's user interface thread.
The Threading Architecture: Main Thread vs Web Workers
Generating vector embeddings requires passing input text through a tokenizer, running it through several layers of transformer blocks (which perform high-dimensional matrix multiplications), and pooling the outputs. If these calculations run on the browser's main thread, the page freezes, touch gestures fail, and audio recording stutters. To avoid thread blocking, we separate responsibilities into distinct threads using Web Workers:
The Tech Stack: ONNX Runtime and Transformers.js
We leverage two main open-source libraries for client-side embedding generation:
- ONNX (Open Neural Network Exchange): A uniform format that compiles deep learning models from PyTorch or TensorFlow into optimized binary representations.
- ONNX Runtime Web: Runs ONNX models directly in browser environments using WebAssembly (WASM) for CPU acceleration (and WebGPU for GPU acceleration).
- Transformers.js (
@xenova/transformers): A wrapper library that mimics PyTorch's Hugging Face pipeline syntax in JavaScript, managing model downloading, tokenization, model execution, and tensor post-processing under one unified API.
Web Worker Implementation
The Web Worker is loaded asynchronously by the main thread. It acts as an event-driven listener, receiving text blocks, compiling them into vectors, and returning them via message passing. Below is the worker initialization and extraction logic:
import { pipeline, env } from "@xenova/transformers";
// Tell Transformers.js to load models from Hugging Face Hub directly
env.allowLocalModels = false;
// Restrict WASM backend execution to a single thread to avoid worker thread contention
env.backends.onnx.wasm.numThreads = 1;
let extractorInstance: any = null;
async function getExtractor(progressCallback: (data: any) => void) {
if (!extractorInstance) {
extractorInstance = await pipeline("feature-extraction", "Xenova/all-MiniLM-L6-v2", {
progress_callback: progressCallback
});
}
return extractorInstance;
}
Here, we disable allowLocalModels because we are running completely within the browser sandbox; the models must be retrieved from Hugging Face's CDN. Once loaded, the browser caches the model files in its native Cache Storage API, meaning subsequent loads are instantaneous and happen completely offline.
Generating Embeddings: Tokenization, Forward Pass, and Normalization
When the worker receives a GENERATE_EMBEDDING request, it executes the feature-extraction pipeline. The raw text must be converted into a mathematical representation called a tensor, processed by the model, and then pooled. Here is how we generate a normalized 384-dimensional float array:
self.addEventListener("message", async (event: MessageEvent) => {
const { type, payload } = event.data;
if (type === "GENERATE_EMBEDDING") {
try {
const ext = await getExtractor(() => {});
// Mean pooling and normalization are required to produce correct cosine similarity vectors
const output = await ext(payload.text, { pooling: "mean", normalize: true });
const embedding = output.tolist()[0];
self.postMessage({
type: "EMBEDDING_SUCCESS",
payload: { text: payload.text, embedding, context: payload.context }
});
} catch (error) {
self.postMessage({ type: "ERROR", payload: error.message });
}
}
});
The call to ext() performs a series of computational tasks:
- Tokenization: The text is sliced into token integers using a vocabulary mapping.
- Forward Pass: The tokens are fed to the model (in this case,
Xenova/all-MiniLM-L6-v2, a lightweight, highly capable Sentence Transformer). - Mean Pooling: Averages token-level embeddings across the sequence to produce a single sentence vector.
- Normalization: Scales the vector to unit length (L2 norm = 1). This is an optimization that allows us to perform cosine similarity calculations simply by calculating dot products in SQLite WASM, saving complex math execution steps.
Key Optimizations for Web-Based Machine Learning
To ensure high performance, several parameters must be tuned:
- Single Thread Limit inside Web Workers: We explicitly set
env.backends.onnx.wasm.numThreads = 1. Inside a Web Worker environment, multi-threading in WASM can cause overhead and thread contention, slowing down overall execution. A single thread is faster and more stable. - Quantized Models: Transformers.js loads quantized (ONNX 8-bit integer) weights by default. This reduces model storage sizes (e.g.
all-MiniLM-L6-v2is compressed down to ~23MB) and significantly decreases network bandwidth requirements. - Cache Storage API: By caching the model files in Cache Storage, the web app functions completely offline after the initial visit, running vector indexing entirely on the client without contacting external API servers.
Decentralizing machine learning logic directly to the user's browser not only saves cloud resources but gives users ultimate power over their own data, guaranteeing zero-telemetry, zero-cost private AI tools.