Why Machines Need Vectors
Neural networks operate exclusively on numbers. They don't understand words like "cat," "bank," or "love" in any symbolic sense — they process continuous floating-point variables on which matrix operations can be performed, as required by backpropagation and gradient descent. The traditional approach, one-hot encoding, represented each word as a vector of zeros with a single 1 at the position corresponding to its index in the vocabulary. With a vocabulary of 100,000 words, that meant 100,000-dimensional vectors, nearly all zeros. This representation was not only computationally wasteful but semantically blind — the dot product between "dog" and "cat" was zero, as if the two words had nothing to do with each other.
Another approach, TF-IDF (Term Frequency-Inverse Document Frequency), assigned statistical weights to words based on their frequency in a document relative to the whole corpus. Useful for basic retrieval, but still ignorant of word order and semantic relationships.
An embedding solves both problems at once. Each word, sentence, or document is mapped to a dense vector of real numbers with a fixed dimensionality — typically 50 to several thousand dimensions. In this space, geometric closeness of vectors corresponds to semantic similarity: "cat" and "dog" lie near each other, while "bank" (financial institution) sits far from "bank" (riverbank).
Embedding Vector Space
Word2Vec: Learning Meaning by Predicting Neighbors
The breakthrough came in 2013 with Word2Vec, published by Tomas Mikolov and his team at Google. Their key insight: instead of defining semantic relationships by hand, let the network learn them by predicting words from context (or context from words).
Word2Vec comes in two architectural variants:
CBOW (Continuous Bag-of-Words): the model predicts a central word from its surrounding context. For the sentence "The peon is ringing the bell" with a window of 2, the model tries to predict "ringing" given [peon, is, the, bell].
Skip-Gram: the reverse — given a central word, the model predicts the surrounding context. Formally, it minimizes a loss function defined as the negative log-likelihood:
After training, the prediction task itself is discarded. What matters is the "side effect": the input-to-hidden layer weights form the embedding matrix — and those weights encode semantic knowledge.
The famous proof that embeddings truly "understand" relationships is vector arithmetic: vector("king") − vector("man") + vector("woman") ≈ vector("queen"). Subtracting the "maleness" dimensions and adding "femaleness" lands almost exactly at the word "queen."
GloVe and FastText: Global Statistics and Morphology
In the same year, Stanford University introduced GloVe (Global Vectors for Word Representation). Authors Pennington, Socher, and Manning argued that Word2Vec captures local context well (a window of a few words) but misses global co-occurrence statistics across the whole corpus.
GloVe builds a giant co-occurrence matrix — how many times each pair of words appears together across all data — and trains vectors so that their dot product approximates the logarithm of those co-occurrences. The key innovation: it's not the raw frequencies that carry meaning, but the ratio of co-occurrence probabilities. The word "solid" co-occurs often with "ice" but rarely with "steam" — that ratio reveals the ice-steam relationship.
GloVe and Word2Vec share a common weakness: they fail with words absent from the training vocabulary (Out-Of-Vocabulary / OOV) and ignore morphology. FastText (Facebook AI Research) solves this by breaking words into character n-grams. The word "unrecognizable" is represented as the sum of vectors for its fragments: <un, unr, nre… Encountering a neologism or a misspelled word, FastText can construct a meaningful vector from already-known n-grams. This matters especially for morphologically rich languages.
BERT and Contextual Embeddings: The Bidirectional Revolution
Static models have a fundamental flaw: every word has exactly one vector, regardless of context. The English "bank" in "I went to the bank to deposit money" and "I'll bank on your support" receives the same averaged vector — despite referring to a financial institution in one case and to relying on someone in the other.
The Transformer architecture (2017, "Attention Is All You Need") and the BERT model built on it (Bidirectional Encoder Representations from Transformers) solved this problem. BERT generates vectors dynamically through a self-attention mechanism that considers all other words in the sentence — both to the left and to the right simultaneously (hence "bidirectional"). The representation of "apple" in a sentence about fruit looks nothing like its representation in a sentence about a tech company.
The extension Sentence-BERT (SBERT) encodes entire sentences into a single dense vector. This enables blazing-fast semantic search: instead of comparing sentences word by word, you compute the distance between their vectors. Modern APIs, like OpenAI's text-embedding-3, embed entire corporate document sets in the cloud within milliseconds.
The Mathematics of Similarity: Cosine as a Measure of Meaning
In a multidimensional vector space, we need a metric of "closeness." Two metrics dominate:
Euclidean distance: the physical straight-line distance between points. Simple, but sensitive to vector magnitude — it favors shorter documents.
Cosine similarity: measures the angle between vectors, ignoring their length:
A value of 1.0 means identical direction (full similarity), 0 means orthogonality (no relation), −1 means opposition. Cosine similarity is computationally lightweight and dominates in vector databases — it is the fundamental operation of every RAG engine.
Research has shown that the vector structures of GloVe, Word2Vec, and BERT, despite different training methods, can be mapped onto each other using Procrustes matrix transformations — suggesting that all of them discovered similar underlying linguistic regularities.
The Dark Side of Embeddings: Bias Baked into Vectors
Embeddings learn from human-generated data — and humans carry biases. Classic models reveal disturbing patterns: vector("man") − vector("computer programmer") + vector("woman") ≈ vector("homemaker"). Similarly: father : doctor :: mother : nurse.
These mathematically encoded stereotypes can influence recruitment algorithms, job search engines, and recommendation systems. Intensive work is underway on embedding debiasing — geometrically neutralizing gender-related dimensions for occupations that should be neutral. The problem remains open and ethically important.
RAG: How Embeddings Cure LLM Hallucinations
Large Language Models (GPT, Claude, Llama) have two structural weaknesses: they are "frozen in time" (unaware of events after their training cut-off) and they hallucinate — generating convincingly phrased but factually wrong statements. Retrieval-Augmented Generation (RAG) is the architecture that addresses both problems by combining an LLM with a vector-based retrieval engine.
The RAG pipeline operates in six steps:
Chunking: knowledge documents (manuals, policies, regulations) are split into small, self-contained fragments — "chunks."
Embedding: each chunk is transformed by an embedding model (e.g., E5, BGE, SBERT) into a numerical vector.
- Vector database: vectors are stored in a specialized database (Pinecone, ChromaDB, FAISS), optimized for fast mathematical operations.
Query embedding: the user's question is embedded using the same model as the documents.
Semantic search: the system retrieves the top-k chunks with the highest cosine similarity to the query vector.
Generation: the LLM receives the question together with the retrieved chunks and formulates an answer — without hallucinating, with the ability to cite sources.
Vector databases don't compare every vector with every other (brute force). They use Approximate Nearest Neighbor (ANN) algorithms — such as HNSW (Hierarchical Navigable Small World) — which search billions of vectors in tens of milliseconds.
RAG Pipeline — How It Works
Choosing an Embedding Model: The MTEB Benchmark
There is no single embedding model that is best for everything. On Hugging Face, the MTEB (Massive Text Embedding Benchmark) ranks dozens of models evaluated on clustering, classification, retrieval, and semantic similarity tasks.
Top performers (2024–2025) include:
intfloat/multilingual-e5-large-instruct — multilingual, excellent for non-English RAG
Alibaba-NLP/gte-Qwen2-7B-instruct — 7 billion parameters, highest quality
jinaai/jina-embeddings-v2-base-code— 137M parameters, 8192-token context window, optimized for code
BAAI/bge-base-en-v1.5 — small, fast, a solid starting point
Practical rule of thumb: start with models under 500 million parameters. Larger models give better quality but slower inference and higher server costs. Domain also matters: a medical model will encode the phrase "scrubbed" entirely differently from a general-purpose one.
Multimodal Embeddings and the Future of Representation
Embeddings are not limited to text. Vision Transformers (ViT) split images into patches treated like tokens — and embed them in the same vector space as text. Models like CLIP (OpenAI) learn a shared text-image space, enabling image search using natural language descriptions.
In genomics and pharmacology, molecular structures are represented as graphs embedded by Graph Neural Networks (GNN). In biometrics, fingerprint and facial vectors are used for identification. The textual inversion technique in Stable Diffusion allows creating an embedding for a new visual concept — and assigning it a keyword for image generation.
Embedding-Free RAG approaches are also emerging — systems where the LLM itself reasons about search paths without vector databases. They currently work best in narrow specialist domains, but they point toward a future where the boundary between retrieval and reasoning blurs further.
Conclusion
Embeddings are the bridge between the symbolic language of humans and the numerical world of machines. The journey from one-hot encoding through Word2Vec and GloVe to contextual BERTs and multimodal CLIP models is one of the most important trajectories in AI history. Today, vector embeddings power semantic search engines, recommendation systems, machine translation, and RAG architectures that let LLMs answer questions grounded in current, verifiable sources. Their mathematical elegance — a simple angle between vectors as a measure of meaning — has made embedding the cornerstone of all modern NLP and conversational AI.
References
- Mikolov et al. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781
- Mikolov et al. (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546
- Pennington et al. (2014). GloVe: Global Vectors for Word Representation. EMNLP 2014. nlp.stanford.edu/projects/glove/
- Devlin et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805
- Vaswani et al. (2017). Attention Is All You Need. arXiv:1706.03762
- Lewis et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401
