Researchers from DeepMind have developed a groundbreaking new method called Gecko for creating versatile text embedding models. Text embeddings are a crucial tool in natural language processing that convert unstructured text into meaningful vector representations. These embeddings capture semantic similarities and relationships between words, sentences, and documents, enabling a wide range of applications such as efficient document retrieval, semantic search, text clustering, sentence similarity scoring, and text classification.
The key challenge in building text embeddings is creating representations that are compact yet highly expressive, and that can generalize well across diverse tasks and domains. Gecko tackles this by leveraging the power of large language models (LLMs) in an innovative way. The result is a state-of-the-art text embedding model that is both highly effective and efficient.
Link to the preprint: https://arxiv.org/abs/2403.20327
How Gecko works
The central idea behind Gecko is distilling the extensive knowledge contained in LLMs into a compact text embedding model.
LLMs, such as GPT-3.5, PaLM, and Chinchilla, are giant neural networks trained on vast amounts of diverse text data. They have been shown to possess broad world knowledge and strong language understanding and generation capabilities.
However, their immense size (often over 100 billion parameters) makes them impractical to use directly in most applications.
Gecko taps into the knowledge of LLMs via a clever two-stage training process:
- Synthetic query-passage generation: In the first stage, an LLM is used to generate a large synthetic dataset called FRet (Few-shot Prompted Retrieval). For each passage in a huge text corpus, the LLM generates a relevant task description and query that could be answered by that passage. This yields a diverse set of (query, passage) pairs covering various task types like question answering, fact checking, text similarity, etc.
- Positive and negative mining: Simply using the original passage as the positive target for each generated query is often suboptimal. There may be a more relevant passage in the corpus that better answers the query. To address this, in the second stage, Gecko retrieves the top-N most similar passages to the query using an existing embedding model. It then employs the LLM to score each retrieved passage’s relevance to the query. The highest-scored passage is selected as the positive target, while a lower-scored one is chosen as a hard negative. This process of LLM-based relabeling ensures that Gecko learns from the most informative positive and negative examples.
The generated FRet dataset is combined with existing human-annotated datasets spanning tasks like semantic textual similarity, natural language inference, and text classification. Gecko is trained on this mixture of synthetic and real data using a contrastive learning objective. For each query, the model learns to embed the positive passage nearby and the negative passage farther away.
The results
Gecko delivers remarkable performance on the MTEB (Massive Text Embedding Benchmark), which is a comprehensive evaluation suite covering 7 categories of tasks across 56 individual datasets. Some key highlights:
- Gecko with 768-dimensional embeddings and 1.2B parameters achieved an outstanding average score of 66.31, surpassing all models of similar size (<5B parameters) and dimensionality (256–768).
- It matched or even beat models that are much larger in size (7B+ parameters) and/or use embeddings that are 3–4x higher in dimensionality (up to 4096).
- Gecko particularly excelled on core tasks like text classification (81.17 average), semantic textual similarity (85.06 average), and summarization (32.63). It also struck an impressive balance on challenging retrieval tasks (55.70 average).
- Surprisingly, a Gecko model trained solely on the synthetic FRet data, without any exposure to MTEB datasets, still demonstrated strong zero-shot generalization. It outperformed several competitive baselines, validating the quality of the LLM-generated data.
Beyond the raw numbers, Gecko exhibited versatility across a spectrum of NLP tasks.
On the classification front, it handled sentiment analysis on reviews and tweets, identified toxic conversations, recognized intents and emotions, and categorized content by topic.
For semantic similarity, it accurately scored sentence pairs on literal and pragmatic similarity across news, social media, web forums, and academic domains.
In retrieval and question-answering settings, it surfaced relevant passages for diverse query types spanning lookup questions, entity queries, fact-checking claims, and abstract questions requiring reasoning.
Gecko’s strong summarization performance demonstrated its ability to capture key information and overarching themes.
Analysis
Extensive ablations revealed key insights into what makes Gecko so powerful:
- LLM-based relabeling to surface the optimal positive and negative passages for each query was vital. It significantly outperformed the naive approach of using the original passage associated with the query. This showcases the value of tapping into an LLM’s global knowledge to find the most informative training examples.
- Generating a diverse set of tasks and queries was crucial. Gecko performed best when trained on a balanced mix spanning question answering, search, fact-checking, and semantic similarity. Models trained on a narrow task distribution were less robust.
- Careful formatting of the training data had an outsized impact. Encoding queries with task descriptors and standardizing the text formats helped the model distinguish different tasks. Symmetrizing the inputs for semantic similarity and employing hard negative mining boosted performance.
The multilingual capabilities of Gecko were also noteworthy. By training on English FRet data augmented with multilingual academic datasets, Gecko demonstrated strong cross-lingual transfer. It achieved state-of-the-art results on the MIRACL benchmark covering retrieval in 18 typologically diverse languages. This suggests that synthetic data from large English LLMs can benefit multilingual models.
Conclusion
Gecko sets a new standard for compact, efficient, and versatile text embeddings. Its innovative training procedure, centered around distilling knowledge from LLMs, yields highly effective representations that excel across diverse tasks and domains. By generating purposeful synthetic data and employing LLMs to strategically select positive and negative examples, Gecko packs a powerful punch in a small package.
Gecko embeddings could power a new wave of intelligent language applications that are both highly capable and practical to deploy. From semantic search engines and document clustering tools to content recommendation systems and AI writing assistants, Gecko enables NLP models that marry broad knowledge with computational efficiency.
The techniques pioneered in Gecko also light the path for future advances. The idea of using LLMs to generate high-quality task-specific synthetic data could be extended to other modalities like speech and vision. Gecko’s relabeling strategy points to the potential of LLMs as general-purpose tools for curating and aligning training data. As LLMs continue to grow in size and capabilities, clever distillation methods like Gecko will become increasingly important for realizing their benefits in real-world applications.