Haystack 2025 Takeaways

Haystack 2025 highlighted the future of search with advances in vector search infrastructure, quantization, re-ranking strategies, and user behavior insights. Learn how modern search engines are scaling relevance and performance with cutting-edge techniques.

Michael Froh and Sarah Matta discussing their framework laptops.

There were two prevailing themes during Haystack this year: vector search infrastructure and relevance. As an industry, we’ve recognized that vector search removes much of the intense work that was previously needed to build search. Synonym graphs and ontologies have given way to fine tuned embedding models. That is okay however because these systems are so new there is still a lot of engineering to make vector search stable and relevant.

Vector Search Infrastructure:

Everyone at the conference acknowledged the challenges with deploying vector search to production. Quantization is going to be a necessary technology to adopt in order for us to deploy vector search at scale. We need to be careful however because our lack of qualitative metrics will make it easy to deploy vector search and tank relevance. For Elastic here were the types and their profiles:

Binary Quantization - 40x faster, 32x less memory but at the cost of potentially tanking relevance
Scalar Quantization (int4/int8) - 4x less memory with a minimal precision loss
Product Quantization (PQ) - up to 64x less memory but requires much more data engineering. Achieves the least precision loss however.
Better Binary Quantization (BBQ) - Binary quantization can have the highest compression (aside from PQ) however is typically very lossy. BBQ solves this problem by normalizing around a centroid. Doing this allows it to perform faster than PQ without the infrastructure and engineering requirements. The challenge is while it does have higher recall Elastic noted it’s precision is better when re-ranking the original vectors after retrieval.

Joelle Robinson and David Fisher probably discussing the new mentorship program

Relevance:

There were two primary patterns that were talked about in depth when it comes to relevance: re-ranking and precomputed enrichment.

Re-ranking

While this pattern has been around for a while it’s gaining focus again. With larger and larger vector data stores we need to rethink search so it can meet our performance goals. This is where re-ranking comes into play. It involves using a much more broadly scoped query to retrieve a list of documents that are “relevant enough”. The retrieval phase mentioned here is supposed to find all the documents that may be relevant at the cost of probably including many that may not be.

After retrieval is where we implement the re-ranking step. This step takes the initial document step and adds precision using a more costly ranking algorithm. Some common re-ranking strategies could be the following:

Exact kNN - Once you have the first ~1000 or so documents you can re-create a graph with them and perform exact kNN on them to find a better sort order.
Cross Encoders - With cross encoder models you can rescore each result by passing in the query and the sentences to the same model and produce a more accurate relevancy score. These are more expensive to run which is why we’d run them in a re-ranking step.
Personalization - Here we could use a two tower type approach to run personalization to further refine the results to be more close to what the current user expects.

Enrichment:

Data enrichment was one of the more bleeding edge practices beginning to gain traction. Data enrichment involves adding value by either aggregating or enriching documents to make them easier to retrieve or utilize in a generative workflow. Here were a few of the particularly interesting use cases that were seen:

Generate questions that may be asked to retrieve that document. This can aid in retrieval for documents. Particularly if they are very long.
Summary generation for LLM’s to utilize
Classification for better retrieval (eg. an issue may have a solution or not)

A picture of two framework laptops side by side. On the left framework is a rainbow keyboard. The right framework is missing it's keyboard and trackpad. It has a pile of parts from the first on top as they were testing hot-swapping parts. — Froh and Sarah's frameworks side by side as they hot swapped parts.

Bleeding Edge:

There were three things I saw used at the conference that I feel are on the bleeding edge for search. First and most accessible was an ML based boosting method for weighting hybrid queries. This talk by Daniel Wrigley took in features about the query in order to generate an optimal set of weights for the query. It looked at the number of tokens, average token length, etc to determine whether to bias to semantic or lexical retrieval. A sample of this workflow can be found in the “learning-to-hybrid-search” repo.

In a talk by Vespa they introduced a concept called: mapped tensor boosting. In this method they apply boosts to known tensors allowing for a more explainable search experience. An example of this could be boosting cars who have “hatchback” as a tensor value. While this boosting method is actually old in the space of the larger search engines like Yahoo and Google, it has still yet to gain mainstream popularity.

Finally, user behavior insights (UBI) is the last bleeding edge bit of technology. Again clickstream logging has been around for a long period however it’s lacked a standard. With the addition of a standard clickstream based workflows can begin to be standardized.