Home United States USA — software How to Best Fit Filtering into Vector Similarity Search

How to Best Fit Filtering into Vector Similarity Search

April 18, 2022

607

Learn about three types of attribute filtering in vector similarity search and explore how to improve the efficiency and accuracy of similarity search.
Join the DZone community and get the full member experience. Attribute filtering, or simply « filtering, » is a basic function desired by users of vector databases. However, such a simple function faces great complexity. Suppose Steve saw a photograph of a fashion blogger on a social media platform. He would like to search for a similar jean jacket on an online shopping platform that supports image similarity search. After uploading the image to the platform, Steve was shown a plethora of results of similar jean jackets. However, he only wears Levi’s. Then the results of image similarity search need to be filtered by brand. But the problem is when to apply the filter? Should it be applied before or after approximate nearest neighbor search (ANNS)? This article intends to examine the pros and cons of three common attribute filtering mechanisms in vector database and then probe into an integrated filtering solution offered by Milvus, an open-source vector database. This article also provides some suggestions about filtering optimization. Generally, there are three types of attribute filtering: post-query, in-query, and pre-query filtering. Each type has its own pros and cons. As its name suggests, post-query filtering applies filter conditions to the TopK results you obtain after a query. For instance, in the case mentioned at the beginning, the system first searches for the most similar jean jackets in its inventory. Then the results are filtered by its brand metadata. However, one inevitable shortcoming of such attribute filtering strategy is that the number of results with metadata satisfying the condition is highly unpredictable. In some cases, we cannot get enough results as we wanted. Because if we want TopK (K=10) results, but after applying the filter, those vectors whose metadata does not meet the requirement will be eliminated. Therefore, we will get less hits than intended. In some worst-case scenarios, we will get no results at all after applying post-query filtering. What if we increase the number of returned query results? We certainly can get enough results even after applying the filter condition.