BigQuery Vector Search

Vector search in Google's BigQuery has recently become officially available (GA). This is an opportunity to take a look at vector search and its possibilities.

Editorial note: This article was first published in German. The image is not translated.

Vector comes from the Latin vector, which can be translated as carrier or driver. There are vectors in various mathematical and physical subfields. Vectors in geometry are very similar to vectors in physics, but not identical. However, (n-)tuples, for example, are also sometimes referred to as vectors. What they all have in common is that addition and scalar multiplication (linear algebra) can be applied to them, the vectors, in a vector or tuple space. That is why they are referred to as vectors. Physical and many geometric vectors have a position and a direction and are therefore often represented as an arrow. They represent, for example, a physical quantity such as speed. Vectors in the sense of tuples are much more generic and can represent a wide range of information. While vectors in geometry and physics often feel comfortable in spaces of 2 and 3 dimensions, “tuple vectors” often form multidimensional spaces. Their visual representation is correspondingly demanding.

Vector calculus was established by the German mathematician Hermann Günther Grassmann in 1844. So, 180 years after the discovery of vectors, BigQuery offers vector-based search.

Vector search in BigQuery – a platform with building blocks

The term “vector search” may suggest an overly simplistic idea. It is not the case that you simply enter VECTOR “com” instead of LIKE “%com%” in BigQuery and BigQuery then performs a text-based vector search. Rather, vector search in BigQuery is a platform that can be used to perform a vector search. A vector search roughly consists of two components: the vectors themselves and the distance algorithm.

So, unlike geometric vectors, for example, the aim is not to move them in space, but to calculate the distance between vectors. The distance between the vectors must reflect their semantic similarity in order to deliver meaningful results for a search in the context of BigQuery. After all, the aim of the search is to find similar, semantically identical texts, images, videos, etc.

2 variants in BigQuery: with and without a model

As mentioned, vector search in BigQuery is a construction kit. The most minimal variant from BigQuery's perspective is to store the vectors in a table and to pass a vector as an Array<Float64> directly for the query. BigQuery then calculates the desired nearest candidates using the configured distance algorithm. Google has already done the first part, for example, in the public patent dataset. There is a new column “embeddings_v1”:

Google's embeddings in the patent dataset

We will come to the topic of embeddings in a moment. To let BigQuery know how you want to calculate the distance, you have to define which distance algorithm you want to use to calculate the vector distance. BigQuery currently offers the following algorithms:

Euclidean distance
Cosine similarity
Scalar product.

You can also define additional parameters such as top_k (the number of neighbors you want to get back).

For large amounts of data, it is worth additionally indexing the stored vectors. BigQuery offers special vector indices for this. This significantly increases query performance.

Embeddings and the models

Now we come to the somewhat more difficult part. How do I get the vectors and how do the semantics get into these vectors?

So far, we have simply assumed that you have stored the vectors in BigQuery and specify a vector when making a query. The question arises as to how you get the vectors to store and which models are available to you?

Google provides you with over 20 different models in BigQuery:

The CREATE MODEL statement | BigQuery | Google Cloud

You can also use remote models. As far as I have seen so far, Vertex AI models are currently available for remote models. You define the desired model in Vertex AI, train it if necessary, and can then use it directly in BigQuery. With Vertex AI in particular, you basically have two options: you can calculate the vectors directly in Vertex AI yourself and then save them in BigQuery. To do this, you have to write some program code. This is the most minimal variant from BigQuery's point of view. Or you can integrate the models into BigQuery via remote models and then use the model directly in BigQuery with the BigQuery SQL dialect.

Which brings us back to the question of what embeddings are and how semantics get into the vector.

It is not the case that, for example, by placing words or letters as numbers in a multidimensional vector space, a semantic search can be carried out automatically. Rather, the vectorization of semantic information helps in the search by allowing the algorithms of linear algebra to be applied. The efficiency of such algorithms also comes into play here. But of course it is also not the case that by placing words or letters as numbers in a multidimensional vector space, a semantic similarity between the vectors automatically arises, the closer they are.

Before we look at how semantics are created based on textual information, it must be mentioned for the sake of completeness that the vector search is not only suitable for texts. It can also be used to search images, audio and video files. The way in which the semantics enter the vector in such cases is correspondingly different from that of texts. The specific procedure depends on the purpose. Do you want to find similar music, detect anomalies in sounds (e.g. for quality assurance) or have speech recognition (only music with Italian lyrics)? Not every method is equally well suited for every purpose.

To put it simply, embedding is the semantic positioning of a word or text in a defined vector space. Thus, an embedding is actually nothing more than a specific vector. A vector that represents the semantics (of texts). There are various methods and procedures for calculating the embedding, such as Word2Vec, GloVe or FastText. In addition, there are transformer-based methods. I wrote about transformers in this post here. I can only speculate, but I assume that Google uses a transformer-based solution in the Gemini embedding engine (link). Google currently offers 6 different embedding models. 4 English-language and 2 multilingual:

textembedding-gecko-multilingual@001
text-multilingual-embedding-002

For vector search of texts in BigQuery, the embeddings from Gemini are therefore probably best suited as a vector. The embeddings from Gemini can also be used directly as a remote model in BigQuery.

Excursus Array: The representation of embeddings in BigQuery as Array<Float64> is nothing other than the matrix notation/representation of vectors.

Limits of similarity

Cumulatively, embeddings are only as good as the training data and the appropriate use. For example, in the context of vehicles and escape, the names Bonnie & Clyde are close to each other. If you just want to know if Bonnie is similar to Clyde, this is of course not very helpful. As an aside, names in general are extremely difficult for meaningful embeddings, since they are either strongly context-related (it is about this name right now) or not at all (so the name is not relevant at all). In addition, many names do not occur frequently in training data. Searching for similarities between names using embeddings is therefore unlikely to provide sufficiently robust results. Fortunately, there are special software solutions for this (such as ESC). However, for many texts, embeddings are very well suited to determine semantic similarity.

Google's BigQuery Vector Search documentation can be found here, among other places.

What does the possibility mean from an IT architecture perspective?

A new search function is nice, but what does it mean from an IT architecture perspective? It allows you to simplify the architecture. By being able to implement the function directly in BigQuery, you don't need any index pipelines to load and vectorize the data. And you don't need to write your own query or search logic, which you then have to enrich with data from BigQuery. You can do it directly with a query. This reduces the number of components, services and systems and thus simplifies your architecture overall. You can also reuse existing SQL/BigQuery know-how.

There is a slight risk, however, that your application will become a monolith. Try to modularize within BigQuery as well and work with interfaces for data models to reduce complexity. Make it possible to develop and customize the rest of the application independently of the vector search. And, of course, the other way around, too: you should be able to improve the vector search independently of the rest in an evolutionary way.

This is how you can get the most out of the integrated vector search.

Out of the box? No.

I repeat myself, the vector search in BigQuery is a powerful construction kit and not a ready-made out-of-the-box function. To implement a powerful vector search in BigQuery, you need expertise in machine learning. You need to understand the different models and their parameters. You need to be able to convert the similarity search into a classification for evaluation and quality assurance purposes in order to be able to apply established evaluation metrics such as F1. To do this, you need appropriate (classified) test data. Vector search in BigQuery is only worthwhile for you if you can use it to generate real business value. Google gives you computing power and a toolbox. To use it effectively in your own application, you have to roll up your sleeves and get to work.

Cloud Architektur | IT Research

Montag, 16. September 2024

BigQuery Vector Search

Keine Kommentare:

Blog durchsuchen