Embeddings: Beyond Data Similarities

Jul 17, 2023

∙ Paid

On a forum recently, a developer was struggling with a chat system that was struggling to hone in on the nature of user questions. I repeatedly asked if she was using embeddings to first classify the nature of the query so that the prompts could be shaped as one-shot learners. Her frustration…

… the only thing I have seen embedding is used for is to do similarity searches

Exactly! That’s the superpower of embeddings - similarity. I advised her to embrace this term without the “search” predicate.

My response:

I know you are buried with tasks, but I’ll say it one more time - many of the answers to these questions are found in embeddings. Once you are thoroughly immersed in the capabilities that embeddings offer, you will see these questions in a very different light.

Imagine a chat conversation that mentions a data “value” type (aka, a column in a database). But, such mention is not precisely the field name itself - it is an intimation that it is a field of data your system possesses. Instead of assuming that embeddings are limited to identifying data similarities IN DATA VALUES, imagine using them in a more localized fashion to understand aggregation requirements with natural language conversations. e.g., the user has mentioned an interest in an analytic - e.g.,

… what about my schema pertains to this?

If you have a vector mapping for your schema, you can compare a vector for a query to understand what part of my schema is this person describing.

This is one example of using embeddings to understand the interest of a human as it pertains to the data you have in your application that the human may need. The popular understanding of how embeddings are employed is narrow and lacks innovation.

Imagine how embeddings might be used to …

Determine if an inference is more likely to be a hallucination as opposed to more factual.
Classify a query as being about (x) data vs (y) data.
Understand when humans are asking about things that have already been asked and which exist in the cache.
Reject certain queries to avoid prompt injections.
Continually reward a chained process for adjusting its compass heading as it seeks to complete a goal.

Unlike text and chat completions, embedding vectors are a tiny fraction of the cost both in terms of inferencing fees with OpenAI. But they are also extremely fast to procure. This makes them an attractive option for developers who want to leverage the power of embeddings without breaking the bank.

Impertinent is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Subscribe now

Understanding Embeddings

I wrote about embeddings a while back and a key takeaway from that episode .

Embeddings is a fundamental element of AGI. They take advantage of billions of parameters already computed by OpenAI.

Embeddings are central to AI solutions, and metaphorically like yeast is to baking bread. They are defined as a mathematical representation of text, images, or any other data that can be mapped to a vector space. The resulting vectors are dense, continuous, and low dimensional, which makes them ideal for machine-learning tasks like classification and clustering.

The superpower of embeddings is their ability to capture the semantic similarity between words or phrases in an instant. This means that embeddings can be used to find similar items in a dataset, even if they are not explicitly labeled as such. For example, embeddings can be used to find similar products, even if they are not in the same category.

Scenario: A user has mentioned an interest in an analytic - what about my schema pertains to this interest?

One of the most innovative uses of embeddings is to understand the interest of a human as it pertains to the data you have that the human may need. Imagine a chat conversation that mentions a data “value” type. Such mention is not precisely the field name itself - it is an intimation that it is a field of data your system possesses. Instead of assuming that embeddings are limited to identifying data similarities, they can also be used to understand aggregation requirements with natural language conversations.

Innovative Use Cases of Embeddings

Determine if an inference is more likely to be a hallucination as opposed to a more factual one.

Embeddings can be used to determine whether an inference is more likely to be a hallucination or more factual. This is because embeddings can capture the underlying semantics of the data, making it easier to identify when something is not quite right.

Classify a query as being about (x) data vs. (y) data.

Embeddings can be used to classify a query as being about (x) data vs. (y) data. This is useful when dealing with large datasets, where it can be difficult to keep track of what data is being referred to. By using embeddings, it is possible to automatically classify queries into different categories, making it easier to find the information you need.

Understand when humans are asking about things that have already been asked and which exist in the cache.

Embeddings can be used to understand when humans are asking about things that have already been asked and which exist in the cache. This is useful when dealing with large datasets, where it can be difficult to keep track of what data has already been searched for. By using embeddings, it is possible to automatically identify when a query has been asked before and retrieve the relevant information from the cache.

Reject certain queries to avoid prompt injections.

Embeddings can be used to reject certain queries to avoid prompt injections. Prompt injections occur when a user tries to inject code into a query, which can cause serious security issues. By using embeddings, it is possible to automatically reject queries that contain suspicious code, making it harder for attackers to exploit vulnerabilities.

Continually reward a chained process for adjusting its compass heading as it seeks to complete a goal.

Embeddings can be used to continually reward a chained process for adjusting its compass heading as it seeks to complete a goal. This is useful in situations where it is necessary to navigate a complex environment to achieve a goal. By using embeddings, it is possible to automatically adjust the process as it moves through the environment, ensuring that it stays on track toward its goal.

Conclusion

Embeddings are a powerful tool for AI solutions that go beyond the popular understanding of their use case. By leveraging the semantic similarity of embeddings, it is possible to understand the interest of a human as it pertains to the data you have that the human may need. Additionally, embeddings can be used to classify queries, reject certain queries, and continually reward a chained process for adjusting its compass heading.

Next [Learning] Steps

To learn more about embeddings and their innovative use cases, here are three resources that you may find helpful regarding AI embeddings:

1. Understanding Embeddings in Machine Learning by Jason Brownlee.

2. A Comprehensive Guide to Word Embeddings by Dipanjan Sarkar.

3. A Visual Guide to Using BERT for the First Time by Jay Alammar.

I’ve often intimated that AI solutions are nothing more than what we have been doing for decades in computer science; embeddings are another decisive advance in the quest to make better applications. It’s your opportunity to be creative with that advantage.

Google Apps Script Example

This script is a function that creates two embeddings and then compares them to derive a score (i.e., the DOT Product). When you run this, the logged output will be:

0.7883761576443311

//
// EMBEDDING
//
function testCreateEmbedding()
{

  var str = "CyberLandr";
  var e1 = palmCreateEmbedding_(str);
  var e1 = JSON.parse(palmCreateEmbedding_(str)).embedding.value;

  var str = "CyberTruck";
  var e2 = palmCreateEmbedding_(str);
  var e2 = JSON.parse(palmCreateEmbedding_(str)).embedding.value;

  Logger.log(dot(e1, e2));

}

This is to say that CyberLandr and CyberTruck share a similarity score of about 0.788.

Driving home the point of embeddings, let’s introduce one more comparison - CyberTruck and Cyber Truck.

5:03:08 AM	Info	CyberTruck vs Cyber Truck: 0.9281584846024246

Simple, right? This is how you use the computational strength of LLMs to make inferences and the possibilities in Workspaces apps are massive. From Google Forms data to sheets, search, slide content, and documents - embeddings can provide the bridge between what your users are thinking and how your solutions function.

Supporting Functions

There are only two dependencies in the example script above - using the PaLM 2 API to create the embedding vectors, and computing the DOT Product.

Keep reading with a 7-day free trial

Subscribe to Impertinent to keep reading this post and get 7 days of free access to the full post archives.