Embeddings and Cosine Similarity

A basic concept in AI is vector embeddings which can be used for many objectives.

Apr 23, 2023

∙ Paid

In another post over on an Airtable forum, I recall mentioning that AI would likely become the best approach to de-duping data sets. This idea applies to all data sets, including Google Sheets. A few people have emailed me asking for a concrete example that explains and demonstrates how to compute similarities.

As we know, OpenAI has shaken the tree and shown us a path to artificial general intelligence (AGI) with examples and eye-opening experiences demonstrating that it seems ready to help or harm humanity in profound ways. But back on earth, we presently have simple, practical needs such as de-duping data rows, or finding similarities among rows in a spreadsheet for example.

Just so you know, you may not want or need this deep dive into embeddings and vectors. If you’d rather just make some stuff work, you might want to take a look at CustomGPT.

Finding Similarities Without Filters

What if you could perform a simple mathematical computation and find all similarities in a data set?

In this brief example, I define the essence of a “dot” product which is like a cosine similarity function but a little less elegant. Its objective is to compare two arrays of numbers that represent an embedding. Embeddings is a fundamental element of AGI. They take advantage of billions of parameters already computed by OpenAI.

As it happens, embeddings are for sale; they each cost about 1/600th of a cent, making their use quite practical for AI applications in Google Apps Script. I also realized recently that it’s not necessary to store embedding vectors in a specialized database like Pinecone or Weviate; they can be stored in a spreadsheet or JSON data files. They’re big arrays, but not onerously large. I also learned that computing similarities, while not simple, are also not performant.

Given a topic, like all rows similar to John Smith, we can get the embeddings for all names in a sheet and then decide which are closely related through simple math. This applies to any data, not just names, of course.

Note the similarity outcome values in the code. Sally Smith (0.8865357851100655) is not at all similar to John R. Smith ( 0.952003729179478). If you want a really powerful search feature, perform these computations and order the results descending. Bob’s your uncle.

And how about that de-dupe process that craves for fuzzy search? Embeddings might be the answer.

Using this technique, you can create magic in your apps while posturing yourselves as a purveyor of AI.

Or, you could get a CustomGPT account and bypass all of this.

The “Dot” Product

The dot product, also known as the scalar product or inner product, is a mathematical operation that takes two vectors and returns a scalar quantity. The dot product is denoted by a dot (.) between the two vectors.

The dot product1 is useful in many areas of mathematics and physics, such as vector calculus, mechanics, and computer graphics, among others. It has various applications, such as calculating work done by a force on an object, determining the angle between two vectors, and projecting one vector onto another.

The following javascript function2 calculates the dot product of two embedding vectors.

Keep reading with a 7-day free trial

Subscribe to Impertinent to keep reading this post and get 7 days of free access to the full post archives.