![]() ![]() In this case, the vector sets correspond to the items purchased by the corresponding customers. Given a similarityUDF that can tell us how similar two sets of word vectors are, we can ask a query such as: It’s time to look at some concrete examples to make all this a bit clearer. If a model is rebuilt using a different relational view, a CI query may return different results for the new model. If the word embedding model is generated using the database being queried, it captures meaning in the context of the associated relational table, as specified by the relational view. Following vector training, the resultant vectors are stored in a relational system table.Īt runtime, the system (built on Spark using Spark SQL and the DataFrames API) uses UDFs to fetch trained vectors from the system and answer CI queries.īroadly, there are two classes of cognitive intelligence queries: similarity and prediction queries… The key characteristic of the CI queries is that these queries are executed, in part, using the vectors in the word embedding model. We use word as a synonym to token although some tokens may not be valid words in any natural language. This phase can also use an external source (e.g. In their prototype implementation, the authors first textify (!) the data in a database table (e.g., using a view), and then use a modified version of word2vec to learn vectors for the words (database tokens) in the extracted text. ![]() Thus, these vectors capture first inter- and intra-attribute relationships within a row (sentence) and then aggregate these relationships across the relation (document) to compute the collective semantic relationships. Word embedding then can extract latent semantic information in terms of word (and in general, token) associations and co-occurrences and encode it in word vectors. Think of each row as corresponding to a sentence, and a relation as a document. ![]() You can also learn directly from the database itself. One approach is to use word vectors that have been pre-trained from external sources. How do we get word embedding vectors for database content? The authors use word2vec in their work, though as they point out they could equally have used GloVe. The idea of word embedding is to fix a d-dimensional vector space and for each word in a text corpus associate a dimension d vector of reals numbers that encodes the meaning of that word… If two words have similar meaning, their word vectors point in very similar directions. In fact, “ The Amazing Power of Word Vectors” continues to be one of the most read pieces on this blog. If you’re not familiar with word embedding vectors, we covered word2vec and GloVe in The Morning Paper a while back. Where can we get comparable abstract representations of meaning though? The answer is already given away in the paper title of course – this is exactly what word embedding vectors do for us! If we understood the meaning of these tokens in the database (at least in some abstract way that was comparable), we could ask queries such as “show me all the rows similar to this.” That’s something you can’t easily do with relational databases today – excepting perhaps for range queries on specific types such as dates. But even columns that contain different types of data, e.g., strings, numerical values, images, dates, etc., possess significant latent information in the form of inter- and intra-column relationships. This is intuitively clear for columns that contain unstructured text. We begin with a simple observation: there is a large amount of untapped latent information within a database relation. It’s a really interesting example of AI infusing everyday systems. ![]() What do you get if you cross word embedding vectors with a relational database? The ability to ask a new class of queries, which the authors term cognitive intelligence (CI) queries, that ask about the semantic relationship between tokens in the database, rather than just syntactic matching as is supported by current queries. Plus, as a bonus it’s only four pages long! Strictly speaking, this paper comes from the DEEM workshop held in conjunction with SIGMOD, but it sparked my imagination and I hope you’ll enjoy it too. Using word embedding to enable semantic queries in relational databases Bordawekar and Shmeuli, DEEM’17Īs I’m sure some of you have figured out, I’ve started to work through a collection of papers from SIGMOD’17. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |