The Future of DAM and AI

This special feature was contributed by David Tenenbaum, CEO and owner of MerlinOne.

For the last 30 years we have had to use text terms (metadata attached to an object) to let us search for visual objects, which seems kind of ironic. What if you could search image content directly, with no metadata middleman? That would be disruptive! How could it be possible?

What if you went out to the Internet and scraped a lot (think millions) of pictures, along with the text near them, so you had this huge pile of image/caption pairs?

Then you take the first image/caption pair, and you feed the caption into an AI process that is optimized for text and uses something called “attention” to figure out the most important words. What if that text optimized AI process had already been trained on a huge textual dataset, like all of Wikipedia, and it had learned to pick up patterns of words and associate them with concepts?

We feed the first picture’s caption into that attention mechanism, and it spits out a long number that describes the caption, fittingly called a “descriptor”. To keep this simple let’s pretend this number represents a dot in 3 dimensional space, so it occupies a place in a cube of space, like the room you are in right now.

Next we take the corresponding image, and feed it into another AI process that is optimized for images, and it too spits out a number (“descriptor”) and that represents another dot in the space of your room. That image dot might be a distance away from the caption dot we just created.

But wait: aren’t those two dots different representations of the exact same thing (one textual and one visual)? Sure they are: they should not be far from each other! So let’s train a third AI system to move those dots really close together.

Let’s do the same thing with the next caption/image pair, and then the next, and work our way through all of the millions of caption/text pairs.

What we end up with is a room full of clusters of dots. Let’s say one image/caption pair was of a German Shepherd: those dots end up in a “dogs” cluster of dots. Within the dogs cluster you would expect to find image and caption pairs of German shepherds in one lump, and nearby golden retrievers, and similarly their captions (even though they use different words) would be nearby since they represent the same, or similar objects. In this way the exact terms used in the captions no longer matter: similar concepts are all grouped together even if they use different words!

Nearby there is probably another cluster, this time of cats (also 4-legged animals that kind of look like dogs but they have their own separate cluster). Further away because they look different will be a cluster of car images and their captions, and further away airplane images and their captions, etc.

We did scrape these image/caption pairs off the internet, and there is sure to be some garbage in there (mislabeled, or biased, or just plain wrong data), but when you are dealing with a huge number of objects (a huge “training set”) the bogus ones have minimal impact.

We end up, in addition to our room full of dots, with an AI engine that has been trained on a huge number of objects, that “understands” textual concepts, understands the visual content of images, and groups the right stuff together. It is this trained engine that has been the end goal (not all of our training dots)!

How do we make it useful to our DAM contents? Well first, let’s clear all those training dots out of our room, and start fresh! Next we flow all the images from our system into our trained AI model we just built, and let our images become clusters of dots in our room, but this time the dots represent OUR images.

Now a user sits down and types “German Shepherd puppy jumping” into our search box. That goes to the same AI text “attention” engine we used before, and just like before a descriptor number gets spit out, and we find the point in our 3D space that descriptor represents, using the engine we just trained that has learned to “understand” the important concepts of “German Shepherd”, “puppy” and “jump”. Then we put a sphere around that point: we want to grab the “nearest neighbor” points that represent our photos that are near our text search dot, and return them to the user.

Guess what: they are all likely to be photos or drawings or graphics of German Shepherd puppies jumping! And we did all this with ZERO reference to any textual metadata attached to the image.

We have ourselves an AI engine that “understands” the world much like a teen-ager might, and a way to find images WITHOUT any dependency on textual metadata some person or automation had to attach to the image.

Share this Article:

The Future of DAM and AI

Leave a Reply