AI In DAM: The Challenges And Opportunities
A month ago Ralph Windsor wrote an article describing his evaluation of Google Cloud Vision and what it could mean for the DAM industry. About the same time, and independent of Ralph, I worked on a project with a team to investigate the value that visual recognition APIs generally could bring to a DAM application.
Is autotagging useful yet?
Ralph and I came to similar conclusions: in most cases the auto-suggested tags keywords coming from APIs such as Google Cloud Vision are not yet good enough to be added directly to assets in a DAM application without human intervention.
The reason I wrote in most cases is because during our user testing we have found exceptions to this. For example, one of our clients manages images for a tourist board that has a large number of photos of outdoor scenes. They found that, although the auto-suggested keywords were not 100% accurate, they were good enough – especially when they considered that, because of the volume of images they have to process on a daily basis, without the auto-suggested keywords their assets don’t receive any keywords at all.
My summary is that, using the online APIs available at the time of writing, auto-suggested tags could add value to your DAM application if:
- Most of your images contain subjects that the APIs have learnt about. This is an obvious point that highlights a key issue: you and your DAM application are not in control of how these systems learn. Their learning process is opaque but a good guess is that most of them are learning using images from the web (as well as other sources). So if your images are mostly of generic subjects often found on web pages – for example, shots of nature or people – then your results are more likely to be accurate.
- You can tolerate some wrong keywords and some missing keywords. At present these APIs have not learned enough to get it right all the time, even if your images are “of the right sort”.
- The alternative is worse, e.g. you just don’t have the time or money to manually add keywords to every image.
Damned with faint praise? Perhaps, but let’s not give up on these technologies just yet. AI is an emerging, fast-evolving field that will inevitably have a huge impact on DAM at some point. Right now it is having some impact (mostly as a marketing tool, it has to be said) but no one really expects that humans will be manually adding metadata to images in 25 years time.
When will this huge impact start? The obvious answer is when the APIs produce keywords that are as good as those entered by an experienced human, regardless of subject domain. But when will this be and can we, as DAM vendors and users, do anything other than wait passively and hope it is soon? Let’s start by looking at where the gaps are.
I’ve already mentioned the biggest problem – accuracy of the suggested keywords. Another issue we had when integrating the visual recognition APIs into our DAM application was that many of our clients work from a taxonomy. This is a hierarchical set of keywords, used to standardise terminology, handle synonyms and ensure searches for general terms pick up assets tagged with specific examples of those general terms. A keyword can only be added to an asset if it is on this master list. Naturally, the suggested keywords coming from the APIs know nothing about the master list and so many of them were rejected.
This problem is not insurmountable – we could solve it in our own code for example by setting up configurable mappings between auto-suggested keywords and those in the master list. However, it does highlight another major problem with the current offerings from the online API providers – they are mostly generic and provide little in the way of client-specific customisation, whereas keyword subject domains are often very specific to clients. What organisations in the UK might call a “pavement” Google probably calls a “sidewalk”. This illustrates another issue the likes of Google have with image recognition – they open themselves up to accusations of cultural imperialism, not to mention the risks of other politically unacceptable errors such as those Ralph highlighted last year.
You don’t need to know about everything
The approach that most of the API providers seem to be taking is to teach their systems to recognise as many subjects as possible and provide the same function to all clients (so if client A and client B both pass exactly the same image to the API they will see exactly the same results). This seems like a reasonable approach, but it leads directly to the two key problems (accuracy and different terminology) as they clearly don’t know enough yet to be sufficiently accurate for all clients’ images and client B might not want to see the same results as client A.
Is it the best approach anyway? Given that most, if not all, of the APIs use neural networks, which are inspired by how we think the human mind works, let’s consider how a human would add keywords to a large set of an organisation’s images. This is a simplification, but generally they will be adding keywords that fall into two categories:
- Generic subjects, which they will recognise as they learned about them from general life, such as “people”, “building”, “sky”.
- Specific subjects, such as the names of the people and buildings, and the name and code of product shots. These are subjects they will have learned about since working with this organisation, or may have to spend some time learning about before they start keywording.
Humans would certainly not be expected to know every subject ever seen in any image ever used by any organisation!
DAM applications typically contain hundreds of thousands of images that have been keyworded manually – why can’t the APIs learn from all this data (a bit like a human would, but more quickly) and then use it to provide keywords that are specific to an organisation? For example if my DAM application already contains hundreds of images of my company’s head office building, why can’t the API learn what my head office building looks like for the next time someone uploads an image of it?
The challenges for visual recognition API providers
Naively, I thought the APIs would support something like this, i.e. that they would have a generic component used by all clients, which learns from all the images it ever sees, and a client-specific component, that would learn from the specific subject domains of a client. Upon working with the APIs I realised this isn’t the case – most don’t even provide the means for learning feedback loops and those that do, for example Clarifai’s feedback resource, use the information only to teach the generic component that is used by all and so don’t give more weight to the learning done on the client’s own images. A notable exception to this is IBM Watson, which provides custom classifiers, but in our evaluation we didn’t find IBM Watson’s results to be accurate enough to use.
I asked Clarifai’s Lamar Potts about this. He said: “We provide a general feedback loop in our API to encourage users to contribute to the virtuous cycle of making visual recognition technology smarter for everyone. Our overarching belief is that our customers can benefit from the contributions of other users.”
This is a noble belief – I’m all in favour of win-win situations. But providing a general feedback loop doesn’t preclude adding client-specific feedback loops.
I looked into this further and discovered what I suspect is the the real issue – it is very difficult to fully automate custom training. Most AI systems based on neural networks need babysitting by humans during their training, which makes it hard to provide real-time, scalable learning functionality. For example, after receiving feedback from its feedback resource, Clarifai’s learning process requires some manual steps in order to incorporate the feedback.
Another issue with learning from clients’ data for vendors, especially high profile ones, is allaying privacy and security concerns – most organisations are protective of their data and want to know how it is being stored and used.
Shall we just wait then?
As consumers of the visual recognition APIs should we just wait until they get more accurate and provide the feedback loops we need? That’s one option, as things move fast in this area. For example, although it doesn’t sound like Clarifai is going to support fully-automated learning (from a client’s images) any time soon, they are about to release a Custom Training Model, which will enable their clients’ users to provide the necessary human input. And IBM Watson’s API looks like it already provides what’s needed – it just needs to improve the results!
For the impatient, there are other options. One is to accept that while auto-tagging is not yet good enough to be used unsupervised, when combined with a human it could help speed things up. An obvious example is to add a user interface that makes it very quick for users to accept correct keywords and reject poor ones. Another approach could be to use the auto-suggested tags as a way of grouping similar images in an upload batch so they can be quickly tagged en masse by a human user. This removes the need for the auto-suggested tags to be accurate – they can be wrong as long as they are consistently wrong for all the images in the batch.
Then there’s the option of adding a layer of intelligence over the top of the results coming back from the visual recognition APIs. For example, we developed an application that uses probabilities to try to map the auto-generated tags to human tags. It processes all the assets already in a client’s DAM application and, for each image, compares the auto-suggested tags coming back from the API with the human-entered keywords. Once all the existing assets have been processed, this information is used to calculate the probability that a new image containing a particular auto-suggested tag should be given a particular human-entered tag. This works pretty well for clients with a large number of assets who tend to upload images with similar subjects, but it produces inaccurate results when new, specific subjects are introduced. For example, if the system has only seen dogs called Fido then it will tag a new image of a dog with Fido. The problem here is with volume – the hundreds of thousands, even millions of images in a large DAM implementation are not enough for a deep learning application on their own. And then there’s the cost – as these APIs charge on a per use basis having to pass millions of images to the API just for the learning stage is going to be expensive.
What about implementing a proper deep learning system to provide this custom intelligence layer, for example using the many open source neural network implementations? That would be fun, but I doubt whether many DAM vendors want to spend effort on becoming machine learning experts, especially as it could turn out to be wasted effort if the API providers introduce this functionality themselves.
My prediction is that although many DAM applications will soon start offering integration with auto-tagging APIs such as Google Cloud Vision, we won’t see high adoption of these technologies from users until the results improve. This could happen either as the API providers get wise to the potential of the DAM market and start listening to what it needs, or when smaller third-party machine-learning experts start filling the gaps. Either way, it’s going to happen – hopefully sooner rather than later!