Automated Image Descriptions – Why Combining It With Human-Based Cataloguing Will Generate A Higher ROI
A story which has been doing the rounds recently is this post on the Google Research blog. The item describes some AI (Artificial Intelligence) techniques Google have used to automatically generate image captions. There is the de rigueur line about a picture being worth a thousand words (can anyone write an introductory photo metadata article without using this?). But it’s the following which is potentially more interesting for those who already grasp that point:
“Many efforts to construct computer-generated natural descriptions of images propose combining current state-of-the-art techniques in both computer vision and natural language processing to form a complete image description approach. But what if we instead merged recent computer vision and language models into a single jointly trained system, taking an image and directly producing a human readable sequence of words to describe it?” [Read More]
The article describes how they have used automated language translation techniques and applied them to this problem. Like a lot of people, my knowledge of AI and neural network is fairly incomplete, however, the basics of this seem to be that Google’s system generates an abstract vector representation of the image and then analyses each key element to pattern match with associated language descriptions.
Assuming the results are genuine, one of the more impressive aspects for me was that they use openly available images, not a carefully selected set of their own. I have sat through a few demonstrations of automated metadata cataloguing systems before and they all usually work great with photos provided by vendor’s own salesperson but then the quality of the results decline rapidly when given some random images that the prospective customer wants catalogued. I would still like to try that with the Google system, however, I have to acknowledge they have understood this as a potential deal-breaker for any practical usage scenario and taken steps to address it. They show a series of examples and it’s encouraging they include some that haven’t been successful also – most metadata professionals are going to be aware that this type of application isn’t going to be 100% accurate, what they would want to know is where it might be weak so they can compensate by increasing the level of human/manual cataloguing or quality control.
It is perhaps too early to discuss this being used in real-world Digital Asset Management operations such as photo libraries, marketing departments etc but I would need to contemplate the issues to be addressed in taking this from interesting research demo into something that could be put to use for proper work. One aspect which has often occurred to me when looking at automating metadata cataloguing is how it seems to favour those who have already got solid processes and rules for doing it manually using human beings. This is not discussed in the article above, but I suspect that many of the issues with the poor recognition examples could be corrected if the algorithm had access to some additional sources of metadata which could offer further hints to optimise the recognition. In other words, if there are additional modifiers and control variables which have been accrued over a period of time, they could be re-used to enhance the quality of the results. This raises the topic of legacy metadata that was discussed on DAM News a few weeks ago. By disposing of data which an AI-based system could utilise, you are losing the opportunity generate compound ROI returns from work already carried out (and paid for). It isn’t mentioned whether Google’s image recognition system either has access to data collected from their search engine or will do, but I would almost guarantee that someone in Mountain View has thought of that idea a long time ago. Also I am confident that they never throw away any of the search data collected from (even if privacy regulations require them to reluctantly obfuscate the source). If you want absolute proof that data is not a commodity like oil or gold etc it is the fact that no other example has the same characteristic where the marginal utility actually increases the more of it you acquire.
There is a paradox with automated cataloguing: the sort of people who are most interested in it tend to be the ones who have decided that manual cataloguing is too much hassle and expense for them. By contrast, those who have come to terms with the need to do it properly and invested the time and effort into it will have accumulated all the additional expertise and configuration understanding to get the most from it. Where this really comes into play is with the type of highly subject-specific asset repositories that define most scenarios where DAM technology gets put to use. For example, if you put a photo of your CEO into a system like the one described by Google, it might guess the gender, age and perhaps racial origin of the subject. However, getting back descriptions like ‘middle-aged white man standing next to a wall’ aren’t that useful for most DAM users (although you probably would want to retain some of that as keywords for anyone doing more general searches). It is the specific details, like the name of the person, their job title and the geographic context which combine with the other elements to give a more useful description. Those can only be derived if a human being has already created the rules to catalogue that data (and selectively applied them).
The synthesis of human design of solid metadata models, automated recognition and manual verification could offer a compelling ROI case and still meet most DAM user’s minimum quality thresholds. In the findability article I wrote for DAM News last year, a key principle I describe is the need for descriptions to contain both literal and subject-oriented elements to reflect the full range of criteria that users will employ to describe the type of asset they need to find. I believe the subject or domain-related metadata cataloguing task is likely to be too complex for AI-based automation for a long time, however, some of the more generic captioning jobs that often get sub-contracted to crowdsourcing providers could potentially be replaced with techniques and algorithms of the type Google have described in their article. Those who have more comprehensively defined metadata schemas and considered this subject in sufficient depth might therefore be best placed to make use of it for their own metadata operations – far more so than those just hoping that the technology will catch up with their DAM ROI aspirations.
Share this Article:
I consider that both Google Research blog post as this excellent article will benefit from the addition of a pinch of the concepts of Web Semantics. The image recognition via AI complemented with semantic web resources – Linked [Open] Data, Ontologies, RDF etc – and other sources will inevitably be a relevant topic in the near future. And this concerns to DAM too.
That’s a very good point, José, these other technologies could certainly enhance the potential quality of the results.