Google’s Visual Case Study Of The Perils And Politics Of Automated Metadata
About six weeks ago on DAM News, we reported some problems that Flickr ran into with their automated tag suggestion feature and noted that there were both perils and politics involved with this type of exercise. Last week, with the revelation that Google had catalogued the images of two people featured in a set of photos as ‘Gorillas’, more tangible examples emerged of the risks of depending exclusively on these automated methods. To make matter worse, the subjects in the photos were both black (that’s as in race or ethnicity, for the benefit of any of Google’s auto-classification robots which might subsequently parse the text of this article). There were some suitably contrite soundbites issued by senior Google personnel, such as:
“This is 100% not OK,’ acknowledged Google executive Yonatan Zunger. [It was] high on my list of bugs you ‘never’ want to see happen” [Read More]
I don’t think I would care to argue with Yonatan’s assessment of their predicament and I would imagine that several meetings ‘without coffee’ were held in Mountain View involving the engineering team when this story first broke.
There was also some follow-up comment by Jacky Alcine who both featured in the offending image and bought it to Google’s attention (especially with reference to the first sentence):
“I do have a few questions, like what kind of images and people were used in their initial priming that led to results like these. [Google has] mentioned a more intensified search into getting person of colour candidates through the door, but only time will tell if that’ll happen and help correct the image Silicon Valley companies have with intersectional diversity – the act of unifying multiple fronts of disadvantaged people so that their voices are heard and not muted.” [Read More]
An argument made by proponents of automated metadata is that human beings make mistakes when carrying out this task and are not infallible either. That is a fair comment, except it glosses over several key distinctions which I want to discuss in this article because they might be instructive for anyone considering using automated methods to generate metadata for assets held in their DAM.
When analysed from a completely rational and logical perspective, computer software cannot ever make mistakes. If a given system does something that the user or programmer did not intend, either one or both of those parties had to be at fault. The software it is not a conscious entity so has no awareness of whether its performance of a given task is acceptable or is not. Most people intuitively understand this, however, anthropomorphising and then blaming an algorithm is less confrontational than outright accusing a live human being of screwing up. In Google case, there can’t really be considered to be users, per sé, so this has to come down to the limitations of the code used to classify the source material. The implication is that when mistakes like this occur, it is not a one-off, but exposes a flaw that will probably manifest itself again at some point. A comparison might be colour-blindness. If you can’t see hues of a given colour, then it is unlikely that you would use them as keywords when cataloguing. Until either the cataloguer’s colour-blindness is corrected or another person who is not similarly afflicted reviews and modifies their work, it will almost certainly happen again on other occasions. An observation that could be made in this case is that computer software (which can detect colours) could fulfil this role also, instead of the human cataloguer. That is a point I will return later and it does suggest how these technologies could be applied in a lower risk and productivity enhancing fashion.
As should be clear, there is a big difference between a flaw or limitation and a one-off mistake. I have worked with a range of individuals responsible for applying metadata to assets in the past, some have been highly expert at it and others absolutely useless. It’s fair to say, however, that none of them would have deemed that tagging a picture of two people with the word ‘gorillas’ was ever going to be a good idea, unless there really were some actual gorillas visible in it. The biggest source of human cataloguing failures (in my experience) is where industrialised techniques like batch tagging get used inappropriately and without the person involved properly checking all the results and just assuming that all the assets were more or less identical (or just not caring about the differences between them in some cases). This is the scenario that occurred with Google, the algorithm got left to follow a prescribed method that was not sophisticated enough to deal with the material supplied. I am sure Google will adjust the series of rules which produced this controversial result, but there will be others and in the process of muting the effects, they might also generate further problems and potential for different kinds of misinterpretation. Yonatan Zunger offers some further indications about how they may start to address this issue later in the article:
“He added it was ‘also working on longer-term fixes around both linguistics – words to be careful about in photos of people – and image recognition itself – eg better recognition of dark-skinned faces” [Read More]
This reads (to me) like they acknowledge the need for more specialised handling of certain types of subject. Where automated pattern recognition techniques are used, the results generally improve if the problem domain is more focussed and restricted to key areas, therefore, OCR, facial recognition etc generally work better than all-encompassing generic approaches. The issues that AI cataloguing techniques are encountering seem to have echoes of themes encountered with DAM software also: where the tools attempt to handle every conceivable task, they are unable to do it as sufficient depth for anything other than fairly simple requirements and their functional scope becomes too diluted and diffuse.
I suspect the developers of Google’s image recognition system have used statistical models where they estimated that approximately 80% of the suggestions will fall within an area that the majority of users will deem acceptable. 20% of the time, there are results that are outside the thresholds and a calculated risk has been taken that they won’t be too serious and they can rely on a user feedback loop to help optimise the results and identify weak points. As is now apparent, while the volume of the mistakes is not that high, their impact certainly can be and they do considerable damage to a user’s trust of any other suggestions that the software has come up with. Yonatan’s comment above suggests this example is a ‘known problem’ that they were already aware of, but the implications were more serious than they anticipated.
I have not seen many successful examples of these automated methods being applied to more specialist libraries when vendors have run demonstrations of them. This would include DAMs deployed for use by businesses or public sector organisations where the assets are typically highly geared towards the cultural values, products and services of the sponsor. With that said, where they might have some potential benefit is to generate literal descriptions of visual assets like photos. As I have discussed in the past, this is often a findability issue with corporate DAMs. It arises when employees do the cataloguing and they are unable to think about the material without making reference to its context exclusively, above all other potential metadata that could be applied which someone who lacked subject knowledge might depend on. In essence, this is similar to the colour-blindness problem that I described earlier and it could fulfil a subset of the role that hiring an independent picture researcher external to the business might carry out to optimise the metadata stored about it, but there are some caveats to that which I will describe at the end of this piece.
In my opinion, these solutions should not be let loose, unattended, on catalogues of assets and all of the results should be checked by a human being. In addition, relying on just one person to verify this could be risky also, especially if most of the results appear to be reasonable and there are other pressing deadlines to attend to. One variation on the theme is an idea Google themselves borrowed a few years ago with their keywording game where two people both enter keywords about an image and the matches are selected for use as descriptive tags. This is an old information science technique that was used a lot when most data had to be manually keyed in by operators to better ensure accuracy. Two different individuals entered each record and then inconsistencies between them were analysed to locate inaccuracies. The method is not completely watertight, but the probability of achieving accuracy is quite high. In this case, one of the parties could be the automated tagging system, the other a human cataloguer. As should be apparent with these methods, the mistakes should be considered more useful than the successful examples as they provide clues to help optimise the results and real cases that can be analysed with a view to the implementation of enhancements.
A point that has been made on DAM News many times before is that automated methods should be able to leverage existing cataloguing data to enhance their results. The context these technologies are being used in is also highly significant. Apart from the fact they mostly work effectively for generalist repositories, the other consideration is the tasks that the library itself exists to fulfil. If a library of assets is only intended to hold personal photos and the volumes are relatively small (e.g. less than 1,000) then any working search facility is acceptable as the users can manually filter the results by eye and usually ignore anything that appears random. On the other hand, if the library is a repository of key production assets that are in constant use or are to be employed for some important project, these are not only a distraction, but they damage both the value of the assets and may negatively affect the productivity of those that have to find them. Further, another area not yet discussed (but strongly hinted at in the Google article referred to) is the risk of litigation from subjects who have been incorrectly catalogued. As discussed, blaming the software is no defence, so this would come down to negligence on the part of the library operator and their suppliers (i.e. the vendor and any third party tools they used). As well as publicly accessible repositories, this could become an HR issue for internal ones held by corporations also.
There is a conflict of interest for anyone intending to use automated methods that has to be resolved to attain a benefit from it: the metadata cataloguing task could be carried out faster (and therefore at a lower cost) but the trade-off is some lower quality data, with all the other negative consequences that suggests. If that same metadata is embedded into the asset, they could also leave the library with these mistakes contained within them to destinations unknown. One approach might be to segment catalogues and enable more automation on the less important material, but some care is needed and an overhead will probably still be incurred manually correcting some of the worst cases.
Overall, I think these technologies are interesting and might eventually offer some benefits for real-world commercial Digital Asset Management use, but not just yet and still with the following caveats:
- They cannot be left unattended and need to be checked by human beings: even random sampling is still risky.
- The algorithm needs to be able to access existing classification rules and metadata schema to refine its results.
- For libraries with highly specific or technical subjects, they may offer a limited benefit.
- The more specialised the focus of the automated techniques are (e.g. facial recognition etc) the better the result and it therefore might be more suitable to apply accordingly rather than wholesale.
- While the productivity opportunity exists, there are legal and HR-related risks which are not straightforward to mitigate.
- They are currently more effective in a supporting role to help identify issues with human cataloguing rather than to replace a real person.