Clarifai vs Google Vision: Two Visual Recognition APIs Compared
As regular readers will be aware, I have been following various automated visual recognition tools for some time now and so far remain underwhelmed by the results (especially compared with the marketing hype that typically accompanies them). A few months back, I checked out the Google Vision API with some test images and the results were partially successful, but not to the extent that the technology could be trusted to take on keywording and tagging tasks without close supervision by a human being.
Clarifai is a competitor product to Google Vision which a number of vendors have been touting for use in DAM implementations as a method to avoid manually cataloguing assets. I decided to perform the same tests as used with the Google Vision system on Clarifai to see how it compares.
One advantage the Clarifai API has over Google Vision is that you don’t need to convert asset files into Base64 text files, you can just pass it the URL of an image. While the Base64 conversion isn’t that much of a big deal, if you have thousands of assets to process, it could become a more time-consuming and is another potential point of failure which has to be monitored.
For the test, I used the same set of images as for the Google Vision exercise. All the samples are scenes of London shot by my colleague and were taken using a commodity digital camera. The results are shown below.
Clarifai tags: bridge, water, travel, no person, architecture, river, outdoors, city, sky, transportation system, connection, traffic, building, urban, road, landmark, landscape, vehicle, street, construction
Google Vision tags: bridge, landform, river, vehicle, ocean liner, girder bridge, watercraft
The Clarifai tags are better than Google Vision and as generic keywords go, they are mostly reasonable. Although the system states ‘no person’ there are actually two people visible towards the middle of the bridge (this is easier to see on the full size image) but they are not identifiable via faces etc as the shot is taken behind them. I would be inclined to be lenient over this mis-classification, but if I needed an image that definitely had no people in it at all, it would be inadvisable to trust Clarifai to get this right by itself.
One point to note about Clarifai is that it uses less specific descriptions that are more easily attributable to a wider sample of images (where Google Vision tries to be more precise – with mixed success, it must be said). This is a point that occurs across the test results. For example, ‘urban’, ‘construction’ and ‘connection’ are reasonable, but don’t really add much in terms of enhancing findability.
For reference, in the article I wrote about findability for DAM News three years ago, this was the caption I provided for the above image which was used in that item also:
Blackfriars Road Bridge over the River Thames, painted red and white. Constructed in the Victorian era (1864) and designed by Joseph Cubitt. Shot from the North Bank (London, EC4) on a sunny, spring day
The obvious issue with my example is that the length of time required to write it was far longer than the speed with which Clarifi or Google Vision took to return keywords, although I believe mine would be likely to produce a more relevant search result (especially for more specific queries). This is more important with corporate DAM scenarios than it is for stock media.
Clarifai: building, architecture, no person, administration, travel, city, outdoors, house, sky, home, urban, street, tourism, tree, town, museum, facade, university, daylight, park
Google Vision: Blackfriars, transport, architecture, downtown, house, facade, MARY
The next photo starts to reveal some more of the problems. On the positive, ‘building’, ‘no person’, ‘outdoors’, ‘tree’ and ‘town’ are all reasonable. The fact the shot includes a red double decker bus is missed entirely, however and the ‘transport’ theme is also not recognised (which Google Vision has picked up). Clarifai appears to lack an OCR capability, unlike Google Vision which has read the larger text on the side of the bus. It has recognised the presence of trees in the shot, but nothing specific about the location (i.e. Blackfriars in London on the north side of the Thames within the EC4 postcode). The building in question isn’t a university, nor a museum (and has never been). Again, there are quite a few keywords which are adequate, but don’t really add much in terms of findability and if a human being catalogued them you might wonder if they were being paid by the word for this assignment. Overall, I think Google Vision edges this one over Clarifai, but it seems like both tools hold different pieces of the puzzle (along with a few other useless ones that should be part of a different jigsaw entirely).
Clarifai: no person, mist, nature, snow, water, fog, winter, outdoors, smoke, water, tree, dawn, art, cold, fall, travel, landscape, one, ice, wood
Google Vision: tree, plant, atmospheric phenomenon, winter, landscape, frost, wall
The Trafalgar Square fountain shot was the undoing of Google Vision in my test and apart from the correct identification of the water theme, Clarifai isn’t a great deal better. As with Google Vision, the original image was rotated and I re-tested the results with the correct orientation, but the keywords offered were the same. Apart from ‘no person’, ‘water’ and ‘outdoors’, the suggestions are fairly useless for finding the image with nothing about fountains appearing.
The generic terms feature once again with Clarifai and I get the impression that when the probability score of a match falls below a given threshold, it has a carefully selected list of more conceptual keywords which are hard to argue against but don’t serve any significant descriptive purpose either, for example terms like ‘landscape’, ‘one’ or ‘art’ which could be used with practically anything. Clarifai offers probability scores for matches and the one for ‘wood’ was about 83%. That seems like a very high ranking for such a poor match to me. I believe most people would say this suggestion was simply incorrect. This indicates that the default settings of Clarifai is to try to generate keywords, no matter what, even if a few are unsuitable.
It is noteworthy that both tools failed with this image and it does suggest that anything with subjects that are more diffuse in composition or lacking in contrast will cause problems for many of these systems. This is a point worth keeping in mind if you plan to use them.
Clarifai: glass, architecture, business, office, building, window, sky, skyscraper, no person, city, modern, facade, downtown, tallest, futuristic, finance, contemporary, urban, reflection, expression
Google Vision: blue, reflection, tower block, architecture, skyscraper, glass, building, office
In the Google Vision test, this photo produced better results than the preceding one, no doubt because the definition of the image is sharper with lines that make visual recognition an easier task. Many of the terms suggested by Clarifai are also credible as keywords and both tools have done a reasonable job. There are still some unexpected omissions and strange inclusions, however. For example ‘tallest’ is present, but not ‘tall’, an exact match search for something like ‘tall building’ would not find any results.
Some of the conceptual suggestions like ‘finance’, add weight to the theory that Clarifai is optimised towards stock photography use-cases rather than DAM. The name of the building above is ‘The Palestra’ and I gather it is currently occupied by Transport for London as well as some other tenants. There is no finance connection that I know of, I believe it has always been occupied by public sector organisations of one kind or another. Apart from the fact that many financial institutions often have offices in tall glass-fronted skyscrapers, finance is not a particularly relevant keyword.
This has some implications for corporate DAM users who don’t just need any image, but who may require a very specific asset. For example, if you had a project that required photos showing the offices that Transport for London operates from, the keywords above would not help you find what you were looking for. It did occur to me that the test photo was too abstract to help Clarifai get enough cues, so I tried this one which shows a street-level shot of the whole of the front of the building from over the road, taken from here: http://www.buildington.co.uk/buildings/london_se1/197_blackfriars_road/palestra/id/2922 The keywords offered are as follows:
architecture, building, modern, city, business, urban, office, glass, sky, skyscraper, facade, no person, construction, expression, downtown, travel, window, contemporary, outdoors, cityscape
These still aren’t fully satisfactory and are more or less the same suggestions as my test image. Like Google Vision, Clarifi also uses culturally-specific terms like ‘downtown’ which isn’t conventional in the UK. I would have some concerns about the extent to which this is widespread and whether localisation issues might be encountered as a result.
Heading back over the river to Admiralty Arch, yields these results:
Clarifai: architecture, building, travel, city, no person, tourism, sky, old, monument, landmark, sculpture, outdoors, ancient, art, urban, town, tourist, facade, house, castle
Google Vision: Admiralty Arch, architecture, tourism, building, plaza, cathedral, arch, palace, facade, basilica
Clarifai has not offered a location (in contrast to Google Vision) and in the previous test, the OCR module of Google Vision made an attempt at the Latin inscription as well (not shown above). Once again, it’s a mixed bag of vague concepts, some of which are reasonable (‘tourism’, ‘landmark’, ‘monument’, ‘tourist’) but others that seem fairly sketchy. I note the similarity of many of the keywords offered by Clarifai across nearly every single image supplied in this test. For example, ‘architecture’ features a lot, as well as ‘art’. I suspect that with asset collections that are catalogued exclusively with Clarifai, users will get either an excessive number of results if they use generic terms, or nothing at all where specific keywords are supplied as search criteria. While Google Vision has made some relatively poor suggestions (e.g. ‘plaza’, ‘basilica’ and ‘cathedral’ ) you are more likely to get fewer results. Where the algorithm has worked properly, these are more precise and accurate descriptions of the image subjects, the ‘Admiralty Arch’ tag being a case in point.
Clarifai: no person, sculpture, architecture, travel, statue, sky, city, outdoors, building, monument, art, castle, administration, people, landmark, tourism, museum, sightseeing, home, landscape
Google Vision: Trafalgar Square, Big Ben, from Trafalgar Square, blue, monument, sea, statue
This is more of the same. Clarifai makes some general, non-specific concept-oriented suggestions, a few of which are reasonable but only partially useful (‘sculpture’, ‘statue’, ‘monument’, ‘landmark’, ‘sightseeing’, ‘tourism’) and some which are weird or just wrong (‘home’, ‘castle’). Google Vision gets a point for being accurate about the location, but then loses it for proposing ‘sea’ and ‘Big Ben’. Neither product is capable of recognising the very large stone lion dominating the foreground of the image. Every single person that I have shown this photo to always comes up with ‘lion’ as their first keyword, but the AI systems which are supposed to replace them, apparently do not see it.
Clarifai: no person, sculpture, sky, outdoors, statue, travel, architecture, city, classic, daylight, monument, marble, one, art, blue sky, museum, bird, religion, sightseeing, eagle
Google Vision: Trafalgar Square, statue, sky, sculpture, monument, flight, gargoyle
Again, some reasonable, but non-specific keywords that could describe numerous other images, nothing about the location and a couple of completely random suggestions like ‘bird’ and ‘eagle’. ‘religion’ is somewhat inappropriate also, given that the photo depicts a monument to an admiral of the fleet killed in battle, but Clarifai is at least not seeing gargoyles, unlike Google Vision (which seems to be at some danger of becoming a euphemistic term in its own right).
Clarifai: clock, time, no person, old, travel, sky, daylight, building, outdoors, city, vintage, architecture, tower, watch, tourism, analogue, dawn, wood, urban, sight
Google Vision: Big Ben, vehicle, jet aircraft
This image should be the easiest one of the lot and Clarifai has done a reasonable job. The ‘clock’, ‘tower’ and ‘vintage’ keywords are good, but ‘sight’ and ‘wood’ (once again) are suggestions of dubious merit. Unlike Google Vision, it hasn’t picked up that it’s Big Ben, although no phantom jet aircraft nor vehicles were detected on this occasion by Clarifai.
Some themes emerge from this test; it appears that Clarifai offers superior keywords to Google Vision, however, they tend to be vaguer and lower value from a descriptive perspective of the kind that is important for generating relevant search results. This is the trade-off: with Google Vision, you get some more precise results, but accompanying those are other suggestions which are wildly inaccurate; Clarifai will propose keywords that are more reasonable, but lack focus, especially any unique or defining characteristics about the image subject.
As discussed in the results review, Clarifai gives the impression of being heavily oriented towards stock photography use-cases. In these scenarios, the operators of photo libraries want to always generate results because there is a higher likelihood of users buying something, even if it is not exactly what they wanted. Although a lot of corporate DAM solutions borrow heavily from the interaction models already established by stock media libraries, the context they are used in is quite different. Asset substitution (i.e. using something else) is frequently not a feasible proposition. For example, if you require a picture of your female CEO outside a particular regional office, getting random shots of different women with buildings of all kinds in the background isn’t sufficient, swapping one for another won’t answer the brief. As such, you might find yourself manually sifting through hundreds of suggestions (with the distinct possibility that none will be suitable). This could quickly become very frustrating and negatively impact your productivity in a way that DAM solutions are supposed to help you avoid, especially if you had previously bought into the claims being made by some DAM vendors that these technologies do away with the need to catalogue assets using relevant keywords and descriptions.
The two products discussed (Clarifai especially) might make a cheap replacement for crowdsourcing and offshore cataloguing services, but that method is usually only suitable for very general digital asset repositories where common subjects like food, clothing etc are the major themes. I believe that what is on offer here is not really Artificial Intelligence, but more like some statistical confidence tricks designed to convince a casual observer that there is a conscious entity behind the suggestions. As with most magic tricks, once you get to inspect the props more closely and find the secret compartments etc, the illusion quickly fades and the act starts to become a lot less impressive than when you first saw it.
Two points in Clarifai’s favour are the presence of a human feedback loop (which was a feature I wanted to see in Google Vision) and also their use of domain-specific modules. Google Vision has these too, but they cross multiple topics and are oriented towards recognition tasks as opposed to themes (e.g. OCR or facial recognition). Clarifai have modules for Travel, Weddings, “Not Safe For Work” and Food. This is them tacitly admitting that their technology isn’t as sophisticated as billed, since you have to make a decision whether or not to use one of their optimised algorithms, which means some human-supplied critical judgement is required. If you are an AI purist, you might not like this method, but I would prefer that the firms responsible for these products acknowledge that they work best when the automation is blended with some human expertise as the results seem to dramatically improve when they do.
Based on these tests, it appears you need at least two different technologies plus some conventional controlled vocabularies – so another bag of tricks, as is usually the case with DAM implementations. Even with all that at your disposal, it still is not advisable to leave the decision about whether to use any of the suggestions entirely with the algorithms because they aren’t as sophisticated as you might initially think and a real person needs to retain conscious responsibility for all the metadata that is applied.
I don’t believe these systems are yet ready for production use unless you have millions of images to deal with and operate in some highly conventional markets (like food, travel etc). Corporate DAM users should think carefully and fully analyse the brand risks (amongst others) before allowing vendors to persuade them to be used as guinea pigs for their R&D. As described in other articles about this topic on DAM News, there is some potential for combining the methods, but predominantly as a metadata advisory tool to help offer suggestions which a human being might not have considered otherwise.
I plan to test a few more automated cataloguing tools to see how they compare with Clarifai and Google Vision. I will report back on DAM News as and when I have test results to share. I would also be interested to hear feedback from DAM users (as opposed to vendors) about their experiences with AI, especially anyone who has been using these technologies for at least 12 months as they have been in active use by some for a reasonable period of time now.
Share this Article: