The Same But Different – Understanding The Implications Of Duplicate Asset Detection Functionality In DAM Systems
Duplicate asset detection is one of these deceptively simple problems that illustrates why DAM initiative implementation is rarely quite as straightforward as everyone involved hopes. This is an issue which crops up quite a lot when I get asked to help clients either review their existing DAM systems (or help them choose new ones). In this article I will explain the basics of duplicate detection, the different variations which have been added by DAM developers over the years and some tips for managing the complexity that often results.
Duplicate asset detection generally refers to the file or binary data rather than the metadata component of a digital asset and usually it would be an image like a photo etc. (although it can be required for other asset types as well). Solutions to duplicate detection appeared fairly early on in the development of DAM in systems available around the mid-late 1990s. The approach used was typically a mathematical encryption technique known as hashes. The maths is somewhat complex, but the layman’s explanation is that a hash is like a ‘fingerprint’ for a file which is guaranteed to be unique. Even if the file date or the name gets changed, the duplicate detection will still be effective.
There have been libraries of off-the-peg tools to help DAM developers implement this for decades. Most (but somewhat surprisingly still not all) DAM systems will employ them to provide a basic form of duplicate detection. The issue with this technique is that it only detects exact duplicates, not close matches. For example, if someone was to change a single pixel, or crop a portion of the image away, the fingerprint would be different and the duplicate detection would not get triggered.
To work around the problem of near-duplicates,a number of solutions have been devised. One example is perceptual hashes, these will theoretically detect images that are similar, but not necessarily identical. There are a wide variety of different algorithms available to DAM developers, some more subtle than others (as well as further settings and options within each). To use the previous example, if an image had been cropped or one pixel changed, it will still probably be picked up as a duplicate.
As should be apparent now, duplicate detection is not an entirely clear cut problem. The issue is less with the technology than it is with the human beings and their interpretation of what ‘duplicate’ actually means to them and the context into which they are using their DAM solution. What makes it more complex is exactly how sensitive the algorithms need to be based on the usage context. For example, if your business is strongly oriented towards a particular type of product (e.g. something like shoes or cars) then it is highly likely a lot of your photos will be very similar and you may want the duplicate detection threshold set far higher than another use-case scenario, like travel imagery, for example.
The fact that there are two photos of the same pair of trainers shot from slightly different angles is likely to be intentional because the marketing use-case is to allow prospective purchasers to get an experience which is as close to picking them up in a shop as possible. By contrast, you may not wish to store 25 snaps of a red London telephone box shot from multiple angles where only one is ever likely to be required by prospective asset users.
If your DAM supports both perceptual as well as exact techniques (and has configuration options to allow you to choose which one is more suitable) it is possible to manage these problems. The snag many DAM users encounter is their need for precision when it comes to duplicate detection also varies from one user to another. This is particularly the case where one DAM will hold product imagery and also more general marketing/brand also.
As I have described in other articles, context is a vitally important consideration when it comes to managing digital assets and duplicate detection is yet another example of why this is the case. It also illustrates why DAM is a far more complex problem than end users and vendors would both wish. A further improvement would be the capability to define duplicate matching rules based on metadata, either of the asset or possibly even the user’s membership of a particular permissions group. This kind of flexibility does exist in a very small number of DAMs, but they are very much the exception when compared with what the majority of vendors currently offer.
There are some other circumstances where users need to have two assets where the file is identical, but the metadata is markedly different. This often comes up where users have elected to make one asset available for a particular region or territory and included usage-rights metadata about it, then they find they need to make it available for another region, but with a different licence and usage guidelines provided. The ‘intentional duplicate’ scenario happens more frequently than many DAM users expect (and regularly perplexes DAM vendors also, from my experience). To an extent, if the rules about duplicate detection can be overridden by administrators then many of these special case scenarios are solvable via manual means. This implies more work for the human software (i.e. Digital Asset Managers) however. This is yet another series of manual management activities that has to be undertaken by the humans because the software still isn’t smart enough to cope with the nuances of a typical enterprise DAM use-case.
I have seen some DAM vendors making reference to the capabilities of some of the AI image recognition components they have integrated with their products for duplicate detection. From what I can tell of the way they have implemented them, there isn’t any major advantage of using an AI component over some of the less explicit methods like perceptual hashes and in all likelihood, the former will probably use the latter internally anyway. As such, while AI might simply their task because they do more with a single component, it probably won’t offer any substantial benefit to end users over what is already available. Where AI/ML capability would be more useful is being able to infer the likelihood that a user will want to override duplicate detection (or change from a strict to loose technique). Useful AI in DAM is less about industrial pixel number-crunching and more to do with analysing user behaviour and learning from it. To date, I haven’t seen this in any DAM at all, however and while vendors talk a lot about AI, I suspect that in-reality they are more wary of getting into the complexities of implementing anything that hasn’t already been developed by someone else.
If you are looking to purchase a new DAM system, my main recommendation when it comes to duplicate detection is to find out what levels of flexibility a candidate DAM provides. Ignore promises over AI tech and assurances that it can do ‘anything’ and instead focus on some specific real-world scenarios to see how obstructive (or useful) the DAM will be with this issue. Ideally, the system will be advisory in-nature and will not block users from uploading duplicates (however strictly you want to apply that definition) but more makes them aware that there are other assets which are similar so they can make their own decision. Duplicate detection is a very practical demonstration of the fine line that has to be walked when it comes to governance of your DAM solution.
Share this Article:
A good overview of duplicate detection – thanks Ralph!
I’d like to add a couple of techniques, which come before ‘file hashing’ on the scale of increasing effectiveness (and complexity).
The first is simply to compare filenames, a form of metadata comparison (which you mentioned). Not particularly effective – files are copied and renamed all the time, and two different images often have the same filename – but this is probably the absolute minimal solution. You could just about get away with calling it “duplicate detection” in a sales situation – you’d feel embarrassed though.
Next up is simply to compare the byte size of the files. When files are fairly large (as are most images and videos) it’s very rare for two different files to contain exactly the same number of bytes. This will match two identical files with different filenames but obviously won’t spot resized or cropped versions. And you would occasionally get false positives.
For a first iteration of duplicate detection these have the advantage of using metadata (filename and size) that are already stored and available in the database.
However, they are not particularly credible in a fully-featured DAM solution these days – perceptual hashing isn’t hard and works well.
Thanks for your article and for your recommendation in the last paragraph.
I want to highlight, that the technique of detecting duplicates and the policy how to deal with them are distinct. As there are different detection mechanisms (each with its advantages and disadvantages), there are different policies suited for various use-cases.
The DAM system should allow the user to configure the policy how detected duplicate assets are handled: delete the duplicate, associate/link the assets or merge the duplicate assets into one. Either automatically or manually with user interaction.
Just for the record, Ralph. I’m repeating my LinkedIn comment here:
Important to cite differences between “original” and copy, especially when considering archives as integral to the asset life cycle. I had an interesting situation when the DAM “copy” didn’t have its metadata tags – something that occurs when the tags are stripped off the image in the process of duplication external to the DAM. So, Ralph, metadata was one of my routes to determine differences between duplicates. Later on, I looked at file properties more closely; but metadata was a red flag for me.