Metadata Operations Management Strategies Part 6 – Crowdsourcing

August 8, 2014 By Ralph Windsor in Taxonomy And Metadata

This is the next in our series of articles on managing metadata cataloguing operations.

In part five, the subject was using outsourcing to assist with metadata cataloguing projects. In the final item in relation to human-based methods for efficiently organising metadata operations, I will consider outsourcing. This has a degree of similarity with outsourcing which the term itself suggests.

Introduction
Crowdsourcing is another contentious subject among those actively involved in asset cataloguing with both vocal supporters and critics. As with many other metadata cataloguing operations subjects, the context and participants have an impact upon whether it will work for you or not. One of the main challenges to resolve is that like everything else in the world of metadata, there are no silver bullets and what you gain in cost and speed you may lose in management overheads.

Although the terminology sounds modern, the theory behind crowdsourcing is not all new and most people reading this article will understand the idea: a larger task is delegated to many individuals (‘the crowd’) with the objective that the effort involved is spread across a larger group of people. The main benefit is the reduced time required since the cataloguing activity occurs simultaneously rather than sequentially, for example, 100 people cataloguing 10 assets each on one day as opposed to ten people taking 10 days to achieve the same result. A further perceived advantage is the reduced cost (for similar reasons to outsourcing).

Applying Crowdsourcing Approaches To Metadata Cataloguing
There are a range of types of activities which might be suitable for crowdsourcing. Here is a non-exhaustive list:

Transcription
Editing/Correction
Translation
Contextualisation
Classification

Transcription
This might include tasks like transcribing the spoken narrative from an audio or video asset to create searchable text.

Editing/Correction
Crowdsourcing might be required where some automated method has been used to generate source metadata, but the results are not at a sufficient level of quality. An example would be OCR (optical character recognition) where the recognition has been imperfect and human beings are required to optimise the results.

Translation
The translation of large quantities of text to a chosen target language. Sometimes this is combined with machine translation and becomes an editing/correction crowdsourcing operation instead where human beings verify the automated version and adjust it.

Contextualisation
Specifying links to associated objects or assets (including possibly external links to other resources). This is usually a more intellectually demanding activity. Examples, such as Wikipedia where multiple authors add supporting contextual information to describe subjects.

Classification
This is a more common crowdsourcing activity for metadata cataloguing operations. Workers are requested to assign assets to a given series of classifications. Concepts like Controlled Vocabularies where metadata is normalised into a series of discrete choices are essential for this to be an effective method.

There are a range of other options and several hybrids where multiple tactics are employed. As will be described in the following section, effective crowdsourcing depends on the nature of the activity being very clearly defined and communicated to workers.

Crowdsourcing Delivery Methods
Crowdsourcing relies on the availability of sufficient numbers of crowd workers who have the necessary skill to complete the cataloguing task. The wider the range of those who can participate, the easier it is to source a critical mass of suppliers and the lower the cost. Rather than relying solely on the competence of the worker alone, tasks are usually rationalised and made as simple as possible to improve consistency and increase throughput (i.e. reduce the time required to finish the task so more can be completed).

A comparison might be a multiple choice exam. Rather than having candidates write out their answers in narrative long-form, they are asked to choose one option which is the most suitable. This is one potential area where the precision of the options offered can become too blunt to generate metadata that is useful for asset searchers. I will discuss the risks with crowdsourcing later in the article.

Each of the tasks in a crowdsourcing operation needs to be delivered to workers. There are a several options:

Pay per task
User feedback
Gamification
Volunteer

As with other examples, this article can hope to cover every conceivable method on offer, so you will find others which might be more suitable for you.

Pay per task
This is a variation on the outsourced model and crowd workers are remunerated usually on a per-task basis, getting paid for each item they deliver. This approach is popular with offshore and outsourcing catalogue providers. One big issue with any method where hard cash changes hands is the quality of the delivery, but there are methods for mitigating those risks which I will discuss later.

User feedback
This is often already present as a feature in a lot of DAM solutions. Where an asset is shown, users are given the opportunity to make suggestions or amendments to the cataloguing which the application owners can then decide whether or not to apply. The issue with this method is motivation to participate which might not be high enough to gain sufficient volume to make it worthwhile. A further problem is that many staff in lager organisations might be wary about making negative observations due to the political consequences or getting co-opted into assisting with the task while they have their own work to get done. This method can be a good one, but the user community needs to be highly motivated and genuinely passionate about the subject matter.

Gamification
This is a crowdsourcing method where the task is converted into a game format with the idea of making the tasks less like work and more of an entertaining activity. The idea is not unreasonable, however, I have not seen many metadata cataloguing games that are particularly engaging to play for a long period of time.

One method tested by Google a few years ago to capture cataloguing data for their image search was a variation on an old information science theory where two people are both asked to enter descriptive keywords simultaneously during a controlled period (e.g. 5 minutes). When both players get the same word, they receive a point. I have a professional interest in metadata cataloguing and while it was entertaining playing this game for about 20-30 minutes, beyond that it got a bit dull and repetitive. In part this was because I was not required to continue with it, but this is the essential problem, at the point where the data obtained by the game becomes useful, it starts to become like work and the motivation problem gamification was supposed to solve presents itself again. A possibility is that this could be used as a method to trial workers to find those suitable to be involved with crowdsourcing tasks, but I remain to be convinced that this has much potential beyond an interesting academic exercise.

One other issue with gamification is the cost of implementing the game might be prohibitively expensive. Unlike a crowdsourcing method where workers get paid (or do it because they care about the subject) the presentation needs to be sufficiently professional to persuade potential participants to start playing (and carry on).

Volunteers
In many ways this is a hybrid of the user feedback or pay per task methods. It generally works better where the volunteers are emotionally invested into the subject matter. Since the options for motivating participation are more restricted, a higher volume of volunteers might be required to gain the critical mass needed to complete the cataloguing work. Volunteer crowdsource initiatives seem to be more practical where the repository is smaller and for a small number of dedicated personnel the task would be onerous, but with a larger volume of workers, the effort can be spread more evenly to a level that is less unappealing than it might have been otherwise.
Crowdsourcing Brokers & Platforms
Most paid cataloguing crowdsourcing is delivered through some kind of commercial broker who connect you with crowd workers, the two key types include:

Crowdsourcing services providers
Crowdsource platforms

These are not mutually exclusive, but the services provider will typically be able to offer the whole end-to-end service, so they will usually organise the delivery by having an existing pool of workers provide project management, task design and most of the other elements. The model is similar to outsourcing and there are definite parallels between the two.

Since crowdsourcing is currently a fashionable term, I have encountered some suppliers who once described themselves using terms like ‘Offshore Business Process Outsourcing’ but who have now switched over to use ‘crowdsourcing’ instead. Since the tasks might be completed through some web-based front end, tacking on ‘Cloud’ or employing portmanteau buzzwords like ‘Cloudsourcing’ etc appear to be popular.

Although the terminology might have changed, the delivery requirements are no different and still need to be properly supervised and monitored by you. In the previous outsourcing part of this series of articles, I gave a breakdown of key points to consider and they all apply equally here and in fact there are more concerns since you also have to review how they plan to implement the task design in a way that is efficient an will produce the cataloguing results you are looking for.

Crowdsourcing platforms are more oriented towards the software and providing the infrastructure. They usually will pay the workers and collect money from the clients, but you will have more responsibility for designing the tasks. The most well known option is Amazon’s Mechnical Turk, but there are a number of others now. The website crowdsourcing.org has a directory of websites and service providers. Another point to take some care with is that some of these appear to have ceased trading and I suspect many may have difficulties obtaining the required critical mass of workers to properly service requirements. Getting recommendations from someone else who has done this already is probably advisable.

Crowdsourcing Quality Control
In a conventional metadata cataloguing or keywording task, digital asset managers will become familiar with the capabilities of the different people involved and there strengths, weaknesses etc. This is usually impractical for crowdsourcing because of the larger scale of the workforce. This can make quality control more demanding since you become reliant on briefing materials alone and/or the service provider’s project manager to communicate your objectives.

There are some tactics for dealing with this at an operational or aggregate level. One was discussed earlier and is an information science technique that dates from an era when a lot of data had to be keyed manually. Two or more data entry staff are assigned to enter the same information. If there is a discrepancy, the records are manually checked to see which is accurate (or if neither are without issue). Manual checking can still be a lot of work, so a further possibility is that unmatched pairs are added back into the list of pending items, but with a rule applied that ensures the same two cataloguers will not be re-selected.

Those methods can be effective where it is important to get precise and accurate cataloguing information, but many asset cataloguing tasks are subjective and open to interpretation, such that none are right or wrong but more good or bad (and then to varying degrees). A further option with crowdsourcing is to delegate the quality assurance out to different workers. With some larger tasks, it is not uncommon to use the services of multiple crowdsourcing services providers and a tactic that is employed for QA might be to get one provider to review a mix of their own work and the other provider also. This adds a few checks and balances into the process, but keep in mind that for every verification method you apply, you are adding more labour into the process and increasing your marginal costs as a result.

The cataloguing task interface can have a major impact on productivity, quality and cost. Just as with DAM systems, task interfaces need to be carefully designed and not be confusing (nor over-simplistic). This is another context-dependent factor that it is not possible to generalise about yet which will be significant in terms of the results obtained. Depending on who will do the work, a further issue is the literacy of those involved, especially if the cataloguing data is supposed to be recorded in one language but those capturing it are not native speakers. This does not rule them out, but the subject matter needs to be reviewed and someone need to risk assess batches of assets and try to spot potential cultural issues in advance before a lot of time and money is wasted.

In considering quality control, I have only examined the conventional bulk cataloguing operations that might be the subject of a commercial digital asset management operation. However, there are a number of specialist research and academic tasks which can benefit from similar methods, but where the ‘workers’ are educated to a post-graduate level and who might not usually be taking on this type of activity. These can present their own set of challenges where the they will be more eager to directly question or challenge the procedures and mechanisms employed even though the quality of the result might ultimately be higher. The point to grasp is that you cannot fully divorce the nature of the supply from the characteristics of the task and irrespective of how many people you decide to get involved, the two must be compatible with each other if you want to get decent results and avoid wasting a lot of time. Although crowdsourcing is all about micro-tasks, none of them should end up becoming exercises in micro-management.

Metadata Model & Project Management Implications
One potential benefit of crowdsourcing is the rigour it imposes on a metadata model used to catalogue assets (i.e. the range and types of fields used to hold metadata about assets). Since the task has to be reduced to its essential elements, the metadata model has to be stable and not subject to frequent modification. This benefit becomes a disadvantage, however, if the metadata model is not yet adequately defined.

Some in DAM recommend so called ‘agile’ metadata models where you defer final decisions about how to classify your assets until you have decided what is important. While that does allow time for user feedback and refinement etc, you cannot start a large-scale cataloguing operation using techniques like crowdsourcing while your metadata model is still in a state of flux because large sections of the work delivered might end up being thrown away or have to be re-done (and tasks re-designed potentially from scratch).

This hints at one of the other characteristics of crowdsourced metadata, it requires the digital asset manager and other related personnel to be highly organised and have very clear ideas about what they want to achieve. From my experience, they usually all do have those traits as you would not last long in this type of profession otherwise, however, if the manager’s time is divided between too many other activities, they might not be able to devote sufficient resources to that as well as their other work.

One of the perceived benefits of crowdsourcing is that you can just plug a faceless mass of workers into your DAM and they will deliver you fully catalogued asset metadata so you can avoid dealing with the hassle of organising it yourself. That can only happen if the end-to-end cataloguing process has been designed, tested and proven to be fit for purpose first. It is likely you will need to catalogue a lot of assets initially and (depending on the volumes) possibly employ freelance cataloguing assistance to make sure the process is suitable for the type of industrialised methods that cost-effective crowdsourcing requires.

Undertaking a crowdsourced metadata cataloguing exercise is effectively a sub-project in its own right. This means it needs the same degree of project management control that you might need to apply for other non-routine implementation challenges, such as software systems.

Crowdsourcing And Specialist Archives
The case for crowdsourcing seems easier to justify for archives where the assets are either generic or use subjects that are easily understood by a large cross section of the workers involved. This is the same point as described in the previous outsourcing section. Where the archive has some highly specialised material that requires subject experts, the quality of the result tends to reduce. It is true that this is an issue when using professional keyworders, but where they might be willing to invest a proportion of time into learning about their clients’ organisation, the same is less likely to be the case with crowdsourcing.

As described, not all crowdsourcing projects depend on teams of unknown and unskilled personnel sat in front of their computers waiting to be assigned tasks, it is possible to use a larger batch of employees whose job might not necessarily involve using a DAM with lots of assets. If that strategy is used, however, there are a finite number of workers who can be drafted in and they might also require incentives to encourage them to participate (depending on the subject matter and the culture of the organisation). Crowdsourcing metadata cataloguing for specialist archives is undoubtedly more complex and obtaining the required critical mass to make the additional investment worthwhile.

Crowdsourcing Opportunities & Risks
The opportunities and risks of crowdsourcing are described below.

Opportunities

Simultaneously utilise a large number of workers to significantly reduce the time required to catalogue assets
Save money by being able to use cheaper cataloguing personnel (including those without prior experience)
Rationalise metadata models to make them more efficient for future cataloguing work (whether crowdsourced or not)
Develop quality control guidelines which can also be re-used.

Risks

Limitations with the metadata model can generate large volumes of useless data if any issues have not been resolved first
If the tasks are designed poorly, the time required to deliver them maybe higher than it needs to be or the quality of the metadata produced may be unsatisfactory.
In normalising the range of metadata choices, cataloguing precision can be lost. This will reduce the quality of search results obtained and transfer the cost saved down the digital supply chain to asset users rather than asset cataloguers. It is usually more efficient to solve these problems further up the chain where they may happen just once rather than on an indefinite number of occasions.
Managing very large numbers of crowd workers is more demanding and requires effective and efficient monitoring. Too much checking can waste a lot of management time and generate resentment from the crowd workers, too little can result in quality objectives not being met. This job can be left with service providers, but they will require a fee and they need to be monitored themselves to make sure they are adhering to the quality standards you have specified.
If quality control methods are not properly defined and effective, the resulting metadata might be below the required standard or there could be confusion about how to handle tasks that did not meet the specified criteria.
If the subject matter is not generic or widely understood by all of the workers involved in completing tasks, either the tasks need to be modified to compensate for that (which might require tasks being split into multiple stages) or workers will need to be trained.
If training is involved, workers might not bother to pay attention to it and the metadata may be unusable as a result.

The fact there are more risks listed above does not mean that I think crowdsourcing is a bad idea, but as implied, it is a leveraged investment (i.e. small changes can have very wide ranging consequences). This means the rewards can be very high if everything goes to plan, but also the risks and negative consequences if not precisely calculated beforehand and managed during the project can be equally dramatic also. With any high risk/reward proposition, you must be constantly vigilant to ensure that your calculations about the ROI you expect to generate remain accurate across the duration of the project. There is a lot of positive PR about crowdsourcing, some of which is justified, but not all. The point of this section is to make sure you embark on a crowdsourcing initiative fully conscious of the risks and with detailed plans prepared about how you will mitigate them.

Conclusion
Crowdsourcing can be an effective method for cataloguing large volumes of digital assets, but anyone who has the idea that it is a quick fix option to dealing with a cataloguing problem should revise their opinion. Any cataloguing endeavour that uses this method needs to be well planned beforehand (including carrying out prior tests to collect data) and very carefully managed during implementation. Those who intend to use these methods will either need to research them thoroughly in advance and/or possibly call upon the services of someone who has real-world experience in delivering crowdsourced cataloguing projects. The leverage effect described in the risks and opportunities section needs to be respected and properly understood by anyone who uses this technique to populate their DAM with catalogued digital assets.

In the next article in this series about building efficient cataloguing operations, I will examine automated methods for generating and reusing metadata.

Share this Article:

Related Posts:

Taxonomy And Metadata

Metadata Operations Management Strategies Part 6 – Crowdsourcing

Leave a Reply