Whether you want to build a corpus to document an interesting linguistic phenomenon or simply want to have language data to train your AI models, the process of data annotation is crucial.
Language data annotation (or text data annotation) is the process of adding labels with relevant information or meta-data. Even though it might seem like a simple process, choosing the right labels for your data is not always easy, as ambiguous data is pretty common.
For this reason, data annotation guidelines are your best friend! Some people think that annotation guidelines consist of a document with rules that must be followed when labeling data and which is written at the beginning of the process. However, the truth is that developing annotation guidelines is in fact an iterative process. In this blog post, I discuss the four key steps of this process.
The first step of the process is to define the purpose of data annotation. In other words, how will the data be used once it is annotated? This question will help you understand what kind of things should be annotated, and what the depth of the annotations should be. For example, should a determiner be simply labeled as "determiner" or should you specify its type (e.g., "article", "possessive")?
It's important to have this initial discussion with the whole team of linguists that will be annotating the data later on.
Once the purpose of the annotation process has been defined, it's time to do an initial round of annotations. In this first round, linguists should not work as a team but rather individually. Each person should annotate the data and take notes of all doubts or questions that arise during this process, providing examples whenever possible.
After the initial round of annotations, the team should meet again to compare the different ways in which they annotated the data as well as to discuss all the issues that came up. This discussion will probably be long and intense but it is key before moving on to step 3.
Now, it's time to work on a guidelines document! This document will in fact be the first version of the guidelines you write. It should include all basic rules for data annotation, as discussed in the meeting described in step 2. It is also super important that guidelines include a lot of examples to illustrate each rule, since future rounds of annotations will draw from this document.
Now that you have some basic annotation guidelines, the team should go back to working individually. Each person should start annotating the data following the guidelines in the document and once the process is over, a quantitative measure should be calculated to determine how reliable the annotations are. There are many metrics that can be used for this, with some being simpler than others (e.g., percent of agreement).
Cases of disagreement should then be discussed as a group to understand whether either the person made a mistake or the guidelines were not clear about how to annotate a particular phenomenon. If the latter is the case, then guidelines should be revised and updated to reflect these changes.
This last step should be repeated as many times as needed, since iteration is crucial to develop a set of reliable data annotation guidelines. As guidelines develop and improve, annotators also become more experienced at dealing with ambiguities.