Document Preservation: Why Precision and Recall Matter
March 28, 2018
When one preserves and collects electronic data for a litigation, one typically casts a broad net. This, in turn, can result in the preservation and collection of a significant volume of documents that are not relevant to the dispute at hand. In an effort to identify the most likely relevant documents from the cache that has been broadly preserved and collected, lawyers tend to use search terms and keywords. But, as anyone who has engaged in that process knows, due to the range of language used in everyday communications, even the most targeted search terms yield results that are not relevant (i.e., “false hits”). So how can a practitioner best gauge the overall effectiveness of their document collection and review process?
Enter PRECISION and RECALL — the two metrics that best assess effectiveness.
So what exactly is precision and recall?
Precision measures how many of the documents retrieved are actually relevant. For example, a 75 percent precision rate means that 75 percent of the documents retrieved are relevant, while 25 percent of those documents have been misidentified as relevant.
Recall measures how many of the relevant documents in a collection have actually been found. For example, a 60 percent recall rate means that 60 percent of all relevant documents in a collection have been found, and 40 percent have been overlooked.
It is relatively easy to achieve high recall with low precision if you collect robustly. The downside is you will also retrieve a lot of irrelevant information, which in turn will increase the cost of review. Similarly, high precision with low recall is easy to achieve. By keeping your key word searches few and narrow, you will likely retrieve mostly relevant documents; and review costs will be contained because you will collect only relevant information. Many relevant documents, however, will also be overlooked.
The ideal result is to achieve high recall with high precision. But identifying only the necessary information and little else is a task difficult to achieve. In order to maximize your chance of achieving high recall with high precision, consider using a combination of temporal limitations, search terms that are vetted with the individuals most familiar with the intricacies of the case and its underlying facts, and early analytics to assess the validity of the terms chosen.