What is De-Duplication and How Do I Do It?
January 04, 2017
When collecting electronically stored information (“ESI”) from multiple custodians (i.e., various individuals/ different sources), there will necessarily be duplicative documents collected in the process. In company-wide e-mail chains, for example, a message is sent to multiple recipients and stored within each recipient’s mailbox. Consider the following: I send an email to two colleagues; a copy of the very same email now exists in each colleague’s mailbox. Depending on the company’s data retention policies, copies of that same e-mail file may also reside on the employee’s (or employees’) hard drive(s), the company’s file server, and/or the company’s backup system.
For the attorney tasked with identifying, collecting and reviewing ESI in response to an actual or threatened litigation, an exhaustive review of a “document set” that is replete with duplicates threatens the cost effectiveness and efficiency of the project. These efficiency concerns intensify during document review, where duplicate documents increase the overall review time. And, duplicates pose the added risk of the review team applying inconsistent privilege and responsiveness decisions on identical documents. As a result, it is a wise decision to de-duplicate one’s collection of ESI at the processing stage.
And so, practitioners should consult with their vendors and educate themselves about the various de-duplication technologies available. When used effectively, de-duplication reduces the number of documents to be reviewed by, on average, 30 or 40 percent.
To determine whether two files are identical, each file’s binary stream is hashed with an algorithm to get a digest value. The most common algorithms used for this purpose are MD5, SHA-1 and SHA-256. The digest value is then considered as the file’s fingerprint. The resultant fingerprints are measured against one another to determine which documents are exact duplicates.
What are my de-duplication options and the effect of each?
- No de-duplication: All documents, irrespective of duplicates, are provided for attorney review. This will result in producing the largest number of documents for review. This method is strongly discouraged for cases involving voluminous amounts of data. Choosing not to de-duplicate also increases the likelihood that inconsistent coding among identical documents.
- Global or horizontal de-duplication: As each file is uploaded, it is compared to the entire data set for the project. Typically, you rank the custodians (i.e., Senior VP, VP, Junior Analysts, Admin). Then, only the first instance of each unique document is provided for review and categorization, resulting in the fewest number of documents for review. For example, if both the Senior VP and the Junior Analyst have the same document, the iteration of the document that resides in the SVP’s files will be the unique document reviewed.
- Per custodian or vertical de-duplication: Each file is uploaded and compared to a limited set of documents from the same document custodian, time period, or other data slice segment of documents. Only the first instance of each unique document per custodian or data slice will be provided for review. However, the same document may exist in other custodians or data slices and may then be provided for independent review. This type of de-duplication is particularly useful when processing multiple sources for the same custodians over time.
The deduplication options above are applied to documents as they are processed. Additionally, as documents are reviewed, they can be identified for relative similarity, called “near dupes,” which ascertains similar documents that differ by simple formatting, document type or other semantic differences. These documents are often identified and grouped by one document—the “core” of the group. All related near-duplicate documents are compared to this core document. Near duplicate identification can help the reviewer better understand the relationship between the documents, allowing for group coding/decisions based upon observed similarities.
Regardless of the method chosen, de-duplication can result in tremendous savings when properly leveraged to meet the needs of a project. However, it can also be wrought with complexity and pitfalls if improperly utilized.