De-Duplication without Metadata!

Feb 2021

About our Client: Based in Chicago, our client is a cutting edge innovator in eDiscovery with over 15 years of experience in the legal support vertical. From forensic collections and paper discovery to enhanced use of analytics and AI to streamline attorney review, their solutions address every piece of the EDRM.


Challenge:  Our client contacted us with an urgent plea for assistance for a unique document collection.  They had received 3 productions of unstructured data and could determine after a just cursory glance that there was a high level of duplication in the population.  When using eDiscovery software, duplicate documents are easily removed using unique MD5 Hash values or extracted text.  In this case, the productions were scanned images with no usable metadata.   Due to poor image quality and frequency of marginalia, OCR text was completely unreliable for near-dup analysis. A review by attorneys and paralegals was out of the question with a looming deadline and limited resources to reliably carry out this time-consuming manual process.   How could the client avoid sifting through thousands of duplicate pages or spot key differences in near duplicates that could be crucial to their case?


Solution:  The first step in the process was to apply correct unitization through 247Digitize’s logical document determination (LDD) process. From 899 received records, 247Digitizewas able to determine 4,522 true documents for review.  Step 2 started with 247Digitize consulting with the client to determine 4 fields of data that could be manually coded from the record set to yield unique values.  These results would be used to accurately sort and identify potentially duplicate records.  Those identified would undergo a final 100% page by page analysis to confirm the presence and level of content duplication.  The final step would be the most tedious and costly to the client so 247Digitize went a step further, developing their own classification methodology resulting in 181 sub-categories of documents.  By taking the client’s category list to a granular level, we were able to reduce the duplicate count and manual review time significantly.


Results:  Over a period of 10 days, the project team executed every step of the project plan, identifying 80% of 4,522 documents as suspected duplicates.  Of these, 46% were identified as true duplicates with the balance having some duplication critical for secondary attorney review.  Both full and partial duplicates were presented to the client via Relativity’s Related Items pane allowing efficient review and tagging for relevancy and responsiveness ahead of trial.


247Digitize received positive reviews from the client not only for execution of the project but for thorough consultation that helped them save effort and money.  Our proactive approach helped the client achieve results at a fraction of the time and cost of internal resources and will serve as the blueprint for related matters in the future.