As a Data Scientist or Business Analyst, your job is to uncover insights from a massive ever-growing pool of uncorrelated data. You've created data lakes, data warehouses, marts and even tried tools that access that through virtualization. One thing you absolutely need is for any data that you process to be clean and reliable.
Dirty Data in the form of unreliable, duplicate, or fraudulent information, may have even a larger impact, as much as 3 trillion dollars! Whether those numbers are accurate is debatable. Closer to home for businesses, Experian estimates that, on average, U.S. organizations believe 32 percent of their data is inaccurate.
And that’s just the perception and impact of basic data quality (DQ). Even more critical, business decisions are made every day on uncorrelated data which may not be “dirty,” but missing key information that might have resulted in a better decision and outcome.
So naturally, Dirty Data is a major concern for both Data Scientists and Business Analysts alike. The Verge’s article titled “The biggest headache in machine learning? Cleaning dirty data off the spreadsheets” contains a humorous but no doubt close-to- home quote: