What’s Worse, Fake News or Dirty Data? Debate.

According to Wikipedia:

 “Fake news is a type of yellow journalism or propaganda that consists of deliberate misinformation or hoaxes spread via traditional print and broadcast news media or online social media. Fake news is written and published with the intent to mislead in order to damage an agency, entity, or person, and/or gain financially or politically, often with sensationalist, exaggerated, or patently false headlines that grab attention.”

Dirty data, also known as rogue data, is inaccurate, incomplete or erroneous data, especially in a computer system or database. Dirty data can contain such mistakes as spelling or punctuation errors, incorrect data associated with a field, incomplete or outdated data, or even data that has been duplicated in the database. It can be cleaned through a process known as data cleansing.”

While Fake News is well known to the general public due to the wide reaching impact and possibly arguably influencing a Presidential election, Dirty Data in the form of unreliable, duplicate, or fraudulent information, may have even a larger impact, as much as 3 trillion dollars! Whether those numbers are accurate is debatable. Closer to home for businesses, Experian estimates that, on average, U.S. organizations believe 32 percent of their data is inaccurate.

And that’s just the perception and impact of basic data quality (DQ). Even more critical, business decisions are made every day on uncorrelated data which may not be “dirty,” but missing key information that might have resulted in a better decision and outcome.

So naturally, Dirty Data is a major concern for advanced analytics and machine learning. The Verge’s article titled “The biggest headache in machine learning? Cleaning dirty data off the spreadsheets contains a humorous but no doubt close-to- home quote:

“There’s the joke that 80 percent of data science is cleaning the data and 20 percent is complaining about cleaning the data,”
— Kaggle founder and CEO Anthony Goldbloom (via The Verge)

Of course the two are not mutually exclusive, Fake News can be used to promote Dirty Data!

For example, you've just read this blog and perhaps agreed with the data which I sourced from other articles on the web. I readily admit that I did not have time to verify the accuracy of the data references in each story. In fact, the 3 Trillion dollar number comes from this Saleshacker.com article, which cites this post which dates back to 2011!

But you can trust me :-)

What’s your most egregious example of Fake News or Dirty Data, please share horror or funny stories in the comments below.