As Published in Information Management at http://www.information-management.com/blogs/big-data-analytics/machine-learning-delivers-quality-data-at-the-speed-of-the-business-10030617-1.html
Maintaining data reliability is a resource-intensive, uphill task for many organizations. Companies often spend too much effort on data reviews and cleanup, but seldom seem to catch up. Most of the time, teams don’t even know what the issues are, how to look for them and how to solve them. They just know that the data is dirty, and like sitting on a ticking time bomb, we wait for the disaster to happen.
The issues are often only illuminated when the data is put to operational use and trips up the end user or the customer with wrong information. The cost to the company can be enormous -- including loss of business, failed product launches, exposure to compliance risks and lack of responsiveness, to name a few.
As the volume of data is growing at a much faster rate than can be manually reviewed and rectified, companies are looking for advanced ways to solve the data quality problem. Modern data management leverages machine learning to monitor, manage and improve the data quality to stay ahead of data challenges. There are three key areas where integrating machine learning in your data management system can help you improve data quality and reliability.
Understanding and Quantifying Data Quality
Data quality is a fundamental aspect of operational data governance. Understanding and monitoring the quality of master data is crucial for downstream data usage. If business applications and analytics receive incomplete or inaccurate data, there is an adverse impact on the company.
The decisions that business users make based on the insights from inaccurate data will be incorrect as well. The quality of your customer experience, time to new product launch, and compliance are all at risk if the underlying data is of questionable quality.
Measuring the quality of master data requires significant analytical skillsets and complicated workflows across analytical, master and operational data stores. One sweep of data quality measurement can take weeks and may require multiple imports and exports to various systems with a fair amount of manual labor.
If the speed of data quality analysis lags the velocity of data generation, you will never have clean and reliable data. Moreover, due to the complexity involved in data quality measurements, you may not be able to afford to do this regularly, further adding to your data woes.
Machine learning and rules-based data quality checks and inspections can help. Such capabilities in modern data management systems continuously monitor the data quality, completeness, formatting and scores the data appropriately. They can raise exceptions and send alerts to data owners and end users about the issues. They can even offer suggestions to fix the data and improve the quality scores.
User-friendly dashboards continuously display the performance charts and graphs with improvement recommendations. You can monitor the data quality trends of any segment of data or set of profiles.
For example, you can compare the data quality of U.S. consumer records to the records for China or do the same across product categories. You can search the data based on data quality scores. If you are running a campaign for a major product launch, you may want to eliminate profiles with low-quality scores. Real-time recalculations of data quality using machine learning provides immediate insights into the quality and can also recommend actions to fix the data.
Having machine learning capability within your modern data management, you can make sure that your operational and customer-facing teams are always working with accurate and reliable data. Integrated quality measurement of operational data provides better governance and decision management across all functional groups. It facilitates informed use of data and enables more data-driven decisions.
Machine Learning for Better Data Deduplication
Creating a true 360-degree view of anything (such as customers, employees, products and suppliers) requires you to bring together data from all internal, external, and third-party sources. This blending requires careful matching and merging of the data. The challenge is that defining matching rulesets takes time and a deep understanding of data profiles. As the number of sources increase and the format and data types grow, defining the rules becomes a complicated endeavor.
Data matching accuracy is always questionable. Many times, it is not comprehensive and requires backup processes and manual interventions. The complexity of data matching adds to the time-to-value for most of the master data management tools. Organizations spend months iteratively defining the matching, testing, and fine-tuning match rule sets. Setting match rules is an expensive and lengthy process.
Machine learning within modern data management platforms can help derive the matching rules automatically from data and active learning training by data stewards. Data stewards can take a set or a sample of data and run it through the matching rule sets, then evaluate the data matching quality to indicate to the system which matches were good, which were inadequate or inaccurate. With a single click, they can show the machine learning system how to treat the data and determine new match rules. The system adapts to the customer data and user behavior.
Machine Learning-Assisted Data Enrichment
Another valuable use of machine learning within data management is automated data enrichment. As you are collecting profile, transactional, and interaction data from all your sources and establishing many-to-many relationships across people, places, products, and companies, you can use machine learning to enrich each profile with additional information, such as data quality score or business value, without any user input.
You can even deduce and add segmentation attributes like low, medium, or high purchasing power and churn propensity based on attributes like address, purchases, interactions, or credit lines. You can understand the risk of using a data set for your analytics or campaigns before you go too deep into design and execution.
Using machine learning within modern data management platform, you can calculate the quality score for the correctness of the profile as well as the confidence level of calculated attributes like churn propensity or channel preference.
The use of machine learning is growing, helping business users make better sense of their data and assisting them to make decisions by processing volumes of information coming from a large number of sources. Machine learning not only helps determine and improve data quality but also enriches the data with relevant insights and provides intelligent recommended actions for data quality and operational improvements.