Data Quality Management: how to implement a modular framework


.key-points {

    color: #ffffff;

    max-width: 900px;

    padding-left: 40px;

    margin-bottom: 55px;

    margin-top: 55px;

    padding-top: 20px;

    padding-bottom: 20px;

    background-color: #191F23;

    font-family: "gotham";

    border-radius: 0px 0px 0px 50px;

}

.key-points-link {

    color: #7ecdc8 !important;

font-size: 18px !important;

}

.key-points-items {

    margin-top: 20px;

    color: #7ecdc8;

    font-weight: 600;

    margin-left: 2px;

    margin-right: 3px;

}

.key-points-header {

font-family: avant-garde, sans-serif;

font-weight: bold;

font-size: 25px;

line-height: 1.2;

}
.key-points-items .sublevel {

margin-left: 20px;

line-height: 1.2;

}
Contents      

Introduction & scope of this article 


Data quality has been an issue for too long 


So, why are organizations not fixing Data Quality issues? 


The case for a comprehensive, yet lightweight data quality framework 


6 Steps to approach implementation of a modular Data Quality framework


How to nail technical details and boost your Data Quality Management?

Introduction & scope of this article

In recent years, as industries have shifted towards data-driven decision-making, the effects have become more impactful: If data is not good enough to train machine learning models, if management dashboards are not accurate enough, fallacies increase, and the quality of business decisions suffers. Quality issues and errors in data are plenty, they are expensive to fix, and no one can really assess the damage they cause.

This article covers process design and how to implement frameworks.

In case you like to learn more about the technical aspect of Data Quality Frameworks, practial implementation details, and how to boost Data Quality Management with machine learning capabilities, you can have a look at Vera’s article on our partner brand website Positive Thinking Company: how Machine Learning enhances Data Quality Management.

Data quality has been an issue for too long

In recent years, as industries have shifted towards data-driven decision-making, the effects on data quality have become more impactful. If data is not good enough to train machine learning models, if management dashboards are not accurate enough, fallacies increase, and the quality of business decisions suffers.

Quality issues and errors in data are plenty, they are expensive to fix, and no one can really assess the damage they cause. Data quality issues themselves come in loads of forms – to name a few examples:

Similar information is stored in different data formats across various systems.
Customers have given wrong information due to misleading forms.
Employees are cleverly omitting necessary information because it would not affect them personally.
Automatic data consolidation processes are not covering few special data constellations.

The root causes for all these errors vary greatly and almost always require an individual analysis that can be quite time-consuming. But this is no news to anyone, as the problems have been around long enough.

So, why are organizations not fixing Data Quality issues?

Here are 5 kinds of struggles regarding data quality that we have observed in the past:

There are rarely any direct monetary benefits for the organization. Lack of data quality prevents the gain of information from data but fixing quality findings does not directly lead to increased sales or margins. It is more likely that increasing data quality will enable an organization to gain insights that will lead them to proper action, but it’s nearly impossible to make the benefit completely transparent.
Cultural rebuff of grassroots initiatives, especially by middle management, tends to slow down whatever momentum such an initiative has had, in no time. This is not a problem exclusive to data quality measures, but it can be difficult to allocate budget for appropriate measures if the overall initiative is not adequately backed.
A lack of internal data governance roles and assigned responsibility for data naturally limits any data quality activities. Identifying errors and aggregating quality KPIs (Key performance indicators) is only one part of the process. The organizationally challenging part only starts there: Now that the errors are known, how should we handle them? The mitigation, fixing, or acceptance of data errors are usually case-by-case decisions.
There are plenty of data quality tools on the market, but when introduced in a corporate environment, they are rarely applied to their full potential, or have significant functional overhead that presents a major obstacle for user acceptance.
Data quality rulesets are likely extended over time and become very difficult to manage. It is also not feasible to cover all quality requirements with explicit rules and monitor results of these checks. That leads to increasing resource demands for operating the DQ (Data Quality) process, which can easily overwhelm organizations and lead to inefficiencies.

To summarize:

Tackling data quality professionally is difficult, costly, unattractive, and comes with a long string of consequences for the organization. That is why we propose a modular solution that can be started small and extended over time, scaled exactly to immediate needs.

The case for a comprehensive, yet lightweight data quality framework

These pain points have led us to develop a modular framework that can tackle the various challenges data quality poses to enterprises.

As a basis for all technical implementation, designing processes and a target operating model for data quality must be the first steps. That includes determining where and when to check data quality, who sets data quality requirements, how to present results, how to prioritize findings, and how operations should be funded.

In terms of operational processes, organizations should aim to continuously improve their quality as well as the accuracy and completeness of their data quality checks. We propose the following high-level, repetitive process as a means of orientation for developing a detailed version specifically tailored to the respective organization:

6 Steps to approach implementation of a modular Data Quality framework

Influential factors for the scope and extent of a DQ framework

There are several factors that influence the setup of a data quality management system. Data consumers, users of data inside or outside the organization, may have specific requirements and need for the quality of their data having different focus needs. (Note: They likely just want it to be “good” without being able to further detail what “good” means specifically.) Depending on the use case, requirements might differ a lot. For some, availability is most important, others can not deal with missing values or need certain data to be extremely accurate.

Strategic initiatives might want an overall framework that makes data quality comparable over different business units, bringing in additional requirements. Also, strong regulations like in the financial industry or privacy laws like the GDPR influence the requirements that one has towards the quality of the data.

These factors shape the perspective on data quality and are determining the design of necessary processes.

First things first: Laying a solid foundation

Before setting up an operational process that continuously improves the quality of corporate data, requirements need to be collected and generic data quality metrics must be defined that suit the company’s needs. Metrics are quite individual; they are important to aggregate checks and communicate results. Metrics can be percentages, traffic lights or numeric/lettered levels with certain adjustable threshold values, or arithmetic figures like averages, means and standard deviation. The specific implementation depends on the types of checks to be aggregated and the company’s management culture. Defining this is a critical step since all ensuing efforts will be based on this outcome.

Modular structure enables flexibility

As a next step, requirements need to be translated into modular quality checks that each check for exactly one criterion (atomicity). They must also not overlap in their checks and should be generically applicable to your data, but on the other hand configured for specific needs, like templates. The checks might be easy-to-answer or more complicated statistically or machine learning supported.

Depending on the type, results can have:

binary format,
percentage-based outcomes
categorized scores.

Additionally, data will be augmented by concrete values like currencies, a list of statistically outlying records, or otherwise noticeable data like unusual data drifts.

Results mean nothing without context

The check results need to be interpreted in the next step, translating the outcome to a generic measurement for specific quality requirements – e.g. a traffic light system for specific external reporting. This translation should be accounted for by a dedicated role, e.g. a Data Quality Officer, who is familiar with the data usage. The generalized, comparable score will help to prioritize your findings. Some of the results might be of greater value or impact for your company. Some might be expected and are just tracked for reporting issues. Thus, we encourage establishing a Data Quality Council consisting of all stakeholders, which meets regularly to align on priorities of DQ findings. This is especially important if the mitigation of different DQ problems will be executed by the same analytical or developer resources.

“So what do we do about this?”

The mitigation of data quality issues ideally includes a root cause analysis. This activity is infinitely easier if the organization has a clear view on its data lineage: the original source of the information, processing steps like transformation and transfer operations must be transparent for an analysis to be efficient. Ideally, this information is available in an organization-wide data catalog, along with further information like data ownership and business definitions of attributes. Once the root cause is identified, business analysts and developers would coordinate a plan to fix the underlying issue, or, if that is not possible, come up with a way to mitigate the impact.

The actual implementation of a mitigation strategy ideally can be executed in an automized way: rule-based, with heuristics or even with machine learning models. But you will also have cases that need to be fixed manually by your data source owners using the regular IT service management process: Tickets need to be written, issues to be fixed and tested along the way.

Lifelong learning, also in Data Quality terms

From the analysis of prior findings, DQ officers can deduce new data quality requirements and checks or to identify old or unnecessary checks. Additionally, statistical analyses or applied machine learning models (e.g. for monitoring) can lead to gain of knowledge and useful new DQ checks by identifying unusual value combinations or statistical outliers. Regarding the interpretation of rules, automatic interpretation schemas can be adjusted if threshold values are determined to be too low or too high. This process step leads to gradual improvement of the overall quality – of both the process precision and the data.

The essence of it all

The overall goal of a data quality framework, in our perspective, is to get rid of quality issues in a sustainable way. Many market tools and proposals focus on an aggregated reporting of data quality KPIs and offer data cleansing mechanisms to correct data at a later stage in the process. Our approach focuses on improving the ‘quality at source’ instead, making data quality efforts sustainable and keeping the overall DQ system manageable. Also, the scope of the framework in terms of the number of checks and the data scope can be flexibly adjusted, if checks or scope need to be added or removed. To make the solution work in the long run, it is vitally important to consider both rule-based checks and their rigid upkeep, as well as more automatable, holistic approaches enabled by advanced analytics and machine learning.

How to nail technical details and boost your Data Quality Management?

Designing processes is just one side of the whole story.

Would you like to learn more about the technical design of Data Quality Frameworks, practial implementation details, and how to boost Data Quality Management with machine learning capabilities?

Vera Meinert from Positive Thinking Company, a partner brand from our ecosystem, wrote an article on how you can enhance your Data Quality Framework by:

Machine Learning based Monitoring and detection of drastic data changes
Machine Learning based validation of input creation
Machine Learning supported automized mitigation

Take a look at Vera’s article on how Machine Learning enhances Data Quality Management.

Lennard Nobbe
Data Strategy & Governance Consultant

Brought to you by Lennard

Lennard Nobbe has a vast experience in data architecture and data governance in the banking industry, and, having experienced organizational mindsets change from a front-row seat, is convinced that organizations‘ ability to manage and utilize data, or lack thereof, will determine their role in the very near future.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
JSESSIONID	session	Used by sites written in JSP. General purpose platform session cookies that are used to maintain users' state across page requests.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
__hssrc	session	This cookie is set by Hubspot whenever it changes the session cookie. The __hssrc cookie set to 1 indicates that the user has restarted the browser, and if the cookie does not exist, it is assumed to be a new session.

Cookie	Duration	Description
CONSENT	16 years 3 months 17 days 8 hours	These cookies are set via embedded youtube-videos. They register anonymous statistical data on for example how many times the video is displayed and what settings are used for playback.No sensitive data is collected unless you log in to your google account, in that case your choices are linked with your account, for example if you click “like” on a video.
hubspotutk	1 year 24 days	This cookie is used by HubSpot to keep track of the visitors to the website. This cookie is passed to Hubspot on form submission and used when deduplicating contacts.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_UA-148925568-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
__hstc	1 year 24 days	This is the main cookie set by Hubspot, for tracking visitors. It contains the domain, initial timestamp (first visit), last timestamp (last visit), current timestamp (this visit), and session number (increments for each subsequent session).

Cookie	Duration	Description
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.

Cookie	Duration	Description
messagesUtk	1 year 24 days	This cookie is set by hubspot. This cookie is used to recognize the user who have chatted using the messages tool. This cookies is stored if the user leaves before they are added as a contact. If the returning user visits again with this cookie on the browser, the chat history with the user will be loaded.
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
__hssc	30 minutes	HubSpot sets this cookie to keep track of sessions and to determine if HubSpot should increment the session number and timestamps in the __hstc cookie.