With the rise of data governance programs across financial services organizations and fintech startups today, data lineage and pedigree are becoming more commonplace terms used throughout the industry. What are they? Are they one and the same? In this blog post, I’ll cover some of the similarities and differences between the two and explain how each fits into a data governance program for financial services organizations.
The short definition of data lineage is this: You can trace when and where a specific data element value entered your data ecosystem, where it goes with in the ecosystem, and to whom it is sent if it leaves your ecosystem. This is a reasonable definition, but it omits a few other key components to validating the accuracy of data, which is essential to a data government program. This includes:
- The correctness of data. Data must follow security and integrity rules established by the data stewards of a financial services organization.
- Rules must be consistently and accurately applied in every instance.
Pedigree, on the other hand, addresses the accuracy, correctness, completeness, and timeliness of a data element, and its compliance with established standards. Let me put it in more colloquial terms: Lineage tells me the “from” and “where;” while Pedigree tells me if it is correct at each step in the lineage, such that it follows the edit and validation rules established by the data stewards.
Of course, you cannot determine the pedigree of a data element if you do not have a way to define the correctness of the data element. This is a bit of a paradox, as data can only be determined to be correct if somebody defines what correctness is. And in most cases, the definition of correctness is based upon the definition provided by the data stewards of the organization. In other words, data is correct because this is how the data stewards, or the analogue function, defined it.
There are some instances where correctness of data has been defined by outside agencies. As an example, some external organizations may define variables like CUSIP Number, Zip Code, and SIC Codes, but in many cases and for most data elements are what the data steward says it is. To summarize, pedigree is a data element that follows the “standard of the breed.” The desired state is that the syntax and semantics of the data element are consistent throughout the lineage of the data.
In my next blog post, I’ll be covering another important topic in data governance: The Issue of Multiple Sources of Data Entry in a Complex Data Ecosystem.