C.2.4. Data Integration¶

Unifying or relating different information concepts.

SummaryIntroDimensionsLevelsValueTraditional ApproachEKG ApproachUse cases

The capability Data Integration (C.2.4) is part of the capability area Data Architecture in the Data Pillar.

Unifying or relating different information concepts.

Data integration is the process of combining data from different sources into a single unified view for business consumption and enhanced utility. The process of integrating data from multiple sources begins with the ingestion process and may include activities such as data profiling, cleansing/remediation, cross-referencing transformation, and field mapping.

A dataset represents a coherent and distinct data asset, or data product, including unstructured data.

A dataset is a logical view on the EKG for a specific purpose and may not directly represent physical storage or a source of data. It may encompass subsetting (shapes, projection or selection) of a physical data store: (e.g. for ontologies: specific classes and properties; for relational databases specific rows and columns as well as tables). The same data element could be exposed through many datasets (grouped/filtered differently subject to different access controls).

Datasets are in general usage outside specific enterprises: - governments have public data catalogs e.g. US data.gov for many datasets, including published statistics; - datasets cataloged by cloud providers (Google, Amazon) and commercial data publishers (Bloomberg, Factset etc) and may be available by download or API - Wikidata - scientific datasets submitted with papers to represent experimental data

Dataset Metadata¶

The dataset metadata contains information about

what data exists,
what it means (ontologies for the data set)
cross-references e.g. to related data sets
- family of datasets
- vocabularies
- ontologies
where it resides (“data-at-rest”)
format(s)
how to access (UI, APIs, queries, reports etc)
usage permission e.g. approved authoritative source
link to data sharing agreements
responsible parties
lifecycle/maintenance/approval process
retention and records management: legal requirements to both retain (legal hold) and delete (to avoid discovery). Requirements are jurisdiction-specific.
licensing (how the data can be used and by whom; pricing)
data sharing agreements
Snapshot vs dynamically updated (snapshot possibly for legal reasons)
Compliance with FAIR principles (encompassed by EKG Principles)
accessibility/security
privacy (especially for personal data, need for masking/encryption)
sensitivity (e.g. financial)
upstream/downstream usage (lineage)
how it moves (“data in motion”)
classifications/tagging
quality metrics
usage metrics (frequency of access, update, concentration of access)
“data temperature” (frequency of access, change/volatility) - may determine storage media, in memory vs archive
availability and other useful metrics (volumetrics)
provenance e.g. source systems, lineage, derivation, processing, machine learning, simulation
datasets usable for ML training data (check semantic tagging for relational data)

Contribution to the EKG¶

The EKG can bring all possible internal and external detail information about customers, products, services, competition, sales volumes, customer requirements and many other details together forming a holistic and realistic, almost real-time view of the position of the company in their markets.

Contribution to the Enterprise¶

Having the proper Market Segments defined will help with the selection, definition and prioritization of the right use cases for the EKG.

Warn

Work in progress, describe how this capability is possibly being delivered today in a non-EKG context and optionally what the issues are that EKG could or should improve

Warn

Work in progress, describe how this capability would be delivered or supported using an EKG approach, making the link to the "how" i.e. the EKG/Method.

Warn

Work in progress, list examples of use cases that contribute to this capability, making the link to use cases in the catalog at https://catalog.ekgf.org/use-case/..

C.2.4. Data Integration¶

Dataset Metadata¶

Contribution to the EKG¶

Contribution to the Enterprise¶

Comments