Skip to content

C.2.4. Data Integration

Unifying or relating different information concepts.

The capability Data Integration (C.2.4) is part of the capability area Data Architecture in the Data Pillar.

Unifying or relating different information concepts.

Data integration is the process of combining data from different sources into a single unified view for business consumption and enhanced utility. The process of integrating data from multiple sources begins with the ingestion process and may include activities such as data profiling, cleansing/remediation, cross-referencing transformation, and field mapping.

A dataset represents a coherent and distinct data asset, or data product, including unstructured data.

A dataset is a logical view on the EKG for a specific purpose and may not directly represent physical storage or a source of data. It may encompass subsetting (shapes, projection or selection) of a physical data store: (e.g. for ontologies: specific classes and properties; for relational databases specific rows and columns as well as tables). The same data element could be exposed through many datasets (grouped/filtered differently subject to different access controls).

Datasets are in general usage outside specific enterprises: - governments have public data catalogs e.g. US data.gov for many datasets, including published statistics; - datasets cataloged by cloud providers (Google, Amazon) and commercial data publishers (Bloomberg, Factset etc) and may be available by download or API - Wikidata - scientific datasets submitted with papers to represent experimental data

Dataset Metadata

The dataset metadata contains information about

  • what data exists,
  • what it means (ontologies for the data set)
  • cross-references e.g. to related data sets
    • family of datasets
    • vocabularies
    • ontologies
  • where it resides (“data-at-rest”)
  • format(s)
  • how to access (UI, APIs, queries, reports etc)
  • usage permission e.g. approved authoritative source
  • link to data sharing agreements
  • responsible parties
  • lifecycle/maintenance/approval process
  • retention and records management: legal requirements to both retain (legal hold) and delete (to avoid discovery). Requirements are jurisdiction-specific.
  • licensing (how the data can be used and by whom; pricing)
  • data sharing agreements
  • Snapshot vs dynamically updated (snapshot possibly for legal reasons)
  • Compliance with FAIR principles (encompassed by EKG Principles)
  • accessibility/security
  • privacy (especially for personal data, need for masking/encryption)
  • sensitivity (e.g. financial)
  • upstream/downstream usage (lineage)
  • how it moves (“data in motion”)
  • classifications/tagging
  • quality metrics
  • usage metrics (frequency of access, update, concentration of access)
  • “data temperature” (frequency of access, change/volatility) - may determine storage media, in memory vs archive
  • availability and other useful metrics (volumetrics)
  • provenance e.g. source systems, lineage, derivation, processing, machine learning, simulation
  • datasets usable for ML training data (check semantic tagging for relational data)

See also Self-describing Dataset (SDD).

Warn

Work in progress, this is just the results of an initial brainstorm session, needs to be worked out

  1. Are all your markets well-defined?
    • Which products and services are sold per market?
  2. Is the geospatial size of your markets known?
  3. Is the demographic / psycho-graphic segmentation of your markets known?
  4. Is the value-proposition per market, per product or service well-defined, communicated and sold?
    • (Porter value chain)
  5. Are cohorts identified?
  6. Are trade-promotions optimized?
    • Optimizing “supply-chain to shelf”
    • Are opportunities for optimization known are there processes in place to continuously improve and optimize your value propositions
  7. Are you leveraging any means available to segment your market into target groups and cohorts?
    • Do you have an interest-graph of all of your customers?
  8. Is your competition known, competitive advantages and USPs known, opportunities and threats?

Warn

Work in progress, describe the 5 maturity levels of this capability

Contribution to the EKG

The EKG can bring all possible internal and external detail information about customers, products, services, competition, sales volumes, customer requirements and many other details together forming a holistic and realistic, almost real-time view of the position of the company in their markets.

Contribution to the Enterprise

Having the proper Market Segments defined will help with the selection, definition and prioritization of the right use cases for the EKG.

Warn

Work in progress, describe how this capability is possibly being delivered today in a non-EKG context and optionally what the issues are that EKG could or should improve

Warn

Work in progress, describe how this capability would be delivered or supported using an EKG approach, making the link to the "how" i.e. the EKG/Method.

Warn

Work in progress, list examples of use cases that contribute to this capability, making the link to use cases in the catalog at https://catalog.ekgf.org/use-case/..

Comments