Background and intro

A dataset represents a coherent and distinct data asset, or data product, including unstructured data.

A dataset is a logical view on the EKG for a specific purpose and may not directly represent physical storage or a source of data. It may encompass subsetting (shapes, projection or selection) of a physical data store: (e.g. for ontologies: specific classes and properties; for relational databases specific rows and columns as well as tables). The same data element could be exposed through many datasets (grouped/filtered differently subject to different access controls).

Datasets are in general usage outside specific enterprises: - governments have public data catalogs e.g. US data.gov for many datasets, including published statistics; - datasets cataloged by cloud providers (Google, Amazon) and commercial data publishers (Bloomberg, Factset etc) and may be available by download or API - Wikidata - scientific datasets submitted with papers to represent experimental data

Dataset Metadata¶

The dataset metadata contains information about

what data exists,
what it means (ontologies for the data set)
cross-references e.g. to related data sets
- family of datasets
- vocabularies
- ontologies
where it resides (“data-at-rest”)
format(s)
how to access (UI, APIs, queries, reports etc)
usage permission e.g. approved authoritative source
link to data sharing agreements
responsible parties
lifecycle/maintenance/approval process
retention and records management: legal requirements to both retain (legal hold) and delete (to avoid discovery). Requirements are jurisdiction-specific.
licensing (how the data can be used and by whom; pricing)
data sharing agreements
Snapshot vs dynamically updated (snapshot possibly for legal reasons)
Compliance with FAIR principles (encompassed by EKG Principles)
accessibility/security
privacy (especially for personal data, need for masking/encryption)
sensitivity (e.g. financial)
upstream/downstream usage (lineage)
how it moves (“data in motion”)
classifications/tagging
quality metrics
usage metrics (frequency of access, update, concentration of access)
“data temperature” (frequency of access, change/volatility) - may determine storage media, in memory vs archive
availability and other useful metrics (volumetrics)
provenance e.g. source systems, lineage, derivation, processing, machine learning, simulation
datasets usable for ML training data (check semantic tagging for relational data)