PROVENANCE
Abdul Saboor
Department of Computer
Science
Software Engineering
Research Group, Berlin, Germany
Welcome to this Presentation
Presentation
Agenda
- What is Provenance?
- Why Provenance
is important and two major strands of Provenance?
- Provenance
and Linked Data
- Provenance
Data Model
- Provenance
Vocabularies
- The Open Provenance
Model
- Provenance
Data Quality Assessment
- Summary -
Scientific and Technical Challenges of Provenance
1
What is Provenance?
Provenance
- Recording
the history of data and its place of origin
Provenance
Dictionary Definitions
- The Merriam-Webster
online diction – Origin , Source
- Oxford English
Dictionary – The place of origin or earliest known history of something;
origin, derivation.
Provenance
Definitions
1. Provenance
refers to the source of Information such as entities and processes involved
in producing or delivering an artifact. (Yolanda)
2. Provenance
is a description of how things came to be, and how they came to be in
the state they are in today. Statements about the provenance can themselves
be considered to have provenance. (Jim M)
Continues ...
2
What is Provenance?
Provenance
Working Definitions
- Provenance
of a resource is a record that describes entities and processes involved
in producing and delivering or otherwise influencing that resource.
Provenance provides a critical foundation for assessing authenticity,
enabling trust, and allowing reproducibility. Provenance assertions
are a form of contextual metadata and can themselves become important
records with their own provenance. (W3C)
Provenance
Web Definition
4. On
the web, provenance would include information about the creation and
publication of web resources as well as information about access of
those resources, and activities related to their discussion, linking,
and reuse.
Continues ...
3
What is Provenance?
Provenance
Definitions
5. Provenance
is documentation of the set of artifacts, processes, and agents that
have caused a artifact to be, and of the contexts of these entities.
Provenance provides a critical foundation for assessing authenticity,
enabling trust, and allowing reproducibility and assertions of provenance
can themselves become important records with their own provenance. (Jim
M)
4
What kind
of History?
- Data Creator/Data
Publisher
- Data Creation
Date
- Data Modifier
& Modification Date
- Data Description
- Etc...
5
Why Provenance
is Important?
The need
of Provenance for data integration and reuse
- Data comes
from various diverse data sources
6
Two
major strands of Provenance
7
Data
And Workflow Provenance
Data
Provenance
When
information describing that how data has moved through a network of
databases is referred to as ��fine-grain�� or ��data�� provenance.
Fine-grain provenance can further categorized into: where, how and why-Provenance.
A query execution simply copy data elements from some source to some
target database and where-provenance identifies these source elements
where the data in the target is copied from. Why-provenance provides
justification for the data elements appearing in the output and how-provenance
describes some parts of the input influenced certain parts of the output.
Workflow
Provenance
When
Information describing how derived data has been calculated from raw
observations that is referred to as ��coarse-grain�� or ��workflow��
provenance. The widespread use of workflow flow tools for processing
scientific data facilitate for capturing provenance information. The
workflow process describes all the steps involved in producing a given
data set and, hence captures it provenance information.
7A
Provenance
Dimensions - 1
Content
of Provenance Information
- Attribution - provenance as the sources
or entities that were used to create a new result
- Responsibility
- knowing who endorses a particular piece of information or result
- Origin - recorded
vs reconstructed, verified vs non-verified, asserted vs inferred
- Process - provenance as the process
that yielded an artifact
- Reproducibility
(e.g. workflows, mashups, text extraction)
- Data Access
(e.g. access time, accessed server, party responsible for accessed server)
- Evolution
and versioning
- Republishing
(e.g. re-tweeting, re-blogging, re-publishing)
- Updates (e.g.
a document with content from various sources and that changes over time)
- Justification
for decisions –
Includes argumentation, hypotheses, why-not questions
- Entailment - given the results to a particular
query, what tuples led to those results
8
Provenance
Dimensions - 2
Management
of Provenance Information
- Publication - Making provenance information
available (expose, distribute)
- Access - Finding and querying provenance
information
- Dissemination
control – Track
policies specified by creator for when/how an artifact can be used
- Access
Control - incorporate
access control policies to access provenance information
- Licensing - stating what rights the object
creators and users have based on provenance
- Law
enforcement (e.g.
enforcing privacy policies on the use of personal information)
- Scale - how to operate with large
amounts of provenance information
Use of
Provenance Information
- Understanding - End user consumption of provenance
- abstraction,
multiple levels of description, summary
- presentation,
visualization
9
Provenance
Dimensions - 3
- Interoperability - combining provenance produced
by multiple different systems
- Comparison - finding what is common in
the provenance of two or more entities
- Accountability - the ability to check the provenance
of an object with respect to some expectation
- Verification
- of a set of requirements
- Compliance
- with a set of policies
- Trust - making trust judgments based
on provenance
- Information
quality - choosing among competing evidence from diverse sources (e.g.
linked data use cases)
- Incorporating
reputation and reliability ratings with attribution information
- Imperfections - reasoning about provenance
information that is not complete or correct
- Incomplete
provenance
- Uncertain/probabilistic
provenance
- Erroneous
provenance
- Fraudulent
provenance
- Debugging
10
Web
of Data
11
Adapted from Cetinia,
iSOCO Innovation Lab, J.M.G Perez, Provenance: eScience to the Web of
Data, 11/09
The Linked
Data Paradigm
- How can we
exploit all the available data?
- Data can be
reuse and remix
- Common flexible
and usable APIs
- Standard vocabularies
to describe interlinked datasets
- Understand
the Semantic Web vision
12
Provenance
and Link Data
- Provenance
provides the ability
- Trace the
sources of various kinds of data
- Enable the
exploration of relationships between datasets, their authors and affiliations
- Provenance
analysis provides an insight on how data is produced and exploited
- Provenance
create a notion of information quality
- Is a certain
dataset consistent and up to date?
- Is the connection
between two datasets meaningful?
- Is a given
dataset relevant for a particular domain?
- Provenance
to establish information trustworthiness
- Provenance
to provide data views relating to some criteria
13
The Provenance
Data Model
Institutional
Level
Experimental
Protocol Level
Data
Analysis and Significance Level
Dataset
Description Level
Metadata
associated with origin in terms of its data attributes (e.g, AuthorName,
Title, URL, etc.)
The Origin
of datasets (e.g. History area, region, organisation or institution)
Datasets
statistical analysis methodology for selecting relevant attributes (e.g.
Either datasets divided into parts, output values, versions, etc)
Who published
that datasets. The vocabulary of interlinked datasets such as Dublin
Core, voiD, PRV, etc.
14
The
Provenance Related Vocabularies
- DC – Dublin
Core
- FOAF –
Friend of a Friend
- SIOC –
Semantic Interlinked online communities
- WOT – Web
of Trust Schema
- OMV – Ontology
Metadata vocabulary
- SWP – Semantic
Web Publishing
- VoiD –
Vocabulary for interlinked datasets
- PRV – Provenance
Vocabulary
- PML – Proof
Markup Language
- PAV – SWAN
provenance ontology
- OUZO –
Provenance ontology
- CS – Changeset
Vocabulary
- Etc.
15
Provenance
Related Metadata
Provenance
related metadata is either directly attached to data item or its host
the documents or it is available as additional data on web.
For example
– Attached metadata are RDF statements about an RDF graph that contains
the statements, AuthorName and Creation date of blog entries added to
syndication feed, or information about an image and detached metadata
can be represented in RDF using vocabularies.
16
A
Provenance Architecture for the Web of Data
Authoritative
agencies require to certify and keep data provenance secure
Application Layer
17
Adapted from Cetinia,
iSOCO Innovation Lab, J.M.G Perez, Provenance: eScience to the Web of
Data, 11/09
Main
Action Points
Provenance
Vocabularies
Represent
and reason with trust and information quality
Extend
emerging Linked data vocabularies
VOiD
Awareness of
Data Providers
W3C
Provenance Incubator Group
Linked
Data Standards
(VOiD)
Tools for Data
Providers
Generalization
of Provenance Metadata
Provenance
Authoritative Agencies
Provenance
Visualization
18
Adapted from Cetinia,
iSOCO Innovation Lab, J.M.G Perez, Provenance: eScience to the Web of
Data, 11/09
The Open Provenance
Model
- The Open
Provenance Model in which data is being produced/transformed into new
state. It can also represent the one or more data items from an old
to a new state.
- OPM graph
model for provenance which describes the graph whose edges denote the
relationship between occurrence presented by the nodes.
- The main
purpose of OPM is to support the assessment of various data qualities
such as reliability, accuracy and timeliness.
19
OPM
Classifies nodes into three parts
Artifacts
Artifacts
are the parts of data of fixed value and context that possibly represent
an entity in a given state. Edges can also have annotations for providing
the information on how occurrence cause another.
Process
Process
are performed on artifacts in order to produce another artifact.
Agents
Agents
indicate the entities which are controlling the process such as user.
20
Model
of Web Data Provenance
Provenance
Graph – It describes the provenance of data Items:
Nodes
Provenance elements
(Pieces of provenance
information)
Edges
Relating Provenance elements
to
each other
Sub-graphs
Related data items if
possible
21
Main
Focus of Provenance of Web Data
- Types of Provenance
elements (roles)
- Relationship
between those elements
22
Adapted from Olaf Hartig��s,
Humboldt University Berlin, Provenance Information in the Web of Data,
04/09
Provenance
Data Quality Assessment
The Quality
of Information
- Main Objectives
are accessing the quality of datasets
- Quality of
datasets in multidimensional perspectives
- Relevance
of criteria determined by preferences and performing certain tasks on
available datasets
23
Provenance
Data Quality
- Data Trustworthiness
- Data Authenticity
- Data Reliability
- Dimensions
of Believability
- Trustworthiness
of source
- Data Lineage
– The origin of data
- Related Artifacts
and actors
- Reasonableness
of data
- Possibility
– The extent to which data value is possible
- Consistency
– The extent to which a data value is consistent with other values
of same data
- Quality of
Data Provenance has Three dimensions:
24
Provenance
Data Quality
- Quality of
Datasets
- Timeliness
- Consistency
between datasets
- Consistency
over source – The
extent to which a data value is consistent with other values of the
same data
- Consistency
over time – The
extent to which the data value is consistent with past data values
- Stable and
meaningful data
- Temporal
of Data
- Transaction
valid times closeness –
The extent to which a data value is credible based on proximity of transaction
time to valid times.
- Transaction
time overlap –
The extent to which a data value is derived from data values with overlapping
valid times.
25
Trust
Evaluation
Some
Questions must need to be considered while provenance data trust evaluation��
-
Who created that content(s) (author or attributions)?
- Was the contents
manipulated? If yes then by what process or source?
-
Who is providing those contents (repositories)?
26
Quality
of Data Assessment
- Assign numeric
values to Quality Criteria of Datasets or Scoring/Rating Systems
- Proactive
Approach
- Precision
vs Practicality
Manual
Approach
- Questionnaires
base system
Semi-Automatic
Approach
- Rating based system
- Reputation based
system
27
Reasons
of Assessment
Main
Reasons
- Provenance
of assessed data on the web
- Identify the
methods / approaches to automatically assess the quality of data on
the web
- Or Identify
the methods to assess the Quality Criteria of Data automatically of
web data.
28
A
Generalize Assessment Approach
Step -
1
Step -
2
Step -
3
Generate
a provenance graph for the data item
Annotate
the provenance graph with impact values
Execute
the assessment function/program (script)
29
Generate
a Provenance Graph
- What types
of provenance elements are necessarily require?
- What types
of details (i.e. granularity) are necessarily require?
- Where and
how do we get provenance information?
-
Two complementary options
- Recordings
- Analyzing
the metadata
30
Annotation
with Impact Values
- How might
each Provenance element can influence the quality of data?
-
Each type of element has to analyze systematically
- What kinds
of impact values are necessary and how to represent the influence through
impact values?
-
It is not necessary that impact values should be numeric
-
It also depends on the assessment functions
- How do we
determine the impact values?
31
Determine
the Impact Values
- From Provenance
Information
- From user
Input
-
Rating-based systems, or reputation-based systems
-
Configuration options
- Through Content
Analysis
-
Comparison of data contents
-
Adoption of information retrieval methods
-
Adoption of data cleansing techniques
- Through Context
Analysis
-
Further metadata
-
Domain knowledge
32
Annotation
with Impact Values
- How might
each Provenance element can influence the quality of data?
Provenance
Element Type
Creation Date
Creation Guidelines
Source data items
Data creator
Impact Values
Creation time
Weights
Expiry time
33
Assessment
Function (s)
- How the assessment
function look alike?
-
Develop function together with impact values
-
Take incompleteness into consideration
-
Provenance graph could be fragmentary
-
Annotation could be missing
34
Scientific
and Technical Challenges of Provenance – 1
(SUMMARY)
Provenance
information need to be:
- Stored and
secured, queries and reasoned about
35
Scientific
and Technical Challenges of Provenance - 2
- Vocabularies
for representation of provenance contents
- Need representation
of process (workflow), entities roles, data collections, meta-assertions,
etc.
- The open provenance
model (OPM)
- Granularity
of provenance records
- How much detail
is useful, manageable/scalable in practice?
- Size of provenance
can be orders of magnitude larger than base data.
- Provenance
evaluation for information quality and trust management
36
Scientific
and Technical Challenges of Provenance – 2a
- Determine
when data becomes obsolete based on provenance information
- Versioning
of data sources
- Relate updates
of data based on provenance information
- Provenance-aware
visualization, navigation and resource consumption
37
Scientific
and Technical Challenges of Provenance and Trust – 3
- Policies based
on Provenance information
- Association-based
policies
- Source is
cited in Spiegel
- Source is
cited in Wikipedia
- Policies may
be restricted to a context
- Topic of search,
topics of pages, tags of page
- Trust policies
may be shared across users
38
Thanks
for your attentions !
Freie
University Berlin
Computer
Science Department
Software
Engineering Research Group
TakuStr
9, Berlin, Germany.
Any
Questions?
39
References
- W3C Website,
What is provenance? Modified at November 2010, http://www.w3.org/2005/Incubator/prov/wiki/What_Is_Provenance
- W3C Website,
A working Definition of Provenance, Modified at November 2010, http://www.w3.org/2005/Incubator/prov/wiki/What_Is_Provenance#A_Working_Definition_of_Provenance
- Hartig, O.
Provenance information in the Web of data. In Proceedings of LDOW 2009
(Madrid, Spain, April 2009).
- O. Hartig
and J. Zhao. Using web data provenance for quality assessment. Pro-ceedings
of the 1st Int. Workshop on the Role of Semantic Web in Provenance
- D. Brickley
and L. Miller, FOAF Vocabulary Specification, November 2007. http://xmlns.com/foaf/spec
- U. Bojars
and J. G. Breslin. SIOC Core Ontology Specification, Revision 1.30,
Jan. 2009. http://rdfs.org/sioc/spec/
- Luc Moreau,
Juliana Freire, Joe Futrelle, Robert E. McGrath, Jim Myers, and Patrick
Paulson. The open provenance model: An overview. In IPAW, pages 323–326,
2008.
- L. L. Pipino,
Y. W. Lee, and R. Y. Wang, ��Data Quality Assessment,��Communications
of the ACM, vol. 45, Issue no. 4, p. 211-218, 2009.
- You-Wei cheah,
Beth Plale. Provenance Analysis: Towards qaulity provenance. In proceeding
of 8th IEEE International conference on eScience, Chicago
Illinois, Oct. 2012. http://www.ci.uchicago.edu/escience2012/pdf/Provenance_Analysis-Towards_Quality_Provenance.pdf
- Yogesh Simmhan,
Beth Plale, and Dennis Gannon. A survey of data provenance in e-science.
SIGMOD Record, 34(3):31–36, 2005.
- Prat, N.,
and Madnick, S. Evaluating and aggregating data believability across
quality sub-dimensions and data lineage. In Proceedings of WITS 2007
(Montreal, Canada, December 2007), p.169-174.
- Y. Simmhan,
B. Plale, and D. Gannon. A Survey of Data Provenance in e-Science.
SIGMOD Record, Computer Science Department, Indiana University.
Vol. 34, Issue No. 3, p31–36, ACM, Sept. 2005.
- P. Buneman,
S. Khanna, and W. C. Tan. Data Provenance: Some Basic Issues. In
Proceedings of the 20th Conference on Foundations of Software Technology
and Theoretical Computer Science (FST TCS), p87-93, Springer, Dec.
2000.
- Prat, N.,
and Madnick, S. Measuring data believability: A provenance approach.
Proceedings of HICSS-41 (Big Island, HI, January 2008), IEEE, p.1-10.
- Jose Manuel
Gomez-Perez, Invited Lectures on Programmable web and the web of data,
November 2009, URJC, Campus de Mostoles, Departmental II, Salon de grados,
Madrid, Spain, Website, http://www.cetinia.urjc.es/es/node/331
- Website :
http://www.w3.org/2005/Incubator/prov/wiki/images/0/02/Provenance-XG-Overview.pdf
- http://www.w3.org/2005/Incubator/prov/wiki/Provenance_Dimensions
- http://www.w3.org/2005/Incubator/prov/wiki/W3C_Provenance_Incubator_Group_Wiki