Home >  PROVENANCE Abdul Saboor Department of Computer Science Software Engineering Research Group, Berlin, Germany Welcome to this Presentation

PROVENANCE Abdul Saboor Department of Computer Science Software Engineering Research Group, Berlin, Germany Welcome to this Presentation


Abdul Saboor 

Department of Computer Science

Software Engineering Research Group, Berlin, Germany 

Welcome to this Presentation

Presentation Agenda 

  • What is Provenance?
  • Why Provenance is important and two major strands of Provenance?
  • Provenance and Linked Data
  • Provenance Data Model
  • Provenance Vocabularies
  • The Open Provenance Model
  • Provenance Data Quality Assessment
  • Summary - Scientific and Technical Challenges of Provenance


What is Provenance? 


  • Recording the history of data and its place of origin

Provenance Dictionary Definitions

  1. The Merriam-Webster online diction – Origin , Source
  2. Oxford English Dictionary – The place of origin or earliest known history of something; origin, derivation.

Provenance Definitions

1. Provenance refers to the source of Information such as entities and processes involved in producing  or delivering an artifact. (Yolanda)

2. Provenance is a description of how things came to be, and how they came to be in the state they are in today. Statements about the provenance can themselves be considered to have provenance. (Jim M) 

Continues ... 


What is Provenance? 

Provenance Working Definitions

  1. Provenance of a resource is a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility. Provenance assertions are a form of contextual metadata and can themselves become important records with their own provenance. (W3C)

Provenance Web Definition

4. On the web, provenance would include information about the creation and publication of web resources as well as information about access of those resources, and activities related to their discussion, linking, and reuse. 

Continues ... 


What is Provenance? 

Provenance Definitions  

5. Provenance is documentation of the set of artifacts, processes, and agents that have caused a artifact to be, and of the contexts of these entities. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility and assertions of provenance can themselves become important records with their own provenance. (Jim M) 


What kind of History? 

  • Data Creator/Data Publisher
  • Data Creation Date
  • Data Modifier & Modification Date
  • Data Description
  • Etc... 


Why Provenance is Important? 

The need of Provenance for data integration and reuse 

  • Data comes from various diverse data sources
  • Varying Quality 
  • Different Scope  
  • Different Assumptions 


Two major strands of Provenance 


Data And Workflow Provenance 

Data Provenance 

When information describing that how data has moved through a network of databases is referred to as ��fine-grain�� or ��data�� provenance. Fine-grain provenance can further categorized into: where, how and why-Provenance. A query execution simply copy data elements from some source to some target database and where-provenance identifies these source elements where the data in the target is copied from. Why-provenance provides justification for the data elements appearing in the output and how-provenance describes some parts of the input influenced certain parts of the output. 

Workflow Provenance 

When Information describing how derived data has been calculated from raw observations that is referred to as ��coarse-grain�� or ��workflow�� provenance. The widespread use of workflow flow tools for processing scientific data facilitate for capturing provenance information. The workflow process describes all the steps involved in producing a given data set and, hence captures it provenance information. 


Provenance Dimensions - 1 

Content of Provenance Information 

  • Attribution - provenance as the sources or entities that were used to create a  new result
    • Responsibility - knowing who endorses a particular piece of information or result
    • Origin - recorded vs reconstructed, verified vs non-verified, asserted vs inferred
  • Process - provenance as the process that yielded an artifact
    • Reproducibility (e.g. workflows, mashups, text extraction)
    • Data Access (e.g. access time, accessed server, party responsible for accessed server)
  • Evolution and versioning
    • Republishing (e.g. re-tweeting, re-blogging, re-publishing)
    • Updates (e.g. a document with content from various sources and that changes over time)
  • Justification for decisions – Includes argumentation, hypotheses, why-not  questions
  • Entailment - given the results to a particular query, what tuples led to those  results


Provenance Dimensions - 2 

Management of Provenance Information 

  • Publication - Making provenance information available (expose, distribute)
  • Access - Finding and querying provenance information
  • Dissemination control – Track policies specified by creator for when/how an  artifact can be used
    • Access Control - incorporate access control policies to access provenance information
    • Licensing - stating what rights the object creators and users have based on provenance
    • Law enforcement (e.g. enforcing privacy policies on the use of personal information)
  • Scale - how to operate with large amounts of provenance information

Use of Provenance Information

  • Understanding - End user consumption of provenance
    • abstraction, multiple levels of description, summary
    • presentation, visualization


Provenance Dimensions - 3 

  • Interoperability - combining provenance produced by multiple different systems
  • Comparison - finding what is common in the provenance of two or more entities
  • Accountability - the ability to check the provenance of an object with respect to some expectation
    • Verification - of a set of requirements
    • Compliance - with a set of policies
  • Trust - making trust judgments based on provenance
    • Information quality - choosing among competing evidence from diverse sources (e.g. linked data use cases)
    • Incorporating reputation and reliability ratings with attribution information
  • Imperfections - reasoning about provenance information that is not complete or correct
    • Incomplete provenance
    • Uncertain/probabilistic provenance
    • Erroneous provenance
    • Fraudulent provenance
  • Debugging


Web of Data 


Adapted from Cetinia, iSOCO Innovation Lab, J.M.G Perez, Provenance: eScience to the Web of Data, 11/09

The Linked Data Paradigm 

  • How can we exploit all the available data?
    • Data can be reuse and remix 
    • Common flexible and usable APIs 
    • Standard vocabularies to describe interlinked datasets 
    • Various Tools 
    • Understand the Semantic Web vision  


Provenance and Link Data 

  • Provenance provides the ability
    • Trace the sources of various kinds of data
    • Enable the exploration of relationships between datasets, their authors and affiliations
  • Provenance analysis provides an insight on how data is produced and exploited 
  • Provenance create a notion of information quality
    • Is a certain dataset consistent and up to date?
    • Is the connection between two datasets meaningful?
    • Is a given dataset relevant for a particular domain?
  • Provenance to establish information trustworthiness 
  • Provenance to provide data views relating to some criteria


The Provenance Data Model 

Institutional Level  

Experimental Protocol Level  

Data Analysis and Significance Level  

Dataset Description Level  

Metadata associated with origin in terms of its data attributes (e.g, AuthorName, Title, URL, etc.) 

The Origin of datasets (e.g. History area, region, organisation or institution) 

Datasets statistical analysis methodology for selecting relevant attributes (e.g. Either datasets divided into parts, output values, versions, etc) 

Who published that datasets. The vocabulary of interlinked datasets such as Dublin Core, voiD, PRV, etc. 


The Provenance Related Vocabularies 

  • DC – Dublin Core
  • FOAF – Friend of a Friend
  • SIOC – Semantic Interlinked online communities
  • WOT – Web of Trust Schema
  • OMV – Ontology Metadata vocabulary
  • SWP – Semantic Web Publishing
  • VoiD – Vocabulary for interlinked datasets
  • PRV – Provenance Vocabulary
  • PML – Proof Markup Language
  • PAV – SWAN provenance ontology
  • OUZO – Provenance ontology
  • CS – Changeset Vocabulary
  • Etc.


Provenance Related Metadata 

Provenance related metadata is either directly attached to data item or its host the documents or it is available as additional data on web. 

For example – Attached metadata are RDF statements about an RDF graph that contains the statements, AuthorName and Creation date of blog entries added to syndication feed, or information about an image and detached metadata can be represented in RDF using vocabularies. 


A Provenance Architecture for the Web of Data 

Authoritative agencies require to certify and keep  data provenance secure  

Application Layer 


Adapted from Cetinia, iSOCO Innovation Lab, J.M.G Perez, Provenance: eScience to the Web of Data, 11/09

Main Action Points 

Provenance Vocabularies 

Represent and reason with trust and information quality 

Extend emerging Linked data vocabularies 


Awareness of Data Providers 

W3C Provenance Incubator Group 

Linked Data Standards


Tools for Data Providers 

Generalization of Provenance Metadata  

Provenance Authoritative Agencies 

Provenance Visualization 


Adapted from Cetinia, iSOCO Innovation Lab, J.M.G Perez, Provenance: eScience to the Web of Data, 11/09

The Open Provenance Model 

  • The Open Provenance Model in which data is being produced/transformed into new state. It can also represent the one or more data items from an old to a new state.
  • OPM graph model for provenance which describes the graph whose edges denote the relationship between occurrence presented by the nodes. 
  • The main purpose of OPM is to support the assessment of various data qualities such as reliability, accuracy and timeliness. 


OPM Classifies nodes into three parts 


Artifacts are the parts of data of fixed value and context that possibly represent an entity in a given state. Edges can also have annotations for providing the information on how occurrence cause another. 


Process are performed on artifacts in order to produce another artifact. 


Agents indicate the entities which are controlling the process such as user. 


Model of Web Data Provenance 

Provenance Graph – It describes the provenance of data Items: 


Provenance elements

(Pieces of provenance information) 


Relating Provenance elements to

each other 


Related data items if possible 


Main Focus of Provenance of Web Data 

  • Provenance Models Define
    • Types of Provenance elements (roles) 
    • Relationship between those elements


Adapted from Olaf Hartig��s, Humboldt University Berlin, Provenance Information in the Web of Data, 04/09

Provenance Data Quality Assessment 

The Quality of Information 

  • Main Objectives are accessing the quality of datasets
  • Quality of datasets in multidimensional perspectives
  • Relevance of criteria determined by preferences and performing certain tasks on available datasets


Provenance Data Quality 

  • Data Trustworthiness
    • Data Authenticity
    • Data Reliability
  • Dimensions of Believability  
    • Trustworthiness of source
      • Data Lineage – The origin of data
      • Related Artifacts and actors
    • Reasonableness of data 
      • Possibility – The extent to which data value is possible
      • Consistency – The extent to which a data value is consistent with other values of same data
  • Quality of Data Provenance has Three dimensions:
    • Correctness 
    • Completeness 
    • Relevancy 


Provenance Data Quality 

  • Quality of Datasets
    • Timeliness
    • Consistency between datasets
      • Consistency over source – The extent to which a data value is consistent with other values of the same data
      • Consistency over time – The extent to which the data value is consistent with past data values
    • Stable and meaningful data
  • Temporal of Data 
    • Transaction valid times closeness – The extent to which a data value is credible based on proximity of transaction time to valid times.
    • Transaction time overlap – The extent to which a data value is derived from data values with overlapping valid times. 


Trust Evaluation 

Some Questions must need to be considered while provenance data trust evaluation�� 

  1.    Who created that content(s) (author or attributions)?
  1. Was the contents manipulated? If yes then by what    process or source? 
  1.    Who is providing those contents (repositories)? 


Quality of Data Assessment 

  • Assign numeric values to Quality Criteria of Datasets or Scoring/Rating Systems
  • Proactive Approach 
    • Precision vs Practicality

Manual Approach 

  • Questionnaires base system

Semi-Automatic Approach 

  • Rating based system
  • Reputation based system


Reasons of Assessment 

Main Reasons  

  • Provenance of assessed data on the web
  • Primary Objectives 
    • Identify the methods / approaches to automatically assess the quality of data on the web 
    • Or Identify the methods to assess the Quality Criteria of Data automatically of web data. 


A Generalize Assessment Approach 

Step - 1 

Step - 2 

Step - 3 

Generate a provenance graph for the data item 

Annotate the provenance graph with impact values 

Execute the assessment function/program (script)  


Generate a Provenance Graph 

  1. What types of provenance elements are necessarily require?
  1. What types of details (i.e. granularity) are necessarily require? 
  1. Where and how do we get provenance information? 
    •    Two complementary options 
      • Recordings
      • Analyzing the metadata


Annotation with Impact Values 

  1. How might each Provenance element can influence the quality of data?
    •    Each type of element has to analyze systematically 
  • What kinds of impact values are necessary and how to represent the influence through impact values? 
    •    It is not necessary that impact values should be numeric 
    •    It also depends on the assessment functions 
  1. How do we determine the impact values?  


Determine the Impact Values 

  1. From Provenance Information
  2. From user Input
    •    Rating-based systems, or reputation-based systems
    •    Configuration options
  • Through Content Analysis 
    •    Comparison of data contents
    •    Adoption of information retrieval methods
    •    Adoption of data cleansing techniques
  1. Through Context Analysis 
    •    Further metadata
    •    Domain knowledge 


Annotation with Impact Values 

  • How might each Provenance element can influence the quality of data?

Provenance Element Type 

Creation Date 

Creation Guidelines 

Source data items 

Data creator 

Impact Values 

Creation time 


Expiry time 


Assessment Function (s) 

  1. How the assessment function look alike?
    •     Develop function together with impact values 
    •     Take incompleteness into consideration 
      •    Provenance graph could be fragmentary 
      •    Annotation could be missing 


Scientific and Technical Challenges of Provenance – 1 

Provenance information need to be: 

  • Represented
  • Captured and recorded 
  • Stored and secured, queries and reasoned about 
  • Visualized and browsed 


Scientific and Technical Challenges of Provenance - 2 

  • Vocabularies for representation of provenance contents
    • Need representation of process (workflow), entities roles, data collections, meta-assertions, etc. 
    • The open provenance model (OPM) 
  • Granularity of provenance records 
    • How much detail is useful, manageable/scalable in practice? 
      • Size of provenance can be orders of magnitude larger than base data. 
  • Provenance evaluation for information quality and trust management 


Scientific and Technical Challenges of Provenance – 2a 

  • Evaluation and updates
    • Shelf timeliness of data 
      • Determine when data becomes obsolete based on provenance information 
    • Versioning of data sources 
      • Relate updates of data based on provenance information 
  • Provenance-aware visualization, navigation and resource consumption 


Scientific and Technical Challenges of Provenance and Trust – 3 

  • Policies based on Provenance information
    • Association-based policies
      • Source is cited in Spiegel
      • Source is cited in Wikipedia
    • Bias-based policies 
      • Source is an Oil company
    • Distrust policies 
      • Source is a blog
  • Policies may be restricted to a context 
      • Topic of search, topics of pages, tags of page
  • Trust policies may be shared across users 


Thanks for your attentions ! 

Freie University Berlin

Computer Science Department

Software Engineering Research Group

TakuStr 9, Berlin, Germany. 

Any Questions? 



  1. W3C Website, What is provenance? Modified at November 2010, http://www.w3.org/2005/Incubator/prov/wiki/What_Is_Provenance
  2. W3C Website, A working Definition of Provenance, Modified at November 2010, http://www.w3.org/2005/Incubator/prov/wiki/What_Is_Provenance#A_Working_Definition_of_Provenance
  3. Hartig, O. Provenance information in the Web of data. In Proceedings of LDOW 2009 (Madrid, Spain, April 2009).
  4. O. Hartig and J. Zhao. Using web data provenance for quality assessment. Pro-ceedings of the 1st Int. Workshop on the Role of Semantic Web in Provenance
  5. D. Brickley and L. Miller, FOAF Vocabulary Specification, November 2007. http://xmlns.com/foaf/spec
  6. U. Bojars and J. G. Breslin. SIOC Core Ontology Specification, Revision 1.30, Jan. 2009. http://rdfs.org/sioc/spec/
  7. Luc Moreau, Juliana Freire, Joe Futrelle, Robert E. McGrath, Jim Myers, and Patrick Paulson. The open provenance model: An overview. In IPAW, pages 323–326, 2008.
  8. L. L. Pipino, Y. W. Lee, and R. Y. Wang, ��Data Quality Assessment,��Communications of the ACM, vol. 45, Issue no. 4, p. 211-218, 2009.
  9. You-Wei cheah, Beth Plale. Provenance Analysis: Towards qaulity provenance. In proceeding of 8th IEEE International conference on eScience, Chicago Illinois, Oct. 2012. http://www.ci.uchicago.edu/escience2012/pdf/Provenance_Analysis-Towards_Quality_Provenance.pdf
  10. Yogesh Simmhan, Beth Plale, and Dennis Gannon. A survey of data provenance in e-science. SIGMOD Record, 34(3):31–36, 2005. 
  11. Prat, N., and Madnick, S. Evaluating and aggregating data believability across quality sub-dimensions and data lineage. In Proceedings of WITS 2007 (Montreal, Canada, December 2007), p.169-174.
  12. Y. Simmhan, B. Plale, and D. Gannon. A Survey of Data Provenance in e-Science. SIGMOD Record, Computer Science Department, Indiana University. Vol. 34, Issue No. 3, p31–36, ACM, Sept. 2005.
  13. P. Buneman, S. Khanna, and W. C. Tan. Data Provenance: Some Basic Issues. In Proceedings of the 20th Conference on Foundations of Software Technology and Theoretical Computer Science (FST TCS), p87-93, Springer, Dec. 2000.
  14. Prat, N., and Madnick, S. Measuring data believability: A provenance approach. Proceedings of HICSS-41 (Big Island, HI, January 2008), IEEE, p.1-10.
  15. Jose Manuel Gomez-Perez, Invited Lectures on Programmable web and the web of data, November 2009, URJC, Campus de Mostoles, Departmental II, Salon de grados, Madrid, Spain, Website, http://www.cetinia.urjc.es/es/node/331
  16. Website : http://www.w3.org/2005/Incubator/prov/wiki/images/0/02/Provenance-XG-Overview.pdf
  17. http://www.w3.org/2005/Incubator/prov/wiki/Provenance_Dimensions
  18. http://www.w3.org/2005/Incubator/prov/wiki/W3C_Provenance_Incubator_Group_Wiki

Set Home | Add to Favorites

All Rights Reserved Powered by Free Document Search and Download

Copyright © 2011
This site does not host pdf,doc,ppt,xls,rtf,txt files all document are the property of their respective owners. complaint#nuokui.com