Appunti - Technologies for Informative System

Esame Technologies for informative system

Facoltà Ingegneria industriale

Appunti esame

IT: Questo file contiene il appunti delle slide integrati con delle note prese a lezione di tutti i capitoli del corso di Technologies for Informative System (Magistrale Ingegneria Informatica) della professoressa Tanca (AC 2022\2023). Inoltre sono integrati alcune note prese dalle domande degli esami passati.

EN: This file contains the notes of the slides integrated with the notes taken in class of all the chapters of Technologies for Informative System course (Master's degree in Computer Engineering) by Professor Tanca (AC 2022/2023). Also integrated are some notes taken from past exam questions

Argomenti / Topics:
1. Intro to Big Data Integration [1,2,3]
1.1. Schema Reconstruction
1.1.1 Mapping Between The Global Logical Schema and single source schemata
1.2 Record Linkage and Dataw Fusion
1.2.1 String Matching
1.2.2. Entity Matching
1.2.3 Data Fusion
2. SemiStructured Data Integration [4,5]
2.1 TSIMMIS
2.2 Ontologies (Slide Part 2)
3. Data WareHouses [6]
4. Data WareHouse Design [7,8]
4.1 Conceptual Design
4.2 Logical Desig
4.2.1 Views
4.3 ROLAP Logical Design
5. Other Architecture [9]
6. Frontiers in Data Integration [10]
6.1 Data Provenance
6.2 Lightweight Integration: Data Mashup
6.3 Pay as You Go Data Management
6.4 Data (Lake) Federation
6.4.1 Node Level
6.4.2 Federation Level
7. Ethics in Data Science
8. Data Quality

…continua

Anteprima

Vedrai una selezione di 6 pagine su 22

Appunti - Technologies for Informative System Pag. 1

Appunti - Technologies for Informative System Pag. 2

Anteprima di 6 pagg. su 22.
Scarica il documento per vederlo tutto.

Scarica

Appunti - Technologies for Informative System Pag. 6

Anteprima di 6 pagg. su 22.
Scarica il documento per vederlo tutto.

Scarica

Appunti - Technologies for Informative System Pag. 11

Anteprima di 6 pagg. su 22.
Scarica il documento per vederlo tutto.

Scarica

Appunti - Technologies for Informative System Pag. 16

Anteprima di 6 pagg. su 22.
Scarica il documento per vederlo tutto.

Scarica

Appunti - Technologies for Informative System Pag. 21

Disdici quando
vuoi

Acquista con carta
o PayPal

Scarica i documenti
tutte le volte che vuoi

Estratto del documento

Inconsistency and Semi-Structured Data Integration

Inconsistency may depend on different reasons: one or both of the sources are incorrect or partially viewed. Often the correct value may be obtained as a function (data layer).

Data wrappers are components that are linked to a source and convert queries into queries which are understandable for a specific data source. Then they translate the result back to the destination format. Due to the fact that they are attached to a source, if the source changes we have to change the wrapper. Human developed wrappers are expensive, so some automatic techniques are built. They can extend the query possibilities of a data source. Difficult to use with semi-structured data.

We define semistructured data where there is some form of structure, but it is not as prescriptive, regulated and complete (like web data, XML). Typically we have semistructured data model based on: Text, Trees, Graphs.

With this kind of data an overall data representation should be progressively built.

As we discover and explore new information sources. MAV and LAV are no longer sufficient, so we must use a mediator, that integrates complex systems.

The term mediation includes:

The processing needed to make interfaces work
The knowledge structures that drive the transformation needed to transform data to information
Any intermediate storage that is needed

The main problem is that each different domain needs a mediator appropriately designed to understand its semantics. Mediator must have access to domain metadata.

Mediators are interfaces specialized in a certain domain that are placed between the application and the wrappers of the sources. The goal of this interface is to catch the query written in the application domain, decompose it and forward each piece to the specific wrapper. It also takes into account the merging of the different responses. The mediation includes the processing needed to make the interface work, the knowledge needed to perform these operations and an internal

memory to store intermediate data. Each domain need a mediation to understand its semantic

2.1 TSIMMIS

TSIMMIS is the first system based on the mediator/wrapper paradigm. Main features:

Unique, graph-based internal model: Object Exchange Model managed by the mediator. it is a self-descriptive model, since it represents data directly with no schema at all. <temp-in-Celsius,int,3>
Wrappers for model-to-model translations, one for each sources
Query posed to the mediator in the loren language, Object oriented query languages
mediator knows the semantic of the application domain and manages the query and the model.

Each mediator is specialized into a certain domain and must know domain metadata (data semantic), if data sources change, the wrapper has to be modified Loren (automatic wrapper generator are useful)

(Lightweight Object Repository Language) is object-based query languages

The TSMIIS system introduced a DataGuide: a kind of a-posteriori schema, progressively built by

themediator while exploring the data source to allow querying. This process is strictly correlated to application and helps the user to understand the "data schema" to produce the query.

2.2 Ontologies (Slide Part 2)

Ontologies are a way to solve the problem of automatic semantic matching. It is a formal representation of knowledge in a specific domain. It defines the concepts and categories that exist within that domain, and the relationships between them.

The vocabulary is used to express queries and assertions. Ontologies are used when we have a large number of heterogeneous data sources with different levels of data structures in different operational contexts with different terminologies. Used with time-variant data and transient data sources.

2 types of ontologies:

Taxonomic: definition of concepts through terms, their hierarchical organization and additional predefined relationship (synonyms and homonyms) to provide a reference vocabulary
Descriptive: not only taxonomy but

also relationship between objects. Provide information for "aligning" existing data structures or to design new, specialized ontologies (domain ontologies)-An Ontology consists of Ontology=(Concepts, Relation, Axiom, Instances):- Concepts: generic or specific concepts that describe world categories or specific domain- Concept Definition: Via a formal\natural language- Relationships between concepts: Taxonomies (is a), meronymies (part of), synonymies and user-defined association

An ontology is a knowledge base, composed by:- T-box (Terminological Box) is used to represent the concepts, classes, and relationships that define the structure of the ontology. The T-box defines the types of entities and the relationships between them, and represents the general knowledge about the domain.- A-box (Assertional Box) is used to represent the specific instances of entities and the relationships between them. The A-box defines the actual individuals and their properties, and represents the

2.2.1 Semantic Interoperability

Semantic Interoperability (semantic Web): it makes it easier for machines to automatically process and integrate information available on the web.

RDF is a data model for objects ("resources") and relations between them, provides a simple semantics for this data model, and can be represented in an XML syntax. At the core of RDF is the notion of a triple subject-predicate-object, a statement that represents 2 vertices connected by an edge: (predicate is the relation). OWL adds more vocabulary for describing properties and class.

The first level above Resource Description Framework (RDF) is OWL (Web Ontology Language), an ontology language that can formally describe the meaning of terminology used in Web for document. The OWL is designed for use by applications that need to process the content of information instead of just presenting information to humans.

Boxis provide different services,

T-Box Services:- Subsumption:

verifies if a concept C subsumes (is a subconcept) of D

- Consistency: verifies that there exists at least one interpretation which satisfy the T-Box

- Local Satisfiability: verify that exist in interpretation where Concept C is true

A-Box Services:

- Consistency: verifies that an A-Box is consistent with a given T-Box

- Instance Checking: verify that an individual x belong to a concept C

- Instance Retrieval: return the extension (set of individuals) of a given concept C

2.2.2 Ontology matching

Ontology matching the process of finding pairs of resources coming from different ontologies which can be considered equal in meaning.

Again we need some sort of similarity measurement, this time based on semantics (not structure of the world). More general concepts of distance are used to define similarity

We have different problem that can cause ontology mismatches

At Definition Language Level:

- Syntax

- Availability of different constructs

- Linguistic Primitive's semantics

At the Ontology Level:

- Scope:

specific domain and its concepts, relationships, and axioms. Ontologies are used to solve the problem of semantic matching, which arises when dealing with heterogeneous structures and different layers of data structures in time variant environments. An ontology is a formal representation of knowledge in a specific domain. It consists of concepts, relationships, axioms, and instances. A concept describes a world category of a specific domain. The elements of an ontology are inserted into either an A-Box or a T-Box. The A-Box is used to represent specific instances or relationships among them, while the T-Box represents a specific domain and its concepts, relationships, and axioms.

class or a relationship that define the structure of the ontology

We have 2 kind of ontology:

Taxonomy: definition of concept using terms and their hierarchy, and a predefined relationship(synonyms and Homonyms) to provide a vocabulary
Descriptive: that can also include relationships defined by the user. More oriented to the database field.

The use of data of ontologies in data integration can be multiple:

could be used as global schema, where the source ontologies are mapped into the global ontologies and it can be query using ontologies query languages
can be helpful To support automatic understanding of the semantics of the instances for automatic entity resolution and data fusion

3. Data WareHouses [6]

Data should be integrated across the enterprises. Summary data provide real value to the organization. Historical data holds the key to understanding data over time.

An Operation definition of Data WareHouse (DW): A data warehouse is a:

Subject-oriented
Integrated
TimeVarying

NonVolatile collection of data that is used primarily in organizational decision making.

The main purpose of a data warehouse is to allow systematic or ad-hoc data analysis and mining.

The design of the conceptual model can be done using a Dimensional Fact Model (DFM), a graphic formalism. It allows one to describe a set of fact schemata. The components of a fact schema are:

Facts: is a concept that is relevant for the decision process
Measures: is a numeric quantity (property) of a fact which describes a quantitative aspect
Dimensions: is discrete properties of a fact which describe a possible perspective of analysis
Dimension Hierarchy: is a fact property

An example of a fact model could be a "salefact", with dimension time, object type and shop position, with measures quantity and income.

We have 3 different types of measurements:

Flow measures: they are related to a time period; at the end of the period the measures are evaluated in a cumulative way. (SUM, AVG, MIN, MAX)

over both temporal and non temporal hierarchies.) (Examples; number of sales in a day, total income in a month, number of birthdays in a year).

- Level measures: they are evaluated in particular time instants. (AVG, MIN, MAX. but SUM only over non temporal hierarchies) (Examples: number of products in stocks, or number of citizens in a city).

- Unitary measures: they are evaluated in particular time instants but they are relative measures. (Only AVG, MIN, MAX). (Examples: unitary price for an item in a particular instant. It cannot be aggregated with respect to time, nor category nor shop)

OLAP operations:

Roll-UP: aggregates data in a higher level
Drill-Down: De-Aggregates data at the lower level
Slide&Dice: Applies selections and projections, which reduce data dimensionality
Pivoting: Select 2 dimensions to re-aggregate data (rotate the cube)
Ranking: Sort data according to predefined criteria

Dettagli

Publisher

Joseph22ITA

A.A. 2022-2023

22 pagine

SSD Ingegneria industriale e dell'informazione ING-INF/03 Telecomunicazioni

I contenuti di questa pagina costituiscono rielaborazioni personali del Publisher Joseph22ITA di informazioni apprese con la frequenza delle lezioni di Technologies for informative system e studio autonomo di eventuali libri di riferimento in preparazione dell'esame finale o della tesi. Non devono intendersi come materiale ufficiale dell'università Politecnico di Milano o del prof Tanca Letizia.