Estratto del documento

Politecnico di Milano

Master of Computer Science and Engineering

Technologies For Information Systems

Notes

Contents

1 Data Integration 1

1.1 Integrating Database Systems . . . . . . . . . . . . . . . . . . . . . . 1

1.2 The Steps of Data Integration . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Schema Reconciliation . . . . . . . . . . . . . . . . . . . . . . 3

1.2.2 GAV and LAV . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.3 Record Linkage and Data Fusion . . . . . . . . . . . . . . . . 4

1.2.4 Heterogeneous Data Sources . . . . . . . . . . . . . . . . . . . 4

1.3 Semistructured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.1 Mediators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.2 Metamodel Data Integration . . . . . . . . . . . . . . . . . . . 6

1.4 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4.1 The Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4.2 DB vs Ontology and Integration Support . . . . . . . . . . . . 8

1.5 Lightweight Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.6 Future Trends in Data Integration . . . . . . . . . . . . . . . . . . . . 10

1.6.1 Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.6.2 Data Provenance . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.6.3 Crowdsourcing . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.6.4 Visualizing Integrated Data . . . . . . . . . . . . . . . . . . . 11

1.6.5 Integrating Social Media . . . . . . . . . . . . . . . . . . . . . 11

1.6.6 Cluster and Cloud-Based Solutions . . . . . . . . . . . . . . . 11

2 Volume and Velocity 12

2.1 Big Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 NoSQL Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 The CAP Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Data Analysis and Exploration 13

3.1 Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Data Warehouses 16

4.1 OLAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1.1 OLAP Models and Operations . . . . . . . . . . . . . . . . . . 17

4.2 Data Warehouse Design . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2.1 Conceptual Design . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2.2 Logical Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

I

5 Temporal Databases 21

5.1 Timestamps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.2 Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.2.1 Temporal Data Types . . . . . . . . . . . . . . . . . . . . . . 23

5.2.2 Predicates on Intervals . . . . . . . . . . . . . . . . . . . . . . 24

5.2.3 Relational and Aggregate Operators . . . . . . . . . . . . . . . 24

5.2.4 Temporal Difference and Temporal Join . . . . . . . . . . . . 25

6 Data Personalization 26

6.1 Data Personalization and Context-Awareness . . . . . . . . . . . . . . 27

6.1.1 Context-Aware System Design . . . . . . . . . . . . . . . . . . 27

7 Data Quality 28

7.1 Data Quality Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . 29

7.2 Data Quality Assessment . . . . . . . . . . . . . . . . . . . . . . . . . 29

7.3 Data Quality Improvement . . . . . . . . . . . . . . . . . . . . . . . . 30

7.3.1 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

7.4 Data Quality Improvement in Integration Activities . . . . . . . . . . 31

7.5 Data Quality Issues in Big Data . . . . . . . . . . . . . . . . . . . . . 32

8 Data Management in Pervasive Systems 32

8.1 Wireless Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . 33

8.2 RFID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

8.3 Data Stream Management . . . . . . . . . . . . . . . . . . . . . . . . 34

8.4 Mobile Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

8.5 The PerLa Language . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

8.6 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

II

1 Data Integration

Data integration consists on combining data coming from different sources in order

to provide the user with a unified vision. Big Data are characterized by the four Vs:

• each data source can contain a huge volume of information;

Volume:

• data are rapidly and continuously produced, the data sources are

Velocity:

very dynamic over time.

• data sources are extremely heterogeneous and contain a great variety

Variety:

of data. People and enterprises need to integrate data and the systems that

handle them;

• the quality of the data sources can vary and this may determine if

Veracity:

information should be used or not in making decisions. In particular, veracity

represents a number of quality aspects: completeness, validity, consistency,

timeliness (not obsolete), accuracy...

The main challenge regarding Big Data is the that is the diffi-

information overload,

culty of understanding and making decisions when too much information is available.

Since there are various forms of autonomy (eg: communication and execution)

in deciding which data and how should be represented, heterogeneity is generated

and the problem of achieving interoperability of software applications, services and

information is becoming bigger and bigger.

1.1 Integrating Database Systems

When database systems have to be integrated, several approaches exist:

• Data can be merged in a new database, a by means

materialized database,

of extract-transform-load systems. Only the new database can be accessed

by queries. ETL systems are used to build data warehouses, where ETL are

performed periodically. The main purpose of DWs is ad-hoc data analysis and

mining;

• Data can remain at their sources. In this case we take advantage of a virtual,

mapped on the sources that are actually ac-

non-materialized database,

cessed by queries. This approach always returns a fresh and up-to-date answer

to the query; 1

• The integration system dispatches a reformulated query to

Data Exchange.

a source database and the result is sent to a target database that replies to

the query. The target instance is materialized using the source instance (eg:

IBM Clio).

Obviously, when we need to avoid that queries return out-to-date data, ETL and

data warehouses are no more the right solution: in this case, virtual data integration

must be used. In general, queries are decomposed and evaluated by local databases

according to an efficient strategy.

1.2 The Steps of Data Integration

Data integration problems arise even in the case of unique and centralized database.

Given a data model, data need to be organized according to the chosen model to

avoid inconsistencies and allow query optimization. Each datum must appear only

once in order to eliminate redundancies and useless memory occupation.

In the case of unique database, integration can be achieved taking advantage of view

and

integration mixed strategy:

• All the involved departments (subsystems) must be identified;

• Design of the skeleton schema and view conceptual design for each department;

• View integration and restructuring to produce the global conceptual schema;

• Conceptual to logical translation (from E-R to relational) of the global schema

and subschemas;

• Reconciliation of the global schema with the single schemata. The final views

are produced. 2

In case of distributed databases, the procedure is very similar and the design pattern

is the same of the centralized situation because we still count on homogeneous

technology and data model.

On the other hand, the presence of various data sources complicates matters.

The integration system must provide a uniform view to the user and know how to

query the sources. The autonomy of design (which data), communication (which

services) and execution (which algorithms) causes heterogeneity: we might deal with

different platforms, data models, query languages, data schemas and different values

for the same information.

1.2.1 Schema Reconciliation

The problem consists in finding matching schemas in different data sources. Some-

times there is no need of this step if data do not have a schema (eg: sensors’ data).

If sources have a schema, we start translating the logical schema to produce a

conceptual representation (E-R). On top of that, conflicts are resolved and the global

conceptual schema is produced. It is, then, translated into a logical representation

and reconciled with the single starting schemata by means of proper view definitions.

A mapping between the global schema and the original data sources is produced.

Conflicts in the data sources can be of various types: name conflicts (syn-

onyms/homonyms), type conflicts (at attribute/entity level), data semantics con-

flicts, structure conflicts, cardinality conflicts and key conflicts (different primary

keys for the same entity).

We can distinguish between homogeneous and heterogeneous data integration. There

exist different kinds of heterogeneity, in particular the data sources may have dif-

ferent data models, but also different query languages or they deal with semi or

unstructured data. The global schema will provide a reconciled, integrated and

virtual view of the data sources.

The system that we want to build must support accesses to different data sources

and know their contents. The data sources are integrated by means of a global

schema, whose language is used to query the system. The key point is that queries

are rewritten to be understandable by the sources and the replies are combined to

produce a final answer.

1.2.2 GAV and LAV

A data integration system is basically a triple (G, S, M) where queries are posed in

terms of the global schema G and mapped on the source schemata S thanks to a set

of mappings M. There are two basic approaches to do these mappings: GAV and

3

LAV.

GAV means and expresses the global schema in terms of the data

Global As View

sources schemata. GAV mapping is a set of assertions that specify for each element

of the global schema a query on the data sources. This approach is effective

g q s

only if sources are stable because every time a new source is introduced, the views

must be modified.

The views are produced using not only the but also other complex oper-

union,

ators such as and

outerjoin, outerunion generalization.

LAV means The global schema can be designed independently from

Local As View.

the ones of the data sources and the mappings are obtained by defining each data

source as a view over the global schema. This approach works well if sources are

transient, but query processing is more complex than in GAV.

Usually, sources are assumed to be incomplete with respect to the global schema.

A mapping (view) is when provides a subset of data present in the data

sound

source it refers to. A mapping is if provides a superset of the available

complete

data in the correspondent data source. A mapping is if it is both sound and

exact

complete.

GAV and LAV do not always provide exact mappings. In details, GAV with

integrity constraints can produce exact or sound mappings while the LAV approach

exact or complete ones.

There is also a third approach to produce mappings: GLAV. The relationship (map-

ping) between sources and global schema is obtained by defining a set of views, some

over the global schema and some over the data sources.

1.2.3 Record Linkage and Data Fusion

Sometimes inconsistencies in the data may occur. The process of record linkage

consists in finding records in a data set that refer to the same entity across different

data sources. Actually, the same world object can be present in more than one

data source with different values for some attributes. is the process of

Data fusion

integrating multiple data sources to produce more consistent, accurate and useful

data.

1.2.4 Heterogeneous Data Sources

Data sources can be characterized by different data models. A new element appears

in this scenario: the It is in charge of converting queries into queries that are

wrapper. 4

understandable by the data sources and translating results to a format understand-

able by the application. Wrappers deal with both structured and semistructured

data. Since building ad-hoc wrappers is very expensive, it is better generating them

automatically.

The design steps of data integration become:

• reverse engineering and production of the conceptual schema;

• conceptual schemata integration;

• choice of the target logical data model and translation of the global conceptual

schema;

• definition of the language translation (wrapping);

• definition of the data views.

1.3 Semistructured Data

Semistructured data have got a form of structure, but they are not as prescriptive,

regular and complete as in traditional DBMSs. Examples are XML, JSON and data

coming from integration of heterogeneous data sources. They are all different and

cannot be easily integrated.

We would like to integrate, compare and query data with different structures

also with semistructured data as if they were structured, (therefore an overall rep-

resentation is built progressively as new information sources are discovered).

1.3.1 Mediators

In order to query semistructured data, a mediator-based approach is used: in fact,

the user is not fully aware of the complete structure of data. play a

Mediators

key role: they are data structures used to accommodate data coming from different

sources. They include knowledge structures to turn data into information, required

intermediate data storage and processing needed to make interfaces to work.

The first system that took advantage of mediators was Each data

Tsimmis.

source have a wrapper for query translation and multiple mediators are in charge of

understanding to which data source the query should be forwarded. They are aware

of the semantics of the application domain.

Mediators build progressively a representation of the sources by means of queries

and learn their content. Tsimmis used a posteriori schema is progressively

dataguide:

built by exploring the data sources. There is no need to solve conflicts at design time

because they are solved on-line. Nevertheless, if data sources change, the wrapper

has to be modified: this is why we would like to automatically generate wrappers.

5

The mediator is defined as the orchestrator of the integration, it knows domain

meta-data which convey data semantics. is a query language for semistruc-

Lorel

tured data, especially used for OEM databases (Object In details,

Exchange Model).

OEM is a self describing data model where information in the form of labels is inter-

mixed with data. It can be thought as a graph where a Lorel query is a specific path.

OEM is used by Tsimmis’ mediators to represent data within the overall system.

1.3.2 Metamodel Data Integration

When we deal with different data models, we can notice that constructs in the various

models are similar. A can use basic constructs to create constructs of

metamodel

the model we need to represent.

Basically, a metamodel is an abstract model for the specification of concrete

models. There exist two types of metamodels:

• General abstract entities can be specialized and become objects in the target

model;

• Constructs are used to build the objects of the target model.

Metamodels allow to translate different models into a unique formalism and provide

automatic translation into and from this common formalism. They have high ex-

pressive power. The metamodel-based data integration translates queries expressed

in the metamodel formalism into the language of each data source.

The application context for metamodels consists in a large number of heterogeneous

and transient data sources, with time varying data.

6

1.4 Ontologies

Ontologies are a way to solve the problem of automatic semantic matching. They

are formal specifications of conceptualizations of a shared knowledge domain. In

other words, an ontology is a controlled vocabulary that describes objects and the

relationships between them in a formal way. It has a grammar to express something

meaningful within a specified domain of interest.

A mediator can query an appropriate ontology instead of being built with hard-

coded knowledge of the domain. The knowledge is built independently from the

application program.

There are two types of ontologies:

• define concepts through terms, their hierarchical or-

Taxonomic ontologies

ganization and additional relationships, such as synonymy. They provide a

reference vocabulary;

• define concepts through data structures and interrela-

Descriptive ontologies

tionships. They do not only deal with hierarchies, but specify also relationships

among components.

The formal definition of ontology is (C, where C stands for concepts, R for

R, I, A),

relationships among them, I for instances (objects belonging to classes) and, finally,

A stands for a set of axioms. In more details, an ontology consists of:

• generic concepts express general world categories. Specific concepts

concepts:

describe a particular application domain;

• both formal and natural languages can be used;

concept definition:

• synonyms, homonyms, taxonomies (is_a), meronymies

relationships among concepts:

(part_of).

An ontology is part of a knowledge base composed by a that contains all

T-Box,

the concepts and role definitions as well as the axioms of our logical theory (eg: "a

and an that contains all the basic assertions

father is a man with a child"), A-Box,

of the logical theory (eg: "Tom is a father").

Services for T-Box are:

• verifies if a concept is a subconcept of another one;

Subsumption:

• verifies if there exists at least one interpretation that satisfies

Consistency:

the given T-Box; 7

• verifies, for a given C, that exists a

Anteprima
Vedrai una selezione di 9 pagine su 38
Technologies For Information Systems - Complete Notes Pag. 1 Technologies For Information Systems - Complete Notes Pag. 2
Anteprima di 9 pagg. su 38.
Scarica il documento per vederlo tutto.
Technologies For Information Systems - Complete Notes Pag. 6
Anteprima di 9 pagg. su 38.
Scarica il documento per vederlo tutto.
Technologies For Information Systems - Complete Notes Pag. 11
Anteprima di 9 pagg. su 38.
Scarica il documento per vederlo tutto.
Technologies For Information Systems - Complete Notes Pag. 16
Anteprima di 9 pagg. su 38.
Scarica il documento per vederlo tutto.
Technologies For Information Systems - Complete Notes Pag. 21
Anteprima di 9 pagg. su 38.
Scarica il documento per vederlo tutto.
Technologies For Information Systems - Complete Notes Pag. 26
Anteprima di 9 pagg. su 38.
Scarica il documento per vederlo tutto.
Technologies For Information Systems - Complete Notes Pag. 31
Anteprima di 9 pagg. su 38.
Scarica il documento per vederlo tutto.
Technologies For Information Systems - Complete Notes Pag. 36
1 su 38
D/illustrazione/soddisfatti o rimborsati
Acquista con carta o PayPal
Scarica i documenti tutte le volte che vuoi
Dettagli
SSD
Ingegneria industriale e dell'informazione ING-INF/05 Sistemi di elaborazione delle informazioni

I contenuti di questa pagina costituiscono rielaborazioni personali del Publisher hardware994 di informazioni apprese con la frequenza delle lezioni di Technologies For Information Systems e studio autonomo di eventuali libri di riferimento in preparazione dell'esame finale o della tesi. Non devono intendersi come materiale ufficiale dell'università Politecnico di Milano o del prof Tanca Letizia.
Appunti correlati Invia appunti e guadagna

Domande e risposte

Hai bisogno di aiuto?
Chiedi alla community