Politecnico di Milano
Master of Computer Science and Engineering
Technologies For Information Systems
Notes
Contents
1 Data Integration 1
1.1 Integrating Database Systems . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Steps of Data Integration . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Schema Reconciliation . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 GAV and LAV . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Record Linkage and Data Fusion . . . . . . . . . . . . . . . . 4
1.2.4 Heterogeneous Data Sources . . . . . . . . . . . . . . . . . . . 4
1.3 Semistructured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Mediators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Metamodel Data Integration . . . . . . . . . . . . . . . . . . . 6
1.4 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.1 The Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.2 DB vs Ontology and Integration Support . . . . . . . . . . . . 8
1.5 Lightweight Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 Future Trends in Data Integration . . . . . . . . . . . . . . . . . . . . 10
1.6.1 Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6.2 Data Provenance . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6.3 Crowdsourcing . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6.4 Visualizing Integrated Data . . . . . . . . . . . . . . . . . . . 11
1.6.5 Integrating Social Media . . . . . . . . . . . . . . . . . . . . . 11
1.6.6 Cluster and Cloud-Based Solutions . . . . . . . . . . . . . . . 11
2 Volume and Velocity 12
2.1 Big Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 NoSQL Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 The CAP Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Data Analysis and Exploration 13
3.1 Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Data Warehouses 16
4.1 OLAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.1 OLAP Models and Operations . . . . . . . . . . . . . . . . . . 17
4.2 Data Warehouse Design . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.1 Conceptual Design . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.2 Logical Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
I
5 Temporal Databases 21
5.1 Timestamps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2 Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2.1 Temporal Data Types . . . . . . . . . . . . . . . . . . . . . . 23
5.2.2 Predicates on Intervals . . . . . . . . . . . . . . . . . . . . . . 24
5.2.3 Relational and Aggregate Operators . . . . . . . . . . . . . . . 24
5.2.4 Temporal Difference and Temporal Join . . . . . . . . . . . . 25
6 Data Personalization 26
6.1 Data Personalization and Context-Awareness . . . . . . . . . . . . . . 27
6.1.1 Context-Aware System Design . . . . . . . . . . . . . . . . . . 27
7 Data Quality 28
7.1 Data Quality Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.2 Data Quality Assessment . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.3 Data Quality Improvement . . . . . . . . . . . . . . . . . . . . . . . . 30
7.3.1 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7.4 Data Quality Improvement in Integration Activities . . . . . . . . . . 31
7.5 Data Quality Issues in Big Data . . . . . . . . . . . . . . . . . . . . . 32
8 Data Management in Pervasive Systems 32
8.1 Wireless Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . 33
8.2 RFID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.3 Data Stream Management . . . . . . . . . . . . . . . . . . . . . . . . 34
8.4 Mobile Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8.5 The PerLa Language . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
8.6 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
II
1 Data Integration
Data integration consists on combining data coming from different sources in order
to provide the user with a unified vision. Big Data are characterized by the four Vs:
• each data source can contain a huge volume of information;
Volume:
• data are rapidly and continuously produced, the data sources are
Velocity:
very dynamic over time.
• data sources are extremely heterogeneous and contain a great variety
Variety:
of data. People and enterprises need to integrate data and the systems that
handle them;
• the quality of the data sources can vary and this may determine if
Veracity:
information should be used or not in making decisions. In particular, veracity
represents a number of quality aspects: completeness, validity, consistency,
timeliness (not obsolete), accuracy...
The main challenge regarding Big Data is the that is the diffi-
information overload,
culty of understanding and making decisions when too much information is available.
Since there are various forms of autonomy (eg: communication and execution)
in deciding which data and how should be represented, heterogeneity is generated
and the problem of achieving interoperability of software applications, services and
information is becoming bigger and bigger.
1.1 Integrating Database Systems
When database systems have to be integrated, several approaches exist:
• Data can be merged in a new database, a by means
materialized database,
of extract-transform-load systems. Only the new database can be accessed
by queries. ETL systems are used to build data warehouses, where ETL are
performed periodically. The main purpose of DWs is ad-hoc data analysis and
mining;
• Data can remain at their sources. In this case we take advantage of a virtual,
mapped on the sources that are actually ac-
non-materialized database,
cessed by queries. This approach always returns a fresh and up-to-date answer
to the query; 1
• The integration system dispatches a reformulated query to
Data Exchange.
a source database and the result is sent to a target database that replies to
the query. The target instance is materialized using the source instance (eg:
IBM Clio).
Obviously, when we need to avoid that queries return out-to-date data, ETL and
data warehouses are no more the right solution: in this case, virtual data integration
must be used. In general, queries are decomposed and evaluated by local databases
according to an efficient strategy.
1.2 The Steps of Data Integration
Data integration problems arise even in the case of unique and centralized database.
Given a data model, data need to be organized according to the chosen model to
avoid inconsistencies and allow query optimization. Each datum must appear only
once in order to eliminate redundancies and useless memory occupation.
In the case of unique database, integration can be achieved taking advantage of view
and
integration mixed strategy:
• All the involved departments (subsystems) must be identified;
• Design of the skeleton schema and view conceptual design for each department;
• View integration and restructuring to produce the global conceptual schema;
• Conceptual to logical translation (from E-R to relational) of the global schema
and subschemas;
• Reconciliation of the global schema with the single schemata. The final views
are produced. 2
In case of distributed databases, the procedure is very similar and the design pattern
is the same of the centralized situation because we still count on homogeneous
technology and data model.
On the other hand, the presence of various data sources complicates matters.
The integration system must provide a uniform view to the user and know how to
query the sources. The autonomy of design (which data), communication (which
services) and execution (which algorithms) causes heterogeneity: we might deal with
different platforms, data models, query languages, data schemas and different values
for the same information.
1.2.1 Schema Reconciliation
The problem consists in finding matching schemas in different data sources. Some-
times there is no need of this step if data do not have a schema (eg: sensors’ data).
If sources have a schema, we start translating the logical schema to produce a
conceptual representation (E-R). On top of that, conflicts are resolved and the global
conceptual schema is produced. It is, then, translated into a logical representation
and reconciled with the single starting schemata by means of proper view definitions.
A mapping between the global schema and the original data sources is produced.
Conflicts in the data sources can be of various types: name conflicts (syn-
onyms/homonyms), type conflicts (at attribute/entity level), data semantics con-
flicts, structure conflicts, cardinality conflicts and key conflicts (different primary
keys for the same entity).
We can distinguish between homogeneous and heterogeneous data integration. There
exist different kinds of heterogeneity, in particular the data sources may have dif-
ferent data models, but also different query languages or they deal with semi or
unstructured data. The global schema will provide a reconciled, integrated and
virtual view of the data sources.
The system that we want to build must support accesses to different data sources
and know their contents. The data sources are integrated by means of a global
schema, whose language is used to query the system. The key point is that queries
are rewritten to be understandable by the sources and the replies are combined to
produce a final answer.
1.2.2 GAV and LAV
A data integration system is basically a triple (G, S, M) where queries are posed in
terms of the global schema G and mapped on the source schemata S thanks to a set
of mappings M. There are two basic approaches to do these mappings: GAV and
3
LAV.
GAV means and expresses the global schema in terms of the data
Global As View
sources schemata. GAV mapping is a set of assertions that specify for each element
of the global schema a query on the data sources. This approach is effective
g q s
only if sources are stable because every time a new source is introduced, the views
must be modified.
The views are produced using not only the but also other complex oper-
union,
ators such as and
outerjoin, outerunion generalization.
LAV means The global schema can be designed independently from
Local As View.
the ones of the data sources and the mappings are obtained by defining each data
source as a view over the global schema. This approach works well if sources are
transient, but query processing is more complex than in GAV.
Usually, sources are assumed to be incomplete with respect to the global schema.
A mapping (view) is when provides a subset of data present in the data
sound
source it refers to. A mapping is if provides a superset of the available
complete
data in the correspondent data source. A mapping is if it is both sound and
exact
complete.
GAV and LAV do not always provide exact mappings. In details, GAV with
integrity constraints can produce exact or sound mappings while the LAV approach
exact or complete ones.
There is also a third approach to produce mappings: GLAV. The relationship (map-
ping) between sources and global schema is obtained by defining a set of views, some
over the global schema and some over the data sources.
1.2.3 Record Linkage and Data Fusion
Sometimes inconsistencies in the data may occur. The process of record linkage
consists in finding records in a data set that refer to the same entity across different
data sources. Actually, the same world object can be present in more than one
data source with different values for some attributes. is the process of
Data fusion
integrating multiple data sources to produce more consistent, accurate and useful
data.
1.2.4 Heterogeneous Data Sources
Data sources can be characterized by different data models. A new element appears
in this scenario: the It is in charge of converting queries into queries that are
wrapper. 4
understandable by the data sources and translating results to a format understand-
able by the application. Wrappers deal with both structured and semistructured
data. Since building ad-hoc wrappers is very expensive, it is better generating them
automatically.
The design steps of data integration become:
• reverse engineering and production of the conceptual schema;
• conceptual schemata integration;
• choice of the target logical data model and translation of the global conceptual
schema;
• definition of the language translation (wrapping);
• definition of the data views.
1.3 Semistructured Data
Semistructured data have got a form of structure, but they are not as prescriptive,
regular and complete as in traditional DBMSs. Examples are XML, JSON and data
coming from integration of heterogeneous data sources. They are all different and
cannot be easily integrated.
We would like to integrate, compare and query data with different structures
also with semistructured data as if they were structured, (therefore an overall rep-
resentation is built progressively as new information sources are discovered).
1.3.1 Mediators
In order to query semistructured data, a mediator-based approach is used: in fact,
the user is not fully aware of the complete structure of data. play a
Mediators
key role: they are data structures used to accommodate data coming from different
sources. They include knowledge structures to turn data into information, required
intermediate data storage and processing needed to make interfaces to work.
The first system that took advantage of mediators was Each data
Tsimmis.
source have a wrapper for query translation and multiple mediators are in charge of
understanding to which data source the query should be forwarded. They are aware
of the semantics of the application domain.
Mediators build progressively a representation of the sources by means of queries
and learn their content. Tsimmis used a posteriori schema is progressively
dataguide:
built by exploring the data sources. There is no need to solve conflicts at design time
because they are solved on-line. Nevertheless, if data sources change, the wrapper
has to be modified: this is why we would like to automatically generate wrappers.
5
The mediator is defined as the orchestrator of the integration, it knows domain
meta-data which convey data semantics. is a query language for semistruc-
Lorel
tured data, especially used for OEM databases (Object In details,
Exchange Model).
OEM is a self describing data model where information in the form of labels is inter-
mixed with data. It can be thought as a graph where a Lorel query is a specific path.
OEM is used by Tsimmis’ mediators to represent data within the overall system.
1.3.2 Metamodel Data Integration
When we deal with different data models, we can notice that constructs in the various
models are similar. A can use basic constructs to create constructs of
metamodel
the model we need to represent.
Basically, a metamodel is an abstract model for the specification of concrete
models. There exist two types of metamodels:
• General abstract entities can be specialized and become objects in the target
model;
• Constructs are used to build the objects of the target model.
Metamodels allow to translate different models into a unique formalism and provide
automatic translation into and from this common formalism. They have high ex-
pressive power. The metamodel-based data integration translates queries expressed
in the metamodel formalism into the language of each data source.
The application context for metamodels consists in a large number of heterogeneous
and transient data sources, with time varying data.
6
1.4 Ontologies
Ontologies are a way to solve the problem of automatic semantic matching. They
are formal specifications of conceptualizations of a shared knowledge domain. In
other words, an ontology is a controlled vocabulary that describes objects and the
relationships between them in a formal way. It has a grammar to express something
meaningful within a specified domain of interest.
A mediator can query an appropriate ontology instead of being built with hard-
coded knowledge of the domain. The knowledge is built independently from the
application program.
There are two types of ontologies:
• define concepts through terms, their hierarchical or-
Taxonomic ontologies
ganization and additional relationships, such as synonymy. They provide a
reference vocabulary;
• define concepts through data structures and interrela-
Descriptive ontologies
tionships. They do not only deal with hierarchies, but specify also relationships
among components.
The formal definition of ontology is (C, where C stands for concepts, R for
R, I, A),
relationships among them, I for instances (objects belonging to classes) and, finally,
A stands for a set of axioms. In more details, an ontology consists of:
• generic concepts express general world categories. Specific concepts
concepts:
describe a particular application domain;
• both formal and natural languages can be used;
concept definition:
• synonyms, homonyms, taxonomies (is_a), meronymies
relationships among concepts:
(part_of).
An ontology is part of a knowledge base composed by a that contains all
T-Box,
the concepts and role definitions as well as the axioms of our logical theory (eg: "a
and an that contains all the basic assertions
father is a man with a child"), A-Box,
of the logical theory (eg: "Tom is a father").
Services for T-Box are:
• verifies if a concept is a subconcept of another one;
Subsumption:
• verifies if there exists at least one interpretation that satisfies
Consistency:
the given T-Box; 7
• verifies, for a given C, that exists a
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
-
Appunti TIS - Technologies for Information Systems - Prof Elio Masciari
-
Schema per l'esame di Technologies For Information Systems
-
Appunti di Technologies for HVDC and HVAC transmission systems
-
Appunti - Technologies for Informative System