Che materia stai cercando?

Technologies For Information Systems - Complete Notes

Complete notes about all the topics covered in the course of Technologies For Information Systems. This is a must-have if you want to excel at the theory part of the exam. Excellent results are guaranteed, dell'università degli Studi del Politecnico di Milano - Polimi.

Esame di Technologies For Information Systems docente Prof. L. Tanca

Anteprima

ESTRATTO DOCUMENTO

A is any piece of data, logic and user interface that can be

mashup component

reused and accessed either locally or remotely. The is the internal logic

mashup logic

of operation of a mashup component.

The development of such mashups is not trivial. The work of the developer

is facilitated by suitable abstractions, component technologies, tools for computer-

assisted mashup composition that provide a user-friendly interface.

Types of mashups are the followings:

• fetch data from different resources, process them and return an

Data mashups

integrated result set. They are a lightweight, web-based form of data integra-

tion;

• integrate functionalities published by logic or data components.

Logic mashups

The output is a process that orchestrates such components and is, in its turn,

a logic component;

• reuse and possibly synchronize the UIs of the involved components

UI mashups

mediating possible data mismatches. The output is a web application the users

can interact with. It is appropriate when developing an UI from scratch is too

costly;

• span multiple layers of the application stack, bringing together

Hybrid mashups

different types of components inside one.

1.6 Future Trends in Data Integration

1.6.1 Uncertainty

Databases are supposed to represent certain data: any tuple in the database is true,

any tuple that is not in the database is false (closed world assumption). Uncertain

are able to deal with incomplete and uncertain information, taking ad-

databases

vantage of fuzzy logic: a kind of logic in which the truth variables may be any real

number between zero and one.

Since data may be uncertainty, data integration must deal with that. Mappings

might be approximate, for example because of an automatic ontology matching,

reconciliation and queries might be approximate as well.

An uncertain database, therefore, describes a set of possible worlds. Each tuple

have a probability and the overall probability of a possible world is the product of

tuples’ probabilities. 10

1.6.2 Data Provenance

Data provenance is also called data lineage or data pedigree. Sometimes, knowing

where data come from and how they are produced is critical. Provenance provides

information about who created the data, where and how they were produced. It is

an index of data quality used to evaluate data reliability.

Provenance can be modeled as annotations describing how each data item was

produced. These annotations are associated with tuples or values. Otherwise, prove-

nance modeling is achieved through a graph of data relationships with tuples as

vertices.

To conclude, provenance is used for explanations, as accessory data information

added to the result of the query, for giving a score to the sources, determining data

quality and in order to evaluate the influence of some sources on one another.

1.6.3 Crowdsourcing

Some checks are very simple for humans, but very hard for a computer. Crowd-

sourcing is used to provide a powerful solution to traditionally hard data integration

problems by exploiting humans to perform database matching tasks.

1.6.4 Visualizing Integrated Data

It could be very useful visualizing important patterns in the data instead of an

infinite number of rows. During the integration process, the system can show a

subset of data that have not been correctly reconciled. When browsing different

collections of data to be integrated, we want the system to show the search results

and evaluate their relevance to the specific integration task. Moreover, also data

provenance can be visualized.

1.6.5 Integrating Social Media

Another challenge for data integration is integrating data coming from social media.

Such data have a transient nature and identifying their quality could be very difficult.

Data often arrive as high-speed streams that require very fast processing.

1.6.6 Cluster and Cloud-Based Solutions

Most query engines, schema matchers, storage systems, query optimizers have been

developed for operation on a single server or few machines. Most algorithms are

based on the assumption of a limited scale underlying machine. Therefore, we need

to redesign them to efficiently exploit the power of a large cluster or the infinite

number of resources dynamically allocated by a cloud system.

11

2 Volume and Velocity

A transaction represents the elementary unit of work of a database server. It can

be made of multiple operations.

Classical DBMSs are transactional systems, they provide a mechanism for the

definition and execution of transactions. During the execution, ACID properties

must be guaranteed:

• a transaction is atomic, it cannot be split;

Atomicity:

• the integrity constraints defined on the databases must not be

Consistency:

violated;

• the execution of a transaction does not affect the others;

Isolation:

• the effect of a committed transaction must be permanent.

Durability:

There exist also non-transactional DBMSs: they are commonly called NoSQL DBMSs.

In this case there is no need that all the ACID properties are satisfied. Moreover,

they provide flexible schemas, asynchronous updates, scalability, easier caching and

potential inconsistencies must be solved directly by the user.

2.1 Big Data Storage

Data clouds are on demand storage services, very reliable and with easy access to a

virtually infinite number of resources. Traditional business applications usually use

centralized or distributed storage methods. Federated and multi-databases methods

are used by companies that share their data on the Internet.

On the other hand, Big Data are very well supported by cloud databases that

provide load sharing and data partitioning.

2.2 NoSQL Data Model

There are basically three categories of data models for NoSQL systems:

• it is the classical reference model. The key can be single or com-

Key-value:

pound while the value can be accessed only through the key. Data dictionaries

help find items. Scaling on multiple nodes is easier because we can take ad-

vantage of (horizontal partitioning). This data model is used, for

sharding

example, by map-reduce frameworks in which mappers produce multiple key-

value pairs given some input onjects and reducers compute a final result acting

on the values of pairs with the same key. Examples: Amazon DynamoDB;

12

• it is design to manage document-oriented information.

Document-based:

Documents are addressed in the database via a unique key, that is used to

retrieve the document itself. Moreover, it offers support for document version-

ing. Examples: MongoDB, CouchDB;

• the key is a compound (row-column-timestamp) and it is

Column-family:

strongly oriented to Big Data, offering maximum scalability and horizontal

and vertical partitioning. Columns are indexed within each row by a row-key.

Column-families contain a set of columns that are usually similar and each

column has a name, is indexed by a column key and contains a value for each

row. Examples: Google BigTable.

2.3 The CAP Theorem

The CAP theorem states that it is impossible for a distributed data store to simul-

taneously guarantee more than two out of the following properties:

• every read receives the most recent version of a data. It is a

Consistency:

strict subset of the ACID consistency;

• every request receives a response about what was successful or

Availability:

failed;

• the system continuous to operate despite an arbitrary

Partition tolerance:

number of dropped or delayed messages by the network.

These kind of systems are not suitable for traditional application such as banking

or accounting, but they are very useful for datasets that are rarely updated and to

collect data from sensors.

3 Data Analysis and Exploration

Data analysis is the process of transforming, inspecting, cleaning and modeling

data with the goal of obtaining useful information for making decisions. Multiple

techniques can be used, in particular:

• is a preliminary exploration of the data to better figure

Data exploration

out their characteristics;

• is a particular analysis technique that focuses on modeling and

Data mining

knowledge discovery for predictive purposes;

13

• relies heavily on aggregation, focusing on business

Business intelligence

information.

Data are analyzed mainly for commercial and scientific purposes. Often information

is hidden in the data and human analysis would take too long.

3.1 Data Exploration

Key motivations of data exploration are the help it provides to select the right tool

for preprocessing or analysis and the possibility to recognize patterns not captured

by data analysis tools. Basic traditional techniques of data exploration are the

followings:

• it deals with numbers that summarize data properties:

Summary statistics:

mean, standard deviation, frequency, mode. The of an attribute is

frequency

the percentage of times the value occurs in the dataset. The is the most

mode

frequent attribute value. For continuous data, the notion of percentile is more

useful: the percentile is the value such that (X ) = Since the

α x P x α.

α α

mean is very sensitive to outliers, the gets rid of them to reduce their

median

effect. The is the difference between the maximum and the minimum

range

values. The measures the spread of a set of points;

variance

• it is the conversion of data into a visual or tabular format in

Visualization:

order to make the analysis easier. Data objects are translated into graphical

elements such as points, lines, shapes and colors. Moreover, objects and at-

tributes can be selected to rule out the aspects we are not interested in. Some

visualization techniques are histograms and box plots;

• relational databases put data into tables,

On-Line Analytical Processing:

OLAP uses a multidimensional array representation to make data analysis and

exploration easier.

3.2 Data Mining

Data mining is based on ideas coming from machine learning, statistics and database

systems. Methods divide in unknown or future values of variables are

predictive,

predicted, and human-interpretable patterns describe the given data.

descriptive,

The challenges of data mining are: scalability, dimensionality, data quality, data

streaming, data ownership and distribution.

In more details, data mining methods are the followings:

14

• - Each record contains a set of attributes, one of

predictive.

Classification:

them is the class. Given a collection of records (training find a model

set),

for the class attribute as a function of the other attributes. In this way, new

records are assigned a class as accurately as possible. The accuracy of the

model is determined through a Examples: fraud detection, customer

test set.

attrition, sky cataloging;

• - Given a set of data points, each having a set of at-

descriptive.

Clustering:

tributes, find clusters such that data points in one cluster are similar and data

points in different clusters are less similar. The system is in charge of iden-

tifying similarities within data. Examples: market segmentation, document

clustering;

• - Given a set of records each of

descriptive.

Association rule discovering:

which contains a number of items from a given collection, the system produces

dependency rules which predict the occurrence of an item based on the pres-

ence of other items. Examples: marketing and sales promotions, inventory

management. Some terminology:

Itemset: a collection of one or more items;

– Support count (σ): frequency of occurrence of an itemset;

– Support: fraction of transactions that contain the sam itemset;

– Frequent itemset: an itemset whose support is grater or equal than a

– threshold;

minsup →

Association rule: an implication in the form where X and Y are

X Y

– {M → {Beer}).

itemsets (eg: ilk, Diaper}

Rule evaluation metrics:

– ∗ Support (s): Fraction of transactions that contain both X and Y:

|;

)/|T

σ(X, Y

∗ Confidence (c): how often items in Y appears in transactions that

contain X: )/σ(X).

σ(X, Y

• - There is an explicit concept

descriptive.

Sequential pattern discovery:

of time. Given a database of sequences and a user-specified minimum support

threshold, we want to find all subsequences whose support is minsup.

A sequence is defined as an ordered list of elements =< Each

s e , e , e , ... >.

1 2 3

{i }.

element is a collection of events = Each element is related to a

e , i , ..., i

i k

1 2

specific time or location. A sequence is contained in another

< a , a , ..., a >

n

1 2

sequence with if there exist integers

< b , b , ..., b > m n i < i < ... < i

m n

1 2 1 2

⊆ ⊆ ⊆

such that .

a b , a b , ..., a b

i1 i2 n in

1 2 15

• - It predicts a value of a given continuous variable

predictive.

Regression:

based on the value of other variables, assuming a linear or non linear model

of dependency.

• - It detects significant deviations from nor-

predictive.

Anomaly detection:

mal behavior, also called Examples: credit card fraud detection,

anomalies.

network intrusion detection.

4 Data Warehouses

A data warehouse is a single, complete and consistent store of data obtained from

a variety of different sources. It is the result of a process for transforming data

into information in order to help people understand how business is going and make

decisions.

A data warehouse is not only a decision support database maintained separately

from the operational databases, but also a technique for assembling and managing

data from various sources with the purpose of answering business questions.

A data warehouse is a subject-oriented, integrated, time-varying and non-volatile

collection of data that is used primarily in organizational decision making.

Standard DB Data Warehouse

OLTP Technology OLAP Technology

Mostly updates Mostly reads

Small transactions Long and complex queries

Current snapshot All the data history

Raw data Reconciled data

Mb - Gb Gb - Tb

Thousands of users Hundreds of users

Comparison between standard DBs and data warehouses.

Table 1: 16

4.1 OLAP

On-Line Analytical Processing supports sophisticated analysis and computations

over different dimensions and hierarchies. It takes advantage of a particular data

model, called The cube are the search keys and each dimen-

data cube. dimensions

sion can be hierarchical.

OLAP makes use of multidimensional models that are based on They

facts.

allow to describe a set of fact schemata whose components are: facts, measures,

dimensions and dimension hierarchies. In more details:

• A a concept that is relevant for the decisional process;

fact

• A is a numerical property of a fact;

measure

• A is a fact property defined with respect to a finite domain. It

dimension

represents an analysis coordinate for the fact.

Example: the sales cube has three dimensions: markets, time and products.

4.1.1 OLAP Models and Operations

OLAP logical models that are commonly used are:

• it stores data by using a multidimensional data structure (a "phys-

MOLAP:

ical" data cube). It is used when extra storage space is available on the server

and the best query performance is desired;

• it uses relational data model to represent multidimensional data. It

ROLAP:

is used when there is limited space on the server and the query performance

is not so important;

• it is a combination of the two. It does not necessarily create a copy

HOLAP:

of the source data, but data aggregations are stored in a multidimensional

structure on the server. This provides space saving and faster query processing.

OLAP provides the following operation to manipulate multidimensional data:

• aggregates data at a higher level;

Roll-up:

• de-aggregates data at a lower level;

Drill-down:

• applies selections and projections which reduce data dimen-

Slice and dice:

sionality; 17

• selects two dimensions to re-aggregate data and performs a re-

Pivoting:

orientation of the cube;

• sorts data according to predefined criteria;

Ranking:

• All the traditional OLTP operations are supported (select, join, project...).

4.2 Data Warehouse Design

When the design process of a data warehouse starts, relevant data sources must be

selected. DWs are based on a multidimensional model and E-R models cannot be

used in the DW conceptual design, this is why one or more fact schemata are derived

from them. They are characterized by fact name, measures and dimensions. The

dimensions of the same fact must have distinct names.

A must assume discrete values and can be organized into

dimensional attribute

a hierarchy. In details, a is a directional tree whose nodes are

dimensional hierarchy

dimensional attributes, the edges describe n:1 associations between pairs of dimen-

sional attributes and the root is the considered dimension. Moreover, two dimen-

sional attributes can be connected by more than two distinct direct edges. Hierarchy

is used not to duplicate portions of hierarchies. In this case, we have to give

sharing

a different name to each arc.

It is possible to identify three different categories of flow measures are

measures:

related to a time period (eg: number of sales per day), level measures are evaluated in

the particular time instant (eg: number of products in stock) and unitary measures

are level measures, but are also relative measures (eg: interest rate, money exchange

rate...).

A is an occurrence of a fact. It is represented by means of a tuple

primary event

of values. A hierarchy describes how it is possible to group and select primary events

and its root represents the finest aggregation granularity.

A contains additional information about a dimensional at-

descriptive attribute

tribute. It is not used to aggregate data.

A is a dimensional or descriptive attribute whose

cross-dimensional attribute

value is obtained by combining values of some dimensional attributes (eg: IVA).

Some attributes or dimensions may be related by a many-to-many relationship.

4.2.1 Conceptual Design

We start from the logical or conceptual schema of the source and apply a top-down

methodology. First of all, we identify the facts and for each of them an attribute

tree is defined, a fact schema is produced and a glossary is written.

A fact can be either an entity or a relationship of the source E-R schema and

it corresponds to an event that dynamically happens in the organization. The fact

18

becomes the root of a new fact schema. Furthermore, the is composed

attribute tree

by root and nodes and can be obtained through a semi-automatic procedure. After

that, the attribute tree is edited to rule out everything that is irrelevant to us by

means of two techniques: pruning, the subtree rooted in node n is deleted, and

grafting, the children of node n are directly connected to the father of n.

Pay attention when there are cycles in the E-R schema:

Dimensions are chosen among the children of the root and time is always a good

candidate. Numerical attributes that are children of the root are usually measures.

Further measures are defined by applying aggregate functions to numerical attributes

(sum, avg, min, max...). It is possible that a fact has no measures, it is an empty

fact.

In the an expression in associated with each measure and describes how

glossary,

the measure itself is obtained for each primary event at different levels of aggregation

starting from the attributes of the source schema.

Summing up, the steps of the conceptual design are:

1. fact definition;

2. for each fact:

• attribute tree generation;

• editing of the tree;

• dimensions definition;

• measures definition;

• fact schema creation. 19

4.2.2 Logical Design

There exists a great variety of data warehouse schemata. Starting from the concep-

tual schema, we want to obtain the logic schema for a specific data mart.

ROLAP is based on the that is a set of charac-

star schema, dimension tables

terized by a primary key and a set of attributes. Moreover, a fact table imports

all the primary keys of such dimension tables. Their denormalization introduces

redundancy, but guarantees fewer joints. Cross-dimensional attributes require to

create a new dimension table having as keys the associated dimensional attributes.

Shared hierarchies and convergences should use the same dimension table without

duplicating it.

The reduces this denormalization, but also the available mem-

snowflake schema

ory space. The advantage is the enhanced query execution.

An alternative to the star and the snowflake schemata is the constellation fact

multiple fact tables share dimension tables and the schema is viewed as a

schema:

collection of stars. It is used in sophisticated applications.

Since aggregation computation is very expensive, we can use to

materialized views

speed up frequent queries. Primary views correspond to primary aggregation levels

while secondary views are related to secondary events obtained aggregating primary

ones. It is useful to materialize a view when it directly solves a frequent query and

reduces the costs of some of them.

Sometimes it is useful to introduce new measures in order to manage aggrega-

tion correctly. are obtained by applying mathematical operators

Derived measures

to two or more values of the same tuple. Aggregate operators are of various types:

aggregate data starting from partially aggregated data (sum,

distributive operators

max, min), require further information to aggregate data (av-

algebraic operators

erage), cannot aggregate data starting from partially aggregated

holistic operators

ones.

Summing up, the steps of logical modeling are:

1. choice of the logical schema (star/snowflake);

2. conceptual schema translation;

3. choice of the materialized views;

4. optimizations. 20

5 Temporal Databases

Temporal databases are still a research area and take time into account in an unusual

manner. They store data related to time instances by offering temporal data types

and keeping information related to the past, the present and the future. Temporal

databases are in contrast to current databases which store only facts that are true

at the current time.

Possible applications deal with data warehousing, finance, law, medical records,

project scheduling, science...

Applications that manage temporal data would benefit from built-in, knowledge

independent temporal support. Their development would be more efficient and

performance increased.

Temporal databases are very useful when we want to keep the history of some-

thing. There are multiple alternatives to do that: it could be up to the user deter-

mining the history by inspecting data, we can use SQL as much as possible or we

can take advantage of embedded SQL.

Some special constructs are required, for example the Temporal

temporal join.

DBMSs have not reached a satisfactory performance level, therefore remain an open

research problem.

Software changes are part of the software operational life: the application or the

database schema are modified during their lifetime. A modification may be caused

by the change of the reality of interest or an improved perception of the reality

itself. Schema evolution and versioning deal with the need to retain current data

and system software functionality when database structure is changed.

In particular, schema evolution permits modifications of the schema without

loss of extensional data, while schema versioning allows querying all data through

appropriate version-based interfaces.

Basically, what we want from temporal databases is:

• capturing the semantics of time-varying information;

• retaining the simplicity of the relational model;

• presenting information concerning an object in a coherent fashion;

• ensuring ease of implementation and high performance.

5.1 Timestamps

A is a seven-part value (year, month, day, hour, minute, second and

timestamp

microsecond) that designates a date and time. The internal representation is a

string of 10 bytes: 4 for the date, 3 for the time and 3 for microseconds.

21

Time is already present in commercial DBMSs, but temporal DBMSs are able

to manage time-referenced data: timestamps are associated to database entities, for

example individual or groups of attributes, individual or sets of tuples, objects and

schema items.

As far as the semantics of a timestamp is concerned, database facts have at least

two relevant aspects:

• it is the time when the fact is true in the modeled reality. It

Valid time:

captures the time varying states of the real world. It can be either in the past

or in the future and can be changed frequently. All facts have a valid time and

it may not necessarily be recorded in the DB. If a database models different

world, each fact might have more than one valid time, one for each world.

A valid time table can be updated and supports historical queries.

• it specifies when a fact has been recorded in the database.

Transaction time:

It captures the time varying states of the database. It cannot extend beyond

the current time and cannot be changed. From the transaction time viewpoint,

an entity has a duration: from insertion to deletion, that is only logical because

it is not physically removed from the DB. Transaction time may be associated

not only to real world facts, but also with other DB concepts, such as attribute

values that are updated at a given time.

A transaction time table is appended only: it keeps the whole history of up-

dates. It supports rollback queries.

Valid time is controlled by the user while transaction time is controlled by the

database. We can have four different kinds of tables: snapshot, valid time, transac-

tion time, bitemporal.

are appended only and support both rollback and historical

Bitemporal tables

queries.

5.2 Time

The time structure can be of various types:

• Linear: a total order in instants is given;

• Hypothetical: specifies possible futures as a tree rooted on and deals with

now

branching time;

• DAG: directed acyclic graph;

• Periodic/Cyclic: used for recurrent processes.

22

Dealing with time can have no bounds or be bounded from the

time boundedness,

left (it is potentially infinite in the future). Furthermore, time can also have bounds

on both ends.

Time is characterized also by a In a time density, the time line

density. discrete

is isomorphic to integers and is composed of a sequence of non-decomposable time

periods called chronons. In a time density, the time line is isomorphic to

dense

rational numbers, while for a time density, the time line is isomorphic to

continuous

real numbers.

The concept of separates the past from the future. It is always increasing

"now"

and unique.

5.2.1 Temporal Data Types

Several temporal data types are used in temporal DBs:

• A is a time point on the time line (a chronon);

time instant

• An is an instantaneous fact occurring at an instance. The event oc-

event

currence time is the valid time instant at which the event occurs in the real

world;

• An is a set of instants;

instant set

• A or is a set of time instants between two instants (start

time period interval

and end time). It is an oriented duration of time: a positive interval is a

forward motion time, a negative interval is a negative motion one. An interval

does not represent an infinite set because it has a finite number of chronons;

• A is a finite union of periods (it solves the problem that

temporal element

union operation is not closed w.r.t intervals).

23

5.2.2 Predicates on Intervals

5.2.3 Relational and Aggregate Operators

Relational operators are:

• - returns the number of time points in the interval i (eg:

DURATION(i)

DURATION([d03,d07]) returns 5);

• - returns [MIN(s1,s2), MAX(e1,e2)] if (i1 MERGES i2), oth-

i1 UNION i2

erwise undefined;

• - returns [MAX(s1,s2), MIN(e1,e2)] if (i1 OVERLAPS

i1 INTERSECT i2

i2), otherwise undefined;

Aggregate operators are and They are basic components of temporal

unfold coalesce.

join. 24

5.2.4 Temporal Difference and Temporal Join

Temporal difference: 25

6 Data Personalization

We are surrounded by data: the web has enabled people to access a huge amount

of information and on-line services, hand-held electronic devices allow information

access anywhere and anytime, massive data generation by devices creates overload

and enables improved services.

provides an overall customized, individualized user experi-

Data personalization

ence by taking into account the needs, preferences and characteristics of a user or

groups of users. It provides different views of a collection of items to different users.

Personalization can be applied at distinct levels: presentation, the way in which the

application is presented to the user, interaction and data level.

At the base of every personalization effort, there is a user model that defines

how each user is represented at the system level. Personalization methods describe

how the system adapts to a specific user: it can be based on personal features of the

user, on the context and situation, on the purpose of the personalization. A data

personalization method may perform the following operations:

• re-ordering of items in a collection to be shown to a user;

• setting the focus on the items of interest;

• recommending additional options or suggestions.

Personalization is studied by several research communities. Information retrieval

approaches focus on finding ways to improve the results of unstructured searches

by taking into account user information (eg: search engines). Data personalization

efforts focus on information filtering and ordering, also considering their context.

Machine learning and data mining are used to develop recommendation strategies.

26


PAGINE

38

PESO

1.27 MB

PUBBLICATO

6 mesi fa


DETTAGLI
Corso di laurea: Corso di laurea magistrale in ingegneria informatica (COMO - MILANO) (corso erogato in lingua inglese)
SSD:
A.A.: 2018-2019

I contenuti di questa pagina costituiscono rielaborazioni personali del Publisher hardware994 di informazioni apprese con la frequenza delle lezioni di Technologies For Information Systems e studio autonomo di eventuali libri di riferimento in preparazione dell'esame finale o della tesi. Non devono intendersi come materiale ufficiale dell'università Politecnico di Milano - Polimi o del prof Tanca Letizia.

Acquista con carta o conto PayPal

Scarica il file tutte le volte che vuoi

Paga con un conto PayPal per usufruire della garanzia Soddisfatto o rimborsato

Recensioni
Ti è piaciuto questo appunto? Valutalo!