Technologies For Information Systems - Complete Notes

Complete notes about all the topics covered in the course of Technologies For Information Systems. This is a must-have if you want to excel at the theory part of the exam. Excellent results are …

Esame Technologies For Information Systems

Facoltà Ingegneria dell'informazione

Dal corso del Prof. Tanca Letizia

Università Politecnico di Milano

Publisher hardware994

A.A. 2017-2018

38 pagine

Appunto

Vota 5,0 / 5 (1)

Scarica

Estratto del documento

MOLAP:

ical" data cube). It is used when extra storage space is available on the server

and the best query performance is desired;

• it uses relational data model to represent multidimensional data. It

ROLAP:

is used when there is limited space on the server and the query performance

is not so important;

• it is a combination of the two. It does not necessarily create a copy

HOLAP:

of the source data, but data aggregations are stored in a multidimensional

structure on the server. This provides space saving and faster query processing.

OLAP provides the following operation to manipulate multidimensional data:

• aggregates data at a higher level;

Roll-up:

• de-aggregates data at a lower level;

Drill-down:

• applies selections and projections which reduce data dimen-

Slice and dice:

sionality; 17

• selects two dimensions to re-aggregate data and performs a re-

Pivoting:

orientation of the cube;

• sorts data according to predefined criteria;

Ranking:

• All the traditional OLTP operations are supported (select, join, project...).

4.2 Data Warehouse Design

When the design process of a data warehouse starts, relevant data sources must be

selected. DWs are based on a multidimensional model and E-R models cannot be

used in the DW conceptual design, this is why one or more fact schemata are derived

from them. They are characterized by fact name, measures and dimensions. The

dimensions of the same fact must have distinct names.

A must assume discrete values and can be organized into

dimensional attribute

a hierarchy. In details, a is a directional tree whose nodes are

dimensional hierarchy

dimensional attributes, the edges describe n:1 associations between pairs of dimen-

sional attributes and the root is the considered dimension. Moreover, two dimen-

sional attributes can be connected by more than two distinct direct edges. Hierarchy

is used not to duplicate portions of hierarchies. In this case, we have to give

sharing

a different name to each arc.

It is possible to identify three different categories of flow measures are

measures:

related to a time period (eg: number of sales per day), level measures are evaluated in

the particular time instant (eg: number of products in stock) and unitary measures

are level measures, but are also relative measures (eg: interest rate, money exchange

rate...).

A is an occurrence of a fact. It is represented by means of a tuple

primary event

of values. A hierarchy describes how it is possible to group and select primary events

and its root represents the finest aggregation granularity.

A contains additional information about a dimensional at-

descriptive attribute

tribute. It is not used to aggregate data.

A is a dimensional or descriptive attribute whose

cross-dimensional attribute

value is obtained by combining values of some dimensional attributes (eg: IVA).

Some attributes or dimensions may be related by a many-to-many relationship.

4.2.1 Conceptual Design

We start from the logical or conceptual schema of the source and apply a top-down

methodology. First of all, we identify the facts and for each of them an attribute

tree is defined, a fact schema is produced and a glossary is written.

A fact can be either an entity or a relationship of the source E-R schema and

it corresponds to an event that dynamically happens in the organization. The fact

becomes the root of a new fact schema. Furthermore, the is composed

attribute tree

by root and nodes and can be obtained through a semi-automatic procedure. After

that, the attribute tree is edited to rule out everything that is irrelevant to us by

means of two techniques: pruning, the subtree rooted in node n is deleted, and

grafting, the children of node n are directly connected to the father of n.

Pay attention when there are cycles in the E-R schema:

Dimensions are chosen among the children of the root and time is always a good

candidate. Numerical attributes that are children of the root are usually measures.

Further measures are defined by applying aggregate functions to numerical attributes

(sum, avg, min, max...). It is possible that a fact has no measures, it is an empty

fact.

In the an expression in associated with each measure and describes how

glossary,

the measure itself is obtained for each primary event at different levels of aggregation

starting from the attributes of the source schema.

Summing up, the steps of the conceptual design are:

1. fact definition;

2. for each fact:

• attribute tree generation;

• editing of the tree;

• dimensions definition;

• measures definition;

• fact schema creation. 19

4.2.2 Logical Design

There exists a great variety of data warehouse schemata. Starting from the concep-

tual schema, we want to obtain the logic schema for a specific data mart.

ROLAP is based on the that is a set of charac-

star schema, dimension tables

terized by a primary key and a set of attributes. Moreover, a fact table imports

all the primary keys of such dimension tables. Their denormalization introduces

redundancy, but guarantees fewer joints. Cross-dimensional attributes require to

create a new dimension table having as keys the associated dimensional attributes.

Shared hierarchies and convergences should use the same dimension table without

duplicating it.

The reduces this denormalization, but also the available mem-

snowflake schema

ory space. The advantage is the enhanced query execution.

An alternative to the star and the snowflake schemata is the constellation fact

multiple fact tables share dimension tables and the schema is viewed as a

schema:

collection of stars. It is used in sophisticated applications.

Since aggregation computation is very expensive, we can use to

materialized views

speed up frequent queries. Primary views correspond to primary aggregation levels

while secondary views are related to secondary events obtained aggregating primary

ones. It is useful to materialize a view when it directly solves a frequent query and

reduces the costs of some of them.

Sometimes it is useful to introduce new measures in order to manage aggrega-

tion correctly. are obtained by applying mathematical operators

Derived measures

to two or more values of the same tuple. Aggregate operators are of various types:

aggregate data starting from partially aggregated data (sum,

distributive operators

max, min), require further information to aggregate data (av-

algebraic operators

erage), cannot aggregate data starting from partially aggregated

holistic operators

ones.

Summing up, the steps of logical modeling are:

1. choice of the logical schema (star/snowflake);

2. conceptual schema translation;

3. choice of the materialized views;

4. optimizations. 20

5 Temporal Databases

Temporal databases are still a research area and take time into account in an unusual

manner. They store data related to time instances by offering temporal data types

and keeping information related to the past, the present and the future. Temporal

databases are in contrast to current databases which store only facts that are true

at the current time.

Possible applications deal with data warehousing, finance, law, medical records,

project scheduling, science...

Applications that manage temporal data would benefit from built-in, knowledge

independent temporal support. Their development would be more efficient and

performance increased.

Temporal databases are very useful when we want to keep the history of some-

thing. There are multiple alternatives to do that: it could be up to the user deter-

mining the history by inspecting data, we can use SQL as much as possible or we

can take advantage of embedded SQL.

Some special constructs are required, for example the Temporal

temporal join.

DBMSs have not reached a satisfactory performance level, therefore remain an open

research problem.

Software changes are part of the software operational life: the application or the

database schema are modified during their lifetime. A modification may be caused

by the change of the reality of interest or an improved perception of the reality

itself. Schema evolution and versioning deal with the need to retain current data

and system software functionality when database structure is changed.

In particular, schema evolution permits modifications of the schema without

loss of extensional data, while schema versioning allows querying all data through

appropriate version-based interfaces.

Basically, what we want from temporal databases is:

• capturing the semantics of time-varying information;

• retaining the simplicity of the relational model;

• presenting information concerning an object in a coherent fashion;

• ensuring ease of implementation and high performance.

5.1 Timestamps

A is a seven-part value (year, month, day, hour, minute, second and

timestamp

microsecond) that designates a date and time. The internal representation is a

string of 10 bytes: 4 for the date, 3 for the time and 3 for microseconds.

Time is already present in commercial DBMSs, but temporal DBMSs are able

to manage time-referenced data: timestamps are associated to database entities, for

example individual or groups of attributes, individual or sets of tuples, objects and

schema items.

As far as the semantics of a timestamp is concerned, database facts have at least

two relevant aspects:

• it is the time when the fact is true in the modeled reality. It

Valid time:

captures the time varying states of the real world. It can be either in the past

or in the future and can be changed frequently. All facts have a valid time and

it may not necessarily be recorded in the DB. If a database models different

world, each fact might have more than one valid time, one for each world.

A valid time table can be updated and supports historical queries.

• it specifies when a fact has been recorded in the database.

Transaction time:

It captures the time varying states of the database. It cannot extend beyond

the current time and cannot be changed. From the transaction time viewpoint,

an entity has a duration: from insertion to deletion, that is only logical because

it is not physically removed from the DB. Transaction time may be associated

not only to real world facts, but also with other DB concepts, such as attribute

values that are updated at a given time.

A transaction time table is appended only: it keeps the whole history of up-

dates. It supports rollback queries.

Valid time is controlled by the user while transaction time is controlled by the

database. We can have four different kinds of tables: snapshot, valid time, transac-

tion time, bitemporal.

are appended only and support both rollback and historical

Bitemporal tables

queries.

5.2 Time

The time structure can be of various types:

• Linear:

Anteprima

Vedrai una selezione di 9 pagine su 38