Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
vuoi
o PayPal
tutte le volte che vuoi
Ex. SUM, MIN, MAX, COUNT
2. Algebraic
If it can be computed from a finite algebraic expression defined over distributive functions.
For example, the functions average (AVG), variance (VAR), and standard deviation (STDEV) are
algebraic aggregate functions.
3. Holistic
If it cannot be computed from other aggregate functions. For example, MEDIAN, MODE, RANK.
Chapter 3- Data warehouse design
The purpose of a data warehouse (DW) is not just to store data but rather to facilitate decision making.
The DW design process has 4 main steps:
1. Requirements Analysis. The goal is to produce a description of the business processes, the typical
information analysis activities with which users are involved, and the measures and dimensions of
interest. Typically, requirements at this stage are documented rather informally.
2. Conceptual Design. The goal is to produce a formal description of the data to be analysed in high-level-
term using a conceptual data model.
We will use the Dimensional Fact Model, DFM, to describe facts, dimensions, dimensional attributes
and attribute hierarchies.
3. Logical Design. The goal is to transform the conceptual design into the logical structures used for storing
the DW in a relational DBMS.
4. Physical Design. The goal is to define the data structures needed for storing the database tables created
by the logical design. The main issues are what indexes and materialized views to define to optimize the
overall performance of the system.
In general, the data to load in the DW are processed with two important and complex kinds of operations:
1. Transform. When the data come from different sources, their formats are revised to align them by
eliminating syntactic and semantic differences.
10
(a) Syntactic transformation. The same data can have both attributes with different names, and the
names are not those to be used in the DW, and different types. For example, a code is defined in some
cases of a type string, and in others a type integer; a gender is defined as (M, F), (m, f), (0, 1) or (male,
female); a value is defined with different units of measurement, and so on.
(b) Semantic transformation. The data in source databases may have been used with a different
meaning. For example, sales can be daily or weekly.
2. Cleaning. The data are analysed in order to eliminate errors of representation or to complete missing
information. For example, in the case of addresses the zip code can be wrong or the name of the town
can be written in different ways (Busto Arsizio also written as BustoArsizio or BArsizio)
The information generated during the design and implementation of a DW is organized and stored as metadata
using appropriate specialized tools or taking advantage of the capabilities of DW systems that provide. In the
case of DW metadata are about other aspects of the data and can be classified into the following main
categories:
– Business metadata. Concern the meaning of the terms used to define the logical structure of data in corporate
terminology. This type of metadata is usually used by users to understand the nature of the data available.
– Structural metadata. Concern the logical structure of facts and dimensions, types of attribute values,
hierarchies, dimensions and meaningful aggregations. This type of metadata is usually used by users to
understand what types of analysis can be performed on the data.
– Technical metadata. Are concerned with the physical data property, such as storage structures, data sources,
date of loading, transformations applied etc. This type of metadata is usually used by technicians for the
maintenance and development of a DW.
– Operational metadata. Concerns the types of predefined analysis reports and what parameters should be
used.
– Design metadata. Concern the results of the DW design phases.
3.2 DATA WAREHOUSE DESIGN APPROACHES
1. DATA-DRIVEN (top-down)
The goal is to design first an enterprise DW based on the data available in
the operational information system, then the data marts are created from
the DW. This is done by analysing a conceptual model of data, if one is
available, or the actual logical record layouts and selecting data elements
deemed to be of interest. This approach is the only possible when the
demand for information from a DW does not exist until the DW is
available.
An initial DW design based on the data available can help both users to discover new ways in which to use the
available data, and the designer to identify areas on which to focus data warehouse development efforts. The
disadvantage of this approach is that without user involvement there is the risk of producing a non-interesting
result.
2. ANALYSIS-DRIVEN (bottom-up)
The goal is to design first the data marts based on the data analysis that the users want to perform, and then
the data marts are integrated to build the DW. The major advantage to this approach is that the focus is on
providing what is really needed, rather than what is available. In general, this approach has a smaller scope than
the data-driven approach. Therefore, it generally produces useful data marts in a shorter time span. The
disadvantage of this approach is the risk that some of the data that the analysis needs are not available.
11
Moreover, if a user is too tightly focused, it is possible to miss useful data that is available in the operational
information system.
Design phases for each data mart of interest:
1. Requirements analysis
2. Initial analysis-driven data mart conceptual design
3. Candidate data-driven data mart conceptual design
4. Final data mart conceptual design
5. Data mart and DW logical design
REQUIREMENTS ANALYSIS
The requirements analysis phase is divided into two main sub-phases, characterized by the different language
used for the preparation of the documents they produce.
The first sub-phase, Requirements gathering, produces a natural language specification of requirements.
(a) Analysis of the problem domain for which the modeling will be done.
(b) Analysis of the business processes to select, with end-user interviews, those more interesting to consider for
designing the data warehouse.
(c) Business questions that end-users issue and try to answer in the course of their information analysis activities.
Examples of frequently encountered categories of business questions are:
– Existence checking analysis, such as “A given product has been sold to a particular customer.”
– Item comparison analysis, such as “Compare the value of purchases of two customers over the last six months,”
or “Compare the number of items sold for a given product category, by store, and by week.”
– Trend analysis, such as “The growth in item sales for a given set of products, over the last 12 months.”
– Queries to analyse ratios, rankings, and clusters, such as “Rank our best customers in terms of dollar sales
over the last year.”
– Statistical analysis, such as “The average item sales by product category, by sales region.”
Note that the business questions must usually be “interpreted” in order to express them in a form more useful
for designing the data warehouse. That is, an interpreted business question should be expressed with the types
of reports to be produced, or phrases that reveal the following information: the constraints on data to analyse,
the requested dimensional attributes and the metrics (aggregation operation) to compute, the coordinates
(dimensions) against which the fact must be analysed, the result sorting criteria and if metrics’ partial value are
needed. When business questions are expressed by means of phrases, the use of the following form is
suggested: “For a data subset to use, the metrics to compute, by dimension, ..., by dimension. How the result
should be presented”. For example, “For the year 2010 in Italy, the total of sale revenue, by region, by month,
by customer name. The result must be sorted by region, month, and customer name, and must include partial
totals for all regions.” It is important also to check that the information analysis requirements need data that
are currently available or can be obtained as external data that exist outside the enterprise. If there are multiple
data sources, the analysis is complicated by the need to reconcile the likely differences in representation of
information. In the following we will not consider this aspect.
The second sub-phase, Requirements specification, produces a description of the requirements for data analysis
outlining the salient features to be modeled then with the conceptual design.
Business Process Requirements: 12
Fact Description:
Dimensions:
Dimensional Attributes:
Dimensional Hierarchies:
Dimensional Attributes Changes: . It is important to understand how
the business wants to deal with the dimension attributes that can change over time. Consequently, for each of
them, besides the name, the type of strategy is specified to deal with them in the logical design phase and data
loading. The strategy to be specified depends on the analysis’ objectives, and for this reason it should be
documented in the requirements specification. We consider four options, of which the first three are considered
for slowly changing dimensions:
Type 1 (overwriting the history) If a dimensional attribute changes its value, only the latest value is
required to be held in the data mart. This means that there is no need to preserve the previous value.
For example, if a customer changes address, the new one replaces the present value of the dimension
Customer. It is the easiest solution, but the history of customer addresses is lost.
Type 2 (preserving the history) If a dimensional attribute changes its value, both the old and the new
value are required to be held in the data mart, but the structure of the dimension must not be changed.
It is the solution commonly used.
Type 3 (preserving one or more versions of history) If a dimensional attribute changes its value, the
structure of the dimension is extended with additional attributes to keep the tracking history with both
old and new values.
Type 4 (fast changing) The dimensional attributes change frequently, and must not be treated with one
of the previous solutions
Measures:
Descriptive attributes of the facts:
Summary of dimensions and Measures:
INITIAL ANALYSIS-DRIVEN DATA MART CONCEPTUAL DESIGN
An initial data mart conceptual design is defined from the analysis that the users perform (the design from what
the users want), without any claim to completeness but useful as a formal description of requirements. In the
conceptual design the dimensional hierarchies are modelled together with their type (balanced, incomplete,
recursive), degenerate dimensions and descriptive attributes of the facts.
13
CANDIDATE DATA-DRIVEN DATA MART CONCEPTUAL DESIGN
A method is described for developing a candidate data mart conceptual design from the operational database
relational schema (the design from what is availa