Anteprima
Vedrai una selezione di 19 pagine su 86
Business Intelligence Pag. 1 Business Intelligence Pag. 2
Anteprima di 19 pagg. su 86.
Scarica il documento per vederlo tutto.
Business Intelligence Pag. 6
Anteprima di 19 pagg. su 86.
Scarica il documento per vederlo tutto.
Business Intelligence Pag. 11
Anteprima di 19 pagg. su 86.
Scarica il documento per vederlo tutto.
Business Intelligence Pag. 16
Anteprima di 19 pagg. su 86.
Scarica il documento per vederlo tutto.
Business Intelligence Pag. 21
Anteprima di 19 pagg. su 86.
Scarica il documento per vederlo tutto.
Business Intelligence Pag. 26
Anteprima di 19 pagg. su 86.
Scarica il documento per vederlo tutto.
Business Intelligence Pag. 31
Anteprima di 19 pagg. su 86.
Scarica il documento per vederlo tutto.
Business Intelligence Pag. 36
Anteprima di 19 pagg. su 86.
Scarica il documento per vederlo tutto.
Business Intelligence Pag. 41
Anteprima di 19 pagg. su 86.
Scarica il documento per vederlo tutto.
Business Intelligence Pag. 46
Anteprima di 19 pagg. su 86.
Scarica il documento per vederlo tutto.
Business Intelligence Pag. 51
Anteprima di 19 pagg. su 86.
Scarica il documento per vederlo tutto.
Business Intelligence Pag. 56
Anteprima di 19 pagg. su 86.
Scarica il documento per vederlo tutto.
Business Intelligence Pag. 61
Anteprima di 19 pagg. su 86.
Scarica il documento per vederlo tutto.
Business Intelligence Pag. 66
Anteprima di 19 pagg. su 86.
Scarica il documento per vederlo tutto.
Business Intelligence Pag. 71
Anteprima di 19 pagg. su 86.
Scarica il documento per vederlo tutto.
Business Intelligence Pag. 76
Anteprima di 19 pagg. su 86.
Scarica il documento per vederlo tutto.
Business Intelligence Pag. 81
Anteprima di 19 pagg. su 86.
Scarica il documento per vederlo tutto.
Business Intelligence Pag. 86
1 su 86
D/illustrazione/soddisfatti o rimborsati
Disdici quando
vuoi
Acquista con carta
o PayPal
Scarica i documenti
tutte le volte che vuoi
Estratto del documento

THE ARCHITECTURE OF BIG DATA

The architecture of big data:

Raw data

- (internal/operational – ERP; external data) data

- Raw data are then put through tools (ETL – Extract Transform and Load) in the

warehouse (repository) to: 6

Raw data were deleted from time to time and put in digital form: not all the past

o data were kept in the past but if you need to perform an analysis it is important

to have them.

Application which uses the CPU of the system: we need different systems to

o perform analysis and running the operational system.

Operational system can be spread all over the world so it has different currencies

o and different coding: it is important to transform data to have them homogenous.

Warehouses are write ones, read in many ways.

Metadata:

- documentations of what has been done, documentation of the origin and

meaning of the records; what kind of transformation has been made to data. It is

definitely unsatisfactory (not existing) in many company, because we never document

things since we are always in a rush to the deadline: we should prepare documentation

from the beginning – work in progress documentation. It is data describing data.

data marts:

- After data warehouses, we have small warehouse owned by single

departments e.g. marketing department.

OLAP – On-Line Analytical Processes.

- Finally,

Query and reporting

- Dashboards: a way of expressing visually (data-viz), from green to red according

o to the dangerousness.

Alerts – alerting system: graphical hints to make people warned.

o Exploratory analysis

o Predictive modelling

o Data mining

o Optimization: to extract knowledge from data

o

CUBES AND MULTIDIMENSIONAL ANALYSIS

DWH are based on relational databases; the design of data warehouses and data marts is based

‘star schema’.

on a multidimensional paradigm for data representation, based on a We can

distinguish two types:

- every dimension table has the main index, the key.

Dimension table/Master table:

Dimension tables are internally structured according to hierarchical relationships e.g.

the customer code, the point of sale code, etc. or time: day, week, year, etc.

The main key is related to the field of the fact table, which has a higher level of detail. 7

- table: contains integrally references to dimension tables and also measures,

Fact

numbers, etc. they usually refer to transactions and contain two types of data: links to

dimension tables, numerical values.

The time is a key aspect for DWH since it is crucial in it: there is always time. ‘galaxy

The schema related to fact tables interconnected with dimension tables is called the

scheme’ since there is a set of stars connected.

Star schema Snowflake schema Galaxy schema

à à

Data Cube data

A fact table connected with n dimension tables may be represented by an n-dimensional

cube where each axis corresponds to a dimension.

It is a representation that is a generalization of the excel table with pivot table (multi-

dimensional dataset): two dimension with the third summarized; it is a natural extension of the

two-dimensions spreadsheet, interpreted as a two-dimensional cube.

We have three dimensions:

- the product

- the time

- the geographical region.

The data at this level of granularity are 36 elementary data. 8

Hierarchy of concepts/dimensions: some dimensions, like time, are hierarchical. E.g. for time

we have days, weeks, but then months is not possible because it is not 28 days; space: street,

towns, etc.

Actions to perform on data cubes:

- operation that lead to a more detailed information, obtained

Drill down or Roll down:

by: Shifting down to a lower level e.g. from province to city

o Adding one dimension e.g. time

o

- up: operation consisting of an aggregation of data, obtained by:

Drill up or Roll

Proceeding upward to a higher level e.g. from city to province.

o Reducing one dimension e.g. remove time dimension.

o

What to perform on data cubes?

Slicing:

- take a slice of the representation e.g. with the region; basically the value of a

dimension is fixed.

Dicing:

- if more than one dimension at the same time is fixed.

Pivoting or Rotation

- 9

MACHINE LEARNING AND DATA MINING

The to artificial intelligence and machine learning: it is based on

inductive approach

discovering rules and hidden information from past data and use them in a predictive way.

means exploring past data that has been collected in the past (called

Predictive analytics

examples, observations, etc.) to extract hidden patterns and correlations useful to decision

makers that appear and are not obvious to humans. The aims of this machine learning approach

and analysis are basically two:

To better understand root causes

- of investigated systems: improve understanding of the

investigated system.

Example: a churner is someone not loyal to the company that sell to some other

companies, for example someone that changes from Vodafone to Tim; from the

customers we want a large set of information – social demographic, age, region, etc. –

once we know the most important information (more than 1 thousand), like calling and

receiving behaviour, we have two dimensional data set in which rows are customers and

columns are the information we know; the last column tells us the target variable: it tells

if the customer is still loyal or if it is a churner. We want to find a hidden relationship

between the last column and the previous columns: we want to point out thorough an

algorithm which one are the most important to explain the target variable. Some of the

variable are actionable by marketing people).

Example: Medical data with the target variable corresponding to heart attack. You want

to know what are the root cause, e.g. medical team, age, etc.

To derive accurate predictions and optimize future actions:

- it is a more pragmatic

‘data lake’

objective. There is a place called where data are collected and stored. Data are

then analysed by the algorithm to find hidden patterns. We have different tools to

perform this kind of analysis. Moreover, we can keep some of the past data to verify the

quality of the prediction; then we predict the real future. 10

The machine learning process

define the objective

1- First step is to of the project: why are we preforming the analysis?

For example, we want to reduce the churn rate. There are mainly three roles:

Data analyst.

o Process owner, the expert.

o The IT people: people in the organization in charge of the data infrastructure e.g.

o they know where are stored the data, etc.

In the object definition stage, all of them are involved.

creation of a dedicated database:

2- Second step is the it costs a lot of efforts and can be

very time consuming since collecting data is not simple in some situations.

Exploratory data analysis

3- is then performed. In the company’s system there might be

incorrect data and must be detected in this stage (it is important to isolate outliers in

order not to influence the average; every data set has outliers that must be discovered

and explained by experts and analysts). If something goes wrong, they have to go back

to data mart. When reached a good data mart and re performed exploratory analysis,

you go one step ahead.

4- Attribute (=columns, information) selection stage is the following step: you have to

decide that some attributes are useless (non informative columns) and you decide to

throw them away. But at the same time we want to create new attributes from existing

variables that may be useful e.g. trends. It is possible with two approaches:

a. Common sense coming from the knowledge of the process; they can be based on

intuitions (e.g. a ratio than helps us understanding if the people is now calling

more than before).

b. Dimensionality reduction techniques which can be linear or not. We are filtering

data to create new attributes and to delete not useful ones.

Exploratory analysis and selection of attributes are very much human driven.

5- At this point we have the final data set, simplified and made correct. Now we have to

choose the best model: it is a very automatic activity in which we run thousands of

experiments (different algorithms with different parameters; there are different

techniques).

We cannot tell in advance, given a dataset, which can be the best algorithm, model, etc. It is very

much empirical: we have to try and experiment which algorithm have the best performances.

It is a realistic approach for choosing the best algorithm.

6- The last step is to discuss the results with the users. 11

Machine learning processes are based on interpretation and prediction. Machine learning

models and methods:

- Select a class of models;

- Select and evaluation metric;

- Design algorithms and identify the model.

Applications of predictive analytics are several, the more traditional ones are:

- Profile prospects and customers - Fraud detection

- Acquire new customers - Risk management

- Cross-selling and up-selling - Demand forecast

- Retention - Preventive maintenance

- Market basket analysis - Biolife –molecular biology

- Credit scoring - Medical diagnosis

While the more innovative are:

- Image recognition - Social media analytics

- Web mining

Dataset

Dataset are usually represented as two dimensional tables:

- Rows that are the observations

- Columns: variables, attributes, information about the rows; the data that characterize

each observation.

Attribute types:

Categorical, that assume a finite number of distinct values:

o Counts: True/False or 0/1 (Boolean variables or binary variables).

§ Nominal attributes: without natural sorting e.g. province of residence.

§ Ordinal attribute that could have a natural sorting but for which it makes

§ no sense to calculate differences or ratios.

Numerical, which assume a finite or infinite number of values and lend

o themselves to operations:

Discrete attributes (finite, countable number).

§ Continuous attribute (uncountable infinity of values).

§

It is not easy to reduce to two dimensions all the time but it is possible for all the datasets (e.g.

tweets: is made by structured information – name, account, location, etc. – and unstructured –

the text: to analyse and represent the text in the two dimensional structure the columns are the

words that you accept to recognize; e.g. images: considered as a number of pixels that are the

columns and the rows are the pictures; the intersection is the number that expresses he colour

of that pixel). D,

To represent a generic dataset we will denote by m the number of observations, or rows, in

the two-dimensional table containing the data and by n the number of attributes, or columns.

Furthermore, we will denote by: X = [x ]

ij

M N

i ∈ = {1, 2,...,m}, j ∈ = {1,2,...,n}

Dettagli
Publisher
A.A. 2017-2018
86 pagine
SSD Ingegneria industriale e dell'informazione ING-INF/05 Sistemi di elaborazione delle informazioni

I contenuti di questa pagina costituiscono rielaborazioni personali del Publisher franciig_ di informazioni apprese con la frequenza delle lezioni di Business Intelligence e Data Mining e studio autonomo di eventuali libri di riferimento in preparazione dell'esame finale o della tesi. Non devono intendersi come materiale ufficiale dell'università Politecnico di Milano o del prof Vercellis Carlo.