Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
vuoi
o PayPal
tutte le volte che vuoi
THE ARCHITECTURE OF BIG DATA
The architecture of big data:
Raw data
- (internal/operational – ERP; external data) data
- Raw data are then put through tools (ETL – Extract Transform and Load) in the
warehouse (repository) to: 6
Raw data were deleted from time to time and put in digital form: not all the past
o data were kept in the past but if you need to perform an analysis it is important
to have them.
Application which uses the CPU of the system: we need different systems to
o perform analysis and running the operational system.
Operational system can be spread all over the world so it has different currencies
o and different coding: it is important to transform data to have them homogenous.
Warehouses are write ones, read in many ways.
Metadata:
- documentations of what has been done, documentation of the origin and
meaning of the records; what kind of transformation has been made to data. It is
definitely unsatisfactory (not existing) in many company, because we never document
things since we are always in a rush to the deadline: we should prepare documentation
from the beginning – work in progress documentation. It is data describing data.
data marts:
- After data warehouses, we have small warehouse owned by single
departments e.g. marketing department.
OLAP – On-Line Analytical Processes.
- Finally,
Query and reporting
- Dashboards: a way of expressing visually (data-viz), from green to red according
o to the dangerousness.
Alerts – alerting system: graphical hints to make people warned.
o Exploratory analysis
o Predictive modelling
o Data mining
o Optimization: to extract knowledge from data
o
CUBES AND MULTIDIMENSIONAL ANALYSIS
DWH are based on relational databases; the design of data warehouses and data marts is based
‘star schema’.
on a multidimensional paradigm for data representation, based on a We can
distinguish two types:
- every dimension table has the main index, the key.
Dimension table/Master table:
Dimension tables are internally structured according to hierarchical relationships e.g.
the customer code, the point of sale code, etc. or time: day, week, year, etc.
The main key is related to the field of the fact table, which has a higher level of detail. 7
- table: contains integrally references to dimension tables and also measures,
Fact
numbers, etc. they usually refer to transactions and contain two types of data: links to
dimension tables, numerical values.
The time is a key aspect for DWH since it is crucial in it: there is always time. ‘galaxy
The schema related to fact tables interconnected with dimension tables is called the
scheme’ since there is a set of stars connected.
Star schema Snowflake schema Galaxy schema
à à
Data Cube data
A fact table connected with n dimension tables may be represented by an n-dimensional
cube where each axis corresponds to a dimension.
It is a representation that is a generalization of the excel table with pivot table (multi-
dimensional dataset): two dimension with the third summarized; it is a natural extension of the
two-dimensions spreadsheet, interpreted as a two-dimensional cube.
We have three dimensions:
- the product
- the time
- the geographical region.
The data at this level of granularity are 36 elementary data. 8
Hierarchy of concepts/dimensions: some dimensions, like time, are hierarchical. E.g. for time
we have days, weeks, but then months is not possible because it is not 28 days; space: street,
towns, etc.
Actions to perform on data cubes:
- operation that lead to a more detailed information, obtained
Drill down or Roll down:
by: Shifting down to a lower level e.g. from province to city
o Adding one dimension e.g. time
o
- up: operation consisting of an aggregation of data, obtained by:
Drill up or Roll
Proceeding upward to a higher level e.g. from city to province.
o Reducing one dimension e.g. remove time dimension.
o
What to perform on data cubes?
Slicing:
- take a slice of the representation e.g. with the region; basically the value of a
dimension is fixed.
Dicing:
- if more than one dimension at the same time is fixed.
Pivoting or Rotation
- 9
MACHINE LEARNING AND DATA MINING
The to artificial intelligence and machine learning: it is based on
inductive approach
discovering rules and hidden information from past data and use them in a predictive way.
means exploring past data that has been collected in the past (called
Predictive analytics
examples, observations, etc.) to extract hidden patterns and correlations useful to decision
makers that appear and are not obvious to humans. The aims of this machine learning approach
and analysis are basically two:
To better understand root causes
- of investigated systems: improve understanding of the
investigated system.
Example: a churner is someone not loyal to the company that sell to some other
companies, for example someone that changes from Vodafone to Tim; from the
customers we want a large set of information – social demographic, age, region, etc. –
once we know the most important information (more than 1 thousand), like calling and
receiving behaviour, we have two dimensional data set in which rows are customers and
columns are the information we know; the last column tells us the target variable: it tells
if the customer is still loyal or if it is a churner. We want to find a hidden relationship
between the last column and the previous columns: we want to point out thorough an
algorithm which one are the most important to explain the target variable. Some of the
variable are actionable by marketing people).
Example: Medical data with the target variable corresponding to heart attack. You want
to know what are the root cause, e.g. medical team, age, etc.
To derive accurate predictions and optimize future actions:
- it is a more pragmatic
‘data lake’
objective. There is a place called where data are collected and stored. Data are
then analysed by the algorithm to find hidden patterns. We have different tools to
perform this kind of analysis. Moreover, we can keep some of the past data to verify the
quality of the prediction; then we predict the real future. 10
The machine learning process
define the objective
1- First step is to of the project: why are we preforming the analysis?
For example, we want to reduce the churn rate. There are mainly three roles:
Data analyst.
o Process owner, the expert.
o The IT people: people in the organization in charge of the data infrastructure e.g.
o they know where are stored the data, etc.
In the object definition stage, all of them are involved.
creation of a dedicated database:
2- Second step is the it costs a lot of efforts and can be
very time consuming since collecting data is not simple in some situations.
Exploratory data analysis
3- is then performed. In the company’s system there might be
incorrect data and must be detected in this stage (it is important to isolate outliers in
order not to influence the average; every data set has outliers that must be discovered
and explained by experts and analysts). If something goes wrong, they have to go back
to data mart. When reached a good data mart and re performed exploratory analysis,
you go one step ahead.
4- Attribute (=columns, information) selection stage is the following step: you have to
decide that some attributes are useless (non informative columns) and you decide to
throw them away. But at the same time we want to create new attributes from existing
variables that may be useful e.g. trends. It is possible with two approaches:
a. Common sense coming from the knowledge of the process; they can be based on
intuitions (e.g. a ratio than helps us understanding if the people is now calling
more than before).
b. Dimensionality reduction techniques which can be linear or not. We are filtering
data to create new attributes and to delete not useful ones.
Exploratory analysis and selection of attributes are very much human driven.
5- At this point we have the final data set, simplified and made correct. Now we have to
choose the best model: it is a very automatic activity in which we run thousands of
experiments (different algorithms with different parameters; there are different
techniques).
We cannot tell in advance, given a dataset, which can be the best algorithm, model, etc. It is very
much empirical: we have to try and experiment which algorithm have the best performances.
It is a realistic approach for choosing the best algorithm.
6- The last step is to discuss the results with the users. 11
Machine learning processes are based on interpretation and prediction. Machine learning
models and methods:
- Select a class of models;
- Select and evaluation metric;
- Design algorithms and identify the model.
Applications of predictive analytics are several, the more traditional ones are:
- Profile prospects and customers - Fraud detection
- Acquire new customers - Risk management
- Cross-selling and up-selling - Demand forecast
- Retention - Preventive maintenance
- Market basket analysis - Biolife –molecular biology
- Credit scoring - Medical diagnosis
While the more innovative are:
- Image recognition - Social media analytics
- Web mining
Dataset
Dataset are usually represented as two dimensional tables:
- Rows that are the observations
- Columns: variables, attributes, information about the rows; the data that characterize
each observation.
Attribute types:
Categorical, that assume a finite number of distinct values:
o Counts: True/False or 0/1 (Boolean variables or binary variables).
§ Nominal attributes: without natural sorting e.g. province of residence.
§ Ordinal attribute that could have a natural sorting but for which it makes
§ no sense to calculate differences or ratios.
Numerical, which assume a finite or infinite number of values and lend
o themselves to operations:
Discrete attributes (finite, countable number).
§ Continuous attribute (uncountable infinity of values).
§
It is not easy to reduce to two dimensions all the time but it is possible for all the datasets (e.g.
tweets: is made by structured information – name, account, location, etc. – and unstructured –
the text: to analyse and represent the text in the two dimensional structure the columns are the
words that you accept to recognize; e.g. images: considered as a number of pixels that are the
columns and the rows are the pictures; the intersection is the number that expresses he colour
of that pixel). D,
To represent a generic dataset we will denote by m the number of observations, or rows, in
the two-dimensional table containing the data and by n the number of attributes, or columns.
Furthermore, we will denote by: X = [x ]
ij
M N
i ∈ = {1, 2,...,m}, j ∈ = {1,2,...,n}