Business Intelligence
Prof. Carlo Vercellis
Detecting rare events
Detecting rare events is one of the most difficult things. The world has dramatically changed: from 2005 to 2017, mobile devices grew; nowadays they are more in number than people. Most people use mobile devices for social activities, to be connected with each other. But the difference is also that compared to 8 years ago, nowadays we are active producers of contents.
Internet of people
The so-called 'internet of people' happened starting from 2007. Looking at numbers, on social media and internet (e.g. Facebook, Snapchat, Tinder, …), in 60 seconds an incredible number of activities (swipe, likes, …) are made.
Digital platform economy
We are living in a digital platform economy, for example:
- Spotify (no one buys CDs anymore);
- The CV, that nowadays is represented by LinkedIn and not anymore given printed to companies in order to be hired;
- Netflix instead of Blockbusters: data sign let Netflix become monopolist in a market where the previous monopolist was thrown away; digital data where fundamental.
Internet of things
Behind the internet of people there is the 'internet of things': smart devices are in every object around us nowadays: there will be 30 trillion smart devices by 2020 (e.g. wearable fitness devices, smart clothing, …).
At the same time the digital data are growing as well: in 2020 expected value around 30 ZB. It causes exponential decreasing/increasing curves (e.g. decreasing exponential curve for the price of storage).
Big Data
The digital universe grew by 2013 to more than 4 zettabytes: more than 50% growth with respect to 2012. Big data focus on moving “up the stack”, towards advanced analytics & discovery. The overall expenditure has been 10 billion $ in 2013, raised to 22 billion $ by 2016. A huge amount of data became available and we have different sources of information:
- People to machine: interaction with an automatic system (most traditional transaction e.g. credit card); it didn’t require a particular form of storage.
- People to people: with the web 2.0 we started interacting with other people (new technologies: to shift from single location database to the access from different servers).
- World of things – machine to machine: a lot of things e.g. smart personal fitness devices, smart clothing, etc. that transmit constantly data.
- Enterprise data: in the information system of the company.
- Public admin: In 2008, Obama established that every data (e.g. traffic, pollution, etc.) should be published in the digital form. Europe aligned having open data one year later.
Why all this happened, why now?
- Always on consumer: we all own mobile devices and use them to inform, search products, make purchases, etc.
- Social media: spread incredibly.
- Technology costs: are going down: storage, computing, etc. so technology is an enabler.
- Data science: is improving.
- We live in a platform economic: the digital disruption has been due to new subjects using aggressively data (e.g. Netflix). Almost every industry (apart from those protected by regulation) has been attacked by companies that use data as a more competitive weapon.
- We live in a world where information is key: infonomics.
Big Data features
- Volume: very high, in the order of hundreds of PB.
- Velocity: very quickly; sensors can help.
- Variety: database that contains heterogeneous data (document databases e.g. Mongo DB). There is a big number of possible schema of databases (e.g.standard for exchange contents: Jason). Inside the data we always also have the description of the format: the document describes itself.
These are the three traditional Vs, but we can identify others, for example:
- Value
- Validity: correctness of the data.
Benefits
What are the benefits we can derive from big data? Both internal side and external side of the company.
From the internal point of view, big data provides more efficiency:
- Reduced costs: for example, analysis of big data from production lines are used to build predictive model so to look at data of past failures in order to get early signals of potential accidents in the future. This is the preventive maintenance: maintenance not after accidents but before, preventing them.
- Reduced time: for example, it is possible to be quicker in answering market needs, understanding before what people want looking at their comments and feedbacks. Very high demanding quality is met with big data analysis.
- Improved processes: both costs and time reductions are process improvements.
From the outside point of view:
- Inventing new services/products: companies are continuously generating new services by trying to exploit possible needs e.g. Amazon, which makes automatically monitoring to make the customer’s life easier.
- Data monetization: data are not only used for you, the company collects them and (do not sell data since it is against the law) but sell to other companies services based on the data (e.g. Facebook makes money providing targeted – according to people’s interests – advertisement to companies).
- Customize services/products
Advertising e-Commerce-Applications for Big Data Analytics
Which can be:
- Positive
- Negative
Smarter Healthcare: Diagnosis can be useful for humans in order to detect the pathology from the early stages (which costs also less). It is a way to prevent.
We as consumers behave in a complex way since when we are in the shop we start checking on phones about products we need so we have a very long search phase that is followed by the decision phase. At the end we have purchase and payment, with different methods. Finally, there is the post-purchase to express opinions. The shopping journey is very complex.
Trading analytics for finance: e.g. cryptocurrencies (virtual currencies as Bitcoins).
Fraud and Risks: e.g. everyone of our credit card is monitored from fraud so for example if you make a different purchase than usual you are asked to prove you are you.
- Log analysis
- Homeland security
- Traffic control
- Telecom
- Search quality
- Manufacturing
- Retail: Churn, NBO
Data Infrastructure
Every corporation (e.g. Pirelli, Ferrero, etc.), from whichever industry, has a team for data science (15-30 people). The first computers that have been invented were analog computers; later on, they became digital computers (based on binary) then used mainly for scientific applications until the late 50s when they started to be used in companies to improve efficiency and replace human tasks.
MIS – Management Information System provided information to managers to improve decisions but it failed because technology was not an enabler but a wall.
At the end of the 90s, with databases, fast network connections and storage, we started to speak about business intelligence. An analyst that has the opportunity to choose among different actions, trying to decide the best one (e.g. which customer to reach with the loyalty campaign): business intelligence provides support to people that have to make effective and timely decisions, and help choosing among a number of possible actions. Through predictive models and with the help of technology, it is possible to reach more accurate conclusions and the decision process can become more effective and quickly (markets are very demanding and competitors are very aggressive: we have to compete on the fast lane and give a quick response to the market). With business intelligence, it is possible to rank from the best to the worst possibility with a high level of accuracy. Our mind is not made for big data so intuition cannot be trusted, moreover today’s organizations are too complex and dynamic that intuitive methodologies and stagnant decision-making processes are inappropriate. Business intelligence is the parent of big data analytics: we put big data in business intelligence framework, where the data are analysed; big data cannot be used directly for decision-making purposes: they need to be processed by appropriate extraction tools and analytical methods.
Data → Information → Knowledge
Many companies nowadays do not include big data in their database.
The Architecture of Big Data
The architecture of big data:
- Raw data (internal/operational – ERP; external data) - Raw data are then put through tools (ETL – Extract Transform and Load) in the warehouse (repository).
- Raw data were deleted from time to time and put in digital form: not all the past data were kept in the past but if you need to perform an analysis it is important to have them.
- Application which uses the CPU of the system: we need different systems to perform analysis and running the operational system.
- Operational system can be spread all over the world so it has different currencies and different coding: it is important to transform data to have them homogenous.
- Warehouses are write ones, read in many ways.
Metadata: documentations of what has been done, documentation of the origin and meaning of the records; what kind of transformation has been made to data. It is definitely unsatisfactory (not existing) in many companies, because we never document things since we are always in a rush to the deadline: we should prepare documentation from the beginning – work in progress documentation. It is data describing data.
Data marts: After data warehouses, we have small warehouses owned by single departments e.g. marketing department.
OLAP – On-Line Analytical Processes: Finally,
- Query and reporting: Dashboards: a way of expressing visually (data-viz), from green to red according to the dangerousness.
- Alerts – alerting system: graphical hints to make people warned.
- Exploratory analysis
- Predictive modelling
- Data mining
- Optimization: to extract knowledge from data
Cubes and Multidimensional Analysis
DWH are based on relational databases; the design of data warehouses and data marts is based on a multidimensional paradigm for data representation, based on a ‘star schema’. We can distinguish two types:
- Dimension table/Master table: every dimension table has the main index, the key. Dimension tables are internally structured according to hierarchical relationships e.g. the customer code, the point of sale code, etc. or time: day, week, year, etc. The main key is related to the field of the fact table, which has a higher level of detail.
- Fact table: contains integrally references to dimension tables and also measures, numbers, etc. they usually refer to transactions and contain two types of data: links to dimension tables, numerical values.
The time is a key aspect for DWH since it is crucial in it: there is always time. The schema related to fact tables interconnected with dimension tables is called the ‘galaxy scheme’ since there is a set of stars connected.
Star schema → Snowflake schema → Galaxy schema
A fact table connected with n dimension tables may be represented by an n-dimensional cube where each axis corresponds to a dimension. It is a representation that is a generalization of the excel table with pivot table (multi-dimensional dataset): two dimensions with the third summarized; it is a natural extension of the two-dimensions spreadsheet, interpreted as a two-dimensional cube. We have three dimensions:
- The product
- The time
- The geographical region.
The data at this level of granularity are 36 elementary data.
Hierarchy of concepts/dimensions: some dimensions, like time, are hierarchical. E.g. for time we have days, weeks, but then months is not possible because it is not 28 days; space: street, towns, etc.
Actions to perform on data cubes:
- Drill down or Roll down: operation that leads to more detailed information, obtained by:
- Shifting down to a lower level e.g. from province to city
- Adding one dimension e.g. time
- Drill up or Roll up: operation consisting of an aggregation of data, obtained by:
- Proceeding upward to a higher level e.g. from city to province.
- Reducing one dimension e.g. remove time dimension.
What to perform on data cubes?
- Slicing: take a slice of the representation e.g. with the region; basically the value of a dimension is fixed.
- Dicing: if more than one dimension at the same time is fixed.
- Pivoting or Rotation
Machine Learning and Data Mining
The inductive approach to artificial intelligence and machine learning is based on discovering rules and hidden information from past data and use them in a predictive way. Predictive analytics means exploring past data that has been collected in the past (called examples, observations, etc.) to extract hidden patterns and correlations useful to decision-makers that appear and are not obvious to humans. The aims of this machine learning approach and analysis are basically two:
- To better understand root causes of investigated systems: improve understanding of the investigated system.
- To derive accurate predictions and optimize future actions: it is a more pragmatic objective. There is a place called ‘data lake’ where data are collected and stored. Data are then analysed by the algorithm to find hidden patterns. We have different tools to perform this kind of analysis. Moreover, we can keep some of the past data to verify the quality of the prediction; then we predict the real future.
Example: a churner is someone not loyal to the company that sells to some other companies, for example, someone that changes from Vodafone to Tim; from the customers, we want a large set of information – social demographic, age, region, etc. – once we know the most important information (more than 1 thousand), like calling and receiving behaviour, we have two dimensional data set in which rows are customers and columns are the information we know; the last column tells us the target variable: it tells if the customer is still loyal or if it is a churner. We want to find a hidden relationship between the last column and the previous columns: we want to point out through an algorithm which one is the most important to explain the target variable. Some of the variables are actionable by marketing people.
Example: Medical data with the target variable corresponding to heart attack. You want to know what are the root cause, e.g. medical team, age, etc.
The Machine Learning Process
- First step is to define the objective of the project: why are we performing the analysis? For example, we want to reduce the churn rate. There are mainly three roles:
- Data analyst.
- Process owner, the expert.
- The IT people: people in the organization in charge of the data infrastructure e.g. they know where are stored the data, etc.
- Creation of a dedicated database: Second step is the creation of a dedicated database; it costs a lot of efforts and can be very time-consuming since collecting data is not simple in some situations.
- Exploratory data analysis is then performed. In the company’s system, there might be incorrect data and must be detected in this stage (it is important to isolate outliers in order not to influence the average; every dataset has outliers that must be discovered and explained by experts and analysts). If something goes wrong, they have to go back to the data mart. When a good data mart is reached, and exploratory analysis is performed, you go one step ahead.
- Attribute selection stage is the following step: you have to decide that some attributes are useless (non-informative columns) and you decide to throw them away. But at the same time, we want to create new attributes from existing variables that may be useful e.g. trends. It is possible with two approaches:
- Common sense coming from the knowledge of the process; they can be based on intuitions (e.g. a ratio than helps us understand if the people is now calling more than before).
- Dimensionality reduction techniques which can be linear or not. We are filtering data to create new attributes and to delete not useful ones.
- Exploratory analysis and selection of attributes are very much human-driven. At this point, we have the final data set, simplified and made correct. Now we have to choose the best model: it is a very automatic activity in which we run thousands of experiments (different algorithms with different parameters; there are different techniques). We cannot tell in advance, given a dataset, which can be the best algorithm, model, etc. It is very much empirical: we have to try and experiment with which algorithm has the best performances. It is a realistic approach for choosing the best algorithm.
- The last step is to discuss the results with the users.
Machine learning processes are based on interpretation and prediction. Machine learning models and methods:
- Select a class of models;
- Select an evaluation metric;
- Design algorithms and identify the model.
Applications of Predictive Analytics
Applications of predictive analytics are several, the more traditional ones are:
- Profile prospects and customers
- Fraud detection
- Acquire new customers
- Risk management
- Cross-selling and up-selling
- Demand forecast
- Retention
- Preventive maintenance
- Market basket analysis
- Biolife – molecular biology
- Credit scoring
- Medical diagnosis
While the more innovative are:
- Image recognition
- Social media analytics
- Web mining
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.