Report Wine Quality White

Esame Business Intelligence

Facoltà Ingegneria

Università Università degli Studi di Siena

Esercitazione

Nel seguente report viene riportata l'analisi effettuata sul dataset "Wine Quality White" con l'obiettivo di fare una previsione sulle preferenze dei consumatori di vino. Per elaborare il dataset è stato utilizzato un modello di classificazione: il Naïve Bayes. Applicata nel mondo dei produttori di vino, l'analisi potrebbe aiutare ad aumentare la qualità dei vini prodotti e supportare i produttori nelle loro scelte di produzione e commercializzazione.

…continua

Anteprima

Vedrai una selezione di 4 pagine su 13

Anteprima di 4 pagg. su 13.
Scarica il documento per vederlo tutto.

Scarica

Anteprima di 4 pagg. su 13.
Scarica il documento per vederlo tutto.

Scarica

Disdici quando
vuoi

Acquista con carta
o PayPal

Scarica i documenti
tutte le volte che vuoi

Estratto del documento

EXECUTION AND METHODS USED

The following paragraphs describe the procedures regarding the preparation of data and subsequently the application of the Naïve Bayes classification technique with the aim of predicting the "quality" target.

4.1 Data preparation

4.1.1 Missing data and outliers

The dataset considered has not missing values. By observing the distribution in Figure 1, it can be seen that in the dataset most of the wines reported are of a normal quality while excellent wines and "less good" ones are scarce. An analysis is therefore reported to identify the outliers (therefore the excellent / "less good" wines) with the aim of eliminating them from our dataset as their presence could compromise the result of the entire classification process. Box plots were used to identify the outliers. The reason for the choice is simple: compared to other measures of central tendency (z-index for example) the boxplots are not likely to be influenced by the extreme values.

values of the distribution as they use the median and quartiles.The box plot in Figure 3a confirms the presence of outliers: wines with qualities 3, 8, 9 are outliersfor our dataset. Analyzing the dataset, the observations labeled as outliers and subsequentlyeliminated are 200. 4

Figure3a: box plot with all the data

Figure3b: box plot without outliers

From here on, therefore, the new dataset will consist of 4698 observations with the target that canassume values equal to 4, 5, 6, 7. For verification, Figure 3b shows the boxplot of the "quality"target after eliminating the outliers: the plot confirms the absence of outliers in the dataset. InFigure4 the histogram of the "new" target "quality" is reported.

Figura4

4.1.1 Data normalization

As already mentioned before the explanatory variables are characterized by diﬀerent ranges andunits of measure. For this reason it was decided to normalize the data in the interval [0,1] usingthe min-max normalization

(in this case simplified because we normalize between 0 and1) : x - xij min, jx' =ij x - xma x, j min, j 5

From now on for praticato reasons the attributes names will change in "previousname.scaled" only to distinguish them from the original ones.

With normalized data it is now easier to plot the box plots of each attribute in order to have a general overview (Figure5). The number representing the attributes is the same in the list at page1 (for example: 1= fixed acidity, 2=Volatile Acidity and so on).

Figure5: box plots of all the 11 attributes (scaled and without outliers)

4.2 Classification: Naïve Bayes

A classification method is adopted to treat the chosen dataset with the aim of predicting the "quality" target. In particular, the Naïve Bayes has been applied to the dataset: it is an application of Bayesian methods (probabilistic classification methods), which calculate the posterior probability P (y | x), that is the probability that

An observation belongs to a specific target class using Bayes' theorem, given the a priori probabilities P(y) and the probabilities P(x | y).

In the original dataset, the target is a numeric attribute. In order to apply the classification, it must therefore be transformed into a categorical attribute.

Two different methodologies are reported to address the problem:

Model A: in this case the "quality" target is transformed into a categorical attribute that can assume values equal to "low", "medium-low", "medium", "medium-high".
Model B: in this second model a discretization of the "quality" target is implemented. At the end of the process it can only assume two values: "low" and "medium".

2.1 Model A

In this case, the target transformed into a categorical attribute can assume values equal to "low", "medium-low", "medium", "medium-high".

and it is possible to view its distribution through a and a (

and

Before applying the classification, the dataset was divided into training set and test set using the . This method randomly partitions the data: precisely 3000 observations were included in the training set and the remaining 1698 in the test set by sampling. Thanks to the application of the and the prediction phase, we therefore obtain the confusion matrix shown in (confusion matrix of a single random iteration). Actual Low Medium-low Medium Medium-high Predicted Low 19 23 12 1 Medium-low 26 298 248 41 Medium 18 145 234 56 Medium-high 9 48 298 222

: confusion matrix model A By applying the classification method, the a priori probabilities P (y) and the conditional probabilities P (x | y) are also obtained. In particular: A priori probabilities: - P(y=low) = 0.03033333 - P(y=medium-low) = 0.31433333 - P(y=medium) =

0.46866667• P(y=medium-high) =0.18666667

Conditional probabilities:

In this case the predictive variables are numerical, so the conditional probabilities can be calculated assuming that the probabilities follow a certain probability distribution. In our case it is assumed that it is Gaussian. Therefore the values reported by the classification correspond to the μ (mean) and the σ (variance) of the Gaussian which represents the distribution of each predictive variable with respect to the target.

In this case:

Y	[,1]	[,2]
low	0.3237532	0.10038812
medium-low	0.3018904	0.07833981
low	0.2826420	0.12274522
medium	0.2920724	0.08011107
medium-low	0.3275792	0.10401708
medium-high	0.2809924	0.07332011
medium	0.2953773	0.09549798
medium-high	0.2692783	0.07472525

Y	[,1]	[,2]
low	0.2812971	0.16655412
medium-low	0.2165856	0.10060484
low	0.1400077	0.04866620
medium	0.46866667	0.18666667

	Y
0.1770180	0.08832223	medium-low	0.1567086	0.04928362	medium-high
0.1803484	0.09018452	medium	0.1314119	0.06012673	medium-high
0.1023176	0.05350037	citric.acid.scaled	pH.scaled
Y	[,1]	[,2]
0.1860850	0.10078821
0.2036311	0.08445086	low	0.4229770	0.1499699	medium
0.2023385	0.07409501	medium-low	0.4092452	0.1264788	medium-high
0.1949010	0.04688729	medium	0.4283654	0.1405658	medium-high
0.4587338	0.1416424	chlorides.scaled	sulphates.scaled
Y	[,1]	[,2]
0.11996609	0.04975605
0.12554792	0.07684539	low	0.3080756	0.1349191	medium
0.10772611	0.06272874	medium-low	0.3069619	0.1155684	medium-high
0.08634485	0.03182103	medium	0.3147309	0.1335821	medium-high
0.3241071	0.1513459	residual.sugar.scaled	alcohol.scaled
Y	[,1]	[,2]
0.06187218	0.06351649
0.10270706	0.08137201	low	0.3417228	0.1512829	medium
0.08826763	0.08122736	medium-low	0.2937171	0.1397299	medium-high
0.06946757	0.06583243	medium	0.4177182	0.1860669	medium-high
0.5477890

0.1999496free.solfur.dioxide.scaledY
- [,1] [,2]
  - low 0.08115404 0.07915199
  - medium-low 0.11983218 0.06302392
  - medium 0.11612254 0.05455907
  - medium-high 0.11191202 0.04639108
Applying the Bayes theorem:
- |P(x y)P(y)|P(y x) = where:
- P(x)P(y)
  - • a priori probability 82(xj − μ )jhn 1 − 22(σ )∏ || | P(x y = v ) = eP(x y) = P(x y) jhj hj
  - • 2π σj=1 jh
Thanks to the confusion matrices it is possible to define some performance indices such as accuracy and precision. The reported values were obtained by iterating the classification process 50 times in order to obtain average performance indices that are as close as possible to the real ones.
- In this case:
  - • 46.62% the model A average accuracy is
  - • the precision is:
    - - 23.3% for "low" quality wines
    - - 51.34% for "medium-low" quality wines
    - - 53.9% for "medium" quality wines
    - - 37.85% for "medium-high" quality wines
From the data shown, it can be seen that the model is not very

accurate in estimating the target.

Analyzing the values of the conditional probabilities obtained from the classification, we can see how the mean and variance values of each variable are similar for adjacent quality levels. As an example, Figure 8 shows the Gaussian distributions of alcohol corresponding to the different quality levels. Looking at the figure in fact is possible to see how the blue and red curves for example are very similar.

Therefore, to increase the accuracy and performance of the process, Model B is introduced in which a discretization of the "quality" target is applied.

Figure 8

94.2.2 Model B

In this second model, the classification process is preceded by a discretization of the target variable "quality". For the process, a discretization by "size" was chosen. The selected classes are:

"Low" class: includes observations rated with a score of 4 and 5 by the testers
"Medium" class: includes the observations rated

with a score of 6 and 7 by the testers

Analyzing the distribution of the new dataset (

figure9

figure10

Figure6

Figure9

Figure10

As already reported in METHOD A, the objective of the classification method is to predict the target given a new observation (P (y | x)).

Also in this case the classification phase follows a phase of subdivision into training set and test set by sampling as in the previous case (hold out method): 3000 observations in the training set and 1698 in the set set.

At the end of the process the confusion matrix shown in

	Actual
	Low	Medium
Predicted	Low	346	248
	Medium	242	862

By applying the classification method, the a priori probabilities P(y) and the

conditional probabilities P (x | y) are also obtained. In particular:

A priori probabilities:

P (y = low) = 0.34410
P (y = medium) = 0.656

Conditional probabilities:

As in the previous case (model A), the predictive variables are numerical, so the conditional μ probabilities can be calculated assuming that the probabilities follow a Gaussian. The mean and σ standard deviation values follow:

fixed.acidity.scaled	pH.scaled	Y
low	0.3025073	0.08249329
medium	0.2902742	0.07908142

volatile.acidity.scaled	sulphates.scaled	Y
low	0.

Dettagli

Publisher

Valentina.Bonaccini

A.A. 2020-2021

13 pagine

SSD Ingegneria industriale e dell'informazione ING-INF/05 Sistemi di elaborazione delle informazioni

I contenuti di questa pagina costituiscono rielaborazioni personali del Publisher Valentina.Bonaccini di informazioni apprese con la frequenza delle lezioni di Business Intelligence e studio autonomo di eventuali libri di riferimento in preparazione dell'esame finale o della tesi. Non devono intendersi come materiale ufficiale dell'università Università degli Studi di Siena o del prof Ingegneria Prof.