Anteprima
Vedrai una selezione di 4 pagine su 13
Report Wine Quality White Pag. 1 Report Wine Quality White Pag. 2
Anteprima di 4 pagg. su 13.
Scarica il documento per vederlo tutto.
Report Wine Quality White Pag. 6
Anteprima di 4 pagg. su 13.
Scarica il documento per vederlo tutto.
Report Wine Quality White Pag. 11
1 su 13
D/illustrazione/soddisfatti o rimborsati
Disdici quando
vuoi
Acquista con carta
o PayPal
Scarica i documenti
tutte le volte che vuoi
Estratto del documento

EXECUTION AND METHODS USED

The following paragraphs describe the procedures regarding the preparation of data and subsequently the application of the Naïve Bayes classification technique with the aim of predicting the "quality" target.

4.1 Data preparation

4.1.1 Missing data and outliers

The dataset considered has not missing values. By observing the distribution in Figure 1, it can be seen that in the dataset most of the wines reported are of a normal quality while excellent wines and "less good" ones are scarce. An analysis is therefore reported to identify the outliers (therefore the excellent / "less good" wines) with the aim of eliminating them from our dataset as their presence could compromise the result of the entire classification process. Box plots were used to identify the outliers. The reason for the choice is simple: compared to other measures of central tendency (z-index for example) the boxplots are not likely to be influenced by the extreme values.

values of the distribution as they use the median and quartiles.The box plot in Figure 3a confirms the presence of outliers: wines with qualities 3, 8, 9 are outliersfor our dataset. Analyzing the dataset, the observations labeled as outliers and subsequentlyeliminated are 200. 4

Figure3a: box plot with all the data

Figure3b: box plot without outliers

From here on, therefore, the new dataset will consist of 4698 observations with the target that canassume values equal to 4, 5, 6, 7. For verification, Figure 3b shows the boxplot of the "quality"target after eliminating the outliers: the plot confirms the absence of outliers in the dataset. InFigure4 the histogram of the "new" target "quality" is reported.

Figura4

4.1.1 Data normalization

As already mentioned before the explanatory variables are characterized by different ranges andunits of measure. For this reason it was decided to normalize the data in the interval [0,1] usingthe min-max normalization

(in this case simplified because we normalize between 0 and1) : x - xij min, jx' =ij x - xma x, j min, j 5

From now on for praticato reasons the attributes names will change in "previousname.scaled" only to distinguish them from the original ones.

With normalized data it is now easier to plot the box plots of each attribute in order to have a general overview (Figure5). The number representing the attributes is the same in the list at page1 (for example: 1= fixed acidity, 2=Volatile Acidity and so on).

Figure5: box plots of all the 11 attributes (scaled and without outliers)

4.2 Classification: Naïve Bayes

A classification method is adopted to treat the chosen dataset with the aim of predicting the "quality" target. In particular, the Naïve Bayes has been applied to the dataset: it is an application of Bayesian methods (probabilistic classification methods), which calculate the posterior probability P (y | x), that is the probability that

An observation belongs to a specific target class using Bayes' theorem, given the a priori probabilities P(y) and the probabilities P(x | y).

In the original dataset, the target is a numeric attribute. In order to apply the classification, it must therefore be transformed into a categorical attribute.

Two different methodologies are reported to address the problem:

  • Model A: in this case the "quality" target is transformed into a categorical attribute that can assume values equal to "low", "medium-low", "medium", "medium-high".
  • Model B: in this second model a discretization of the "quality" target is implemented. At the end of the process it can only assume two values: "low" and "medium".

2.1 Model A

In this case, the target transformed into a categorical attribute can assume values equal to "low", "medium-low", "medium", "medium-high".

and it is possible to view its distribution through a and a (
and
).
Before applying the classification, the dataset was divided into training set and test set using the . This method randomly partitions the data: precisely 3000 observations were included in the training set and the remaining 1698 in the test set by sampling. Thanks to the application of the and the prediction phase, we therefore obtain the confusion matrix shown in (confusion matrix of a single random iteration). Actual Low Medium-low Medium Medium-high Predicted Low 19 23 12 1 Medium-low 26 298 248 41 Medium 18 145 234 56 Medium-high 9 48 298 222
: confusion matrix model A By applying the classification method, the a priori probabilities P (y) and the conditional probabilities P (x | y) are also obtained. In particular: A priori probabilities: - P(y=low) = 0.03033333 - P(y=medium-low) = 0.31433333 - P(y=medium) =

0.46866667• P(y=medium-high) =0.18666667

Conditional probabilities:

In this case the predictive variables are numerical, so the conditional probabilities can be calculated assuming that the probabilities follow a certain probability distribution. In our case it is assumed that it is Gaussian. Therefore the values reported by the classification correspond to the μ (mean) and the σ (variance) of the Gaussian which represents the distribution of each predictive variable with respect to the target.

In this case:

Y [,1] [,2]
low 0.3237532 0.10038812
medium-low 0.3018904 0.07833981
low 0.2826420 0.12274522
medium 0.2920724 0.08011107
medium-low 0.3275792 0.10401708
medium-high 0.2809924 0.07332011
medium 0.2953773 0.09549798
medium-high 0.2692783 0.07472525
Y [,1] [,2]
low 0.2812971 0.16655412
medium-low 0.2165856 0.10060484
low 0.1400077 0.04866620
medium 0.46866667 0.18666667
Y
0.1770180 0.08832223 medium-low 0.1567086 0.04928362 medium-high
0.1803484 0.09018452 medium 0.1314119 0.06012673 medium-high
0.1023176 0.05350037 citric.acid.scaled pH.scaled
Y [,1] [,2]
0.1860850 0.10078821
0.2036311 0.08445086 low 0.4229770 0.1499699 medium
0.2023385 0.07409501 medium-low 0.4092452 0.1264788 medium-high
0.1949010 0.04688729 medium 0.4283654 0.1405658 medium-high
0.4587338 0.1416424 chlorides.scaled sulphates.scaled
Y [,1] [,2]
0.11996609 0.04975605
0.12554792 0.07684539 low 0.3080756 0.1349191 medium
0.10772611 0.06272874 medium-low 0.3069619 0.1155684 medium-high
0.08634485 0.03182103 medium 0.3147309 0.1335821 medium-high
0.3241071 0.1513459 residual.sugar.scaled alcohol.scaled
Y [,1] [,2]
0.06187218 0.06351649
0.10270706 0.08137201 low 0.3417228 0.1512829 medium
0.08826763 0.08122736 medium-low 0.2937171 0.1397299 medium-high
0.06946757 0.06583243 medium 0.4177182 0.1860669 medium-high
0.5477890
  1. 0.1999496free.solfur.dioxide.scaledY
    • [,1] [,2]
      • low 0.08115404 0.07915199
      • medium-low 0.11983218 0.06302392
      • medium 0.11612254 0.05455907
      • medium-high 0.11191202 0.04639108
  2. Applying the Bayes theorem:
    • |P(x y)P(y)|P(y x) = where:
    • P(x)P(y)
      • • a priori probability 82(xj − μ )jhn 1 − 22(σ )∏ || | P(x y = v ) = eP(x y) = P(x y) jhj hj
      • • 2π σj=1 jh
  3. Thanks to the confusion matrices it is possible to define some performance indices such as accuracy and precision. The reported values were obtained by iterating the classification process 50 times in order to obtain average performance indices that are as close as possible to the real ones.
    • In this case:
      • • 46.62% the model A average accuracy is
      • • the precision is:
        • - 23.3% for "low" quality wines
        • - 51.34% for "medium-low" quality wines
        • - 53.9% for "medium" quality wines
        • - 37.85% for "medium-high" quality wines
  4. From the data shown, it can be seen that the model is not very

accurate in estimating the target.

Analyzing the values of the conditional probabilities obtained from the classification, we can see how the mean and variance values of each variable are similar for adjacent quality levels. As an example, Figure 8 shows the Gaussian distributions of alcohol corresponding to the different quality levels. Looking at the figure in fact is possible to see how the blue and red curves for example are very similar.

Therefore, to increase the accuracy and performance of the process, Model B is introduced in which a discretization of the "quality" target is applied.

Figure 8

94.2.2 Model B

In this second model, the classification process is preceded by a discretization of the target variable "quality". For the process, a discretization by "size" was chosen. The selected classes are:

  • "Low" class: includes observations rated with a score of 4 and 5 by the testers
  • "Medium" class: includes the observations rated
  • with a score of 6 and 7 by the testers

    Analyzing the distribution of the new dataset (

    figure9
    and
    figure10
    ) and comparing with the previous distribution in
    Figure6
    is possible to see that the solution obtained can be reasonably used since it does not present a massive loss of information.

    Figure9
    Figure10

    As already reported in METHOD A, the objective of the classification method is to predict the target given a new observation (P (y | x)).

    Also in this case the classification phase follows a phase of subdivision into training set and test set by sampling as in the previous case (hold out method): 3000 observations in the training set and 1698 in the set set.

    At the end of the process the confusion matrix shown in

    Table 3
    is obtained. The numerical values are shown as an example and refer to a single random iteration.

    Actual
    Low Medium
    Predicted Low 346 248
    Medium 242 862

    By applying the classification method, the a priori probabilities P(y) and the

    conditional probabilities P (x | y) are also obtained. In particular:

    A priori probabilities:

    • P (y = low) = 0.34410
    • P (y = medium) = 0.656

    Conditional probabilities:

    As in the previous case (model A), the predictive variables are numerical, so the conditional μ probabilities can be calculated assuming that the probabilities follow a Gaussian. The mean and σ standard deviation values follow:

    fixed.acidity.scaled pH.scaled Y
    low 0.3025073 0.08249329
    medium 0.2902742 0.07908142
    volatile.acidity.scaled sulphates.scaled Y
    low 0.
Dettagli
A.A. 2020-2021
13 pagine
SSD Ingegneria industriale e dell'informazione ING-INF/05 Sistemi di elaborazione delle informazioni

I contenuti di questa pagina costituiscono rielaborazioni personali del Publisher Valentina.Bonaccini di informazioni apprese con la frequenza delle lezioni di Business Intelligence e studio autonomo di eventuali libri di riferimento in preparazione dell'esame finale o della tesi. Non devono intendersi come materiale ufficiale dell'università Università degli Studi di Siena o del prof Ingegneria Prof.