vuoi
o PayPal
tutte le volte che vuoi
EXECUTION AND METHODS USED
The following paragraphs describe the procedures regarding the preparation of data and subsequently the application of the Naïve Bayes classification technique with the aim of predicting the "quality" target.4.1 Data preparation
4.1.1 Missing data and outliers
The dataset considered has not missing values. By observing the distribution in Figure 1, it can be seen that in the dataset most of the wines reported are of a normal quality while excellent wines and "less good" ones are scarce. An analysis is therefore reported to identify the outliers (therefore the excellent / "less good" wines) with the aim of eliminating them from our dataset as their presence could compromise the result of the entire classification process. Box plots were used to identify the outliers. The reason for the choice is simple: compared to other measures of central tendency (z-index for example) the boxplots are not likely to be influenced by the extreme values.values of the distribution as they use the median and quartiles.The box plot in Figure 3a confirms the presence of outliers: wines with qualities 3, 8, 9 are outliersfor our dataset. Analyzing the dataset, the observations labeled as outliers and subsequentlyeliminated are 200. 4
Figure3a: box plot with all the data
Figure3b: box plot without outliers
From here on, therefore, the new dataset will consist of 4698 observations with the target that canassume values equal to 4, 5, 6, 7. For verification, Figure 3b shows the boxplot of the "quality"target after eliminating the outliers: the plot confirms the absence of outliers in the dataset. InFigure4 the histogram of the "new" target "quality" is reported.
Figura4
4.1.1 Data normalization
As already mentioned before the explanatory variables are characterized by different ranges andunits of measure. For this reason it was decided to normalize the data in the interval [0,1] usingthe min-max normalization
From now on for praticato reasons the attributes names will change in "previousname.scaled" only to distinguish them from the original ones.
With normalized data it is now easier to plot the box plots of each attribute in order to have a general overview (Figure5). The number representing the attributes is the same in the list at page1 (for example: 1= fixed acidity, 2=Volatile Acidity and so on).
Figure5: box plots of all the 11 attributes (scaled and without outliers)
4.2 Classification: Naïve Bayes
A classification method is adopted to treat the chosen dataset with the aim of predicting the "quality" target. In particular, the Naïve Bayes has been applied to the dataset: it is an application of Bayesian methods (probabilistic classification methods), which calculate the posterior probability P (y | x), that is the probability that
An observation belongs to a specific target class using Bayes' theorem, given the a priori probabilities P(y) and the probabilities P(x | y).
In the original dataset, the target is a numeric attribute. In order to apply the classification, it must therefore be transformed into a categorical attribute.
Two different methodologies are reported to address the problem:
- Model A: in this case the "quality" target is transformed into a categorical attribute that can assume values equal to "low", "medium-low", "medium", "medium-high".
- Model B: in this second model a discretization of the "quality" target is implemented. At the end of the process it can only assume two values: "low" and "medium".
2.1 Model A
In this case, the target transformed into a categorical attribute can assume values equal to "low", "medium-low", "medium", "medium-high".
and it is possible to view its distribution through aY | [,1] | [,2] |
---|---|---|
low | 0.3237532 | 0.10038812 |
medium-low | 0.3018904 | 0.07833981 |
low | 0.2826420 | 0.12274522 |
medium | 0.2920724 | 0.08011107 |
medium-low | 0.3275792 | 0.10401708 |
medium-high | 0.2809924 | 0.07332011 |
medium | 0.2953773 | 0.09549798 |
medium-high | 0.2692783 | 0.07472525 |
Y | [,1] | [,2] |
---|---|---|
low | 0.2812971 | 0.16655412 |
medium-low | 0.2165856 | 0.10060484 |
low | 0.1400077 | 0.04866620 |
medium | 0.46866667 | 0.18666667 |
Y | |||||
---|---|---|---|---|---|
0.1770180 | 0.08832223 | medium-low | 0.1567086 | 0.04928362 | medium-high |
0.1803484 | 0.09018452 | medium | 0.1314119 | 0.06012673 | medium-high |
0.1023176 | 0.05350037 | citric.acid.scaled | pH.scaled | ||
Y | [,1] | [,2] | |||
0.1860850 | 0.10078821 | ||||
0.2036311 | 0.08445086 | low | 0.4229770 | 0.1499699 | medium |
0.2023385 | 0.07409501 | medium-low | 0.4092452 | 0.1264788 | medium-high |
0.1949010 | 0.04688729 | medium | 0.4283654 | 0.1405658 | medium-high |
0.4587338 | 0.1416424 | chlorides.scaled | sulphates.scaled | ||
Y | [,1] | [,2] | |||
0.11996609 | 0.04975605 | ||||
0.12554792 | 0.07684539 | low | 0.3080756 | 0.1349191 | medium |
0.10772611 | 0.06272874 | medium-low | 0.3069619 | 0.1155684 | medium-high |
0.08634485 | 0.03182103 | medium | 0.3147309 | 0.1335821 | medium-high |
0.3241071 | 0.1513459 | residual.sugar.scaled | alcohol.scaled | ||
Y | [,1] | [,2] | |||
0.06187218 | 0.06351649 | ||||
0.10270706 | 0.08137201 | low | 0.3417228 | 0.1512829 | medium |
0.08826763 | 0.08122736 | medium-low | 0.2937171 | 0.1397299 | medium-high |
0.06946757 | 0.06583243 | medium | 0.4177182 | 0.1860669 | medium-high |
0.5477890 |
- 0.1999496free.solfur.dioxide.scaledY
- [,1] [,2]
- low 0.08115404 0.07915199
- medium-low 0.11983218 0.06302392
- medium 0.11612254 0.05455907
- medium-high 0.11191202 0.04639108
- [,1] [,2]
- Applying the Bayes theorem:
- |P(x y)P(y)|P(y x) = where:
- P(x)P(y)
- • a priori probability 82(xj − μ )jhn 1 − 22(σ )∏ || | P(x y = v ) = eP(x y) = P(x y) jhj hj
- • 2π σj=1 jh
- Thanks to the confusion matrices it is possible to define some performance indices such as accuracy and precision. The reported values were obtained by iterating the classification process 50 times in order to obtain average performance indices that are as close as possible to the real ones.
- In this case:
- • 46.62% the model A average accuracy is
- • the precision is:
- - 23.3% for "low" quality wines
- - 51.34% for "medium-low" quality wines
- - 53.9% for "medium" quality wines
- - 37.85% for "medium-high" quality wines
- In this case:
- From the data shown, it can be seen that the model is not very
accurate in estimating the target.
Analyzing the values of the conditional probabilities obtained from the classification, we can see how the mean and variance values of each variable are similar for adjacent quality levels. As an example, Figure 8 shows the Gaussian distributions of alcohol corresponding to the different quality levels. Looking at the figure in fact is possible to see how the blue and red curves for example are very similar.
Therefore, to increase the accuracy and performance of the process, Model B is introduced in which a discretization of the "quality" target is applied.
Figure 8
94.2.2 Model B
In this second model, the classification process is preceded by a discretization of the target variable "quality". For the process, a discretization by "size" was chosen. The selected classes are:
- "Low" class: includes observations rated with a score of 4 and 5 by the testers
- "Medium" class: includes the observations rated
- P (y = low) = 0.34410
- P (y = medium) = 0.656
- Risolvere un problema di matematica
- Riassumere un testo
- Tradurre una frase
- E molto altro ancora...
with a score of 6 and 7 by the testers
Analyzing the distribution of the new dataset (
As already reported in METHOD A, the objective of the classification method is to predict the target given a new observation (P (y | x)).
Also in this case the classification phase follows a phase of subdivision into training set and test set by sampling as in the previous case (hold out method): 3000 observations in the training set and 1698 in the set set.
At the end of the process the confusion matrix shown in
Actual | |||
---|---|---|---|
Low | Medium | ||
Predicted | Low | 346 | 248 |
Medium | 242 | 862 |
By applying the classification method, the a priori probabilities P(y) and the
conditional probabilities P (x | y) are also obtained. In particular:
A priori probabilities:
Conditional probabilities:
As in the previous case (model A), the predictive variables are numerical, so the conditional μ probabilities can be calculated assuming that the probabilities follow a Gaussian. The mean and σ standard deviation values follow:
fixed.acidity.scaled | pH.scaled | Y |
---|---|---|
low | 0.3025073 | 0.08249329 |
medium | 0.2902742 | 0.07908142 |
volatile.acidity.scaled | sulphates.scaled | Y |
---|---|---|
low | 0. |
Per termini, condizioni e privacy, visita la relativa pagina.