Anteprima
Vedrai una selezione di 10 pagine su 42
Analisi statistica predizione voti finali Pag. 1 Analisi statistica predizione voti finali Pag. 2
Anteprima di 10 pagg. su 42.
Scarica il documento per vederlo tutto.
Analisi statistica predizione voti finali Pag. 6
Anteprima di 10 pagg. su 42.
Scarica il documento per vederlo tutto.
Analisi statistica predizione voti finali Pag. 11
Anteprima di 10 pagg. su 42.
Scarica il documento per vederlo tutto.
Analisi statistica predizione voti finali Pag. 16
Anteprima di 10 pagg. su 42.
Scarica il documento per vederlo tutto.
Analisi statistica predizione voti finali Pag. 21
Anteprima di 10 pagg. su 42.
Scarica il documento per vederlo tutto.
Analisi statistica predizione voti finali Pag. 26
Anteprima di 10 pagg. su 42.
Scarica il documento per vederlo tutto.
Analisi statistica predizione voti finali Pag. 31
Anteprima di 10 pagg. su 42.
Scarica il documento per vederlo tutto.
Analisi statistica predizione voti finali Pag. 36
Anteprima di 10 pagg. su 42.
Scarica il documento per vederlo tutto.
Analisi statistica predizione voti finali Pag. 41
1 su 42
D/illustrazione/soddisfatti o rimborsati
Disdici quando
vuoi
Acquista con carta
o PayPal
Scarica i documenti
tutte le volte che vuoi
Estratto del documento

Q

• = 1 indicates a perfect positive association between the variables.

−1

Q

• = indicates a perfect negative association between the variables.

13

Q

• = 0 indicates no association between the variables (independence).

A positive Yule’s Q suggests that when one variable is “true” the other variable is also more likely to be

“true”, while a negative Yule’s Q suggests an inverse relationship where one variable being “true” makes

the other variable more likely to be “false”. In the case of ‘sex’, it’s clearly not a case of true or false. We

simply mapped 1 to women and 0 to men, so it turns out that women get higher grades. If we simply swap

the associated number, we will get the same result with opposite sign. These values represent the Yule’s Q

between the output and all the categorical features:

## $sex

## [1] 0.2609756

##

## $address

## [1] -0.3910294

##

## $schoolsup

## [1] -0.2096511

##

## $famsup

## [1] 0.08811437

##

## $paid

## [1] -0.08422838

##

## $higher

## [1] 0.9061045

##

## $activities

## [1] 0.2547114

##

## $internet

## [1] 0.4249261

##

## $romantic

## [1] -0.09853276 14

Yule's q values

1.0

0.5

0.0

−0.5

−1.0 sex address schoolsup famsup paid higher activities internet romantic

We add two threshold lines at -0.3 and 0.3 in order to visualize the features with the largest Yule’s q in

absolute value. We can see that there are important relations between the output and these 3 attributes:

’address’: Taking a look at the relative boxplot we can observe that the percentage of people who lives

• in the urban area is much higher in the students who passed the exam. It is reasonable to think that

living near school and having easily access to libraries and study rooms, encourage the students to

improve their school’s performance.

’higher’: As we have already noted students who doesn’t want to take a higher education are very

• unlikely to pass the exam so this feature has a big influence on the final output.

• ’internet’: Nowadays is almost indispensable to have an internet access for school study, in fact the

students who haven’t this possibility are disadvantage compared to those who can avail themselves of it.

Therefore it’s evident that people who haven’t access to internet have less prospects to pass the exam.

Correlations between ordinal features

Correlations serve as statistical indicators, used to study the strength of the linear relationship between two

variables, along with understanding their mutual influence. With the following table we will obtain values

ranging between -1 and 1, with the following interpretations:

• A value of 1 indicates a perfect positive correlation, meaning the variables vary together in a positive

linear manner.

• A value of -1 indicates a perfect negative correlation, meaning the variables vary together in a negative

linear manner.

• A value of 0 indicates a complete absence of linear correlation between the variables, suggesting that

variations in one variable are not associated with variations in the other variable.

15

These numbers are called Pearson correlation coefficients and the mathematical formula for them is:

− −

X̄)(Y Ȳ

Σ(X )

√ i i

2 2

− −

X̄) Ȳ

Σ(X Σ(Y )

i i

X Y X̄, Ȳ

Where , are the data points and the means of the two variables.

i i

#We defined a subset of the training dataset with only the ordinal features

new_data_train_ord <- select = 'studytime',

subset(new_data_train, c('traveltime',

'failures', 'famrel', 'freetime',

'goout', 'Dalc', 'Walc', 'health'))

#We transform the column of the output using numerical values (0 and 1)

new_data_train_ord$output <- passed', 0, 1)

ifelse(new_data_train$output=='not

corr_matrix <- cor(new_data_train_ord)

method = "number", diag=FALSE, tl.cex=0.7, number.cex = 0.8,

corrplot(corr_matrix,

type = "upper", tl.col='black')

studytime freetime

failures output

famrel health

goout Walc

Dalc 1

traveltime −0.04 0.07 0.00 0.02 0.07 0.09 0.08 −0.06 −0.18 0.8

studytime −0.15 0.01 −0.09 −0.06 −0.11 −0.19 −0.08 0.25 0.6

failures −0.09 0.13 0.06 0.13 0.10 0.03 −0.37 0.4

famrel 0.12 0.12 −0.09 −0.12 0.11 0.09 0.2

freetime 0

0.35 0.12 0.11 0.07 −0.13 −0.2

goout 0.21 0.37 0.00 −0.12 −0.4

Dalc 0.62 0.08 −0.19 −0.6

Walc 0.13 −0.16 −0.8

health −0.07 −1

As we can see from the table the feature that most influence the output is the ‘failures’ one, in fact a low

number of past exam failures indicate an high possibility of passing the final exam. Obviously the ‘study

time’ attribute is relevant and has a positive association. Moreover, we want to point out that the growing of

other features like travel time and alcohol assumptions implies a decrease in academic performance. On the

other hand, we can observe that the remaining features have a very low impact on the final grade.

Correlations between numerical features

Now our current focus is in the examination of correlations among numerical attributes. We applied the same

method used for ordinal features.

#We defined a subset of the training dataset with only the numerical features

new_data_train_num <- select = 'absences', 'G1', 'G2'))

subset(new_data_train, c('age',

16

#We transform the column of the output using numerical values (0 and 1)

new_data_train_num$output <- passed', 0, 1)

ifelse(new_data_train$output=='not

corr_matrix <- cor(new_data_train_num)

method = "number", diag=FALSE, tl.cex=0.7, type = "upper",

corrplot(corr_matrix,

tl.col='black') absences output

G1 G2 1

0.8

0.14 −0.19 −0.13 −0.12

age 0.6

0.4

−0.16 −0.14 −0.16

absences 0.2

0

−0.2

0.87 0.73

G1 −0.4

−0.6

0.75

G2 −0.8

−1

As we can see the feature ‘age’ doesn’t show a significant correlation with other variables. From the second

row we can notice that an increasing number of absences leads to lower grades on average, as was easily

predictable. Instead, the grades of the first and second period which, in addiction to being strongly correlated

with each other, have an important relationship with the output value. This suggested us to remove these two

attributes for the next part of the project because they are too much relevant in comparison to the others.

Model processing

As our target variable has only two possible outcomes (binary nature), we chose logistic regression as our

generalized linear models (GLMs).

first model. Logistic regression falls under the category of binomial

The main assumption in logistic regression is that the natural logarithm of the odds of the response variable

belonging to a specific category can be expressed as a linear combination of the independent variables. The

link function used in this model is the logistic function (also known as the sigmoid function), which maps the

linear combination of predictors to a probability range from 0 to 1. Consequently, using academic performance

data, the model calculates the likelihood of a student passing or not passing the final exam.

Model with all the features

##

## Call: 17

## glm(formula = new_data_train$output ~ ., family = binomial, data = new_data_train)

##

## Coefficients:

## Estimate Std. Error z value Pr(>|z|)

## (Intercept) -51.28164 7.86250 -6.522 6.92e-11 ***

## sexM -0.36376 0.57372 -0.634 0.526058

## age 0.66734 0.24113 2.768 0.005648 **

## addressU 0.16191 0.54453 0.297 0.766206

## traveltime 0.28353 0.31650 0.896 0.370328

## studytime 0.53070 0.30587 1.735 0.082735 .

## failures -1.67140 0.94489 -1.769 0.076912 .

## schoolsupyes -0.72984 0.70104 -1.041 0.297843

## famsupyes 0.08083 0.51609 0.157 0.875539

## paidyes 0.44588 1.09067 0.409 0.682679

## activitiesyes 0.52783 0.49986 1.056 0.290993

## higheryes 0.71907 1.05966 0.679 0.497399

## internetyes 0.98705 0.59578 1.657 0.097575 .

## romanticyes 0.07131 0.49594 0.144 0.885665

## famrel 0.06744 0.27231 0.248 0.804392

## freetime -0.39518 0.25737 -1.535 0.124665

## goout -0.05599 0.23742 -0.236 0.813558

## Dalc -0.03843 0.36840 -0.104 0.916926

## Walc 0.14059 0.28319 0.496 0.619586

## health 0.09210 0.18411 0.500 0.616912

## absences -0.04235 0.05161 -0.821 0.411898

## G1 0.77962 0.22485 3.467 0.000526 ***

## G2 2.62396 0.39706 6.609 3.88e-11 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## (Dispersion parameter for binomial family taken to be 1)

##

## Null deviance: 712.77 on 518 degrees of freedom

## Residual deviance: 135.61 on 496 degrees of freedom

## AIC: 181.61

##

## Number of Fisher Scoring iterations: 9

## [1] "Accuracy: 0.923076923076923"

The summary presented above contains essential information concerning the estimated coefficients, standard

errors, z-values, and p-values of the predictor variables within the model. These values are vital for evaluating

the magnitude and significance of the relationship between the predictor variables and the binary outcome.

Estimated Coefficients:

• they represent the estimated effect of each predictor on the response variable.

Positive coefficients indicate a positive relationship with the response, while negative coefficients indicate

a negative relationship. The similarity in absolute values among these coefficients indicates a well-

balanced distribution of information across all predictor variables in the model. This desirable outcome

suggests that no single variable significantly outweighs the others, and each predictor plays a valuable

role in predicting academic performance.

Standard Errors:

• this section shows the standard errors associated with each coefficient and they

measure the uncertainty or variability in the estimated coefficients.

Z-Values:

• this section shows the Z-values (also known as t-values), which are the coefficients divided

by their respective standard errors. They measure the number of standard deviations the coefficients

18

are away from zero and larger absolute Z-values indicate a more significant effect of the predictor on

the response.

P-Values:

• they indicate the probability of observing the estimated coefficient or a more extre

Dettagli
Publisher
A.A. 2022-2023
42 pagine
SSD Scienze economiche e statistiche SECS-S/01 Statistica

I contenuti di questa pagina costituiscono rielaborazioni personali del Publisher Matteobonamin di informazioni apprese con la frequenza delle lezioni di Statistical learning e studio autonomo di eventuali libri di riferimento in preparazione dell'esame finale o della tesi. Non devono intendersi come materiale ufficiale dell'università Università degli Studi di Padova o del prof Roverato Alberto.