vuoi
o PayPal
tutte le volte che vuoi
The adjusted R squared is low in both the simple and the multiple regression model, in particular it
decreases from the first to the second. This fact suggests that all the demographic variables
together with income are not useful in explaining Y (97% is unexplained!)
2 possibile risposta alla multicollinearity:
From the correlation between all the explanatory variables, we are able to see that between
income and education there is a high (positive) correlation equal to 46%. We can test these two
variables jointly through the F test in order to verify the presence of multicollinearity. The pvalue of
the F test is lower that 5% saying that we must reject H0 in favor of H1. This result is not
consistent with the t test, which suggest the opposite.
So, there is multicollinearity between income and education: we must cut one of this two in order
to study the regression. 2
For what we have said before, the best model between 1 and 2 seems to be the simple
3. one.
Indeed, in the multiple regression model we have multicollinearity and all the variables are not
statistically different from zero.
In general, both the simple and the multiple are poor models for what we have said about R
squared adjusted.
PROVA 2
Answers.
We are dealing with cross sectional data. In particular, the dataset contains information
1. about USA working individuals for 1987. From the command descr, we obtain information
about our variables: experience (years of full-time work experience), male (1 if male, 0
otherwise), school (years of schooling) and wage (wage $ per hour of 1980). The number of
observations is equal to 3294.
Through the command sum y, d we obtain some more information about the dependent variable.
First we see that it has not a normal distribution: the skewness is different from zero and the
kurtosis is higher than three. As the mean of y is not so high with respect to the median we can set
up a sktest in order to be sure about our previous statement: the test shows that the p value is
equal to zero, saying that we were right. In particular, the skewness is equal to 1.97; this suggests
that most of the values are located under the right tail of the distribution.( positive outliers)
A kurtosis higher than three (12.63) suggests that the variable can be well approximated by a t
student.
Moreover, from the correlation command in STATA we can see that all the explanatory variables
are positively correlated.
The coefficient of school is 0.56. This means that one extra year of schooling implies an
2. increase in wage equal to 0.56 dollars. As regards the value of the intercept, it has not
sense since it is not credible that someone in the sample has a level of schooling equal to
zero.
The coefficient of schooling in this case is equal to 0.1052. So, one extra year of schooling
3. implies an increase in wage equal to 10.52%.
Now we have two different dependent variables: wage and log(wage). So, if we want to compare
the first and the second regression we need to see what is the relationship between the root MSE
and the sd of Y of these two models.
In the first model we have 1.042 (SD of y/root MSE) while in the second 1.04079. Consequently,
the second model is better than the first in explaining y.
In our new regression, all the coefficients are statistically different from zero (pvalue =0). In
4. particular, the dummy variable male tells us that ceteris paribus there is a difference in
terms of wage between men and women which is equal to 1.344. (men gain more than
women)
Note that there is a negative correlation between school and experience ( -20%), indeed the
coefficient of schooling has increased (from 0.56 to 0.64) and, as the coefficient of experience is
positive, in the simple regression model we have a downward bias in the estimator. 3
Moreover, the confidence interval of schooling in the multiple regression model does not contain
0.56 saying that the bias is significant. Consequently, we prefer the multiple regression model.
When we add exper_sq, its pvalue together with the experience pvalue become higher than Alfa.
So, they are not statistically significant. The marginal effect of experience on wage can be defined
by using the derivative:
Variation of wage/wage= (beta +2*beta (exper_sq))* variation of exper
1 2
PROVA 3
Answers.
The data set is a cross sectional data: we have data of 1990 about countries and their life
1. expectancy and gdp. Through the command descr on stata we find out the number of
observations which is equal to 124 as well as a description of our explanatory variables:
• Country = Indicator of country
• cid= Country label
• lexp= Life expectancy in years
• gdp= Per capita GDP at parity of purchasing power
Through the scatterplot we can see that between the life expectancy and the gdp there is a
positive relationship:
90
80
70
60
50
40 0 5000 10000 15000 20000
gdp
lexp Fitted values
This means that an increase in the gdp implies an increase of the life expectancy. From the graph
there is also evidence of heteroskedasticy standard error. Indeed, for different level of x we have a
different variance of our errors. It seems that for small explanatory variables we have larger
variance while for big explanatory variables smaller variance: for low levels of gdp we have more
volatility of y.
Through the command sum lexp, d we can find some information about our dependent variable
distribution. In particular we can see that the skewness is different from zero ( but near: 0.2) and
the kurtosis is different from three (1). Consequently, from the skewness we can state that our
distribution is enough symmetric and from the kurtosis that it has very slim tails. We can make up
a test in order to be surer about our previous statements. The command is sktest. The p value is
equal to zero, so with a significance level equal to 5% we reject H0 in favour of H1. Our
distribution is not a normal distribution. 4