Riassunto Data Analysis

Riassunto completo e dettagliato basato su rielaborazione personale e appunti di tutte le lezioni integrato con tutti gli esercizi svolti in classe e relative spiegazioni del prof Nieddu. …

Esame Data analysis

Facoltà Economia

Dal corso del Prof. Nieddu Luciano

Università Università degli Studi Internazionali di Roma - UNINT

Publisher giorgia2808

A.A. 2020-2021

223 pagine

5 download

Appunto

Vota 4,0 / 5 (1)

Scarica

Estratto del documento

VARIABILITY FOR QUANTITATIVE CHARACTER

It should be redundant since we said variability is for quantitative characters, but let's stress this once more.

For a quantitative character, the Gini Index and the entropy can still be computed since they are indexes only based on relative frequencies.

Heterogeneity indexes do not take into consideration the values of the character.

In the 3 examples we have 3 distributions with the same Gini Index, since the 3 distributions all have the same frequencies.

The central tendency values (mean) we have computed seem, according to the heterogeneity, not to be very good at representing the entire distribution. stLooking at the data, it looks like 1.2 is better at representing the 1 distribution than what 5.8ndis at representing the 2 distribution.

This conclusion is solely based on the values of the character and therefore it can be drawn using the Gini index. ndIn the same fashion, 5.8 is better at representing the 2 distribution than what 500.3 is.

atrdrepresenting the 3 distribution.

Nonetheless, the 3 distributions all show the same heterogeneity, therefore we need a different measure of variability that can take into consideration also the values of the character.

Now we are going to solve this problem.

So, basically we need a variability index which also take into consideration how different the values are.

There are several variability indexes.

QUESTION:

In the case where the Gini Index would have been much lower, so with a lower heterogeneity, what we are saying now would it be the same? Yes, because you can have a Gini index of 0.5, so very small, so the all the units tend to show the same value of the character, but the other values may be completely different.

# OF MOBILES	FREQ	x FREQ	MEAN	GINI INDEX
0	1	0	1	0.438
1	10	10	2
2	1	2
total	12	12	1.00	0.438

We have an average of 1.0.

The Gini Index is 0.43 and actually 1 is very good at representing the entire distribution, because almost everybody owns 1 mobile, so the average is very good at

representing the entire distribution. The Gini Index is very small, so this information is what we get from this distribution. But what if I change the values a little?

# OF MOBILES FREQ x FREQ MEAN

0 1 0

5 10 50

10 1 10

total 12 60 5.00

The average is 5 and 5 now is going to be used to represent a population where the number of mobiles goes from 0 to 10. The Gini Index is the same. In the previous table we used 1 to represent a population where the number of mobiles goes from 0 to 2, so 1 is actually not that bad to represent a population where the number of mobiles which goes from 0 to 2, but 5 is not good at representing a population where the number of mobiles goes from 0 to 10. But the Gini Index gives exactly the same information you had before, because it is only based on frequency. And it is even worse here:

# OF MOBILES FREQ x FREQ MEAN

0 1 0

20 10 200

100000 1 100000

total 12 100200 8350.00

We have a population where the number of mobiles goes from 0 to 100000, so you are using 8350 to represent the

entire distribution.

This has the same Gini Index.

8350 is not that good at representing the entire distribution because we have values from 0 to 100000, so 8350 is very different from all the values we have.

In the 2 table 5 is different from the values we have, but not as different as it is 8350.

Even 5 is much more different than 1 is from the values of the 1 table.

So, you have a Gini Index which is very small, and it is the same in the 3 distributions, but the values we are using shouldn't have the same informative power.

This because in the 1 table, when you say 1 you actually say a value which is very close to the values you have; in the 2 table when you say 5 you say a value which is very frequent in the population, but it is very different from the other values; in the 3 case you say 8350, but it is a value completely different from all the values you have.

Even if the Gini Index is the same.

This is valid regardless of the numbers we put in the 1 column, because the

Gini Index is always going to consider the frequencies, so the values in the 2 column. So if the values you have are completely different, the average is not going to be good at representing the mode, regardless of the frequencies. And I am using very small numbers here, but if I had larger numbers, we would have the same results. We have the same Gini Index, but with completely different distributions.

® Gini Index and entropy are powerful tools to represent the ability of a qualitative character, because we have no information coming from the values of the character, but when we have a quantitative character, we have information that come with the value of the character, that the Gini index cannot take into consideration. So, the Gini Index shouldn't be used to measure variability for a quantitative character because it is not taking into consideration the values. The Gini index doesn't give all the information. We should take into consideration also the values of the character.

which for a quantitative character give you an idea of how different the statistical units can be with respect to the values of the character. How can we come up with a variability index which takes into consideration the values of the character? There are several. 1- Range Very easy measure of variability. It is given by the maximum value (remember we use this notation to represent the maximum value of the character) minus the minimum value. Why is it a measure of variability? It is a measure of variability because since you are computing the difference between the maximum and the minimum value, it gives you an idea of how different statistical units are. This because you take the maximum value that you have observed in a sample, and the minimum value. They are very different, there is a huge tendency to assume different values. If they are not very different, then the values of the characters tend to assume pretty much the same magnitude/general value. The range is a measure of

Variability since it takes into consideration the maximum and the minimum value of the character. Therefore, if the character tends to assume different values in the population then the max and the min will be very different. If the variability is small, then the units will tend to assume similar values in the distribution and therefore the max and the min will not be that different, and so the range will not be very large.

For instance, going back to the first table of the previous example, what is the range of the character? The max value is 2, the min value is 0, so the range is 2.

In the 2 distribution: the max is 10, the min is 0, so the range is 0.

So, the difference between the smallest value and the largest value is 10, so 2 persons in this distribution can have a difference in the number of mobile equal to 10 mobiles.

So, the character is more variable in the 2 distribution than it is in the 1 distribution.

What is the range in the 3 distribution? It is 10000.

So this basic and

simple index is able to capture what the Gini Index can't: the difference between values of the character.

So, if the range is 2, it means the max difference in the n. of mobile is 2, so 2 people at random can have a difference in the n. of mobiles equal to 2.

It is smaller than 10, so the character in the 2 distribution is more variable.

And 10 is smaller than 1000, so the character is more variable in the 3 distribution than what it is in the 2 or in the 1 distribution.

It is actually easy to compute. 133®

What are the pros and the cons of this index?

Pros:

it is easy to compute;
it can be used to compare variability of the same character in different distributions.

Cons:

it doesn't take into consideration all the values of the character (all the information) (this is one of the reasons why it is not used: imagine you have data on 1000 statistical units, you just pick 2 values, and that's not very smart);
max and min can be outliers and so the indexes can be

affected by outliers (the reason why it is less used than other indexes) (if you use the max and the min, that can be where the outliers are located. For instance when you forget the decimal point in a number, or when you put an extra zero, or you put a negative) (mistakes in the data entry process, probably will end up being the max and the min values of the character in the distribution).

There is an alternative, which also takes into consideration this (the fact it is based on the max and min, which can be outliers). You have to get rid of the max and the min if you don't want to use them, but what can we use?

2- Inter quartile range rd st

This is the inter quartile range: the difference between the 3rd and the 1st quartile. 75% percentile and 25% percentile. This difference is not affected by outliers (they are on the tails/extremes of the distribution). So, this is an alternative to the range. This is a valid solution to the problem we mentioned (outliers). The inter-quartile range is a valid

Alternative to the range in the case where you don't completely trust the source of the data or think that the data may be affected by errors. In this case, instead of computing the differences between the min and the max, we compute the difference between the 3rd quartile and the 1st quartile. ®

What are the pros and cons of this index?

Pros:

It is still fast, but not as easy to compute as the range (but still easy btw).
It can be used to compare variability of the same character in different distributions.

Cons:

It doesn't take into consideration all the values of the character (all the information) (this is one of the reasons why it is not used: imagine you have data on 1000 statistical units, you just pick 2 values, and that's not very smart).
We still have the problem that this is an index based on 2 single values.

So we have 2 variability indexes, which do not take into consideration the frequencies.

Now we are only working on the values of the

character (max - min / 3 quartile – 1 quartile). These are to give an idea of the span / range / variable you are considering, on how different the values you are considering are in the distribution you have. QUESTION: to calculate quartiles, do we need to calculate the cumulative frequencies? It is true, but it takes into consideration only 2 particular frequencies if you think about it: the min and the max (because the min is the lowest, so it is the 0 percentile, and the max is the largest, so it is the 100 percentile). So actually, also the min and the max take into consideration the frequencies, because it is like you are ordering all the values, and you take the 1st one, the

Anteprima

Vedrai una selezione di 10 pagine su 223