Data analysis
Data analysis is a part of statistics. Statistics: field of study where we analyse collective phenomena. A collective phenomenon is any phenomenon referred to more than one subject. We deal with more than one element. We collect information on more than one subject. The goal is trying to describe the information collected: give a description, summary, synthesis of the information we have collected. We try to describe a phenomenon, collective phenomenon, using numbers.
Statistics
Statistics is the discipline that quantitatively analyses collective phenomena. We collect information on more than one subject.
Difference between statistics and data analysis
Data analysis is a branch of statistics. It deals with descriptions of the phenomena, it is something more than statistics, than just describing phenomena.
Main branches of statistics
- Data analysis: Description of the phenomena.
- Inference: Try to gather information on the whole population.
Data science and data mining
Data science – data mining: Words to describe the application of statistics in engineering.
Approach in statistics
When you deal with statistics, you have a question – you collect data – you see if data agree with your question – we see what data tell us. You have an idea/theory, so you have a question, according to statistics I should collect data and see if the data I collected agree with my theory, so my data might be right or it may not agree with my theory so I would be wrong.
Approach in data science
Data science / Data mining: words to describe the application of statistics in engineering. You don’t have theory – you collect data – what do data tell me? You have a large amount of data and you want to see what are the questions that data can answer (when you mine data), so you don’t have the question.
For statistics, you have a scientific phenomenon (you have a theory and you try to confirm or falsify it), while in data science you don’t have a theory. Basically, the techniques are the same by the way.
Key terms in statistics
- Population: You collect data on the elements of the population. Statistics is about collective phenomena, so you must have a population. If you only have one element, you are not dealing with statistics. It is the set of all the units from whom I am trying to collect information.
- Unit / Statistical unit / Subject: It is the single element on which I collect information. Units are not necessarily persons. It can be a person, country, house.
Sometimes it is not possible to collect information on all the units of the population for a number of reasons. Example: you want to know the average income of the population in Italy. 60 million persons in Italy and you want to collect info on their income. Collecting info on all the population could be time-consuming. Time is important, time is money. It can be costly.
The only institution that collects info on all the population and performs what is called a "census survey" for Italy is ISTAT (only institution that has the money, hour, time, means to collect info on all the population). Actually, not even ISTAT is performing census surveys anymore. Now even the census survey on the entire population is not census, it has become a sample survey.
Reasons why we cannot do census surveys anymore
- Time consuming
- Costly
- Ethical
- Sometimes the observations of the phenomenon imply the destruction of the statistical unit
- In some cases, it is impossible to collect data on all the population because there is no list of all the units in the population
- Measurements are affected by errors (due to you, to the measurement tool you use etc.)
- In some cases, the population is not finite (you have a coin, and you are asked to see the probability of heads, the possible events are heads or tails)
Probability
Probability is the ratio between the number of cases favourable to the happening of the event divided by all the events (when all cases are equally possible). All the cases must be equally possible / the coin is fair. If the coin is not fair, we cannot say the probability of heads is 0.5. We can’t apply the classic definition of probability. We can use the subjective definition of probability, or the frequentistic definition.
The probability of an event is the limit of the relative occurrences of the event over a number of trials which tends to the infinite, if all the trials are performed under the same conditions. We keep tossing the coin a number of times. We check the relative frequencies of heads. If the relative frequency of heads is around 0.5 we are willing to believe the coin is fair. But how many times can we toss the coin? We’re sampling 1,000, million times. It is a sample because you could toss it an infinite number of times. So we cannot perform a census survey.
Sample survey vs census survey
Sample: Subset of the population. A typical sample survey is what you get during election times: the exit polls.
Exit polls: Very specific sample. I do not collect information on what people are going to vote but we wait for people outside the voting location. It is a sample because only a small part of the population is asked to do that.
Why do we just get a sample?
If we collect info on all the votes, it would take too much time to count all the votes. We wouldn’t be able to have information readily available. Exit polls are easy to analyze, you collect info on a subset of the population. For example, the telephone sample survey: they try to get information or opinions on something. There are a lot of examples of sample surveys. We usually perform them when we want very fast answers to specific questions.
But the problem of a sample survey is that it doesn’t collect information on the whole population and the question that we ask is usually about the whole population. What’s the average income for an Italian resident? I cannot carry out a census survey, so I select 100 persons from the entire population. I ask them what their income is. So I have the average income for the sample. But how can I answer questions for the entire population?
The only situation where you can extend: you can extend the result on the sample to the entire population if the sample is representative of the entire population. It is actually a small representation of an entire picture. For instance, I have over 100 students online. If I ask: what is your GPA? We are a sample/subset of the students in UNINT, so we all are a sample. Can I say the GPA for all the students in UNINT is the one I found with the 100 students? No, because we are not representative for all the UNINT’s students. There is not a straightforward answer on how we can say if they are representative or not of all the population. I actually guess they are not representative.
We don’t say if a sample is representative or not of the whole population, we give insights/rules on how to select the sample that could probably be representative. It is not guaranteed that it will be representative, but there are some tools for which you will be confident that the sample is representative.
But how can I ensure that the sample is representative of the entire population if I don’t know the population? To know it, you have to be able to say if the sample is a small picture of the population. The point is you don’t know the population, and that is why you are collecting a sample.
So the problem is you want a sample which is representative of the population but to know if the sample is representative of the population you must know the population. The basic idea to get a representative sample of the entire population is to collect units in the sample, the way you decide if some units should be included or not should be random. The solution would be drawing a random sample from the entire population.
Since the sample would be random, I would be fairly sure that it would represent each unit according to the frequency of the units in the population. Imagine we want to collect information on the gender ratio. We have a population of 20 balls of which 15 are red and 5 are blue. Possibility of having a blue one: 5/20 = 1/4. If we want to collect info on the number of red and blue balls, on the proportion (for each blue ball, we have red balls).
We draw a sample of 4 units, so we draw 4 balls from the population. We want a representative sample. Since in the population, red ones are more frequent, we would like to have more red balls in the sample than blue ones.
So, let’s consider what is going to happen if I randomly select the sample (randomly: with no specific criteria, each ball has the same chance of being selected). Each ball in the population has the same possibility of being picked. Since it is random, what is the probability of the ball being blue? 5/20, so 1/4. On average, 1/4 will be blue. Since it is random, I can expect that for 4 balls, 3 will be red and 1 blue. The probability of a ball being red is 3/4. This is basically because I am randomly selecting the balls. So, 0.25% and 0.75%. I selected the balls with no specific criteria, but randomly. If the ball is not very frequent in the population, it will not be frequent in the sample and vice versa.
Imagine if I do something like this, imagine these balls are people. Red ones are people from Rome and blue ones are people from Lazio. Imagine I do not do a random sample but I just pick up one randomly and I ask him “do you know another person who would participate in the survey”, he tells me and I do the same with the other person. He tells me the other red person that can answer the questionnaire etc. and I keep doing so. Is this random? I pick up randomly one of the 170 students and I ask he/she to answer the questionnaire and he/she would answer yes, and at the end, I will ask another person willing to participate and so on. This is called snowball sampling. Is this random? People would give you the name of people they know, very similar to them. So they will have the same ideas, religion, way of thinking, income, social status. So this is not random. If a sample is random, every person has the possibility to be selected.
We left with the idea of considering a sample survey instead of a census survey. We came up with the conclusion that most of the time and unless you are the official statistic of a country, you don’t have resources, time, manpower, and money to collect information on the whole population. Most of the time we consider then a sample survey: it is a subset of the population. A sample is just a subset, with no particular string attached/adjective associated with samples. It is just a subset of the population. The only way to get a sample which can be representative of the whole population is to select the units that have to be included in the sample randomly. By randomly we mean that all the units of the population have the same possibility (probability) to be included in the sample.
If we go back to our previous lesson example, we selected a sample of 4 units. When we selected the units, we drew the sample randomly, it means we select each unit in the sample without looking in the jar/container of the population. We’ll have a probability of taking a blue ball of 0.5 and a red one of 0.7. A probability of 25% of being blue, 75% of being red. On average (if you keep doing this), 1 will be blue and 3 will be red. The probability of selecting a blue ball, 5/20, so 0.5. This is because you select the balls randomly.
What if you look in the urn before selecting the balls? You may end up picking the red ball. Or you may end up picking the blue ball. So it must be random. The color of the ball will be decided by the frequency by how many blue or red balls you have in the population.
Let’s see examples of samples which are not random. Imagine you want to get information on the average income of the Italian population.
- You go out and start stopping people on the street. To stop a person you toss a coin. If it is heads, you select the person and you ask him how much he makes every 3 months.
- You pick up the phone directory for Rome and you open up a page at random and you call all the persons on that page and you ask them what their income is.
- You randomly generate phone numbers and you call the numbers and you ask the person that picks up what their income is.
In order for the sample to be random, all the units of the population must have the same possibilities to be included in the sample. This is our goal: we want to estimate the average income for the Italian population. Our population is the Italian residents in 2020.
- The problem in this case: if you go out, you tend to select people who live in the area, who tend to have on average the same figures for the income. If I go out in my neighborhood and I tend to select people that pass by, it’s not going to meet the definition of a random sample, because if I live in Rome, a person who lives in Sardinia has 0 possibilities to be included in the sample. So, this is not a random sample.
- Can we use this to get an average income of the people who live in Rome? If someone is not listed, they have 0 possibilities to be included. Even if I were listed, maybe I am not at home at that moment I couldn’t answer. Again, this is not a random sample. Even if all people would be listed.
- If I hate mobile phones, for example, but I still have a job, I wouldn’t be able to be included. Even if it looks random, it is not.
There is just one way to make a random sample. To create a random sample, you must have a list of all the units in the population. Then, you randomly pick units from that list. This is a random sample because if you have a list of all the units of the population, then all the units are in that list. And if you randomly select people from that list, it means that all the people/units in that list will be feasible to be selected.
The phone directory is not a list of all units in the population, unless you are talking about people with the phone number. If you want a random sample of the Italian population, you must have a list of all people resident in Italy today. This could be annoying: if you think of it, you may end up selecting people from all over Italy and if you want to interview one random person, you could end up with a person in Rome, in Naples, etc. If you have a random sample, all things are easy from there. Once you have a random sample, you can be fairly sure that the sample is representative of the whole population. It is a small representation of the whole population.
Most of the samples we have are not random: there are other types of samples. There are other types of samples that are not random but sometimes it is impossible to get a random sample since we don’t have a list of the whole population. For instance: I want an estimate of the average age of people with HIV in Italy (and we don’t have a list of people with HIV in Italy so it is impossible to get randomly from that list). Another example: I want to get the gender ratio (M/F) of whales in the world, but you don’t have a list of all the whales in the world. There are several examples where you don’t actually have a list.
I want an estimate of the number of taxes that haven’t been paid in Italy in 2020. I want to get some information on the number of illegal immigrants in Italy. When a random sample is not an option, we use convenience sampling. It is a sample that is not random by definition, it is convenient, that means it is the only sample I can get. This kind of sample is not random, so all the information we get from that sample cannot be extended to the whole population. If you have a sample that you collected not using random selection, then you can use that sample but you must keep in mind the information you collected can only be used to describe what’s in the sample, it cannot be extended to the whole population.
For instance, if I want to collect information on drug use in Rome, I could try to find a drug user and then interview him/her and afterward ask him/her to give me the name of another person that he/she may know that is a drug addict and then go and interview him/her and so on. I would do the same thing for the next person. Snowball sampling, because as you keep going you keep getting more units in the sample. But this is not random sampling because all the units you are selecting are somehow linked because they all know each other.
We will be only dealing with random sampling. The amount of information you can collect from a unit is infinite: imagine I am a person and I am a unit of the population of teachers in UNINT, you can ask me many questions (age, height, gender, color of the eyes). You can ask yourself all sorts of information. If the unit is an object, once again you can collect an infinite number of information. We don’t have time to collect all the possible information from a unit because it would take an infinite amount of time, so we have to select some characteristics from the units. For instance, if we are interested in academic performances for the students, probably we wouldn’t ask irrelevant information. I would ask information related to the academic performance: how many exams passed, average, lowest mark, etc. So, we select characters that are relevant for the study.
Characters
For each unit, we only collect those characteristics (characters) that are relevant for the study since the number of characters (features) that can be collected for each unit is infinite. For instance, if the unit is a student at UNINT and we are interested in academic performance.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
-
Riassunto esame Data Analysis, Prof.ssa Zavarrone e Prof. Sfogliarini, libro consigliato Business analytics, Evans
-
Riassunto esame Analysis of algorithms and data structures, Prof. Andrea Marino, libro consigliato Think Python: Ho…
-
Riassunto esame Analysis of algorithms and data structures, Prof. Merlini Stefano, libro consigliato The art of com…
-
Riassunto esame Data analysis and forecasting, Prof. Bee Marco, libro consigliato Forecasting: principles and pract…