Data Analysis
SLIDE 1 – Univariate Statistical modelling
Random variable
A random variable (rv) is a variable whose numerical values are determined by the
outcome of a random experiment. A rv can be: 1) discrete, if it can take no more than
a countable number of values; 2) continuous, if it can take any value in an interval.
1) Discrete random variable ()
Probability (mass) function: The probability mass function (pmf) of a discrete rv
, .
expresses the probability that takes the value as a function of The pmf
∶ ℝ → [0, 1] () = ( = )
is defined as
Support: the support of a discrete rv is defined as the set of possible values
, = { ∈ ∶ () > 0}.
of that is ℝ
Properties: The pmf satisfies the following properties:
()
> 0 ∈
1) ;
()
= 0 ∉
2) ;
∑ () = 1
3) ∈
()
Cumulative probability function: The cumulative distribution function (cdf) of a
,
discrete rv expresses the probability that does not exceed the value as a
. ∶ ℝ → [0, 1]
function of The cdf is defined as
() = ( ≤ ) = ()
∑
{∈ ;≤}
∈
where the notation indicates that the summation is over all possible values
.
that are less than or equal to
Properties: The cdf satisfies the following properties:
0 ≤ () ≤ 1 ∈
1) for every ;
ℝ ( ) ≤ ( )
2) if x and x are two numbers such that x < x , then
0 1 0 1 0 1 1
Expectation: The expectation (also called mean or expected value) of a discrete rv
∑
() = = ()
is defined as , where the summation is over all possible
∈
∈
values .
(). ()
Generalization: Let be a discrete rv with pmf Moreover, let be
. ()
some function of The expected value of is defined as
[ ()]
= ()()
∑
∈
2
( − )
Variance: The expectation of the squared discrepancy about the mean is
2
called the variance, commonly denoted by , and it is given by
2
2 2
( )
[( − ) ] = = − ()
∑
∈
2) Continuous random variables
Probability density function: Let be a continuous rv. The probability density
+
∶ ℝ → ℝ
function (pdf) of is a function with the following properties
0
() ≥ 0 ∈ ℝ
1) for any
( ≤ ≤ ) = () , ∈ ℝ;
2) for any
∫
∞ () = 1
3) ∫
−∞
If is a continuous rv, then the probability of a single point x is null, that is
0
( = ) = 0.
0 ()
Cumulative distribution function: The cumulative distribution function of a
() = ( ≤ ) = ()
continuous rv is .
∫
−∞
()
() =
It easily follows that
Expectation: The expectation of a continuous rv is defined as
∞ ∞
() = = ∫ () = ∫ ()
−∞ −∞
(). ()
Generalization: Let be a discrete rv with pdf Moreover, let be
. ()
some function of The expected value of is defined as
2
∞ ∞
[()] = ∫ ()() = ∫ ()()
−∞ −∞
Variance: The variance of a continuous rv is defined as
∞ ∞
2
2 2 2
) ] ( ) ( )
[( − = = ∫ − () = ∫ − ()
−∞ −∞
The probability density or mass function of is more informative with respect to
synthetic indicators evaluating specific aspects of such as, for example, location
(mean, median, mode, and so on), variability (standard deviation, variance, and so
on), skeweness and kurtosis.
Statistical models for random variables
(; )
A statistical model for the pdf (or pmf) of is a mathematical function
.
characterized by a vector of parameters The model aims to represent, often in a
considerably idealized form, the data-generating process.
Parsimony principle
A statistical model is a simplified representation of reality by analogy, derived by
“field” observations and scientific reasoning. Reality is a too complex system, which
statistical modelling aims at reproducing with parsimony.
We have to find a trade-off between parsimony and accuracy. .
The choice of a model for the available data firstly depends on the support of
Famous statistical models:
• Discrete support:
-> Binomial: number of successes in m trials. (Bernoulli distribution)
-> Poisson: suited for count data.
-> Negative Binomial: number of trials before success.
• Continuous support:
-> Gaussian: useful for a lot of real-valued phenomena.
-> Gamma: suited for positive data. (sotto)
-> Log-Normal: suited for positive data. (sotto sia questa che normal) 3
-> Beta: suited for phenomena with support within an interval (for instance, a
rate). (sotto)
-> Exponential: suited for describing time between event data (births, deaths,
etc).
-> Uniform: suited to describe phenomena with maximum uncertainty. 4
5
6
Random sample , . . . , , . . . ,
A random sample of size is the set of rv’s associated to
1
independent and identically distributed (iid) observations of the rv .
a simple random sample is a set of n objects in a population of N objects where all possible samples
are equally likely to happen , . . . , , . . . ,
Observed sample: an observed sample of size n is the set 1
, . . . , , . . . ,
constituting the realizations of the rv’s through the sample units.
1
Sample (joint) distribution: the sample (joint) distribution is the joint distribution
(or ) of ,..., ,..., .
1
(; ),
If is a continuous rv and has pdf the joint density of the sample is given by
(
, … . , , … , ; ) = ( ; )
∏
1
=1 7
(; ),
If is a discrete rv and has pmf the joint probability of the sample is given
by
(
, … . , , … , ; ) = ( ; )
∏
1
=1
Parametric inference
, . . . , , . . . ,
Parametric estimation: let be a random sample from a continuous
1
(; ), = ( , . . . , , . . . , )′
rv having pdf where 1
The parameter vector is unknown and we wish to “estimate” it based on the
random sample. This is the typical context for parametric estimation. In other words,
is
the functional form is known (or at least assumed such!), but not!
.
Example: Let be the (population) mean of Suppose is unknown and we
want to estimate it using the random sample. Note that here no functional form
!
is assumed for the pdf of
This easily extends to discrete rv’s.
Point estimation
, . . . , , . . . , (; ).
Let be a random sample from a population with density We
1
()
will indicate with a function of the unknown parameters that we wish to
estimate using the random sample. 1 − 2
1 − ( )
2 2 ′
) (, )
(; , = =
Example: Let . In this case and
2
√2 2
√
2 )
(, =
we might want to estimate: 1) the coefficient of variation ;
2 2 2
) )
(, = ; (, =
2) the mean 3) the variance
()
Estimator: An estimator of is defined as some (hopefully appropriate) function
= ( , … , , … , ) of the sample variables. Remark that the estimator T, being
1
a function of random variables, is itself a random variable! 8
,
Estimation: The value taken by corresponding to an observed sample
, . . . , , . . . , , = ( , … , , … , ),
is denoted by and it is referred to as
1 1
().
estimate of 1 1
̅ =1 =1
∑ ∑
= ̅ =
Example: is the estimator of and is the
.
estimate of
Properties of point estimators: There are several ways of building up estimators and
sometimes these (the estimators) do not coincide. Thus we have to characterize the
properties that we want for our estimators, in order to have a guide on which
estimator to choose. Before getting into technicalities, it is natural to require our
() ().
estimator of to produce estimates the closest possible to
().
Concentration: Let and be two estimators for We say that is more
1 2 1
concentrated than if
2 )
( () − < < () + > (() − < < () + )
1 2
> 0
for every and
∗
()
We say that is the most concentrated estimator for if it is more concentrated
.
than any other estimator of
Mean Squared Error
∗
An estimator is very hard to find. Comparing estimators based on
Pr(() − < < () + )
> 0
for every is not feasible. ()
We prefer computing an “average error” between and to make comparisons.
()
Mean Squared Error: A measure of the distance between and is given by the
mean squared error: 2 2
() = [| − ()| ] = … = () + [() − ()]
() − () .
where indicates the bias of 9
()
Unfortunately, an estimator with uniformly smaller with respect to than
any other estimator does not exist.
Our goal: set up “desirable” conditions must fulfil to restrict the range of admissible
estimators- hopefully there will be one better than the others!
Unbiased estimator (quelli che noi studieremo)
() () = (), .
Unbiased estimator: the estimator of is unbiased if for every
”, (), ().
The “central value of that is coincides with
This property allows us to find a subset of estimators, the unbiased estimators, of
much interest. On top of this, given that their MSE coincides with their variance, to
choose among unbiased estimators will be enough to look for the lowest variance.
Bias of an estimator
() ≠ (),
If we say that the estimator is biased ()
Bias of an estimator: the bias of an estimator T of is the difference
()
= () − ()
() > ()
If the estimator T is positively biased, with bias
()
= () − () > 0 10
();
tends to overestimate
() < ()
If the estimator T is negatively biased, with bias
()
= () − () < 0
().
tends to underestimate
Cramèr-Rao Lower Bound (legato agli unbiased estimators)
Within the class of unbiased estimators, the most important feature is variance.
Under certain regularity conditions it is possible to show that the variance of any
unbiased estimator is greater than, or equal to, a quantity, which is the lower bound
of the variance of unbiased estimators.
An unbiased estimator T is efficient if its variance reaches the CR lower bound. 11
Quadratic consistency (legato agli unbiased estimators)
Until now we have considered the sample size as fixed. It is interesting also to
investigate how estimators behave as n grows. In particular, we want large samples
to deliver more accurate estimates than smaller samples would do (it means that if
the sample size grows, the quality of the estimator becomes better and the MSE tends
to 0). This request can be formalized under quadratic consistency.
Note: to stress that estimators somewhat depend on sample size, we will
add a subscript and refer to the estimator as .
()
Quadratic consistency: the estimator of satisfies quadratic consistency if
)
lim ( = 0,
→ ∞
for every
Remark 1: Quadratic consistency implies that:
lim ( ) = 0;
1)
→∞
lim ( ) = 0,
2) which corresponds to asymptotic unbiasedness.
→∞
Both implications are derived from the decomposition of MSE.
Remark 2: Within the class of asymptotically unbiased estimators, looking for an
asymptotically optimal estimator boils down to find with zero variance.
Strategy for estimator construction (sempre unbiased)
1) Analogy: estimates are computed on the sample which is treated as a
population analog.
2) Method of moments: estimates are computed equating the population
moments (mean, variance, etc.) with the sample moments.
3) Method of maximum likelihood: In the maximum likelihood (ML) method,
estimates are computed maximizing the probability - i.e. the likelihood - of
obtaining the available sample. That is, we look for the value of which
maximizes the likelihood of obtaining the observed sample.
ML: pros and cons
✓ -> The ML method delivers asymptotically normal and efficient estimators.
12
-> Perhaps ML is the most widespread method for building up estimators,
although arguably not the easiest one (see the method of moments).
Operationally, ML method can be summarized into 3 steps
Maximum likelihood method
-> Step 1
Keeping in mind that we are dealing with a random sample of iid observations, first
we compute the likelihood function
)
(; , … , = ( ; )
∏
1
=1
Remark: note that the formula is the same as the sample joint density
( , . . . , ; ),
but this is now a function of given the sample.
1 (; , … . , )
Interpretation: indicates the probability, or likelihood if you prefer,
1 (; ).
that the observed sample originates from having density
(; , . . . , ) .
Objective: our goal is to maximize with respect to
1
-> Step 2
From the likelihood function we get to the log-likelihood function
for two main reasons:
1) (; , . . . , ) (; , . . . , )
does not alter the minima and the maxima of
1 1
since the logarithm is an increasing monotonic function;
2) the logarithmic function has some useful properties: 13
= = )
a) it easily handles the exponentials ( and that
most widespread distributions have (normal, gamma, etc);
=1 =1
∏ ) ∑ )
ln ( = ln ( )
b) it easily handles products ( like those of
(; , … . , );
1
c) exalts differences (useful from a computational point of view) (vedere
imagine qui sotto)
-> Step 3 (forse non troppo importante)
(; , . . . , ) =
The ML estimator of is found maximizing with respect to
1
( , … . , , … , )′ that is by solving the system of k equations obtained by equating
1
the k partial derivatives to 0: 14
The invariance property of ML estimators
We are often interested in not just the model parameters, but some functions of
them.
Asymptotic properties of ML estimators
ML estimators are widely appreciated as they enjoy some very nice asymptotic
properties. 15
Goodness-of-fit tests
There are 3 main statistical tests to evaluate the goodness-of-fit of a statistical
(theoretical) model (that we put under the null hypothesis ) to the empirical
0
distribution (namely the distribution of the observed sample) (ovvero ci sono 3
tipologie di test effettuabili per capire la bontà con la quale un set di dati è stato
descritto da una distribuzione che si è ipotizzata):
Note: The null hypothesis is an hypothesis which is assumed true until there’s a proof
of the contrary. The alternative hypothesis is an hypothesis which is opposite to the
null one which is accepted only if there is a strong proof in its favour.
1) Pearson’s chi-square test;
2) Kolmogorov-Smirnov test;
3) Likelihood-ratio test.
1) Pearson’s chi-square test ,
In Pearson’s chi-square goodness-of-fit test the sample data of size if of a
, . . . ,
continuous type, are divided into intervals (or classes or bins) . Then the
1
, . . . , , . . . ,
numbers of poi
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.