Formulario R

Formulario con codici utili del software R. Utili per passare l'esame di Data Analysis and Forecasting. Con esempi e spiegazioni chiare, tutto scritto in inglese. Presenta anche alcune note su co…

Esame Data analysis and forecasting

Facoltà Economia

Dal corso del Prof. Bee Marco

Università Università degli Studi di Trento

Publisher martiterragnolo

A.A. 2024-2025

12 pagine

Schemi e mappe concettuali

Vota

Scarica

Estratto del documento

Trend is a long-term increase or decrease

Seasonality is regular, fixed fluctuations.

Cycle is irregular rises and falls, often linked to economic factors.

gg_season() > graph used to visualize time series data, plotting data against individual

season or cycle

gg_season(name, labels = "both") +

labs(y = "$ (millions)",

title = "....")

gg_season (Demand, period= “day/week/year”) +

theme(legend.position = "none") > toglie la leggenda, opzionale

It helps focus on one pattern at a time (weekly or yearly). Makes it easier to see trends and

changes for that specific seasonality.

gg_subseries() > creates subseries plots to visualize individual seasonal cycles over time,

highlighting variations across periods.

gg_subseries(name) The blue horizontal lines indicate the means for each month. This form

of plot enables the underlying seasonal pattern to be seen clearly, and

also shows the changes in seasonality over time. It is especially useful

in identifying changes within particular seasons.

gg_plot() > used to create plots by mapping data to visual elements.

…, …))

ggplot(data, aes(x = y = +

geom_point() # Add scatter points

ggplot(visitors, aes(x = Quarter, y = Trips)) +

geom_line() + # add line plots

facet_grid(vars(State), scales = "free_y")

geom_point(): For scatter plots.

geom_line(): For line plots.

geom_bar(): For bar charts.

geom_histogram(): For histograms.

cor() > calculates the correlation coefficient between two numerical variables

cor(name of the variable$name of column, vic_elec_2014$Demand)

0.85, it means that there is a strong positive correlation between Temperature and Demand

in the year 2014.

-0.4, it means that there is a weak negative correlation.

0.05, it means that there is almost no linear relationship between the variables.

ggpairs() > Creates a pairwise scatterplot matrix to visualize relationships between multiple

numeric variables in a dataset.

library(GGally)

ggpairs(data, columns, mapping, ...)

ggpairs(pivot_wider(dataset, values_from = Trips, names_from = State),

columns = 2:9)

ggpairs(data, columns = c(2, 3, 5))

Range:

+1 = Perfect positive correlation (as one increases, the other

increases).

-1 = Perfect negative correlation (as one increases, the other

decreases).

0 = No correlation.

Stars Indicate Significance:

*** = highly significant

** = moderately significant

* = weakly significant

gg_lag() > used to visualize the relationship between a time series variable and its lagged

versions (e.g., comparing the values of a variable with its previous values over time).

library(GGally)

gg_lag(name, variable/column, geom = "point/line")

If the series has a strong seasonal effect, the points should be very

close to the bisector

If your data contains daily or monthly data, you may want to extract

the quarterly information first (using functions like quarter() and

year() from lubridate)

…. <- filter(name, year(Quarter) >= 2000)

ACF() > computes the autocorrelation of a time series at different lags.

ACF(name, column name, lag_max = 9)

autoplot(ACF(recent_production, Beer))

The space between the two blue lines is called the confidence

interval

Autocorrelation values inside the confidence interval suggest no significant correlation at that lag.

Autocorrelation values outside the confidence interval suggest that the observed autocorrelation is

statistically significant and unlikely to be due to random noise

Spikes are due to seasonality

Per trovare la percentuale, si divide il numero di linee sopra e sotto la linea blu del grafico per il

numero totale di righe.

< 5%: the data is likely random, with no meaningful autocorrelation.

> 5% but not much: the result might be inconclusive, potentially due to random variations.

> 20%: significant autocorrelation, suggesting that there is a real, non-random relationship in the data

across those lags.

Time series that show no autocorrelation are called white noise. For white noise series, we expect

each autocorrelation to be close to zero < 5%. For a white noise series, we expect 95% of the spikes

in the ACF to lie within the blue lines

features() > designed to extract features (i.e., summary statistics or transformations) from a

time series, such as trends, seasonality, or optimal parameters for transformations.

features(aus_production, Gas, features = guerrero)

lambda <- pull(guer, lambda_guerrero)

print(lambda)

box_cox() > applies the Box-Cox transformation to the Gas time series using the λ value.

box_cox(aus_production$Gas, lambda)

autoplot(name, box_cox(Gas, lambda))

model() > allows you to create and compare multiple models on your dataset efficiently

model(name, model_name = model_function(Employed))

model(name, classical_decomposition(Employed, type = "additive/ multiplicative"))

Additive > for example, retail employment may have fixed seasonal peaks, but the size of the peak

doesn’t change with the level of employment.

Multiplicative > for example, if retail employment tends to grow over time and the seasonal peaks get

larger as the trend increases, a multiplicative model might be more appropriate.

components() > takes a decomposed time series object and extracts the individual

components of the decomposition, such as the trend, seasonal, and remainder components.

components (name)

Now, you can manually specify which component (Employed) to plot and add a trend line

autoplot(as_tsibble(components(dcmp)), Employed)

as_tsibble(components(dcmp))

autoplot(components_tsibble, Employed)

To add the orange line:

geom_line(aes(y=trend), colour = "blue/orange")

or you can automatically plot all components (trend, seasonal, remainder) in separate panels

autoplot(components(dcmp))

A longer bar means that the data contains significant random or

irregular fluctuations that are not explained by the trend or

seasonal components.

A shorter bar indicates that the decomposition model has

successfully captured most of the structure in the data.

slide_dbl() > designed for sliding window calculations over a numeric vector. It is particularly

useful for moving averages, rolling sums, and other calculations applied to subsets of data.

slide_dbl(.x, .f, ..., .before = 0, .after = 0, .complete = TRUE)

slide_dbl(data, mean, .before = 2, .after = 2, .complete = TRUE)

per linea arancione:

autoplot(aus_exports, Exports) +

geom_line(aes(y = `5-MA`), colour = "orange")

STL() > Seasonal and Trend decomposition using Loess is a powerful and flexible method

for decomposing a time series

model(name,

STL(Employed ~ trend(window = xxx)+season(window = xxx), robust = TRUE))

A larger window for the trend component helps capture long-term patterns by averaging a wider range

of values.

For seasonality, the window size often corresponds to the expected periodicity of seasonal

fluctuations (e.g., 7 for weekly seasonality in monthly data).

features() > A feature is any numerical summary of the data, such as the mean, median,

maximum, minimum, or others.

features(tourism, Trips, list(mean = mean))

This line tells R to calculate the mean of the Trips column from the tourism dataset.

arrange() > After calculating the mean, you might want to sort the result in ascending or

descending order. arrange allows you to do this by a specified column.

arrange(name, mean)

This line takes the result from features() (which includes the mean) and sorts it by the mean

values.

model() > build statistical models for time series data

model_fit <- model(data, MODEL_TYPE(y ~ x))

library(fable)

fit <- model(name, TSLM(Employed ~ trend()))

report(fit)

Supports different types of models, such as:

● ARIMA (Auto-Regressive Integrated Moving Average)

● ETS (Exponential Smoothing State Space Model)

● STL (Seasonal-Trend Decomposition)

● TSLM (Time Series Linear Model).

accuracy() > when we build a model, we need to check how well it predicts the data before

trusting it for forecasting

fit <- model(....)

accuracy(fit)

accuracy(beer_fc, recent_production)

select(accTable,.model,RMSE,MAE,MAPE)

Smaller values are better

# Example Output:

# ME RMSE MAE MPE MAPE

# -12.34 15.67 10.45 -1.2% 3.4%

MAE → Easy to understand and good for general errors.

RMSE → Highlights larger errors (use if big mistakes are more important).

MAPE → Best for percentage-based comparisons across datasets with different scales.

forecast() > used to predict future values based on a fitted time series model

forecast(name, h = 12)

forecast(fit, h = "3 years")

autoplot()

filter_index() > extract a subset of data for analysis or visualization.

filter_index(name, "1970 Q1" : "2004 Q4") # select data between first quarter of 1970 and

fourth quarter of 2004

"2010" | "2011" | "2012" # Extracts data for the years 2010, 2011, and 2012

model() > MEAN, NAIVE, SEASONAL NAIVE and DRIFT

mean_fit <- model(bricks, MEAN(Bricks))

tidy(mean_fit)

results_list <- mean_fit$”MEAN(Bricks)” [[1]]

mean_results <- results_list$fit

or all together like this: results_list <- mean_fit$”MEAN(Bricks)” [[1]] $fit

then we have to forecast and autoplot

mean_fc <- forecast(mean_fit, h = 12)

bricks_mean = mutate(bricks,hline = mean_fc$.mean[1]) # add a dashed line

autoplot(mean_fc, bricks, level = NULL) +

autolayer(bricks_mean,hline,linetype='dashed',color='blue')

naive_fit <- model(bricks,NAIVE(Bricks))

naive_fc <- forecast(naive_fit, h = 12)

autoplot(naive_fc, bricks, level = NULL)

For Naive forecasting, the forecast line will typically be flat,

meaning the forecasted value for each future point is the same as

the last observed value.

snaive_fit <- model(bricks,SNAIVE(Bricks ~ lag("year")))

snaive_fc <- forecast(snaive_fit, h = 12)

autoplot(snaive_fc, bricks, level = NULL)

This is useful for series with no trend but high seasonality

drift_fit <- model(bricks,RW(Bricks ~ drift()))

drift_fc <- forecast(drift_fit, h = 12)

autoplot(drift_fc, bricks, level = NULL)

This is useful for series with no seasonal effect but with a trend

We can do all together:

model(

Mean = MEAN(name),

Naive = NAIVE/RW(Beer),

Seasonal_naive = SNAIVE(Beer)

Drift = RW (Beer ˜ drift ())

)

augment() > Retrieves details about the model, including actual values, fitted values (.fitted)

and residuals (.resid).

augment(beer_fit1)

For visualisation:

ggplot(mean_fitted, aes(x = Quarter)) +

geom_line(aes(y = Beer),color='black') +

geom_line(aes(y = .fitted),color='red')

autoplot(mean_fitted,.vars = Beer) + #

Anteprima

Vedrai una selezione di 4 pagine su 12

Anteprima di 4 pagg. su 12.
Scarica il documento per vederlo tutto.

Scarica

Anteprima di 4 pagg. su 12.
Scarica il documento per vederlo tutto.

Scarica

Acquista con carta o PayPal

Scarica i documenti tutte le volte che vuoi

Dettagli

SSD

Scienze economiche e statistiche SECS-S/01 Statistica

I contenuti di questa pagina costituiscono rielaborazioni personali del Publisher martiterragnolo di informazioni apprese con la frequenza delle lezioni di Data analysis and forecasting e studio autonomo di eventuali libri di riferimento in preparazione dell'esame finale o della tesi. Non devono intendersi come materiale ufficiale dell'università Università degli Studi di Trento o del prof Bee Marco.

Appunti correlati

Invia appunti e guadagna

Recensioni

Ti è piaciuto questo appunto?

Formulario R

# ME RMSE MAE MPE MAPE

Recensioni

Domande e risposte

I migliori insegnanti di Matematica

Giovanni C.

Salvatore F.

Matteo S.