Anteprima
Vedrai una selezione di 4 pagine su 13
Esercizi di Business Intelligence Techniques Pag. 1 Esercizi di Business Intelligence Techniques Pag. 2
Anteprima di 4 pagg. su 13.
Scarica il documento per vederlo tutto.
Esercizi di Business Intelligence Techniques Pag. 6
Anteprima di 4 pagg. su 13.
Scarica il documento per vederlo tutto.
Esercizi di Business Intelligence Techniques Pag. 11
1 su 13
D/illustrazione/soddisfatti o rimborsati
Disdici quando
vuoi
Acquista con carta
o PayPal
Scarica i documenti
tutte le volte che vuoi
Estratto del documento

E

destination of the edge). The edge set is therefore represented by a collection of lines of text.

E

Given a collection of lines describing the edge set (either coming from a single file or split among several

E

files), we want to design a MapReduce system to produce a list of vertices, each associated with a pair of integers

representing their indegree and outdegree.

For example, consider the set of lines on the left. The resulting mapping is shown on the right.

lorem ipsum lorem 7! (0, 2)

amet ipsum ipsum 7 ! (2, 1)

ipsum sit dolor 7 !

) (1, 2)

dolor sit sit 7 ! (2, 0)

lorem dolor amet 7 ! (1, 1)

dolor amet

The corresponding graph is the following: ipsum

lorem dolor

amet sit

6.1) What are the domain and codomain of the and functions?

Map Reduce

6.2) Write a pseudo-code implementation of the relevant functions.

Exercise 7

A distributed filesystem contains a large number of measurements, each measurement being a real number

between and

0 1.

Describe a MapReduce implementation of an algorithm that computes a 100-bin histogram of this measurement

set.

Hint — In other words, divide the range into 100 equally sized bins and for each bin tell how many

[0, 1]

measurements it contains.

Exercise 8

Let us define a 2-gram as a sequence of two alphabetic characters in the same word, with no distinction on

capitalization. For example, the phrase “I love *YOU*!” contains the following 2-grams: “lo”, “ov”, “ve”,

“yo”, “ou”.

Describe a MapReduce implementation that lists every 2-gram appearing in a set of documents together with

the number of times it occurs.

Exercise 9

A set of invoices is stored in a distributed file system. Every invoice contains the name of the buyer followed by

a list of items, quantities and unit prices, each on a line. Here are four sample invoices.

Alice Bob Charles

Pencils 2 3.50 Erasers 1 4.55 Computers 1 650.00

Erasers 5 5.25 Pencils 3 3.75 Pencils 2 3.25

Computers 1 500.00 Alice

Erasers 2 3.50

A corporation wants to identify the “best buyer” for every kind of item, i.e., a list of distinct items, each

associated to the buyer who bought it at the lowest unit price, and the unit price itself. Given the sample

invoices, the list would be: Pencils Bob 1.25

Erasers Alice 1.05

Computers Alice 500.00

Describe a MapReduce implementation of the requested functionality (at least the pseudocode of the relevant

functions and a description of all the related key and value pairs).

Solution — See the example in the course web page.

BestPurchase.py

Exercise 10

The following table, taken from a social network, reports the attitude of five users (u ) with respect to

, . . . , u

1 5

five messages (m ) posted on the network. “Y” means “Likes”, “N” means “Doesn’t like”; an empty

, . . . , m

1 5

entry mean that a user hasn’t expressed his/her opinion about the message.

m m m m m

1 2 3 4 5

Y Y Y N

u 1 N Y N N

u 2 Y Y Y

u 3 N N Y N Y

u 4 Y N Y

u 5

What methods can be used to decide whether user likes or dislikes message ?

u m

3 3

Nota bene — This is an open question, there is no “correct” answer.

Exercise 11

A recommender system contains ratings provided by four user on two items (ratings range from 1 to 5; empty

entries correspond to unknown ratings): Item 1 Item 2

User 1 3

User 2 5

User 3 1

User 4 4 2

The system bases its recommendations on the following rating model:

·

r = 3 + u v ,

ij i j

where is the rating assigned by user to item are four 2-dimensional user

r i = 1, . . . , 4 j = 1, 2; u , . . . , u

ij 1 4

feature vectors, and are two 2-dimensional item feature vectors:

v , v

1 2

✓ ◆ ✓ ◆ ✓ ◆ ✓ ◆ ✓ ◆ ✓ ◆

1 .5 .5 0 0 y

u = , u = , u = , u = ; v = , v = ,

1 2 3 4 1 2

0 1 0 1 x 1

while and are as yet unknown model parameters.

x y

11.1) Compute the values of and that minimize the root mean square error of the model with respect to

x y

known ratings (also known as the least-square estimates of the model parameters and

x y).

11.2) Does it make sense to recommend item 2 to user 1? Should we recommend item 1 to user 2 or 3?

Solution —

11.1) We want the values of and that minimize the sum of the squared differences between the model’s prediction

x y

and the actual ratings, i.e., the values for and that minimize the following:

x y X 2

RMSE ·

= (3 + u v r ) .

i j ij

s.t. is known

i, j r ij

Replacing all values, we get: i=1,j=1 i=4,j=1 i=2,j=2 i=3,j=2 i=4,j=2

z }| { z }| { z }| { z }| { z }| {

2 2 2 2 2

RMSE(x, y) = (3 3) + (3 x 4) + (3 + .5y + 1 5) + (3 .5y 1) + (3 1 2) .

In order to find the values that minimize the RMSE, let us compute the partial derivatives and equate them to zero:

i=4,j=1

z }| {

@RMSE(x, y)

0 = = 2x + 2

@x i=2,j=2 i=3,j=2

z }| { z }| {

@RMSE(x, y)

0 = = .5y 1 + .5y 2 = y 3.

@y

Therefore, x = 1, y = 3.

Note that the value for could also be determined by the only rating in the matrix that depended on it, . However,

x r 41

the two ratings that depend on don’t provide the same result, and we need to minimize the squared errors as requested by

y

the exercise.

11.2) We obtain therefore suggesting item 2 to user 1 is extremely advisable. On the other hand, with the given

r = 6,

12

value for and therefore user 3 is slightly more likely to appreciate item 1.

x, r = 2 r = 3,

21 31

Exercise 12

The columns of the following matrix represent the coordinates of a set of documents in a TFIDF space:

0 1

2 0 2

B C

1 0 1 0

B C

p

A = @ A

2 1 2

6 2 1 2

Let document similarity be defined by the cosine measure (dot product).

12.1) Compute the rank of matrix A.

12.2) Let be a query. Find the document in the set that best satisfies the query.

T

q = (1, 3, 0, 2)

12.3) Given the matrices 0 1 0 1

1 0 1 0

B C 1

0 1

B C @ A

p 0 1

U = ↵ , V =

@ A

1 1 2 1 0

1 1

determine coefficient and the diagonal matrix so that is column-orthonormal and .

T

↵ ⌃ U A = U ⌃V

12.4) Project the query onto the LSI space defined by this decomposition and verify the answer to ques-

q

tion 12.2.

12.5) Suppose that we want to reduce the LSI space to one dimension. Show how the new approximate document

similarities to are computed.

q

Solution —

12.1) Notice that has two linearly dependent (actually equal) columns (thus rk while the first two columns

A A < 3),

are independent (thus rk therefore rk

A 2), A = 2.

12.2) Similarities are computed by dot products, let’s do it in a single shot for all documents:

0 1

2

1

T @ A

p 5

A q = ;

6 2

The most similar is document 2. p

12.3) The column normality condition for matrix implies By expliciting the calculation of some entries

U = 1/ 3.

of matrix we obtain

A, ✓ ◆

2 0

⌃ = .

0 1

12.4) 1 T

Projection onto the document LSI space is achieved via :

⌃ U

✓ ◆

1 1

T p

q̂ = U q = .

5

3

Similarity to the documents is computed via the matrix. If all computations are right,

V ⌃ T

V ⌃q̂ = A q.

Exercise 13

The singular value decomposition of a term-document matrix is

T

A = U ⌃V 0 1

0 1 0 1 1 0 1

1 1 0 3 0 0 B C

1 1 1 1 1

B C

@ A @ A

p p

1 1 0 0 2 0

U = , ⌃ = , V = @ A

p 0 1 1

2 3

0 0 1

0 0 2 1 1 0

13.1) What is the rank of the matrix A?

13.2) Perform a reduction of the LSI space by one dimension.

13.3) Given the new representation of matrix apply an agglomerative clustering procedure to the collection.

Â,

Merge the clusters at each step according to the self-similarity measure by using as a measure of inter-document

similarity simply the dot-product hd i.

, d

1 2

13.4) Draw the resulting dendrogram. How many clusters can you find at a level of similarity of 2?

13.5) Check the clustering results you get by cutting across the dendrogram, by plotting them.

Solution —

13.1) Matrix A was originally a matrix. The three elements in the diagonal matrix are non-null, therefore matrix

3 4 ⌃

T (hence, matrix has full rank (3).

A A A) ˆ

13.2) Let us obtain , and by removing the third column from and , and the third row and column from

Û V̂ ⌃ U V ⌃,

T

corresponding to the smallest eigenvalue of A A:

0 1 ✓ ◆ ✓ ◆

1 1

1 1

3 0 1 1 0 1

ˆ

@ A

p p

1 1

Û = .

, ⌃ = , V̂ =

0 2 0 1 1 1

2 3

0 0

13.3) After rank reduction, we can compute the similarity by

0 1

1 0 ✓ ◆✓ ◆

B C

1 9 0 1 1 0 1

1 1

B C

ˆ

T 2 T

  = V̂ ⌃ V̂ = @ A

0 1 0 4 0 1 1 1

3 1 1

Therefore, we obtain the following table of unnormalized dot product similarities:

2 3 4

1 3 0 3

2 4 5

3 3

3 4

3

13.4) and are both candidates as the first cluster. Let us choose the first pair. Therefore, at level 3 the first

{1, {1,

2} 4}

clustering step yields 3 4

13 23

{1, 2} 9 9

3 4

3

Now, the highest self-similarity value is achieved by cluster at level , so that the similarity matrix becomes

23

{1, 2, 4} 9

3

23

{1, 2, 4} 18

Therefore, at similarity level we have two clusters: and

{1,

2 2, 4} 3.

13.5) The corresponding dendrogram is 1 2 4 3 3

23/9

2

23/18

Exercise 14

Consider the set of offices:

T office, Bank, Police station, Students office, Library}.

T {Post

=

The following table defines their opening hours:

Office Opens at Closes at

Post office 9am 5:30pm

Bank 8:30am 12:30pm

Police station 8am 7pm

Students office 9am 12pm

Library 10am 4:30pm

The similarity sim(a, between two offices is defined by the amount of time and are both open

2 T

b) a, b a b

during the day (i.e., how much their opening hours overlap), while the similarity between two sets is

✓ T

A, B

the minimum pairwise similarity (complete linkage):

sim(A, sim(a,

B) = min b).

a2A

b2B

14.1) Apply a hierarchical agglomerative clustering procedure to based on the similarity defined above.

T

14.2) Draw the resulting dendrogram.

Exercise 15

The following matrix represents the number of occurrences of two terms in each of three documents:

✓ ◆

1 3 2

A = 0 4 2

We are interested in term distribution and covariance

15.1) Compute the mean count of each term and the term covariance matrix. What’s the rank of t

Dettagli
Publisher
A.A. 2015-2016
13 pagine
SSD Ingegneria industriale e dell'informazione ING-INF/05 Sistemi di elaborazione delle informazioni

I contenuti di questa pagina costituiscono rielaborazioni personali del Publisher tito1992 di informazioni apprese con la frequenza delle lezioni di Business Intelligence Techniques e studio autonomo di eventuali libri di riferimento in preparazione dell'esame finale o della tesi. Non devono intendersi come materiale ufficiale dell'università Università degli Studi di Trento o del prof Brunato Mauro.