vuoi
o PayPal
tutte le volte che vuoi
E
destination of the edge). The edge set is therefore represented by a collection of lines of text.
E
Given a collection of lines describing the edge set (either coming from a single file or split among several
E
files), we want to design a MapReduce system to produce a list of vertices, each associated with a pair of integers
representing their indegree and outdegree.
For example, consider the set of lines on the left. The resulting mapping is shown on the right.
lorem ipsum lorem 7! (0, 2)
amet ipsum ipsum 7 ! (2, 1)
ipsum sit dolor 7 !
) (1, 2)
dolor sit sit 7 ! (2, 0)
lorem dolor amet 7 ! (1, 1)
dolor amet
The corresponding graph is the following: ipsum
lorem dolor
amet sit
6.1) What are the domain and codomain of the and functions?
Map Reduce
6.2) Write a pseudo-code implementation of the relevant functions.
Exercise 7
A distributed filesystem contains a large number of measurements, each measurement being a real number
between and
0 1.
Describe a MapReduce implementation of an algorithm that computes a 100-bin histogram of this measurement
set.
Hint — In other words, divide the range into 100 equally sized bins and for each bin tell how many
[0, 1]
measurements it contains.
Exercise 8
Let us define a 2-gram as a sequence of two alphabetic characters in the same word, with no distinction on
capitalization. For example, the phrase “I love *YOU*!” contains the following 2-grams: “lo”, “ov”, “ve”,
“yo”, “ou”.
Describe a MapReduce implementation that lists every 2-gram appearing in a set of documents together with
the number of times it occurs.
Exercise 9
A set of invoices is stored in a distributed file system. Every invoice contains the name of the buyer followed by
a list of items, quantities and unit prices, each on a line. Here are four sample invoices.
Alice Bob Charles
Pencils 2 3.50 Erasers 1 4.55 Computers 1 650.00
Erasers 5 5.25 Pencils 3 3.75 Pencils 2 3.25
Computers 1 500.00 Alice
Erasers 2 3.50
A corporation wants to identify the “best buyer” for every kind of item, i.e., a list of distinct items, each
associated to the buyer who bought it at the lowest unit price, and the unit price itself. Given the sample
invoices, the list would be: Pencils Bob 1.25
Erasers Alice 1.05
Computers Alice 500.00
Describe a MapReduce implementation of the requested functionality (at least the pseudocode of the relevant
functions and a description of all the related key and value pairs).
Solution — See the example in the course web page.
BestPurchase.py
Exercise 10
The following table, taken from a social network, reports the attitude of five users (u ) with respect to
, . . . , u
1 5
five messages (m ) posted on the network. “Y” means “Likes”, “N” means “Doesn’t like”; an empty
, . . . , m
1 5
entry mean that a user hasn’t expressed his/her opinion about the message.
m m m m m
1 2 3 4 5
Y Y Y N
u 1 N Y N N
u 2 Y Y Y
u 3 N N Y N Y
u 4 Y N Y
u 5
What methods can be used to decide whether user likes or dislikes message ?
u m
3 3
Nota bene — This is an open question, there is no “correct” answer.
Exercise 11
A recommender system contains ratings provided by four user on two items (ratings range from 1 to 5; empty
entries correspond to unknown ratings): Item 1 Item 2
User 1 3
User 2 5
User 3 1
User 4 4 2
The system bases its recommendations on the following rating model:
·
r = 3 + u v ,
ij i j
where is the rating assigned by user to item are four 2-dimensional user
r i = 1, . . . , 4 j = 1, 2; u , . . . , u
ij 1 4
feature vectors, and are two 2-dimensional item feature vectors:
v , v
1 2
✓ ◆ ✓ ◆ ✓ ◆ ✓ ◆ ✓ ◆ ✓ ◆
1 .5 .5 0 0 y
u = , u = , u = , u = ; v = , v = ,
1 2 3 4 1 2
0 1 0 1 x 1
while and are as yet unknown model parameters.
x y
11.1) Compute the values of and that minimize the root mean square error of the model with respect to
x y
known ratings (also known as the least-square estimates of the model parameters and
x y).
11.2) Does it make sense to recommend item 2 to user 1? Should we recommend item 1 to user 2 or 3?
Solution —
11.1) We want the values of and that minimize the sum of the squared differences between the model’s prediction
x y
and the actual ratings, i.e., the values for and that minimize the following:
x y X 2
RMSE ·
= (3 + u v r ) .
i j ij
s.t. is known
i, j r ij
Replacing all values, we get: i=1,j=1 i=4,j=1 i=2,j=2 i=3,j=2 i=4,j=2
z }| { z }| { z }| { z }| { z }| {
2 2 2 2 2
RMSE(x, y) = (3 3) + (3 x 4) + (3 + .5y + 1 5) + (3 .5y 1) + (3 1 2) .
In order to find the values that minimize the RMSE, let us compute the partial derivatives and equate them to zero:
i=4,j=1
z }| {
@RMSE(x, y)
0 = = 2x + 2
@x i=2,j=2 i=3,j=2
z }| { z }| {
@RMSE(x, y)
0 = = .5y 1 + .5y 2 = y 3.
@y
Therefore, x = 1, y = 3.
Note that the value for could also be determined by the only rating in the matrix that depended on it, . However,
x r 41
the two ratings that depend on don’t provide the same result, and we need to minimize the squared errors as requested by
y
the exercise.
11.2) We obtain therefore suggesting item 2 to user 1 is extremely advisable. On the other hand, with the given
r = 6,
12
value for and therefore user 3 is slightly more likely to appreciate item 1.
x, r = 2 r = 3,
21 31
Exercise 12
The columns of the following matrix represent the coordinates of a set of documents in a TFIDF space:
0 1
2 0 2
B C
1 0 1 0
B C
p
A = @ A
2 1 2
6 2 1 2
Let document similarity be defined by the cosine measure (dot product).
12.1) Compute the rank of matrix A.
12.2) Let be a query. Find the document in the set that best satisfies the query.
T
q = (1, 3, 0, 2)
12.3) Given the matrices 0 1 0 1
1 0 1 0
B C 1
0 1
B C @ A
p 0 1
U = ↵ , V =
@ A
1 1 2 1 0
1 1
determine coefficient and the diagonal matrix so that is column-orthonormal and .
T
↵ ⌃ U A = U ⌃V
12.4) Project the query onto the LSI space defined by this decomposition and verify the answer to ques-
q
tion 12.2.
12.5) Suppose that we want to reduce the LSI space to one dimension. Show how the new approximate document
similarities to are computed.
q
Solution —
12.1) Notice that has two linearly dependent (actually equal) columns (thus rk while the first two columns
A A < 3),
are independent (thus rk therefore rk
A 2), A = 2.
12.2) Similarities are computed by dot products, let’s do it in a single shot for all documents:
0 1
2
1
T @ A
p 5
A q = ;
6 2
The most similar is document 2. p
12.3) The column normality condition for matrix implies By expliciting the calculation of some entries
U = 1/ 3.
of matrix we obtain
A, ✓ ◆
2 0
⌃ = .
0 1
12.4) 1 T
Projection onto the document LSI space is achieved via :
⌃ U
✓ ◆
1 1
T p
q̂ = U q = .
5
3
Similarity to the documents is computed via the matrix. If all computations are right,
V ⌃ T
V ⌃q̂ = A q.
Exercise 13
The singular value decomposition of a term-document matrix is
T
A = U ⌃V 0 1
0 1 0 1 1 0 1
1 1 0 3 0 0 B C
1 1 1 1 1
B C
@ A @ A
p p
1 1 0 0 2 0
U = , ⌃ = , V = @ A
p 0 1 1
2 3
0 0 1
0 0 2 1 1 0
13.1) What is the rank of the matrix A?
13.2) Perform a reduction of the LSI space by one dimension.
13.3) Given the new representation of matrix apply an agglomerative clustering procedure to the collection.
Â,
Merge the clusters at each step according to the self-similarity measure by using as a measure of inter-document
similarity simply the dot-product hd i.
, d
1 2
13.4) Draw the resulting dendrogram. How many clusters can you find at a level of similarity of 2?
13.5) Check the clustering results you get by cutting across the dendrogram, by plotting them.
Solution —
13.1) Matrix A was originally a matrix. The three elements in the diagonal matrix are non-null, therefore matrix
⇥
3 4 ⌃
T (hence, matrix has full rank (3).
A A A) ˆ
13.2) Let us obtain , and by removing the third column from and , and the third row and column from
Û V̂ ⌃ U V ⌃,
T
corresponding to the smallest eigenvalue of A A:
0 1 ✓ ◆ ✓ ◆
1 1
1 1
3 0 1 1 0 1
ˆ
@ A
p p
1 1
Û = .
, ⌃ = , V̂ =
0 2 0 1 1 1
2 3
0 0
13.3) After rank reduction, we can compute the similarity by
0 1
1 0 ✓ ◆✓ ◆
B C
1 9 0 1 1 0 1
1 1
B C
ˆ
T 2 T
  = V̂ ⌃ V̂ = @ A
0 1 0 4 0 1 1 1
3 1 1
Therefore, we obtain the following table of unnormalized dot product similarities:
2 3 4
1 3 0 3
2 4 5
3 3
3 4
3
13.4) and are both candidates as the first cluster. Let us choose the first pair. Therefore, at level 3 the first
{1, {1,
2} 4}
clustering step yields 3 4
13 23
{1, 2} 9 9
3 4
3
Now, the highest self-similarity value is achieved by cluster at level , so that the similarity matrix becomes
23
{1, 2, 4} 9
3
23
{1, 2, 4} 18
Therefore, at similarity level we have two clusters: and
{1,
2 2, 4} 3.
13.5) The corresponding dendrogram is 1 2 4 3 3
23/9
2
23/18
Exercise 14
Consider the set of offices:
T office, Bank, Police station, Students office, Library}.
T {Post
=
The following table defines their opening hours:
Office Opens at Closes at
Post office 9am 5:30pm
Bank 8:30am 12:30pm
Police station 8am 7pm
Students office 9am 12pm
Library 10am 4:30pm
The similarity sim(a, between two offices is defined by the amount of time and are both open
2 T
b) a, b a b
during the day (i.e., how much their opening hours overlap), while the similarity between two sets is
✓ T
A, B
the minimum pairwise similarity (complete linkage):
sim(A, sim(a,
B) = min b).
a2A
b2B
14.1) Apply a hierarchical agglomerative clustering procedure to based on the similarity defined above.
T
14.2) Draw the resulting dendrogram.
Exercise 15
The following matrix represents the number of occurrences of two terms in each of three documents:
✓ ◆
1 3 2
A = 0 4 2
We are interested in term distribution and covariance
15.1) Compute the mean count of each term and the term covariance matrix. What’s the rank of t