Appunti di Systems and Methods for Big and Unstructured Data

Risorsa accademica completa e dettagliata, basata sul corso del Prof. Marco Brambilla e redatta da Andrea Mikhaiel presso il Politecnico di Milano. Questo materiale fornisce una panoramica …

Esame Systems and methods for big and unstructured data

Facoltà Ingegneria dell'informazione

Dal corso del Prof. Brambilla Marco

Università Politecnico di Milano

Publisher AndreaMikhaiel

A.A. 2024-2025

76 pagine

Appunti esame

Vota

Scarica

Estratto del documento

BEGIN BATCH

<insert_statement>;

<update_statement>;

<delete_statement>;

APPLY BATCH;

5.3.16 CAPTURE

When the amount of data within a database grows, it can be really tough to visualize it within

a terminal. Fortunately, Cassandra provides us with a few commands to overcome this problem.

The CAPTURE command followed by the path of the folder in which store the results and the

name of the file.

CAPTURE ’/Program Files/Cassandra/Outputs/output.txt’

To interrupt the CAPTURE you can run the following command.

CAPTURE off

5.3.17 EXPAND

The EXPAND command provides extended outputs within the console when performing queries.

It must be executed before the query to enable it.

EXPAND on

To interrupt the EXPAND you can run the following command.

EXPAND off

5.3.18 SOURCE

The SOURCE command allows you to run queries from textual files. The command accepts the

path to the file with the query.

SOURCE ’D:/Program Files/Cassandra/Queries/query_1.txt’

5.3.19 Data Types

Cassandra supports many different data types, like etc.

text, varint, float, double, Boolean,

In particular, it supports two special data types:

• Collections

• User-defined data types

Collections are easy to define and update:

CREATE TABLE test(email list<text>, ...)

UPDATE test SET email = email + [’new@email.com’] WHERE ...

5.3.20 User-defined Data Types

User-defined data types require defining the type before use:

CREATE TYPE <type_name> (

<column_name> <column_type>,

...

);

To verify the type creation, use the command. User-defined types support and

DESCRIBE ALTER

operations.

DROP DESCRIBE TYPE <type_name> 48

6 Elasticsearch

6.1 Information retrieval databases

In information retrieval databases we will talk about the so-called ELK stack. ELK stack comes

from the acronym of Elasticsearch, Logstash and Kibana. The idea of ELK stack is that of

providing a technology for storing, searching and analysing data coming from a perspective that

is not that of database technologies, but of search engines. So, these technologies are based on

search engine approaches instead of database approaches.

Elasticsearch historically has been the first part, but then around it they created other two

levels, Logstash and Kibana. Elasticsearch is the core engine, storage and search capabilities,

but then they added two other levels:

1. The data comes in from Logstash

2. Gets stored and managed in Elasticsearch

3. And then the Kibana level is the output of the data. So, we can imagine data visualization,

exploration dashboards, diagrams whatever that come up from the data outputs. Basically,

this level is the one you use for building data visualization and data interaction over the

data.

6.2 Elasticsearch

The idea of Elasticsearch is that of a search engine that was built for creating custom search

engine for enterprise use. You deploy this internally in your company, you build the data storage

for your documents and then you can search over that. You don’t put this on the web.

This search engine slowly became incorporated also capabilities that are typical of databases:

storage, querying data (Not only as a search engine, but in a mixed strategy where you can

write queries that are a little bit like a search engine and a little bit like SQL like query). So,

you will be able to write mixed queries.

Also, the other thing that they included in this technology is not only searching, but also

capability of running analytics. It’s called the Elasticsearch because it’s very flexible. It’s able

to search, to write queries and to run analytics.

What is there so different in a search engine approach with respect to a query approach like

database implements? There are many ingredients:

• Relevance

• Ranking

These two things do not exist in databases.

6.2.1 Relevance

In a database technology, when we write a query, we match the query on the data and the output

is the exact answer. These means that being exact, all the values we are returning in a query

have the same level of importance and relevance because they are exactly what you ask for.

And this is for the database approaches. What happens in the search approach is that when we

write a search query the behaviour is quite different, because we have somewhere a repository of

documents, but when we send the query to that, the answer is not the exact match of the query.

The answer we get from a search engine is what we may call the best match. This happens for

different reasons: 49

1. It could be too many to return all of them.

2. It may not be even able to find them all.

Best is meant to be the most relevant. It’s important that the answer of the search engine

is the most relevant to your information need. The relevance is finding the best answers that

supposedly respond to your desire. And this is not available in a database. Since here you are

going to get the best matches, the next idea is defining what is best and what is not. And here

comes the other concept that is the concept of ranking.

6.2.2 Ranking

The answers we get are not just retrieved in a random order, but they are retrieved according

to a rank approach. So, what we get is a ranking where supposedly the first element is the

best and so on. So, relevance is matched against the intent. This is not a sorting. Clearly in

SQL we have an order by, we have a sorting approach. But is not ordering the results. This is

ranking the results by importance, very different. So, the relevance is the concept of matching

your intent, the order in which I decide that this is matches best or less your intent will create

a ranking.

How do we compute this relevance in this ranking? Depending on the kind of data we have, we

may need to implement different ranking strategies, for instance, in textual data is implemented

with the basic techniques like TFIDF or similar techniques. In Information retrieval-based

approaches we have two concepts that are used for implementing those behaviours:

• Inverted Index

• Quantitative relevance

6.2.3 Inverted Index

What does it mean to have an inverted index? Usually in database technology an index is

something like this: I have a key, so I have a quick way to access the data. The key value

approach is a typical example of index. Given the key I give you the value. The inverted index

works in the opposite direction, in the sense that it indexes the values and lets you find the IDs.

Inverted indexes are typically used for text: Text analysis, text match and text ranking. You

don’t need an index that, starting from the key to the document, finds the value here. Here

you need the opposite. You need an index that, starting from the values of the words, finds

a document. That’s why it’s called inverted. So, the inverted index creates an index of every

single word, and for every word it points to all the documents containing that word.

Figure 48: Inverted Index

6.2.4 Quantitative Relevance

How do we decide what’s important and what’s not? Essentially, we need to think about some

kind of relevance measure. We need a quantitative measure to assess what is relevant and what

is not. And here we can open a huge set of possibilities on how to measure what is important.

We point out only to one possible realization of this concept of quantitative relevance match

that is implemented by a very basic approach called TFIDF.

6.2.5 TFIDF

This is based on two aspects:

1. You ask for something, I need to find something similar to what you ask. Clearly, the

things that are more similar will be more relevant

2. When searching for this similarity, I need to assess that not all the information has the

same importance.

TFIDF stands for term frequency and inverted document frequency, and the purpose of that is

to consider the balance between the two aspects.

The first thing we can do is to count how frequent are the words in a document. And this is

what the first term of TFIDF does. The term frequency is the measure of how many times every

word appears in every document, divided by the number of total words in the document. So,

it’s the frequency of a word inside the document. The idea of term frequency is that of putting

a high importance to documents that talk a lot about a given term.

Figure 49: TFIDF - tf

But then I need to consider the other aspect, that is: not all the words have the same importance.

That means there are words that are very important and other that are not. We can say that

the most important words should be the ones that are less frequent in the database. Because if

they are very rare, it means that they are also very specific. So, if a document contains these

words, this is a significant indication of the relevance of the document. The inverted document

frequency IDF basically counts how many documents contain a given word divided by the total

number of documents in the collection, and then reverse the result. That’s why it’s called

inverse document frequency. The logarithmic function there is just for numerical purposes, for

decreasing the importance of that. Figure 50: TFIDF - idf

The key point is that overall, when you search for something, we will find first the documents

that contain many times the word you search for, and we put first the documents that contain

the words that are more important. In terms of rank of total function that we get, the TFIDF

altogether computes the term frequency term by term in every document and the inverted

document frequency, of number of documents that contain that word and multiply them. Finally,

the total score of the document j is the summation of the scores (tf-idf)i,j of every word i in the

document j. Figure 51: TFIDF - tfidf

For instance, all the words we use every time in every sentence, like determinants pronouns, very

frequent verbs, these are so frequent that are totally non informative. So, we could even drop

them from the text completely because in any case they would provide zero as a contribution.

6.3 Elasticsearch - Query language

How do we implement that? In engines like Elasticsearch what we need to do is that we need to

create indexes. So, the main data structure for this kind of technology is the concept of index.

And what we do when we define an index, we say we define a mapping. The mapping is the

set of rules that define the shape of an index. The index will be the way we access the data.

So essentially the mapping is the specification of the index where we say which and where we

want to enable full text search, which fields of the documents or the objects can be searchable,

which features of the documents should be used as traditional fields in a database like query.

The mapping specifies all of that.

There are two types of mapping:

• Elasticsearch takes care of adding the new fields automatically

Dynamic Mapping:

when a document is indexed. This is default se

Anteprima

Vedrai una selezione di 17 pagine su 76