Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
vuoi
o PayPal
tutte le volte che vuoi
BEGIN BATCH
<insert_statement>;
<update_statement>;
<delete_statement>;
APPLY BATCH;
5.3.16 CAPTURE
When the amount of data within a database grows, it can be really tough to visualize it within
a terminal. Fortunately, Cassandra provides us with a few commands to overcome this problem.
47
The CAPTURE command followed by the path of the folder in which store the results and the
name of the file.
CAPTURE ’/Program Files/Cassandra/Outputs/output.txt’
To interrupt the CAPTURE you can run the following command.
CAPTURE off
5.3.17 EXPAND
The EXPAND command provides extended outputs within the console when performing queries.
It must be executed before the query to enable it.
EXPAND on
To interrupt the EXPAND you can run the following command.
EXPAND off
5.3.18 SOURCE
The SOURCE command allows you to run queries from textual files. The command accepts the
path to the file with the query.
SOURCE ’D:/Program Files/Cassandra/Queries/query_1.txt’
5.3.19 Data Types
Cassandra supports many different data types, like etc.
text, varint, float, double, Boolean,
In particular, it supports two special data types:
• Collections
• User-defined data types
Collections are easy to define and update:
CREATE TABLE test(email list<text>, ...)
UPDATE test SET email = email + [’new@email.com’] WHERE ...
5.3.20 User-defined Data Types
User-defined data types require defining the type before use:
CREATE TYPE <type_name> (
<column_name> <column_type>,
...
);
To verify the type creation, use the command. User-defined types support and
DESCRIBE ALTER
operations.
DROP DESCRIBE TYPE <type_name> 48
6 Elasticsearch
6.1 Information retrieval databases
In information retrieval databases we will talk about the so-called ELK stack. ELK stack comes
from the acronym of Elasticsearch, Logstash and Kibana. The idea of ELK stack is that of
providing a technology for storing, searching and analysing data coming from a perspective that
is not that of database technologies, but of search engines. So, these technologies are based on
search engine approaches instead of database approaches.
Elasticsearch historically has been the first part, but then around it they created other two
levels, Logstash and Kibana. Elasticsearch is the core engine, storage and search capabilities,
but then they added two other levels:
1. The data comes in from Logstash
2. Gets stored and managed in Elasticsearch
3. And then the Kibana level is the output of the data. So, we can imagine data visualization,
exploration dashboards, diagrams whatever that come up from the data outputs. Basically,
this level is the one you use for building data visualization and data interaction over the
data.
6.2 Elasticsearch
The idea of Elasticsearch is that of a search engine that was built for creating custom search
engine for enterprise use. You deploy this internally in your company, you build the data storage
for your documents and then you can search over that. You don’t put this on the web.
This search engine slowly became incorporated also capabilities that are typical of databases:
storage, querying data (Not only as a search engine, but in a mixed strategy where you can
write queries that are a little bit like a search engine and a little bit like SQL like query). So,
you will be able to write mixed queries.
Also, the other thing that they included in this technology is not only searching, but also
capability of running analytics. It’s called the Elasticsearch because it’s very flexible. It’s able
to search, to write queries and to run analytics.
What is there so different in a search engine approach with respect to a query approach like
database implements? There are many ingredients:
• Relevance
• Ranking
These two things do not exist in databases.
6.2.1 Relevance
In a database technology, when we write a query, we match the query on the data and the output
is the exact answer. These means that being exact, all the values we are returning in a query
have the same level of importance and relevance because they are exactly what you ask for.
And this is for the database approaches. What happens in the search approach is that when we
write a search query the behaviour is quite different, because we have somewhere a repository of
documents, but when we send the query to that, the answer is not the exact match of the query.
The answer we get from a search engine is what we may call the best match. This happens for
different reasons: 49
1. It could be too many to return all of them.
2. It may not be even able to find them all.
Best is meant to be the most relevant. It’s important that the answer of the search engine
is the most relevant to your information need. The relevance is finding the best answers that
supposedly respond to your desire. And this is not available in a database. Since here you are
going to get the best matches, the next idea is defining what is best and what is not. And here
comes the other concept that is the concept of ranking.
6.2.2 Ranking
The answers we get are not just retrieved in a random order, but they are retrieved according
to a rank approach. So, what we get is a ranking where supposedly the first element is the
best and so on. So, relevance is matched against the intent. This is not a sorting. Clearly in
SQL we have an order by, we have a sorting approach. But is not ordering the results. This is
ranking the results by importance, very different. So, the relevance is the concept of matching
your intent, the order in which I decide that this is matches best or less your intent will create
a ranking.
How do we compute this relevance in this ranking? Depending on the kind of data we have, we
may need to implement different ranking strategies, for instance, in textual data is implemented
with the basic techniques like TFIDF or similar techniques. In Information retrieval-based
approaches we have two concepts that are used for implementing those behaviours:
• Inverted Index
• Quantitative relevance
6.2.3 Inverted Index
What does it mean to have an inverted index? Usually in database technology an index is
something like this: I have a key, so I have a quick way to access the data. The key value
approach is a typical example of index. Given the key I give you the value. The inverted index
works in the opposite direction, in the sense that it indexes the values and lets you find the IDs.
Inverted indexes are typically used for text: Text analysis, text match and text ranking. You
don’t need an index that, starting from the key to the document, finds the value here. Here
you need the opposite. You need an index that, starting from the values of the words, finds
a document. That’s why it’s called inverted. So, the inverted index creates an index of every
single word, and for every word it points to all the documents containing that word.
Figure 48: Inverted Index
6.2.4 Quantitative Relevance
How do we decide what’s important and what’s not? Essentially, we need to think about some
kind of relevance measure. We need a quantitative measure to assess what is relevant and what
50
is not. And here we can open a huge set of possibilities on how to measure what is important.
We point out only to one possible realization of this concept of quantitative relevance match
that is implemented by a very basic approach called TFIDF.
6.2.5 TFIDF
This is based on two aspects:
1. You ask for something, I need to find something similar to what you ask. Clearly, the
things that are more similar will be more relevant
2. When searching for this similarity, I need to assess that not all the information has the
same importance.
TFIDF stands for term frequency and inverted document frequency, and the purpose of that is
to consider the balance between the two aspects.
The first thing we can do is to count how frequent are the words in a document. And this is
what the first term of TFIDF does. The term frequency is the measure of how many times every
word appears in every document, divided by the number of total words in the document. So,
it’s the frequency of a word inside the document. The idea of term frequency is that of putting
a high importance to documents that talk a lot about a given term.
Figure 49: TFIDF - tf
But then I need to consider the other aspect, that is: not all the words have the same importance.
That means there are words that are very important and other that are not. We can say that
the most important words should be the ones that are less frequent in the database. Because if
they are very rare, it means that they are also very specific. So, if a document contains these
words, this is a significant indication of the relevance of the document. The inverted document
frequency IDF basically counts how many documents contain a given word divided by the total
number of documents in the collection, and then reverse the result. That’s why it’s called
inverse document frequency. The logarithmic function there is just for numerical purposes, for
decreasing the importance of that. Figure 50: TFIDF - idf
The key point is that overall, when you search for something, we will find first the documents
that contain many times the word you search for, and we put first the documents that contain
the words that are more important. In terms of rank of total function that we get, the TFIDF
altogether computes the term frequency term by term in every document and the inverted
document frequency, of number of documents that contain that word and multiply them. Finally,
51
the total score of the document j is the summation of the scores (tf-idf)i,j of every word i in the
document j. Figure 51: TFIDF - tfidf
For instance, all the words we use every time in every sentence, like determinants pronouns, very
frequent verbs, these are so frequent that are totally non informative. So, we could even drop
them from the text completely because in any case they would provide zero as a contribution.
6.3 Elasticsearch - Query language
How do we implement that? In engines like Elasticsearch what we need to do is that we need to
create indexes. So, the main data structure for this kind of technology is the concept of index.
And what we do when we define an index, we say we define a mapping. The mapping is the
set of rules that define the shape of an index. The index will be the way we access the data.
So essentially the mapping is the specification of the index where we say which and where we
want to enable full text search, which fields of the documents or the objects can be searchable,
which features of the documents should be used as traditional fields in a database like query.
The mapping specifies all of that.
There are two types of mapping:
• Elasticsearch takes care of adding the new fields automatically
Dynamic Mapping:
when a document is indexed. This is default se