Appunti - System and Method for Big and Unstructured Data

The whole file is in English IT: Questo file contiene il appunti delle slide integrati con delle note prese a lezione di tutti i capitoli del corso di System and Method for Big and Unstructured …

Esame System and method for big and unstructured data

Facoltà Ingegneria industriale

Dal corso del Prof. Brambilla Marco

Università Politecnico di Milano

Publisher Joseph22ITA

A.A. 2022-2023

25 pagine

Appunti esame

Vota

Scarica

Estratto del documento

Write to a Schema on Read Parading improving flexibility

This allows to have a single, shared and fully available BD for the enterprise (data lake)

The process of data is the following:

Data Ingestion: Importing, transferring and loading data for storage and later use
Data Wrangling: Clean raw data and transforming it into data that can be analyzed (understand, cleanse, augment, shape)
Extract Transform Load (ETL): Used to better fit the targeted query

2.1. Transactional Properties

A transaction is an elementary unit of work performed by an application.

A transactional System (OLTP) is a system capable of providing transactions.

Acid Properties:

Atomicity: A transaction is an atomic transformation from the initial state to the final one
Consistency: The transaction satisfies the integrity constraints (initial-final state are consistent)
Isolation: A transaction is not affected by the behavior of the other concurrent transactions
Durability: The effect of a transaction that has

successfully committed will last forever

2.2 Cap Theorem

It is impossible for a distributed system to simultaneously provide all 3 of the following guarantees:

Consistency: All nodes see the same data at the same time
Availability: Node failures don't prevent other survivors from continuing to operate
Network Partition Tolerance: system continues to operate despite arbitrary network fail partition

A distributed system can satisfy any 2 of these guarantees at the same time but not all 3

In networked system the Network Partition Tolerance is quite necessary

2.3 No-SQL World

Key Value Store: A key that refers to a payload
Column Store: Column data is saved together, as opposed to row data
Document/XML/ OBJ Store: Key point to a serialized object
Graph Store: Nodes are stored independently, and the relation are stored with data

2.4 Acid VS Base Properties

ACID properties may not hold with noSQL DB. Instead they guaranteed BASE properties:

Basic Availability: fulfill request, even

Partial Consistency:
- Soft State: Abandon the consistency requirements of the ACID Model
- Eventual Consistency: At some point in the future, data will converge to a consistent state
A fully ACID DB is the perfect fit for use cases when Reliability and Consistency are essential
Data Ingestion: API

The 3 W of API:

What: "an Application Program Interface is a set of routines, protocols and tools for building software application", allows a programmatics access to data and platform.
Why: Separating between model and presentation, regulate access to data and avoid direct access to the platform
Who: Every Data Provider

7.1 How to expose API

Each API exposes a set of HTTP(s) endpoints (URLS) to call in order to get data or perform action on the platform. Most of the endpoints can be "tweaked" via one or more parameter
REST architecture can be used to expose API. RESTful API are resources based WebAPI with standard URL format.

Almost all the APIs require a kind of

user authentication. The user must register to the developer of the provider to obtain the keys to access the API

7.2 Some Problem

Crawling: Problem: getting a lot of data points from an API: “An API Crawler is a software that methodically interacts with a WebAPI to download data or to take some actions at predefined time intervals.”

Pagination: most APIs support data pagination to split huge chunks of data into smaller set of data

Timeline: Most of DB leverages the concept of timeline, The solution to retrieve always new data is a cursoring technique. Instead of reading from the top of the timeline we read the data relative to the already processed ids

Parallelization and Multiple Account: When possible make parallel requests to gather more data in less time. This can be enabled by handling multiple accounts.

Multiple account can be managed in different ways:

Request Based:
Round Robin: Round Robin Strategy among all the account for the requests
Account Pull: Sequential account

Account Based: Request Stack: The accounts, in parallel, get the next request from the pool
When APIs are not present, but data retrieval is needed, Scraping could be a solution but it should be the last solution.
Graph Databases - Neo4J (6)
- A Graph is a set of nodes joined by a set of lines or arrows (some definition of graph in the slides).
- We define degree of a node the number or edges entering or leaving the node.
- In CS a graph is an abstract data type and could be implemented in different way:
  - Representation Matrix: Incident [ExV], Adjacency [VxV]
  - Representative List: Edge or Adjacency List
- Graph databases are used when relationships between objects are more important than the object themselves. They provide an index-free adjacency. Query are base on a pattern-recognition functionality
Neo4j is a schemaless DB that maintains the ACID properties, typically employed in non-Analytic DB.
- Highly performant read and write
- Reliability for

mission-critical production- Not efficient in whole-graph analysis

Relation has direction but you could navigate them in both verse

In this case indexing is useful to find a entry point in the graph

Queries are expressed with a custom language called Cypher

profile operation provides the complete analysis of the selected query

Hints:

Use parameters instead of literals
Set an Upper Limit to *
Return only the data you need
Use Profile/explain to analyze the performance of your queries

Shortest Path

Shortest path function work with the where condition, it returns the shortest path that respects that following where condition. We have 2 kind of condition:

Fast Algorithm where condition can be checked "online" during the computation of the path
Exhaustive algorithm where we have to compute the whole path to check the condition (length)

If we use a with operator between the computation of the shortest path and the where condition, the computation of the shortest path will ignore that constraints

Typically used for performance goals. They are deployed with 3 instances:

Stand Alone (unusual)
Integrated with other DB
For caching purpose

SQL query, key-based, foreign key, indexes and joins

4.1 Caching Purpose - MemCached

Generic in nature, intended for use in setting up dynamic web applications by alleviating DB load.

Technically is a pool of servers managed by the client.

Data stored in memcache are: with high demand, expensive to compute and common. Main Features:

Fast Network access
No Persistence or Redundancy/fail-over
No authentication (secure trouble)

4.2 Redis

Redis is an advanced key-value store, it supports a set of atomic operations.

Best used for rapidly changing data with a foreseeable database size

In Redis values are complex data types.

Most used in databases that fit in memory, where dataset is not critical and high performance is needed.

It means that it is not a general purpose DB. Redis is a single-threaded server, so we remove

The problem of concurrency and execute most command in O(1) complexity

4.2.1 Redis Features

The main characteristic of Redis are:

Persistence: it provide 2 mechanism of persistence: Redis Snapshot and Append-only files
Replication: Master\Slave approach. Slaves are read only copies of the Master. Improvement in fault tolerance and Performance
Partitioning: It can be done in different layers: Client, Proxy, Query Router. implemented for Scalability and Replication purpose
FailOver: 3 ways: manual, with Redis Sentinel, with Redis cluster

4.2.2 Redis Topologies

StandAlone:
- The master data is optionally replicated to slaves
- Slaves provides data redundancy, reads and save-to-disk offloading
- Slaves can have its own slaves
- No automatic Failover
Sentinel - Automatic FailOver
Redis sentinel proved an automatic failover in master/slave topology promoting a slave to master after a failure. Sentinels does not distribute data across node
Twemproxy - Distributed Data
Twemproxy works as proxy

Redis is an in-memory data structure store that can be used as a database, cache, and message broker. It is known for its high performance and scalability. Redis supports various data structures such as strings, hashes, lists, sets, and sorted sets.

There are different deployment options for Redis:

Standalone - A single Redis instance running on a single server.
Replication - Multiple Redis instances with one acting as the master and the others as slaves. Data is replicated from the master to the slaves.
Sentinel - A high availability solution for Redis. It monitors the health of Redis instances and performs automatic failover.
Cluster - Automatic failover and distributed data. Redis Cluster distributes data across different Redis instances and performs automatic failover if any problem happens to any master instance. All nodes are directly connected with a service channel and the key space is divided into hash slots.

The main advantages of Redis are Performance, Availability, Fault-Tolerance, Scalability, and Portability.

Document Databases, such as MongoDB, use data partitioning. Their main characteristics are fully automatic, configurable, transparent, and based on BASE principles. MongoDB handles schema-less data, making development easier. In traditional databases, a row of a table is a record, resulting in small granularity. The goal of document databases is to have more flexible and larger records, achieved through the use of the document notion. In MongoDB, data is stored in JSON-like documents. This type of database provides dynamic schemas and automatic data sharding.

Better Data Locality thanks to the document data structure
In-Memory Caching and Disk Storage support
Thanks to Json like data distributed architecture are supported

Include them in the main document (prefered), in this case the sub-document exist only as a sub-element of the main one and duplication must be used to share the same element
References can be used, as SQL does. Transaction and Joins are present but strongly not recommended for performance reason

Sharding

Sharding involves a shard key defined by a data modeler that describes the partition space of a data set.

Sharding is a means of partitioning data across servers to enable:

Scale and Geo locality
Hardware Optimization
Lower Recovery Times

Sharding key defines range of data

Data is partitioned into data chunks by the shard key, and these chunks are distributed. MongoDB automatically splits and migrates chunks when max dimension is reached (64MB).

Queries are routed to specific shards.

We define:

Mongod - the main daemon
Mogobs - the process that acts as a router and has no local data
Mongod - Database instance
Mongos - Sharding processes

We have different sharding strategies:

Ranged: Splits shards based on a sub-range of a key
Hashed: Keys are hashed before use. Ensures

Anteprima

Vedrai una selezione di 6 pagine su 25

Appunti - System and Method for Big and Unstructured Data Pag. 1

Appunti - System and Method for Big and Unstructured Data Pag. 2

Anteprima di 6 pagg. su 25.
Scarica il documento per vederlo tutto.

Scarica

Appunti - System and Method for Big and Unstructured Data Pag. 6

Anteprima di 6 pagg. su 25.
Scarica il documento per vederlo tutto.

Scarica

Appunti - System and Method for Big and Unstructured Data Pag. 11

Anteprima di 6 pagg. su 25.
Scarica il documento per vederlo tutto.

Scarica

Appunti - System and Method for Big and Unstructured Data Pag. 16

Anteprima di 6 pagg. su 25.
Scarica il documento per vederlo tutto.

Scarica

Appunti - System and Method for Big and Unstructured Data Pag. 21

Acquista con carta o PayPal

Scarica i documenti tutte le volte che vuoi

Dettagli

SSD

Ingegneria industriale e dell'informazione ING-INF/05 Sistemi di elaborazione delle informazioni

I contenuti di questa pagina costituiscono rielaborazioni personali del Publisher Joseph22ITA di informazioni apprese con la frequenza delle lezioni di System and method for big and unstructured data e studio autonomo di eventuali libri di riferimento in preparazione dell'esame finale o della tesi. Non devono intendersi come materiale ufficiale dell'università Politecnico di Milano o del prof Brambilla Marco.

Appunti correlati

Invia appunti e guadagna

Recensioni

Ti è piaciuto questo appunto?

Appunti - System and Method for Big and Unstructured Data

Write to a Schema on Read Parading improving flexibility

2.1. Transactional Properties

4.1 Caching Purpose - MemCached

4.2 Redis

4.2.1 Redis Features

4.2.2 Redis Topologies

Recensioni

Domande e risposte

I migliori insegnanti di Informatica

Salvatore F.

Andrea D.

Pietro S.