Anteprima
Vedrai una selezione di 6 pagine su 25
Appunti - System and Method for Big and Unstructured Data Pag. 1 Appunti - System and Method for Big and Unstructured Data Pag. 2
Anteprima di 6 pagg. su 25.
Scarica il documento per vederlo tutto.
Appunti - System and Method for Big and Unstructured Data Pag. 6
Anteprima di 6 pagg. su 25.
Scarica il documento per vederlo tutto.
Appunti - System and Method for Big and Unstructured Data Pag. 11
Anteprima di 6 pagg. su 25.
Scarica il documento per vederlo tutto.
Appunti - System and Method for Big and Unstructured Data Pag. 16
Anteprima di 6 pagg. su 25.
Scarica il documento per vederlo tutto.
Appunti - System and Method for Big and Unstructured Data Pag. 21
1 su 25
D/illustrazione/soddisfatti o rimborsati
Disdici quando
vuoi
Acquista con carta
o PayPal
Scarica i documenti
tutte le volte che vuoi
Estratto del documento

Write to a Schema on Read Parading improving flexibility

This allows to have a single, shared and fully available BD for the enterprise (data lake)

The process of data is the following:

  1. Data Ingestion: Importing, transferring and loading data for storage and later use
  2. Data Wrangling: Clean raw data and transforming it into data that can be analyzed (understand, cleanse, augment, shape)
  3. Extract Transform Load (ETL): Used to better fit the targeted query

2.1. Transactional Properties

A transaction is an elementary unit of work performed by an application.

A transactional System (OLTP) is a system capable of providing transactions.

Acid Properties:

  • Atomicity: A transaction is an atomic transformation from the initial state to the final one
  • Consistency: The transaction satisfies the integrity constraints (initial-final state are consistent)
  • Isolation: A transaction is not affected by the behavior of the other concurrent transactions
  • Durability: The effect of a transaction that has

successfully committed will last forever

2.2 Cap Theorem

It is impossible for a distributed system to simultaneously provide all 3 of the following guarantees:

  • Consistency: All nodes see the same data at the same time
  • Availability: Node failures don't prevent other survivors from continuing to operate
  • Network Partition Tolerance: system continues to operate despite arbitrary network fail partition

A distributed system can satisfy any 2 of these guarantees at the same time but not all 3

In networked system the Network Partition Tolerance is quite necessary

2.3 No-SQL World

  • Key Value Store: A key that refers to a payload
  • Column Store: Column data is saved together, as opposed to row data
  • Document/XML/ OBJ Store: Key point to a serialized object
  • Graph Store: Nodes are stored independently, and the relation are stored with data

2.4 Acid VS Base Properties

ACID properties may not hold with noSQL DB. Instead they guaranteed BASE properties:

  • Basic Availability: fulfill request, even
    1. Partial Consistency:
      • Soft State: Abandon the consistency requirements of the ACID Model
      • Eventual Consistency: At some point in the future, data will converge to a consistent state
    2. A fully ACID DB is the perfect fit for use cases when Reliability and Consistency are essential
    3. Data Ingestion: API

    The 3 W of API:

    1. What: "an Application Program Interface is a set of routines, protocols and tools for building software application", allows a programmatics access to data and platform.
    2. Why: Separating between model and presentation, regulate access to data and avoid direct access to the platform
    3. Who: Every Data Provider

    7.1 How to expose API

    1. Each API exposes a set of HTTP(s) endpoints (URLS) to call in order to get data or perform action on the platform. Most of the endpoints can be "tweaked" via one or more parameter
    2. REST architecture can be used to expose API. RESTful API are resources based WebAPI with standard URL format.

    Almost all the APIs require a kind of

    user authentication. The user must register to the developer of the provider to obtain the keys to access the API

    7.2 Some Problem

    Crawling: Problem: getting a lot of data points from an API: “An API Crawler is a software that methodically interacts with a WebAPI to download data or to take some actions at predefined time intervals.”

    Pagination: most APIs support data pagination to split huge chunks of data into smaller set of data

    Timeline: Most of DB leverages the concept of timeline, The solution to retrieve always new data is a cursoring technique. Instead of reading from the top of the timeline we read the data relative to the already processed ids

    Parallelization and Multiple Account: When possible make parallel requests to gather more data in less time. This can be enabled by handling multiple accounts.

    Multiple account can be managed in different ways:

    • Request Based:
    • Round Robin: Round Robin Strategy among all the account for the requests
    • Account Pull: Sequential account
    1. Account Based: Request Stack: The accounts, in parallel, get the next request from the pool
    2. When APIs are not present, but data retrieval is needed, Scraping could be a solution but it should be the last solution.
    3. Graph Databases - Neo4J (6)
      • A Graph is a set of nodes joined by a set of lines or arrows (some definition of graph in the slides).
      • We define degree of a node the number or edges entering or leaving the node.
      • In CS a graph is an abstract data type and could be implemented in different way:
        • Representation Matrix: Incident [ExV], Adjacency [VxV]
        • Representative List: Edge or Adjacency List
      • Graph databases are used when relationships between objects are more important than the object themselves. They provide an index-free adjacency. Query are base on a pattern-recognition functionality
    4. Neo4j is a schemaless DB that maintains the ACID properties, typically employed in non-Analytic DB.
      • Highly performant read and write
      • Reliability for

    mission-critical production- Not efficient in whole-graph analysis

    Relation has direction but you could navigate them in both verse

    In this case indexing is useful to find a entry point in the graph

    Queries are expressed with a custom language called Cypher

    profile operation provides the complete analysis of the selected query

    Hints:

    • Use parameters instead of literals
    • Set an Upper Limit to *
    • Return only the data you need
    • Use Profile/explain to analyze the performance of your queries

    Shortest Path

    Shortest path function work with the where condition, it returns the shortest path that respects that following where condition. We have 2 kind of condition:

    • Fast Algorithm where condition can be checked "online" during the computation of the path
    • Exhaustive algorithm where we have to compute the whole path to check the condition (length)

    If we use a with operator between the computation of the shortest path and the where condition, the computation of the shortest path will ignore that constraints

    Key Value Database - Redis (8)

    Typically used for performance goals. They are deployed with 3 instances:

    • Stand Alone (unusual)
    • Integrated with other DB
    • For caching purpose

    SQL query, key-based, foreign key, indexes and joins

    4.1 Caching Purpose - MemCached

    Generic in nature, intended for use in setting up dynamic web applications by alleviating DB load.

    Technically is a pool of servers managed by the client.

    Data stored in memcache are: with high demand, expensive to compute and common. Main Features:

    • Fast Network access
    • No Persistence or Redundancy/fail-over
    • No authentication (secure trouble)

    4.2 Redis

    Redis is an advanced key-value store, it supports a set of atomic operations.

    Best used for rapidly changing data with a foreseeable database size

    In Redis values are complex data types.

    Most used in databases that fit in memory, where dataset is not critical and high performance is needed.

    It means that it is not a general purpose DB. Redis is a single-threaded server, so we remove

    The problem of concurrency and execute most command in O(1) complexity

    4.2.1 Redis Features

    The main characteristic of Redis are:

    • Persistence: it provide 2 mechanism of persistence: Redis Snapshot and Append-only files
    • Replication: Master\Slave approach. Slaves are read only copies of the Master. Improvement in fault tolerance and Performance
    • Partitioning: It can be done in different layers: Client, Proxy, Query Router. implemented for Scalability and Replication purpose
    • FailOver: 3 ways: manual, with Redis Sentinel, with Redis cluster

    4.2.2 Redis Topologies

    1. StandAlone:
      • The master data is optionally replicated to slaves
      • Slaves provides data redundancy, reads and save-to-disk offloading
      • Slaves can have its own slaves
      • No automatic Failover
    2. Sentinel - Automatic FailOver

      Redis sentinel proved an automatic failover in master/slave topology promoting a slave to master after a failure. Sentinels does not distribute data across node

    3. Twemproxy - Distributed Data

      Twemproxy works as proxy

    Redis is an in-memory data structure store that can be used as a database, cache, and message broker. It is known for its high performance and scalability. Redis supports various data structures such as strings, hashes, lists, sets, and sorted sets.

    There are different deployment options for Redis:

    1. Standalone - A single Redis instance running on a single server.
    2. Replication - Multiple Redis instances with one acting as the master and the others as slaves. Data is replicated from the master to the slaves.
    3. Sentinel - A high availability solution for Redis. It monitors the health of Redis instances and performs automatic failover.
    4. Cluster - Automatic failover and distributed data. Redis Cluster distributes data across different Redis instances and performs automatic failover if any problem happens to any master instance. All nodes are directly connected with a service channel and the key space is divided into hash slots.

    The main advantages of Redis are Performance, Availability, Fault-Tolerance, Scalability, and Portability.

    Document Databases, such as MongoDB, use data partitioning. Their main characteristics are fully automatic, configurable, transparent, and based on BASE principles. MongoDB handles schema-less data, making development easier. In traditional databases, a row of a table is a record, resulting in small granularity. The goal of document databases is to have more flexible and larger records, achieved through the use of the document notion. In MongoDB, data is stored in JSON-like documents. This type of database provides dynamic schemas and automatic data sharding.

    MongoDB Main Features:
    • Better Data Locality thanks to the document data structure
    • In-Memory Caching and Disk Storage support
    • Thanks to Json like data distributed architecture are supported
    MongoDB data model is made of collection of documents in a schema-less approach. We have 2 ways to embed sub-document into a document:
    1. Include them in the main document (prefered), in this case the sub-document exist only as a sub-element of the main one and duplication must be used to share the same element
    2. References can be used, as SQL does. Transaction and Joins are present but strongly not recommended for performance reason
    Structure in MongoDB follow a polymorphic idea: Documents in a collection are similar but not identically so that each document can be shaped better to data API based language for query, not declarative one. Just a Library with some operation on databases. Indexes can be created to speed up computation, but only one index (even composed) could be used at time. Query has to be
    1. Sharding
    2. Sharding involves a shard key defined by a data modeler that describes the partition space of a data set.

      Sharding is a means of partitioning data across servers to enable:

      • Scale and Geo locality
      • Hardware Optimization
      • Lower Recovery Times

      Sharding key defines range of data

      Data is partitioned into data chunks by the shard key, and these chunks are distributed. MongoDB automatically splits and migrates chunks when max dimension is reached (64MB).

      Queries are routed to specific shards.

    3. We define:
      • Mongod - the main daemon
      • Mogobs - the process that acts as a router and has no local data
      • Mongod - Database instance
      • Mongos - Sharding processes
    4. We have different sharding strategies:
      • Ranged: Splits shards based on a sub-range of a key
      • Hashed: Keys are hashed before use. Ensures
Dettagli
Publisher
A.A. 2022-2023
25 pagine
SSD Ingegneria industriale e dell'informazione ING-INF/05 Sistemi di elaborazione delle informazioni

I contenuti di questa pagina costituiscono rielaborazioni personali del Publisher Joseph22ITA di informazioni apprese con la frequenza delle lezioni di System and method for big and unstructured data e studio autonomo di eventuali libri di riferimento in preparazione dell'esame finale o della tesi. Non devono intendersi come materiale ufficiale dell'università Politecnico di Milano o del prof Brambilla Marco.