MSC Azure Data Fundamentals - 1 Core Concepts

?
  • Created by: funkyd101
  • Created on: 12-07-22 10:32

Types of Data file

Structured data is data that adheres to a fixed schema, so all of the data has the same fields or properties.

Semi-structured is information that has some structure, but which allows for some variation between entity instances.

 Unstructured data Not all data is structured or even semi-structured. For example, documents, images, audio and video data, and binary files might not have a specific structure. This kind of data is referred to as unstructured data.

1 of 13

Common File Formats

Delimited text files
comma-separated values (CSV)
tab-separated values (TSV)
space-delimited

JavaScript Object Notation (JSON)

Extensible Markup Language (XML)

Binary Large Object (BLOB)
-general term used to describe many different unstructured file formats

2 of 13

JSON file format

JSON 
{
  "customers":
  [
    {
      "firstName": "Joe",
      [
        {"type": "home",… },
        {"type": "email",}
      ]
    },
    {
      "firstName": "Samir",
    }
  ]
}

3 of 13

XML file format

<XML>

<Customers>
  <Customer name="Joe"lastName="Jones">
    <ContactDetails>
      <Contact type="home"number="555 123-1234"/>
      <Contact type="email"address="joe@litware.com"/>    </ContactDetails>
  </Customer>
  <Customer name="Samir"lastName="Nadoy">
    <ContactDetails>
      <Contact type="email"address="samir@northwind.com"/>
    </ContactDetails>
  </Customer>
</Customers>

</XML>

4 of 13

Optimized file formats

Avro is a row-based format. Each record contains a header that describes the structure of the data in the record. This header is stored as JSON. The data is stored as binary information.

ORC (Optimized Row Columnar format) organizes data into columns rather than rows. An ORC file contains stripes of data. Each stripe holds the data for a column or set of columns. A stripe contains an index into the rows in the stripe, the data for each row, and a footer that holds statistical information (count, sum, max, min, and so on) for each column.

Parquet is another columnar data format. Parquet file contains row groups. Data for each column is stored together in the same row group. Each row group contains one or more chunks of data. A Parquet file includes metadata that describes the set of rows found in each chunk. It supports very efficient compression and encoding schemes.

5 of 13

Relational Database

Image showing a relational database schema

6 of 13

Non Relational Databases

Key- Value Database                Column Family Database
Image showing a key-value database                              Image showing a column family database

Document Database                    Graph Database
Image showing a document database            Image showing a graph database

7 of 13

Transactional Data processing

Online Transactional Processing (OLTP).

Transactional systems are often high-volume, sometimes handling many millions of transactions in a single day. The data being processed has to be accessible very quickly. The work performed by transactional systems is often referred to as Online Transactional Processing (OLTP).

Atomicity – each transaction is treated as a single unit, which succeeds completely or fails completely.

Consistency – transactions can only take the data in the database from one valid state to another.

Isolation – concurrent transactions cannot interfere with one another, and must result in a consistent database state.

Durability – when a transaction has been committed, it will remain committed.

8 of 13

Analytical data processing

Analytical data processing typically uses read-only (or read-mostly) systems that store vast volumes of historical data or business metrics.Image showing an analytical database architecture with the numbered elements described below

1. Data files stored e.g. data lake
2. An extract, transform, and load (ETL) process copies data from files and OLTP databases into a data warehouse that is optimized for read activity.
3. Data in the data warehouse may be aggregated and loaded into an online analytical processing (OLAP) model, or cube.
4.The data in the data lake, data warehouse, and analytical model can be queried to produce reports, visualizations, and dashboards.

9 of 13

Data Roles

Database Administrator
Database design, Database maintenance, Database Availability, Backup and recovery, Security, User Admin

Data Engineer
Data workload design, Data store design, management of data pipelines, Security ,User Admin

Data Analyst
exploring data to identify trends and relationships, designing and building analytical models, advanced analytics, reports and visualizations.

10 of 13

Azure Data Services - Storage

Azure SQL logoAzure SQL is the collective name for a family of relational database solutions based on the Microsoft SQL Server database engine

Azure Database for MariaDB, MySQL, and PostreSQL logosAzure includes managed services for popular open-source relational database systems, MySQL, MariaDB, PostgreSQL

Azure Cosmos DB logoAzure Cosmos DB is a global-scale non-relational (NoSQL) database system that supports multiple application programming interfaces (APIs

Azure Storage logo Azure Storage is a core Azure service that enables you to store data as Blob containers, File Shares, Tables

11 of 13

Azure Data Services - integrated services

Azure Data Factory logo Azure Data Factory is an Azure service that enables you to define and schedule data pipelines to transfer and transform data.

Azure Synapse Analytics logo Azure Synapse Analytics, provides a single service interface for multiple analytical capabilities, including: Pipelines, SQL, Apache spark, Azure synapse data explorer

Azure Databricks logoAzure Databricks combines the Apache Spark data processing, SQL database semantics and an integrated management interface to enable large-scale data analytics

Azure HDInsight logoAzure HDInsight is an Azure service that provides Azure-hosted clusters for popular Apache open-source big data processing technologies.

12 of 13

Azure Data Services - Analytics

Azure Stream Analytics logoAzure Stream Analytics is a real-time stream processing engine that captures a stream of data from an input, applies a query and writes the results to an output for analysis.

Azure Data Explorer logoAzure Data Explorer offers querying of log and telemetry data.

Azure Purview logoMicrosoft Purview provides a solution for enterprise-wide data governance and discoverability.

Microsoft Power BI logoMicrosoft Power BI is a platform for analytical data modeling and reporting used to create and share interactive data visualizations.

13 of 13

Comments

No comments have yet been made

Similar Computing resources:

See all Computing resources »See all Azure resources »