MSC Azure Data Fundamentals - 1 Core Concepts
- Created by: funkyd101
- Created on: 12-07-22 10:32
Types of Data file
Structured data is data that adheres to a fixed schema, so all of the data has the same fields or properties.
Semi-structured is information that has some structure, but which allows for some variation between entity instances.
Unstructured data Not all data is structured or even semi-structured. For example, documents, images, audio and video data, and binary files might not have a specific structure. This kind of data is referred to as unstructured data.
Common File Formats
Delimited text files
comma-separated values (CSV)
tab-separated values (TSV)
space-delimited
JavaScript Object Notation (JSON)
Extensible Markup Language (XML)
Binary Large Object (BLOB)
-general term used to describe many different unstructured file formats
JSON file format
JSON
{
"customers":
[
{
"firstName": "Joe",…
[
{"type": "home",… },
{"type": "email",…}
]
},
{
"firstName": "Samir",…
}
]
}
XML file format
<XML>
<Customers>
<Customer name="Joe"lastName="Jones">
<ContactDetails>
<Contact type="home"number="555 123-1234"/>
<Contact type="email"address="joe@litware.com"/> </ContactDetails>
</Customer>
<Customer name="Samir"lastName="Nadoy">
<ContactDetails>
<Contact type="email"address="samir@northwind.com"/>
</ContactDetails>
</Customer>
</Customers>
</XML>
Optimized file formats
Avro is a row-based format. Each record contains a header that describes the structure of the data in the record. This header is stored as JSON. The data is stored as binary information.
ORC (Optimized Row Columnar format) organizes data into columns rather than rows. An ORC file contains stripes of data. Each stripe holds the data for a column or set of columns. A stripe contains an index into the rows in the stripe, the data for each row, and a footer that holds statistical information (count, sum, max, min, and so on) for each column.
Parquet is another columnar data format. Parquet file contains row groups. Data for each column is stored together in the same row group. Each row group contains one or more chunks of data. A Parquet file includes metadata that describes the set of rows found in each chunk. It supports very efficient compression and encoding schemes.
Relational Database
Non Relational Databases
Key- Value Database Column Family Database
Document Database Graph Database
Transactional Data processing
Online Transactional Processing (OLTP).
Transactional systems are often high-volume, sometimes handling many millions of transactions in a single day. The data being processed has to be accessible very quickly. The work performed by transactional systems is often referred to as Online Transactional Processing (OLTP).
Atomicity – each transaction is treated as a single unit, which succeeds completely or fails completely.
Consistency – transactions can only take the data in the database from one valid state to another.
Isolation – concurrent transactions cannot interfere with one another, and must result in a consistent database state.
Durability – when a transaction has been committed, it will remain committed.
Analytical data processing
Analytical data processing typically uses read-only (or read-mostly) systems that store vast volumes of historical data or business metrics.
1. Data files stored e.g. data lake
2. An extract, transform, and load (ETL) process copies data from files and OLTP databases into a data warehouse that is optimized for read activity.
3. Data in the data warehouse may be aggregated and loaded into an online analytical processing (OLAP) model, or cube.
4.The data in the data lake, data warehouse, and analytical model can be queried to produce reports, visualizations, and dashboards.
Data Roles
Database Administrator
Database design, Database maintenance, Database Availability, Backup and recovery, Security, User Admin
Data Engineer
Data workload design, Data store design, management of data pipelines, Security ,User Admin
Data Analyst
exploring data to identify trends and relationships, designing and building analytical models, advanced analytics, reports and visualizations.
Azure Data Services - Storage
Azure SQL is the collective name for a family of relational database solutions based on the Microsoft SQL Server database engine
Azure includes managed services for popular open-source relational database systems, MySQL, MariaDB, PostgreSQL
Azure Cosmos DB is a global-scale non-relational (NoSQL) database system that supports multiple application programming interfaces (APIs
Azure Storage is a core Azure service that enables you to store data as Blob containers, File Shares, Tables
Azure Data Services - integrated services
Azure Data Factory is an Azure service that enables you to define and schedule data pipelines to transfer and transform data.
Azure Synapse Analytics, provides a single service interface for multiple analytical capabilities, including: Pipelines, SQL, Apache spark, Azure synapse data explorer
Azure Databricks combines the Apache Spark data processing, SQL database semantics and an integrated management interface to enable large-scale data analytics
Azure HDInsight is an Azure service that provides Azure-hosted clusters for popular Apache open-source big data processing technologies.
Azure Data Services - Analytics
Azure Stream Analytics is a real-time stream processing engine that captures a stream of data from an input, applies a query and writes the results to an output for analysis.
Azure Data Explorer offers querying of log and telemetry data.
Microsoft Purview provides a solution for enterprise-wide data governance and discoverability.
Microsoft Power BI is a platform for analytical data modeling and reporting used to create and share interactive data visualizations.
Comments
No comments have yet been made