官术网_书友最值得收藏!

AWS big data ecosystem

Amazon's big data ecosystem has several software services that enable business insights from data. These services can be broadly classified into four major categories - Collect, Store, Analyze, and Orchestrate, as shown in the following diagram:

Figure 2.1: AWS big data ecosystem

Let's look at each category in detail.

Collect

The first step for any BI initiative is to collect data from external systems to Amazon for which AWS has the following services:

  • Direct connect: With direct connect, you can establish private connectivity between AWS and your enterprise data center and provide an easy way to move data files from your applications to AWS for analysis
  • Snowball: Snowball (also known as Import/Export) lets you import hundreds of terabytes of data quickly into AWS using Amazon-provided, secure appliances for secure transport
  • Kinesis and Kinesis Firehose: Kinesis services enable building custom applications that process or analyze streaming data

Store

The data collected needs to be stored and Amazon offers several options, which you can pick and choose, based on latency and budget requirements. Following is a summary:

  • S3: Amazon Simple Storage Service (S3) can be used to store and retrieve any amount of data. It is an object store and very reliable.
  • Glacier: Glacier is an extremely low-cost storage service that provides secure, durable, and flexible storage for data backup and archival with low cost (1 cent per GB per month).
  • RDS and Aurora: RDS services enables easy setup for the most commonly used relational databases in AWS including Oracle, MySQL, SQLServer, and Postgres and manages the time-consuming administration tasks of backup. The Aurora service is a MySQL compatible service at a fraction of the RDS cost.
  • Redshift: The Redshift service provides a fast, full-managed data warehouse for a low cost ($1,000 per TB per year).

Analyze

Once data is in Amazon, we have several options to analyze data. Following is a summary:

  • EMR: Amazon EMR provides a managed Hadoop framework that makes it an easy, fast, and cost-effective way to process a vast amount of data at scale and on-demand.
  • Machine learning: Machine learning provides visualization tools and wizards for creating machine learning models and execute them on your big data.
  • QuickSight: QuickSight is the fast, cloud-powered BI service and the theme of this book.
  • Athena: It is a query service that makes it easy to analyze data directly from files in S3 using standard SQL statements. Athena is server-less, which makes it really stand out since there is no additional infrastructure to be provisioned.

Orchestrate

To move, orchestrate, and integrate data between the various AWS stores, Amazon has two key products; Data Pipeline and Glue. The following is a summary of these products:

  • Data Pipeline: Amazon Data Pipeline allows reliable data movement from different AWS compute and storage services, as well as on-premise data sources at specified intervals.
  • Glue: Glue is a fully managed ETL service (launched Dec 2016) with a data catalog. It crawls data sources, identifies data formats, allows transformations to be built using an IDE, and schedules these jobs.

This completes the AWS big data ecosystem overview. Next, let's look at how to onboard data to QuickSight in detail.

主站蜘蛛池模板: 平顺县| 当阳市| 大同市| 铜鼓县| 民和| 定远县| 榕江县| 开原市| 柏乡县| 固始县| 旬阳县| 固镇县| 神木县| 连山| 安顺市| 逊克县| 庆云县| 龙江县| 沿河| 鄂温| 新源县| 剑川县| 古浪县| 正安县| 牟定县| 南部县| 盘锦市| 南和县| 西林县| 宁阳县| 鸡泽县| 南郑县| 蓬莱市| 金坛市| 伊宁市| 泌阳县| 桃园市| 利川市| 交城县| 万宁市| 香港 |