官术网_书友最值得收藏!

Store

In this section, we will discuss storing data that has been collected from various sources. Let's consider an example of crawling reviews of organizations for sentiment analysis, wherein each gathers data from different sites with each of them having data uniquely displayed.

Traditionally, data was processed using the ETL (Extract, Transform, and Load) procedure, which used to gather data from various sources, modify it according to the requirements, and upload it to the store for further processing or display. Tools that were every so often used for such scenarios were spreadsheets, relational databases, business intelligence tools, and so on, and sometimes manual effort was also a part of it.

The most common storage used in Big Data platform is HDFS. HDFS also provides HQL (Hive Query Language), which helps us do many analytical tasks that are traditionally done in business intelligence tools. A few other storage options that can be considered are Apache Spark, Redis, and MongoDB. Each storage option has their own way of working in the backend; however, most storage providers exposes SQL APIs which can be used to do further data analysis.

There might be a case where we need to gather real-time data and showcase in real time, which practically doesn't need the data to be stored for future purposes and can run real-time analytics to produce results based on the requests.

主站蜘蛛池模板: 松溪县| 菏泽市| 富锦市| 翁牛特旗| 南江县| 汉阴县| 丰原市| 南通市| 湛江市| 云南省| 南召县| 南涧| 深州市| 包头市| 特克斯县| 洛川县| 福贡县| 宣化县| 夏津县| 利津县| 呼玛县| 瑞安市| 固始县| 邯郸市| 横峰县| 新巴尔虎右旗| 浪卡子县| 南京市| 武鸣县| 金门县| 会宁县| 昆明市| 五峰| 寿阳县| 阳曲县| 如东县| 平阳县| 嘉荫县| 梁平县| 绵竹市| 南城县|