官术网_书友最值得收藏!

A structured life is a good life

When learning about the benefits of Spark and big data, you may have heard discussions about structured data versus semi-structured data versus unstructured data. While Spark promotes the use of structured, semi-structured, and unstructured data, it also provides the basis for its consistent treatment. The only constraint being that it should be record-based. Providing they are record-based, datasets can be transformed, enriched and manipulated in the same way, regardless of their organization.

However, it is worth noting that having unstructured data does not necessitate taking an unstructured approach. Having identified techniques for exploring datasets in the previous chapter, it would be tempting to pe straight into stashing data somewhere accessible and immediately commencing simple profiling analytics. In real life situations, this activity often takes precedence over due diligence. Once again, we would encourage you to consider several key areas of interest, for example, file integrity, data quality, schedule management, version management, security, and so on, before embarking on this exploration. These should not be ignored and many are large topics in their own right.

Therefore, while we have already covered many of these concerns in Chapter 2, Data Acquisition, and will study more later, for example in Chapter 13, Secure Data, in this chapter we are going to focus on data input and output formats specifically, exploring some of the methods that we can employ to ensure better data handling and management.

主站蜘蛛池模板: 诏安县| 盱眙县| 宝山区| 蓬溪县| 中牟县| 株洲县| 镇沅| 盖州市| 安图县| 衡山县| 敖汉旗| 师宗县| 富顺县| 禹城市| 苗栗市| 万载县| 柏乡县| 石阡县| 高青县| 霍城县| 广丰县| 嘉峪关市| 锦州市| 莎车县| 金门县| 五家渠市| 旅游| 玛曲县| 锡林浩特市| 保山市| 清远市| 麻栗坡县| 鹿邑县| 台湾省| 梁河县| 赣榆县| 榆中县| 城口县| 晋宁县| 岑巩县| 萨迦县|