官术网_书友最值得收藏!

Data locality

The key for good data processing performance is avoidance of network transfers. This was very true a couple of years ago but is less relevant for tasks with high demands on CPU and low I/O, but for low demand on CPU and high I/O demand data processing algorithms, this still holds.

We can conclude from this that HDFS is one of the best ways to achieve data locality as chunks of files are distributed on the cluster nodes, in most of the cases, using hard drives directly attached to the server systems. This means that those chunks can be processed in parallel using the CPUs on the machines where individual data chunks are located in order to avoid network transfer.

Another way to achieve data locality is using ApacheSparkSQL. Depending on the connector implementation, SparkSQL can make use of data processing capabilities of the source engine. So for example when using MongoDB in conjunction with SparkSQL parts of the SQL statement are preprocessed by MongoDB before data is sent upstream to Apache Spark.

主站蜘蛛池模板: 台南县| 星子县| 信丰县| 垣曲县| 桂东县| 舟山市| 城固县| 新田县| 肇庆市| 隆化县| 奉节县| 阿拉尔市| 集安市| 曲沃县| 武乡县| 和林格尔县| 马龙县| 叙永县| 沂源县| 新干县| 玉山县| 贺州市| 常德市| 华蓥市| 灌南县| 伊春市| 襄汾县| 彭山县| 海盐县| 伊通| 新源县| 宜州市| 健康| 共和县| 许昌县| 柳河县| 铜川市| 闽侯县| 富蕴县| 开江县| 宜兰市|