官术网_书友最值得收藏!

Hadoop Distributed File System

You might consider using an alternative to HDFS, depending upon your cluster requirements. For instance, IBM has the GPFS (General Purpose File System) for improved performance.

The reason why GPFS might be a better choice is that, coming from the high performance computing background, this filesystem has a full read write capability, whereas HDFS is designed as a write once, read many filesystem. It offers an improvement in performance over HDFS because it runs at the kernel level as opposed to HDFS, which runs in a Java Virtual Machine (JVM) that in turn runs as an operating system process. It also integrates with Hadoop and the Spark cluster tools. IBM runs setups with several hundred petabytes using GPFS.

Another commercial alternative is the MapR file system that, besides performance improvements, supports mirroring, snapshots, and high availability.

Ceph is an open source alternative to a distributed, fault-tolerant, and self-healing filesystem for commodity hard drives like HDFS. It runs in the Linux kernel as well and addresses many of the performance issues that HDFS has. Other promising candidates in this space are Alluxio (formerly Tachyon), Quantcast, GlusterFS, and Lustre.

Finally, Cassandra is not a filesystem but a NoSQL key value store and is tightly integrated with Apache Spark and is therefore traded as a valid and powerful alternative to HDFS--or even to any other distributed filesystem--especially as it supports predicate push-down using ApacheSparkSQL and the Catalyst optimizer, which we will cover in the following chapters.

主站蜘蛛池模板: 新蔡县| 剑河县| 潞西市| 巴楚县| 陇川县| 全南县| 辽阳市| 开封县| 和硕县| 儋州市| 霸州市| 泌阳县| 香格里拉县| 洮南市| 牟定县| 天长市| 自治县| 隆林| 五原县| 邵阳县| 广安市| 中阳县| 娄烦县| 巴林右旗| 盘锦市| 阜平县| 托克逊县| 双鸭山市| 葫芦岛市| 穆棱市| 清河县| 黔江区| 泰州市| 延川县| 翁源县| 沈丘县| 玛纳斯县| 江永县| 房产| 罗甸县| 清水河县|