官术网_书友最值得收藏!

Using PyArrow's filesystem interface for HDFS

PyArrow has a C++-based interface for HDFS. By default, it uses libhdfs, a JNI-based interface, for the Java Hadoop client. Alternatively, we can also use libhdfs3, a C++ library for HDFS. We connect to the NameNode using hdfs.connect:

import pyarrow as pa
hdfs = pa.hdfs.connect(host='hostname', port=8020, driver='libhdfs')

If we change the driver to libhdfs3, we will be using the C++ library for HDFS from Pivotal Labs. Once the connection to the NameNode is made, the filesystem is accessed using the same methods as for hdfs3. 

HDFS is preferred when the data is extremely large. It allows us to read and write data in chunks; this is helpful for accessing and processing streaming data. A nice comparison of the three native RPC client interfaces is presented in the following blog post: http://wesmckinney.com/blog/python-hdfs-interfaces/.

主站蜘蛛池模板: 建昌县| 涪陵区| 大石桥市| 雅安市| 万载县| 庆云县| 南康市| 乳山市| 明溪县| 墨玉县| 凤凰县| 朝阳市| 睢宁县| 鹿泉市| 藁城市| 扶沟县| 凯里市| 崇左市| 山东| 从化市| 周口市| 辽宁省| 宜州市| 自贡市| 拉萨市| 宁安市| 察隅县| 临泽县| 东阳市| 绍兴县| 宜黄县| 勃利县| 花莲县| 江北区| 荔浦县| 开平市| 临安市| 肇东市| 二连浩特市| 盱眙县| 六枝特区|