官术网_书友最值得收藏!

Implicit schema discovery

One important aspect of the DataSource API is implicit schema discovery. For a subset of data sources, implicit schema discovery is possible. This means that while loading the data, not only are the individual columns discovered and made available in a DataFrame or Dataset, but also the column names and types.

Take a JSON file, for example. Column names are already explicitly present in the file. Due to the dynamic schema of JSON objects per default, the complete JSON file is read to discover all the possible column names. In addition, the column types are inferred and discovered during this parsing process.

If the JSON file gets very large and you want to make use of the lazy loading nature that every Apache Spark data object usually supports, you can specify a fraction of the data to be sampled in order to infer column names and types from a JSON file.

Another example is the the Java Database Connectivity (JDBC) data source where the schema doesn't even need to be inferred but is directly read from the source database.

主站蜘蛛池模板: 和林格尔县| 紫金县| 泊头市| 新绛县| 保康县| 瑞昌市| 卓尼县| 尚义县| 阿拉善左旗| 江孜县| 辛集市| 南宫市| 济宁市| 彭阳县| 长汀县| 应城市| 孟津县| 保靖县| 子长县| 凤台县| 韶关市| 沂水县| 三明市| 寻乌县| 鄂托克前旗| 湘西| 巴青县| 邛崃市| 濮阳县| 镇江市| 平陆县| 师宗县| 凌源市| 囊谦县| 丽江市| 分宜县| 田林县| 长兴县| 礼泉县| 闽清县| 华池县|