官术网_书友最值得收藏!

Data processing libraries

The standard Java library is very rich and offers a lot of tools for data processing, such as collections, I/O tools, data streams, and means of parallel task execution. 

There are very powerful extensions to the standard library such as:

We will cover both the standard API for data processing and its extensions in Chapter 2Data Processing Toolbox. In this book, we will use Maven for including external libraries such as Google Guava or Apache Commons IO. It is a dependency management tool and allows to specify the external dependencies with a few lines of XML code. For example, to add Google Guava, it is enough to declare the following dependency in pom.xml:

<dependency> 
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>19.0</version>
</dependency>

When we do it, Maven will go to the Maven Central repository and download the dependency of the specified version. The best way to find the dependency snippets for pom.xml (such as the previous one) is to use the search at https://mvnrepository.com or your favorite search engine.

Java gives an easy way to access databases through Java Database Connectivity (JDBC)--a unified database access protocol. JDBC makes it possible to connect virtually any relational database that supports SQL, such as MySQL, MS SQL, Oracle, PostgreSQL, and many others. This allows moving the data manipulation from Java to the database side.

When it is not possible to use a database for handling tabular data, then we can use DataFrame libraries for doing it directly in Java. The DataFrame is a data structure that originally comes from R and it allows to easily manipulate textual data in the program, without resorting to external database.

For example, with DataFrames it is possible to filter rows based on some condition, apply the same operation to each element of a column, group by some condition or join with another DataFrame. Additionally, some data frame libraries make it easy to convert tabular data to a matrix form so that the data can be used by machine learning algorithms. 

There are a few data frame libraries available in Java. Some of them are as follows:

We will also cover databases and data frames in Chapter 2, Data Processing Toolbox and we will use DataFrames throughout the book. 

There are more complex data processing libraries such as Spring Batch (http://projects.spring.io/spring-batch/). They allow creating complex data pipelines (called ETLs from Extract-Transform-Load) and manage their execution.

Additionally, there are libraries for distributed data processing such as:

We will talk about distributed data processing in Chapter 9Scaling Data Science.

主站蜘蛛池模板: 赤壁市| 洞头县| 丁青县| 司法| 区。| 屯门区| 小金县| 武隆县| 绍兴县| 永仁县| 鸡泽县| 东兴市| 常熟市| 三原县| 阜新市| 子长县| 丰台区| 濮阳县| 博野县| 桐乡市| 英山县| 曲水县| 肥西县| 环江| 胶南市| 北票市| 抚远县| 钦州市| 和平县| 桓台县| 盈江县| 文化| 东源县| 安化县| 枣阳市| 城固县| 岑溪市| 香格里拉县| 禄劝| 长顺县| 邵阳县|