- Java Deep Learning Projects
- Md. Rezaul Karim
- 615字
- 2021-06-18 19:08:06
Titanic survival revisited with DL4J
In the preceding chapter, we solved the Titanic survival prediction problem using Spark-based MLP. We also saw that by using Spark-based MLP, the user has very little transparency of using the layering structure. Moreover, it was not explicit to define hyperparameters and so on.
Therefore, what I have done is used the training dataset and then performed some preprocessing and feature engineering. Then I randomly split the pre-processed dataset into training and testing (to be precise, 70% for training and 30% for testing). First, we create the Spark session as follows:
SparkSession spark = SparkSession.builder()
.master("local[*]")
.config("spark.sql.warehouse.dir", "temp/")// change accordingly
.appName("TitanicSurvivalPrediction")
.getOrCreate();
In this chapter, we have seen that there are two CSV files. However, test.csv one does not provide any ground truth. Therefore, I decided to use only the training.csv one, so that we can compare the model's performance. So let's read the training dataset using the spark read() API:
Dataset<Row> df = spark.sqlContext()
.read()
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("data/train.csv");
We have seen in Chapter 1, Getting Started with Deep Learning that the Age and Fare columns have many null values. So, instead of writing UDF for each column, here I just replace the missing values of the age and fare columns by their mean:
Map<String, Object> m = new HashMap<String, Object>();
m.put("Age", 30);
m.put("Fare", 32.2);
Dataset<Row> trainingDF1 = df2.na().fill(m);
To get more detailed insights into handling missing/null values and machine learning, interested readers can take a look at Boyan Angelov's blog at https://towardsdatascience.com/working-with-missing-data-in-machine-learning-9c0a430df4ce.
For simplicity, we can drop a few more columns too, such as "PassengerId", "Name", "Ticket", and "Cabin":
Dataset<Row> trainingDF2 = trainingDF1.drop("PassengerId", "Name", "Ticket", "Cabin");
Now, here comes the tricky part. Similar to Spark ML-based estimators, DL4J-based networks also need training data in numeric form. Therefore, we now have to convert the categorical features into numerics. For that, we can use a StringIndexer() transformer. What we will do is we will create two that is, StringIndexer for the "Sex" and "Embarked" columns:
StringIndexer sexIndexer = new StringIndexer()
.setInputCol("Sex")
.setOutputCol("sexIndex")
.setHandleInvalid("skip");//// we skip column having nulls
StringIndexer embarkedIndexer = new StringIndexer()
.setInputCol("Embarked")
.setOutputCol("embarkedIndex")
.setHandleInvalid("skip");//// we skip column having nulls
Then we will chain them into a single pipeline. Next, we will perform the transformation operation:
Pipeline pipeline = new Pipeline().setStages(new PipelineStage[] {sexIndexer, embarkedIndexer});
Then we will fit the pipeline, transform, and drop both the "Sex" and "Embarked" columns to get the transformed dataset:
Dataset<Row> trainingDF3 = pipeline.fit(trainingDF2).transform(trainingDF2).drop("Sex", "Embarked");
Then our final pre-processed dataset will have only the numerical features. Note that DL4J considers the last column as the label column. That means DL4J will consider "Pclass", "Age", "SibSp", "Parch", "Fare", "sexIndex", and "embarkedIndex" as features. Therefore, I placed the "Survived" column as the last column:
Dataset<Row> finalDF = trainingDF3.select("Pclass", "Age", "SibSp","Parch", "Fare",
"sexIndex","embarkedIndex", "Survived");
finalDF.show();
Then we randomly split the dataset into training and testing as 70% and 30%, respectively. That is, we used 70% for training and the rest to evaluate the model:
Dataset<Row>[] splits = finalDF.randomSplit(new double[] {0.7, 0.3});
Dataset<Row> trainingData = splits[0];
Dataset<Row> testData = splits[1];
Finally, we have both the DataFrames as separate CSV files to be used by DL4J:
trainingData
.coalesce(1)// coalesce(1) writes DF in a single CSV
.write()
.format("com.databricks.spark.csv")
.option("header", "false") // don't write the header
.option("delimiter", ",") // comma separated
.save("data/Titanic_Train.csv"); // save location
testData
.coalesce(1)// coalesce(1) writes DF in a single CSV
.write()
.format("com.databricks.spark.csv")
.option("header", "false") // don't write the header
.option("delimiter", ",") // comma separated
.save("data/Titanic_Test.csv"); // save location
Additionally, DL4J does not support the header info in the training set, so I intentionally skipped writing the header.
- 人工免疫算法改進及其應用
- 西門子S7-200 SMART PLC從入門到精通
- Cloudera Administration Handbook
- 大學C/C++語言程序設計基礎
- 基于單片機的嵌入式工程開發詳解
- RedHat Linux用戶基礎
- Salesforce for Beginners
- 計算機與信息技術基礎上機指導
- Learning ServiceNow
- Mastering Exploratory Analysis with pandas
- Mastering Text Mining with R
- 手機游戲策劃設計
- TensorFlow Deep Learning Projects
- 工業機器人操作
- Hands-On Agile Software Development with JIRA