官术网_书友最值得收藏!

Data preprocessing

Taking into account the goals of data preparation, Scala was chosen as an easy and interactive way to manipulate data:

val priceDataFileName: String = "bitstampUSD_1-min_data_2012-01-01_to_2017-10-20.csv"

val spark = SparkSession
.builder()
.master("local[*]")
.config("spark.sql.warehouse.dir", "E:/Exp/")
.appName("Bitcoin Preprocessing")
.getOrCreate()

val data = spark.read.format("com.databricks.spark.csv").option("header", "true").load(priceDataFileName)
data.show(10)
>>>
Figure 5: A glimpse of the Bitcoin historical price dataset
println((data.count(), data.columns.size))

>>>

(3045857, 8)

In the preceding code, we load data from the file downloaded from Kaggle and look at what is inside. There are 3045857 rows in the dataset and 8 columns, described before. Then we create the Delta column, containing the difference between closing and opening prices (that is, to consider only that data where meaningful trading has started to occur):

val dataWithDelta = data.withColumn("Delta", data("Close") - data("Open"))

The following code labels our data by assigning 1 to the rows the Delta value of which was positive; it assigns 0 otherwise:

import org.apache.spark.sql.functions._
import spark.sqlContext.implicits._

val dataWithLabels = dataWithDelta.withColumn("label", when($"Close" - $"Open" > 0, 1).otherwise(0))
rollingWindow(dataWithLabels, 22, outputDataFilePath, outputLabelFilePath)

This code transforms the original dataset into time series data. It takes the Delta values of WINDOW_SIZE rows (22 in this experiment) and makes a new row out of them. In this way, the first row has Delta values from t0 to t21, and the second one has values from t1 to t22. Then we create the corresponding array with labels (1 or 0).

Finally, we save X and Y into files where 612000 rows were cut off from the original dataset; 22 means rolling window size and 2 classes represents that labels are binary 0 and 1:

val dropFirstCount: Int = 612000

def rollingWindow(data: DataFrame, window: Int, xFilename: String, yFilename: String): Unit = {
var i = 0
val xWriter = new BufferedWriter(new FileWriter(new File(xFilename)))
val yWriter = new BufferedWriter(new FileWriter(new File(yFilename)))
val zippedData = data.rdd.zipWithIndex().collect()
System.gc()
val dataStratified = zippedData.drop(dropFirstCount)//slice 612K

while (i < (dataStratified.length - window)) {
val x = dataStratified
.slice(i, i + window)
.map(r => r._1.getAs[Double]("Delta")).toList
val y = dataStratified.apply(i + window)._1.getAs[Integer]("label")
val stringToWrite = x.mkString(",")
xWriter.write(stringToWrite + "n")
yWriter.write(y + "n")
i += 1

if (i % 10 == 0) {
xWriter.flush()
yWriter.flush()
}
}
xWriter.close()
yWriter.close()
}

In the preceding code segment:

val outputDataFilePath: String = "output/scala_test_x.csv"
val outputLabelFilePath: String = "output/scala_test_y.csv"
主站蜘蛛池模板: 宁安市| 广元市| 嘉义县| 肃宁县| 留坝县| 肥城市| 镇巴县| 沈阳市| 乐山市| 深水埗区| 澎湖县| 东至县| 滕州市| 马边| 新余市| 镇康县| 涿鹿县| 响水县| 五华县| 秦安县| 吉林省| 仪征市| 晋州市| 沧州市| 西安市| 东乡| 武夷山市| 九台市| 盘山县| 江川县| 米林县| 达日县| 平舆县| 龙江县| 将乐县| 仲巴县| 隆子县| 兰坪| 南昌市| 昌吉市| 平舆县|