- Deep Learning with R for Beginners
- Mark Hodnett Joshua F. Wiley Yuxi (Hayden) Liu Pablo Maldonado
- 427字
- 2021-06-24 14:30:46
The unreasonable effectiveness of data
Our first deep learning models on the binary classification task had fewer than 4,000 records. We did this so you could run the example quickly. For deep learning, you really need a lot more data, so we created a more complicated model with a lot more data, which gave us an increase in accuracy. This process demonstrated the following:
- Establishing a baseline with other machine learning algorithms provides a good benchmark before using a deep learning model
- We had to create a more complex model and adjust the hyper-parameters for our bigger dataset
- The Unreasonable Effectiveness of Data
The last point here is borrowed from an article by Peter Norvig, available at https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35179.pdf. There is also a YouTube video with the same name. One of the main points in Norvig's article is this: invariably simple models and a lot of data trump more elaborate models based on less data.
We have increased the accuracy on our deep learning model by 0.38%. Considering that our dataset has highly correlated variables and that our domain is modelling human activities, this is not bad. People are, well predictable; so when attempting to predict what they do next, a small dataset usually works. In other domains, adding more data has much more of an effect. Consider a complex image-recognition task with color images where the image quality and format are not consistent. In that case, increasing our training data by a factor of 10 would have much more of an effect than in the earlier example. For many deep learning projects, you should include tasks to acquire more data from the very beginning of the project. This can be done by manually labeling the data, by outsourcing tasks (Amazon Turk), or by building some form of feedback mechanism in your application.
While other machine learning algorithms may also see an improvement in performance with more data, eventually adding more data will stop making a difference and performance will stagnate. This is because these algorithms were never designed for large high-dimensional data and so cannot model the complex patterns in very large datasets. However, you can build increasingly complex deep learning architectures that can model these complex patterns. This following plot illustrates how deep learning algorithms can continue to take advantage of more data and performance can still improve after performance on other machine algorithms stagnates:

- 程序員修煉之道:從小工到專家
- 分布式數(shù)據(jù)庫系統(tǒng):大數(shù)據(jù)時(shí)代新型數(shù)據(jù)庫技術(shù)(第3版)
- 文本數(shù)據(jù)挖掘:基于R語言
- 數(shù)字媒體交互設(shè)計(jì)(初級):Web產(chǎn)品交互設(shè)計(jì)方法與案例
- Spark大數(shù)據(jù)分析實(shí)戰(zhàn)
- Flutter Projects
- INSTANT Android Fragmentation Management How-to
- 探索新型智庫發(fā)展之路:藍(lán)迪國際智庫報(bào)告·2015(上冊)
- 計(jì)算機(jī)組裝與維護(hù)(微課版)
- Hadoop 3實(shí)戰(zhàn)指南
- SIEMENS數(shù)控技術(shù)應(yīng)用工程師:SINUMERIK 840D-810D數(shù)控系統(tǒng)功能應(yīng)用與維修調(diào)整教程
- Visual FoxPro數(shù)據(jù)庫技術(shù)基礎(chǔ)
- 中國云存儲(chǔ)發(fā)展報(bào)告
- 數(shù)據(jù)指標(biāo)體系:構(gòu)建方法與應(yīng)用實(shí)踐
- 企業(yè)級大數(shù)據(jù)項(xiàng)目實(shí)戰(zhàn):用戶搜索行為分析系統(tǒng)從0到1