官术网_书友最值得收藏!

Direct source for data analysis

The least complicated database architecture is uses source data directly as a data source for further analysis. The following screenshot shows this scenario:

The only database in the preceding screenshot is used for both data manipulation from source applications as well as for reading data in machine-learning models. This architecture is generally suitable for limited scenarios only, and we have to consider its limitations. These include the following:

  • First of all, we must not block incoming work by reading data into our data science model. Source databases are usually designed for DML operations or as a data warehouse. When the database is an  online transactional processing (OLTP) database such as libraries, airlines or banks, we need to consider the fact that incoming transactions have priority over the range read operations generated by machine-learning training. When the source database is a data warehouse, the situations are not as complicated because data warehouses are designed for range reads.
  • We have a very limited capability to adjust the database schema for our purposes (one or two datasets). In this case, almost the only way to transform data into a desired dataset is to create database views. The need for more complex transformations leads to the necessity to create new tables, and this is not a direct source.
  • Furthermore, we have a very limited capability for checking data quality. We are used to believing in the data quality of original data. This limitation is quite similar to the previous two bullets; the only type of database object that is actually suitable is the database view.
  • We also don't need other data sources to be combined with incoming data. It's very difficult and also inefficient to combine data from more data sources in this direct model because of the need for distributed queries with their probable impact on performance and accessibility.

Aside from the previously described limitations, this approach also has some of the following benefits:

  • Data for making predictions is accessible as soon as it comes to the source database. Because of this, our machine-learning model can access incoming data directly without the extra effort that is required to transform data.
  • Data for training is also always accessible. When the source system is running, our data is always accessible.
主站蜘蛛池模板: 遵化市| 宝清县| 资阳市| 洪湖市| 曲水县| 当雄县| 上蔡县| 杨浦区| 玉树县| 峨山| 汉寿县| 晋中市| 灌云县| 安福县| 资中县| 竹北市| 荥经县| 宜章县| 青铜峡市| 杭锦后旗| 固始县| 青海省| 宁河县| 金沙县| 赤峰市| 商城县| 连山| 博野县| 麻栗坡县| 昌吉市| 临泽县| 江北区| 读书| 晴隆县| 景德镇市| 资源县| 伊通| 惠州市| 合水县| 奉化市| 富阳市|