官术网_书友最值得收藏!

Chapter 2. Data Pipelines and Modeling

We have looked at basic hands-on tools for exploring the data in the previous chapter, thus we now can delve into more complex topics of statistical model building and optimal control or science-driven tools and problems. I will go ahead and say that we will only touch on some topics in optimal control since this book really is just about ML in Scala and not the theory of data-driven business management, which might be an exciting topic for a book on its own.

In this chapter, I will stay away from specific implementations in Scala and discuss the problem of building a data-driven enterprise at a high level. Later chapters will address how to solve these smaller pieces of the puzzle. A special emphasis will be given to handing uncertainty. Uncertainty usually comes in several favors: first, there can be noise in the information we are provided with. Secondly, the information can be incomplete. The system may have some degree of freedom in filling the missing pieces, which results in uncertainty. Finally, there may be variations in the interpretation of the models and the resulting metrics. The final point is subtle, as most classic textbooks assume that we can measure things directly. Not only the measurements may be noisy, but the definition of the measure may change in time—try measuring satisfaction or happiness. Certainly, we can avoid the ambiguity by saying that we can optimize only measurable metrics, as people usually do, but it will significantly limit the application domain in practice. Nothing prevents the scientific machinery from handling the uncertainty in the interpretation into account as well.

The predictive models are often built just for data understanding. From the linguistic derivation, model is a simplified representation of the actual complex buildings or processes for exactly the purpose of making a point and convincing people, one or another way. The ultimate goal for predictive modeling, the modeling I am concerned about in this book and this chapter specifically, is to optimize the business processes by taking the most important factors into account in order to make the world a better place. This was certainly a sentence with a lot of uncertainty entrenched, but at least it looks like a much better goal than optimizing a click-through rate.

Let's look at a traditional business decision-making process: a traditional business might involve a set of C-level executives making decisions based on information that is usually obtained from a set of dashboards with graphical representation of the data in one or several DBs. The promise of an automated data-driven business is to be able to automatically make most of the decisions provided the uncertainties eliminating human bias. This is not to say that we no longer need C-level executives, but the C-level executives will be busy helping the machines to make the decisions instead of the other way around.

In this chapter, we will cover the following topics:

  • Going through the basics of influence diagrams as a tool for decision making
  • Looking at variations of the pure decision making optimization in the context of adaptive Markov Decision making process and Kelly Criterion
  • Getting familiar with at least three different practical strategies for exploration-exploitation trade-off
  • Describing the architecture of a data-driven enterprise
  • Discussing major architectural components of a decision-making pipeline
  • Getting familiar with standard tools for building data pipelines
主站蜘蛛池模板: 玉山县| 松原市| 潞西市| 木兰县| 淮安市| 西峡县| 龙门县| 平南县| 遂宁市| 营山县| 毕节市| 常宁市| 丹东市| 沾化县| 泰兴市| 阜新市| 且末县| 罗平县| 湖南省| 新民市| 乌什县| 清新县| 明光市| 忻州市| 中宁县| 仲巴县| 凤阳县| 阆中市| 武川县| 泸定县| 光泽县| 西和县| 包头市| 阳曲县| 团风县| 子长县| 博野县| 益阳市| 博野县| 文化| 辽源市|