官术网_书友最值得收藏!

  • Learning Apache Apex
  • Thomas Weise Munagala V. Ramanath David Yan Kenneth Knowles
  • 448字
  • 2021-07-02 22:38:39

Development process and methodology

Development of an Apex application starts with mapping the functional specification to operators (smaller functional building blocks), which can then be composed into a DAG to collectively provide the functionality required for the use case.

This involves identifying the data sources, formats, transformations and sinks for the application, and finding matching operators from the Apex library (which will be covered in the next chapter). In most cases, the required connectors will be available from the library that support frequent sources, such as files and Kafka, along with many other external systems that are part of the Apache Big Data ecosystem.

With the comprehensive operator library and set of examples to cover frequently used I/O cases and transformations, it is often possible to assemble a preliminary end-to-end flow that covers a subset of the functionality quickly, before building out the complete business logic in detail.

Examples that show how to work with frequently used library operators and accelerate the path to an initial running application can be found at https://github.com/apache/apex-malhar/tree/master/examples.

Having a basic pipeline working early on in the target environment (or at least close to it) allows for various important integration and operational requirements to be evaluated in parallel, such as security and access control. It also establishes a baseline for iterative and parallel development, and for testing the full-featured operators. Experience from working on complex pipelines shows how having an early basic pipeline can reduce risk and provides better visibility into the progress of a bigger project, especially when it has many integration points and a larger development team. Essentially, development dependencies can follow the modular structure of the DAG, allowing the full pipeline to be gradually built up and functions further downstream to be developed in parallel with mocked input, when needed.

A large project broken down into a series of smaller and more manageable milestones would roughly involve the following sequence of steps:

  1. Writing the Java code for new or customized operator.
  2. Unit testing (in IDE, no cluster environment needed).
  3. Integrating the operator into DAG.
  4. Integration testing (testing the DAG with potentially mocked data, in IDE).
  5. Configuring operator properties for the target environment (connector setting, and so on).
  6. End-to-end testing with realistic data set in the target environment.
  7. Tuning (optimizing resource utilization, configuring appropriate platform attributes such as processing locality, memory and CPU allocation, scaling and so on).

Following a similar sequence will ensure that basic functional issues are discovered early on (ideally within the IDE environment where it is far more efficient to debug and fix) before fully packaging and deploying the pipeline to a cluster.

In subsequent sections, we will look at each of these phases in more detail.

主站蜘蛛池模板: 遵义县| 南投市| 静安区| 麦盖提县| 大理市| 什邡市| 商河县| 霍林郭勒市| 庆云县| 民丰县| 铅山县| 区。| 许昌县| 义马市| 米脂县| 鹿邑县| 大方县| 长治市| 巨野县| 阿坝| 江西省| 越西县| 清苑县| 鄱阳县| 昌邑市| 德格县| 浪卡子县| 库车县| 江源县| 江西省| 南和县| 广河县| 普安县| 江津市| 镇雄县| 项城市| 乌拉特中旗| 庆元县| 响水县| 双城市| 甘孜县|