官术网_书友最值得收藏!

How to build robust ETL pipelines with AWS SQS

Scraping a large quantity of sites and data can be a complicated and slow process.  But it is one that can take great advantage of parallel processing, either locally with multiple processor threads, or distributing scraping requests to report scrapers using a message queue system. There may also be the need for multiple steps in a process similar to an Extract, Transform, and Load pipeline (ETL). These pipelines can also be easily built using a message queuing architecture in conjunction with the scraping.

Using a message queuing architecture gives our pipeline two advantages:

  • Robustness
  • Scalability

The processing becomes robust, as if processing of an individual message fails, then the message can be re-queued for processing again. So if the scraper fails, we can restart it and not lose the request for scraping the page, or the message queue system will deliver the request to another scraper.

It provides scalability, as multiple scrapers on the same, or different, systems can listen on the queue. Multiple messages can then be processed at the same time on different cores or, more importantly, different systems. In a cloud-based scraper, you can scale up the number of scraper instances on demand to handle greater load.

Common message queueing systems that can be used include: Kafka, RabbitMQ, and Amazon SQS. Our example will utilize Amazon SQS, although both Kafka and RabbitMQ are quite excellent to use (we will see RabbitMQ in use later in the book). We use SQS to stay with a model of using AWS cloud-based services as we did earlier in the chapter with S3.

主站蜘蛛池模板: 桑植县| 乌兰浩特市| 曲靖市| 玉屏| 安龙县| 鄂州市| 鞍山市| 贡觉县| 平度市| 徐汇区| 南华县| 获嘉县| 林西县| 石门县| 田东县| 航空| 建宁县| 南漳县| 绥德县| 子洲县| 东乡族自治县| 永丰县| 绵阳市| 乡城县| 陕西省| 申扎县| 镇平县| 高要市| 泽库县| 宁德市| 太仆寺旗| 乌拉特后旗| 沿河| 江油市| 徐汇区| 临洮县| 洪雅县| 商丘市| 长海县| 大渡口区| 东安县|