官术网_书友最值得收藏!

Chapter 1. Introduction

Imagine you're working as an administrator of a large IT infrastructure. You have just started receiving emails that a web application has stopped working. When you try to access the same page, it just doesn't load. What are the possibilities? Is it the router? Or the firewall? Perhaps the machine hosting the page is down? Before you even start thinking rationally about what is to be done, your boss calls about the critical situation and demands an explanation. In this panic situation, you'll probably start plugging everything in and out of the network, rebooting the machine and so on, and that doesn't help.

After hours of nervously digging into the issue you finally find the solution— the web server was working properly, but was timing out on communication with the database server. This was because the machine with the database was not getting a correct IP as yet another box had run out of memory and Dynamic Host Configuration Protocol (DHCP) server had stopped working. Imagine how much time it would take to find all that out manually. It would be a nightmare if the database server was in another branch of the company, in a different time zone, and perhaps the people over there were still sleeping.

And what if you had Nagios up and running across your entire company? You would just need to go to the web interface, see that there are no problems with the web server and the machine it is running on. There would also be a list of what's wrong – that the machine serving IP addresses to the entire company is not doing its job and that the database is down. If the set-up also monitored the DHCP server, you would get a warning email that very little swap memory is available on it, or that too many processes are running. Maybe it would even have an event handler for such cases to just kill or restart noncritical processes. Also, Nagios would try to restart the DHCP server process over the network, in case it is down.

In the worst case, Nagios would speed up hours of investigation to 10 minutes. In the best case, you would just get an email that there was a problem, followed by another one saying that the problem is already fixed. You would just disable a few services and increase the swap size for the DHCP machine and solve the problem once for all. And nobody would even notice there was a problem.

主站蜘蛛池模板: 贵德县| 焦作市| 广宁县| 许昌县| 扬中市| 宣武区| 永仁县| 普兰店市| 尉犁县| 东光县| 南漳县| 花莲县| 成安县| 郴州市| 青川县| 工布江达县| 嘉祥县| 封开县| 蒙阴县| 大余县| 大荔县| 罗江县| 鹤山市| 邻水| 永昌县| 崇礼县| 沿河| 太康县| 丹江口市| 建始县| 杭锦后旗| 宽甸| 客服| 从江县| 邵阳市| 辽源市| 兴安县| 容城县| 农安县| 大邑县| 光泽县|