官术网_书友最值得收藏!

Weight optimization

Before the training starts, the network parameters are set randomly. Then to optimize the network weights, an iterative algorithm called Gradient Descent (GD) is used. Using GD optimization, our network computes the cost gradient based on the training set. Then, through an iterative process, the gradient G of the error function E is computed.

In following graph, gradient G of error function E provides the direction in which the error function with current values has the steeper slope. Since the ultimate target is to reduce the network error, GD makes small steps in the opposite direction -G. This iterative process is executed a number of times, so the error E would move down towards the global minima. This way, the ultimate target is to reach a point where G = 0, where no further optimization is possible:

Searching for the minimum for the error function E; we move in the direction in which the gradient G of E is minimal

The downside is that it takes too long to converge, which makes it impossible to meet the demand of handling large-scale training data. Therefore, a faster GD called Stochastic Gradient Descent (SDG) is proposed, which is also a widely used optimizer in DNN training. In SGD, we use only one training sample per iteration from the training set to update the network parameters.

I'm not saying SGD is the only available optimization algorithm, but there are so many advanced optimizers available nowadays, for example, Adam, RMSProp, ADAGrad, Momentum, and so on. More or less, most of them are either direct or indirect optimized versions of SGD.

By the way, the term stochastic comes from the fact that the gradient based on a single training sample per iteration is a stochastic approximation of the true cost gradient.

主站蜘蛛池模板: 武穴市| 阿拉善右旗| 泸定县| 来安县| 屏南县| 平顶山市| 华宁县| 富川| 林芝县| 桃园县| 女性| 建瓯市| 怀仁县| 永丰县| 秦安县| 六盘水市| 长顺县| 神农架林区| 毕节市| 上犹县| 大港区| 云安县| 饶河县| 老河口市| 遂昌县| 秦安县| 长海县| 沂南县| 天峻县| 灵山县| 油尖旺区| 嵊泗县| 龙州县| 宽甸| 无为县| 剑阁县| 扎兰屯市| 贡觉县| 闽清县| 惠来县| 阿合奇县|