官术网_书友最值得收藏!

The optimizer parameter

Our implementation of neural networks used gradient descent. When researchers started creating more complicated multilayer neural network models, they found that they took an extraordinarily long time to train. This is because the basic gradient-descent algorithm with no optimization is not very efficient; it makes small steps towards its goal in each epoch regardless of what occurred in previous epochs. We can compare it with a guessing game: one person has to guess a number in a range and for each guess, they are told to go higher or lower (assuming they do not guess the correct number!). The higher/lower instruction is similar to the derivative value, it indicates the direction we must travel. Now let's say that the range of possible numbers is 1 to 1,000,000 and the first guess is 1,000. The person is told to go higher, which should they do:

  • Try 1001.
  • Take the difference between the guess and the max value and divide by 2. Add this value to the previous guess.

The second option is much better and should mean the person gets to the right answer in 20 guesses or fewer. If you have a background in computer science, you may recognize this as the binary-search algorithm. The first option, guessing 1,001, 1,002, ...., 1,000,000, is a terrible choice and will probably fail as one party will give up! But this is similar to how gradient descent works. It moves incrementally towards the target. If you try increasing the learning rate to overcome this problem, you can overshoot the target and the model fails to converge.

Researchers came up with some clever optimizations to speed up training. One of the first optimizers was called momentum, and it does exactly what its name states. It looks at the extent of the derivative and takes bigger steps for each epoch if the previous steps were all in the same direction. It should mean that the model trains much quicker. There are other algorithms that are enhancements of these, such as RMS-Prop and Adam. You don't usually need to know how they work, just that, when you change the optimizer, you may also have to adjust other hyper-parameters, such as the learning rate. In general, look for previous examples done by others and copy those hyper-parameters.

We actually used one of these optimizers in an example in the previous chapter. In that chapter, we had 2 models with a similar architecture (40 hidden nodes). The first model (digits.m3) used the nnet library and took 40 minutes to train. The second model (digits.m3) used resilient backpropagation and took 3 minutes to train. This shows the benefit of using an optimizer in neural networks and deep learning.

主站蜘蛛池模板: 芦溪县| 永年县| 山丹县| 油尖旺区| 邢台市| 留坝县| 金昌市| 仲巴县| 临武县| 诸暨市| 西宁市| 娄烦县| 武山县| 娄底市| 乡宁县| 齐河县| 泰来县| 兴安县| 镇沅| 武山县| 西乌珠穆沁旗| 延吉市| 商丘市| 资阳市| 伊宁市| 南皮县| 长丰县| 兖州市| 梨树县| 牙克石市| 广饶县| 高唐县| 顺昌县| 聊城市| 泸水县| 江安县| 安康市| 佛坪县| 新野县| 宝山区| 襄汾县|