官术网_书友最值得收藏!

Maximum-likelihood learning

We have defined likelihood as a filtering term in the Bayes formula. In general, it has the form of:

Here the first term expresses the actual likelihood of a hypothesis, given a dataset X. As you can imagine, in this formula there are no more Apriori probabilities, so, maximizing it doesn't imply accepting a theoretical preferential hypothesis, nor considering unlikely ones. A very common approach, known as expectation-maximization and used in many algorithms (we're going to see an example in logistic regression), is split into two main parts:

  • Determining a log-likelihood expression based on model parameters (they will be optimized accordingly)

  • Maximizing it until residual error is small enough

A log-likelihood (normally called L) is a useful trick that can simplify gradient calculations. A generic likelihood expression is:

As all parameters are inside hi, the gradient is a complex expression which isn't very manageable. However our goal is maximizing the likelihood, but it's easier minimizing its reciprocal:

This can be turned into a very simple expression by applying natural logarithm (which is monotonic):

The last term is a summation which can be easily derived and used in most of the optimization algorithms. At the end of this process, we can find a set of parameters which provides the maximum likelihood without any strong statement about prior distributions. This approach can seem very technical, but its logic is really simple and intuitive. To understand how it works, I propose a simple exercise, which is part of Gaussian mixture technique discussed also in Russel S., Norvig P., Artificial Intelligence: A Modern Approach, Pearson.

Let's consider 100 points drawn from a Gaussian distribution with zero mean and a standard deviation equal to 2.0 (quasi-white noise made of independent samples):

import numpy as np

nb_samples = 100
X_data = np.random.normal(loc=0.0, scale=np.sqrt(2.0), size=nb_samples)

The plot is shown next:

In this case, there's no need for a deep exploration (we know how they are generated), however, after restricting the hypothesis space to the Gaussian family (the most suitable considering only the graph), we'd like to find the best value for mean and variance. First of all, we need to compute the log-likelihood (which is rather simple thanks to the exponential function):

A simple Python implementation is provided next (for ease of use, there's only a single array which contains both mean (0) and variance (1)):

def negative_log_likelihood(v):
l = 0.0
f1 = 1.0 / np.sqrt(2.0 * np.pi * v[1])
f2 = 2.0 * v[1]

for x in X_data:
l += np.log(f1 * np.exp(-np.square(x - v[0]) / f2))

return -l

Then we need to find its minimum (in terms of mean and variance) with any of the available methods (gradient descent or another numerical optimization algorithm). For example, using the scipy minimization function, we can easily get:

from scipy.optimize import minimize

>>> minimize(fun=negative_log_likelihood, x0=[0.0, 1.0])

fun: 172.33380423827057
hess_inv: array([[ 0.01571807, 0.02658017],
[ 0.02658017, 0.14686427]])
jac: array([ 0.00000000e+00, -1.90734863e-06])
message: 'Optimization terminated successfully.'
nfev: 52
nit: 9
njev: 13
status: 0
success: True
x: array([ 0.04088792, 1.83822255])

A graph of the negative log-likelihood function is plotted next. The global minimum of this function corresponds to an optimal likelihood given a certain distribution. It doesn't mean that the problem has been completely solved, because the first step of this algorithm is determining an expectation, which must be always realistic. The likelihood function, however, is quite sensitive to wrong distributions because it can easily get close to zero when the probabilities are low. For this reason, maximum-likelihood (ML) learning is often preferable to MAP learning, which needs Apriori distributions and can fail when they are not selected in the most appropriate way:

This approach has been applied to a specific distribution family (which is indeed very easy to manage), but it also works perfectly when the model is more complex. Of course, it's always necessary to have an initial awareness about how the likelihood should be determined because more than one feasible family can generate the same dataset. In all these cases, Occam's razor is the best way to proceed: the simplest hypothesis should be considered first. If it doesn't fit, an extra level of complexity can be added to our model. As we'll see, in many situations, the easiest solution is the winning one, and increasing the number of parameters or using a more detailed model can only add noise and a higher possibility of overfitting.

SciPy ( https://www.scipy.org) is a set of high-end scientific and data-oriented libraries available for Python. It includes NumPy, Pandas, and many other useful frameworks. If you want to read more about Python scientific computing, refer to Johansson R., Numerical Python, Apress or Landau R. H., Pàez M. J., Bordeianu C. C., Computational Physics. Problem Solving with Python, Wiley-VCH.
主站蜘蛛池模板: 江源县| 吉木萨尔县| 固原市| 奉贤区| 阳高县| 西吉县| 内乡县| 水城县| 宁海县| 屏东市| 长顺县| 盘锦市| 葵青区| 肇源县| 葵青区| 东乌珠穆沁旗| 临潭县| 都昌县| 盐池县| 辽阳市| 铜鼓县| 越西县| 大名县| 舒兰市| 泉州市| 夹江县| 额济纳旗| 城口县| 布尔津县| 文成县| 奉节县| 临朐县| 扶绥县| 双江| 武川县| 宝清县| 吉安市| 泰州市| 通河县| 镇巴县| 武城县|