Breeze is an extensive library providing fast and easy manipulation of arrays of data, routines for optimization, interpolation, linear algebra, signal processing, and numerical integration.
The basic linear algebra operations underlying Breeze rely on the netlib-java library, which can use system-optimized BLAS and LAPACK libraries, if present. Thus, linear algebra operations in Breeze are often extremely fast. Breeze is still undergoing rapid development and can, therefore, be somewhat unstable.
Vectors
Breeze makes manipulating one- and two-dimensional data structures easy. To start, open a Scala console through SBT and import Breeze:
scala> val v = DenseVector(1.0, 2.0, 3.0)breeze.linalg.DenseVector[Double] = DenseVector(1.0, 2.0, 3.0)
We have just defined a three-element vector, v. Vectors are just one-dimensional arrays of data exposing methods tailored to numerical uses. They can be indexed like other Scala collections:
scala> v(1)Double = 2.0
They support element-wise operations with a scalar:
scala> v :* 2.0 // :* is 'element-wise multiplication'breeze.linalg.DenseVector[Double] = DenseVector(2.0, 4.0, 6.0)
They also support element-wise operations with another vector:
scala> v :+ DenseVector(4.0, 5.0, 6.0) // :+ is 'element-wise addition'breeze.linalg.DenseVector[Double] = DenseVector(5.0, 7.0, 9.0)
Breeze makes writing vector operations intuitive and considerably more readable than the native Scala equivalent.
Note that Breeze will refuse (at compile time) to coerce operands to the correct type:
scala> v :* 2 // element-wise multiplication by integer<console>:15: error: could not find implicit value for parameter op:...
It will also refuse (at runtime) to add vectors together if they have different lengths:
scala> v :+ DenseVector(8.0, 9.0)java.lang.IllegalArgumentException: requirement failed: Vectors must have same length: 3 != 2...
Basic manipulation of vectors in Breeze will feel natural to anyone used to working with NumPy, MATLAB, or R.
So far, we have only looked at element-wise operators. These are all prefixed with a colon. All the usual suspects are present: :+, :*, :-, :/, :% (remainder), and :^ (power) as well as Boolean operators. To see the full list of operators, have a look at the API documentation for DenseVector or DenseMatrix (https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet).
Besides element-wise operations, Breeze vectors support the operations you might expect of mathematical vectors, such as the dot product:
scala> val v2 = DenseVector(4.0, 5.0, 6.0)breeze.linalg.DenseVector[Double] = DenseVector(4.0, 5.0, 6.0)scala> v dot v2Double = 32.0
Tip
Pitfalls of element-wise operators
Besides the :+ and :- operators for element-wise addition and subtraction that we have seen so far, we can also use the more traditional + and - operators:
scala> v + v2breeze.linalg.DenseVector[Double] = DenseVector(5.0, 7.0, 9.0)
One must, however, be very careful with operator precedence rules when mixing :+ or :* with :+ operators. The :+ and :* operators have very low operator precedence, so they will be evaluated last. This can lead to some counter-intuitive behavior:
scala> 2.0 :* v + v2 // !! equivalent to 2.0 :* (v + v2)breeze.linalg.DenseVector[Double] = DenseVector(10.0, 14.0, 18.0)
By contrast, if we use :+ instead of +, the mathematical precedence of operators is respected:
scala> 2.0 :* v :+ v2 // equivalent to (2.0 :* v) :+ v2breeze.linalg.DenseVector[Double] = DenseVector(6.0, 9.0, 12.0)
In summary, one should avoid mixing the :+ style operators with the + style operators as much as possible.
Dense and sparse vectors and the vector trait
All the vectors we have looked at thus far have been dense vectors. Breeze also supports sparse vectors. When dealing with arrays of numbers that are mostly zero, it may be more computationally efficient to use sparse vectors. The point at which a vector has enough zeros to warrant switching to a sparse representation depends strongly on the type of operations, so you should run your own benchmarks to determine which type to use. Nevertheless, a good heuristic is that, if your vector is about 90% zero, you may benefit from using a sparse representation.
Sparse vectors are available in Breeze as the SparseVector and HashVector classes. Both these types support many of the same operations as DenseVector but use a different internal implementation. The SparseVector instances are very memory-efficient, but adding non-zero elements is slow. HashVector is more versatile, at the cost of an increase in memory footprint and computational time for iterating over non-zero elements. Unless you need to squeeze the last bits of memory out of your application, I recommend using HashVector. We will not discuss these further in this book, but the reader should find them straightforward to use if needed. DenseVector, SparseVector, and HashVector all implement the Vector trait, giving them a common interface.
Tip
Breeze remains very experimental and, as of this writing, somewhat unstable. I have found dealing with specific implementations of the Vector trait, such as DenseVector or SparseVector, to be more reliable than dealing with the Vector trait directly. In this chapter, we will explicitly type every vector as DenseVector.
Matrices
Breeze allows the construction and manipulation of two-dimensional arrays in a similar manner:
We have seen how to explicitly build vectors and matrices by passing their values to the constructor (or rather, to the companion object's apply method): DenseVector(1.0, 2.0, 3.0). Breeze offers several other powerful ways of building vectors and matrices:
The linspace method (available in the breeze.linalg package object) creates a Double vector of equally spaced values. For instance, to create a vector of 10 values distributed uniformly between 0 and 1, perform the following:
The first argument to DenseVector.tabulate is the size of the vector, and the second is a function returning the value of the vector at a particular position. This is useful for creating ranges of data, among other things.
The rand function lets us create random vectors and matrices:
To construct vectors from other Scala collections, you must use the splat operator, :_ *:
scala> val l = Seq(2, 3, 4)l: Seq[Int] = List(2, 3, 4)scala> DenseVector(l :_ *)breeze.linalg.DenseVector[Int] = DenseVector(2, 3, 4)
Advanced indexing and slicing
We have already seen how to select a particular element in a vector v by its index with, for instance, v(2). Breeze also offers several powerful methods for selecting parts of a vector.
Let's start by creating a vector to play around with:
scala> val v = DenseVector.tabulate(5) { _.toDouble }breeze.linalg.DenseVector[Double] = DenseVector(0.0, 1.0, 2.0, 3.0, 4.0)
Unlike native Scala collections, Breeze vectors support negative indexing:
scala> v(-1) // last elementDouble = 4.0
Breeze lets us slice the vector using a range:
scala> v(1 to 3)breeze.linalg.DenseVector[Double] = DenseVector(1.0, 2.0, 3.0)scala v(1 until 3) // equivalent to Python v[1:3]breeze.linalg.DenseVector[Double] = DenseVector(1.0, 2.0)scala> v(v.length-1 to 0 by -1) // reverse view of vbreeze.linalg.DenseVector[Double] = DenseVector(4.0, 3.0, 2.0, 1.0, 0.0)
Tip
Indexing by a range returns a view of the original vector: when running val v2 = v(1 to 3), no data is copied. This means that slicing is extremely efficient. Taking a slice of a huge vector does not increase the memory footprint at all. It also means that one should be careful updating a slice, since it will also update the original vector. We will discuss mutating vectors and matrices in a subsequent section in this chapter.
Breeze also lets us select an arbitrary set of elements from a vector:
scala> val vSlice = v(2, 4) // Select elements at index 2 and 4breeze.linalg.SliceVector[Int,Double] = breeze.linalg.SliceVector@9c04d22
This creates a SliceVector, which behaves like a DenseVector (both implement the Vector interface), but does not actually have memory allocated for values: it just knows how to map from its indices to values in its parent vector. One should think of vSlice as a specific view of v. We can materialize the view (give it its own data rather than acting as a lens through which v is viewed) by converting it to DenseVector:
Note that if an element of a slice is out of bounds, an exception will only be thrown when that element is accessed:
scala> val vSlice = v(2, 7) // there is no v(7)breeze.linalg.SliceVector[Int,Double] = breeze.linalg.SliceVector@2a83f9d1scala> vSlice(0) // valid since v(2) is still validDouble = 2.0scala> vSlice(1) // invalid since v(7) is out of boundsjava.lang.IndexOutOfBoundsException: 7 not in [-5,5) ...
Finally, one can index vectors using Boolean arrays. Let's start by defining an array:
This can be used as a way of filtering certain elements in a vector. For instance, to select the elements of v which are less than 3.0:
scala> val filtered = v(v :< 3.0) // :< is element-wise "less than"breeze.linalg.SliceVector[Int,Double] = breeze.linalg.SliceVector@2b1edef3scala> filtered.toDenseVectorbreeze.linalg.DenseVector[Double] = DenseVector(0.0, 1.0, 2.0)
Matrices can be indexed in much the same way as vectors. Matrix indexing functions take two arguments—the first argument selects the row(s) and the second one slices the column(s):
scala> val m = DenseMatrix((1.0, 2.0, 3.0), (5.0, 6.0, 7.0))m: breeze.linalg.DenseMatrix[Double] =1.0 2.0 3.05.0 6.0 7.0scala> m(1, 2)Double = 7.0scala> m(1, -1)Double = 7.0scala> m(0 until 2, 0 until 2)breeze.linalg.DenseMatrix[Double] =1.0 2.05.0 6.0
You can also mix different slicing types for rows and columns:
scala> m(0 until 2, 0)breeze.linalg.DenseVector[Double] = DenseVector(1.0, 5.0)
Note how, in this case, Breeze returns a vector. In general, slicing returns the following objects:
A scalar when single indices are passed as the row and column arguments
A vector when the row argument is a range and the column argument is a single index
A vector transpose when the column argument is a range and the row argument is a single index
A matrix otherwise
The symbol :: can be used to indicate every element along a particular direction. For instance, we can select the second column of m:
Breeze vectors and matrices are mutable. Most of the slicing operations described above can also be used to set elements of a vector or matrix:
scala> val v = DenseVector(1.0, 2.0, 3.0)v: breeze.linalg.DenseVector[Double] = DenseVector(1.0, 2.0, 3.0)scala> v(1) = 22.0 // v is now DenseVector(1.0, 22.0, 3.0)
We are not limited to mutating single elements. In fact, all the indexing operations outlined above can be used to set the elements of vectors or matrices. When mutating slices of vectors or matrices, use the element-wise assignment operator, :=:
scala> v(0 until 2) := DenseVector(50.0, 51.0) // set elements at position 0 and 1breeze.linalg.DenseVector[Double] = DenseVector(50.0, 51.0)scala> vbreeze.linalg.DenseVector[Double] = DenseVector(50.0, 51.0, 3.0)
The assignment operator, :=, works like other element-wise operators in Breeze. If the right-hand side is a scalar, it will automatically be broadcast to a vector of the given shape:
scala> v(0 until 2) := 0.0 // equivalent to v(0 until 2) := DenseVector(0.0, 0.0)breeze.linalg.DenseVector[Double] = DenseVector(0.0, 0.0)scala> vbreeze.linalg.DenseVector[Double] = DenseVector(0.0, 0.0, 3.0)
All element-wise operators have an update counterpart. For instance, the :+= operator acts like the element-wise addition operator :+, but also updates its left-hand operand:
scala> val v = DenseVector(1.0, 2.0, 3.0)v: breeze.linalg.DenseVector[Double] = DenseVector(1.0, 2.0, 3.0)scala> v :+= 4.0breeze.linalg.DenseVector[Double] = DenseVector(5.0, 6.0, 7.0)scala> vbreeze.linalg.DenseVector[Double] = DenseVector(5.0, 6.0, 7.0)
Notice how the update operator updates the vector in place and returns it.
We have learnt how to slice vectors and matrices in Breeze to create new views of the original data. These views are not independent of the vector they were created from—updating the view will update the underlying vector and vice-versa. This is best illustrated with an example:
scala> val v = DenseVector.tabulate(6) { _.toDouble }breeze.linalg.DenseVector[Double] = DenseVector(0.0, 1.0, 2.0, 3.0, 4.0, 5.0)scala> val viewEvens = v(0 until v.length by 2)breeze.linalg.DenseVector[Double] = DenseVector(0.0, 2.0, 4.0)scala> viewEvens := 10.0 // mutate viewEvensbreeze.linalg.DenseVector[Double] = DenseVector(10.0, 10.0, 10.0)scala> viewEvensbreeze.linalg.DenseVector[Double] = DenseVector(10.0, 10.0, 10.0)scala> v // v has also been mutated!breeze.linalg.DenseVector[Double] = DenseVector(10.0, 1.0, 10.0, 3.0, 10.0, 5.0)
This quickly becomes intuitive if we remember that, when we create a vector or matrix, we are creating a view of an underlying data array rather than creating the data itself:
A vector slice v(0 to 6 by 2) of the v vector is just a different view of the array underlying v. The view itself contains no data. It just contains pointers to the data in the original array. Internally, the view is just stored as a pointer to the underlying data and a recipe for iterating over that data: in the case of this slice, the recipe is just "start at the first element of the underlying data and go to the seventh element of the underlying data in steps of two".
Breeze offers a copy function for when we want to create independent copies of data. In the previous example, we can construct a copy of viewEvens as:
scala> val copyEvens = v(0 until v.length by 2).copybreeze.linalg.DenseVector[Double] = DenseVector(10.0, 10.0, 10.0)
We can now update copyEvens independently of v.
Matrix multiplication, transposition, and the orientation of vectors
So far, we have mostly looked at element-wise operations on vectors and matrices. Let's now look at matrix multiplication and related operations.
Besides matrix-matrix multiplication, we can use the matrix multiplication operator between matrices and vectors. All vectors in Breeze are column vectors. This means that, when multiplying matrices and vectors together, a vector should be viewed as an (n * 1) matrix. Let's walk through an example of matrix-vector multiplication. We want the following operation:
scala> val v = DenseVector(1.0, 2.0)breeze.linalg.DenseVector[Double] = DenseVector(1.0, 2.0)scala> m1 * v breeze.linalg.DenseVector[Double] = DenseVector(8.0, 17.0, 26.0)
By contrast, if we wanted:
We must convert v to a row vector. We can do this using the transpose operation:
Note that the type of v.t is Transpose[DenseVector[_]]. A Transpose[DenseVector[_]] behaves in much the same way as a DenseVector as far as element-wise operations are concerned, but it does not support mutation or slicing.
Data preprocessing and feature engineering
We have now discovered the basic components of Breeze. In the next few sections, we will apply them to real examples to understand how they fit together to form a robust base for data science.
An important part of data science involves preprocessing datasets to construct useful features. Let's walk through an example of this. To follow this example and access the data, you will need to download the code examples for the book (www.github.com/pbugnion/s4ds).
You will find, in directory chap02/data/ of the code attached to this book, a CSV file with true heights and weights as well as self-reported heights and weights for 181 men and women. The original dataset was collected as part of a study on body image. Refer to the following link for more information: http://vincentarelbundock.github.io/Rdatasets/doc/car/Davis.html.
There is a helper function in the package provided with the book to load the data into Breeze arrays:
scala> val data = HWData.loadHWData [ 181 rows ]scala> data.gendersbreeze.linalg.Vector[Char] = DenseVector(M, F, F, M, ... )
The data object contains five vectors, each 181 element long:
data.genders: A Char vector describing the gender of the participants
data.heights: A Double vector of the true height of the participants
data.weights: A Double vector of the true weight of the participants
data.reportedHeights: A Double vector of the self-reported height of the participants
data.reportedWeights: A Double vector of the self-reported weight of the participants
Let's start by counting the number of men and women in the study. We will define an array that contains just 'M' and do an element-wise comparison with data.genders:
scala> val maleVector = DenseVector.fill(data.genders.length)('M')breeze.linalg.DenseVector[Char] = DenseVector(M, M, M, M, M, M,... )scala> val isMale = (data.genders :== maleVector)breeze.linalg.DenseVector[Boolean] = DenseVector(true, false, false, true ...)
The isMale vector is the same length as data.genders. It is true where the participant is male, and false otherwise. We can use this Boolean array as a mask for the other arrays in the dataset (remember that vector(mask) selects the elements of vector where mask is true). Let's get the height of the men in our dataset:
To count the number of men in our dataset, we can use the indicator function. This transforms a Boolean array into an array of doubles, mapping false to 0.0 and true to 1.0:
Let's calculate the mean height of men and women in the experiment. We can calculate the mean of a vector using mean(v), which we can access by importing breeze.stats._:
To calculate the mean height of the men, we can use our isMale array to slice data.heights; data.heights(isMale) is a view of the data.heights array with all the height values for the men:
scala> mean(data.heights(isMale)) // mean male heightDouble = 178.0121951219512scala> mean(data.heights(!isMale)) // mean female heightDouble = 164.74747474747474
As a somewhat more involved example, let's look at the discrepancy between real and reported weight for both men and women in this experiment. We can get an array of the percentage difference between the reported weight and the true weight:
There are thus ten men who believe they are taller than they actually are. The element-wise AND operator :& returns a vector that is true for all indices for which both its arguments are true. The vector overReportMask :& isMale is thus true for all participants that are male and over-reported their height.
Breeze – function optimization
Having studied feature engineering, let's now look at the other end of the data science pipeline. Typically, a machine learning algorithm defines a loss function that is a function of a set of parameters. The value of the loss function represents how well the model fits the data. The parameters are then optimized to minimize (or maximize) the loss function.
In library that contains many well-known algorithms. Often, we don't need to worry about optimizing loss functions directly since we can rely on the machine learning algorithms provided by MLlib. It is nevertheless useful to have a basic knowledge of optimization.
Breeze has an optimize module that contains functions for finding a local minimum:
Most local optimizers also require the gradient of the function being optimized. The gradient is a vector of the same dimension as the arguments to the function. In our case, the gradient is:
We can represent the gradient in Breeze with a function that takes a vector argument and returns a vector of the same length:
Let's set up the optimization problem. Breeze's optimization methods require that we pass in an implementation of the DiffFunction trait with a single method, calculate. This method must return a tuple of the function and its gradient:
scala> val optTrait = new DiffFunction[DenseVector[Double]] { def calculate(xs:DenseVector[Double]) = (f(xs), gradf(xs))}breeze.optimize.DiffFunction[breeze.linalg.DenseVector[Double]] = <function1>
We are now ready to run the optimization. The optimize module provides a minimize function that does just what we want. We pass it optTrait and a starting point for the optimization:
The true minimum is at (0.0, 0.0, 0.0). The optimizer therefore correctly finds the minimum.
The minimize function uses the L-BFGS method to run the optimization by default. It takes several additional arguments to control the optimization. We will explore these in the next sections.
Numerical derivatives
In the previous example, we specified the gradient of f explicitly. While this is generally good practice, calculating the gradient of a function can often be tedious. Breeze provides a gradient approximation function using finite differences. Reusing the same objective function def f(xs:DenseVector[Double]) = sum(xs :^ 2.0) as in the previous section:
scala> val approxOptTrait = new ApproximateGradientFunction(f)breeze.optimize.ApproximateGradientFunction[Int,breeze.linalg.DenseVector[Double]] = <function1>
The trait approxOptTrait has a gradientAt method that returns an approximation to the gradient at a point:
Note that this can be quite inaccurate. The ApproximateGradientFunction constructor takes an epsilon optional argument that controls the size of the step taken when calculating the finite differences. Changing the value of epsilon can improve the accuracy of the finite difference algorithm.
The ApproximateGradientFunction instance implements the DiffFunction trait. It can therefore be passed to minimize directly:
This, again, gives a result close to zero, but somewhat further away than when we specified the gradient explicitly. In general, it will be significantly more efficient and more accurate to calculate the gradient of a function analytically than to rely on Breeze's numerical gradient. It is probably best to only use the numerical gradient during data exploration or to check analytical gradients.
Regularization
The minimize function takes many optional arguments relevant to machine learning algorithms. In particular, we can instruct the optimizer to use a regularization parameter when performing the optimization. Regularization introduces a penalty in the loss function to prevent the parameters from growing arbitrarily. This is useful to avoid overfitting. We will discuss regularization in greater detail in Chapter 12, Distributed Machine Learning with MLlib.
For instance, to use L2Regularization with a hyperparameter of 0.5: