Домой United States USA — software About Expression Templates Library for the Deep Learning Library About Expression Templates...

About Expression Templates Library for the Deep Learning Library About Expression Templates Library for the Deep Learning Library

208
0
ПОДЕЛИТЬСЯ

Learn about the updated Expression Templates Library for the Deep Learning Library, including new features, performance, and other changes.
It took me longer than I thought, but I’m glad to announce the release of version 1.1 of my Expression Templates Library (ETL) project. This is a major new release with many improvements and new features. It’s been almost one month since the last, and first, release (1.0) was released. I should have done some minor releases in the mean time, but at least now the library is in a good shape for a major version.
It may be interesting to note that my machine learning framework (DLL) , based on the ETL library, has shown to be faster than all the tested popular frameworks (Tensorflow, Keras, Caffee, Torch, DeepLearning4J) for training various neural networks on CPU. I’ll post more details on another post soon, but that shows that special attention to performance has been done in this library and that it is well-adapted to machine learning.
For those of you that don’t follow my blog, ETL is a library providing expression templates for computations on matrix and vector. For instance, if you have three matrices A, B, and C, you could write C++ code like this:
Or given vectors b, v, h and a matrix W, you could write code like this:
The goal of such library is two-fold. First, this makes the expression more readable and as close to math as possible. And then, it allows the library to compute the expressions as fast as possible. In the first case, the framework will compute the sum using a vectorized algorithm and then compute the overall expression using yet again vectorized code. The expression can also be computed in parallel if the matrices are big enough. In the second case, the vector-matrix multiplication will be computed first using either hand-code optimized vectorized or a BLAS routine (depending on configuration options) . Then, all the expression will be executed using vectorized code.
Many new features have been integrated into the library.
The support for machine learning operations has been improved. There are now specific helpers for machine learning in the etl: : ml namespace which have names that are standard to machine learning. A real transposed convolution has been implemented with support for padding and stride. Batched outer product and batched bias averaging are also now supported. The activation function support has also been improved and the derivatives have been reviewed. The pooling operators have also been improved with stride and padding support. Unrelated to machine learning, 2D and 3D pooling can also be done in higher dimensional matrix now.
New functions are also available for matrices and vectors. The support for square root has been improved with cubic root and inverse root. Support has also been added for floor and ceil. Moreover, comparison operators are now available as well as global functions such as approx_equals.
New reductions have also been added with support for absolute sum and mean (asum/asum) and for min_index and max_index, which returns the index of the minimum element, respectively the maximum. Finally, argmax can now be used to get the max index in each sub dimensions of a matrix. argmax on a vector is equivalent to max_index.
Support for shuffling has also been added. By default, shuffling a vector means shuffling all elements and shuffling a matrix means shuffling by shuffling the sub matrices (only the first dimension is shuffled) , but shuffling a matrix as a vector is also possible. Shuffling two vectors or two matrices in parallel is also possible. In that case, the same permutation is applied to both containers. As a side note, all operations using random generation are also available with an additional parameter for the random generator, which can help to improve reproducibility or simply tune the random generator.
I’ve also included support for adapters matrices. There are adapters for hermitian matrices, symmetric matrices and lower and upper triangular matrices. For now, the framework does not take advantage of this information, this will be done later, but the framework guarantees the different constraint on the content.
There are also a few new more minor features. Maybe not so minor, matrices can now be sliced into sub-matrices. With that, a matrix can be divided into several sub-matrices, and modifying the sub-matrices will modify the source matrix. The sub matrices are available in 2D, 3D, and 4D for now. There are also some other ways of slicing matrix and vectors. It is possible to obtain a slice of its memory or obtain a slice of its first dimension. Another new feature is that it is now possible compute the cross product of vectors now. Matrices can be decomposed into their Q/R decomposition rather than only their PALU decomposition. Special support has been integrated for matrices and vectors of booleans. In that case, they support logical operators such as and, not and or.
I’ve always considered the performance of this library to be a feature itself. I consider the library to be quite fast, especially its convolution support, even though there is still room for improvement. Therefore, many improvements have been made to the performance of the library since the last release. As said before, this library was used in a machine learning framework which then proved faster than most popular neural network frameworks on CPU. I’ll present here the most important new improvements to performance, in no real particular order, every bit being important in my opinion.
First, several operations have been optimized to be faster.
Multiplication of matrices or matrices and vectors are now much faster if one of the matrix is transposed. Instead of performing the slow transposition, different kernels are used in order to maximize performance without doing any transposition, although sometimes transposition is performed when it is faster. This leads to very significant improvements, up to 10 times faster in the best case. This is performed for vectorized kernels and also for BLAS and CUBLAS calls. These new kernels are also directly used when matrices of different storage order are used. For instance, multiplying a column major matrix with a row major matrix and storing the result in a column major matrix is now much more efficient than before. Moreover, the performance of the transpose operation itself is also much faster than before.
A lot of machine learning operations have also been highly optimized. All the pooling and upsample operators are now parallelized and the most used kernel (2×2 pooling) is now more optimized. 4D convolution kernels (for machine learning) have been greatly improved. There are now very specialized vectorized kernels for classic kernel configurations (for instance 3×3 or 5×5) and the selection of implementations is now smarter than before. The support of padding is now much better than before for small amount of padding.
Moreover, for small kernels, the full convolution can now be evaluated using the valid convolution kernels directly with some padding, for much faster overall performance. The exponential operation is now vectorized which allows operations such as sigmoid or softmax to be much faster.
Matrices and vector are automatically using aligned memory. This means that vectorized code can use aligned operations, which may be slightly faster. Moreover, matrices and vectors are now padded to a multiple of the vector size. This allows removing the final non-vectorized remainder loop from the vectorized code. This is only done for the end of matrices when they are accessed in a flat way. Contrary to some frameworks, inner dimensions of the matrix are not padded. Finally, accesses to 3D and 4D matrices is now much faster than before.
Then, the parallelization feature of ETL has been completely reworked. Before, there was a thread pool for each algorithm that was parallelized. Now, there is a global thread engine with one thread pool. Since parallelization is not nested in ETL, this improves performance slightly by greatly diminishing the number of threads that are created throughout an application. Another big difference in parallel dispatching is that now it can detect good split based on alignment so that each split are aligned. This then allows the vectorization process to use aligned stores and loads instead of unaligned ones which may be faster on some processors.
Vectorization has also been greatly improved in ETL. Integer operations are now automatically vectorized on processors that support this.

Continue reading...