Performance
Improvements in performance and efficiency:
Elementary Functions
Multiple Memory Regions
Parallel Linear Algebra
Parallel Garbage Collector
Polynomial Arithmetic
Sparse Matrices and Vectors
With every new release, Maplesoft strives to improve the efficiency and speed of its mathematical computations. This involves making improvements in the most frequently called routines and algorithms, as well as in the low-level infrastructure.
Maple 17 features new algorithms for faster numerical calculations with complex numbers and new low-level routines that make sparse matrix and vector concatenation much more practical. Updates to optimized BLAS and LAPACK functions for floating-point matrices and vectors make matrix computations such as Cholesky decomposition and eigenvalues, much quicker on multi-core systems.
Memory usage in Maple is much improved with the combination of multiple memory regions, which reduce fragmentation and make much more efficient use of overall memory, and the new parallel garbage collector, which helps to efficiently release memory that is no longer used.
Lastly, numerous improvements in underlying routines combined with a new high performance data structure for distributed multivariate polynomials drastically improves the speed and scalability of most polynomial computations.
Polynomials with Integer Coefficients
See Also
Hardware algorithms have been introduced into Maple 17 that greatly speed up operations on complex floating-point numbers. Compared to Maple 16, the Maple 17 is almost 2000 times faster in some cases.
To demonstrate this, the following commands create a random vector, V, containing 10,000 complex floating-point numbers. The time (in seconds) needed to perform operations on all 10,000 elements is found using various functions.
Example
Benchmarks computed on an Intel Q8200 2.33GHz Quad-Core CPU running Linux.
V:=LinearAlgebra:-RandomVector104,datatype=complex8:
timerealln~V;
0.002
timereallog~V;
timereallog10~V;
0.005
timerealsqrt~V;
0.001
timerealarcsin~V;
0.003
timerealarcsinh~V;
0.027
timerealarctan~V;
timerealarctanh~V;
timerealarccos~V;
timerealarccosh~V;
The following lists, m16 and m17, contain the measured times to compute these operations using Maple 16 and 17, respectively. The column graph plots these lists to show the speed up between the two versions.
m16:=2.557,2.914,3.162,2.020,5.193,5.417,2.810,2.972,5.655,6.300:
m17:=0.002,0.002,0.005,0.001,0.003,0.0027,0.003,0.003,0.003,0.003:
Statistics:-ColumnGraph⁡m16,m17,datasetlabels=ln,log,log10,sqrt,arcsin,arcsinh,arctan,arctanh,arccos,arccosh,legend=Maple 16,Maple 17,labels=,time (s),title=Computation Times,titlefont=TIMES, BOLD, 18;
A significant improvement to Maple's memory allocator is the addition of multiple memory regions.
Benefits of multiple memory regions include:
More efficient allocation of memory, leading to more memory being available to solve bigger problems without the need to initially specify the size of the reserved region.
Better resource sharing between different processes running on the same machine, including multiple instances of Maple and multiple Maple kernels launched as parallel grid nodes.
Performance improvements due to less fragmentation and cache locality, as well as performance improvements when running parallel code.
Previously, Maple initially reserved a large contiguous heap of memory that was subsequently managed by additional allocations and garbage collection as required. Initially, only a small fraction of this reserved memory was allotted to handle memory allocation requests. Whenever a given allotment was exhausted, a garbage collection would occur to reclaim any unused memory. As the computation required more memory, the memory management system would progressively commit more of the heap to be allocated.
With the addition of memory regions, Maple is no longer constrained by the limitation imposed by a single large contiguous memory region (heap). Instead, Maple is able to incorporate additional regions upon demand. Various types of memory regions are provided: small allocations come from thread-local regions (512 words and less), medium-sized allocations reside in global thread-shared regions (less than 1MB), and finally large blocks of memory are individually allocated as distinct regions of their own. Overall, this allows Maple to more effectively and efficiently manage the available memory resources, and more specifically, this offers improvements to caching via better data locality and reduced fragmentation.
In earlier releases of Maple, when a chunk of memory was committed, it remained under Maple's control for the remainder of the session. Handling memory requests larger than 1MB by allocating an individually tailored memory region enables Maple 17 to return this large block of memory back to the operating system when it is no longer needed.
For example, the following code snippet allocates an array of matrices, does some work, and then clobbers the array thereby rendering the allocated memory garbage. Notice that memory usage drops after the explicit call to garbage collection.*
restart;
size ≔ 1024:elems ≔ 10:A ≔ Array 1.. elems, 1..elems :for i to elems do for j to elems do Ai, j ≔ Matrix size, size, datatype = float8 : end do: end do:
Report how many KB Maple is using.
kerneloptsbytesalloc1024;
850704
A ≔ 0:
gc;kerneloptsbytesalloc1024;
88476
* The explicit call to gc() is used here to illustrate a specific behavior of Maple's memory management system. In general, it is better to allow Maple to decide when to initiate a garbage collection cycle.
The access to optimized BLAS and LAPACK functions for floating-point Matrices and Vectors has been updated in Maple 17.
On Linux and OS X, this involves updates to the ATLAS (Automatically Tuned Linear Algebra Software) dynamic libraries which are shipped with Maple. While on 64-bit Windows, Maple 17 has been updated to use Intel's Math Kernel Library (MKL).
Performance enhancements for floating-point Linear Algebra operations include improved use of multiple cores and CPUs.
The operations referenced in the plots in this document were all performed on real, floating-point Matrices and Vectors created with the datatype=float[8] option.
Linux (32bit)
Linux (64bit)
OS X
Windows (64bit)
One of the core components of Maple's engine is the garbage collector. The garbage collector is responsible for finding and reclaiming memory that is no longer needed by the evaluation engine. In Maple 17, the garbage collector is capable of taking advantage of multiple processors to perform its job more quickly. As the garbage collector is used continuously as Maple runs, this speed up helps all users, not just those running parallel algorithms.
The garbage collector in Maple 17 is capable of running multiple threads to take advantage of multiple cores. This can speed up the collector significantly. Although the collector is only responsible for a fraction of Maple's total running time, this can still lead to running times reduced by up to 10%. These speed ups require no changes to user code.
There are 2 kernelopts that can control the parallel collector.
gcmaxthreads controls the maximum number of threads that the parallel garbage collector will use.
gcthreadmemorysize controls how many threads Maple will use for a collection. The number of bytes allocated is divided by the value assigned to gcthreadmemorysize to determine the number of threads.
The following example shows the speed up in the garbage collector from using multiple threads.
restart
if kerneloptswordsize = 32 then upperBound ≔ 2⋅10^7;else upperBound ≔ 10^8; end if:
data ≔ seqi, i = 1 .. upperBound:
p:=proc local i; for i to 100 do gc end do end proc:
kerneloptsgcmaxthreads=1: t11≔timerealp⁡:
kerneloptsgcmaxthreads=2:t12≔timerealp⁡:
kerneloptsgcmaxthreads=4:t14≔timerealp⁡:
Statistics:-ColumnGraph⁡t1[1],t1[2],t1[4],datasetlabels=One Thread,Two Threads,Four Threads,labels=,Seconds,gridlines
The garbage collection algorithm only comprises a small amount of the total Maple running time, so the speed up will not be this significant for real Maple computations. The following example shows a more realistic situation.
d:=seq⁡randpoly⁡x,y,z,dense,degree=12,i=1..10000:
f:=proc local i; mapi→int⁡i,x,d end proc:
kernelopts⁡gcmaxthreads=1:
t21≔timerealf⁡:
kernelopts⁡gcmaxthreads=2:
t22≔timerealf⁡:
kernelopts⁡gcmaxthreads=4:
t24≔timerealf⁡:
Statistics:-ColumnGraph⁡t21,t22,t24,datasetlabels=One Thread,Two Threads,Four Threads,labels=,Seconds,gridlines
Maple 17 automatically uses a new high performance data structure for distributed multivariate polynomials with integer coefficients. Together with new algorithms in the Maple kernel, this drastically improves the speed and scalability of most polynomial computations.
Benchmarks computed in real times on a 64-bit quad core Intel Core i7 920 2.66 GHz.
Overview
Expanding this polynomial with 170544 terms takes 31 milliseconds and 2.6 MB of memory in Maple 17, versus 250 ms and 29 MB in Maple 16. This speedup is entirely due to the reduction in overhead for the new data structure. The external C routine that computes this result has not changed.
f:=1+x+y+z+t+u+v+w15:
f≔CodeTools:-Usageexpand⁡f:
Maple 17:
memory used=2.67MiB, alloc change=2.61MiB, cpu time=30.00ms, real time=31.00ms
Maple 16:
memory used=23.40MiB, alloc change=29.16MiB, cpu time=242.00ms, real time=250.00ms
Mathematica® 9:
0.564 seconds
AbsoluteTiming[ f = Expand[ (1+x+y+z+t+u+v+w)^15 ]; ]
Common queries such as finding the variables, computing the total degree, and testing for a polynomial run in O(1) time for the new data structure. The effect on Maple library is profound.
X:=CodeTools:-Usage⁡indets⁡f:
d≔CodeTools:-Usagedegree⁡f:
memory used=0.59KiB, alloc change=0 bytes, cpu time=0ns, real time=0ns
memory used=432 bytes, alloc change=0 bytes, cpu time=0ns, real time=0ns
memory used=0.66KiB, alloc change=0 bytes, cpu time=76.00ms, real time=77.00ms
memory used=0.72KiB, alloc change=0 bytes, cpu time=119.00ms, real time=119.00ms
memory used=0.55KiB, alloc change=0 bytes, cpu time=53.00ms, real time=54.00ms
0.125 seconds, X = Variables[ f ]
0.166 seconds, d = Exponent[ f, x] (* degree in x, total degree not builtin *)
4.611 seconds, PolynomialQ[ f ]
New high performance algorithms were developed for the Maple kernel to exploit the new data structure. We improved many core operations including differentiation, extracting a coefficient, and evaluating a polynomial.
CodeTools:-Usage⁡∂∂x⁢f:
CodeTools:-Usagecoeffs⁡f,x:
CodeTools:-Usagefx=2,y=3,z=5|fx=2,y=3,z=5:
memory used=2.60MiB, alloc change=2.61MiB, cpu time=5.00ms, real time=5.00ms
memory used=5.21MiB, alloc change=2.61MiB, cpu time=21.00ms, real time=21.00ms
memory used=2.60MiB, alloc change=2.61MiB, cpu time=26.00ms, real time=26.00ms
memory used=12.36MiB, alloc change=9.72MiB, cpu time=174.00ms, real time=175.00ms
memory used=34.78MiB, alloc change=29.16MiB, cpu time=194.00ms, real time=194.00ms
memory used=48.63MiB, alloc change=34.02MiB, cpu time=476.00ms, real time=478.00ms
1.001 seconds, D[f,x];
1.030 seconds, CoefficientList[f, x];
1.304 seconds, f /. {x->2, y->3, z->5};
Dense methods have been added to multiply multivariate polynomials in subquadratic time. Maple 17 automatically selects a sparse or dense algorithm, balancing time and memory use. Dense algorithms are not parallelized in Maple 17.
g:=expandx5+x3+x2+x+120⁢y5+y3+y2+y+120:
p≔CodeTools:-Usageexpand⁡g+1⁢g+2:
memory used=2.55MiB, alloc change=0.61MiB, cpu time=148.00ms, real time=149.00ms
memory used=4.07MiB, alloc change=0 bytes, cpu time=4.55s, real time=1.19s
1.331 seconds
g = Expand[ (x^5+x^3+x^2+x+1)^20 (y^5+y^3+y^2+y+1)^20 ];
AbsoluteTiming[ p = Expand[ (g+1)*(g+2) ]; ]
Parallel speedup increases in Maple 17 due to the elimination of sequential overhead and Amdahl's Law. For the factorization below, Maple 17 uses 26.01 CPU seconds versus 62.24 for Maple 16, so it saves a little more than half the work. However, all of the time saved was sequential, so Maple 17 actually runs 3.7x faster on the quad core CPU, and parallel speedup increases from 1.3x to 2.0x.
f:=expand1+x+y+z+t20:
p:=expandf+1⁢f+2:
CodeTools:-Usagefactor⁡p:
memory used=1.43GiB, alloc change=82.04MiB, cpu time=26.01s, real time=12.90s
memory used=4.29GiB, alloc change=256.41MiB, cpu time=62.24s, real time=47.60s
952.857 seconds
f = Expand[ (1+x+y+z+t)^20 ];
p = Expand[ (f+1)*(f+2) ];
AbsoluteTiming[ Factor[ p ]; ]
Benchmarks
For each problem we expand the input f and g, then we time the multiplication p = f⋅g, the division q=pf, and the factorization of p.
For Maple we timed p := expand(f*g); divide(p,f,'q'); and factor(p);
For Mathematica® we timed p = Expand[f g]; PolynomialQuotient[p, f, x]; and Factor[p]; except for Problem #5 where Cancel[p/f]; was much faster than PolynomialQuotient.
All times are real times in seconds on a 64-bit Core i7 920 2.66GHz Linux system.
Problem #1: 1771 x 1771 = 12341 terms
f = x+y+z+120+1
g = f+1
Problem #2: 1771 x 1771 = 12341 terms
f = x2+y2+z2+120+1
Problem #3: 5456 x 5456 = 39711 terms
f = x+y+z+130+1
Problem #4: 10626 x 10626 = 135751 terms
f = x+y+z+t+120+1
Problem #5: 9261 x 9261 = 12552 terms
f = 1+x20⋅1+y20⋅1+z20+1
g = 1−x20⋅1−y20⋅1−z20+1
Problem #6: 3003 x 3003 = 417311 terms
f = 1+u2+v+w2+x−y10+1
g = 1+y+v2+w+x2+y10+1
The timing data is given in matrix form below. Mathematica® 9 is the first row, Maple 16 is the middle row, and Maple 17 is the last row.
Enhancements have been made in Maple 17 for the performance of stacking and augmenting sparse floating-point, hardware datatype Matrices and Vectors. Computing linear combinations of pairs of sparse floating-point, hardware datatype Matrices using the LinearAlgebra:-MatrixAdd command has also been improved.
In the examples below, the Matrix and Vector constructors are used for stacking or augmentation (concatenation).
Benchmarks computed in 64-bit Maple 17 for Linux on an Intel Core i5 CPU 760 @ 2.80GHz.
Sparse vector concatenation
In Maple 16, the concatenation of Vector V with itself three times would take approximately 28 seconds and would result in an output Vector with full rectangular storage. The allocated memory would increase by 1.27GiB.
In Maple 17, it takes less than 0.2 seconds and produces a Vector with sparse storage. Memory allocation increases by 30.5MiB.
restart:
with⁡LinearAlgebra:
V:=RandomVector⁡107,generator=0.0..1.0,density=0.05,storage=sparse,datatype=float8:
rtable_num_elems⁡V,NonZeroStored
500366
biggerV:=CodeTools:-Usage⁡Vector⁡V,V,V,V
memory used=76.37MiB, alloc change=30.54MiB, cpu time=100.00ms, real time=121.00ms
rtable_num_elems⁡biggerV,NonZeroStored
2001464
Sparse matrix concatenation
In Maple 16, the concatenation of Matrix M with itself as performed below would take approximately 20 seconds and would result in an output Matrix with full rectangular storage.
In Maple 17, it takes less than 0.1 seconds and produces a Matrix with sparse storage. Memory allocation increases by 48MiB.
withLinearAlgebra:
M:=RandomMatrix⁡103,103,generator=0.0..1.0,density=0.01,storage=sparse,datatype=float8:
rtable_num_elems⁡M,NonZeroStored
10053
biggerM:=CodeTools:-Usage⁡Matrix⁡M,M
memory used=0.76MiB, alloc change=48.00MiB, cpu time=30.00ms, real time=55.00ms
rtable_num_elems⁡biggerM,NonZeroStored
20106
Simple linear combinations of sparse matrices
In Maple 16, the following command would take 3 minutes and allocated memory would increase by 400MiB.
In Maple 17, it takes less than 1 second and allocated memory increases by 76MiB.
M:=RandomMatrix104,104,generator=0.0..1.0,density=0.01,storage=sparse,datatype=float8:
CodeTools:-Usage⁡MatrixAdd⁡M,M,2,−3.3
memory used=76.67MiB, alloc change=76.35MiB, cpu time=360.00ms, real time=417.00ms
Elementary Functions in Maple 17, Memory Regions in Maple 17, Parallel Linear Algebra, Garbage Collection, Sparse Matrices and Vectors
Download Help Document