Items tagged with performance performance Tagged Items Feed

Based on the title, once again I'll mention that the programm that I have been made by using

Maple14 could not run faster.

The function of Maple should be running effectively compared to others.

Hereby, I'll attach the programm for your further action.

That's all, thank you.

I have to solve a linear system

in Maple. A is a tridiagonal matrix.

My problem is that my matrix A is big and therefore, I have to wait too long. Is there some function in Maple that can help this problem?

I'm using Maple to simulate particle collision and interaction (absorption and scattering) and keep tracking its energy until each particle either get absorbed or leaked (Energy less than predefined minimum energy). I also use Maple statistical pacakge. I also need to call MATLAB link to generate histogram in a logarithmically spaced bin, a feature that I could not find anywhere in Maple and I think it is one of the reasons for calculation performance issue.

The problem...

Yesterday I wrote a post that began,

"I realized recently that, while 64bit Maple 15 on Windows (XP64, 7) is now using accelerated BLAS from Intel's MKL, the Operating System environment variable OMP_NUM_THREADS is not being set automatically."

But that first sentence is about where it stopped being correct, as far as how I was interpreting the performance on 64bit Maple on Windows. So I've rewritten the whole post, and this is the revision.

I concluded that, by setting the Windows operating system environment variable OMP_NUM_THREADS to 4, performance would double on a quad core i7. I even showed timings to help establish that. And since I know that memory management and dynamic linking can cause extra overhead, I re-ran all my examples in freshly launched GUI sessions, with the user-interface completely closed between examples. But I got caught out in a mistake, nonetheless. The problem was that there is extra real-time cost to having my machine's Windows operating system dynamically open the MKL dll the very first time after bootup.

So my examples done first after bootup were at a disadvantage. I knew that I could not look just at measured cpu time, since for such threaded applications that reports as some kind of sum of cycles for all threads. But I failed to notice the real-time measurements were being distorted by the cost of loading the dlls the first time. And that penalty is not necessarily paid for each freshly launched, completely new Maple session. So my measurements were not fair.

Here is some illustration of the extra real-time cost, which I was not taking into account. I'll do Matrix-Matrix multiplication for a 1x1 example, to try and show just how much this extra cost is unrelated to the actual computation. In these examples below, I've done a full reboot on Windows 7 where so annotated. The extra time cost for the very first load of the dynamic MKL libraries can be from 1 to over 3 seconds. That's about the same as the cpu time this i7 takes to do the full 3000x3000 Matrix multiplication! Hence the confusion.

Roman brought up hyperthreading in his comment on the original post. So part of redoing all these examples, with full restarts between them, is testing each case both with and without hyperthreading enabled (in the BIOS).

Quad core Intel i7. (four physical cores)

Hyperthreading disabled in BIOS
-------------------------------

> restart: # actual OS reboot
> getenv(OMP_NUM_THREADS);   # NULL, unset in OS

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=217.18KiB, alloc change=127.98KiB, cpu time=219.00ms, real time=3.10s

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ):
memory used=9.46KiB, alloc change=0 bytes, cpu time=0ns, real time=0ns


> restart: # actual OS reboot
> getenv(OMP_NUM_THREADS);
                              "4"

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=216.91KiB, alloc change=127.98KiB, cpu time=140.00ms, real time=2.81s

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ):
memory used=9.46KiB, alloc change=0 bytes, cpu time=0ns, real time=0ns


Hyperthreading enabled in BIOS
------------------------------

> restart: # actual OS reboot
> getenv(OMP_NUM_THREADS);    # NULL, unset in OS

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=217.00KiB, alloc change=127.98KiB, cpu time=202.00ms, real time=2.84s

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ):
memory used=9.46KiB, alloc change=0 bytes, cpu time=0ns, real time=0ns


> restart: # actual OS reboot
> getenv(OMP_NUM_THREADS);
                              "4"

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=215.56KiB, alloc change=127.98KiB, cpu time=187.00ms, real time=1.12s

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ):
memory used=9.46KiB, alloc change=0 bytes, cpu time=0ns, real time=0ns


Having established that the first use after reboot was incurring a real time penalty of a few seconds, I redid the timings in order to gauge the benefit of having OMP_NUM_THREADS set appropriately. These too were done with and without hyperthreading enabled. The timings below appear to indicate that slightly bettern performance can be had for this example in the case that hyperthreading is disabled. The timings also appear to indicate that having OMP_NUM_THREADS unset results in performance competitive with having it set to the number of physical cores.

Hyperthreading disabled in BIOS
-------------------------------

> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=217.84KiB, alloc change=127.98KiB, cpu time=141.00ms, real time=142.00ms

> getenv(OMP_NUM_THREADS);  # NULL, unset in OS

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=7.50s, real time=1.92s


> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=217.84KiB, alloc change=127.98KiB, cpu time=141.00ms, real time=141.00ms

> getenv(OMP_NUM_THREADS);
                              "1"

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=7.38s, real time=7.38s


> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=217.11KiB, alloc change=127.98KiB, cpu time=125.00ms, real time=125.00ms

> getenv(OMP_NUM_THREADS);
                              "4"

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=7.57s, real time=1.94s



Hyperthreading enabled in BIOS
------------------------------

> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=216.57KiB, alloc change=127.98KiB, cpu time=125.00ms, real time=125.00ms

> getenv(OMP_NUM_THREADS);  # NULL, unset in OS

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=8.46s, real time=2.15s


> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=216.80KiB, alloc change=127.98KiB, cpu time=125.00ms, real time=125.00ms

> getenv(OMP_NUM_THREADS);
                              "1"

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=7.35s, real time=7.35s


> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=216.80KiB, alloc change=127.98KiB, cpu time=125.00ms, real time=125.00ms

> getenv(OMP_NUM_THREADS);  # NULL, unset in OS
                              "4"

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=8.56s, real time=2.15s


> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=216.80KiB, alloc change=127.98KiB, cpu time=125.00ms, real time=125.00ms

> getenv(OMP_NUM_THREADS);
                              "8"

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=8.69s, real time=2.23s

With all those new timing measurements it appears that having to set the global environment variable OMP_NUM_THREADS to the number of physical cores may not be necessary. The performance is comparable, when that variable is left unset. So, while this post is now a non-story, it's interesting to know.

And the lesson about comparitive timings is also useful. Sometimes, even complete GUI/kernel relaunch is not enough to get a level and fair field for comparison.

I was recently looking at rotating a 3D plot, using plottools:-rotate, and noticed something inefficient.

In the past few releases of Maple, efficient float[8] datatype rtables (Arrays or hfarrays) can be used inside the plot data structure. This can save time and memory, both in terms of the users' creation and manipulation of them as well as in terms of the GUI's ability to use them for graphic rendering.

What I noticed is that, if one starts with a 3D plot data structure containing a float[8] Array in the MESH portion, then following application of plottools:-rotate a much less efficient list-of-lists is produced in the resulting structure.

Likewise, an effiecient float[8] Array or hfarray in the GRID portion of a 3D plot structure gets transformed by plottools:-rotate into an inefficient list-of-lists object in the MESH portion of the result. For example,

restart:

p:=plot3d(sin(x),x=-6..6,y=-6..6,numpoints=5000,style=patchnogrid,
          axes=box,labels=[x,y,z],view=[-6..6,-6..6,-6..6]):

seq(whattype(op(3,zz)), zz in indets(p,specfunc(anything,GRID)));
                            hfarray

pnew:=plottools:-rotate(p,Pi/3,0,0):

seq(whattype(op(1,zz)), zz in indets(pnew,specfunc(anything,MESH)));
                              list

The efficiency concern is not just a matter of the occupying space in memory. It also relates to the optimal attainable methods for subsequent manipulation of the data.

It may be nice and convenient for plottools to get as much mileage as it can out of plottools:-transform, internally. But it's suboptimal. And plotting is a topic where dedicated, optimized helper routines for some particular data format is justified and of merit. If we want plot manipulation to be fast, then both Library-side as well as GUI-side operations need more case-by-case-optimizated.

Here's an illustrative worksheet, using and comparing memory performance with a (new, alternative) procedure that does inplace rotation of a 3D MESH. plot3drotate.mw

Whenever I try to plot a simple/small function in 2d, Maple literally takes forever to calculate it whenever the frame rate is too high.

For instance, the following command (from a tutorial, so it should work):

animate( plot, [exp(-x/5)*sin(x),x=0..t], t=0..20, frames=100 );

Will not compute, unless i change the number of frames to 50 or below.

I tried this on a friends computer (he has a mac, but has way weaker performance) and it worked just fine.

Consider a function f depending on a parameter k and a variable y

f:=(k,y)->k*s(y);

where s:=y->add(sin(1/y^i),i=1..1000000) is just some nasty function.

Now, why does it take a long time to evaluate f(0,y)? And how could I speed this up?

 

More generally, I would like to plug (varying) parameters into a function, simplify and then do analysis on the resulting function. How could I do that?

I am calculating the dotproduct in maple 15 of two very large vectors. Are there any tricks/ways to speed up such a calculation? Maple is taking a very long time to execute the calculation and I am unsure whether it will actually return a result.

I'm using maple code of the form DotProduct(vector1,vector2);

 

Any advice ?

 

Most programs will not produce and assign to a large number of global "top-level" names. But it is interesting that the cost associated with such global name assignment is related to the number of entries in libname.

A possible cause of this cost is the need to check whether the name is protected, before assigning.

The following timings were made on 32bit Maple 15 running on Windows 7, on an Intel i7. The set of four timings is...

The complexplot3d command can color by using (complex) argument for the hue, and compute height z by magnitude. So, when rotated to view the x-y plane straight on, it can provide a nice coloring of the argument of whatever complex-valued expression is being plotted.

Another way to obtain a similar plot is to use densityplot (with appropriate values for its scaletorange option) and apply argument to the expression or function being plotted. For some kinds of complex-valued...

I'm working on a 3D program that plots stars, star clusters and other objects of interest against a model of the Milky Way. The guts of this program create a few hundred points (< 2000) that represent stars, and move them a little bit in random directions to create a cloud effect. Then the animate procedure rotates them about the center of the galaxy. I would like to add more stars and do more things with them, but I am encountering a very sluggish editor, suggesting that Maple is imposing a memory limit. The Activity Monitor says that Maple is using 140 MB, so gigabytes of RAM remain available. Furthermore, if I save the plot as an animated gif, I can run half a dozen copies simultaneously in separate browser windows without encountering noticeable slowdown. But I can't elaborate the program without better performance from the editor.

The program is written in the Document interface. I tried saving it to the classic interface, but that didn't seem to help, and I found it difficult to edit.

Ideas sought.

This is in response to a Question about the speed and memory use of an animated DEplot. The problems are that the example's animation was slow to create, and prohibitively expensive to save in a Document.

An alternative approach is to combine multiple calls to plots:-odeplot with a call to plots:-fieldplot to supply the background flow arrows. This is a lot faster. It takes less memory to create and run, but the GUI may still consume too much resources saving it. The good news is that it's so much faster that it's not inconvenient to re-run the entire thing from scratch. And so it's quite feasible to remove all the expensive output from the Document prior to saving and thus avoid the whole resources problem.

The original questioner also wanted to visualize with resect to two varying parameters. So I've also done an implementation of that using Embedded Components and two Sliders.

Here is the old DEplot animation. It takes about 40 sec to create it animation on an Intel i7.

Here is the new combined odeplot+fieldplot animation. It takes a second or two to create its animation on an Intel i7.

Here is the DEplot in Embedded Components. It's very slow, and the image doesn't change smoothly with the sliders.

Here is the new combined odeplot+fieldplot in Embedded Components. Its image changes pretty smoothly with the sliders.

I encourage completely quitting the GUI (not just restart, or close Document and re-Open) between comparison runs of these implementations, of you want to get a really good feel for the effects of both running them as well as saving them (with and without all output).

For the Embedded Component documents, the functioning code resides inside the "initialize" button. (right-click, go to Component Properties, Action When Clicked, only if you want to inspect it.) To run those two  Documents, execute all the commands (use the triple-exclam from the menubar if you like), and then press the "initialize" button, and then move the sliders.

The difference in performance is related partly to the use of hardware datatype Arrays in the PLOT structures generated by odeplot and fieldplot. (But an `arrow` primitive would help even more!)

And (I think) there is improvement by virtue of using dsolve/numeric/parameters in the use of `odeplot`. That saves overhead from repeated cold invocations of dsolve/numeric. And DEplot doesn't support that, since it expects as argument the system of DEs and ICs. The newer `odeplot` command accepts the procedure returned by dsolve/numeric, and thus allows for efficient repeated setting of parameter values. The `fieldplot` command doesn't need the solution of the DE system at all: it just needs the DEs.

I would have considered wrapping the whole combined approach up into a single command, but it might have to accept separate options for the view ranges, in order to always look its best. A smart version might be able to deduce the computed ranges from the odeplot output, and then create the background fieldplot based on that.

This page from the Online Help suggests that GMP 4.2.1 is still being used in Maple 15.

But the latest "stable" release is GMP 4.3.2, see release notes here (various fixes and improvements in the 4.3.xx series mentioned).

Perhaps this could be updated in a service release Maple 15.xx.

The goal of computing only a select number of eigenvectors of a real symmetric floating-point Matrix comes up now and then. For very large Matrices the memory requirements can be more restrictive than the timing.

The attached worksheet and code computes this, more quickly and with significantly less memory allocation than does the usual task of computing all eigenvectors. By using the supplied Matrix itself as a partial "workspace" the amount of additional workspace and memory allocation for the task is negligible.

For example, having created the very large Matrix in the first place,  essentially no significant further memory allocation is required to compute the largest eigenvalue and its associated eigenvector.

A little about this routine `SelectedEigenvectors` follows.

It only works in hardware double precision. It expects a float[8] datatype Matrix (because you are serious about using minimal memory!). It uses the CLAPACK function dsyevx, using the "wrapperless" version of Maple's external-calling mechanism. It seems to work fine in the systems I've tried so far: Maple 13 and 14 on both 32bit and 64bit Linux and Windows.

Whether it computes and returns the selected eigenvectors (alongside the selected eigenvalues, which are always returned) is controlled by the 'vectors=truefalse' optional argument. By default it uses the Matrix argument as partial workspace and so destroys the original data; but this can be overridden with the 'preserve=true' optional argument. The requested accuracy can be relaxed with the 'epsilon=float' optional argument, which might sometimes speed it up.

The input Matrix is presumed to be symmetric. By default it uses the data in the lower triangle, but this can be changed to be the upper triangle with the 'uplo' optional argument.

The choice of eigenvalues is controlled by the two integer arguments `il` and `iu`. If il=iu=n then only the nth largest eigenvalue is computed. If il=1 and iu=4 then the four smallest eigenvalues are computed.

It returns three things: a Vector of dimension n whose first m entries are the selected eigenvalues, a nxm Matrix whose columns are the m associated eigenvectors, and a Vector of dimension n whose entries indicate whether corresponding eigenvectors failed to converge.

I didn't enable float arguments such as `vl` and `vu` which in principle could allow one to supply a floating-point range in which to find eigenvalues.

I didn't make an optimization of having it do an initial "dummy" external call in which no calculation would be done, but which would instead query for and subsequently utilize the optimal-performance additonal float workspace size.

For reasons mysterious to me, on Windows the 64bit version runs almost exactly half as fast as the 32bit version.

Usually, the workspace for eigen-solving is implemented to be at least O(n^2) for an nxn Matrix. But this routine does only O(m+n) extra workspace allocation to compute the m eigenvectors. And that is linear. Which is the Big Deal.

A 5000x5000 datatype=float[8] (ie. hardware double precision) Matrix takes 200MB of memory. With the preserve=true option, this routine can compute just the largest eigenvector with only about 200MB of additional allocation. And if the original Matrix is no longer required then with the preserve=false option this routine can do that task with less than 1MB further allocation. In comparison, the regular LinearAlgebra:-Eigenvectors command would require about 600MB of additional memory allocation while computing all eigenvectors.

At size 5000x5000 this routine is only about four times faster than LinearAlgebra:-Eigenvectors. I suspect that is because it still has to compute in full the reduction to tridiagonal form.

Download dsyevx.mw

GMP version 5.0.1

December 08 2010 by acer 6926 Maple

Maple uses the LGPL licensed GMP as a support library for doing large integer and high precision float computations. There exists is a later version of GMP than was bundled and shipped in Maple 14.x. A natural question is whether it would improve Maple's performance. Another question is whether it can be used within Maple 14.

1 2 3 Page 1 of 3