acer

32303 Reputation

29 Badges

19 years, 310 days
Ontario, Canada

Social Networks and Content at Maplesoft.com

MaplePrimes Activity


These are replies submitted by acer

In your second try, when you call dzurflu(720), you're forgetting that the new unapplied function is actually named dzurflu14. Calling dzurflu14(...) repeatedly should work fine. I still think that using D for this is more straightforward. See the above reply, or the ?D help-page. acer
In your second try, when you call dzurflu(720), you're forgetting that the new unapplied function is actually named dzurflu14. Calling dzurflu14(...) repeatedly should work fine. I still think that using D for this is more straightforward. See the above reply, or the ?D help-page. acer
I wasn't completely sure what you were after. It doesn't seem to have anything especially to do with integrals. It looks more like a task of getting coefficients and replacing terms by some function call.
split := proc(z::polynom(anything,[x,y]))
  local C,t;
  C:=coeffs(z,[x,y],'t');
  seq(C[i]*INT(degree(t[i],x),degree(t[i],y)),i=1..nops([C]));
end proc:

z := a + b/c*x + d/e*x*y^3 + f*y^5:
split(z);

z := c1 + c2*x + c3*y + c4*x*y:
split(z);
acer
I think that there's probably more that can be done. The ArrayTools:-Copy calls might all be replaced by the lowest level external calls to BLAS (just like those others that appear within matexp). There is a bit of non-external work done in matexp, most of it to do with size n, Digits, and routine selection according to UseHardwareFloats. It could be set up that a new routine, matexp_generator(), emits a version of matexp in which most all that work is hard-coded. (LinearAlgebra could do this, if someone went to the trouble.) This new routine could use a remember table so that it only had to generate one matexp() version per combination of n/Digits. The hardware external functions could be left in, and the software external functions removed, for your example. The matexp() version could be emitted by doing a subs on a template procedure -- a minor variant of the existing matexp() procedure. It'd make a decent blog post, if it worked well. There may be savings in the other parts of Parametrize_SU_Euler(), but if I need more underlying code I'll send a private message with my email address and you could decide on whether that'd be OK. At some point, when size n gets large, the cost of the external hardward computations involved with the exponential will dominate. Comparison with Matlab or pure C would, at that point, depend on the algorithm used. There are several poor ways to compute it. It'd take a little research to figure out the state-of-art for it. There are a few obvious places to start, such as this paper co-authored by Cleve Moler. acer
I think that there's probably more that can be done. The ArrayTools:-Copy calls might all be replaced by the lowest level external calls to BLAS (just like those others that appear within matexp). There is a bit of non-external work done in matexp, most of it to do with size n, Digits, and routine selection according to UseHardwareFloats. It could be set up that a new routine, matexp_generator(), emits a version of matexp in which most all that work is hard-coded. (LinearAlgebra could do this, if someone went to the trouble.) This new routine could use a remember table so that it only had to generate one matexp() version per combination of n/Digits. The hardware external functions could be left in, and the software external functions removed, for your example. The matexp() version could be emitted by doing a subs on a template procedure -- a minor variant of the existing matexp() procedure. It'd make a decent blog post, if it worked well. There may be savings in the other parts of Parametrize_SU_Euler(), but if I need more underlying code I'll send a private message with my email address and you could decide on whether that'd be OK. At some point, when size n gets large, the cost of the external hardward computations involved with the exponential will dominate. Comparison with Matlab or pure C would, at that point, depend on the algorithm used. There are several poor ways to compute it. It'd take a little research to figure out the state-of-art for it. There are a few obvious places to start, such as this paper co-authored by Cleve Moler. acer
Yes, I'm sure that one of the Maple plot devices is cps (not eps). The c in cps stands for color. But in Linux I get color encapsulated Postscript (.eps) when using the right-click Context Menu on a plot in Maple 11. So maybe that works for you too, in Windows or what have you. In that case, you probably don't need to set the plot device to cps, as Export might work fine to produce color .eps. If you look at the .eps file with Ghostview or GSview or similar, does it show as color? If so, then maybe you are OK. Aren't most dvi viewers only capable of displaying in greyscale, even if they contain color embedded .eps? If you run the .dvi through dvips (or whatever MikTeX has for that) then does the resulting final .ps display or print it in color? What if you produce pdf from it, does that show the color? acer
Yes, I'm sure that one of the Maple plot devices is cps (not eps). The c in cps stands for color. But in Linux I get color encapsulated Postscript (.eps) when using the right-click Context Menu on a plot in Maple 11. So maybe that works for you too, in Windows or what have you. In that case, you probably don't need to set the plot device to cps, as Export might work fine to produce color .eps. If you look at the .eps file with Ghostview or GSview or similar, does it show as color? If so, then maybe you are OK. Aren't most dvi viewers only capable of displaying in greyscale, even if they contain color embedded .eps? If you run the .dvi through dvips (or whatever MikTeX has for that) then does the resulting final .ps display or print it in color? What if you produce pdf from it, does that show the color? acer
# matexp() is a revision of `MatrixExponential/complexfloat`
# which accepts all its Matrix and Vector workspaces as
# additional arguments.
# The input should be in A, and results appear in RESMAT.
# It doesn't matter whether all others are zeroed or not.
matexp := proc(A, a, temp1, temp2, s, workuaf, onevec, RESMAT)
local t, i, bk, p, z, n, m, j, k, l, M, N, tmax, tmin,
extlib, EXTCall, ii, jj, high, DatatypeA, OrderA, StorageA, ExtAdd,
ExtNorm, IndFnsA, small, OrigDigits, sol, ExtRealNorm;
option `Copyright (c) Maplesoft, a division of Waterloo Maple Inc. 2004`;
    m, n := op(1, A);
    if m <> n then error "Expecting a square Matrix" end if;
    if UseHardwareFloats = true then
        extlib :=
            ExternalCalling:-ExternalLibraryName("linalg", 'HWFloat');
        EXTCall := ExternalCalling:-DefineExternal('hw_f06zaf', extlib);
        ExtNorm := ExternalCalling:-DefineExternal('hw_f06uaf', extlib);
        ExtAdd := ExternalCalling:-DefineExternal('hw_f06gcf', extlib);
        ExtRealNorm :=
            ExternalCalling:-DefineExternal('hw_f06raf', extlib)
    elif UseHardwareFloats = false then
        extlib :=
            ExternalCalling:-ExternalLibraryName("linalg", 'SWFloat');
        MPFloat(extlib);
        EXTCall := ExternalCalling:-DefineExternal('sw_f06zaf', extlib);
        ExtNorm := ExternalCalling:-DefineExternal('sw_f06uaf', extlib);
        ExtAdd := ExternalCalling:-DefineExternal('sw_f06gcf', extlib);
        ExtRealNorm :=
            ExternalCalling:-DefineExternal('sw_f06raf', extlib)
    else error
        "expecting UseHardwareFloats to be true or false, yet received %1"
        , UseHardwareFloats
    end if;
    DatatypeA, StorageA, OrderA :=
        rtable_options(A, 'datatype', 'storage', 'order');
    IndFnsA := [rtable_indfns(A)];
    ArrayTools:-Copy(n*n, A, 0, 1, a, 0, 1);
    M := ExtNorm("I", n, n, a, n, workuaf);
    if M < 1 then N := 0 else N := trunc(evalf(ln(M)/ln(2) + 1, 9)) end if
    ;
    if UseHardwareFloats = true then OrigDigits := trunc(evalhf(Digits))
    else OrigDigits := Digits
    end if;
    Digits := Digits + length(Digits + 9) + length(n) + N + 1;
    small := Float(1.0, -Digits);
    M := 0.5^N;
    LinearAlgebra:-LA_Main:-MatrixScalarMultiply(a, M, 'inplace' = 'true',
        'outputoptions' = []);
    ArrayTools:-Copy(n*n, a, 0, 1, s, 0, 1);
    ExtAdd(n, 1.0, onevec, 0, 1, s, 0, n + 1);
    ArrayTools:-Copy(n*n, a, 0, 1, temp2, 0, 1);
    for i from 2 do
        if irem(i, 2) = 0 then
            EXTCall("N", "N", n, n, n, 1.0/i, temp2, n, a, n, 0., temp1, n)
                ;
            high := ExtRealNorm("M", 2*n, n, temp1, n, workuaf);
            tmax := high;
            if tmax < small then break
            else ExtAdd(n*n, 1.0, temp1, 0, 1, s, 0, 1)
            end if
        else
            EXTCall("N", "N", n, n, n, 1.0/i, temp1, n, a, n, 0., temp2, n)
                ;
            high := ExtRealNorm("M", 2*n, n, temp2, n, workuaf);
            tmax := high;
            if tmax < small then break
            else ExtAdd(n*n, 1.0, temp2, 0, 1, s, 0, 1)
            end if
        end if
    end do;
    if N = 0 then sol := s
    else
        EXTCall("N", "N", n, n, n, 1.0, s, n, s, n, 0., temp1, n);
        if N = 1 then sol := temp1
        else
            for i from 2 to N + 1 do
                if irem(i, 2) = 0 then EXTCall("N", "N", n, n, n, 1.0,
                    temp1, n, temp1, n, 0., temp2, n)
                else EXTCall("N", "N", n, n, n, 1.0, temp2, n, temp2, n,
                    0., temp1, n)
                end if
            end do;
            if irem(i, 2) = 1 then sol := temp1 else sol := temp2 end if
        end if
    end if;
    ArrayTools:-Copy(n*n,sol,0,1,RESMAT,0,1);
    NULL;
end proc:
                                                                                                                                           
# I didn't check whether things work for software floats.
UseHardwareFloats:=true:
Digits:=trunc(evalhf(Digits));
# I didn't experiment with gcfreq.
#kernelopts(gcfreq=10^7):

# Size
n:=8;
                                                                                                                                           
# number of times through the loop.
N:=1000;
                                                                                                                                           
# Just using lambda[1] and alpha[1] repeatedly, instead of lambda[i],alpha[i].
lambda[1]:=LinearAlgebra:-RandomMatrix(n,n,generator=-0.01..0.01,
                                       outputoptions=[datatype=complex[8]]):
alpha[1]:=3.0:
                                                                                                                                           
# Uncomment separately for most accurate measurement.
# Select both true will misrepresent bytesalloc increase
# for the one performed second.
# Select both true, to test results.
#oldmethod,newmethod := true,false;
#oldmethod,newmethod := false,true;
oldmethod,newmethod := true,true;
                                                                                                                                           
gc():
if newmethod=true then
(st,ba,bu):=time(),kernelopts(bytesalloc),kernelopts(bytesused):
                                                                                                                                           
# Allocate all the rtables
a:=Matrix(n,n,datatype='complex'('float')):
temp1:=Matrix(n,n,datatype='complex'('float')):
temp2:=Matrix(n,n,datatype='complex'('float')):
s:=Matrix(n,n,datatype='complex'('float')):
RESMAT:=Matrix(n,n,datatype='complex'('float')):
workuaf:=Vector[row](n,'datatype'='float'):
onevec:=Vector(n,'fill'=1.0,'datatype'='complex'('float')):
PROD:=Matrix(n,n,datatype='complex'('float')):
U:=Matrix(n,n,datatype='complex'('float')):
temp := Matrix(n,n,datatype=complex[8]):
                                                                                                                                           
kernelopts(opaquemodules=false):
mmm:=LinearAlgebra:-LA_Main:-LA_External:-MatrixMatrixMultiply:
                                                                                                                                           
for i from 1 to n do U[i,i]:=1.0: od:
for i from 1 to N do
ArrayTools:-Copy(n*n,lambda[1],0,1,temp,0,1):
LinearAlgebra:-LA_Main:-MatrixScalarMultiply(
   temp, I*alpha[1], inplace=true,
   outputoptions=[datatype=complex[8]]):
matexp(temp,a,temp1,temp2,s,workuaf,onevec,RESMAT);
mmm(U,RESMAT,PROD,'inplace'=false):
ArrayTools:-Copy(n*n,PROD,0,1,U,0,1):
od:
                                                                                                                                           
print([time()-st,kernelopts(bytesalloc)-ba,kernelopts(bytesused)-bu]);
U1 := copy(U):
end if:
                                                                                                                                           
gc():
if oldmethod=true then
(st,ba,bu):=time(),kernelopts(bytesalloc),kernelopts(bytesused):
                                                                                                                                           
U := LinearAlgebra:-IdentityMatrix(n,n,compact=false,
                                    outputoptions=[datatype=complex[8]]):
for i from 1 to N do
Temp := LinearAlgebra:-LA_Main:-MatrixScalarMultiply(
           lambda[1], I*alpha[1],
           inplace=false, outputoptions=[datatype=complex[8]]):
U := LinearAlgebra:-LA_Main:-MatrixMatrixMultiply(
        U, LinearAlgebra:-LA_Main:-MatrixFunction(
           Temp, exp(dummy), dummy, outputoptions=[datatype=complex[8]]),
        inplace=false, outputoptions=[]):
od:
                                                                                                                                           
print([time()-st,kernelopts(bytesalloc)-ba,kernelopts(bytesused)-bu]);
U2 := copy(U):
end if:
                                                                                                                                           
if oldmethod=true and newmethod=true then
LinearAlgebra:-Norm(U1-U2);
end if;
                                                                                                                                           

acer
# matexp() is a revision of `MatrixExponential/complexfloat`
# which accepts all its Matrix and Vector workspaces as
# additional arguments.
# The input should be in A, and results appear in RESMAT.
# It doesn't matter whether all others are zeroed or not.
matexp := proc(A, a, temp1, temp2, s, workuaf, onevec, RESMAT)
local t, i, bk, p, z, n, m, j, k, l, M, N, tmax, tmin,
extlib, EXTCall, ii, jj, high, DatatypeA, OrderA, StorageA, ExtAdd,
ExtNorm, IndFnsA, small, OrigDigits, sol, ExtRealNorm;
option `Copyright (c) Maplesoft, a division of Waterloo Maple Inc. 2004`;
    m, n := op(1, A);
    if m <> n then error "Expecting a square Matrix" end if;
    if UseHardwareFloats = true then
        extlib :=
            ExternalCalling:-ExternalLibraryName("linalg", 'HWFloat');
        EXTCall := ExternalCalling:-DefineExternal('hw_f06zaf', extlib);
        ExtNorm := ExternalCalling:-DefineExternal('hw_f06uaf', extlib);
        ExtAdd := ExternalCalling:-DefineExternal('hw_f06gcf', extlib);
        ExtRealNorm :=
            ExternalCalling:-DefineExternal('hw_f06raf', extlib)
    elif UseHardwareFloats = false then
        extlib :=
            ExternalCalling:-ExternalLibraryName("linalg", 'SWFloat');
        MPFloat(extlib);
        EXTCall := ExternalCalling:-DefineExternal('sw_f06zaf', extlib);
        ExtNorm := ExternalCalling:-DefineExternal('sw_f06uaf', extlib);
        ExtAdd := ExternalCalling:-DefineExternal('sw_f06gcf', extlib);
        ExtRealNorm :=
            ExternalCalling:-DefineExternal('sw_f06raf', extlib)
    else error
        "expecting UseHardwareFloats to be true or false, yet received %1"
        , UseHardwareFloats
    end if;
    DatatypeA, StorageA, OrderA :=
        rtable_options(A, 'datatype', 'storage', 'order');
    IndFnsA := [rtable_indfns(A)];
    ArrayTools:-Copy(n*n, A, 0, 1, a, 0, 1);
    M := ExtNorm("I", n, n, a, n, workuaf);
    if M < 1 then N := 0 else N := trunc(evalf(ln(M)/ln(2) + 1, 9)) end if
    ;
    if UseHardwareFloats = true then OrigDigits := trunc(evalhf(Digits))
    else OrigDigits := Digits
    end if;
    Digits := Digits + length(Digits + 9) + length(n) + N + 1;
    small := Float(1.0, -Digits);
    M := 0.5^N;
    LinearAlgebra:-LA_Main:-MatrixScalarMultiply(a, M, 'inplace' = 'true',
        'outputoptions' = []);
    ArrayTools:-Copy(n*n, a, 0, 1, s, 0, 1);
    ExtAdd(n, 1.0, onevec, 0, 1, s, 0, n + 1);
    ArrayTools:-Copy(n*n, a, 0, 1, temp2, 0, 1);
    for i from 2 do
        if irem(i, 2) = 0 then
            EXTCall("N", "N", n, n, n, 1.0/i, temp2, n, a, n, 0., temp1, n)
                ;
            high := ExtRealNorm("M", 2*n, n, temp1, n, workuaf);
            tmax := high;
            if tmax < small then break
            else ExtAdd(n*n, 1.0, temp1, 0, 1, s, 0, 1)
            end if
        else
            EXTCall("N", "N", n, n, n, 1.0/i, temp1, n, a, n, 0., temp2, n)
                ;
            high := ExtRealNorm("M", 2*n, n, temp2, n, workuaf);
            tmax := high;
            if tmax < small then break
            else ExtAdd(n*n, 1.0, temp2, 0, 1, s, 0, 1)
            end if
        end if
    end do;
    if N = 0 then sol := s
    else
        EXTCall("N", "N", n, n, n, 1.0, s, n, s, n, 0., temp1, n);
        if N = 1 then sol := temp1
        else
            for i from 2 to N + 1 do
                if irem(i, 2) = 0 then EXTCall("N", "N", n, n, n, 1.0,
                    temp1, n, temp1, n, 0., temp2, n)
                else EXTCall("N", "N", n, n, n, 1.0, temp2, n, temp2, n,
                    0., temp1, n)
                end if
            end do;
            if irem(i, 2) = 1 then sol := temp1 else sol := temp2 end if
        end if
    end if;
    ArrayTools:-Copy(n*n,sol,0,1,RESMAT,0,1);
    NULL;
end proc:
                                                                                                                                           
# I didn't check whether things work for software floats.
UseHardwareFloats:=true:
Digits:=trunc(evalhf(Digits));
# I didn't experiment with gcfreq.
#kernelopts(gcfreq=10^7):

# Size
n:=8;
                                                                                                                                           
# number of times through the loop.
N:=1000;
                                                                                                                                           
# Just using lambda[1] and alpha[1] repeatedly, instead of lambda[i],alpha[i].
lambda[1]:=LinearAlgebra:-RandomMatrix(n,n,generator=-0.01..0.01,
                                       outputoptions=[datatype=complex[8]]):
alpha[1]:=3.0:
                                                                                                                                           
# Uncomment separately for most accurate measurement.
# Select both true will misrepresent bytesalloc increase
# for the one performed second.
# Select both true, to test results.
#oldmethod,newmethod := true,false;
#oldmethod,newmethod := false,true;
oldmethod,newmethod := true,true;
                                                                                                                                           
gc():
if newmethod=true then
(st,ba,bu):=time(),kernelopts(bytesalloc),kernelopts(bytesused):
                                                                                                                                           
# Allocate all the rtables
a:=Matrix(n,n,datatype='complex'('float')):
temp1:=Matrix(n,n,datatype='complex'('float')):
temp2:=Matrix(n,n,datatype='complex'('float')):
s:=Matrix(n,n,datatype='complex'('float')):
RESMAT:=Matrix(n,n,datatype='complex'('float')):
workuaf:=Vector[row](n,'datatype'='float'):
onevec:=Vector(n,'fill'=1.0,'datatype'='complex'('float')):
PROD:=Matrix(n,n,datatype='complex'('float')):
U:=Matrix(n,n,datatype='complex'('float')):
temp := Matrix(n,n,datatype=complex[8]):
                                                                                                                                           
kernelopts(opaquemodules=false):
mmm:=LinearAlgebra:-LA_Main:-LA_External:-MatrixMatrixMultiply:
                                                                                                                                           
for i from 1 to n do U[i,i]:=1.0: od:
for i from 1 to N do
ArrayTools:-Copy(n*n,lambda[1],0,1,temp,0,1):
LinearAlgebra:-LA_Main:-MatrixScalarMultiply(
   temp, I*alpha[1], inplace=true,
   outputoptions=[datatype=complex[8]]):
matexp(temp,a,temp1,temp2,s,workuaf,onevec,RESMAT);
mmm(U,RESMAT,PROD,'inplace'=false):
ArrayTools:-Copy(n*n,PROD,0,1,U,0,1):
od:
                                                                                                                                           
print([time()-st,kernelopts(bytesalloc)-ba,kernelopts(bytesused)-bu]);
U1 := copy(U):
end if:
                                                                                                                                           
gc():
if oldmethod=true then
(st,ba,bu):=time(),kernelopts(bytesalloc),kernelopts(bytesused):
                                                                                                                                           
U := LinearAlgebra:-IdentityMatrix(n,n,compact=false,
                                    outputoptions=[datatype=complex[8]]):
for i from 1 to N do
Temp := LinearAlgebra:-LA_Main:-MatrixScalarMultiply(
           lambda[1], I*alpha[1],
           inplace=false, outputoptions=[datatype=complex[8]]):
U := LinearAlgebra:-LA_Main:-MatrixMatrixMultiply(
        U, LinearAlgebra:-LA_Main:-MatrixFunction(
           Temp, exp(dummy), dummy, outputoptions=[datatype=complex[8]]),
        inplace=false, outputoptions=[]):
od:
                                                                                                                                           
print([time()-st,kernelopts(bytesalloc)-ba,kernelopts(bytesused)-bu]);
U2 := copy(U):
end if:
                                                                                                                                           
if oldmethod=true and newmethod=true then
LinearAlgebra:-Norm(U1-U2);
end if;
                                                                                                                                           

acer
Some things about Maple are learned by experience. Finding out how the Library routines work, internally, can require digging through it. If one looks at one routine, then it can show calls to others, and so one can build up a broader picture. One thing that I've never done, but think should usually be possible, is to dump out all the routines in a package and then reconstruct it as source. It would be interesting to hear views on this, as it could lead to dispersal of code edits. Of course, copyright and the need for Maplesoft's permission would come into that. But it is one possible avenue to partial open sourcing. I usually look at Library routines like this: kernelopts(opaquemodules=false); interface(verboseproc=3); eval( .... ); One reason for doing it with eval() instead of showstat is that, very often, the routine can be dumped to a file and modify it. (Using writeto("filename"), then eval, then writeto(terminal) again.) Looking at your example has been very instructive for me. I've realized and confirmed some thoughts I've had for a while, and got some new ones too. To start, I looked at the following routine, because it was indicated by your profiling as being key. LinearAlgebra:-LA_Main:-`MatrixExponential/complexfloat` That routine does a lot of Matrix-Matrix multiplications. But considerable care is taken to re-use workspace Matrices. Here's a reason why. Maple's memory management is not very good at reclaiming large contiguous sections of memory as are allocated for hardware datatype rtables (Matrices, Vectors, and Arrays). It was likely not originally designed for that, with those objects being introduced in Maple 6. So here is a key rule: avoid creating new rtables. Now, there is a routine called ArrayTools:-Copy which acts externally to copy contents of one hardware rtable to another. It does its work using compiled C. It is so fast that one can often copy data around at low cost. If the bulk of a computation is O(n^3) like Matrix-Matrix multiplication then the O(n^2) copying is a secondary concern. And so that is why `MatrixExponential/complexfloat` uses just a few workspaces, and re-uses them for a whole lot of multiplications. The upshot is that single calls to this routine are pretty efficient, but repeated calls to it entail repeated recreation of its own workspace Matrices. Let's look a little closer at Matrix-Matrix multiplication. It uses a BLAS function, dgemm or zgemm for real or complex double precision floats. It acts like this, in-place on array C:
         alpha*A*B + beta*C  ->  C
So, the container C which will hold the result must be allocated up front. In practice, one can create a handful of Matrices, just once, and juggle them to do a series of operations that might "normally" be expected to use many more new Matrix objects. For example, the results in C can be copied into A and the operation repeated (with beta=0) to get a powering effect. You can compute a high power, but with only three Matrices allocated. This leads to considering another internal routine, which is LinearAlgebra:-LA_Main:-LA_External:-MatrixMatrixMultiply You don't want to create a new rtable container for each result of an operation, if it can be avoided. Now this routine takes four arguments, (A, B, C, ip). The last argument, of the form 'inplace'=, specifies whether to act in-place on A. Praxis shows that doing it in-place on A is not efficient. (It forces the usual row-column products, while dgemm can use Strassen or what have you.) But with inplace=false this routine acts in-place on Matrix C. It's following the BLAS behaviour for zgemm as show above, but without alpha and beta exposed. So here's what I've done with your example. I've used that internal MatrixMatrixMultiply routine and shuffled contents around in a re-usable C. And I dumped out `MatrixExponential/complexfloat` and edited it to accept all its workspaces as additional arguments. And the scaling of the lambda[i] can be done in-place, or in-place on a re-usable temp Matrix. In this way, all the Matrices are created once, outside of the final expensive loop. Despite the extra copying of contents from Matrix to Matrix, this gets about a three-times speedup at size 8x8. I'll post the code in a new message, but I want to say a little more here. An important thing here is that the sizes of the hardware Matrices are all the same. Examination of the BLAS shows that they accept integers to specify the "leading dimension" of an array. For "Fortran order", this means the number of rows, since data is stored contiguously by column. So in theory you could use a much larger array, and have the data packed in the upper left corner. So a really large temp array could serve for all sorts of size problems. I don't know how Matlab works, but one alternative to repeated allocation/collection of workspaces is making use of registered permanant workspace arrays. Of course, eventually one's work requires so much that the memory management has to become more sophisticated. But when the task is set and known in advance, workspace management by hand can win out. Also, parallelizing the task makes such by-hand memory management even more onerous. acer
Some things about Maple are learned by experience. Finding out how the Library routines work, internally, can require digging through it. If one looks at one routine, then it can show calls to others, and so one can build up a broader picture. One thing that I've never done, but think should usually be possible, is to dump out all the routines in a package and then reconstruct it as source. It would be interesting to hear views on this, as it could lead to dispersal of code edits. Of course, copyright and the need for Maplesoft's permission would come into that. But it is one possible avenue to partial open sourcing. I usually look at Library routines like this: kernelopts(opaquemodules=false); interface(verboseproc=3); eval( .... ); One reason for doing it with eval() instead of showstat is that, very often, the routine can be dumped to a file and modify it. (Using writeto("filename"), then eval, then writeto(terminal) again.) Looking at your example has been very instructive for me. I've realized and confirmed some thoughts I've had for a while, and got some new ones too. To start, I looked at the following routine, because it was indicated by your profiling as being key. LinearAlgebra:-LA_Main:-`MatrixExponential/complexfloat` That routine does a lot of Matrix-Matrix multiplications. But considerable care is taken to re-use workspace Matrices. Here's a reason why. Maple's memory management is not very good at reclaiming large contiguous sections of memory as are allocated for hardware datatype rtables (Matrices, Vectors, and Arrays). It was likely not originally designed for that, with those objects being introduced in Maple 6. So here is a key rule: avoid creating new rtables. Now, there is a routine called ArrayTools:-Copy which acts externally to copy contents of one hardware rtable to another. It does its work using compiled C. It is so fast that one can often copy data around at low cost. If the bulk of a computation is O(n^3) like Matrix-Matrix multiplication then the O(n^2) copying is a secondary concern. And so that is why `MatrixExponential/complexfloat` uses just a few workspaces, and re-uses them for a whole lot of multiplications. The upshot is that single calls to this routine are pretty efficient, but repeated calls to it entail repeated recreation of its own workspace Matrices. Let's look a little closer at Matrix-Matrix multiplication. It uses a BLAS function, dgemm or zgemm for real or complex double precision floats. It acts like this, in-place on array C:
         alpha*A*B + beta*C  ->  C
So, the container C which will hold the result must be allocated up front. In practice, one can create a handful of Matrices, just once, and juggle them to do a series of operations that might "normally" be expected to use many more new Matrix objects. For example, the results in C can be copied into A and the operation repeated (with beta=0) to get a powering effect. You can compute a high power, but with only three Matrices allocated. This leads to considering another internal routine, which is LinearAlgebra:-LA_Main:-LA_External:-MatrixMatrixMultiply You don't want to create a new rtable container for each result of an operation, if it can be avoided. Now this routine takes four arguments, (A, B, C, ip). The last argument, of the form 'inplace'=, specifies whether to act in-place on A. Praxis shows that doing it in-place on A is not efficient. (It forces the usual row-column products, while dgemm can use Strassen or what have you.) But with inplace=false this routine acts in-place on Matrix C. It's following the BLAS behaviour for zgemm as show above, but without alpha and beta exposed. So here's what I've done with your example. I've used that internal MatrixMatrixMultiply routine and shuffled contents around in a re-usable C. And I dumped out `MatrixExponential/complexfloat` and edited it to accept all its workspaces as additional arguments. And the scaling of the lambda[i] can be done in-place, or in-place on a re-usable temp Matrix. In this way, all the Matrices are created once, outside of the final expensive loop. Despite the extra copying of contents from Matrix to Matrix, this gets about a three-times speedup at size 8x8. I'll post the code in a new message, but I want to say a little more here. An important thing here is that the sizes of the hardware Matrices are all the same. Examination of the BLAS shows that they accept integers to specify the "leading dimension" of an array. For "Fortran order", this means the number of rows, since data is stored contiguously by column. So in theory you could use a much larger array, and have the data packed in the upper left corner. So a really large temp array could serve for all sorts of size problems. I don't know how Matlab works, but one alternative to repeated allocation/collection of workspaces is making use of registered permanant workspace arrays. Of course, eventually one's work requires so much that the memory management has to become more sophisticated. But when the task is set and known in advance, workspace management by hand can win out. Also, parallelizing the task makes such by-hand memory management even more onerous. acer
I think that the timing for Parametrize_SU_Euler can be improved to roughly 35% (of the timing of your version posted in this thread) for size 8x8, and to about 85% for size 64x64. That's by focusing on only the last loop. Your profiling indicates that that's where the expense lies. But before I post code, I have a question. Are the lambda[i] Matrices already of datatype=float[8], or complex[8]? (Timing comparisons on 64bit Linux, Maple 11.) I am curious about how fast Matlab is for 10000 8x8 or 1000 64x64 double precision complex matrix exponentials. acer
I think that the timing for Parametrize_SU_Euler can be improved to roughly 35% (of the timing of your version posted in this thread) for size 8x8, and to about 85% for size 64x64. That's by focusing on only the last loop. Your profiling indicates that that's where the expense lies. But before I post code, I have a question. Are the lambda[i] Matrices already of datatype=float[8], or complex[8]? (Timing comparisons on 64bit Linux, Maple 11.) I am curious about how fast Matlab is for 10000 8x8 or 1000 64x64 double precision complex matrix exponentials. acer
"Just curious... do the lambda[used_lambdas[k]] Matrices, as produced by Hermitian_basis(), happen to commute?" Ahh, I see now that this has been covered before. I never did put the internal float Matrix-exponential routine through the wringer. I think that there's a lot of temp Matrix garbage collection going on, for both that and the multiplications. (The "result" rtable in a BLAS Matrix-Matrix multiplication call can also be re-used, I think.) I'll give it a whirl. Food for thought: routine LinearAlgebra:-LA_Main:-LA_External:-MatrixMatrixMultiply acts in-place on 3rd rtable argument C, filling it with the result. I'll try to use that, as well as a hacked up version of LinearAlgebra:-LA_Main:-`MatrixExponential/complexfloat` which doesn't do its own Matrix allocations. BTW, are you using Linux/UNIX/OSX, or Windows? Some raw BLAS stuff is easier to do outside of Windows. It may not come to that. acer
"Just curious... do the lambda[used_lambdas[k]] Matrices, as produced by Hermitian_basis(), happen to commute?" Ahh, I see now that this has been covered before. I never did put the internal float Matrix-exponential routine through the wringer. I think that there's a lot of temp Matrix garbage collection going on, for both that and the multiplications. (The "result" rtable in a BLAS Matrix-Matrix multiplication call can also be re-used, I think.) I'll give it a whirl. Food for thought: routine LinearAlgebra:-LA_Main:-LA_External:-MatrixMatrixMultiply acts in-place on 3rd rtable argument C, filling it with the result. I'll try to use that, as well as a hacked up version of LinearAlgebra:-LA_Main:-`MatrixExponential/complexfloat` which doesn't do its own Matrix allocations. BTW, are you using Linux/UNIX/OSX, or Windows? Some raw BLAS stuff is easier to do outside of Windows. It may not come to that. acer
First 557 558 559 560 561 562 563 Last Page 559 of 591