acer

32313 Reputation

29 Badges

19 years, 313 days
Ontario, Canada

Social Networks and Content at Maplesoft.com

MaplePrimes Activity


These are replies submitted by acer

I couldn't get the Document that you referenced above (which does performance comparisons) to run as quickly as its printed timings suggest. I also saw what might be a mistake in that Document. You claim that it's OK to replace evalf[n](Pi) with evalf(Pi) in key spots. But there are no special evaluation rules in action here, and the `convert/binary` (and related) procedures would just see 10 digits of Pi in that case, since Digits was not set at the top-level. Are you sure that the methods you compare are really all doing the same thing? (DJ Keenan made some comments about his routines not needing Digits to be set at a higher level in order than a variant of his modifications to respect the optional precision parameter of `convert/binary. But it must surely still matter to what accuracy the inbound float approximation of Pi is made.) Perhaps as much accuracy as is provided by evalf[n](Pi) would not be required. In order to have `convert/binary` produce n binary digits then maybe something like evalf[trunc(n/log[2](10.0))+2](Pi) would suffice. Also, it's not reasonable to compare methods which might compute evalf of some same numbers (as they work internally, say) in the same session like that. More accurate would be to clear evalf's remember tables between tests. Better still is to also measure bytes alloc increases and to place each method in its own testing file. The following seems to be a big improvement on the first method that I suggested. And it doesn't require edits to any of the existing `convert` routines.
Q := proc(n)
op(ListTools:-Reverse(convert(trunc(evalf[trunc(n/log[2](10.0))+2](Pi*2^(n-2))),base,2))):
end proc:
If it really is a big improvement, then one might wonder why, exactly. One reason is that maple can scale a large floating point number up by a power of 2 pretty quickly, presumably helped by use of gmp. Another reason is that the irem techniques already used in `convert/base` aren't so bad. It seems to me that they act in the same general way as DJKeenan's code (but used above with the smallest possible base and hence no lookup table benefits). Consider the number
        trunc(evalf[trunc(n/log[2](10.0))+2](Pi*2^(n-2)))
Maybe one could get the address of its DAG, offset by enough to get the address of its data portion (the number itself), and copy it into a hardware datatype Vector. Entries of that Vector might then be taken and used in conjunction with a hard-coded lookup table. The copying might be done using an external call to a BLAS routine like scopy or dcopy. To compute comparison timings, I grabbed the modified `convert/binary` routines directly from the original worksheet. Those do indeed provide a big speedup over the regular Maple 11 `convert/binary` routines. But for your particular task, it looks to me as if simple routine Q above (which uses different, unchanged internal routines like `convert/base`) is still about 10 times faster and allocates about 10 time less memory when n=1000000. I'm just wondering whether I compared Q against the most efficient alternative provided by any incarnation of `convert/binary` and friends. acer
Ah, yes, sorry about that.
split := proc(z::polynom(anything,[x,y]))
  local C,t;
  C:=[coeffs(z,[x,y],'t')];
  add(C[i]*INT(degree(t[i],x),degree(t[i],y)),i=1..nops(C));
end proc:
acer
Just replace the seq() with add(). acer
In your second try, when you call dzurflu(720), you're forgetting that the new unapplied function is actually named dzurflu14. Calling dzurflu14(...) repeatedly should work fine. I still think that using D for this is more straightforward. See the above reply, or the ?D help-page. acer
In your second try, when you call dzurflu(720), you're forgetting that the new unapplied function is actually named dzurflu14. Calling dzurflu14(...) repeatedly should work fine. I still think that using D for this is more straightforward. See the above reply, or the ?D help-page. acer
I wasn't completely sure what you were after. It doesn't seem to have anything especially to do with integrals. It looks more like a task of getting coefficients and replacing terms by some function call.
split := proc(z::polynom(anything,[x,y]))
  local C,t;
  C:=coeffs(z,[x,y],'t');
  seq(C[i]*INT(degree(t[i],x),degree(t[i],y)),i=1..nops([C]));
end proc:

z := a + b/c*x + d/e*x*y^3 + f*y^5:
split(z);

z := c1 + c2*x + c3*y + c4*x*y:
split(z);
acer
I think that there's probably more that can be done. The ArrayTools:-Copy calls might all be replaced by the lowest level external calls to BLAS (just like those others that appear within matexp). There is a bit of non-external work done in matexp, most of it to do with size n, Digits, and routine selection according to UseHardwareFloats. It could be set up that a new routine, matexp_generator(), emits a version of matexp in which most all that work is hard-coded. (LinearAlgebra could do this, if someone went to the trouble.) This new routine could use a remember table so that it only had to generate one matexp() version per combination of n/Digits. The hardware external functions could be left in, and the software external functions removed, for your example. The matexp() version could be emitted by doing a subs on a template procedure -- a minor variant of the existing matexp() procedure. It'd make a decent blog post, if it worked well. There may be savings in the other parts of Parametrize_SU_Euler(), but if I need more underlying code I'll send a private message with my email address and you could decide on whether that'd be OK. At some point, when size n gets large, the cost of the external hardward computations involved with the exponential will dominate. Comparison with Matlab or pure C would, at that point, depend on the algorithm used. There are several poor ways to compute it. It'd take a little research to figure out the state-of-art for it. There are a few obvious places to start, such as this paper co-authored by Cleve Moler. acer
I think that there's probably more that can be done. The ArrayTools:-Copy calls might all be replaced by the lowest level external calls to BLAS (just like those others that appear within matexp). There is a bit of non-external work done in matexp, most of it to do with size n, Digits, and routine selection according to UseHardwareFloats. It could be set up that a new routine, matexp_generator(), emits a version of matexp in which most all that work is hard-coded. (LinearAlgebra could do this, if someone went to the trouble.) This new routine could use a remember table so that it only had to generate one matexp() version per combination of n/Digits. The hardware external functions could be left in, and the software external functions removed, for your example. The matexp() version could be emitted by doing a subs on a template procedure -- a minor variant of the existing matexp() procedure. It'd make a decent blog post, if it worked well. There may be savings in the other parts of Parametrize_SU_Euler(), but if I need more underlying code I'll send a private message with my email address and you could decide on whether that'd be OK. At some point, when size n gets large, the cost of the external hardward computations involved with the exponential will dominate. Comparison with Matlab or pure C would, at that point, depend on the algorithm used. There are several poor ways to compute it. It'd take a little research to figure out the state-of-art for it. There are a few obvious places to start, such as this paper co-authored by Cleve Moler. acer
Yes, I'm sure that one of the Maple plot devices is cps (not eps). The c in cps stands for color. But in Linux I get color encapsulated Postscript (.eps) when using the right-click Context Menu on a plot in Maple 11. So maybe that works for you too, in Windows or what have you. In that case, you probably don't need to set the plot device to cps, as Export might work fine to produce color .eps. If you look at the .eps file with Ghostview or GSview or similar, does it show as color? If so, then maybe you are OK. Aren't most dvi viewers only capable of displaying in greyscale, even if they contain color embedded .eps? If you run the .dvi through dvips (or whatever MikTeX has for that) then does the resulting final .ps display or print it in color? What if you produce pdf from it, does that show the color? acer
Yes, I'm sure that one of the Maple plot devices is cps (not eps). The c in cps stands for color. But in Linux I get color encapsulated Postscript (.eps) when using the right-click Context Menu on a plot in Maple 11. So maybe that works for you too, in Windows or what have you. In that case, you probably don't need to set the plot device to cps, as Export might work fine to produce color .eps. If you look at the .eps file with Ghostview or GSview or similar, does it show as color? If so, then maybe you are OK. Aren't most dvi viewers only capable of displaying in greyscale, even if they contain color embedded .eps? If you run the .dvi through dvips (or whatever MikTeX has for that) then does the resulting final .ps display or print it in color? What if you produce pdf from it, does that show the color? acer
# matexp() is a revision of `MatrixExponential/complexfloat`
# which accepts all its Matrix and Vector workspaces as
# additional arguments.
# The input should be in A, and results appear in RESMAT.
# It doesn't matter whether all others are zeroed or not.
matexp := proc(A, a, temp1, temp2, s, workuaf, onevec, RESMAT)
local t, i, bk, p, z, n, m, j, k, l, M, N, tmax, tmin,
extlib, EXTCall, ii, jj, high, DatatypeA, OrderA, StorageA, ExtAdd,
ExtNorm, IndFnsA, small, OrigDigits, sol, ExtRealNorm;
option `Copyright (c) Maplesoft, a division of Waterloo Maple Inc. 2004`;
    m, n := op(1, A);
    if m <> n then error "Expecting a square Matrix" end if;
    if UseHardwareFloats = true then
        extlib :=
            ExternalCalling:-ExternalLibraryName("linalg", 'HWFloat');
        EXTCall := ExternalCalling:-DefineExternal('hw_f06zaf', extlib);
        ExtNorm := ExternalCalling:-DefineExternal('hw_f06uaf', extlib);
        ExtAdd := ExternalCalling:-DefineExternal('hw_f06gcf', extlib);
        ExtRealNorm :=
            ExternalCalling:-DefineExternal('hw_f06raf', extlib)
    elif UseHardwareFloats = false then
        extlib :=
            ExternalCalling:-ExternalLibraryName("linalg", 'SWFloat');
        MPFloat(extlib);
        EXTCall := ExternalCalling:-DefineExternal('sw_f06zaf', extlib);
        ExtNorm := ExternalCalling:-DefineExternal('sw_f06uaf', extlib);
        ExtAdd := ExternalCalling:-DefineExternal('sw_f06gcf', extlib);
        ExtRealNorm :=
            ExternalCalling:-DefineExternal('sw_f06raf', extlib)
    else error
        "expecting UseHardwareFloats to be true or false, yet received %1"
        , UseHardwareFloats
    end if;
    DatatypeA, StorageA, OrderA :=
        rtable_options(A, 'datatype', 'storage', 'order');
    IndFnsA := [rtable_indfns(A)];
    ArrayTools:-Copy(n*n, A, 0, 1, a, 0, 1);
    M := ExtNorm("I", n, n, a, n, workuaf);
    if M < 1 then N := 0 else N := trunc(evalf(ln(M)/ln(2) + 1, 9)) end if
    ;
    if UseHardwareFloats = true then OrigDigits := trunc(evalhf(Digits))
    else OrigDigits := Digits
    end if;
    Digits := Digits + length(Digits + 9) + length(n) + N + 1;
    small := Float(1.0, -Digits);
    M := 0.5^N;
    LinearAlgebra:-LA_Main:-MatrixScalarMultiply(a, M, 'inplace' = 'true',
        'outputoptions' = []);
    ArrayTools:-Copy(n*n, a, 0, 1, s, 0, 1);
    ExtAdd(n, 1.0, onevec, 0, 1, s, 0, n + 1);
    ArrayTools:-Copy(n*n, a, 0, 1, temp2, 0, 1);
    for i from 2 do
        if irem(i, 2) = 0 then
            EXTCall("N", "N", n, n, n, 1.0/i, temp2, n, a, n, 0., temp1, n)
                ;
            high := ExtRealNorm("M", 2*n, n, temp1, n, workuaf);
            tmax := high;
            if tmax < small then break
            else ExtAdd(n*n, 1.0, temp1, 0, 1, s, 0, 1)
            end if
        else
            EXTCall("N", "N", n, n, n, 1.0/i, temp1, n, a, n, 0., temp2, n)
                ;
            high := ExtRealNorm("M", 2*n, n, temp2, n, workuaf);
            tmax := high;
            if tmax < small then break
            else ExtAdd(n*n, 1.0, temp2, 0, 1, s, 0, 1)
            end if
        end if
    end do;
    if N = 0 then sol := s
    else
        EXTCall("N", "N", n, n, n, 1.0, s, n, s, n, 0., temp1, n);
        if N = 1 then sol := temp1
        else
            for i from 2 to N + 1 do
                if irem(i, 2) = 0 then EXTCall("N", "N", n, n, n, 1.0,
                    temp1, n, temp1, n, 0., temp2, n)
                else EXTCall("N", "N", n, n, n, 1.0, temp2, n, temp2, n,
                    0., temp1, n)
                end if
            end do;
            if irem(i, 2) = 1 then sol := temp1 else sol := temp2 end if
        end if
    end if;
    ArrayTools:-Copy(n*n,sol,0,1,RESMAT,0,1);
    NULL;
end proc:
                                                                                                                                           
# I didn't check whether things work for software floats.
UseHardwareFloats:=true:
Digits:=trunc(evalhf(Digits));
# I didn't experiment with gcfreq.
#kernelopts(gcfreq=10^7):

# Size
n:=8;
                                                                                                                                           
# number of times through the loop.
N:=1000;
                                                                                                                                           
# Just using lambda[1] and alpha[1] repeatedly, instead of lambda[i],alpha[i].
lambda[1]:=LinearAlgebra:-RandomMatrix(n,n,generator=-0.01..0.01,
                                       outputoptions=[datatype=complex[8]]):
alpha[1]:=3.0:
                                                                                                                                           
# Uncomment separately for most accurate measurement.
# Select both true will misrepresent bytesalloc increase
# for the one performed second.
# Select both true, to test results.
#oldmethod,newmethod := true,false;
#oldmethod,newmethod := false,true;
oldmethod,newmethod := true,true;
                                                                                                                                           
gc():
if newmethod=true then
(st,ba,bu):=time(),kernelopts(bytesalloc),kernelopts(bytesused):
                                                                                                                                           
# Allocate all the rtables
a:=Matrix(n,n,datatype='complex'('float')):
temp1:=Matrix(n,n,datatype='complex'('float')):
temp2:=Matrix(n,n,datatype='complex'('float')):
s:=Matrix(n,n,datatype='complex'('float')):
RESMAT:=Matrix(n,n,datatype='complex'('float')):
workuaf:=Vector[row](n,'datatype'='float'):
onevec:=Vector(n,'fill'=1.0,'datatype'='complex'('float')):
PROD:=Matrix(n,n,datatype='complex'('float')):
U:=Matrix(n,n,datatype='complex'('float')):
temp := Matrix(n,n,datatype=complex[8]):
                                                                                                                                           
kernelopts(opaquemodules=false):
mmm:=LinearAlgebra:-LA_Main:-LA_External:-MatrixMatrixMultiply:
                                                                                                                                           
for i from 1 to n do U[i,i]:=1.0: od:
for i from 1 to N do
ArrayTools:-Copy(n*n,lambda[1],0,1,temp,0,1):
LinearAlgebra:-LA_Main:-MatrixScalarMultiply(
   temp, I*alpha[1], inplace=true,
   outputoptions=[datatype=complex[8]]):
matexp(temp,a,temp1,temp2,s,workuaf,onevec,RESMAT);
mmm(U,RESMAT,PROD,'inplace'=false):
ArrayTools:-Copy(n*n,PROD,0,1,U,0,1):
od:
                                                                                                                                           
print([time()-st,kernelopts(bytesalloc)-ba,kernelopts(bytesused)-bu]);
U1 := copy(U):
end if:
                                                                                                                                           
gc():
if oldmethod=true then
(st,ba,bu):=time(),kernelopts(bytesalloc),kernelopts(bytesused):
                                                                                                                                           
U := LinearAlgebra:-IdentityMatrix(n,n,compact=false,
                                    outputoptions=[datatype=complex[8]]):
for i from 1 to N do
Temp := LinearAlgebra:-LA_Main:-MatrixScalarMultiply(
           lambda[1], I*alpha[1],
           inplace=false, outputoptions=[datatype=complex[8]]):
U := LinearAlgebra:-LA_Main:-MatrixMatrixMultiply(
        U, LinearAlgebra:-LA_Main:-MatrixFunction(
           Temp, exp(dummy), dummy, outputoptions=[datatype=complex[8]]),
        inplace=false, outputoptions=[]):
od:
                                                                                                                                           
print([time()-st,kernelopts(bytesalloc)-ba,kernelopts(bytesused)-bu]);
U2 := copy(U):
end if:
                                                                                                                                           
if oldmethod=true and newmethod=true then
LinearAlgebra:-Norm(U1-U2);
end if;
                                                                                                                                           

acer
# matexp() is a revision of `MatrixExponential/complexfloat`
# which accepts all its Matrix and Vector workspaces as
# additional arguments.
# The input should be in A, and results appear in RESMAT.
# It doesn't matter whether all others are zeroed or not.
matexp := proc(A, a, temp1, temp2, s, workuaf, onevec, RESMAT)
local t, i, bk, p, z, n, m, j, k, l, M, N, tmax, tmin,
extlib, EXTCall, ii, jj, high, DatatypeA, OrderA, StorageA, ExtAdd,
ExtNorm, IndFnsA, small, OrigDigits, sol, ExtRealNorm;
option `Copyright (c) Maplesoft, a division of Waterloo Maple Inc. 2004`;
    m, n := op(1, A);
    if m <> n then error "Expecting a square Matrix" end if;
    if UseHardwareFloats = true then
        extlib :=
            ExternalCalling:-ExternalLibraryName("linalg", 'HWFloat');
        EXTCall := ExternalCalling:-DefineExternal('hw_f06zaf', extlib);
        ExtNorm := ExternalCalling:-DefineExternal('hw_f06uaf', extlib);
        ExtAdd := ExternalCalling:-DefineExternal('hw_f06gcf', extlib);
        ExtRealNorm :=
            ExternalCalling:-DefineExternal('hw_f06raf', extlib)
    elif UseHardwareFloats = false then
        extlib :=
            ExternalCalling:-ExternalLibraryName("linalg", 'SWFloat');
        MPFloat(extlib);
        EXTCall := ExternalCalling:-DefineExternal('sw_f06zaf', extlib);
        ExtNorm := ExternalCalling:-DefineExternal('sw_f06uaf', extlib);
        ExtAdd := ExternalCalling:-DefineExternal('sw_f06gcf', extlib);
        ExtRealNorm :=
            ExternalCalling:-DefineExternal('sw_f06raf', extlib)
    else error
        "expecting UseHardwareFloats to be true or false, yet received %1"
        , UseHardwareFloats
    end if;
    DatatypeA, StorageA, OrderA :=
        rtable_options(A, 'datatype', 'storage', 'order');
    IndFnsA := [rtable_indfns(A)];
    ArrayTools:-Copy(n*n, A, 0, 1, a, 0, 1);
    M := ExtNorm("I", n, n, a, n, workuaf);
    if M < 1 then N := 0 else N := trunc(evalf(ln(M)/ln(2) + 1, 9)) end if
    ;
    if UseHardwareFloats = true then OrigDigits := trunc(evalhf(Digits))
    else OrigDigits := Digits
    end if;
    Digits := Digits + length(Digits + 9) + length(n) + N + 1;
    small := Float(1.0, -Digits);
    M := 0.5^N;
    LinearAlgebra:-LA_Main:-MatrixScalarMultiply(a, M, 'inplace' = 'true',
        'outputoptions' = []);
    ArrayTools:-Copy(n*n, a, 0, 1, s, 0, 1);
    ExtAdd(n, 1.0, onevec, 0, 1, s, 0, n + 1);
    ArrayTools:-Copy(n*n, a, 0, 1, temp2, 0, 1);
    for i from 2 do
        if irem(i, 2) = 0 then
            EXTCall("N", "N", n, n, n, 1.0/i, temp2, n, a, n, 0., temp1, n)
                ;
            high := ExtRealNorm("M", 2*n, n, temp1, n, workuaf);
            tmax := high;
            if tmax < small then break
            else ExtAdd(n*n, 1.0, temp1, 0, 1, s, 0, 1)
            end if
        else
            EXTCall("N", "N", n, n, n, 1.0/i, temp1, n, a, n, 0., temp2, n)
                ;
            high := ExtRealNorm("M", 2*n, n, temp2, n, workuaf);
            tmax := high;
            if tmax < small then break
            else ExtAdd(n*n, 1.0, temp2, 0, 1, s, 0, 1)
            end if
        end if
    end do;
    if N = 0 then sol := s
    else
        EXTCall("N", "N", n, n, n, 1.0, s, n, s, n, 0., temp1, n);
        if N = 1 then sol := temp1
        else
            for i from 2 to N + 1 do
                if irem(i, 2) = 0 then EXTCall("N", "N", n, n, n, 1.0,
                    temp1, n, temp1, n, 0., temp2, n)
                else EXTCall("N", "N", n, n, n, 1.0, temp2, n, temp2, n,
                    0., temp1, n)
                end if
            end do;
            if irem(i, 2) = 1 then sol := temp1 else sol := temp2 end if
        end if
    end if;
    ArrayTools:-Copy(n*n,sol,0,1,RESMAT,0,1);
    NULL;
end proc:
                                                                                                                                           
# I didn't check whether things work for software floats.
UseHardwareFloats:=true:
Digits:=trunc(evalhf(Digits));
# I didn't experiment with gcfreq.
#kernelopts(gcfreq=10^7):

# Size
n:=8;
                                                                                                                                           
# number of times through the loop.
N:=1000;
                                                                                                                                           
# Just using lambda[1] and alpha[1] repeatedly, instead of lambda[i],alpha[i].
lambda[1]:=LinearAlgebra:-RandomMatrix(n,n,generator=-0.01..0.01,
                                       outputoptions=[datatype=complex[8]]):
alpha[1]:=3.0:
                                                                                                                                           
# Uncomment separately for most accurate measurement.
# Select both true will misrepresent bytesalloc increase
# for the one performed second.
# Select both true, to test results.
#oldmethod,newmethod := true,false;
#oldmethod,newmethod := false,true;
oldmethod,newmethod := true,true;
                                                                                                                                           
gc():
if newmethod=true then
(st,ba,bu):=time(),kernelopts(bytesalloc),kernelopts(bytesused):
                                                                                                                                           
# Allocate all the rtables
a:=Matrix(n,n,datatype='complex'('float')):
temp1:=Matrix(n,n,datatype='complex'('float')):
temp2:=Matrix(n,n,datatype='complex'('float')):
s:=Matrix(n,n,datatype='complex'('float')):
RESMAT:=Matrix(n,n,datatype='complex'('float')):
workuaf:=Vector[row](n,'datatype'='float'):
onevec:=Vector(n,'fill'=1.0,'datatype'='complex'('float')):
PROD:=Matrix(n,n,datatype='complex'('float')):
U:=Matrix(n,n,datatype='complex'('float')):
temp := Matrix(n,n,datatype=complex[8]):
                                                                                                                                           
kernelopts(opaquemodules=false):
mmm:=LinearAlgebra:-LA_Main:-LA_External:-MatrixMatrixMultiply:
                                                                                                                                           
for i from 1 to n do U[i,i]:=1.0: od:
for i from 1 to N do
ArrayTools:-Copy(n*n,lambda[1],0,1,temp,0,1):
LinearAlgebra:-LA_Main:-MatrixScalarMultiply(
   temp, I*alpha[1], inplace=true,
   outputoptions=[datatype=complex[8]]):
matexp(temp,a,temp1,temp2,s,workuaf,onevec,RESMAT);
mmm(U,RESMAT,PROD,'inplace'=false):
ArrayTools:-Copy(n*n,PROD,0,1,U,0,1):
od:
                                                                                                                                           
print([time()-st,kernelopts(bytesalloc)-ba,kernelopts(bytesused)-bu]);
U1 := copy(U):
end if:
                                                                                                                                           
gc():
if oldmethod=true then
(st,ba,bu):=time(),kernelopts(bytesalloc),kernelopts(bytesused):
                                                                                                                                           
U := LinearAlgebra:-IdentityMatrix(n,n,compact=false,
                                    outputoptions=[datatype=complex[8]]):
for i from 1 to N do
Temp := LinearAlgebra:-LA_Main:-MatrixScalarMultiply(
           lambda[1], I*alpha[1],
           inplace=false, outputoptions=[datatype=complex[8]]):
U := LinearAlgebra:-LA_Main:-MatrixMatrixMultiply(
        U, LinearAlgebra:-LA_Main:-MatrixFunction(
           Temp, exp(dummy), dummy, outputoptions=[datatype=complex[8]]),
        inplace=false, outputoptions=[]):
od:
                                                                                                                                           
print([time()-st,kernelopts(bytesalloc)-ba,kernelopts(bytesused)-bu]);
U2 := copy(U):
end if:
                                                                                                                                           
if oldmethod=true and newmethod=true then
LinearAlgebra:-Norm(U1-U2);
end if;
                                                                                                                                           

acer
Some things about Maple are learned by experience. Finding out how the Library routines work, internally, can require digging through it. If one looks at one routine, then it can show calls to others, and so one can build up a broader picture. One thing that I've never done, but think should usually be possible, is to dump out all the routines in a package and then reconstruct it as source. It would be interesting to hear views on this, as it could lead to dispersal of code edits. Of course, copyright and the need for Maplesoft's permission would come into that. But it is one possible avenue to partial open sourcing. I usually look at Library routines like this: kernelopts(opaquemodules=false); interface(verboseproc=3); eval( .... ); One reason for doing it with eval() instead of showstat is that, very often, the routine can be dumped to a file and modify it. (Using writeto("filename"), then eval, then writeto(terminal) again.) Looking at your example has been very instructive for me. I've realized and confirmed some thoughts I've had for a while, and got some new ones too. To start, I looked at the following routine, because it was indicated by your profiling as being key. LinearAlgebra:-LA_Main:-`MatrixExponential/complexfloat` That routine does a lot of Matrix-Matrix multiplications. But considerable care is taken to re-use workspace Matrices. Here's a reason why. Maple's memory management is not very good at reclaiming large contiguous sections of memory as are allocated for hardware datatype rtables (Matrices, Vectors, and Arrays). It was likely not originally designed for that, with those objects being introduced in Maple 6. So here is a key rule: avoid creating new rtables. Now, there is a routine called ArrayTools:-Copy which acts externally to copy contents of one hardware rtable to another. It does its work using compiled C. It is so fast that one can often copy data around at low cost. If the bulk of a computation is O(n^3) like Matrix-Matrix multiplication then the O(n^2) copying is a secondary concern. And so that is why `MatrixExponential/complexfloat` uses just a few workspaces, and re-uses them for a whole lot of multiplications. The upshot is that single calls to this routine are pretty efficient, but repeated calls to it entail repeated recreation of its own workspace Matrices. Let's look a little closer at Matrix-Matrix multiplication. It uses a BLAS function, dgemm or zgemm for real or complex double precision floats. It acts like this, in-place on array C:
         alpha*A*B + beta*C  ->  C
So, the container C which will hold the result must be allocated up front. In practice, one can create a handful of Matrices, just once, and juggle them to do a series of operations that might "normally" be expected to use many more new Matrix objects. For example, the results in C can be copied into A and the operation repeated (with beta=0) to get a powering effect. You can compute a high power, but with only three Matrices allocated. This leads to considering another internal routine, which is LinearAlgebra:-LA_Main:-LA_External:-MatrixMatrixMultiply You don't want to create a new rtable container for each result of an operation, if it can be avoided. Now this routine takes four arguments, (A, B, C, ip). The last argument, of the form 'inplace'=, specifies whether to act in-place on A. Praxis shows that doing it in-place on A is not efficient. (It forces the usual row-column products, while dgemm can use Strassen or what have you.) But with inplace=false this routine acts in-place on Matrix C. It's following the BLAS behaviour for zgemm as show above, but without alpha and beta exposed. So here's what I've done with your example. I've used that internal MatrixMatrixMultiply routine and shuffled contents around in a re-usable C. And I dumped out `MatrixExponential/complexfloat` and edited it to accept all its workspaces as additional arguments. And the scaling of the lambda[i] can be done in-place, or in-place on a re-usable temp Matrix. In this way, all the Matrices are created once, outside of the final expensive loop. Despite the extra copying of contents from Matrix to Matrix, this gets about a three-times speedup at size 8x8. I'll post the code in a new message, but I want to say a little more here. An important thing here is that the sizes of the hardware Matrices are all the same. Examination of the BLAS shows that they accept integers to specify the "leading dimension" of an array. For "Fortran order", this means the number of rows, since data is stored contiguously by column. So in theory you could use a much larger array, and have the data packed in the upper left corner. So a really large temp array could serve for all sorts of size problems. I don't know how Matlab works, but one alternative to repeated allocation/collection of workspaces is making use of registered permanant workspace arrays. Of course, eventually one's work requires so much that the memory management has to become more sophisticated. But when the task is set and known in advance, workspace management by hand can win out. Also, parallelizing the task makes such by-hand memory management even more onerous. acer
Some things about Maple are learned by experience. Finding out how the Library routines work, internally, can require digging through it. If one looks at one routine, then it can show calls to others, and so one can build up a broader picture. One thing that I've never done, but think should usually be possible, is to dump out all the routines in a package and then reconstruct it as source. It would be interesting to hear views on this, as it could lead to dispersal of code edits. Of course, copyright and the need for Maplesoft's permission would come into that. But it is one possible avenue to partial open sourcing. I usually look at Library routines like this: kernelopts(opaquemodules=false); interface(verboseproc=3); eval( .... ); One reason for doing it with eval() instead of showstat is that, very often, the routine can be dumped to a file and modify it. (Using writeto("filename"), then eval, then writeto(terminal) again.) Looking at your example has been very instructive for me. I've realized and confirmed some thoughts I've had for a while, and got some new ones too. To start, I looked at the following routine, because it was indicated by your profiling as being key. LinearAlgebra:-LA_Main:-`MatrixExponential/complexfloat` That routine does a lot of Matrix-Matrix multiplications. But considerable care is taken to re-use workspace Matrices. Here's a reason why. Maple's memory management is not very good at reclaiming large contiguous sections of memory as are allocated for hardware datatype rtables (Matrices, Vectors, and Arrays). It was likely not originally designed for that, with those objects being introduced in Maple 6. So here is a key rule: avoid creating new rtables. Now, there is a routine called ArrayTools:-Copy which acts externally to copy contents of one hardware rtable to another. It does its work using compiled C. It is so fast that one can often copy data around at low cost. If the bulk of a computation is O(n^3) like Matrix-Matrix multiplication then the O(n^2) copying is a secondary concern. And so that is why `MatrixExponential/complexfloat` uses just a few workspaces, and re-uses them for a whole lot of multiplications. The upshot is that single calls to this routine are pretty efficient, but repeated calls to it entail repeated recreation of its own workspace Matrices. Let's look a little closer at Matrix-Matrix multiplication. It uses a BLAS function, dgemm or zgemm for real or complex double precision floats. It acts like this, in-place on array C:
         alpha*A*B + beta*C  ->  C
So, the container C which will hold the result must be allocated up front. In practice, one can create a handful of Matrices, just once, and juggle them to do a series of operations that might "normally" be expected to use many more new Matrix objects. For example, the results in C can be copied into A and the operation repeated (with beta=0) to get a powering effect. You can compute a high power, but with only three Matrices allocated. This leads to considering another internal routine, which is LinearAlgebra:-LA_Main:-LA_External:-MatrixMatrixMultiply You don't want to create a new rtable container for each result of an operation, if it can be avoided. Now this routine takes four arguments, (A, B, C, ip). The last argument, of the form 'inplace'=, specifies whether to act in-place on A. Praxis shows that doing it in-place on A is not efficient. (It forces the usual row-column products, while dgemm can use Strassen or what have you.) But with inplace=false this routine acts in-place on Matrix C. It's following the BLAS behaviour for zgemm as show above, but without alpha and beta exposed. So here's what I've done with your example. I've used that internal MatrixMatrixMultiply routine and shuffled contents around in a re-usable C. And I dumped out `MatrixExponential/complexfloat` and edited it to accept all its workspaces as additional arguments. And the scaling of the lambda[i] can be done in-place, or in-place on a re-usable temp Matrix. In this way, all the Matrices are created once, outside of the final expensive loop. Despite the extra copying of contents from Matrix to Matrix, this gets about a three-times speedup at size 8x8. I'll post the code in a new message, but I want to say a little more here. An important thing here is that the sizes of the hardware Matrices are all the same. Examination of the BLAS shows that they accept integers to specify the "leading dimension" of an array. For "Fortran order", this means the number of rows, since data is stored contiguously by column. So in theory you could use a much larger array, and have the data packed in the upper left corner. So a really large temp array could serve for all sorts of size problems. I don't know how Matlab works, but one alternative to repeated allocation/collection of workspaces is making use of registered permanant workspace arrays. Of course, eventually one's work requires so much that the memory management has to become more sophisticated. But when the task is set and known in advance, workspace management by hand can win out. Also, parallelizing the task makes such by-hand memory management even more onerous. acer
I think that the timing for Parametrize_SU_Euler can be improved to roughly 35% (of the timing of your version posted in this thread) for size 8x8, and to about 85% for size 64x64. That's by focusing on only the last loop. Your profiling indicates that that's where the expense lies. But before I post code, I have a question. Are the lambda[i] Matrices already of datatype=float[8], or complex[8]? (Timing comparisons on 64bit Linux, Maple 11.) I am curious about how fast Matlab is for 10000 8x8 or 1000 64x64 double precision complex matrix exponentials. acer
First 557 558 559 560 561 562 563 Last Page 559 of 591