Fast Tensor Rotation With Numpy
Solution 1:
To use tensordot
, compute the outer product of the g
tensors:
def rotT(T, g):
gg = np.outer(g, g)
gggg = np.outer(gg, gg).reshape(4 * g.shape)
axes = ((0, 2, 4, 6), (0, 1, 2, 3))
return np.tensordot(gggg, T, axes)
On my system, this is around seven times faster than Sven's solution. If the g
tensor doesn't change often, you can also cache the gggg
tensor. If you do this and turn on some micro-optimizations (inlining the tensordot
code, no checks, no generic shapes), you can still make it two times faster:
def rotT(T, gggg):
return np.dot(gggg.transpose((1, 3, 5, 7, 0, 2, 4, 6)).reshape((81, 81)),
T.reshape(81, 1)).reshape((3, 3, 3, 3))
Results of timeit
on my home laptop (500 iterations):
Your original code:19.471129179Sven'scode:0.718412876129My first code:0.118047952652My second code:0.0690279006958
The numbers on my work machine are:
Your original code:9.77922987938Sven'scode:0.137110948563My first code:0.0569641590118My second code:0.0308079719543
Solution 2:
Here is how to do it with a single Python loop:
defrotT(T, g):
Tprime = T
for i inrange(4):
slices = [None] * 4
slices[i] = slice(None)
slices *= 2
Tprime = g[slices].T * Tprime
return Tprime.sum(-1).sum(-1).sum(-1).sum(-1)
Admittedly, this is a bit hard to grasp at first glance, but it's quite a bit faster :)
Solution 3:
Thanks to hard work by M. Wiebe, the next version of Numpy (which will probably be 1.6) is going to make this even easier:
>>>Trot = np.einsum('ai,bj,ck,dl,abcd->ijkl', g, g, g, g, T)
Philipp's approach is at the moment 3x faster, though, but perhaps there is some room for improvement. The speed difference is probably mostly due to tensordot being able to unroll the whole operation as a single matrix product that can be passed on to BLAS, and so avoiding much of the overhead associated with small arrays --- this is not possible for general Einstein summation, as not all operations that can be expressed in this form resolve to a single matrix product.
Solution 4:
Out of curiosity I've compared Cython implementation of a naive code from the question with the numpy code from @Philipp's answer. Cython code is four times faster on my machine:
#cython: boundscheck=False, wraparound=Falseimport numpy as np
cimport numpy as np
defrotT(np.ndarray[np.float64_t, ndim=4] T,
np.ndarray[np.float64_t, ndim=2] g):
cdefnp.ndarray[np.float64_t, ndim=4] Tprime
cdefPy_ssize_t i, j, k, l, ii, jj, kk, ll
cdefnp.float64_t gg
Tprime = np.zeros((3,3,3,3), dtype=T.dtype)
for i inrange(3):
for j inrange(3):
for k inrange(3):
for l inrange(3):
for ii inrange(3):
for jj inrange(3):
for kk inrange(3):
for ll inrange(3):
gg = g[ii,i]*g[jj,j]*g[kk,k]*g[ll,l]
Tprime[i,j,k,l] = Tprime[i,j,k,l] + \
gg*T[ii,jj,kk,ll]
return Tprime
Solution 5:
I thought I'd contribute a relatively new data point to these benchmarks, using parakeet, one of the numpy
-aware JIT compilers that's sprung up in the past few months. (The other one I'm aware of is numba, but I didn't test it here.)
After you make it through the somewhat labyrinthine installation process for LLVM, you can decorate many pure-numpy
functions to (often) speed up their performance :
import numpy as np
import parakeet
@parakeet.jitdefrotT(T, g):
# ...
I only tried applying the JIT to Andrew's code in the original question, but it does pretty well (> 10x speedup) for not having to write any new code whatsoever :
andrew10loops,best of 3:206msecperloopandrew_jit10loops,best of 3:13.3msecperloopsven100loops,best of 3:2.39msecperloopphilipp1000 loops,best of 3:0.879msecperloop
For these timings (on my laptop) I ran each function ten times, to give the JIT a chance to identify and optimize the hot code paths.
Interestingly, Sven and Philipp's suggestions are still orders of magnitude faster !
Post a Comment for "Fast Tensor Rotation With Numpy"