Fast Tensor Rotation With Numpy

September 23, 2023 Post a Comment

At the heart of an application (written in Python and using NumPy) I need to rotate a 4th order tensor. Actually, I need to rotate a lot of tensors many times and this is my bottle

Solution 1:

To use tensordot, compute the outer product of the g tensors:

def rotT(T, g):
    gg = np.outer(g, g)
    gggg = np.outer(gg, gg).reshape(4 * g.shape)
    axes = ((0, 2, 4, 6), (0, 1, 2, 3))
    return np.tensordot(gggg, T, axes)

On my system, this is around seven times faster than Sven's solution. If the g tensor doesn't change often, you can also cache the gggg tensor. If you do this and turn on some micro-optimizations (inlining the tensordot code, no checks, no generic shapes), you can still make it two times faster:

def rotT(T, gggg):
    return np.dot(gggg.transpose((1, 3, 5, 7, 0, 2, 4, 6)).reshape((81, 81)),
                  T.reshape(81, 1)).reshape((3, 3, 3, 3))

Results of timeit on my home laptop (500 iterations):

Your original code:19.471129179Sven'scode:0.718412876129My first code:0.118047952652My second code:0.0690279006958

The numbers on my work machine are:

Your original code:9.77922987938Sven'scode:0.137110948563My first code:0.0569641590118My second code:0.0308079719543

Solution 2:

Here is how to do it with a single Python loop:

defrotT(T, g):
    Tprime = T
    for i inrange(4):
        slices = [None] * 4
        slices[i] = slice(None)
        slices *= 2
        Tprime = g[slices].T * Tprime
    return Tprime.sum(-1).sum(-1).sum(-1).sum(-1)

Admittedly, this is a bit hard to grasp at first glance, but it's quite a bit faster :)

Solution 3:

Thanks to hard work by M. Wiebe, the next version of Numpy (which will probably be 1.6) is going to make this even easier:

>>>Trot = np.einsum('ai,bj,ck,dl,abcd->ijkl', g, g, g, g, T)

Philipp's approach is at the moment 3x faster, though, but perhaps there is some room for improvement. The speed difference is probably mostly due to tensordot being able to unroll the whole operation as a single matrix product that can be passed on to BLAS, and so avoiding much of the overhead associated with small arrays --- this is not possible for general Einstein summation, as not all operations that can be expressed in this form resolve to a single matrix product.

Baca Juga

Solution 4:

Out of curiosity I've compared Cython implementation of a naive code from the question with the numpy code from @Philipp's answer. Cython code is four times faster on my machine:

#cython: boundscheck=False, wraparound=Falseimport numpy as np
cimport numpy as np

defrotT(np.ndarray[np.float64_t, ndim=4] T,
         np.ndarray[np.float64_t, ndim=2] g):
    cdefnp.ndarray[np.float64_t, ndim=4] Tprime
    cdefPy_ssize_t i, j, k, l, ii, jj, kk, ll
    cdefnp.float64_t gg

    Tprime = np.zeros((3,3,3,3), dtype=T.dtype)
    for i inrange(3):
        for j inrange(3):
            for k inrange(3):
                for l inrange(3):
                    for ii inrange(3):
                        for jj inrange(3):
                            for kk inrange(3):
                                for ll inrange(3):
                                    gg = g[ii,i]*g[jj,j]*g[kk,k]*g[ll,l]
                                    Tprime[i,j,k,l] = Tprime[i,j,k,l] + \
                                         gg*T[ii,jj,kk,ll]
    return Tprime

Solution 5:

I thought I'd contribute a relatively new data point to these benchmarks, using parakeet, one of the numpy-aware JIT compilers that's sprung up in the past few months. (The other one I'm aware of is numba, but I didn't test it here.)

After you make it through the somewhat labyrinthine installation process for LLVM, you can decorate many pure-numpy functions to (often) speed up their performance :

import numpy as np
import parakeet

@parakeet.jitdefrotT(T, g):
    # ...

I only tried applying the JIT to Andrew's code in the original question, but it does pretty well (> 10x speedup) for not having to write any new code whatsoever :

andrew10loops,best of 3:206msecperloopandrew_jit10loops,best of 3:13.3msecperloopsven100loops,best of 3:2.39msecperloopphilipp1000 loops,best of 3:0.879msecperloop

For these timings (on my laptop) I ran each function ten times, to give the JIT a chance to identify and optimize the hot code paths.

Interestingly, Sven and Philipp's suggestions are still orders of magnitude faster !

Learn Python Tutorials