Animation Blend Spaces without Triangulation

19/02/2024

In my previous article I talked about how to compute the weighted average of a set of quaternions in a robust and accurate way. The most obvious application for this in animation and video games is in blending multiple animations together.

Most game engines provide some sort of blending functionality like this for sets of animations. Slerp or nlerp is usually used when blending two animations in a 1D parameter space, while in a 2D parameter space many engines triangulate the data points and then use the barycentric weights to do something like the weighted average of the three animations at each vertex. In a 3D parameter space (or greater) some engines also provide some kind of blend functionally (such as via tetrahedralization) or RBF interpolation, but many engines don't provide anything at all.

Unfortunately, triangulating the space and using the barycentric coordinates as blend weights has a couple of big problems.

The first is that very thin triangles can cause the resulting animation to change extremely quickly as you move through the parameter space, and so you need to be careful to place your animations at locations in the parameter space that do not construct anything degenerative.

Additionally, small tweaks in the parameter space can trigger entirely different triangulations, meaning that small adjustments can completely change how the results look even if they don't introduce degenerate triangulations.

In addition to that, although Delaunay triangulation in 2D is fairly straight forward, triangulation can be a relatively complex and expensive algorithm in higher dimensions, which gets exponentially more computationally expensive the more dimensions you add. And even if you can run it in higher dimensions, beyond three dimensions inspecting the results visually becomes almost impossible, which basically rules it out since you can't easily validate you are not getting thin or degenerate triangles (or in this case, tetrahedrons or simplices).

These two issues combined have an annoying knock-on effect: most of the parameters we wish to blend with respect to in these kind of blend spaces can be extracted automatically from the animation data itself (for example, the character's speed, turning rate, or aiming direction), but doing so often does not work in practice either because we need to place our data points carefully in the parameter space to avoid thin triangles and other degenerate cases in the triangulation, or because such parameters are higher than two dimensions.

For this reason, often it is more practical to build up a kind of higher dimensional blend space out of consecutive 1D blends (which you could see as being like bilinear interpolation). This generally works well if you can do it, but in some cases can require a lot of data as it needs a dense coverage of all the parameters at regular intervals for each parameter.

One of the first animation papers I studied during my PhD uses a method taken from another paper (which is over 20 years old now!) by Peter-Pike Sloan which avoids these issues, and allows you to find per-animation blend weights without relying on triangulation or consecutive 1D blends. This means we can easily have parameter spaces in any dimension we want, with parameters which are automatically extracted from the animation data itself, and we don't need to worry about degenerate triangles or small adjustments to the animation data or parameters causing different behavior.

It works like this: given some desired parameters, represented by a point in the parameter space \( \mathbf{p}_t \), and all the points in our parameter space associated with each of our \( n \) animations \( \mathbf{p}_0,\ \mathbf{p}_1,\ \ldots\ \mathbf{p}_n \), we start by computing the distances (in parameter space) to all these points, and stacking them into a vector:

\begin{align*} \mathbf{d}_t &= \begin{bmatrix} \| \mathbf{p}_t - \mathbf{p}_0 \| \\ \| \mathbf{p}_t - \mathbf{p}_1 \| \\ \vdots \\ \| \mathbf{p}_t - \mathbf{p}_n \| \end{bmatrix} \end{align*}

and then given this vector \( \mathbf{d}_t \) of distances we are going to put it through a matrix \( \mathbf{M} \) which is going to transform it into a vector of weight values \( \mathbf{w} \)...

\begin{align*} \mathbf{w} &= \mathbf{M}\ \mathbf{d} \end{align*}

where each value in the vector is the weight associated with each of our \( n \) animations in our set.

\begin{align*} \mathbf{w}_t &= \begin{bmatrix} w_0 \\ w_1 \\ \vdots \\ w_n \\ \end{bmatrix} \end{align*}

If we can find an appropriate matrix \( \mathbf{M} \) which performs this mapping in the way we want then it will be easy for us to compute the blend weights from the point in the parameter space.

But how do we find such an \( \mathbf{M} \)? Well, what we are going to do is set things up as a linear equation and solve for the \( \mathbf{M} \) which has the behavior we want. To do this we need to stack all of our examples of desired inputs as columns of a matrix to the right of \( \mathbf{M} \), and all the associated examples of desired outputs as columns of a matrix on the left.

So on the right we will have a matrix made up of the distance vectors computed at the parameters associated with each of the animations in the set, stacked into columns (in other words: the pair-wise distance matrix of our animation parameters):

\begin{align*} \mathbf{D} &= \begin{bmatrix} \mathbf{d}_0\ \mathbf{d}_1\ \ldots\ \mathbf{d}_n \end{bmatrix} \end{align*}

And on the left will be the weight vector we want to associate with the parameters for each of the animations in the set, stacked into columns.

\begin{align*} \mathbf{W} &= \begin{bmatrix} \mathbf{w}_0\ \mathbf{w}_1 \ldots\ \mathbf{w}_n \end{bmatrix} \end{align*}

In this case, we will say that we want the weight of the animation for the associated parameters to be one, and the weight of all the other animations to be zero.

\begin{align*} \mathbf{W} &= \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \\ \end{bmatrix} = \mathbf{I} \end{align*}

This gives us the following equation, where \( \mathbf{M} \) is the matrix we want to solve for:

\begin{align*} \mathbf{W} &= \mathbf{M}\ \mathbf{D} \end{align*}

But how do we actually solve this linear system? Algebraically it's simple: we solve this by multiplying by \( \mathbf{D}^{-1} \) on the right:

\begin{align*} \mathbf{W} &= \mathbf{M}\ \mathbf{D} \\ \mathbf{W}\ \mathbf{D}^{-1} &= \mathbf{M}\ \mathbf{D}\ \mathbf{D}^{-1} \\ \mathbf{W}\ \mathbf{D}^{-1} &= \mathbf{M} \end{align*}

But that brings up another question, how do we compute the inverse of \( \mathbf{D} \), or more to the point, how do we compute \( \mathbf{W}\ \mathbf{D}^{-1} \)?

The easiest way to compute this is to pull in some kind of linear algebra library such as Eigen which provides methods for such things out-of-the-box - and this is probably what most game developers do. However that's another dependency, which for a header-only library like Eigen can mean a big hit to compile times, and perhaps another 10MB or more added to the binary if you're not careful.

I know that linear algebra can be a bit scary, but before looking at libraries, I think it's worth considering taking a look at the code listings in books like Numerical Recipes in C. These are really not nearly as scary (or as many lines of code) as you might think. Yes, understanding and transcribing the code is a bit tedious (and annoyingly all the code in the listings uses one-based indexing), but, for example, the code for LU-decomposition (which is one of if not the most efficient method for matrix inversion), can be implement in just two functions and about 100 lines of code.

First, the following function computes a LU decomposition in place, writing it to matrix, and writing a row ordering to row_order and scaling factors to the temporary buffer row_scaling:

bool mat_lu_decompose_inplace(
    slice2d<float> matrix,     // Matrix to decompose in-place
    slice1d<int> row_order,    // Output row order
    slice1d<float> row_scale)  // Temp row scale
{
    assert(matrix.rows == matrix.cols);
    assert(matrix.rows == row_order.size);
    assert(matrix.rows == row_scale.size);
    
    int n = matrix.rows;
    
    // Compute scaling for each row
    
    for (int i = 0; i < n; i++)
    {
        float vmax = 0.0f;
        for (int j = 0; j < n; j++)
        {
            vmax = maxf(vmax, fabs(matrix(i, j)));
        }
        
        if (vmax == 0.0) { return false; }
        
        row_scale(i) = 1.0 / vmax;
    }
    
    // Loop over columns using Crout's method
    
    for (int j = 0; j < n; j++)
    {
        for (int i = 0; i < j; i++)
        {
            float sum = matrix(i, j);
            for (int k = 0; k < i; k++)
            {
                sum -= matrix(i, k) * matrix(k, j);
            }
            matrix(i, j) = sum;
        }
        
        // Search Largest Pivot
        
        float vmax = 0.0f;
        int imax = -1;
        for (int i = j; i < n; i++)
        {
            float sum = matrix(i, j);
            for (int k = 0; k < j; k++)
            {
                sum -= matrix(i, k) * matrix(k, j);
            }
            matrix(i, j) = sum;
            float val = row_scale(i) * fabs(sum);
            if (val >= vmax)
            {
                vmax = val;
                imax = i;
            }
        }
        
        if (vmax == 0.0) { return false; }
        
        // Interchange Rows
        
        if (j != imax)
        {
            for (int k = 0; k < n; k++)
            {
                float val = matrix(imax, k);
                matrix(imax, k) = matrix(j, k);
                matrix(j, k) = val;
            }
            row_scale(imax) = row_scale(j);
        }
        
        // Divide by Pivot
        
        row_order(j) = imax;
        
        if (matrix(j, j) == 0.0) { return false; }

        if (j != n - 1)
        {
            float val = 1.0 / matrix(j, j);
            for (int i = j + 1; i < n; i++)
            {
                matrix(i, j) *= val;
            }
        }
    }
    
    return true;
}

Then, given a LU decomposition, the following function can be used to compute the inverse of the decomposed matrix multiplied by a vector vector. It writes the result back into vector in-place.

void mat_lu_solve_inplace(
    slice1d<float> vector,
    const slice2d<float> decomp, 
    const slice1d<int> row_order)
{
    assert(decomp.rows == decomp.cols);
    assert(decomp.rows == row_order.size);
    assert(decomp.rows == vector.size);
    
    int n = decomp.rows;
    int ii = -1;
    
    // Forward Substitution
    
    for (int i = 0; i < n; i++)
    {
        float sum = vector(row_order(i));
        vector(row_order(i)) = vector(i);
        
        if (ii != -1)
        {
            for (int j = ii; j <= i-1; j++)
            {
                sum -= decomp(i, j) * vector(j);
            }
        }
        else if (sum)
        {
            ii = i;
        }
        
        vector(i) = sum;
    }
    
    // Backward Substitution
    
    for (int i = n - 1; i >= 0; i--)
    {
        float sum = vector(i);
        for (int j = i + 1; j < n; j++)
        {
            sum -= decomp(i, j) * vector(j);
        }
        vector(i) = sum / decomp(i, i);
    }
}

This is really what we want to do - we want to compute \( \mathbf{W}\ \mathbf{D}^{-1} \), not \( \mathbf{D}^{-1} \). This we can do by plugging columns of \( \mathbf{W} \) into vector in the above.

So to summarize: to get the matrix \( \mathbf{M} = \mathbf{W}\ \mathbf{D}^{-1} \), we need to apply LU decomposition to \( \mathbf{D} \), and then solve for each column of \( \mathbf{W} \) using the above method (which in this special case will just be columns of the identity matrix):

void fit_blend_matrix(
    slice2d<float> blend_matrix,
    const slice2d<float> animation_parameters)
{    
    int nanims = animation_parameters.rows;
    int nparams = animation_parameters.cols;
    
    // Compute Pairwise Distances

    array2d<float> distances(nanims, nanims);
    
    for (int i = 0; i < nanims; i++)
    {
        for (int j = 0; j < nanims; j++)
        {
            distances(i, j) = 0.0f;
            for (int k = 0; k < nparams; k++)
            {
                distances(i, j) += squaref(
                    animation_parameters(i, k) - 
                    animation_parameters(j, k));
            }
            distances(i, j) = sqrtf(distances(i, j));
        }
    }
    
    // Subtract epsilon from diagonal this helps the stability
    // of the decomposition and solve
    
    for (int i = 0; i < nanims; i++)
    {
        distances(i, i) -= 1e-4;
    }
    
    // Decompose in place
    
    array1d<int> row_order(nanims);
    array1d<float> row_scale(nanims);
    
    bool success = mat_lu_decompose_inplace(distances, row_order, row_scale);
    assert(success);
    
    // Write associated blend weights into blend matrix
    
    for (int i = 0; i < nanims; i++)
    {
        for (int j = 0; j < nanims; j++)
        {
            blend_matrix(i, j) = i == j ? 1.0f : 0.0f;
        }
    }
    
    // Solve for blend matrix in-place
    
    for (int i = 0; i < nanims; i++)
    {
        mat_lu_solve_inplace(blend_matrix(i), distances, row_order);
    }
}

(Note: Since \( \mathbf{W} \) is symmetric the code here actually solves for rows instead of columns, but the result is the same.)

Here, the output blend_matrix will contain our matrix \( \mathbf{M} \), animation_parameters is our matrix \( \mathbf{P} \), distances is our distance matrix \( \mathbf{D} \), and blend_matrix is also where we write in our weights matrix \( \mathbf{W} \) since this is solved in-place.

Now that we've computed \( \mathbf{M} \), given a new point in the parameter space \( \mathbf{p}_t \), we simply compute the distances to all the other points in the parameter space \( \mathbf{d}_t \), and multiply this vector of distances by \( \mathbf{M} \). This gives us a vector of weights \( \mathbf{w}_t \) to use in our weighted blend of animation data.

void compute_query_distances(
    slice1d<float> query_distances, 
    const slice2d<float> animation_parameters, 
    const slice1d<float> query_parameters)
{
    assert(query_distances.size == animation_parameters.rows);
  
    for (int i = 0; i < animation_parameters.rows; i++)
    {
        query_distances(i) = 0.0f;
        for (int j = 0; j < animation_parameters.cols; j++)
        {
            query_distances(i) += squaref(
                query_parameters(j) - animation_parameters(i, j));
        }
        query_distances(i) = sqrtf(query_distances(i));
    }
}

void compute_blend_weights(
    slice1d<float> blend_weights,
    const slice1d<float> query_distances,
    const slice2d<float> blend_matrix)
{
    mat_mul_vec(blend_weights, blend_matrix, query_distances);
}

We will want to set any weights that are less than zero back to zero, and normalize the result so that all the weights sum to one. This ensures we always have a convex combination of weights for our interpolation.

void clamp_normalize_blend_weights(slice1d<float> blend_weights)
{
    float total_blend_weight = 0.0f;
    for (int i = 0; i < blend_weights.size; i++)
    {
        blend_weights(i) = maxf(blend_weights(i), 0.0f);
        total_blend_weight += blend_weights(i);
    }
    
    for (int i = 0; i < blend_weights.size; i++)
    {
        blend_weights(i) /= total_blend_weight;
    }
}

Here is a little comparison between triangulation and this method (called "interpolation" in the video). Here we use a 2D parameter space where one axis is the animation's average speed and the other is its average turning rate.

We can get a better idea of the difference by visualizing the blend weight associated with each animation in the space. Here you can see that this method produces blend weights which change much more smoothly, without the artifacts you get from thin triangles and barycentric interpolation.

Because this method is based on linear algebra, it works almost exactly the same no matter how many dimensions the parameter space has. Here is another example where I use sample points from a trajectory of future positions and directions as the parameters (like you might use in Motion Matching). Given the 2D location and direction of three future points, this creates a 12-dimensional parameter space. This would be totally impossible with triangulation, but works fine with the proposed method (although we cannot visualize the space so easily).

Assuming you don't have very many animations in your space, computing \( \mathbf{M} \) is actually pretty fast, meaning it can potentially be done every frame. That allows for dynamic blend spaces where the parameters change according to the state of the animations in the set.

For example, here, instead of the parameter of the blend being the "average root velocity of the whole animation", I instead set it to the root velocity of the currently playing frame of each animation in the set.

This works for the trajectory-based parameterization too.

While re-triangulation is not expensive either and so could be done every frame if required, because it isn't stable, the results are jumpy:

Computationally this method can be more expensive than triangulation, simply because we are not limited to always blending three animations. But this is somewhat the price you have to pay for having that smoother interpolation.

There is one big potential problem with this method, which is revealed if we plot the point in the parameter space found by taking our predicted weights and doing the weighted sum of the parameters themselves (here shown in blue):

We can see that unlike with triangulation the weights produced by this method, when used as a weighted sum of our parameters, don't actually map exactly back to the point we asked for. And this reveals some other potential issues with it. When we move our target parameters far outside of where we have data we can see that the extrapolation behavior tends to spread out the weighting evenly across all animations, pulling the resulting animation toward the average.

We get a similar thing for the trajectory based parameterization:

Here the target trajectory is in green, while the actual trajectory we get (when using the predicted blend weights) is in blue.

This effect is both bad and good - good because it means this method extrapolates "safely" - it just goes toward the average when we ask for something far away from any of our data - bad because it introduces some pretty counter-intuitive effects. For example, if we ask for a speed faster than our fastest animation, it is likely we will actually get an animation which is slower than our fastest animation, as it will target something closer to the average.

One simple thing you can do (in 2D at least) is to use the triangulation to project onto the convex hull before computing the weights. This removes some of the weird behavior in terms of extrapolation but does not solve the issue of parameters not matching within the data distribution and does not scale to higher dimensions.

Now... trust me when I say that I spent a really long time trying to come up with a simple solution to this issue which did not somehow involve Neural Networks, but unfortunately at some point I had to concede.

The thing is, for better or worse, this problem is actually a pretty good fit for Neural Networks because we have a function we want to fit (the function mapping from from parameters to weights), which we need to be fast to evaluate, and we want to make subject to a number of relatively simple constraints:

For each animation in our set, if we input its parameters, the output weights must be one for that animation and zero everywhere else.
For any given input, the sum of the output weights must equal one.
For any given input, the output weights must be positive.
For any given input, doing the weighted sum of the animation parameters using the output weights must produce something close to the original input.

If we code up these constraints as loss functions, we can quite easily train a very small Neural Network to perform this task in less than 200 lines of PyTorch. The most important bit looks like this (full code is here):

# Sample random points in the parameter space
Prand = (Pmax - Pmin) * torch.rand(batchsize, nparams, dtype=torch.float32) + Pmin

# Compute the distances for those random points
Drand = torch.sqrt(torch.sum(torch.square(P[None,:] - Prand[:,None]), dim=-1))

# Concatenate to the points and distances from our animations
Pbatch = torch.cat([P, Prand], dim=0)
Dbatch = torch.cat([D, Drand], dim=0)

# Compute the weights by going through the neural network
W = (network((Dbatch - network_mean_in) / network_std_in) * 
    network_std_out + network_mean_out)

# Compute the normalized weights (i.e positive and sum to one)
Wnorm = torch.maximum(W, torch.zeros_like(W))
Wnorm = Wnorm / torch.maximum(
    Wnorm.sum(dim=-1)[...,None], 1e-4 * torch.ones_like(Wnorm[:,:1]))

# Loss to make weights at the animations 1 for that anim and 0 elsewhere
loss_data = torch.mean(torch.abs(W[:nanims] - torch.eye(nanims)))

# Loss to make the weights sum to one
loss_sum = 0.01 * torch.mean(torch.abs(W.sum(dim=-1) - 1))

# Loss to make the weights positive
loss_pos = 0.01 * torch.mean(torch.abs(torch.minimum(W, torch.zeros_like(W))))

# Loss to make the input match the output in parameter space
loss_proj = 0.1 * torch.mean(
    torch.sqrt(torch.sum(torch.square((Wnorm @ P) - Pbatch), dim=-1) + 1e-4))

# Add together losses
loss = loss_data + loss_sum + loss_pos + loss_proj

Just like before, the input to this network will be the distances to all the animations in the parameter space, and the output will be the weights. So it's a drop-in replacement to the matrix \( \mathbf{M} \) we used in the previous method.

The result is something that much more closely matches our expectations, and takes just a few microseconds to evaluate:

Interestingly, we can also control how the input projects onto the convex hull by changing the distance function used in constraint 4. In the above example I use the Euclidean distance between the original input and the reconstructed weighted sum of the animation parameters. But if I switch to the Manhattan distance...

loss_proj_l1 = 0.1 * torch.mean(
            torch.mean(torch.abs((Wnorm @ P) - Pbatch), dim=-1))

...we can see it will project in a way which minimizes the Manhattan distance instead:

This could be useful. For example, if we know one of our axes in parameter space represents an angle, we could make sure this axis uses the angular distance to get a projection which "wraps around" onto the other side when required.

This Neural Network based method works identically in higher dimensional parameter spaces, which we can see with our trajectory-based parameterization.

It might look like the result here is not really matching the input, but what I think is happening is actually that in higher dimensions it starts to be the case where it is increasingly unlikely for a point to be completely inside the convex hull. That means that we are almost always projecting onto the convex hull by at least some small amount whatever our input is - which would explain the fact we always have a difference.

The big downside of the Neural Network based method is that in my experiments it takes at least 15 minutes to train, which is kind of a disaster for iteration times, and also means that it is more difficult to get working with dynamic blend spaces where the animation's parameters change on a frame-by-frame basis.

But given the simplicity of the task and the small size of the network, I imagine that with some tweaking it should be possible to get the training time down to under a minute (in which case this might be a more viable option), but if anyone knows a better solution to this problem which does not require as much pre-computation I would love to hear what it is!

I think it's also worth noting that everything I've discussed today doesn't really have anything to do with animation! These are just ways of getting a set of blend weights when you have some data points in some parameter space. What you actually blend with those weights is entirely up to you!

You can find the full source code for this article here - and there is a web-demo you can play with here.

So that's everything I have for you today. I hope you found this interesting and maybe it stirred up some other ideas for how we might do animation blending and implement blend spaces in new ways. And as usual, thanks for reading.