## Abstract

In the last two decades, increased need for high-fidelity simulations of the time evolution and propagation of forces in granular media has spurred a renewed interest in the discrete element method (DEM) modeling of frictional contact. Force penalty methods, while economic and widely accessible, introduce artificial stiffness, requiring small time steps to retain numerical stability. Optimization-based methods, which enforce contacts geometrically through complementarity constraints leading to a differential variational inequality problem (DVI), allow for the use of larger time steps at the expense of solving a nonlinear complementarity problem (NCP) each time-step. We review the latest efforts to produce solvers for this NCP, focusing on its relaxation to a cone complementarity problem (CCP) and solution via an equivalent quadratic optimization problem with conic constraints. We distinguish between first-order methods, which use only gradient information and are thus linearly convergent and second-order methods, which rely on a Newton type step to gain quadratic convergence and are typically more robust and problem-independent. However, they require the approximate solution of large sparse linear systems, thus losing their competitive advantages in large scale problems due to computational cost. In this work, we propose a novel acceleration for the solution of Newton step linear systems in second-order methods using low-rank compression based fast direct solvers, leveraging on recent direct solver techniques for structured linear systems arising from differential and integral equations. We employ the quantized tensor train (QTT) decomposition to produce efficient approximate representations of the system matrix and its inverse. This provides a versatile and robust framework to accelerate its solution using this inverse in a direct or a preconditioned iterative method. We demonstrate compressibility of the Newton step matrices in primal dual interior point (PDIP) methods as applied to the multibody dynamics problem. Using a number of numerical tests, we demonstrate that this approach displays sublinear scaling of precomputation costs, may be efficiently updated across Newton iterations as well as across simulation time steps, and leads to a fast, optimal complexity solution of the Newton step. This allows our method to gain an order of magnitude speedups over state-of-the-art preconditioning techniques for moderate to large-scale systems, hence mitigating the computational bottleneck of second-order methods.

## 1 Introduction

The discrete element method (DEM) [1] is one of the most widely used approaches to simulate the multibody dynamics such as in granular materials. This method considers the granular medium as a collection of discrete particles; each responding to body forces such as gravity, inertia or drag, as well as repulsive or dissipative forces caused by contact. Materials in granular form are omnipresent in industry and understanding their dynamics is crucial for a broad range of application fields, including terramechanics, active media, additive manufacturing, nanoparticle self-assembly, avalanche dynamics, composite materials, pyroclastic flows, etc.

Distinct instances of the DEM are defined essentially by their modeling of contact forces, and thus in how collisions are handled. We distinguish two main types of discrete element methods in the literature: force penalty methods (DEM-P) and complementarity formulations (DEM-C). Penalty methods introduce one or multiple layers of spring-like forces between objects in contact, and may introduce additional fields to represent friction. These methods are widely used due to being computationally inexpensive and easy to implement. However, in many cases they introduce high stiffness, requiring extremely small time steps to retain stability during collisions. An in-depth and informative comparison between the two can be found in Ref. [2].

DEM-C methods such as in Refs. [3] and [4] enforce contacts using complementarity constraints, leading to a differential variational inequality problem upon discretization. This allows for the use of larger time steps in its integration, as contacts are enforced geometrically. Given a time-stepping scheme for this problem, a nonlinear complementarity problem (NCP) must be solved to compute the corresponding contact forces at each time-step. One common way to solve this optimization problem is via a linearization of the constraints, producing a linear complementarity problem [5,6]. An alternative relaxation method produces an equivalent, convex cone complementarity problem (CCP). Efficient quadratic cone programming techniques such as those in Refs. [7–11], and [4] have been proposed to solve the CCP.

In Ref. [12], the authors performed a comparison of these quadratic programming techniques focused on determining which class of methods performed best: second-order optimization methods, i.e., those that use Hessian information (Interior Point), or first-order methods, i.e., methods using only gradient information (Jacobi, Gauss-Seidel, Projected Gradient Descent). Second-order methods display quadratic convergence in a neighborhood of the solution, and thus their convergence is much faster and more problem-independent than that of linearly convergent first-order methods, requiring at least 1 to 2 orders of magnitude less iterations to reach a desired accuracy across experiments in this work. However, each iteration in second-order methods requires the solution of a (generally sparse) linear system, which can become costly as the dimensions of the many-body problem increase. For large problems, the added computational effort ultimately eclipses the gains obtained from the reduction in iteration counts.

The preferred solution method for large sparse systems is often a Krylov subspace iterative method. The iteration counts, and thus the performance of these methods are known to be directly affected by the eigenspectrum of the associated matrices. Preconditioning techniques can be used to cluster the eigenvalues away from zero and drastically reduce iteration counts; we refer the reader to Ref. [13] for a general review. Generic sparse preconditioners are most typically based on incomplete or sparsified factorizations, such as the well-known Incomplete LU and Cholesky methods [14]. In Ref. [12], a fast, parallel SaP (split and paralellize) [15] preconditioner was used to accelerate the primal dual interior point (PDIP) method, garnering reductions in iteration counts and execution times.

Most general-purpose preconditioners suffer a trade-off between precomputation costs and the resulting reduction in iteration counts, and their performance is often problem-dependent. Moreover, in the context of optimization problems such as the ones we are interested in, system matrices change every iteration, and it is often impossible or expensive to update the associated preconditioners. Finally, improving their scaling with the number of degrees-of-freedom and the level of sparsity is extremely important, as it is central to remaining competitive in large-scale problems.

In the last decades, a continued effort has been made to produce *direct* solvers for structured linear systems arising from differential and integral equations. These solvers entirely side-step the challenges related to convergence speed of iterative solvers. They can also lead to dramatic improvements in speed, in particular in situations where a large number of linear systems with coefficient matrices that stay fixed or can be updated via low rank modifications. Additionally, they also provide a methodology to produce robust preconditioners: low accuracy direct solvers may be used in conjunction with iterative refinement or the Krylov subspace method of choice; the use of an approximate low accuracy inverse requires less memory and precomputation time than a direct solver, at the expense of a slight increase in the number of iterations. These features have led to the adoption of these solvers in other areas of scientific computing and statistics.

In Ref. [16], an effective and memory-efficient solver based on the quantized tensor train decomposition (QTT) was presented. By recasting system matrices as tensors, the tensor train (TT) compression and inversion routines were used to produce direct solvers and robust preconditioners for integral equations in complex geometries in three dimensions. Key properties of this solver that differentiate it from other hierarchical matrix approaches feature sublinear computational costs and memory requirements with problem size *N*, as well as techniques to produce economic updates for a matrix and its inverse across time-steps, *even when matrix size changes*.

The main goal of this work is to employ a QTT-based approach to provide a radical speed up to second-order optimization methods. We demonstrate its application to interior point methods such as the PDIP in the context of many-body dynamics problems; however, we expect our discussion to apply with little to no modification to a general class of Newton and quasi-Newton type methods. We first study compressibilty of the system matrices and their inverses in TT format for a range of target accuracies. We then demonstrate how factorization re-use can provide significant speed-ups to precomputation costs, reducing costs by orders of magnitude. For three validation tests of common soil mechanics phenomena—sedimentation, blade drafting and direct shear experiments—we show that a TT-based preconditioner displays efficient and robust performance for problems with > 10^{4} bodies, garnering up to an order of magnitude speed-up and greatly improved iteration counts when compared with state-of-the-art ILU-preconditioned methods. We also confirm an extremely significant gain in scaling of precomputation costs: precomputation for the TT preconditioner is *sublinear* (for all practical purposes, constant) as the number of collisions and matrix size increase.

The QTT decomposition is one of several approaches for approximate solution of linear systems based in hierarchical compression. In the context of sparse structured systems and related factorizations, work has been done for a number of hierarchical matrix formats: *hierarchically semi separable* [17–23], $H$*matrices:* [24–26], and fast multipole methods [27,28]. Producing efficient, global factorization updates and dealing with high storage costs is an ongoing challenge in these alternate formats.

## 2 Mathematical Preliminaries: Dynamics of Rigid Bodies in the Presence of Friction

### 2.1 Notation and Glossary.

In order to ensure clarity of exposition, we include list of notation and terminology employed throughout this work. The tensor train acceleration proposed involves the manipulation of vectors, matrices and tensors of different dimensions, as well as the associated indexing and operators (Table 1).

We include a glossary and notation for important quantities and parameters used throughout the text in Table 2.

Objects | Notation | Description |
---|---|---|

b, q(⋅) | Scalars, scalar valued functions (default math) | |

$\gamma ,\u2009f(\xb7)$ | Vectors, vector valued functions (lowercase bold math) | |

$A,\u2009K(\xb7)$ | Matrices, matrix valued functions (uppercase math serif) | |

$b,\u2009A,\u2009F(\xb7)$ | Tensors, tensor valued functions (chancery calligraphic) | |

$A$ | Tensor Train decomposition (uppercase typewriter) | |

$T$ | Tree data structure (uppercase calligraphic) | |

Indexing | ||

$\gamma i$ | ith vector linked to ith object (e.g., rigid body, contact pair) | |

$\gamma (i)$ | ith entry of vector $\gamma $ | |

$A(i1,i2,\u2026,id)$ | $(i1,i2,\u2026,id)$ entry of tensor $A$ | |

$i=i1i2\cdots id\xaf$ | “flattened” (vectorized) multi-index by lexicographical order | |

Operators | ||

$vec(A)$ | Vectorize tensor $A$ according to flattened index $i=i1i2\cdots id\xaf$ | |

$reshape(\u2009\xb7\u2009,\delta )$ | Reshape array or index given compatible dimension vector $\delta $ | |

$permute(A,p)$ | Permute tensor dimensions given permutation vector p | |

$diag(\lambda )$ | Diagonal matrix generated by vector $\lambda $ |

Objects | Notation | Description |
---|---|---|

b, q(⋅) | Scalars, scalar valued functions (default math) | |

$\gamma ,\u2009f(\xb7)$ | Vectors, vector valued functions (lowercase bold math) | |

$A,\u2009K(\xb7)$ | Matrices, matrix valued functions (uppercase math serif) | |

$b,\u2009A,\u2009F(\xb7)$ | Tensors, tensor valued functions (chancery calligraphic) | |

$A$ | Tensor Train decomposition (uppercase typewriter) | |

$T$ | Tree data structure (uppercase calligraphic) | |

Indexing | ||

$\gamma i$ | ith vector linked to ith object (e.g., rigid body, contact pair) | |

$\gamma (i)$ | ith entry of vector $\gamma $ | |

$A(i1,i2,\u2026,id)$ | $(i1,i2,\u2026,id)$ entry of tensor $A$ | |

$i=i1i2\cdots id\xaf$ | “flattened” (vectorized) multi-index by lexicographical order | |

Operators | ||

$vec(A)$ | Vectorize tensor $A$ according to flattened index $i=i1i2\cdots id\xaf$ | |

$reshape(\u2009\xb7\u2009,\delta )$ | Reshape array or index given compatible dimension vector $\delta $ | |

$permute(A,p)$ | Permute tensor dimensions given permutation vector p | |

$diag(\lambda )$ | Diagonal matrix generated by vector $\lambda $ |

Note: Font type and specifications employed to distinguish classes of mathematical objects is indicated with emphasized text. Standard notation is used for matrix indices and operations (e.g., transpose, inverse).

M | Number of rigid particles in contact dynamics system. |

B_{i} | ith rigid body with location in generalized coordinates $qi$. |

$Fi,Ti$ | Applied force and torque vectors on B._{i} |

$ui,\omega i$ | Translational and rotational velocities of body B._{i} |

$vi$ | Rigid body velocity vector $vi=(ui,\omega i)$. |

N_{c} | Number of contacts, with locations $ci$ in $\mathbb{R}3$ |

$\Phi i(q)$ | Signed distance function for contact $ci$. |

$\gamma i$ | Contact impulse vector $\gamma i=(\gamma i,n,\gamma i,1,\gamma i,2)$ |

μ_{i} | Sliding contact friction coefficient |

d | Tensor or tensor train decomposition dimension |

r_{k} | kth tensor train rank; r denotes maximum rank. |

$\epsilon $ | Accuracy parameter; subindices used for different methods. |

M | Number of rigid particles in contact dynamics system. |

B_{i} | ith rigid body with location in generalized coordinates $qi$. |

$Fi,Ti$ | Applied force and torque vectors on B._{i} |

$ui,\omega i$ | Translational and rotational velocities of body B._{i} |

$vi$ | Rigid body velocity vector $vi=(ui,\omega i)$. |

N_{c} | Number of contacts, with locations $ci$ in $\mathbb{R}3$ |

$\Phi i(q)$ | Signed distance function for contact $ci$. |

$\gamma i$ | Contact impulse vector $\gamma i=(\gamma i,n,\gamma i,1,\gamma i,2)$ |

μ_{i} | Sliding contact friction coefficient |

d | Tensor or tensor train decomposition dimension |

r_{k} | kth tensor train rank; r denotes maximum rank. |

$\epsilon $ | Accuracy parameter; subindices used for different methods. |

### 2.2 Problem Formulation.

We consider a granular material comprised of *M* rigid particles *B _{i}* in $\mathbb{R}3$; the position of each body is uniquely described by the coordinates $xi$ for its center of mass and the rotation $Qi$ of a reference frame fixed to the body, represented by a unimodular quaternion. Quaternion representation of rotations in three dimensions is usually favored due to being compact and numerically stable; since we require the four component quaternions to be length one, it introduces only three independent degrees-of-freedom per particle. Let $qi=(xi,Qi)$ then be the six generalized coordinates for the

*i*th body, and $q=(q1,q2,\u2026,qM)\u2208\mathbb{R}6M$.

where $L(q)$ is a linear operator that relates velocities to the rate of change in generalized coordinates, $M$ is the mass matrix and $v,fB,fC\u2208\mathbb{R}6M$ contain each body's translational and rotational velocities, the total body forces and torques applied to them and the reaction forces and torques due to contact dynamics, respectively.

### 2.3 Complementarity Contact Model.

We impose the following contact constraints: no two bodies should penetrate, and if there is contact, a normal force and a tangential frictional force act at the interface. Consider two bodies $Bi1$ and $Bi2$, and let $\Phi i(q)$ be an unsigned distance function (also known as a gap function) for the pair $i=(i1,i2)$ satisfying $\Phi i(q)>0$ if the two bodies are separated, $\Phi i(q)=0$ if they are touching, $\Phi i(q)<0$ otherwise.

**,$t1,t2$ be unit normal and tangential vectors at the point of contact. Contact forces $fN=\gamma i,nn$ and $fT=\gamma i,1t1+\gamma i,2t2$ are then applied to each body in opposite directions. The complementarity constraint for the normal force (3) prevents penetration, enforcing that bodies move away from each other at contact. The Coulomb friction model ties the magnitudes of the normal and tangential forces. Using a maximum dissipation principle, the friction force is posed as the solution to an optimization problem**

*n*where *μ* is the static friction coefficient. The feasible set of forces $f=fN+fT$ for the minimization problem in Eq. (4) are known as a *friction cone*$\u03d2i={(\gamma i,n,\gamma i,1,\gamma i,2)\u2009|\u2009||(\gamma i,1,\gamma i,2)||\u2264\mu \gamma i,n}$. The complementarity condition in Eq. (3) is typically abbreviated using the notation $0\u2264\gamma i,n\u22a5\Phi i(q)\u22650$. We note that an alternate complementarity formulation for Eq. (3) allowing for partially elastic collisions can be obtained by replacing $\Phi i$ with the normal velocity as discussed in Refs. [5] and [29].

*δ*> 0

For a system with $Nc=|A(q,\delta )|$ contacts, this adds a number of constraints to the equations of motion proportional to *N _{c}*. We note that for dense granular flows,

*N*itself scales as

_{c}*O*(

*M*) for

*M*bodies.

*t*and step size Δ

^{k}*t*, velocity $vk+1$ and contact forces are solved via a NCP. The new velocity is used to evolve the position in time.

We note that Eq. (12) is obtained from Eq. (8) via a linearization, dividing by Δ*t* (which does not affect complementarity, but is numerically desirable). This makes the future velocity $vk+1$ the sole variable needed to enforce the complementarity condition. We also note that in this discretization, $(\gamma i,n,\gamma i,1,\gamma i,2)$ constitute contact *impulses*, i.e., force magnitudes multiplied by the step length Δ*t*.

### 2.4 Solving the Optimization Problem.

*t*goes to zero. Additionally, the CCP is equivalent to the Karush Kuhn Tucker (KKT) first-order optimality conditions for a quadratic optimization problem with conic constraints. We define a contact transformation matrix $D=[D1\u2009;D2\u2009;\u2026\u2009DNc]\u2208\mathbb{R}6M\xd73Nc$, a matrix $N$ and vector

*r*While this relaxed problem may introduce artifacts when step size Δ*t*, sliding velocity or friction are large, it enables the use of a wide range of quadratic programming methods for its solution:

Projected Jacobi and Gauss–Seidel methods [4].

Projected gradient descent methods like accelerated projected gradient descent [10], Barzilai and Borwein [30] and the Kucera and Preconditioned spectral projected gradient with fallback methods in Ref. [8].

Krylov subspace methods: Gradient projected minimum residual in Ref. [8].

Symmetric cone interior point [9].

Projected Jacobi and Gauss–Seidel methods, while requiring only fast matrix applies of $N$, have slow, linear convergence, often requiring thousands of iterations to obtain a significant reduction for the objective function. The projected gradient descent and Krylov subspace methods have since been proposed, providing considerable iteration count reductions while retaining cost-efficiency per time-step. However, they remain linearly convergent, and they require an increasing number of iterations as problem size increases for a variety of problems of interest.

Interior point methods are often based on a modified Newton or quasi-Newton step, displaying quadratic convergence near minima and iteration counts which are less dependent on problem size. However, they require the approximate solution of large linear systems in order to produce the Newton step, thus losing this competitive advantage due to computational cost.

### 2.5 Overview of Interior Point Methods.

The parameter *τ* controls the strength of the barrier, and as $\tau \u2192\u221e,\u2009B\tau (z)\u2192I(z)$ with *I*(*z*) the indicator function over (–*∞*, 0], and the problem in Eq. (21) becomes equivalent to the original constrained program defined by Eqs. (19) and (20). Given the set of problems defined by Eq. (21), interior point methods proceed by following along the corresponding *central path*${\gamma *(\tau ):\tau >0}$ of optima, which are located inside of the feasible set, and converge to the solution of the original problem as $\tau \u2192\u221e$.

#### 2.4.1 Primal-Dual Interior Point Methods.

We note that, as $\tau \u2192\u221e$, Eq. (26) becomes a strict complementarity condition.

#### 2.4.2 The PDIP Step.

*τ*until sufficient convergence to the solution of the original problem is satisfied.

#### 2.4.3 Application to the CCP.

*N*inequality constraints (Eq. (18)) given by $fi(\gamma )$ as

_{c}where $M\u0302$ is a diagonal matrix defined by $M\u0302=\u2211i=12Nc\lambda i\u22072fi(\gamma )=diag(m\u0302)$, with $m\u0302=[\mu 12\lambda 1,\lambda 1,\lambda 1,\u2026,\mu 12\lambda Nc,\lambda Nc,\lambda Nc]$. $B,C$ are banded rectangular matrices, and $E$ is diagonal. A typical sparsity pattern for a multibody dynamics problem is shown in Fig. 1.

*N*× 3

_{c}*N*for $\Delta \gamma $ can be obtained

_{c}We note that working with this Schur complement matrix $S$ is widely preferred, as it is symmetric positive definite and smaller in size. We additionally observe that its sparsity pattern across all PDIP iterations is always that of matrix $N$. That is due to the fact that $M\u0302$ is diagonal and $BE\u22121C$ is block-diagonal (3 × 3 blocks). Finally, in all of our experiments, $S$ was relatively more compressible in the Tensor train format. For these reasons, our discussion and acceleration efforts below center around solving the associated Schur complement system.

Most state-of-the-art methods for the direct solution and preconditioning of this system, the Tensor train decomposition included, will attempt to exploit sparsity structure in $S$. We recall that $N$ is the Hessian of the quadratic $q(\gamma )$ given by $DTM\u22121D$ (Eq. (15)), with $D\u2208\mathbb{R}6M\xd73Nc$ contact transformation matrix and $M$ diagonal mass matrix. In order to understand the sparsity pattern of $N$, we partition it into 3 × 3 blocks corresponding to each contact. The (*ℓ*, *i*)th block of $D$ is nonzero if the *i*th contact involves body *B _{ℓ}*. As a result, the (

*i*,

*j*)th block of $N$ is nonzero if the

*i*th and

*j*th contacts share a body in common.

Let us consider a toy contact problem involving three spherical objects lying on a flat surface as shown in Fig. 2(a). Given the network of bodies at contact in Fig. 2(b), the block sparsity pattern in $N$ corresponds to the edge-adjacency graph in Fig. 2(c); two edge nodes (indexed by body pairs) are connected if they share a body in common. The resulting pattern for this matrix (and thus for $S$) for this toy problem is shown in Fig. 3.

## 3 The Tensor Train Solver

In the context of multibody dynamics, we know the system matrices for the Newton step in Eq. (31) to be sparse and highly structured, and dependent on the iterates $(\gamma ,\lambda )$ as the PDIP iteration proceeds to the solution of problem (17). Further, from one time-step to the next, we expect the matrix required to obtain the Newton step to change in size as the active set of constraints (corresponding to pairs of objects in contact) evolves. It is because of these changes both within and between timesteps that producing an efficient and robust solution technique for the Newton system remains challenging.

Among currently available hierarchical compression techniques, the TT decomposition features compression and inversion algorithms that are applicable to a large set of structured matrices and that lend themselves to inexpensive global updates. In addition, they have shown to achieve sublinear precomputation times. We thus propose to use it as a framework for direct solution and preconditioning of iterative solvers for the linear systems in each PDIP iteration.

In this section, we give a cursory description of the QTT decomposition as a method to efficiently compress, invert and perform fast arithmetic with approximate representations of structured matrices. We then present a general discussion of its application to solving the linear systems associated with the PDIP for the CCP. This constitutes, to our knowledge, the first application of hierarchical compression solvers to the acceleration of second-order optimization methods. Although it is beyond the scope of this work, we expect the techniques laid out in this section to be readily applicable to a more general class of interior point and other Newton and quasi-Newton based methods for smooth convex problems.

### 3.1 The Tensor Train Decomposition.

The TT decomposition provides a powerful tensor compression technique [36] via low rank representation akin to that of a generalized singular value decomposition. We will focus on its application in the approximation of *tensorized* vectors and matrices given a hierarchical subdivision of their indices, known as QTT. Matrices in this setting are further interpreted as tensorized operators acting on such tensorized vectors. We outline how this interpretation allows us to effectively employ the TT as a tool for hierarchical compression and inversion of structured matrices.

#### 3.1.1 Motivating Example.

*n*

^{3}points $(xi1,yi2,zi3)$ in $[0,1]3$. We can use addition formulas to decompose $f(x,y,z)$ as a sum of separable functions

Evaluation of Eqs. (33) and (34) on the tensor $A$ is depicted in Fig. 4. The first step (33) produces a rank 2 decomposition $A(i1,i2,i3)=G1(i1,\alpha 1)V1(\alpha 1,i2,i3)$. In Eq. (34) we then decompose $V1$ to separate dependency of *y* and *z*; each of the two terms results in a rank 2 decomposition $V1(\alpha 1,i2,i3)=G2(\alpha 1,i2,\alpha 2)G3(\alpha 3,i3)$. We note that each term *G _{k}* (depicted in solid yellow, blue and red in Fig. 4) depends on one dimension

*i*of the original tensor $A$, and storage has been reduced from

_{k}*n*

^{3}to 2

*n*+ 4

*n*+ 2

*n*= 8

*n*.

What we have presented in this example is an exact tensor train decomposition of $A$. We present the analogous definition for the TT decomposition of a *d* dimensional tensor.

*For a d-dimensional tensor*$A(i1,i2,\u2026,id),ik\u2264nk$

*, sampled at*$N=\u220fk=1dnk$

*points, a TT decomposition is of the form*

*where, each two- or three-dimensional*$Gk$

*is known as a tensor core. The ranges of auxiliary indices*$\alpha k=1,\u2026,rk$

*determine the number of terms in the decomposition. We refer to r*th

_{k}as the k*TT-rank, analogous to matrix numerical rank.*

In order to understand the chain of “low rank” decompositions in the tensor train format in terms of matrix low rank decompositions, we introduce auxiliary objects known as unfolding matrices.

*For a tensor of dimension d, the kth unfolding matrix is defined as*

*where*$pk=i1\cdots ik\xaf$

*and*$qk=ik+1\cdots id\xaf$

*are two flattened indices. Using Matlab's notation*

*i*

_{2},

*i*

_{3}in $V1$. We may similarly show that $A2$ is of rank 2 (by applying addition formulas for $(x+y)$ and

*z*); however, the second rank 2 approximation in the chain is obtained by decomposing an unfolding matrix of $V1$

As is the case with low rank matrix decompositions, most often a tensor $A$ of interest will be approximately of low rank, in the sense that given a target accuracy *ε*, a TT decomposition with low TT ranks $A$ may be found such that $||A\u2212A||F<\epsilon $. This decomposition can be obtained by a sequence of low-rank approximations to $Ak$. A generic algorithm proceeds as in Algorithm 1.

Due to the cost of successive low rank factorizations, implementing Algorithm 1 will predictably lead to relatively high computational cost, *O*(*N*) or higher, which is exponential in the tensor dimension *d*. Instead, we employ TT rank revealing strategies based on the multipass alternating minimal energy (AMEN) cross algorithm of Ref. [36], in which a low-TT-rank approximation is initially computed with fixed ranks and is improved upon by a series of passes through all cores.

#### 3.1.2 TT Compression Update.

We note that, since these algorithms obtain the TT approximation based on an iterative, greedy rank detection procedure, they allow for inexpensive updates to a TT compressed representation of a tensor $A$. If we wish to produce the TT representation of $A\u0303=A+E$, the AMEN Cross algorithm may be started using the TT representation of $A$ as an initial guess. If $E$ has small TT ranks, this provides a significant speed-up for this compression algorithm, which often converges in a few iterations to the updated representation.

#### 3.1.3 Computational Complexity and Memory Requirements.

Because AMEN cross and related TT rank revealing approaches proceed by enriching low-TT-rank approximations, all computations are performed on matrices of size $rk\u22121nk\xd7rk$ or less. In Ref. [16], it is shown that the complexity for this algorithm is thus bounded by *O*(*r*^{3}*d*) or equivalently $O(r3\u2009log\u2009N)$, where $r=max(rk)$ is the maximal TT-rank that may be a function of sample size *N* and accuracy *ε*. For a large number of structured matrices as well as their inverses, *r* typically stays constant or grows logarithmically with *N* [16,37–39]. The overall complexity of computations is then *sublinear* in *N*.

### 3.2 Tensor Train for Hierarchically Structured Matrices.

Consider a block-sparse, structured matrix $S$ such as the ones obtained from the Newton step system in PDIP applied to contact dynamics. Following the motivating example presented above, our objective is to *tensorize* matrix $S$ in such a way that it is highly compressible in the TT format (i.e., the resulting TT ranks for a given accuracy *ε* are reliably small).

For this purpose, we propose a tensorization process obtained from a hierarchical partition of the unknowns corresponding to rows and columns of $S$ based on their location in space. These kind of partitions, usually encoded with tree data structures, have been extensively used in the last two decades to produce multilevel low rank factorizations of structured matrices and associated fast direct solvers. Following Ref. [16], we show that the Tensor train decomposition applied to this tensorized matrix yields an effective hierarchical low rank factorization technique.

#### 3.2.1 Contact Problem Example.

In order to demonstrate how this tensorization procedure works, we again consider the small contact problem involving three spherical objects lying on a flat surface presented in Fig. 2. As discussed in Sec. 2.4, the unknowns of the Schur complement matrix $S$ correspond to one normal and two tangent impulses $\gamma i=(\gamma i,n,\gamma i,1,\gamma i,2)$ for each contact point located in space at $ci$.

$S$ can thus be naturally partitioned as a $Nc\xd7Nc$ block-sparse matrix, with the (*i*, *j*)th 3 × 3 block encoding the relationship between the change in impulse $\Delta \gamma j$ and the resulting residual $r\gamma ,i$. We may additionally identify these 3 × 3 blocks as samples from a bivariate function of impulses at each contact point $S(i,j)=K(\gamma i,\gamma j)$; we recall that for interior point methods, $K$ is derived from the augmented Hessian of $f(\gamma )$.

In order to tensorize this matrix, we first subdivide the physical domain along each coordinate direction (two subdivisions in Figs. 5(b) and 5(c)) and upon each subdivision, we assign a binary index $ik\u2208{0,1}$ to contact points on each side. In this example, each contact is identified with the leaf node of a binary tree $T$ of depth 2 depicted in Fig. 5(d). A final index $i1\u2208{0,1,2}$ is required for each of the three coordinates of the impulse vector $\gamma i$. We note that indices are numbered from local to global, matching the order they would appear when tensorizing a column vector. Replacing a single index *i* with the multiindex $(i1,i2,i3)$ induces a natural tensorization for any vector of samples in our 12 unknowns $(\gamma 1,n,\gamma 1,1,\u2026,\gamma 4,n,\gamma 4,1,\gamma 4,2)$. Unknowns are ordered according to a depth-first (e.g., morton ordering) traversal of the tree, and then the vector is reshaped as a 3 × 2 × 2 tensor.

The Schur matrix $S$ in our example is 12 × 12, with rows and columns dependent on this very set of unknowns; it is depicted in the left-most matrix in Fig. 3. At first glance, it might be tempting to split row and column indices in the order they appear, obtaining tensor of dimension 6 with entries $S(i1,i2,i3,j1,j2,j3)$; this, however, will invariably produce suboptimal TT ranks. Instead, let $T\Pi $ denote the product tree $T\xd7T$, corresponding to hierarchical subdivision of both row and column indices. The first bisection produces 4 matrix 6 × 6 blocks (Fig. 3, center), indexed by a block index $b3=i3j3\xaf$. The second, 16 3 × 3 blocks indexed by $(b2,b3)=(i2j2\xaf,i3j3\xaf)$ (Fig. 3, left). Finally, entries on each individual block are identified with the local index $b1=i1j1\xaf$. This corresponds to ordering matrix entries according to the multiindex $(i1j1\xaf,i2j2\xaf,i3j3\xaf)$ (depth-first traversal on $T\Pi $) and reshaping the matrix as a 9 × 4 × 4 tensor.

#### 3.2.2 General Matrix Tensorization Process and TT Decomposition.

Given a set of *N _{c}* contact points with locations $ci\u2208\mathbb{R}3$, we set a box containing all of them as the root of the hierarchy and proceed to subdivide

*all nodes*along each coordinate direction until each leaf node has at most

*m*points. As was the case for our toy problem, each contact point $ci$ may now be encoded by a set of

*d*indices: a local index $i1\u22080,1,\u2026,3m\u22121$ for the 3

*m*unknowns and

*d*− 1 binary indices $(i2,i3,\u2026,id)$ corresponding to its location on the tree $T$ at each level. This process is shown on line 1 of Algorithm 2. We note that the binary encoding of points in

*n*dimensional space induced by this coordinate partition induces a so-called Morton ordering (also known as

*Z*-curve order). For efficient algorithms to construct and update such partitions, we refer the reader to Ref. [40].

In general, the tree $T$ will be a uniform binary tree of depth *d* − 1 with empty nodes corresponding to regions in space with no contacts, and nonempty leaf nodes will contain different numbers of contacts (unless *m *=* *1). Equivalently, every unknown will have an associated multiindex $(i1,\u2026,id)$, but not all multiindices will correspond to a valid unknown. In order to apply TT compression algorithms, we will thus work on an augmented matrix $S\u0302$ of size $2d\u22121n1\xd72d\u22121n1$ (with *n*_{1} = 3 *m*) such that $S\u0302(i,j)=S(i,j)$ if both *i* and *j* correspond to valid unknowns, and $S\u0302(i,j)=\delta i,j$ otherwise. In other words, $S\u0302$ is a permutation of a 2 × 2 block diagonal matrix with blocks $S$ and identity matrix $I$ corresponding to the original and dummy variables, respectively. This corresponds to the function call in line 2 of Algorithm 2.

*d*dimensional tensor by reshaping arrays and permuting their dimensions; however, the TT AMEN Cross compression algorithm used in this work will not in general form this array, and instead all we require is a way to evaluate the desired tensor $S$ given a set of block indices

and use this routine to obtain an approximate TT factorization S via a greedy, rank-revealing process. We note that each core of $S,\u2009Gk(\alpha k\u22121,ikjk\xaf,\alpha k)$ depends only on the pair of row and column indices at the corresponding level of the hierarchy. When performing matrix arithmetic, such as matrix-vector product or inversion, TT cores are often reshaped as *n _{k}* ×

*n*matrices parametrized by $\alpha k\u22121$ and

_{k}*α*.

_{k}In Fig. 3, we illustrate the application of Algorithm 1 to produce the TT decomposition for the Schur complement in our toy contact problem, corresponding to line 6 of Algorithm 2. This method produces a TT decomposition via a sequence of low rank decompositions of unfolding matrices of the tensorized $S$. Because of the way we have indexed $S$, the columns of the first unfolding matrix $S1$ are obtained by vectorizing each of the 16 leaf blocks in $T\Pi $. If we apply a row interpolatory decomposition to obtain the low rank factorization $G1V1$, this step of the algorithm finds a low rank interpolation basis of block entries. The algorithm then proceeds by merging children nodes, forming the next unfolding matrix and applying low rank decompositions in an upward pass through the matrix block tree.

#### 3.2.3 TT Compressibility of $S$.

As evidenced by the experimental results in Sec. 4, following this procedure reliably yields approximate TT decompositions for $S$ with low TT ranks. This is, first, because it exploits sparsity at different levels of resolution, corresponding to different levels of the tree $T\Pi $ (unfolding matrix columns are zero if the corresponding block is zero). As discussed in Sec. 2.4, block-sparsity of $S$ is tied to a network of contact points; a block of $S$ is nonzero if they share a body in common. Unless bodies are substantially different in size and diameter, this network is highly localized: two contacts sharing a body will generally be nearby in space. We note that an interesting avenue of future research might consider tensorizing $S$ following graph partition techniques based on the adjacency graph in Fig. 2(c).

Second, the TT automatically exploits smoothness and symmetries of the bivariate function $K(\gamma 1,\gamma 2)$. The resulting TT decomposition and rank bounds may thus be analyzed in terms of the approximation of $K$ by a sum of separable functions of the form $g1(x1)g2(x2)\cdots gd(xd)$ provided by this decomposition, as has been done for integral and differential operators and their inverses.

#### 3.2.4 Fast Arithmetic and TT Direct Solvers.

Fast methods for TT matrix arithmetic are available for a number of operations, including inversion and matrix-vector and matrix-matrix products. The TT solvers in this work find an approximate TT structure for the inverse and employ the corresponding TT matrix-vector product $\gamma =S\u22121b$. This mat-vec algorithm proceeds by contracting one dimension of the tensorized matrix $S$^{– 1} at a time, applying the *k*th TT core. Its complexity can be shown to be $O(r2N\u2009log\u2009N)$.

TT inversion methods compute an approximate TT decomposition of $S\u22121$ given the TT decomposition of $S$. In these algorithms, matrix equations for each tensor core of the inverse is solved iteratively. Following an alternating least squares algorithm, given an initial guess for the inverse in TT form, it proceeds by iteratively cycling through the cores (freezing all cores but one) and solving a linear system to update the *k*th core of S^{text--1}. TT ranks of the inverse are not known a priori (and are distinct to those for S), and so strategies to increase core ranks are needed to ensure convergence of the alternating least squares procedure to an accurate inverse representation. Further details about variants of this approach can be found in Refs. [41–43]. The complexity of this algorithm for a maximum TT rank *r* for both S and S^{−1} is bound by $O(r4\u2009log\u2009N)$, as shown in Ref. [16].

#### 3.2.5 Inverse Compression Update.

We note that, since they share the same iterative, greedy rank detection structure with the AMEN Cross compression algorithm, TT direct solver routines may also be significantly sped up using an approximate initial solution. In the context of the Newton step in interior point methods, we expect both forward and inverse operators to be obtainable via low TT rank updates.

### 3.3 Tensor Train Newton System Solver.

We now discuss how to adapt the tensor train decomposition framework to accelerate computation of the Newton steps within the PDIP method applied to the CCP. We first show how at a given timestep, the tensor train provides an easy-to-update approximate inverse for the Newton system's Schur complement matrix. We then describe a procedure to hot-start the TT compression and inversion algorithms reusing information from the previous timestep, even when the corresponding set of contacts (and thus, matrix size and structure) change. We summarize the entire PDIP step acceleration approach in Algorithm 3.

#### 3.3.1 Accelerated Newton Step Solve.

At a given timestep *t*, the *k*th PDIP iteration involves the solution of the Schur complement system, for 3*N _{c}* × 3

*N*matrix $Sk=N+M\u0302k\u2212BkEk\u22121Ck$. In order to accelerate the Newton system solve, we construct approximate TT decompositions S

_{c}*and $Sk\u22121$; we employ the AMEN cross and solve algorithms given user prescribed target accuracy and maximum TT rank parameters (lines 3–8, 14, and 15 in Algorithm 3). We then have the choice to employ $Sk\u22121$ as a direct solver or couple it with an iterative procedure (e.g., preconditioned Krylov method) if a more accurate solve is needed. For the numerical experiments presented in Sec. 4, we show the latter approach to have the best performance in terms of reducing overall solution costs.*

_{k}*k*to

*k*+

*1, we consider $Sk+1$ as a perturbation*

where *L _{k}* is block-diagonal. While

*L*is generally of matrix rank

_{k}*O*(

*N*), across all experiments in Sec. 4 we observe it to be of approximate low TT rank, and the same is found for $Sk+1\u22121$ as a perturbation of $Sk\u22121$. This fact allows us to use the TT decompositions $Sk,Sk\u22121$ to hot-start the corresponding AMEN compression and inversion algorithms, reducing precomputation times considerably (∼ 10 × faster in our experiments).

_{c}#### 3.3.2 Re-Using Information Across Timesteps.

Employing information from the solution of the CCP at a timestep *t* to hot-start the PDIP iteration at the next timestep *t* + Δ *t* is notoriously hard; even if it can be used to produce a feasible point that is close to the optimum (which is nontrivial due to changes in the set of contacts), efforts by the PDIP algorithm to preserve centrality might cause it to take small steps and waste time “returning” to the central path. For this reason, we initialize each PDIP iteration by making the tangential force impulses $\gamma i,1,\gamma i,2$ equal to zero, and the normal force $\gamma i,n$ equal to either a constant preset value (e.g., 1) or to the value computed in the previous timestep if the corresponding contact persists across timesteps. This ensures that our initial value is feasible and lies safely inside the friction cone.

Since pairs of bodies may phase in and out of contact, the set of contacts considered at each timestep (and thus the set of unknowns) as well as the corresponding sparsity pattern both change as the system evolves in time. In Fig. 6 we show an example based on our toy contact problem. While the matrix size for this problem remains the same (since one contact is added and one is removed), we note that the sparsity structure is different due to a change in graph connectivity. However, as long as Δ*t* and relative velocities are sufficiently small, it is likely that the set of persistent contacts (and the corresponding submatrix of $S$) from one timestep to the next will be relatively large. In all validation experiments in Sec. 4, in fact, well over 90% of contacts persist once objects have sedimented.

Our goal is then to re-use the TT decompositions for the *initial* Schur complement matrix $S0$ from one timestep to the next. From one timestep to the next, contact locations evolve, new active contacts may be introduced and inactive contacts need to be eliminated. We recall from Sec. 3.2 that the current set of active unknowns is considered as a subset of variables corresponding to a uniform binary tree with at most *m* contacts in each leaf node, and the matrix $S0$ is compressed within an augmented matrix $S\u03020$ of size $2d\u22121n1\xd72d\u22121n1$ with *n*_{1} = 3 *m*.

We must then update the indices for the new set of active contacts at time *t* + Δ*t*

Indices for contacts that are no longer active must be set to dummy variables and corresponding entries of $S\u03020$ set to the identity.

New contacts must be added to the hierarchy and encoded according to their location. Blocks for these contacts and their neighbors (sharing a rigid body) are now nonzero.

Since binary encoding of contact locations is computationally inexpensive, we may readily detect and update indices for persistently active contacts whose location has significantly changed.

In order to preserve as much structure as possible, we only change the encoding of persistent contacts if there is an indexing conflict, if their current leaf node has more than *m* contacts, or if their location has migrated significantly. In Fig. 7 we show a simple one-dimensional example of the evolution of tensor indices from one timestep to the next, tracking active nodes in the resulting binary tree structure and corresponding sparsity of matrix $S0$. We note that while *N _{c}* changes from 5 to 6, the size of the augmented matrix, depth of the binary tree and number of contacts per leaf (

*m*=

*1) for this example remain the same.*

If this re-indexing is compatible with the encompasing binary tree, we may use our TT decompositions for $S$_{0} and $S0\u22121$ as initial guesses for the corresponding initial matrix and inverse at the next timestep. Otherwise, we must compute a new hierarchical subdivision and compress the TT factorizations from scratch.

## 4 Numerical Results

We now demonstrate the performance of the TT-based solver when accelerating the solution of the Newton step system, and thus of second-order methods such as PDIP, in the context of dense, multiple rigid-body dynamics. As mentioned in Secs. 1 and 3, we know that given a desired target accuracy *ε*, if the associated TT ranks *r _{k}* are bounded or slowly growing as a function of problem size

*N*, factorization costs and storage for the TT solver are

*sublinear*in

*N*, and that applying this matrix to a given right-hand-side is $O(N\u2009log\u2009N)$. We wish to study if Newton system matrices are TT compressible in this sense, and to test their performance and scaling in their solution as problem size grows (determined by the number of collisions

*N*). By harnessing the ability of the TT to produce economic updates both from one Newton iteration to the next, as well as across timesteps, we demonstrate significant improvement for the solution of the complementarity problem using second-order methods.

_{c}The complementarity method for frictional contact, as well as the solution methods for its CCP relaxation have been thoroughly validated and contrasted with experimental data [12,44]. We setup three experiments based on standard phenomena in terramechanics, focusing on how TT-based linear solvers perform in the PDIP iteration. Models and contact dynamics simulations are performed using open-source library Project Chrono [45], PDIP solver comparisons are performed with a serial matlab implementation and all TT methods are based on the matlab TT-Toolbox [46]. All experiments are run on the serial queue of University of Michigan's Flux computing cluster ().

### 4.1 Tensor Train Compressibility and Information Re-Use.

We first carry out an assessment to determine how the general behavior of TT ranks for the Schur matrix and its inverse vary with target accuracy *ε* and maximum allowed TT ranks, varying these parameter from 10^{−2} to 10^{−4} and $r\u226410,100,1000$, respectively. This gives us a general idea of the compressibility of these matrices upon re-ordering their nodes according to a spatial hierarchy. Since algorithmic constants for TT compression, inversion and matrix-vector apply all depend on rank, this also informs out choice of *ε*. We present average results for 1000 PDIP iterations for a sedimentation experiment with *M *=* *35,939 rigid bodies in Sec. 4.2; we note these are largely replicated across experiments, and that ranks and compression times grow very slowly with the number of degrees-of-freedom.

In Table 3 we observe that the Schur matrices and their inverses are indeed highly compressible, and that as we increase the maximum allowed rank, individual and average TT ranks converge. This is replicated in all our experiments, regardless of number of particles and number of collisions. In Table 4 we record the combined matrix compression and inversion times; as indicated in Sec. 3.2, performance of these algorithms is bounded by terms of the form $rk\u2009log\u2009Nc$. Given that rank growth results in a substantial increase in precomputation times, in our experiments we find that a TT preconditioner approach with *ε* = 10^{−2} is most effective in reducing overall computation for the PDIP iterations. We note, however, that this parameter selection is generally problem and implementation dependent.

r ≤ 10 | r ≤ 100 | r ≤ 1000 | |
---|---|---|---|

ε = 10^{−2} | 4.6/4.4 | 10.5/6.7 | 10.5/6.7 |

ε = 10^{−3} | 8.7/8.5 | 59.3/45.2 | 65.2/46.4 |

ε = 10^{−4} | 9.1/9.0 | 68.6/58.1 | 113.3/91.2 |

r ≤ 10 | r ≤ 100 | r ≤ 1000 | |
---|---|---|---|

ε = 10^{−2} | 4.6/4.4 | 10.5/6.7 | 10.5/6.7 |

ε = 10^{−3} | 8.7/8.5 | 59.3/45.2 | 65.2/46.4 |

ε = 10^{−4} | 9.1/9.0 | 68.6/58.1 | 113.3/91.2 |

Note: This data is obtained from 1000 timesteps of a sedimentation simulation of *M *=* *35,939 rigid bodies.

r ≤ 10 | r ≤ 100 | r ≤ 1000 | |
---|---|---|---|

ε = 10^{−2} | 10 | 25 | 25 |

ε = 10^{−3} | 16 | 1244 | 1317 |

ε = 10^{−4} | 16 | 2740 | 8157 |

r ≤ 10 | r ≤ 100 | r ≤ 1000 | |
---|---|---|---|

ε = 10^{−2} | 10 | 25 | 25 |

ε = 10^{−3} | 16 | 1244 | 1317 |

ε = 10^{−4} | 16 | 2740 | 8157 |

We then wish to quantify the impact of the strategies described in Sec. 3.3 to re-use information in TT factorizations for the matrix and its inverse. For this purpose, we run two versions of the TT preconditioned PDIP iteration with and without factorization re-use for 100 timesteps for the same sedimentation problem described above. We set target accuracy to *ε* = 10^{−2} and bound maximum ranks at $r\u226410$.

In Fig. 8, we plot precomputation times for the initial Newton steps and for the entire PDIP iteration, in order to measure how effective our information re-use strategies for the TT are in bringing down compression and inversion costs. For the initial PDIP iteration, we see that except for the first timestep (for which both methods have no prior information), information re-use always provides a speedup, between 5 and 15× for most timesteps shown here. The accumulated effect of both re-use strategies can be seen in the second plot on the right; precomputation is improved for all timesteps, resulting in a 10–15× overall speedup. Across all experiments, we observe both an improvement by about an order of magnitude and a drastic reduction in the variance of precomputation times, making the TT approach faster and more robust. We also note that predictably, this speedup is larger for higher target accuracies, as the iterations through all TT cores become more computationally burdensome.

### 4.2 Performance Comparison Experiments for Primal Dual Interior Point Iteration.

#### 4.2.1 Sedimentation on Box With Rotating Mixer.

A randomly pertubed cubic lattice of (2*n* + 1)^{3} rigid particles of spherical shape of radius 0.1 and friction coefficient *μ* = 0.25 are dropped and sediment under gravity into a fixed box with a rotating mixer, as shown in Fig. 9. We then run simulations for *n *=* *8, 16, 32, with a total of rigid bodies *M *=* *4915, 35,939 and 274,627 (including the box and mixer), scaling up the box size to keep particle density roughly constant. We set a timestep size Δ*t* = 0.025 and target accuracy 1 × 10^{−4} for the PDIP solver, until such time as all objects are deposited in the box and undergoing mixing.

We compute the Newton step through the solution of the corresponding Schur complement system. In order to test performance of the TT-based preconditioner, we compare precomputation and solution times for a preconditioned biconjugate gradient stabilized (BICGSTAB) method against Incomplete LU and unpreconditioned versions of this iterative solver. We note that while use of the conjugate gradient method may be generally preferrable for these systems, we observe its performance may degrade due to ill-conditioning as iterates approach the feasible set boundary, and so for simplicity of presentation we exclude it from our comparison.

As discussed in Sec. 3.3, the Tensor train approach allows us to produce approximate direct and preconditioned iterative solvers by varying target accuracy and maximum TT rank in the compression and inversion processes. Following preliminary parametric studies, we find that setting target accuracy for TT inversion to 1 × 10^{−2} and capping maximum TT ranks at $r\u226410$ provides the best performance in terms of the trade-offs involved in precomputation and TT preconditioner apply costs for our experimental setup.

We note that for practical purposes, the maximum number of iterations for the unpreconditioned solve was set at 1000; this limit was often reached reducing the overall accuracy of the resulting linear system solution and thus the quality of the PDIP iteration. Through further testing, we consistently observe average BICG iteration counts of 3–5 × 10^{3} and a fivefold and sixfold increase in computational cost when removing this constraint. These estimates should be considered whenever comparing either of the preconditioned methods against unpreconditioned BICG.

In Fig. 10 we can observe how performance for each of these solvers scales with the number of collisions *N _{c}*. Two factors are contributing to increase problem complexity in this experiment: as objects sediment in the box, the number of collisions tends to increase and the Schur complement matrix becomes less sparse. From this plot, we can readily observe that the TT preconditioned solver generally shows superior scaling, outperforming the ILU preconditioner at about

*N*∼ 20,000, and gaining an order of magnitude speed-up against the ILU preconditioner by the end of experiments with

_{c}*M*=

*35,939, 274,267. We can observe that while the fraction of time spent in precomputation tends to a constant for ILU (indicating similar asymptotic scaling for precomputation and solve times, experimentally $O(Nc3)$, it quickly goes down to zero for the TT.*

In Fig. 11, we take a closer look at average precomputation times for each PDIP iteration for TT and ILU. We note that this involves computing roughly 50–100 factorizations, one per Newton iteration. In all experiments, we confirm that compression and inversion times for the Tensor train approach grow extremely slowly with *N _{c}*, staying on a range from 5 to 15 s.

Finally, we compare BICGSTAB average iteration counts in Fig. 12. We note that while iteration counts generally increase at the beginning of the experiment, those for the TT grow slower and settle sooner as particles sediment; for *M *=* *35,939, 274,627, the number of iterations for the ILU becomes up to eight times larger. For the unpreconditioned case, the maximum iteration count of 1000 is reached for a significant number of linear system solves, limiting the accuracy of the resulting Newton steps.

#### 4.2.2 Drafting Test With Rectangular Blade.

We follow the validation experiment in Ref. [12]; as shown in Fig. 13, we setup a drafting test involving a rectangular blade of width 0.1 moving through a container filled with spherical rigid particles of radius 0.1 and friction coefficient *μ* = 0.25. This test may be used to compute the force that the blade experiences as it moves through the granular flow, which eventually reaches a steady-state. As in the sedimentation test above, we setup a perturbed lattice of (2*n* + 1)^{3} spherical particles inside the box, and perform experiments for *n *=* *8, 16 for a total number of *M *=* *4915, 35,939 rigid bodies. The blade moves from one end of the box to the other with a prescribed sinusoidal velocity with period 4 s.

#### 4.2.3 Direct Shear Experiment.

*n*+ 1)(2

*n*+ 1)

^{2}spherical particles is setup inside the box, experiencing shear from the lower half of the box and a load from a ceiling press ten times denser than the granular material. We perform experiments for

*n*=

*8, 16, for a total number of*

*M*=

*7228, 53,364 rigid bodies. The movement of the lower half is again controlled prescribing a sinusoidal velocity. Simulation snapshots for 1023 bodies are shown in Fig. 14.*

In Fig. 15, we once again show a comparison of performance for each of the three solvers as it scales with number of collisions *N _{c}*. For all four experiments we observe essentially the same scaling for solution and precomputation times for the TT preconditioned solver, with TT precomputation times again staying roughly around 10 s per timestep. The overall speedups attained are slightly smaller (about 5–10×), due likely to the increase in problem complexity and the force and velocity magnitudes involved compared to the sedimentation case. However, it remains the case that the TT provides a significantly more robust and better acceleration to the linear system, with better scaling precomputation times and reduced iteration counts than the ILU sparse preconditioner. As is the case for the sedimentation tests, due to the limit of the maximum number of iterations for the unpreconditioned solve, average iteration counts and timings can be estimated to be about five times higher than presented in these plots for full accuracy.

## 5 Conclusions

In this work we have presented a robust and highly efficient acceleration technique for the solution of Newton step linear systems in second-order methods based on approximate hierarchical compression and inversion in the Tensor train format. In multiple experiments for common terramechanics phenomena modeled with frictional contact for dense, multiple rigid body systems, we have successfully applied a TT preconditioner to accelerate their solution, providing speed-ups of up to an order of magnitude against state-of-the-art sparse solvers for systems with $Nc\u227320000$. Across all our experiments, we observe that the Tensor train preconditioner provides lower and more reliable iteration count reductions, as well as practically constant precomputation costs and storage requirements, which are improved significantly by our proposed TT factorization re-use techniques.

As discussed in Sec. 1, the benefits of rapid, problem-independent convergence of second-order optimization solvers are often negated by expensive large sparse matrix solves. The application of sparse and structured linear algebra techniques has been thus far limited by unfavorably scaling precomputation costs, excessive memory storage and communication requirements and the absence of efficient global factorization updates. We have demonstrated that the Tensor train approach can successfully address these issues in DEM complementarity simulations of granular media. Based on its versatility and exploitation of a wide class of hierarchical low rank structure, we expect this to be true for a wide array of large scale optimization problems. We also anticipate that the low precomputation cost and storage requirements provided by the TT will be most impactful in high performance computing implementations; our ongoing work features a distributed memory implementation of the frictional contact CCP and the TT accelerated PDIP solver.

We note that the optimization algorithms and proposed accelerations in this work were tested for a complementarity formulation for perfectly plastic contact. As mentioned in Sec. 2, similar formulations exist for elastic and partially elastic (bouncing) contact, and we expect the algorithms presented here to apply readily to that case and to display similar performance.

## Acknowledgment

We acknowledge support from the Automotive Research Center (ARC) in accordance with Cooperative Agreement W56HZV-14-2-0001 with U.S. Army Tank Automotive Research, Development and Engineering Center (TARDEC). Corona and Veerapaneni were also supported by the NSF under grant DMS-1454010. This research was supported in part through computational resources and services provided by Advanced Research Computing Center at the University of Michigan, Ann Arbor. Corona would like to thank Daniel Negrut, Radu Serban, Luning Fang and Milad Rakhsha at the Simulation Based Engineering Laboratory (SBEL) at UW Madison for their assistance with this project and for many helpful discussions.

## Funding Data

National Science Foundation (Award ID: DMS-1454010; Funder ID: 10.13039/100000001).

Tank Automotive Research, Development and Engineering Center (Award ID: W56HZV-14-2-0001; Funder ID: 10.13039/100009922).

## References

_{2}-Matrix Compression, Algorithms and Analysis (EMS Tracts in Mathematics, Vol.

^{d}× 2

^{d}Matrices Using Tensor Decomposition