Benchmarks

Here are some simple benchmarks. Take them with a grain of salt since they run on virtual machines in the cloud to generate the documentation automatically.

First-derivative operators

Periodic domains

Let's set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, DiffEqOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = periodic_derivative_operator(derivative_order=1, accuracy_order=2,
                                     xmin=xmin, xmax=xmax, N=100)
x = grid(D_SBP)
D_DEO = CenteredDifference(derivative_order(D_SBP), accuracy_order(D_SBP),
                           step(x), length(x)) * PeriodicBC(eltype(D_SBP))

D_sparse = sparse(D_SBP)

u = randn(eltype(D_SBP), length(x)); du = similar(u);
@show D_SBP * u ≈ D_DEO * u ≈ D_sparse * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end

doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)

D_SBP:
BenchmarkTools.Trial: 10000 samples with 993 evaluations.
 Range (min … max):  35.081 ns … 62.564 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     35.877 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   36.062 ns ±  1.059 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▃▆█▆▅▃▁▁                                                 
  ▂▄█████████▇▆▅▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▂▂▁▁▁▁▁▁▁▂▁▁▂▁▁▂▂▂ ▃
  35.1 ns         Histogram: frequency by time        41.8 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare this to the runtime obtained using a sparse matrix representation of the derivative operator. Depending on the hardware etc., this can be an order of magnitude slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_sparse, "D_sparse:", du, u)

D_sparse:
BenchmarkTools.Trial: 10000 samples with 651 evaluations.
 Range (min … max):  190.602 ns … 267.813 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     201.668 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   202.581 ns ±   4.391 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                   ▂▄▆████▇▆▅▄▄▂▂▁▁▁            ▁               ▂
  ▄▃▅▅▅▆▄▆▆▅▆▇▇▄▆▅▇██████████████████▇▇▇█▇█▇▇▇█████▇▇▆▆▇▆▅▆▅▆▆▅ █
  191 ns        Histogram: log(frequency) by time        220 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, we benchmark the implementation of the same derivative operator in DiffEqOperators.jl.

doit(D_DEO, "D_DEO:", du, u)

D_DEO:
┌ Warning: #= /home/runner/.julia/packages/DiffEqOperators/lHq9u/src/derivative_operators/convolutions.jl:412 =#:
│ `LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
│ Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.
└ @ DiffEqOperators ~/.julia/packages/LoopVectorization/QgYWB/src/condense_loopset.jl:1166
┌ Warning: #= /home/runner/.julia/packages/DiffEqOperators/lHq9u/src/derivative_operators/convolutions.jl:460 =#:
│ `LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
│ Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.
└ @ DiffEqOperators ~/.julia/packages/LoopVectorization/QgYWB/src/condense_loopset.jl:1166
BenchmarkTools.Trial: 10000 samples with 73 evaluations.
 Range (min … max):  837.329 ns … 106.266 μs  ┊ GC (min … max): 0.00% … 98.70%
 Time  (median):     859.685 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   917.929 ns ±   2.076 μs  ┊ GC (mean ± σ):  4.50% ±  1.97%

   ▂▄▇███▇▆▅▄▃▃▂▂▂▂▂▃▄▃▂▂▁▁▂▂▂▁ ▁                               ▂
  ▆█████████████████████████████████▇▇▇▆▇▇▇▇▇▇██████▆▇▇▇▆▆▆▆▆▅▅ █
  837 ns        Histogram: log(frequency) by time       1.04 μs <

 Memory estimate: 416 bytes, allocs estimate: 6.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "DiffEqOperators"],
           mode=PKGMODE_MANIFEST)

Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [9fdde737] DiffEqOperators v4.45.0
  [9f78cca6] SummationByPartsOperators v0.5.61 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Bounded domains

We start again by setting up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
D_sparse = sparse(D_SBP)
D_banded = BandedMatrix(D_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u ≈ D_banded * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end

doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)

D_SBP:
BenchmarkTools.Trial: 10000 samples with 199 evaluations.
 Range (min … max):  420.482 ns … 594.432 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     425.523 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   427.212 ns ±   8.815 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▄██▅                                                      
  ▂▂▃▇█████▆▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▂▁▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  420 ns           Histogram: frequency by time          469 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Again, we compare this to a representation of the derivative operator as a sparse matrix. No surprise - it is again much slower, as in periodic domains.

doit(D_sparse, "D_sparse:", du, u)

D_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.453 μs …  10.268 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.478 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.506 μs ± 203.226 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█▁   ▂                                                     ▁
  ███▄▅██▇▆▆▅▅▅▆▆▇▅▅▅▅▁▃▁▃▃▃▁▄▄▃▄▁▁▃▁▄▃▄▄▁▃▃▄▃▄▁▃▁▁▁▅▃▃▄▄▅▅▆▄ █
  4.45 μs      Histogram: log(frequency) by time       5.7 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

FInally, we compare it to a representation as banded matrix. Disappointingly, this is still much slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_banded, "D_banded:", du, u)

D_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  6.638 μs …  16.094 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.658 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.835 μs ± 919.710 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▂                                                          ▁
  ███▇█▆▄▅▄▄▄▄▁▃▄▄▆▇▆▆▄▅▄▄▇▄▃▄▄▄▁▁▃▁▃▄▃▃▃▁▃▃▃▅▇▆▆▆▆▅▆▆▄▅▅▅▄▆▆ █
  6.64 μs      Histogram: log(frequency) by time      12.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)

Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v0.17.18
  [9f78cca6] SummationByPartsOperators v0.5.61 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Dissipation operators

We follow the same structure as before. At first, we set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
Di_SBP  = dissipation_operator(MattssonSvärdNordström2004(), D_SBP)
Di_sparse = sparse(Di_SBP)
Di_banded = BandedMatrix(Di_SBP)
Di_full   = Matrix(Di_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show Di_SBP * u ≈ Di_sparse * u ≈ Di_banded * u ≈ Di_full * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end

doit (generic function with 1 method)

At first, let us benchmark the derivative and dissipation operators implemented in SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
doit(Di_SBP, "Di_SBP:", du, u)

D_SBP:
BenchmarkTools.Trial: 10000 samples with 201 evaluations.
 Range (min … max):  391.433 ns … 716.164 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     396.617 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   398.052 ns ±   9.498 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▆█▆▁                                                     
  ▂▂▂▄█████▆▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▂▁▁▂▁▂▁▂▁▂▂▂▁▁▂▁▂▂▁▁▁▂▂▂▂▂▂▂▂▂ ▃
  391 ns           Histogram: frequency by time          438 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_SBP:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.062 μs …  2.895 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.072 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.077 μs ± 56.648 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       █▁▄                                                   
  ▂▂▂▆▆███▅▅▄▅▃▃▂▂▂▂▁▁▁▂▁▁▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▂▁▂▁▂▁▁▂▁▁▁▁▂▁▂▂▂▂ ▃
  1.06 μs        Histogram: frequency by time        1.15 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare the results to sparse matrix representations. It will not come as a surprise that these are again much (around an order of magnitude) slower.

doit(Di_sparse, "Di_sparse:", du, u)
doit(Di_banded, "Di_banded:", du, u)

Di_sparse:
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.135 μs …   8.910 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.176 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.205 μs ± 207.102 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄█▆   ▁▁                                                    ▁
  ████▇███▆▅▇▆▃▆▆▇▅▄▅▅▄▅▁▁▁▁▃▃▁▁▁▃▁▃▁▁▃▁▁▁▃▁▃▁▁▄▁▁▁▁▁▁▃▁▃▅▅▆▆ █
  5.13 μs      Histogram: log(frequency) by time      6.57 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  6.172 μs …  13.824 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.192 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.403 μs ± 396.694 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █    ▁               ▅▆    ▁                                ▁
  ██▅▄▆█▇▅▅▆▆▄▆▅▅▆▆▄▅▃▅██▄▁▄▆█▄▄▅▄▅▅▄▅▄▄▄▁▁▃▃▁▃▄▁▁▁▄▁▃▄▄▅▅▅▅▆ █
  6.17 μs      Histogram: log(frequency) by time      7.92 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's benchmark the same computation if a full (dense) matrix is used to represent the derivative operator. This is obviously a bad idea but 🤷

doit(Di_full, "Di_full:", du, u)

Di_full:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  136.816 μs … 333.224 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     138.159 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   140.088 μs ±   5.678 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▅██▆▅▃▂▁▁▁▁               ▁▁▁▁▁▁▂▁▁                          ▂
  █████████████▇▇▆▆▆▅▅▅▄▆▆▆▆███████████████▇▇▆▅▄▆▅▄▄▅▄▆▅▅▆▆▄▅▆▆ █
  137 μs        Histogram: log(frequency) by time        161 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)

Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v0.17.18
  [9f78cca6] SummationByPartsOperators v0.5.61 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Structure-of-Arrays (SoA) and Array-of-Structures (AoS)

SummationByPartsOperators.jl tries to provide efficient support of

StaticVectors from StaticArrays.jl
StructArrays.jl

To demonstrate this, let us set up some benchmark code.

using BenchmarkTools
using StaticArrays, StructArrays
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

struct Vec5{T} <: FieldVector{5,T}
  x1::T
  x2::T
  x3::T
  x4::T
  x5::T
end

# Apply `mul!` to each component of a plain array of structures one after another
function mul_aos!(du, D, u, args...)
  for i in 1:size(du, 1)
    mul!(view(du, i, :), D, view(u, i, :), args...)
  end
end

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=4, xmin=xmin, xmax=xmax, N=101)
D_sparse = sparse(D_SBP)
D_full   = Matrix(D_SBP)

101×101 Matrix{Float64}:
 -141.176    173.529   -23.5294   …    0.0         0.0       0.0
  -50.0        0.0      50.0           0.0         0.0       0.0
    9.30233  -68.6047    0.0           0.0         0.0       0.0
    3.06122    0.0     -60.2041        0.0         0.0       0.0
    0.0        0.0       8.33333       0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    ⋮                             ⋱                          ⋮
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0          -8.33333     0.0       0.0
    0.0        0.0       0.0          60.2041      0.0      -3.06122
    0.0        0.0       0.0           0.0        68.6047   -9.30233
    0.0        0.0       0.0         -50.0         0.0      50.0
    0.0        0.0       0.0      …   23.5294   -173.529   141.176

At first, we benchmark the application of the operators implemented in SummationByPartsOperators.jl and their representations as sparse and dense matrices in the scalar case. As before, the sparse matrix representation is around an order of magnitude slower and the dense matrix representation is far off.

println("Scalar case")
u = randn(T, size(D_SBP, 1)); du = similar(u)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_SBP, $u))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_sparse, $u))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_full, $u))

Scalar case
D_SBP
BenchmarkTools.Trial: 10000 samples with 984 evaluations.
 Range (min … max):  56.213 ns … 100.462 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     56.803 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   57.155 ns ±   1.883 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▂██▅▁                                                       
  ▄█████▆▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  56.2 ns         Histogram: frequency by time         65.8 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 231 evaluations.
 Range (min … max):  319.385 ns … 529.130 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     328.017 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   330.202 ns ±  10.649 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▁▄▇█▇▅▄▂                                                 
  ▂▂▂▃▄█████████▇▅▅▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  319 ns           Histogram: frequency by time          372 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.705 μs …  4.072 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.722 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.734 μs ± 98.862 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█▆▂                                                       ▂
  ████▅▅▆██▇▆▄▅▅▅▄▅▅▄▅▅▄▁▄▄▃▃▁▃▁▁▃▃▁▁▁▁▁▃▁▁▁▁▁▁▁▁▃▁▃▃▁▃▃▃▄▁▅ █
  1.71 μs      Histogram: log(frequency) by time     2.38 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we use a plain array of structures (AoS) in the form of a two-dimensional array and our custom mul_aos! implementation that loops over each component, using mul! on views. Here, the differences between the timings are less pronounced.

println("Plain Array of Structures")
u_aos_plain = randn(T, 5, size(D_SBP, 1)); du_aos_plain = similar(u_aos_plain)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_SBP, $u_aos_plain))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_sparse, $u_aos_plain))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_full, $u_aos_plain))

Plain Array of Structures
D_SBP
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.380 μs …  3.083 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.394 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.399 μs ± 63.109 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▂▆█▆▅                                                   
  ▂▂▃█████▇▅▄▃▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▁▂▂▂▂▂▂ ▃
  1.38 μs        Histogram: frequency by time        1.51 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.380 μs …   4.892 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.467 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.484 μs ± 119.036 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▅█▆▂                                                   
  ▂▂▂▃▇████▅▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▂▂▂▂▂▁▂▁▁▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂ ▃
  2.38 μs         Histogram: frequency by time        3.12 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min … max):  8.913 μs …  17.109 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.983 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.031 μs ± 368.772 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▇▂   ▁▁                                                   ▁
  ████▆▆▇██▇▆▅▆▅▄▅▅▆▆▅▅▆▅▃▅▅▃▁▁▃▁▁▁▃▁▁▁▅▁▁▁▁▁▄▁▁▁▃▁▁▁▁▃▁▁▁▁▁▆ █
  8.91 μs      Histogram: log(frequency) by time      11.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, we use an array of structures (AoS) based on reinterpret and standard mul!. This is much more efficient for the implementation in SummationByPartsOperators.jl. In Julia v1.6, this is also more efficient for sparse matrices but less efficient for dense matrices (compared to the plain AoS approach with mul_aos! above).

println("Array of Structures (reinterpreted array)")
u_aos_r = reinterpret(reshape, Vec5{T}, u_aos_plain); du_aos_r = similar(u_aos_r)
@show D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r
mul!(du_aos_r, D_SBP, u_aos_r)
@show reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_SBP, $u_aos_r))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_sparse, $u_aos_r))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_full, $u_aos_r))

Array of Structures (reinterpreted array)
D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r = true
reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 496 evaluations.
 Range (min … max):  218.857 ns … 292.351 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     221.099 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   222.508 ns ±   6.029 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▇█▆▄▂▁▁▁▁                ▁▁                                 ▂
  ▄▇██████████▇▇▆▆▆▅▅▅▅▅▅▄▅▄███▆▇▆▅▅▄▃▄▃▄▄▅▁▃▄▅▄▅▅▄▅▅▅▅▄▅▆▆▆▆▆▆ █
  219 ns        Histogram: log(frequency) by time        258 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 185 evaluations.
 Range (min … max):  556.989 ns … 784.168 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     571.611 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   573.906 ns ±  10.884 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▄▇█▇▄▁                                             
  ▁▁▁▁▁▁▁▂▃▅███████▇▅▄▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  557 ns           Histogram: frequency by time          620 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (min … max):  7.416 μs …  19.008 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.484 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.650 μs ± 846.643 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅▁▁                                                        ▁
  ████▇▆▇▆▆▄▅▄▃▃▃▃▁▅▃▁▄▄▃▆▆▅▅▅▅▄▄▃▄▄▇▄▄▅▆▃▄▁▄▅▅▄▅▅▅▆▅▅▆▆▆▆▆▇▆ █
  7.42 μs      Histogram: log(frequency) by time      12.5 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we still use an array of structures (AoS), but copy the data into a plain Array instead of using the reinterpreted versions. There is no significant difference to the previous version in this case.

println("Array of Structures")
u_aos = Array(u_aos_r); du_aos = similar(u_aos)
@show D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos
mul!(du_aos, D_SBP, u_aos)
@show du_aos ≈ du_aos_r
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_SBP, $u_aos))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_sparse, $u_aos))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_full, $u_aos))

Array of Structures
D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos = true
du_aos ≈ du_aos_r = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 487 evaluations.
 Range (min … max):  218.725 ns … 361.189 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     224.177 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   225.129 ns ±   5.889 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▄▆▅▅▇██▇▆▄▃▂▂▁                 ▁▁                          ▂
  ▅▆█████████████████▇▆▆▆▅▅▅▄▃▆▇▆▆▇████▇▇▇▅▆▅▄▅▃▄▄▃▅▅▁▅▆▆▇▆▇▆▅▅ █
  219 ns        Histogram: log(frequency) by time        256 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 181 evaluations.
 Range (min … max):  583.635 ns … 853.972 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     598.691 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   601.217 ns ±  13.691 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▂▅▇█▇▅▃                                               
  ▂▂▂▂▃▃▄▇████████▆▅▄▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  584 ns           Histogram: frequency by time          660 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (min … max):  7.452 μs …  18.540 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.504 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.546 μs ± 344.233 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▆    ▁                                                    ▁
  ████▆▆██▇▄▆▆▅▅▅▆▆▅▅▅▃▃▁▅▃▃▃▃▁▁▁▃▃▃▁▃▁▁▁▁▁▁▁▃▁▁▁▁▁▃▁▃▁▁▄▁▃▃▆ █
  7.45 μs      Histogram: log(frequency) by time      9.48 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's look at a structure of arrays (SoA). Interestingly, this is slower than the array of structures we used above. On Julia v1.6, the sparse matrix representation performs particularly bad in this case.

println("Structure of Arrays")
u_soa = StructArray(u_aos); du_soa = similar(u_soa)
@show D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa
mul!(du_soa, D_SBP, u_soa)
@show du_soa ≈ du_aos
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_SBP, $u_soa))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_sparse, $u_soa))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_full, $u_soa))

Structure of Arrays
D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa = true
du_soa ≈ du_aos = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 417 evaluations.
 Range (min … max):  237.254 ns … 303.974 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     239.249 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   240.052 ns ±   4.297 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▃▆▇██▇▆▅▄▃▂▁                                                 ▂
  ██████████████▇▇▇█▇▇▅▆▅▃▄▁▁▃▃▁▁▁▃▁▁▁▁▁▃▃▁▁▁▃▁▃▃▃▁▅▆▆▇████▇█▇▇ █
  237 ns        Histogram: log(frequency) by time        260 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  214.151 μs …  11.808 ms  ┊ GC (min … max):  0.00% … 61.65%
 Time  (median):     223.568 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   279.165 μs ± 488.971 μs  ┊ GC (mean ± σ):  11.00% ±  6.14%

   ▃██▇▄▃▃▃▂▁                                        ▁▄▅▄▂▁▁    ▂
  ▆████████████▇▇▆▇▇▆▆▅▅▅▂▄▄▄▂▄▃▃▃▂▄▄▄▃▄▃▄▃▄▄▃▃▄▃▂▂▄▄████████▆▄ █
  214 μs        Histogram: log(frequency) by time        370 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  176.491 μs …  13.073 ms  ┊ GC (min … max):  0.00% … 52.70%
 Time  (median):     183.583 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   236.160 μs ± 464.883 μs  ┊ GC (mean ± σ):  12.21% ±  6.16%

  ▃▆█▇▅▃▃▃▂▁ ▁                                        ▄▄▄▃▂▁▁   ▂
  ██████████████▇▇▆▆▅▄▅▄▃▅▅▃▃▃▅▆▅▄▅▃▄▄▃▄▃▁▁▁▃▁▃▃▄▄▃▁▁██████████ █
  176 μs        Histogram: log(frequency) by time        322 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "StaticArrays", "StructArrays"],
           mode=PKGMODE_MANIFEST)

Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [90137ffa] StaticArrays v1.9.3
  [09ab397b] StructArrays v0.6.18
  [9f78cca6] SummationByPartsOperators v0.5.61 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`