STRATA

Small World, Small Efforts
STratified Reachability And Topology Algebra
Young-Min Kang (Tongmyong University) & Sung-Soo Kim (ETRI)

Overview

STRATA (Dynamic Sparse Topology-aware Optimal Reachability Matrix) is a framework for all-pairs shortest path (APSP) computation. It extends the AORM framework (IEEE Access, 2021) with sparse matrix redesign, Cython fused pruning, CUDA GPU acceleration with a custom guard+CAS kernel, and comprehensive benchmarking across 12 methods and 6 graph topologies.

Key Results (Facebook, n=4,039)

  • GPU-PerSrc-BFS (BG1): 0.017s — 122× over SciPy, 920× over NetworkX
  • DAWN-SOVM (BG2): 0.019s — frontier-driven BFS (DAWN, ICS 2024)
  • STRATA-CUDA (TG2): 0.031s — 67× over SciPy (guard+CAS kernel)
  • STRATA-SpMM-Cython (TC1): 1.013s — 2.0× over SciPy (CPU best)
  • All 12 methods produce identical distance matrices (verified element-wise)

Why STRATA?

Real-world networks are overwhelmingly small-world: Facebook's 721 million users are connected by an average distance of just 4.74 hops. STRATA provides the fastest matrix-algebraic APSP on CPU (2.0× over SciPy C BFS) and competitive GPU methods, while uniquely supporting k-hop constrained queries, dynamic edge insertion/deletion, simultaneous shortest-path count (σ) computation, and matrix-algebraic Betweenness Centrality (Matrix Brandes).

Algorithm

STRATA preserves AORM's iterative reachability framework while introducing five structural changes:

IDAORMSTRATAEffect
C1Dense matmul O(n³)Sparse SpMM O(nnz·c̄)Sparse graph acceleration
C2Fixed O(n²)/iternnz-proportional costProgressive sparsity
C3Pre-prune check O(n²)Post-prune nnz=0, O(1)Correct + fast
C4heaviside O(n²)data[:]=1 O(nnz)nnz-only work
C5Sparse F add O(n²)Dense bool F, O(1)/entryFootprint acceleration

The Cython C-extension (_strata_core.pyx) fuses pruning, convergence checking, and footprint update into a single C loop.

Experimental Results

Environment: Windows 11 Pro, Python 3.14.3, NVIDIA RTX 5080 (16GB), CUDA 12.9, CuPy 14.0.1, SuiteSparse:GraphBLAS 10.3.1, Cython 3.2.4. All timings are best of 3 runs. GPU methods include 1 warmup run.

Facebook Social Network (n=4,039) — GPU Methods

#MethodIDTime (s)vs SciPyCategory
1GPU-PerSrc-BFSBG10.017122×Baseline
2DAWN-SOVMBG20.019109×Baseline
3STRATA-CUDATG20.03167×STRATA
4STRATA-DAWNiBFSTG10.2677.8×STRATA

Facebook Social Network (n=4,039) — CPU Methods

#MethodIDTime (s)vs SciPyCategory
1STRATA-SpMM-CythonTC11.0132.0×STRATA
2STRATA-NumpyBLASTC21.6921.2×STRATA
3STRATA-GraphBLASTC31.8971.1×STRATA
4SciPy (C BFS)BC22.0721.0×Baseline
5M-AORMBC42.8540.7×Baseline
6GB-bfsBC54.1940.5×Baseline
7I-AORMBC38.1370.3×Baseline
8NetworkXBC115.9340.1×Baseline

Performance Tiers

Tier 1  BG1/BG2/TG2  (0.003–0.031s)  GPU per-source BFS / CUDA direct expand
Tier 2  TG1          (0.006–0.267s)  GPU DAWNiBFS bit-compressed STRATA
Tier 3  TC1/BC2      (0.16–2.07s)   CPU SpMM+Cython / C BFS
Tier 4  TC2/TC3      (0.31–3.64s)   CPU dense BLAS / GraphBLAS
Tier 5  BC1–BC5      (0.40–15.9s)   Python BFS / edge-wise / GraphBLAS BFS

Topology Comparison — 12 Methods × 6 Graphs

GPU Methods

MethodFacebookBA (d=5)WS (d=8)ER (d=6)Grid (d=88)PLC (d=5)
BG1 GPU-PerSrc-BFS0.0170.0030.0030.0060.0030.003
BG2 DAWN-SOVM0.0190.0030.0030.0060.0030.003
TG2 STRATA-CUDA0.0310.0060.0050.0310.0140.005
TG1 STRATA-DAWNiBFS0.2670.0170.0060.0090.0260.017

CPU Methods

MethodFacebookBA (d=5)WS (d=8)ER (d=6)Grid (d=88)PLC (d=5)
TC1 STRATA-SpMM-Cython1.0130.2870.3060.3130.1590.306
TC2 STRATA-NumpyBLAS1.6920.3060.5170.3593.6430.336
TC3 STRATA-GraphBLAS1.8970.4640.4830.4840.5890.479
BC2 SciPy2.0720.3140.3010.3310.1640.330
BC4 M-AORM2.8540.4220.6720.4835.6270.424
BC5 GB-bfs4.1941.1791.3551.1733.6361.177
BC3 I-AORM8.1370.4030.5460.4424.3100.411
BC1 NetworkX15.9341.2221.2591.4211.0341.245

TC1 (STRATA-SpMM-Cython) is CPU-fastest on all 6 graphs, including Grid (d=88) where it beats SciPy by 1.0×.

Grand Summary (6 Graphs Total)

#IDMethodTotal (s)Mean (s)
1BG1GPU-PerSrc-BFS0.0350.006
2BG2DAWN-SOVM0.0370.006
3TG2STRATA-CUDA0.0920.015
4TG1STRATA-DAWNiBFS0.3410.057
5TC1STRATA-SpMM-Cython2.3840.397
6BC2SciPy3.5120.585
7TC3STRATA-GraphBLAS4.3960.733
8TC2STRATA-NumpyBLAS6.8541.142
9BC4M-AORM10.4821.747
10BC5GB-bfs12.7142.119
11BC3I-AORM14.2492.375
12BC1NetworkX22.1153.686

Scaling Analysis

GPU 4 methods were tested on Barabási–Albert graphs from n=1,000 to n=20,000 to evaluate large-scale behavior.

GPU Scaling (BA Graphs)

nBG1BG2TG1TG2
1,0000.0010.0010.0060.002
2,0000.0030.0030.0160.005
4,0000.0110.0100.0520.019
6,0000.0230.0250.1000.062
8,0000.0430.0450.1630.123
10,0000.0670.0720.2380.211
15,0000.1560.1670.4870.454
20,0000.2630.2810.80512.784

Key Findings

  • TG2 Memory Cliff: At n=20,000, TG2 explodes to 12.8s. Its dense buffers require 16n² = 6.4GB, exceeding GPU VRAM capacity and causing massive swap overhead.
  • TG1 Stable Scaling: Bit-compression reduces memory to 4.25n² (75% reduction). Runs at 0.805s at n=20,000 — stable and predictable.
  • BG1/BG2 Dominance: 4n² memory, single kernel launch per source — consistently fastest across all scales.
  • TG1 vs BG1: 2.7× gap at n=20K reflects STRATA's structural overhead (per-hop sync, atomicOr, batch management).

STRATA GPU Optimization Path (BA n=20,000)

StepTime (s)Improvement
TG2 STRATA-CUDA (memory cliff)12.7841.0×
TG1 STRATA-DAWNiBFS (bit-compress)0.80515.9×
BG1 GPU-PerSrc-BFS (remove framework)0.26348.6×

At Facebook scale (n=4,039), TG2 (0.031s) beats TG1 (0.267s). But at n=20,000, TG2 hits the memory cliff (12.8s) while TG1 remains stable (0.81s). This confirms that STRATA's dense buffers impose a scaling ceiling that bit-compression or framework removal can overcome.

Sigma Discovery & Matrix Brandes

Discovery: Matrix Multiply Encodes σ(s,t)

In STRATA's matrix product A · C(k-1), the integer values before Heaviside binarization exactly encode the number of shortest paths σ(s,t). This follows from the Brandes (2001) recurrence:

σ(i,j) = ∑ { σ(u,j) : u ∈ N(i) ∧ d(u,j) = d(i,j) - 1 }

STRATA's frontier pruning (F mask) ensures that only newly discovered pairs retain values, which is exactly "count shortest paths only." By simply removing the Heaviside step, STRATA simultaneously computes both D (distances) and σ (shortest-path counts) in a single forward pass.

σ(s,t) Computation: STRATA vs BFS

GraphndSTRATABFS (Brandes)Speedup
Facebook4,03984.53s155.2s34.2×
BA-20002,00050.49s10.0s20.3×
WS-20002,00080.73s6.2s8.5×
ER-20002,00060.58s10.0s17.2×
PLC-20002,00050.49s10.0s20.3×

STRATA σ and BFS σ are exactly identical on all graphs (np.allclose verified). The 8.5–34× speedup comes from replacing n Python loops with d BLAS calls — same O(nm) complexity, better work organization.

Matrix Brandes: Betweenness Centrality

The σ discovery enables Matrix Brandes — expressing both the forward (σ) and backward (δ) phases of Brandes' algorithm as matrix multiplications. This reorganizes n per-source BFS+backpropagation into d hop-level BLAS calls.

Implementations

IDMethodKernel
BB1Brandes-PythonPure Python per-source BFS
BB2Brandes-C (igraph)igraph C single-thread
TB1Matrix-Brandes (STRATA)SpMM + Numba JIT fused prune
TB2Matrix-Brandes-C (STRATA)C + OpenMP fused SpMM

Betweenness Centrality Performance

MethodFacebook (4,039)BA-2000WS-2000BA-500
TB2 Matrix-Brandes-C0.24s0.058s0.062s0.004s
TB1 Matrix-Brandes (Numba)0.42s0.088s0.093s0.005s
BB2 Brandes-C (igraph)0.91s0.188s0.146s0.011s
BB1 Brandes-Python179.1s16.4s11.5s0.958s

TB2 vs igraph C: 2.4–3.6× faster across all graphs. The speedup comes from (1) hop-level fused pruning that skips already-visited pairs, and (2) OpenMP parallelization that hop-level organization naturally enables. Both approaches share O(nm) asymptotic complexity.

Dynamic Edge Operations

STRATA's distance matrix D enables incremental updates for both edge insertion and deletion, without full APSP recomputation. These operations work with any APSP method that produces a complete distance matrix.

Edge Insertion: O(n²) Closed-Form Update

When undirected edge {a,b} is added:

D'[i,j] = min(D[i,j], D[i,a] + 1 + D[b,j], D[i,b] + 1 + D[a,j])

Only 0.008% of pairs are affected per insertion. NumPy vectorization (CPU) runs in ~0.155s, CuPy (GPU) in ~0.035s (4.4× speedup), yielding 22–33× speedup over full recomputation.

Edge Deletion: Level-Based Parent Replacement

Edge deletion is harder — distances can only increase. The proposed algorithm processes levels bottom-up:

  1. Phase 1 (Identify): For each affected pair (s,t) at level k, check if an alternative parent exists among neighbors at level k-1.
  2. Phase 2 (Propagate): Orphaned nodes cascade upward — their children may also lose shortest paths.
  3. Early Termination: In small-world graphs, cascade dies out quickly due to abundant alternative paths.

Deletion Performance (Facebook n=4,039)

CaseAffected SourcesCPU-BFSCPU-DLevelGPU-BFSGPU-DLevel
Low impact21.10s0.009s0.034s0.034s
High impact2,4071.10s0.091s0.034s0.031s

CPU DLevel Optimization History

StageChangeTimeSpeedup
BaselineFull D scan per level1.170s1.0×
+ Affected trackingScan only affected sources0.158s7.4×
+ CythonInner loop in C0.043s27.2×
+ Forward propagationPropagate from orphans only0.009s~130×

Fully Dynamic Framework

Combining insertion and deletion enables a fully dynamic APSP update framework:

  • Insertion: O(n²) closed-form, vectorizable
  • Deletion: Level-based parent replacement with early termination
  • Method-agnostic: Works with any APSP method that produces D and A

Related work: Even–Shiloach (1981) handles single-source decremental in O(m·n). Ramalingam–Reps (1996) updates affected vertices only. Demetrescu–Italiano (2004) achieves amortized O(n² log³ n) fully dynamic APSP. STRATA's framework is orthogonal to these — it operates on the distance matrix level, independent of the APSP method used.

STRATA Contributions

1. CPU: Fastest Matrix-Algebraic APSP

STRATA-SpMM-Cython (TC1) is the fastest CPU method across all 6 graph topologies, outperforming SciPy's native C BFS by up to 2.0×. The key is Cython fused pruning — collapsing three sparse operations (booleanize → prune → footprint update) into a single C pass over COO entries.

GraphTC1 (s)SciPy (s)Speedup
Facebook (n=4,039)1.0132.0722.0×
BA-20000.2870.3141.1×
WS-20000.3060.3011.0×
Grid-45×45 (d=88)0.1590.1641.0×

2. GPU: SpMM Elimination via Direct CSR Expansion

STRATA-CUDA (TG2) replaces cuSPARSE SpMM with a custom CUDA kernel that directly traverses CSR neighbors, eliminating three SpMM bottlenecks:

  1. Redundant products — SpMM computes all matrix products including already-visited pairs
  2. Intermediate matrices — SpMM output requires COO/CSR conversion overhead
  3. Separate prune step — footprint check is fused into expansion (1-pass)

The footprint check uses a guard+CAS pattern: non-atomic read guard skips already-visited cells, then atomicCAS ensures race-free 0→1 transitions.

if (F[idx] == 0) {                      // non-atomic guard
    if (atomicCAS(&F[idx], 0, 1) == 0) { // atomic CAS
        int pos = atomicAdd(out_count, 1);
        out_row[pos] = i;
        out_col[pos] = k;
    }
}

3. Structural Insight: Optimization Converges to Per-Source BFS

Progressive removal of STRATA's matrix-algebraic overhead converges to per-source BFS. At small scale (n=4K), TG2 is competitive. At large scale (n=20K), the framework overhead dominates, and BG1 wins by 48.6×. This reveals that for full APSP, matrix-algebraic indirection costs (shared footprint → atomics, per-hop kernel launch) accumulate at scale.

4. STRATA's Unique Capabilities

While BG1 is fastest for full APSP, STRATA provides capabilities that per-source BFS cannot:

  • k-hop constrained APSP — exact hop shells at each distance level
  • σ(s,t) simultaneous computation — shortest-path counts from the same matrix multiply, 8.5–34× over sequential BFS
  • Matrix Brandes — betweenness centrality via hop-level BLAS, TB2 is 2.4–3.6× faster than igraph C
  • Dynamic edge insertion — O(n²) incremental update, 22–33× faster than full recomputation
  • Dynamic edge deletion — level-based parent replacement with early termination, ~130× optimized
  • Algebraic analysis — matrix-based framework for theoretical convergence proofs

Discussion

Algorithm Selection Guide

ScenarioBest (GPU)Best (CPU only)
Small-world APSP (n < 4K)TG2 STRATA-CUDATC1 STRATA-SpMM-Cython
Large-scale APSP (n > 4K)BG1 GPU-PerSrc-BFSBC2 SciPy
High diameter APSP (d > 20)BG1 GPU-PerSrc-BFSTC1 / BC2
k-hop queryTG2TC1 STRATA
Betweenness CentralityTB2 Matrix-Brandes-C (2.4–3.6× over igraph)
Dynamic edge insertionSTRATA (22–33× vs recompute)
Dynamic edge deletionDLevel (130× optimized, early termination)

Honest Assessment

STRATA's APSP performance on GPU cannot beat per-source BFS at scale. The matrix-algebraic framework introduces structural overhead (per-hop sync, atomics, batch management) that becomes dominant as n grows. TG2's memory cliff at n=20K is a hard scaling limit.

However, STRATA's real value lies in its algebraic generality: the same matrix multiply that computes distances also yields σ(s,t), enabling Matrix Brandes (TB2) to outperform igraph C by 2.4–3.6×. The framework also naturally supports dynamic updates and k-hop queries — capabilities absent from per-source BFS.