STRATA (Dynamic Sparse Topology-aware Optimal Reachability Matrix) is a framework for all-pairs shortest path (APSP) computation. It extends the AORM framework (IEEE Access, 2021) with sparse matrix redesign, Cython fused pruning, CUDA GPU acceleration with a custom guard+CAS kernel, and comprehensive benchmarking across 12 methods and 6 graph topologies.
Real-world networks are overwhelmingly small-world: Facebook's 721 million users are connected by an average distance of just 4.74 hops. STRATA provides the fastest matrix-algebraic APSP on CPU (2.0× over SciPy C BFS) and competitive GPU methods, while uniquely supporting k-hop constrained queries, dynamic edge insertion/deletion, simultaneous shortest-path count (σ) computation, and matrix-algebraic Betweenness Centrality (Matrix Brandes).
STRATA preserves AORM's iterative reachability framework while introducing five structural changes:
| ID | AORM | STRATA | Effect |
|---|---|---|---|
| C1 | Dense matmul O(n³) | Sparse SpMM O(nnz·c̄) | Sparse graph acceleration |
| C2 | Fixed O(n²)/iter | nnz-proportional cost | Progressive sparsity |
| C3 | Pre-prune check O(n²) | Post-prune nnz=0, O(1) | Correct + fast |
| C4 | heaviside O(n²) | data[:]=1 O(nnz) | nnz-only work |
| C5 | Sparse F add O(n²) | Dense bool F, O(1)/entry | Footprint acceleration |
The Cython C-extension (_strata_core.pyx) fuses pruning, convergence checking, and footprint update into a single C loop.
Environment: Windows 11 Pro, Python 3.14.3, NVIDIA RTX 5080 (16GB), CUDA 12.9, CuPy 14.0.1, SuiteSparse:GraphBLAS 10.3.1, Cython 3.2.4. All timings are best of 3 runs. GPU methods include 1 warmup run.
| # | Method | ID | Time (s) | vs SciPy | Category |
|---|---|---|---|---|---|
| 1 | GPU-PerSrc-BFS | BG1 | 0.017 | 122× | Baseline |
| 2 | DAWN-SOVM | BG2 | 0.019 | 109× | Baseline |
| 3 | STRATA-CUDA | TG2 | 0.031 | 67× | STRATA |
| 4 | STRATA-DAWNiBFS | TG1 | 0.267 | 7.8× | STRATA |
| # | Method | ID | Time (s) | vs SciPy | Category |
|---|---|---|---|---|---|
| 1 | STRATA-SpMM-Cython | TC1 | 1.013 | 2.0× | STRATA |
| 2 | STRATA-NumpyBLAS | TC2 | 1.692 | 1.2× | STRATA |
| 3 | STRATA-GraphBLAS | TC3 | 1.897 | 1.1× | STRATA |
| 4 | SciPy (C BFS) | BC2 | 2.072 | 1.0× | Baseline |
| 5 | M-AORM | BC4 | 2.854 | 0.7× | Baseline |
| 6 | GB-bfs | BC5 | 4.194 | 0.5× | Baseline |
| 7 | I-AORM | BC3 | 8.137 | 0.3× | Baseline |
| 8 | NetworkX | BC1 | 15.934 | 0.1× | Baseline |
Tier 1 BG1/BG2/TG2 (0.003–0.031s) GPU per-source BFS / CUDA direct expand Tier 2 TG1 (0.006–0.267s) GPU DAWNiBFS bit-compressed STRATA Tier 3 TC1/BC2 (0.16–2.07s) CPU SpMM+Cython / C BFS Tier 4 TC2/TC3 (0.31–3.64s) CPU dense BLAS / GraphBLAS Tier 5 BC1–BC5 (0.40–15.9s) Python BFS / edge-wise / GraphBLAS BFS
| Method | BA (d=5) | WS (d=8) | ER (d=6) | Grid (d=88) | PLC (d=5) | |
|---|---|---|---|---|---|---|
| BG1 GPU-PerSrc-BFS | 0.017 | 0.003 | 0.003 | 0.006 | 0.003 | 0.003 |
| BG2 DAWN-SOVM | 0.019 | 0.003 | 0.003 | 0.006 | 0.003 | 0.003 |
| TG2 STRATA-CUDA | 0.031 | 0.006 | 0.005 | 0.031 | 0.014 | 0.005 |
| TG1 STRATA-DAWNiBFS | 0.267 | 0.017 | 0.006 | 0.009 | 0.026 | 0.017 |
| Method | BA (d=5) | WS (d=8) | ER (d=6) | Grid (d=88) | PLC (d=5) | |
|---|---|---|---|---|---|---|
| TC1 STRATA-SpMM-Cython | 1.013 | 0.287 | 0.306 | 0.313 | 0.159 | 0.306 |
| TC2 STRATA-NumpyBLAS | 1.692 | 0.306 | 0.517 | 0.359 | 3.643 | 0.336 |
| TC3 STRATA-GraphBLAS | 1.897 | 0.464 | 0.483 | 0.484 | 0.589 | 0.479 |
| BC2 SciPy | 2.072 | 0.314 | 0.301 | 0.331 | 0.164 | 0.330 |
| BC4 M-AORM | 2.854 | 0.422 | 0.672 | 0.483 | 5.627 | 0.424 |
| BC5 GB-bfs | 4.194 | 1.179 | 1.355 | 1.173 | 3.636 | 1.177 |
| BC3 I-AORM | 8.137 | 0.403 | 0.546 | 0.442 | 4.310 | 0.411 |
| BC1 NetworkX | 15.934 | 1.222 | 1.259 | 1.421 | 1.034 | 1.245 |
TC1 (STRATA-SpMM-Cython) is CPU-fastest on all 6 graphs, including Grid (d=88) where it beats SciPy by 1.0×.
| # | ID | Method | Total (s) | Mean (s) |
|---|---|---|---|---|
| 1 | BG1 | GPU-PerSrc-BFS | 0.035 | 0.006 |
| 2 | BG2 | DAWN-SOVM | 0.037 | 0.006 |
| 3 | TG2 | STRATA-CUDA | 0.092 | 0.015 |
| 4 | TG1 | STRATA-DAWNiBFS | 0.341 | 0.057 |
| 5 | TC1 | STRATA-SpMM-Cython | 2.384 | 0.397 |
| 6 | BC2 | SciPy | 3.512 | 0.585 |
| 7 | TC3 | STRATA-GraphBLAS | 4.396 | 0.733 |
| 8 | TC2 | STRATA-NumpyBLAS | 6.854 | 1.142 |
| 9 | BC4 | M-AORM | 10.482 | 1.747 |
| 10 | BC5 | GB-bfs | 12.714 | 2.119 |
| 11 | BC3 | I-AORM | 14.249 | 2.375 |
| 12 | BC1 | NetworkX | 22.115 | 3.686 |
GPU 4 methods were tested on Barabási–Albert graphs from n=1,000 to n=20,000 to evaluate large-scale behavior.
| n | BG1 | BG2 | TG1 | TG2 |
|---|---|---|---|---|
| 1,000 | 0.001 | 0.001 | 0.006 | 0.002 |
| 2,000 | 0.003 | 0.003 | 0.016 | 0.005 |
| 4,000 | 0.011 | 0.010 | 0.052 | 0.019 |
| 6,000 | 0.023 | 0.025 | 0.100 | 0.062 |
| 8,000 | 0.043 | 0.045 | 0.163 | 0.123 |
| 10,000 | 0.067 | 0.072 | 0.238 | 0.211 |
| 15,000 | 0.156 | 0.167 | 0.487 | 0.454 |
| 20,000 | 0.263 | 0.281 | 0.805 | 12.784 |
| Step | Time (s) | Improvement |
|---|---|---|
| TG2 STRATA-CUDA (memory cliff) | 12.784 | 1.0× |
| TG1 STRATA-DAWNiBFS (bit-compress) | 0.805 | 15.9× |
| BG1 GPU-PerSrc-BFS (remove framework) | 0.263 | 48.6× |
At Facebook scale (n=4,039), TG2 (0.031s) beats TG1 (0.267s). But at n=20,000, TG2 hits the memory cliff (12.8s) while TG1 remains stable (0.81s). This confirms that STRATA's dense buffers impose a scaling ceiling that bit-compression or framework removal can overcome.
In STRATA's matrix product A · C(k-1), the integer values before Heaviside binarization exactly encode the number of shortest paths σ(s,t). This follows from the Brandes (2001) recurrence:
σ(i,j) = ∑ { σ(u,j) : u ∈ N(i) ∧ d(u,j) = d(i,j) - 1 }
STRATA's frontier pruning (F mask) ensures that only newly discovered pairs retain values, which is exactly "count shortest paths only." By simply removing the Heaviside step, STRATA simultaneously computes both D (distances) and σ (shortest-path counts) in a single forward pass.
| Graph | n | d | STRATA | BFS (Brandes) | Speedup |
|---|---|---|---|---|---|
| 4,039 | 8 | 4.53s | 155.2s | 34.2× | |
| BA-2000 | 2,000 | 5 | 0.49s | 10.0s | 20.3× |
| WS-2000 | 2,000 | 8 | 0.73s | 6.2s | 8.5× |
| ER-2000 | 2,000 | 6 | 0.58s | 10.0s | 17.2× |
| PLC-2000 | 2,000 | 5 | 0.49s | 10.0s | 20.3× |
STRATA σ and BFS σ are exactly identical on all graphs (np.allclose verified). The 8.5–34× speedup comes from replacing n Python loops with d BLAS calls — same O(nm) complexity, better work organization.
The σ discovery enables Matrix Brandes — expressing both the forward (σ) and backward (δ) phases of Brandes' algorithm as matrix multiplications. This reorganizes n per-source BFS+backpropagation into d hop-level BLAS calls.
| ID | Method | Kernel |
|---|---|---|
| BB1 | Brandes-Python | Pure Python per-source BFS |
| BB2 | Brandes-C (igraph) | igraph C single-thread |
| TB1 | Matrix-Brandes (STRATA) | SpMM + Numba JIT fused prune |
| TB2 | Matrix-Brandes-C (STRATA) | C + OpenMP fused SpMM |
| Method | Facebook (4,039) | BA-2000 | WS-2000 | BA-500 |
|---|---|---|---|---|
| TB2 Matrix-Brandes-C | 0.24s | 0.058s | 0.062s | 0.004s |
| TB1 Matrix-Brandes (Numba) | 0.42s | 0.088s | 0.093s | 0.005s |
| BB2 Brandes-C (igraph) | 0.91s | 0.188s | 0.146s | 0.011s |
| BB1 Brandes-Python | 179.1s | 16.4s | 11.5s | 0.958s |
TB2 vs igraph C: 2.4–3.6× faster across all graphs. The speedup comes from (1) hop-level fused pruning that skips already-visited pairs, and (2) OpenMP parallelization that hop-level organization naturally enables. Both approaches share O(nm) asymptotic complexity.
STRATA's distance matrix D enables incremental updates for both edge insertion and deletion, without full APSP recomputation. These operations work with any APSP method that produces a complete distance matrix.
When undirected edge {a,b} is added:
D'[i,j] = min(D[i,j], D[i,a] + 1 + D[b,j], D[i,b] + 1 + D[a,j])
Only 0.008% of pairs are affected per insertion. NumPy vectorization (CPU) runs in ~0.155s, CuPy (GPU) in ~0.035s (4.4× speedup), yielding 22–33× speedup over full recomputation.
Edge deletion is harder — distances can only increase. The proposed algorithm processes levels bottom-up:
| Case | Affected Sources | CPU-BFS | CPU-DLevel | GPU-BFS | GPU-DLevel |
|---|---|---|---|---|---|
| Low impact | 2 | 1.10s | 0.009s | 0.034s | 0.034s |
| High impact | 2,407 | 1.10s | 0.091s | 0.034s | 0.031s |
| Stage | Change | Time | Speedup |
|---|---|---|---|
| Baseline | Full D scan per level | 1.170s | 1.0× |
| + Affected tracking | Scan only affected sources | 0.158s | 7.4× |
| + Cython | Inner loop in C | 0.043s | 27.2× |
| + Forward propagation | Propagate from orphans only | 0.009s | ~130× |
Combining insertion and deletion enables a fully dynamic APSP update framework:
Related work: Even–Shiloach (1981) handles single-source decremental in O(m·n). Ramalingam–Reps (1996) updates affected vertices only. Demetrescu–Italiano (2004) achieves amortized O(n² log³ n) fully dynamic APSP. STRATA's framework is orthogonal to these — it operates on the distance matrix level, independent of the APSP method used.
STRATA-SpMM-Cython (TC1) is the fastest CPU method across all 6 graph topologies, outperforming SciPy's native C BFS by up to 2.0×. The key is Cython fused pruning — collapsing three sparse operations (booleanize → prune → footprint update) into a single C pass over COO entries.
| Graph | TC1 (s) | SciPy (s) | Speedup |
|---|---|---|---|
| Facebook (n=4,039) | 1.013 | 2.072 | 2.0× |
| BA-2000 | 0.287 | 0.314 | 1.1× |
| WS-2000 | 0.306 | 0.301 | 1.0× |
| Grid-45×45 (d=88) | 0.159 | 0.164 | 1.0× |
STRATA-CUDA (TG2) replaces cuSPARSE SpMM with a custom CUDA kernel that directly traverses CSR neighbors, eliminating three SpMM bottlenecks:
The footprint check uses a guard+CAS pattern: non-atomic read guard skips already-visited cells, then atomicCAS ensures race-free 0→1 transitions.
if (F[idx] == 0) { // non-atomic guard
if (atomicCAS(&F[idx], 0, 1) == 0) { // atomic CAS
int pos = atomicAdd(out_count, 1);
out_row[pos] = i;
out_col[pos] = k;
}
}
Progressive removal of STRATA's matrix-algebraic overhead converges to per-source BFS. At small scale (n=4K), TG2 is competitive. At large scale (n=20K), the framework overhead dominates, and BG1 wins by 48.6×. This reveals that for full APSP, matrix-algebraic indirection costs (shared footprint → atomics, per-hop kernel launch) accumulate at scale.
While BG1 is fastest for full APSP, STRATA provides capabilities that per-source BFS cannot:
| Scenario | Best (GPU) | Best (CPU only) |
|---|---|---|
| Small-world APSP (n < 4K) | TG2 STRATA-CUDA | TC1 STRATA-SpMM-Cython |
| Large-scale APSP (n > 4K) | BG1 GPU-PerSrc-BFS | BC2 SciPy |
| High diameter APSP (d > 20) | BG1 GPU-PerSrc-BFS | TC1 / BC2 |
| k-hop query | TG2 | TC1 STRATA |
| Betweenness Centrality | TB2 Matrix-Brandes-C (2.4–3.6× over igraph) | |
| Dynamic edge insertion | STRATA (22–33× vs recompute) | |
| Dynamic edge deletion | DLevel (130× optimized, early termination) | |
STRATA's APSP performance on GPU cannot beat per-source BFS at scale. The matrix-algebraic framework introduces structural overhead (per-hop sync, atomics, batch management) that becomes dominant as n grows. TG2's memory cliff at n=20K is a hard scaling limit.
However, STRATA's real value lies in its algebraic generality: the same matrix multiply that computes distances also yields σ(s,t), enabling Matrix Brandes (TB2) to outperform igraph C by 2.4–3.6×. The framework also naturally supports dynamic updates and k-hop queries — capabilities absent from per-source BFS.