STRATA — STratified Reachability And Topology Algebra

Overview

STRATA (Dynamic Sparse Topology-aware Optimal Reachability Matrix) is a framework for all-pairs shortest path (APSP) computation. It extends the AORM framework (IEEE Access, 2021) with sparse matrix redesign, Cython fused pruning, CUDA GPU acceleration with a custom guard+CAS kernel, and comprehensive benchmarking across 12 methods and 6 graph topologies.

Key Results (Facebook, n=4,039)

GPU-PerSrc-BFS (BG1): 0.017s — 122× over SciPy, 920× over NetworkX
DAWN-SOVM (BG2): 0.019s — frontier-driven BFS (DAWN, ICS 2024)
STRATA-CUDA (TG2): 0.031s — 67× over SciPy (guard+CAS kernel)
STRATA-SpMM-Cython (TC1): 1.013s — 2.0× over SciPy (CPU best)
All 12 methods produce identical distance matrices (verified element-wise)

Why STRATA?

Real-world networks are overwhelmingly small-world: Facebook's 721 million users are connected by an average distance of just 4.74 hops. STRATA provides the fastest matrix-algebraic APSP on CPU (2.0× over SciPy C BFS) and competitive GPU methods, while uniquely supporting k-hop constrained queries, dynamic edge insertion/deletion, simultaneous shortest-path count (σ) computation, and matrix-algebraic Betweenness Centrality (Matrix Brandes).

Algorithm

STRATA preserves AORM's iterative reachability framework while introducing five structural changes:

ID	AORM	STRATA	Effect
C1	Dense matmul O(n³)	Sparse SpMM O(nnz·c̄)	Sparse graph acceleration
C2	Fixed O(n²)/iter	nnz-proportional cost	Progressive sparsity
C3	Pre-prune check O(n²)	Post-prune nnz=0, O(1)	Correct + fast
C4	heaviside O(n²)	data[:]=1 O(nnz)	nnz-only work
C5	Sparse F add O(n²)	Dense bool F, O(1)/entry	Footprint acceleration

The Cython C-extension (_strata_core.pyx) fuses pruning, convergence checking, and footprint update into a single C loop.

Experimental Results

Environment: Windows 11 Pro, Python 3.14.3, NVIDIA RTX 5080 (16GB), CUDA 12.9, CuPy 14.0.1, SuiteSparse:GraphBLAS 10.3.1, Cython 3.2.4. All timings are best of 3 runs. GPU methods include 1 warmup run.

Facebook Social Network (n=4,039) — GPU Methods

#	Method	ID	Time (s)	vs SciPy	Category
1	GPU-PerSrc-BFS	BG1	0.017	122×	Baseline
2	DAWN-SOVM	BG2	0.019	109×	Baseline
3	STRATA-CUDA	TG2	0.031	67×	STRATA
4	STRATA-DAWNiBFS	TG1	0.267	7.8×	STRATA

Facebook Social Network (n=4,039) — CPU Methods

#	Method	ID	Time (s)	vs SciPy	Category
1	STRATA-SpMM-Cython	TC1	1.013	2.0×	STRATA
2	STRATA-NumpyBLAS	TC2	1.692	1.2×	STRATA
3	STRATA-GraphBLAS	TC3	1.897	1.1×	STRATA
4	SciPy (C BFS)	BC2	2.072	1.0×	Baseline
5	M-AORM	BC4	2.854	0.7×	Baseline
6	GB-bfs	BC5	4.194	0.5×	Baseline
7	I-AORM	BC3	8.137	0.3×	Baseline
8	NetworkX	BC1	15.934	0.1×	Baseline

Performance Tiers

Tier 1  BG1/BG2/TG2  (0.003–0.031s)  GPU per-source BFS / CUDA direct expand
Tier 2  TG1          (0.006–0.267s)  GPU DAWNiBFS bit-compressed STRATA
Tier 3  TC1/BC2      (0.16–2.07s)   CPU SpMM+Cython / C BFS
Tier 4  TC2/TC3      (0.31–3.64s)   CPU dense BLAS / GraphBLAS
Tier 5  BC1–BC5      (0.40–15.9s)   Python BFS / edge-wise / GraphBLAS BFS

Topology Comparison — 12 Methods × 6 Graphs

GPU Methods

Method	Facebook	BA (d=5)	WS (d=8)	ER (d=6)	Grid (d=88)	PLC (d=5)
BG1 GPU-PerSrc-BFS	0.017	0.003	0.003	0.006	0.003	0.003
BG2 DAWN-SOVM	0.019	0.003	0.003	0.006	0.003	0.003
TG2 STRATA-CUDA	0.031	0.006	0.005	0.031	0.014	0.005
TG1 STRATA-DAWNiBFS	0.267	0.017	0.006	0.009	0.026	0.017

CPU Methods

Method	Facebook	BA (d=5)	WS (d=8)	ER (d=6)	Grid (d=88)	PLC (d=5)
TC1 STRATA-SpMM-Cython	1.013	0.287	0.306	0.313	0.159	0.306
TC2 STRATA-NumpyBLAS	1.692	0.306	0.517	0.359	3.643	0.336
TC3 STRATA-GraphBLAS	1.897	0.464	0.483	0.484	0.589	0.479
BC2 SciPy	2.072	0.314	0.301	0.331	0.164	0.330
BC4 M-AORM	2.854	0.422	0.672	0.483	5.627	0.424
BC5 GB-bfs	4.194	1.179	1.355	1.173	3.636	1.177
BC3 I-AORM	8.137	0.403	0.546	0.442	4.310	0.411
BC1 NetworkX	15.934	1.222	1.259	1.421	1.034	1.245

TC1 (STRATA-SpMM-Cython) is CPU-fastest on all 6 graphs, including Grid (d=88) where it beats SciPy by 1.0×.

Grand Summary (6 Graphs Total)

#	ID	Method	Total (s)	Mean (s)
1	BG1	GPU-PerSrc-BFS	0.035	0.006
2	BG2	DAWN-SOVM	0.037	0.006
3	TG2	STRATA-CUDA	0.092	0.015
4	TG1	STRATA-DAWNiBFS	0.341	0.057
5	TC1	STRATA-SpMM-Cython	2.384	0.397
6	BC2	SciPy	3.512	0.585
7	TC3	STRATA-GraphBLAS	4.396	0.733
8	TC2	STRATA-NumpyBLAS	6.854	1.142
9	BC4	M-AORM	10.482	1.747
10	BC5	GB-bfs	12.714	2.119
11	BC3	I-AORM	14.249	2.375
12	BC1	NetworkX	22.115	3.686

Scaling Analysis

GPU 4 methods were tested on Barabási–Albert graphs from n=1,000 to n=20,000 to evaluate large-scale behavior.

GPU Scaling (BA Graphs)

n	BG1	BG2	TG1	TG2
1,000	0.001	0.001	0.006	0.002
2,000	0.003	0.003	0.016	0.005
4,000	0.011	0.010	0.052	0.019
6,000	0.023	0.025	0.100	0.062
8,000	0.043	0.045	0.163	0.123
10,000	0.067	0.072	0.238	0.211
15,000	0.156	0.167	0.487	0.454
20,000	0.263	0.281	0.805	12.784

Key Findings

TG2 Memory Cliff: At n=20,000, TG2 explodes to 12.8s. Its dense buffers require 16n² = 6.4GB, exceeding GPU VRAM capacity and causing massive swap overhead.
TG1 Stable Scaling: Bit-compression reduces memory to 4.25n² (75% reduction). Runs at 0.805s at n=20,000 — stable and predictable.
BG1/BG2 Dominance: 4n² memory, single kernel launch per source — consistently fastest across all scales.
TG1 vs BG1: 2.7× gap at n=20K reflects STRATA's structural overhead (per-hop sync, atomicOr, batch management).

STRATA GPU Optimization Path (BA n=20,000)

Step	Time (s)	Improvement
TG2 STRATA-CUDA (memory cliff)	12.784	1.0×
TG1 STRATA-DAWNiBFS (bit-compress)	0.805	15.9×
BG1 GPU-PerSrc-BFS (remove framework)	0.263	48.6×

At Facebook scale (n=4,039), TG2 (0.031s) beats TG1 (0.267s). But at n=20,000, TG2 hits the memory cliff (12.8s) while TG1 remains stable (0.81s). This confirms that STRATA's dense buffers impose a scaling ceiling that bit-compression or framework removal can overcome.

Sigma Discovery & Matrix Brandes

Discovery: Matrix Multiply Encodes σ(s,t)

In STRATA's matrix product A · C^(k-1), the integer values before Heaviside binarization exactly encode the number of shortest paths σ(s,t). This follows from the Brandes (2001) recurrence:

σ(i,j) = ∑ { σ(u,j) : u ∈ N(i) ∧ d(u,j) = d(i,j) - 1 }

STRATA's frontier pruning (F mask) ensures that only newly discovered pairs retain values, which is exactly "count shortest paths only." By simply removing the Heaviside step, STRATA simultaneously computes both D (distances) and σ (shortest-path counts) in a single forward pass.

σ(s,t) Computation: STRATA vs BFS

Graph	n	d	STRATA	BFS (Brandes)	Speedup
Facebook	4,039	8	4.53s	155.2s	34.2×
BA-2000	2,000	5	0.49s	10.0s	20.3×
WS-2000	2,000	8	0.73s	6.2s	8.5×
ER-2000	2,000	6	0.58s	10.0s	17.2×
PLC-2000	2,000	5	0.49s	10.0s	20.3×

STRATA σ and BFS σ are exactly identical on all graphs (np.allclose verified). The 8.5–34× speedup comes from replacing n Python loops with d BLAS calls — same O(nm) complexity, better work organization.

Matrix Brandes: Betweenness Centrality

The σ discovery enables Matrix Brandes — expressing both the forward (σ) and backward (δ) phases of Brandes' algorithm as matrix multiplications. This reorganizes n per-source BFS+backpropagation into d hop-level BLAS calls.

Implementations

ID	Method	Kernel
BB1	Brandes-Python	Pure Python per-source BFS
BB2	Brandes-C (igraph)	igraph C single-thread
TB1	Matrix-Brandes (STRATA)	SpMM + Numba JIT fused prune
TB2	Matrix-Brandes-C (STRATA)	C + OpenMP fused SpMM

Betweenness Centrality Performance

Method	Facebook (4,039)	BA-2000	WS-2000	BA-500
TB2 Matrix-Brandes-C	0.24s	0.058s	0.062s	0.004s
TB1 Matrix-Brandes (Numba)	0.42s	0.088s	0.093s	0.005s
BB2 Brandes-C (igraph)	0.91s	0.188s	0.146s	0.011s
BB1 Brandes-Python	179.1s	16.4s	11.5s	0.958s

TB2 vs igraph C: 2.4–3.6× faster across all graphs. The speedup comes from (1) hop-level fused pruning that skips already-visited pairs, and (2) OpenMP parallelization that hop-level organization naturally enables. Both approaches share O(nm) asymptotic complexity.

Dynamic Edge Operations

STRATA's distance matrix D enables incremental updates for both edge insertion and deletion, without full APSP recomputation. These operations work with any APSP method that produces a complete distance matrix.

Edge Insertion: O(n²) Closed-Form Update

When undirected edge {a,b} is added:

D'[i,j] = min(D[i,j], D[i,a] + 1 + D[b,j], D[i,b] + 1 + D[a,j])

Only 0.008% of pairs are affected per insertion. NumPy vectorization (CPU) runs in ~0.155s, CuPy (GPU) in ~0.035s (4.4× speedup), yielding 22–33× speedup over full recomputation.

Edge Deletion: Level-Based Parent Replacement

Edge deletion is harder — distances can only increase. The proposed algorithm processes levels bottom-up:

Phase 1 (Identify): For each affected pair (s,t) at level k, check if an alternative parent exists among neighbors at level k-1.
Phase 2 (Propagate): Orphaned nodes cascade upward — their children may also lose shortest paths.
Early Termination: In small-world graphs, cascade dies out quickly due to abundant alternative paths.

Deletion Performance (Facebook n=4,039)

Case	Affected Sources	CPU-BFS	CPU-DLevel	GPU-BFS	GPU-DLevel
Low impact	2	1.10s	0.009s	0.034s	0.034s
High impact	2,407	1.10s	0.091s	0.034s	0.031s

CPU DLevel Optimization History

Stage	Change	Time	Speedup
Baseline	Full D scan per level	1.170s	1.0×
+ Affected tracking	Scan only affected sources	0.158s	7.4×
+ Cython	Inner loop in C	0.043s	27.2×
+ Forward propagation	Propagate from orphans only	0.009s	~130×

Fully Dynamic Framework

Combining insertion and deletion enables a fully dynamic APSP update framework:

Insertion: O(n²) closed-form, vectorizable
Deletion: Level-based parent replacement with early termination
Method-agnostic: Works with any APSP method that produces D and A

Related work: Even–Shiloach (1981) handles single-source decremental in O(m·n). Ramalingam–Reps (1996) updates affected vertices only. Demetrescu–Italiano (2004) achieves amortized O(n² log³ n) fully dynamic APSP. STRATA's framework is orthogonal to these — it operates on the distance matrix level, independent of the APSP method used.

STRATA Contributions

1. CPU: Fastest Matrix-Algebraic APSP

STRATA-SpMM-Cython (TC1) is the fastest CPU method across all 6 graph topologies, outperforming SciPy's native C BFS by up to 2.0×. The key is Cython fused pruning — collapsing three sparse operations (booleanize → prune → footprint update) into a single C pass over COO entries.

Graph	TC1 (s)	SciPy (s)	Speedup
Facebook (n=4,039)	1.013	2.072	2.0×
BA-2000	0.287	0.314	1.1×
WS-2000	0.306	0.301	1.0×
Grid-45×45 (d=88)	0.159	0.164	1.0×

2. GPU: SpMM Elimination via Direct CSR Expansion

STRATA-CUDA (TG2) replaces cuSPARSE SpMM with a custom CUDA kernel that directly traverses CSR neighbors, eliminating three SpMM bottlenecks:

Redundant products — SpMM computes all matrix products including already-visited pairs
Intermediate matrices — SpMM output requires COO/CSR conversion overhead
Separate prune step — footprint check is fused into expansion (1-pass)

The footprint check uses a guard+CAS pattern: non-atomic read guard skips already-visited cells, then atomicCAS ensures race-free 0→1 transitions.

if (F[idx] == 0) {                      // non-atomic guard
    if (atomicCAS(&F[idx], 0, 1) == 0) { // atomic CAS
        int pos = atomicAdd(out_count, 1);
        out_row[pos] = i;
        out_col[pos] = k;
    }
}

3. Structural Insight: Optimization Converges to Per-Source BFS

Progressive removal of STRATA's matrix-algebraic overhead converges to per-source BFS. At small scale (n=4K), TG2 is competitive. At large scale (n=20K), the framework overhead dominates, and BG1 wins by 48.6×. This reveals that for full APSP, matrix-algebraic indirection costs (shared footprint → atomics, per-hop kernel launch) accumulate at scale.

4. STRATA's Unique Capabilities

While BG1 is fastest for full APSP, STRATA provides capabilities that per-source BFS cannot:

k-hop constrained APSP — exact hop shells at each distance level
σ(s,t) simultaneous computation — shortest-path counts from the same matrix multiply, 8.5–34× over sequential BFS
Matrix Brandes — betweenness centrality via hop-level BLAS, TB2 is 2.4–3.6× faster than igraph C
Dynamic edge insertion — O(n²) incremental update, 22–33× faster than full recomputation
Dynamic edge deletion — level-based parent replacement with early termination, ~130× optimized
Algebraic analysis — matrix-based framework for theoretical convergence proofs

Discussion

Algorithm Selection Guide

Scenario	Best (GPU)	Best (CPU only)
Small-world APSP (n < 4K)	TG2 STRATA-CUDA	TC1 STRATA-SpMM-Cython
Large-scale APSP (n > 4K)	BG1 GPU-PerSrc-BFS	BC2 SciPy
High diameter APSP (d > 20)	BG1 GPU-PerSrc-BFS	TC1 / BC2
k-hop query	TG2	TC1 STRATA
Betweenness Centrality	TB2 Matrix-Brandes-C (2.4–3.6× over igraph)
Dynamic edge insertion	STRATA (22–33× vs recompute)
Dynamic edge deletion	DLevel (130× optimized, early termination)

Honest Assessment

STRATA's APSP performance on GPU cannot beat per-source BFS at scale. The matrix-algebraic framework introduces structural overhead (per-hop sync, atomics, batch management) that becomes dominant as n grows. TG2's memory cliff at n=20K is a hard scaling limit.

However, STRATA's real value lies in its algebraic generality: the same matrix multiply that computes distances also yields σ(s,t), enabling Matrix Brandes (TB2) to outperform igraph C by 2.4–3.6×. The framework also naturally supports dynamic updates and k-hop queries — capabilities absent from per-source BFS.

Links

📄

AORM Paper

IEEE Access (Open Access)

Brandes Implementations

4 Betweenness Centrality Methods

📊

Experiments

12 Methods × 6 Graphs + Scaling

🚀

How to Run

Installation & Usage