NeuraTensor SDK — Custom CUDA kernels for edge AI on NVIDIA Jetson AGX Orin. 23 ms inference, 111× speedup, 4× memory reduction.
CUDA SDK · Production-ready · Available under license

NeuraTensorCustom CUDA

Edge AI SDK

Hand-tuned CUDA kernels for hybrid neuromorphic workloads. 23 ms inference, 111× speedup and 4× memory reduction on a single Jetson AGX Orin — under a hard 60 W envelope.

Scroll

Inference latency

0ms

64 M params · warm

Speedup vs baseline

0×

vs PyTorch reference

Memory footprint

0× lower

fused SNN-SSM kernels

GPU utilization

0%

sustained · MAXN

Direct measurement · Jetson AGX Orin 64 GB · Power mode MAXN · CUDA 12.x

What it is

A CUDA SDK engineered for hybrid neuromorphic inference at the edge.

NeuraTensor is the GPU acceleration layer that powers the NeuratronLLM-Edge family. It replaces stock PyTorch and TensorFlow operators with hand-tuned CUDA kernels designed for spiking, state-space and fusion workloads — squeezing the full envelope of NVIDIA Ampere edge silicon.

Fused SNN-SSM kernels

Spiking and state-space layers compile into single kernels. Eliminates intermediate tensor traffic and cuts memory bandwidth pressure by 4×.

Hardware-aware scheduling

Occupancy is tuned for SM 8.7 (Ampere edge). Tensor Cores are exploited for FP16 / INT8 mixed precision; warps are aligned to the spike-router topology.

Zero-copy edge runtime

Unified-memory aware. Inputs and KV cache live in shared LPDDR5 — no host↔device shuttling — so the 204.8 GB/s bus is never the bottleneck.

Architecture

Seven layers. Silicon to API.

From the Ampere GPU all the way up to the industrial REST surface. Layer 3 — the custom CUDA kernel set — is where NeuraTensor extracts the order-of-magnitude speedup. Everything above it is regular Python.

Stack topology

L7 → L1
L7

Application interface

REST API + WebSocket server · Python

L6

Orchestration & control

Execution loop, monitoring, scheduling

L5

Industrial integration

SLA monitor · SEMI/GEM protocols · Python/C++

L4

Model architecture

Hybrid neural network implementation · Python

L3

CUDA acceleration

Custom GPU kernels · CUDA C++ · the NeuraTensor core

L2

Hardware abstraction

Runtime + device management · Python/Shell

L1

Physical hardware

Jetson AGX Orin 64 GB · Ampere SM 8.7

Kernel inventory

On-chip · IP-safe
Fused spiking attentioncustom
State-space recurrencecustom
Sparse spike routercustom
Mixed-precision GEMMcustom · FP16/INT8
Quantized weight updatecustom · sub-KB
Tokenizer / embedding lookupvendor + ours
RoPE rotationfused
Layer norm (FP32-safe)fused

Measured performance

Real numbers. Single Jetson AGX Orin 64 GB.

Benchmarks run on the same SDK build that ships to customers. The PyTorch baseline executes the identical 64 M-parameter model with stock operators. NeuraTensor delivers the same numerics with two orders of magnitude less wall-clock time.

Inference latency

0ms

64 M params · warm

PyTorch baseline

0ms

same model · stock ops

Speedup

0×

kernel fusion + occupancy

Memory reduction

0×

vs reference path

GPU utilization

0%

sustained

Resident GPU memory

0GB

LLM + vision co-resident

Mean power

0W

under 60 W TDP

Junction temp

0°C

peak 42.06 °C

Cold start

0s

disk → GPU

Why 111×. The PyTorch reference path issues hundreds of small ops per token; NeuraTensor fuses spiking + state-space + attention into a handful of compiled kernels with hardware-aware occupancy, eliminating intermediate allocations and saturating the 204.8 GB/s memory bus. Same model, same outputs — two orders of magnitude faster.

Hardware platform

Tuned to the silicon. Not the abstraction.

NeuraTensor targets the Ampere edge stack directly — SM 8.7, Tensor Cores, unified LPDDR5 memory. The kernel set is portable across the Jetson Orin family; the performance numbers below are anchored to the AGX Orin 64 GB validation host.

GPU subsystem

Ampere · SM 8.7
CUDA cores2,048 @ 1.3 GHz
Tensor Cores64 (FP16 / INT8)
Streaming multiprocessors16
Compute capability8.7
CUDA toolkit12.x
DriverJetPack 6.x

Memory & I/O

Unified · LPDDR5
Unified RAM61.3 GB LPDDR5
Memory bandwidth204.8 GB/s
L2 cache (shared)4 MB
CPU ↔ GPU accessZero-copy
StorageNVMe / eMMC
Power envelope60 W TDP (MAXN)

Where NeuraTensor matters

For workloads that can't wait on the cloud.

Anywhere latency, power and on-device autonomy are non-negotiable. NeuraTensor is the substrate beneath every Neuramorphic foundation model deployment — and is licensable standalone for partner workloads.

Foundation models on edge

Ships under NeuratronLLM-Edge (Caroline). Runs a 4 B-parameter LLM and a real-time vision pipeline concurrently on a single Jetson AGX Orin under 60 W.

LLMVisionConcurrent

Industrial control & robotics

Sub-25 ms inference closes the loop tight enough for motion control, line-side perception and adaptive process control on tool-mounted compute.

Control loopReal-timeTool-side

Defense & autonomous systems

Disconnected operation by design. The kernel set runs without internet, telemetry or driver phone-home — the entire compute path stays on the device.

DisconnectedSovereignAutonomous

Healthcare imaging

Bedside inference where PHI cannot leave the room. The 4× memory reduction lets larger diagnostic models fit on Orin-class compute next to the patient.

PHI-safeBedsideImaging

Semiconductor fab tools

Tool-mounted reasoning and anomaly detection co-located with the equipment. NeuraTensor is the acceleration layer behind regulated industrial deployments.

IndustrialProcess bayEdge HPC

Energy & grid edge

Substations, wind turbines, remote pipelines. Inference budgets in single-digit watts; NeuraTensor keeps the kernels inside the envelope without sacrificing model size.

SubstationLow-powerRemote

Supported targets

Where NeuraTensor runs.

Edge HPCActive
Jetson AGX Orin 64 GB
Primary validation target — full kernel set, full envelope
Edge
Jetson Orin Nano 8 GB
Distilled profile · same kernel API · reduced footprint
Workstation
Discrete Ampere/Ada GPU
Pre-deployment soak, kernel profiling, regression suite
Training
Internal cluster
Model factory — not customer-facing

FAQ

What integrators ask first.

NeuraTensor is a custom CUDA kernel set built for hybrid neuromorphic inference at the edge. It replaces stock PyTorch / TensorFlow operators with hand-tuned, fused kernels that target the NVIDIA Ampere edge stack (SM 8.7) and deliver 23 ms inference, 111× speedup and 4× memory reduction on Jetson AGX Orin.

cuDNN and TensorRT accelerate standard transformer operators. NeuraTensor accelerates hybrid neuromorphic operators — fused spiking + state-space + sparse attention paths — that have no production-grade equivalent in vendor libraries. The two are complementary; NeuraTensor sits at Layer 3 of the stack and falls back to cuBLAS / cuDNN where appropriate.

Primary target is the NVIDIA Jetson AGX Orin 64 GB (Ampere, SM 8.7, 2,048 CUDA cores, 64 Tensor Cores, 204.8 GB/s LPDDR5). The kernel set is portable across the Jetson Orin family (Nano, NX, AGX) and runs on workstation Ampere/Hopper for pre-deployment profiling.

The PyTorch reference path issues hundreds of small ops per token, each with allocation and launch overhead. NeuraTensor fuses spiking, state-space and attention into a handful of compiled kernels, aligns occupancy to SM 8.7, exploits Tensor Cores for FP16 / INT8 mixed precision and saturates the unified memory bus. Same numerics, two orders of magnitude faster wall-clock.

Yes. The kernel set runs entirely on-device with no network, telemetry or driver phone-home. It is the substrate behind the Caroline air-gap deployment, which is launched inside a Linux network namespace and validated to block 4 of 4 external egress probes at the kernel level.

On Jetson AGX Orin under sustained concurrent LLM + vision workload at 98.25% GPU utilization: estimated mean power 44.52 W, peak 45.08 W — inside the published 60 W platform envelope. Junction temperature mean 40.92 °C, peak 42.06 °C.

NeuraTensor SDK is licensed standalone for partner workloads and ships embedded inside the NeuratronLLM-Edge foundation models. Source, kernel internals and per-target tuning profiles are released under NDA. Contact support@neuramorphic.ai for an evaluation build.

NeuraTensor exposes a Python API surface that mirrors PyTorch tensor operations for the supported neuromorphic ops, plus a C++ runtime for low-overhead embedded deployment. A reference Docker image for JetPack 6.x is available for partners under NDA.

Patent pending

The NeuraTensor kernel set, the fused SNN-SSM compiler path and the hardware-aware scheduler are proprietary to Neuramorphic, Inc. and are protected by USPTO patent applications.

Implementation details, kernel source, occupancy maps and per-target tuning profiles are confidential and shared under NDA. Contact support@neuramorphic.ai for SDK access.