ION Monitoring Guide
Overview
ION provides three essential monitoring utilities to track system health and resource usage:
- ionwatch - Monitors daemon process status across all ION protocols
- sdrwatch - Monitors SDR (non-volatile data store) memory usage
- psmwatch - Monitors PSM (private shared memory) partition usage
These tools are critical for: - Verifying that all required daemons are running - Detecting memory leaks in SDR and PSM - Troubleshooting performance issues - Monitoring system health in production environments - Diagnosing storage space problems
ionwatch - Daemon Status Monitoring
Purpose
ionwatch monitors the status of all ION daemon processes by checking their PIDs in the volatile database and verifying that the processes are actually running. It provides a comprehensive view of daemon health across all ION protocols.
Usage
# Display daemon status once
ionwatch
# Continuously monitor, refreshing every 5 seconds
ionwatch -w 5
# Monitor for 10 iterations with 2-second intervals
ionwatch -w 2 -c 10
# Show only running daemons
ionwatch -r
# Log output to ion.log instead of stdout
ionwatch -l
# Quiet mode: show full status initially, then only changes
ionwatch -w 10 -q
# Recommended watchdog mode: log only changes every 10 seconds
ionwatch -w 10 -q -l
Command-Line Options
| Option | Description |
|---|---|
-w, --watch <interval> |
Watch mode: refresh every interval seconds |
-c, --count <count> |
Number of refresh cycles (default: 1) |
-r, --running-only |
Show only running daemons |
-l, --log |
Output to ion.log instead of stdout |
-q, --quiet |
Show full status initially, then only changes |
-h, --help |
Show help message |
Output Format
Protocol Daemon PID Status Notes
--------------------------------------------------------------------------------
ICI rfxclock 26 Running Contact plan manager
LTP ltpclock 36 Running Event scheduler
LTP ltpdeliv 37 Running Delivery service
LTP udplso n20:1113 [2] 39 Running Link service output
LTP ltpmeter [2] 38 Running Meter
LTP udplsi [::]:1113 40 Running Link service input
BP bpclock 47 Running Event scheduler
BP cpsd 48 STALE Contact plan sync
BP bptransit 49 Running Transit processor
BP bpclm [ipn:2.0] 46 Running CL manager
BP ltpcli [1] 52 Running CL input
BP ltpclo [2] 53 Running CL output
--------------------------------------------------------------------------------
Note: In this example, the cpsd daemon shows STALE status (PID 48), indicating it has crashed or been killed. This requires investigation and likely a daemon restart or full ION restart.
Output Fields Explained
Protocol Column
Identifies which ION protocol layer the daemon belongs to: - ICI - Inter-node Communication Infrastructure (core layer) - LTP - Licklider Transmission Protocol (convergence layer) - BP - Bundle Protocol (application layer) - CFDP - CCSDS File Delivery Protocol - DTPC - Delay-Tolerant Payload Conditioning - BSSP - Bundle Streaming Service Protocol
Daemon Column
The name of the daemon process. Common daemons include:
ICI Daemons:
- rfxclock - Contact plan manager that maintains the contact graph routing table
LTP Daemons:
- ltpclock - Event scheduler for timer-driven activities
- ltpdeliv - Delivery service that delivers received data blocks
- udplso [engineId] - Link Service Output for specific span
- ltpmeter [engineId] - Metering daemon for rate control
- udplsi - Link Service Input (per-seat daemon)
BP Daemons:
- bpclock - Event scheduler for Bundle Protocol events
- cpsd - Contact Plan Sync daemon
- bptransit - Transit processor for bundle forwarding
- bpclm [eid] - Convergence Layer Manager for specific endpoint
- udpcli [duct] - Convergence Layer Input for specific induct
- udpclo [duct] - Convergence Layer Output for specific outduct
CFDP Daemons:
- cfdpclock - Event scheduler for CFDP
- cfdp UT layer - User Transaction adapter
DTPC Daemons:
- dtpcclock - Event scheduler for DTPC
- dtpcd - Main DTPC daemon
BSSP Daemons:
- bsspclock - Event scheduler for BSSP
PID Column
The Process ID of the daemon as registered in the volatile database. A PID of -1 or 0 indicates the daemon was never started.
Status Column
The current operational status of the daemon:
-
Running - The daemon's PID is registered and the process exists and is executing normally. This is the expected state for all operational daemons.
-
Not Started - The daemon has never been started. The PID is -1 or unset in the volatile database. This may be intentional if the daemon is not needed for your configuration (e.g., optional protocols), or it may indicate a startup problem.
-
STALE - The daemon's PID is registered in the database, but the process no longer exists. This indicates the daemon terminated abnormally without properly clearing its PID. In ION 4.1.5+, daemons implement self-cleanup, so STALE entries typically indicate the daemon crashed or was forcibly killed (kill -9).
Notes Column
Brief description of the daemon's function to help understand its role in the ION stack.
Interpreting ionwatch Output
Healthy System: All required daemons show "Running" status. Optional protocol daemons (CFDP, DTPC, BSSP) may show "Not Started" if those protocols are not configured.
Daemon Failure: If a daemon shows "STALE" status, it has crashed or been killed. Check ion.log for error messages and restart the daemon or the entire ION node.
Configuration Issue: If a required daemon shows "Not Started" after running ionstart, there may be a configuration error or startup script issue. Check your configuration files and ion.log.
Performance Monitoring:
Use watch mode (-w) to continuously monitor daemon health, especially useful during testing or when diagnosing intermittent failures.
sdrwatch - SDR Memory Monitoring
Purpose
sdrwatch monitors the SDR (Simple Data Recorder), ION's non-volatile data store that holds persistent protocol state, bundle payload data, and other critical information. It helps detect memory leaks and monitor heap space utilization.
Usage
# Display current SDR usage summary once
sdrwatch ion -t 0
# Print statistics for current transaction
sdrwatch ion -s
# Reset log length high-water mark and print stats
sdrwatch ion -r
# Print stats and ZCO status
sdrwatch ion -z
# Trace mode: monitor allocation/deallocation every 5 seconds
sdrwatch ion -t 5
# Continuous tracing with 10 iterations at 3-second intervals
sdrwatch ion -t 3 10
# Verbose trace showing all allocations (not just leaks)
sdrwatch ion -t 5 10 verbose
Operating Modes
| Mode | Description |
|---|---|
-t (default) |
Trace mode: reports on SDR space allocation and release activity |
-s |
Statistics mode: prints current transaction statistics |
-r |
Reset mode: resets max log length high-water mark, then prints stats |
-z |
ZCO mode: prints stats plus Zero-Copy Objects status |
Output Format (Trace Mode with interval=0)
-- sdr 'ion' usage report --
small pool free blocks:
8 of size 8
12 of size 16
5 of size 24
total avbl: 1245680
total unavbl: 234320
total size: 1480000
large pool free blocks:
3 of order 256
1 of order 512
total avbl: 8942560
total unavbl: 1057440
total size: 10000000
total heap size: 11480000
total unused: 2156320
max total used: 9323680
total now in use: 8031360
max xn log len: 45632
Output Fields Explained
Small Pool Section
The small pool manages small allocations efficiently using fixed-size blocks.
Small pool free blocks:
Lists the count and size of available free blocks in the small pool, grouped by size. Each size class represents a multiple of WORD_SIZE (typically 8 bytes). This shows the fragmentation level of small allocations.
- Format: count of size bytes
- Example: 12 of size 16 means there are 12 free blocks of 16 bytes each
total avbl (available): Total bytes available in the small pool's free block lists. This memory can be immediately allocated for small objects without fragmentation. Higher values indicate good availability of small blocks.
total unavbl (unavailable): Total bytes currently allocated from the small pool and in use by ION. This represents small objects currently holding data.
total size: Total size of the entire small pool (avbl + unavbl). This is the configured capacity for small allocations.
Large Pool Section
The large pool manages larger allocations using a buddy system with power-of-two sized blocks.
Large pool free blocks:
Lists the count and order (size) of available free blocks in the large pool. The buddy system organizes blocks by powers of two.
- Format: count of order bytes
- Example: 3 of order 256 means there are 3 free blocks of 256 bytes each
- Orders typically range from WORD_SIZE to large sizes (512, 1024, 2048, etc.)
total avbl (available): Total bytes available in the large pool's free block lists. This memory can be allocated for large objects using the buddy algorithm.
total unavbl (unavailable): Total bytes currently allocated from the large pool and in use. This represents large objects currently holding data.
total size: Total size of the entire large pool (avbl + unavbl). This is the configured capacity for large allocations.
Heap Summary
total heap size: The complete size of the SDR heap (small pool size + large pool size). This is the total configured SDR capacity from your ionconfig file.
total unused: Bytes in the heap that have never been allocated. This is "virgin" space that can be used for either pool as needed. As the system runs, this value decreases as more heap space is put into use.
max total used: The maximum amount of heap space that has ever been in use simultaneously since ION started. This is calculated as: heap size - unused size. This high-water mark indicates peak memory demand.
total now in use: Current amount of heap space actively allocated and in use. Calculated as: heap size - small pool free - large pool free - unused. This should fluctuate as bundles are created, forwarded, and delivered.
max xn log len (transaction log length): The maximum length of the transaction log that has been observed. The transaction log records all modifications during a transaction. If this value grows very large, it may indicate: - Very large transactions that should be split up - Inefficient use of transactions - Potential memory pressure during transaction processing
Detecting Memory Leaks with sdrwatch
Normal Operation: - "total now in use" fluctuates as bundles arrive, are forwarded, and delivered - "max total used" increases initially then stabilizes - "total unused" decreases initially as pools are allocated, then stabilizes - Free block counts remain reasonable
Memory Leak Indicators: - "total now in use" continuously increases over time without decreasing - "total unused" continuously decreases - "max total used" keeps growing toward heap size - Free block counts decrease toward zero - Eventually: "Can't allocate heap space" errors in ion.log
How to Detect Leaks:
1. Run sdrwatch ion -t 30 100 to monitor every 30 seconds for 100 iterations
2. Observe the "total now in use" value over time
3. If it grows continuously without dropping, investigate recent code changes
4. Use trace mode with verbose output to see exactly what's being allocated
5. Check ion.log for "unfreed" allocation reports when the trace ends
psmwatch - PSM Memory Monitoring
Purpose
psmwatch monitors PSM (Private Shared Memory) partitions used by ION for working memory. PSM holds in-memory data structures like lists, databases, and volatile state that doesn't need to be persistent.
Usage
# Display current usage for ionwm partition once (no tracing)
psmwatch 0xff01 5000000 ionwm 0 1
# Monitor with trace every 5 seconds for 10 iterations
psmwatch 0xff01 5000000 ionwm 5 10
# Verbose trace showing all allocations
psmwatch 0xff01 5000000 ionwm 5 10 verbose
# Poll without tracing (interval must be negative)
psmwatch 0xff01 5000000 ionwm -10 100
Parameters
| Parameter | Description |
|---|---|
shared_memory_key |
IPC key for the shared memory segment (hex or decimal, typically 0xff01 / 65281) |
memory_size |
Size of the shared memory segment (must match .ionconfig wmSize, typically 5000000) |
partition_name |
Name of the partition to monitor: ionwm (ION working memory) or sdrwm (SDR working memory) |
interval |
Polling interval in seconds. Use 0 for single poll. Use negative value to disable tracing and just show summaries |
count |
Number of polling iterations |
verbose |
Optional: enable verbose output showing all allocations |
Common Partition Names
- ionwm - ION working memory, used for volatile ION data structures
- sdrwm - SDR working memory, used internally by SDR for heap management
Output Format
-- partition 'ionwm' usage report --
small pool free blocks:
45 of size 8
38 of size 16
22 of size 24
15 of size 32
total avbl: 234560
total unavbl: 512440
total size: 747000
large pool free blocks:
5 of order 128
3 of order 256
1 of order 512
total avbl: 2456320
total unavbl: 1543680
total size: 4000000
total partition: 5000000
total unused: 253000
Output Fields Explained
Small Pool Section
Small pool free blocks:
Lists available free blocks in the small pool by size. Each entry shows the count and size of free blocks.
- Format: count of size bytes
- Sizes increment by WORD_SIZE (typically 8 bytes)
- Example: 45 of size 8 means 45 free 8-byte blocks available
total avbl (available): Total bytes in the small pool's free lists, immediately available for small allocations. Higher values indicate good availability.
total unavbl (unavailable): Total bytes allocated from the small pool and currently in use by ION data structures.
total size: Total capacity of the small pool (avbl + unavbl).
Large Pool Section
Large pool free blocks:
Lists available free blocks in the large pool organized by order (power-of-two sizes).
- Format: count of order bytes
- Uses buddy system allocation
- Example: 5 of order 128 means 5 free 128-byte blocks
total avbl (available): Total bytes in the large pool's free lists, available for large allocations.
total unavbl (unavailable): Total bytes allocated from the large pool and currently in use.
total size: Total capacity of the large pool (avbl + unavbl).
Partition Summary
total partition: The complete size of the PSM partition. This should match the memory_size parameter and the wmSize configuration in .ionconfig.
total unused: Bytes that have never been allocated from this partition. This "virgin" memory can be added to either pool as needed. Decreases over time as memory is first used.
Understanding Memory States: avbl, unavbl, and unused
The terminology can be confusing, especially "unavbl" (unavailable). Here's what each state means:
The Three Memory States
- "avbl" (available) - Memory in free block lists, ready to allocate immediately
- "unavbl" (unavailable) - Memory that has been allocated and is actively in use by ION data structures
- "unused" - Virgin memory that has never been allocated to either pool
Key Point: "unavbl" means ALLOCATED and IN USE, not "reserved but free". It's "unavailable" because it's already occupied by active data structures.
Memory Flow
[Unused Space] → [Allocated to Pool] → [Given to Application]
→ [Small/Large Pool] → [unavbl = in use]
(total size) [avbl = free blocks]
When memory is allocated: - First allocation: Comes from "unused" → becomes part of pool "total size" → marked as "unavbl" - Subsequent allocation: Comes from "avbl" (free blocks) → becomes "unavbl" - When freed: Goes from "unavbl" → back to "avbl" (free blocks)
Real Example Breakdown
-- partition 'ionwm' usage report --
small pool free blocks:
14 of size 16
4 of size 24
6 of size 32
2 of size 40
18 of size 48
3 of size 56
4 of size 80
1 of size 184
1 of size 192
4 of size 272
total avbl: 3408 ← Free blocks ready to allocate
total unavbl: 13232 ← Allocated and currently IN USE
total size: 16640 ← Total small pool (3408 + 13232)
large pool free blocks:
1 of order 1024
1 of order 2048
total avbl: 5168 ← Free blocks ready to allocate
total unavbl: 611184 ← Allocated and currently IN USE
total size: 616352 ← Total large pool (5168 + 611184)
total partition: 50000000 ← Total PSM partition size
total unused: 49364408 ← Never allocated to any pool yet
Interpretation: - Small pool unavbl (13,232 bytes): Active ION data structures using small allocations - Large pool unavbl (611,184 bytes): Active ION data structures using large allocations - Unused (49.3 MB): Plenty of virgin space left to grow pools as needed - This is a healthy system with lots of headroom!
Summary:
- total size = avbl + unavbl (all memory in this pool, whether free or in use)
- total partition = small pool size + large pool size + unused
- Rising "unavbl" over time without dropping = memory leak
- Stable or fluctuating "unavbl" = normal operation
Trace Mode vs Polling Mode
Trace Mode (positive interval): Monitors memory allocations and deallocations, reporting on potential leaks and showing allocation activity. Useful for debugging memory issues. Due to limited tracing memory, it is best used for short-term monitoring only.
Polling Mode (negative interval): Only displays usage summaries at each interval without detailed allocation tracing. More efficient for long-term monitoring with less overhead.
Detecting PSM Memory Leaks
Normal Operation: - "total unavbl" values fluctuate as data structures are created and freed - "total unused" decreases initially then stabilizes - Free block counts remain healthy
Memory Leak Indicators: - "total unavbl" continuously increases without decreasing - "total unused" steadily decreases toward zero - Free block counts trending toward zero - "Can't allocate space" errors in ion.log
Troubleshooting Steps:
1. Use trace mode to see allocation patterns: psmwatch 0xff01 5000000 ionwm 30 50
2. Compare "total unavbl" values over time
3. Check for "unfreed" allocations in trace output
4. Review recent code changes for missing psm_free() calls
5. Verify that cleanup routines are being called properly
See Also
- ionwatch(1) - Daemon status monitor man page
- sdrwatch(1) - SDR activity monitor man page
- psmwatch(1) - PSM activity monitor man page
- sdr(3) - SDR API documentation
- psm(3) - PSM API documentation
- ION-Watch-Characters.md - Watch character documentation for real-time monitoring
- ION-Utilities.md - Overview of all ION utility programs