[SVLS-7027] Add Step Functions Flare Command #1701

ryanstrat · 2025-06-17T20:53:41Z

Summary

This PR adds a new stepfunctions flare command to collect diagnostic information for AWS Step Functions, enabling efficient troubleshooting with Datadog support.

What it does

The flare command gathers comprehensive diagnostic data about your Step Functions state machines:

State machine configuration and tags
Recent execution history with detailed events
CloudWatch log subscription filters
CloudWatch logs (when --with-logs is specified)
Local project files for context

Key Features

🔐 Sensitive Data Protection: Automatically masks sensitive fields in ASL definitions and execution data
📊 Configuration Analysis: Detects common misconfigurations that prevent proper Datadog integration
🎯 Smart Diagnostics: Identifies missing log groups, incorrect log levels, and missing Datadog forwarder subscriptions
📁 Organized Output: Creates timestamped directories with all diagnostic files
🚀 Easy Sharing: Zips all data for simple upload to Datadog support

Configuration Analysis

The command generates an INSIGHTS.md file that automatically checks for:

Log level set to "ALL" (required for complete traces)
Include Execution Data enabled (required for detailed monitoring)
CloudWatch log group existence and accessibility
Datadog forwarder subscription on log groups
Proper Datadog tags (DD_TRACE_ENABLED, DD_TRACE_SAMPLE_RATE)
Potential conflicts with X-Ray tracing

Usage

# Basic flare collection
datadog-ci stepfunctions flare \
  --state-machine arn:aws:states:us-east-1:123456789012:stateMachine:MyStateMachine \
  --case-id 12345 \
  --email support@example.com

# Include CloudWatch logs
datadog-ci stepfunctions flare \
  --state-machine arn:aws:states:us-east-1:123456789012:stateMachine:MyStateMachine \
  --case-id 12345 \
  --email support@example.com \
  --with-logs \
  --start "2024-01-01" \
  --end "2024-01-07"

# Dry run to preview
datadog-ci stepfunctions flare \
  --state-machine arn:aws:states:us-east-1:123456789012:stateMachine:MyStateMachine \
  --case-id 12345 \
  --email support@example.com \
  --dry-run

Test Coverage

32 unit tests covering all functionality
Sensitive data masking verification
Configuration analysis validation
Error handling for missing resources
AWS API interaction testing

Sample Insights:

Full insights file omitted to avoid leaking DD Test account ids.

… red phase) - Created StepFunctionsFlareCommand class with 20 stubbed methods - Added comprehensive test suite covering all functionality - Created test fixtures for AWS API responses - All tests currently failing as expected (TDD red phase) - Ready for implementation of actual functionality Methods stubbed: - validateInputs: Validate required CLI inputs - getStateMachineConfiguration: Fetch state machine details - getStateMachineTags: Retrieve resource tags - getRecentExecutions: List recent execution history - getExecutionHistory: Get detailed execution events - getLogSubscriptions: Check CloudWatch log subscriptions - getCloudWatchLogs: Retrieve execution logs - maskStateMachineConfig: Redact sensitive configuration - maskExecutionData: Redact sensitive execution data - generateInsightsFile: Create troubleshooting insights - summarizeConfig: Generate configuration summary - getFramework: Detect deployment framework - createOutputDirectory: Set up output structure - writeOutputFiles: Save collected data - zipAnd 8000 Send: Package and send to Datadog - parseStateMachineArn: Extract ARN components - getLogGroupName: Extract log group from config - maskAslDefinition: Redact sensitive ASL fields - getExecutionDetails: Get execution metadata 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Implement comprehensive flare command following lambda flare pattern - Add methods to collect state machine config, executions, and logs - Include sensitive data masking for ASL definitions and execution data - Add framework detection (Serverless, SAM, CDK, Terraform) - Implement dry-run mode and CloudWatch logs collection (optional) - Add comprehensive test coverage with 35 passing tests - Update CLI registry and documentation The flare command helps users collect diagnostic information about their Step Functions for troubleshooting with Datadog support. 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>

…tion - Enhanced insights file with detailed state machine configuration - Improved dry-run mode messaging with clearer output - Created zip archive in both dry-run and normal modes - Added proper directory structure for output files - Removed unused zipAndSend method in favor of inline implementation - Updated tests to match new dry-run output messages 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

…real filesystem operations - Changed log subscription filters collection to always run (not gated by --with-logs flag) - Reverted to timestamped subdirectories (e.g., stepfunctions-StateMachineName-1234567890) - Added cleanup of old Step Functions flare directories and zip files before creating new ones - Updated to use FLARE_OUTPUT_DIRECTORY constant instead of hardcoded value - Removed filesystem mocking in tests - tests now use real filesystem operations - Updated tests to match new directory structure with timestamps - Added proper cleanup after tests - Fixed test expectations for new behavior 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Remove extra spaces after emojis in stdout messages - Change tag emoji from 🏷️ to 🔖 for better terminal alignment 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>

datadog-datadog-prod-us1 · 2025-06-17T20:56:06Z

Datadog Report

Branch report: ryan.strat/step-functions-flare
Commit report: 8dbec68
Test service: datadog-ci-tests

✅ 0 Failed, 2904 Passed, 0 Skipped, 3m 14.03s Total duration (5m 19.49s time saved)

- Add documentation link icon for flare command in README - Maintain consistency with other Step Functions commands 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

src/commands/stepfunctions/README.md

- Remove --region flag as it's automatically extracted from state machine ARN - Fix tests to properly set state machine ARN for region extraction 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Replace spaces with tabs after all emoji in stdout output to ensure consistent alignment regardless of terminal emoji width rendering 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Revert to single space after emoji (matching lambda/cloud-run flare) - Replace problematic ℹ️ emoji with 📁 and 📦 for better alignment - Maintains consistency with existing flare commands in the codebase 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add configuration analysis as the first section - Check for required settings (log level, execution data, etc.) - Validate Datadog integration and tags - Remove emoji from insights file for cleaner output - Move tags as subsection of state machine configuration - Remove encryption configuration section - Don't mention DD_TRACE_SAMPLE_RATE if absent (default is 1) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Fix linting issues identified by ESLint and Prettier 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Replace 🌧️ with ☁️ for better terminal alignment 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Fix log group ARN parsing to handle :* suffix - Add debug output for log collection status - Logs are now properly collected when using --with-logs flag 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Remove --region flag from documentation example - Replace 'any' type with proper interface in test fixtures - Add practical examples to Step Functions README 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Handle "specified log group does not exist" error message - Gracefully handle missing log groups in both subscription filters and logs collection - Prevents flare from failing when log group doesn't exist 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Update getLogSubscriptions to return both filters and existence status - Distinguish between missing log groups vs missing subscription filters in configuration analysis - Add specific error message when log group doesn't exist - Add test coverage for log group existence detection 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Copilot

Pull Request Overview

This PR introduces a new "stepfunctions flare" command to gather diagnostic information for AWS Step Functions, facilitating troubleshooting for Datadog support.

Updated CLI exports to include the new flare command.
Added test fixtures to support the flare command testing.
Expanded documentation in both the command-specific README and the main README to detail usage and parameters for the flare command.

Reviewed Changes

Copilot reviewed 4 out of 6 changed files in this pull request and generated no comments.

File	Description
src/commands/stepfunctions/cli.ts	Exports the new StepFunctionsFlareCommand along with existing commands.
src/commands/stepfunctions/tests/fixtures/stepfunctions-flare.ts	Provides fixtures for testing the flare command functionality.
src/commands/stepfunctions/README.md	Updates the documentation to include flare command usage and parameters.
README.md	Integrates the flare command into the main CLI documentation overview.

buraizu

Approving with a minor update requested

buraizu · 2025-06-18T21:19:08Z

src/commands/stepfunctions/README.md

 # stepfunctions commands

-You can use the `stepfunctions instrument` command to instrument your Step Functions with Datadog. This command enables instrumentation by subscribing Step Function logs to a [Datadog Forwarder](https://docs.datadoghq.com/logs/guide/forwarder/).
+The Step Functions commands allow you to manage Datadog instrumentation and troubleshooting for your AWS Step Functions:


Suggested change

The Step Functions commands allow you to manage Datadog instrumentation and troubleshooting for your AWS Step Functions:

The Step Function commands allow you to manage Datadog instrumentation and troubleshooting for your AWS Step Functions:

Minor suggestion for consistency -- as in line 5, Step Function is used when it's an attributive noun

ava-silver

given how long the flare command is, it feels like it should be split up into multiple modules if possible, i'll come back for more review later, once the initial comments have been addressed

ava-silver · 2025-06-20T18:35:12Z