8000 [SVLS-7027] Add Step Functions Flare Command by ryanstrat · Pull Request #1701 · DataDog/datadog-ci · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@ryanstrat
Copy link
@ryanstrat ryanstrat commented Jun 17, 2025

Summary

This PR adds a new stepfunctions flare command to collect diagnostic information for AWS Step Functions, enabling efficient troubleshooting with Datadog support.

What it does

The flare command gathers comprehensive diagnostic data about your Step Functions state machines:

  • State machine configuration and tags
  • Recent execution history with detailed events
  • CloudWatch log subscription filters
  • CloudWatch logs (when --with-logs is specified)
  • Local project files for context

Key Features

  • 🔐 Sensitive Data Protection: Automatically masks sensitive fields in ASL definitions and execution data
  • 📊 Configuration Analysis: Detects common misconfigurations that prevent proper Datadog integration
  • 🎯 Smart Diagnostics: Identifies missing log groups, incorrect log levels, and missing Datadog forwarder subscriptions
  • 📁 Organized Output: Creates timestamped directories with all diagnostic files
  • 🚀 Easy Sharing: Zips all data for simple upload to Datadog support

Configuration Analysis

The command generates an INSIGHTS.md file that automatically checks for:

  • Log level set to "ALL" (required for complete traces)
  • Include Execution Data enabled (required for detailed monitoring)
  • CloudWatch log group existence and accessibility
  • Datadog forwarder subscription on log groups
  • Proper Datadog tags (DD_TRACE_ENABLED, DD_TRACE_SAMPLE_RATE)
  • Potential conflicts with X-Ray tracing

Usage

# Basic flare collection
datadog-ci stepfunctions flare \
  --state-machine arn:aws:states:us-east-1:123456789012:stateMachine:MyStateMachine \
  --case-id 12345 \
  --email support@example.com

# Include CloudWatch logs
datadog-ci stepfunctions flare \
  --state-machine arn:aws:states:us-east-1:123456789012:stateMachine:MyStateMachine \
  --case-id 12345 \
  --email support@example.com \
  --with-logs \
  --start "2024-01-01" \
  --end "2024-01-07"

# Dry run to preview
datadog-ci stepfunctions flare \
  --state-machine arn:aws:states:us-east-1:123456789012:stateMachine:MyStateMachine \
  --case-id 12345 \
  --email support@example.com \
  --dry-run

Test Coverage

  • 32 unit tests covering all functionality
  • Sensitive data masking verification
  • Configuration analysis validation
  • Error handling for missing resources
  • AWS API interaction testing

Sample Insights:

Full insights file omitted to avoid leaking DD Test account ids.
image

ryanstrat and others added 7 commits June 17, 2025 13:39
… red phase)

- Created StepFunctionsFlareCommand class with 20 stubbed methods
- Added comprehensive test suite covering all functionality
- Created test fixtures for AWS API responses
- All tests currently failing as expected (TDD red phase)
- Ready for implementation of actual functionality

Methods stubbed:
- validateInputs: Validate required CLI inputs
- getStateMachineConfiguration: Fetch state machine details
- getStateMachineTags: Retrieve resource tags
- getRecentExecutions: List recent execution history
- getExecutionHistory: Get detailed execution events
- getLogSubscriptions: Check CloudWatch log subscriptions
- getCloudWatchLogs: Retrieve execution logs
- maskStateMachineConfig: Redact sensitive configuration
- maskExecutionData: Redact sensitive execution data
- generateInsightsFile: Create troubleshooting insights
- summarizeConfig: Generate configuration summary
- getFramework: Detect deployment framework
- createOutputDirectory: Set up output structure
- writeOutputFiles: Save collected data
- zipAnd
8000
Send: Package and send to Datadog
- parseStateMachineArn: Extract ARN components
- getLogGroupName: Extract log group from config
- maskAslDefinition: Redact sensitive ASL fields
- getExecutionDetails: Get execution metadata

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Implement comprehensive flare command following lambda flare pattern
- Add methods to collect state machine config, executions, and logs
- Include sensitive data masking for ASL definitions and execution data
- Add framework detection (Serverless, SAM, CDK, Terraform)
- Implement dry-run mode and CloudWatch logs collection (optional)
- Add comprehensive test coverage with 35 passing tests
- Update CLI registry and documentation

The flare command helps users collect diagnostic information about their
Step Functions for troubleshooting with Datadog support.

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
…tion

- Enhanced insights file with detailed state machine configuration
- Improved dry-run mode messaging with clearer output
- Created zip archive in both dry-run and normal modes
- Added proper directory structure for output files
- Removed unused zipAndSend method in favor of inline implementation
- Updated tests to match new dry-run output messages

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…real filesystem operations

- Changed log subscription filters collection to always run (not gated by --with-logs flag)
- Reverted to timestamped subdirectories (e.g., stepfunctions-StateMachineName-1234567890)
- Added cleanup of old Step Functions flare directories and zip files before creating new ones
- Updated to use FLARE_OUTPUT_DIRECTORY constant instead of hardcoded value
- Removed filesystem mocking in tests - tests now use real filesystem operations
- Updated tests to match new directory structure with timestamps
- Added proper cleanup after tests
- Fixed test expectations for new behavior

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Remove extra spaces after emojis in stdout messages
- Change tag emoji from 🏷️ to 🔖 for better terminal alignment

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
@ryanstrat ryanstrat requested a review from Copilot June 17, 2025 20:54

This comment was marked as outdated.

@datadog-datadog-prod-us1
Copy link
datadog-datadog-prod-us1 bot commented Jun 17, 2025

Datadog Report

Branch report: ryan.strat/step-functions-flare
Commit report: 8dbec68
Test service: datadog-ci-tests

✅ 0 Failed, 2904 Passed, 0 Skipped, 3m 14.03s Total duration (5m 19.49s time saved)

- Add documentation link icon for flare command in README
- Maintain consistency with other Step Functions commands

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
ryanstrat and others added 2 commits June 17, 2025 17:16
- Remove --region flag as it's automatically extracted from state machine ARN
- Fix tests to properly set state machine ARN for region extraction

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Replace spaces with tabs after all emoji in stdout output to ensure
consistent alignment regardless of terminal emoji width rendering

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@ryanstrat ryanstrat changed the title [Hackathon] Add Step Functions flare command [Draft] Add Step Functions Flare command Jun 18, 2025
ryanstrat and others added 5 commits June 18, 2025 10:43
- Revert to single space after emoji (matching lambda/cloud-run flare)
- Replace problematic ℹ️ emoji with 📁 and 📦 for better alignment
- Maintains consistency with existing flare commands in the codebase

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add configuration analysis as the first section
- Check for required settings (log level, execution data, etc.)
- Validate Datadog integration and tags
- Remove emoji from insights file for cleaner output
- Move tags as subsection of state machine configuration
- Remove encryption configuration section
- Don't mention DD_TRACE_SAMPLE_RATE if absent (default is 1)

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fix linting issues identified by ESLint and Prettier

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Replace 🌧️ with ☁️ for better terminal alignment

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Fix log group ARN parsing to handle :* suffix
- Add debug output for log collection status
- Logs are now properly collected when using --with-logs flag

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@ryanstrat ryanstrat requested a review from Copilot June 18, 2025 17:05
@ryanstrat ryanstrat added enhancement New feature or request serverless Related to [aas, cloud-run, lambda, stepfunctions] labels Jun 18, 2025

This comment was marked as outdated.

ryanstrat and others added 3 commits June 18, 2025 13:10
- Remove --region flag from documentation example
- Replace 'any' type with proper interface in test fixtures
- Add practical examples to Step Functions README

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Handle "specified log group does not exist" error message
- Gracefully handle missing log groups in both subscription filters and logs collection
- Prevents flare from failing when log group doesn't exist

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Update getLogSubscriptions to return both filters and existence status
- Distinguish between missing log groups vs missing subscription filters in configuration analysis
- Add specific error message when log group doesn't exist
- Add test coverage for log group existence detection

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@ryanstrat ryanstrat requested a review from Copilot June 18, 2025 17:42
Copy link
Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new "stepfunctions flare" command to gather diagnostic information for AWS Step Functions, facilitating troubleshooting for Datadog support.

  • Updated CLI exports to include the new flare command.
  • Added test fixtures to support the flare command testing.
  • Expanded documentation in both the command-specific README and the main README to detail usage and parameters for the flare command.

Reviewed Changes

Copilot reviewed 4 out of 6 changed files in this pull request and generated no comments.

File Description
src/commands/stepfunctions/cli.ts Exports the new StepFunctionsFlareCommand along with existing commands.
src/commands/stepfunctions/tests/fixtures/stepfunctions-flare.ts Provides fixtures for testing the flare command functionality.
src/commands/stepfunctions/README.md Updates the documentation to include flare command usage and parameters.
README.md Integrates the flare command into the main CLI documentation overview.

@ryanstrat ryanstrat changed the title [Draft] Add Step Functions Flare command [SVLS-7027] Add Step Functions Flare command Jun 18, 2025
@ryanstrat ryanstrat changed the title [SVLS-7027] Add Step Functions Flare command [SVLS-7027] Add Step Functions Flare Command Jun 18, 2025
@ryanstrat ryanstrat marked this pull request as ready for review June 18, 2025 17:55
@ryanstrat ryanstrat requested review from a team as code owners June 18, 2025 17:55
@ryanstrat ryanstrat requested a review from ava-silver June 18, 2025 17:55
Copy link
Contributor
@buraizu buraizu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with a minor update requested

# stepfunctions commands

You can use the `stepfunctions instrument` command to instrument your Step Functions with Datadog. This command enables instrumentation by subscribing Step Function logs to a [Datadog Forwarder](https://docs.datadoghq.com/logs/guide/forwarder/).
The Step Functions commands allow you to manage Datadog instrumentation and troubleshooting for your AWS Step Functions:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The Step Functions commands allow you to manage Datadog instrumentation and troubleshooting for your AWS Step Functions:
The Step Function commands allow you to manage Datadog instrumentation and troubleshooting for your AWS Step Functions:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion for consistency -- as in line 5, Step Function is used when it's an attributive noun

Copy link
Contributor
@ava-silver ava-silver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given how long the flare command is, it feels like it should be split up into multiple modules if possible, i'll come back for more review later, once the initial comments have been addressed

9E88

// CLI Options
private isDryRun = Option.Boolean('-d,--dry,--dry-run', false)
private withLogs = Option.Boolean('--with-logs', false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming nit: should this be isWithLogs or hasLogs or something to follow best practices?

private email = Option.String('-e,--email')
private start = Option.String('--start')
private end = Option.String('--end')
private maxExecutions = Option.String('--max-executions', '10')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably want to validate this as an integer (docs)

Comment on lines +58 to +59
private start = Option.String('--start')
private end = Option.String('--end')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and these should probably be validated as Dates as well

6D38
})

// CLI Options
private isDryRun = Option.Boolean('-d,--dry,--dry-run', false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a stupid question -- does dry run even make sense for the flare command?

Comment on lines +256 to +258
if (this.caseId === undefined) {
errorMessages.push(helpersRenderer.renderError('No case ID specified. [-c,--case-id]'))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be done directly in the Option.String(...) settings (docs)

(same goes for state machine arn and email)

this.credentials = await getAWSCredentials()
} catch (err) {
if (err instanceof Error) {
this.context.stderr.write(helpersRenderer.renderError(err.message))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth adding a prefix like "Unable to authenticate with AWS: err.message"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also if for some reason we don't get an Error, we will exit with no log, which seems not good


// Get execution history
const history = await this.getExecutionHistory(sfnClient, execution.executionArn)
;(execution as any).history = history
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the any cast necessary?

Comment on lines +129 to +132
// 4. Get execution details and history for each execution
this.context.stdout.write('📜 Fetching execution details and history...\n')

for (const execution of executions.slice(0, 5)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like it should be a helper function

Comment on lines +146 to +169
// 5. Get log subscription filters (always collected)
let subscriptionFilters: SubscriptionFilter[] | undefined
let logGroupExists = true
const logGroupName = this.getLogGroupName(stateMachineConfig)
if (logGroupName) {
this.context.stdout.write('🔍 Getting log subscription filters...\n')
const result = await this.getLogSubscriptions(cloudWatchLogsClient, logGroupName)
subscriptionFilters = result.filters
logGroupExists = result.exists
}

// 6. Get CloudWatch logs if enabled
let logs: Map<string, OutputLogEvent[]> | undefined
if (this.withLogs && logGroupName) {
this.context.stdout.write('☁️ Getting CloudWatch logs...\n')
const startTime = this.start ? new Date(this.start).getTime() : undefined
const endTime = this.end ? new Date(this.end).getTime() : undefined
logs = await this.getCloudWatchLogs(cloudWatchLogsClient, logGroupName, startTime, endTime)
if (!logs || logs.size === 0) {
this.context.stdout.write(' No logs found in the specified time range\n')
} else {
this.context.stdout.write(` Found logs from ${logs.size} log streams\n`)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like it should also be helper functions

Comment on lines +315 to +324
const tags: Record<string, string> = {}
if (response.tags) {
for (const tag of response.tags) {
if (tag.key && tag.value) {
tags[tag.key] = tag.value
}
}
}

return tags
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be simplified:

Suggested change
const tags: Record<string, string> = {}
if (response.tags) {
for (const tag of response.tags) {
if (tag.key && tag.value) {
tags[tag.key] = tag.value
}
}
}
return tags
return Object.fromEntries((response.tags ?? []).filter(({key, value}) => key && value).map(({key, value}) => [key, value]))

Comment on lines +989 to +995
private async getExecutionDetails(sfnClient: SFNClient, executionArn: string): Promise<any> {
const command = new DescribeExecutionCommand({
executionArn,
})

return sfnClient.send(command)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of these single method helper functions hurt readability, since they can easily be inlined and end up bloating the file. we're probably better off just inlining them

@clifordshelton clifordshelton self-requested a review June 20, 2025 20:38
Copy link
@clifordshelton clifordshelton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙌 This is an awesome hackathon project and it's great to see this kind of initiative!

To consider releasing this to production, the next step would be for someone from the ONE team to collaborate on the project. That will help make sure the implementation aligns with our team's existing patterns and we're positioned well to maintain and support it in the long term. Let's discuss how we can best pull this into one of our upcoming sprints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request serverless Related to [aas, cloud-run, lambda, stepfunctions]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

0