-
Notifications
You must be signed in to change notification settings - Fork 63
[SVLS-7027] Add Step Functions Flare Command #1701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
… red phase) - Created StepFunctionsFlareCommand class with 20 stubbed methods - Added comprehensive test suite covering all functionality - Created test fixtures for AWS API responses - All tests currently failing as expected (TDD red phase) - Ready for implementation of actual functionality Methods stubbed: - validateInputs: Validate required CLI inputs - getStateMachineConfiguration: Fetch state machine details - getStateMachineTags: Retrieve resource tags - getRecentExecutions: List recent execution history - getExecutionHistory: Get detailed execution events - getLogSubscriptions: Check CloudWatch log subscriptions - getCloudWatchLogs: Retrieve execution logs - maskStateMachineConfig: Redact sensitive configuration - maskExecutionData: Redact sensitive execution data - generateInsightsFile: Create troubleshooting insights - summarizeConfig: Generate configuration summary - getFramework: Detect deployment framework - createOutputDirectory: Set up output structure - writeOutputFiles: Save collected data - zipAnd 8000 Send: Package and send to Datadog - parseStateMachineArn: Extract ARN components - getLogGroupName: Extract log group from config - maskAslDefinition: Redact sensitive ASL fields - getExecutionDetails: Get execution metadata 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Implement comprehensive flare command following lambda flare pattern - Add methods to collect state machine config, executions, and logs - Include sensitive data masking for ASL definitions and execution data - Add framework detection (Serverless, SAM, CDK, Terraform) - Implement dry-run mode and CloudWatch logs collection (optional) - Add comprehensive test coverage with 35 passing tests - Update CLI registry and documentation The flare command helps users collect diagnostic information about their Step Functions for troubleshooting with Datadog support. 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
…tion - Enhanced insights file with detailed state machine configuration - Improved dry-run mode messaging with clearer output - Created zip archive in both dry-run and normal modes - Added proper directory structure for output files - Removed unused zipAndSend method in favor of inline implementation - Updated tests to match new dry-run output messages 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
…real filesystem operations - Changed log subscription filters collection to always run (not gated by --with-logs flag) - Reverted to timestamped subdirectories (e.g., stepfunctions-StateMachineName-1234567890) - Added cleanup of old Step Functions flare directories and zip files before creating new ones - Updated to use FLARE_OUTPUT_DIRECTORY constant instead of hardcoded value - Removed filesystem mocking in tests - tests now use real filesystem operations - Updated tests to match new directory structure with timestamps - Added proper cleanup after tests - Fixed test expectations for new behavior 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Remove extra spaces after emojis in stdout messages - Change tag emoji from 🏷️ to 🔖 for better terminal alignment 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
Datadog ReportBranch report: ✅ 0 Failed, 2904 Passed, 0 Skipped, 3m 14.03s Total duration (5m 19.49s time saved) |
- Add documentation link icon for flare command in README - Maintain consistency with other Step Functions commands 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Remove --region flag as it's automatically extracted from state machine ARN - Fix tests to properly set state machine ARN for region extraction 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Replace spaces with tabs after all emoji in stdout output to ensure consistent alignment regardless of terminal emoji width rendering 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Revert to single space after emoji (matching lambda/cloud-run flare) - Replace problematic ℹ️ emoji with 📁 and 📦 for better alignment - Maintains consistency with existing flare commands in the codebase 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add configuration analysis as the first section - Check for required settings (log level, execution data, etc.) - Validate Datadog integration and tags - Remove emoji from insights file for cleaner output - Move tags as subsection of state machine configuration - Remove encryption configuration section - Don't mention DD_TRACE_SAMPLE_RATE if absent (default is 1) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Fix linting issues identified by ESLint and Prettier 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Replace 🌧️ with ☁️ for better terminal alignment 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Fix log group ARN parsing to handle :* suffix - Add debug output for log collection status - Logs are now properly collected when using --with-logs flag 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Remove --region flag from documentation example - Replace 'any' type with proper interface in test fixtures - Add practical examples to Step Functions README 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Handle "specified log group does not exist" error message - Gracefully handle missing log groups in both subscription filters and logs collection - Prevents flare from failing when log group doesn't exist 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Update getLogSubscriptions to return both filters and existence status - Distinguish between missing log groups vs missing subscription filters in configuration analysis - Add specific error message when log group doesn't exist - Add test coverage for log group existence detection 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new "stepfunctions flare" command to gather diagnostic information for AWS Step Functions, facilitating troubleshooting for Datadog support.
- Updated CLI exports to include the new flare command.
- Added test fixtures to support the flare command testing.
- Expanded documentation in both the command-specific README and the main README to detail usage and parameters for the flare command.
Reviewed Changes
Copilot reviewed 4 out of 6 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/commands/stepfunctions/cli.ts | Exports the new StepFunctionsFlareCommand along with existing commands. |
| src/commands/stepfunctions/tests/fixtures/stepfunctions-flare.ts | Provides fixtures for testing the flare command functionality. |
| src/commands/stepfunctions/README.md | Updates the documentation to include flare command usage and parameters. |
| README.md | Integrates the flare command into the main CLI documentation overview. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving with a minor update requested
| # stepfunctions commands | ||
|
|
||
| You can use the `stepfunctions instrument` command to instrument your Step Functions with Datadog. This command enables instrumentation by subscribing Step Function logs to a [Datadog Forwarder](https://docs.datadoghq.com/logs/guide/forwarder/). | ||
| The Step Functions commands allow you to manage Datadog instrumentation and troubleshooting for your AWS Step Functions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The Step Functions commands allow you to manage Datadog instrumentation and troubleshooting for your AWS Step Functions: | |
| The Step Function commands allow you to manage Datadog instrumentation and troubleshooting for your AWS Step Functions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestion for consistency -- as in line 5, Step Function is used when it's an attributive noun
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given how long the flare command is, it feels like it should be split up into multiple modules if possible, i'll come back for more review later, once the initial comments have been addressed
|
|
||
| // CLI Options | ||
| private isDryRun = Option.Boolean('-d,--dry,--dry-run', false) | ||
| private withLogs = Option.Boolean('--with-logs', false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming nit: should this be isWithLogs or hasLogs or something to follow best practices?
| private email = Option.String('-e,--email') | ||
| private start = Option.String('--start') | ||
| private end = Option.String('--end') | ||
| private maxExecutions = Option.String('--max-executions', '10') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably want to validate this as an integer (docs)
| private start = Option.String('--start') | ||
| private end = Option.String('--end') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and these should probably be validated as Dates as well
| }) | ||
|
|
||
| // CLI Options | ||
| private isDryRun = Option.Boolean('-d,--dry,--dry-run', false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a stupid question -- does dry run even make sense for the flare command?
| if (this.caseId === undefined) { | ||
| errorMessages.push(helpersRenderer.renderError('No case ID specified. [-c,--case-id]')) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be done directly in the Option.String(...) settings (docs)
(same goes for state machine arn and email)
| this.credentials = await getAWSCredentials() | ||
| } catch (err) { | ||
| if (err instanceof Error) { | ||
| this.context.stderr.write(helpersRenderer.renderError(err.message)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe worth adding a prefix like "Unable to authenticate with AWS: err.message"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also if for some reason we don't get an Error, we will exit with no log, which seems not good
|
|
||
| // Get execution history | ||
| const history = await this.getExecutionHistory(sfnClient, execution.executionArn) | ||
| ;(execution as any).history = history |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the any cast necessary?
| // 4. Get execution details and history for each execution | ||
| this.context.stdout.write('📜 Fetching execution details and history...\n') | ||
|
|
||
| for (const execution of executions.slice(0, 5)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feels like it should be a helper function
| // 5. Get log subscription filters (always collected) | ||
| let subscriptionFilters: SubscriptionFilter[] | undefined | ||
| let logGroupExists = true | ||
| const logGroupName = this.getLogGroupName(stateMachineConfig) | ||
| if (logGroupName) { | ||
| this.context.stdout.write('🔍 Getting log subscription filters...\n') | ||
| const result = await this.getLogSubscriptions(cloudWatchLogsClient, logGroupName) | ||
| subscriptionFilters = result.filters | ||
| logGroupExists = result.exists | ||
| } | ||
|
|
||
| // 6. Get CloudWatch logs if enabled | ||
| let logs: Map<string, OutputLogEvent[]> | undefined | ||
| if (this.withLogs && logGroupName) { | ||
| this.context.stdout.write('☁️ Getting CloudWatch logs...\n') | ||
| const startTime = this.start ? new Date(this.start).getTime() : undefined | ||
| const endTime = this.end ? new Date(this.end).getTime() : undefined | ||
| logs = await this.getCloudWatchLogs(cloudWatchLogsClient, logGroupName, startTime, endTime) | ||
| if (!logs || logs.size === 0) { | ||
| this.context.stdout.write(' No logs found in the specified time range\n') | ||
| } else { | ||
| this.context.stdout.write(` Found logs from ${logs.size} log streams\n`) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feels like it should also be helper functions
| const tags: Record<string, string> = {} | ||
| if (response.tags) { | ||
| for (const tag of response.tags) { | ||
| if (tag.key && tag.value) { | ||
| tags[tag.key] = tag.value | ||
| } | ||
| } | ||
| } | ||
|
|
||
| return tags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be simplified:
| const tags: Record<string, string> = {} | |
| if (response.tags) { | |
| for (const tag of response.tags) { | |
| if (tag.key && tag.value) { | |
| tags[tag.key] = tag.value | |
| } | |
| } | |
| } | |
| return tags | |
| return Object.fromEntries((response.tags ?? []).filter(({key, value}) => key && value).map(({key, value}) => [key, value])) |
| private async getExecutionDetails(sfnClient: SFNClient, executionArn: string): Promise<any> { | ||
| const command = new DescribeExecutionCommand({ | ||
| executionArn, | ||
| }) | ||
|
|
||
| return sfnClient.send(command) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most of these single method helper functions hurt readability, since they can easily be inlined and end up bloating the file. we're probably better off just inlining them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙌 This is an awesome hackathon project and it's great to see this kind of initiative!
To consider releasing this to production, the next step would be for someone from the ONE team to collaborate on the project. That will help make sure the implementation aligns with our team's existing patterns and we're positioned well to maintain and support it in the long term. Let's discuss how we can best pull this into one of our upcoming sprints.
Summary
This PR adds a new
stepfunctions flarecommand to collect diagnostic information for AWS Step Functions, enabling efficient troubleshooting with Datadog support.What it does
The flare command gathers comprehensive diagnostic data about your Step Functions state machines:
Key Features
Configuration Analysis
The command generates an INSIGHTS.md file that automatically checks for:
Usage
Test Coverage
Sample Insights:
Full insights file omitted to avoid leaking DD Test account ids.
