8000 Refactor bbtools module to follow MultiQC conventions by ewels · Pull Request #3410 · MultiQC/MultiQC · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@ewels
Copy link
Member
@ewels ewels commented Dec 3, 2025

This refactoring unifies the separate bbmap and bbduk modules into a single bbtools module following MultiQC conventions. The change addresses the issues raised in #3398 and #1146.

Equivalent PR to #3398 but with more significant code rewrite to try to match MultiQC coding style.

🤖 Written by Claude, with instruction and oversight by @ewels

Details

Key changes:

  • Merge bbmap and bbduk into unified bbtools module
  • Each subtool now has its own self-contained file with:
    • Independent parsing logic
    • Individual table/plot configuration
    • Separate data file output (multiqc_bbtools_)
  • Standard data structure: Dict[str, Dict] with sample names as keys
  • Consistent naming: bbtools/ in search patterns
  • Single entry point in pyproject.toml

Submodules included:

  • duk: BBDuk filtering statistics (from bbduk module)
  • stats: Adapter/contaminant filtering statistics
  • bbsplit: Reference genome alignment distribution
  • qchist: Quality value counts (+ Q30% general stats)
  • aqhist: Average read quality histogram
  • bhist: Base composition histogram
  • bqhist: Base quality boxplot data
  • covhist: Coverage depth histogram
  • ehist: Errors-per-read histogram
  • gchist: GC content histogram
  • idhist: Identity histogram
  • ihist: Insert size histogram
  • indelhist: Indel length histogram
  • lhist: Read length histogram
  • mhist: Match/sub/del/ins rates
  • qahist: Quality accuracy histogram
  • qhist: Quality by position histogram

Breaking changes:

  • Module name changed from bbmap/bbduk to bbtools
  • Search patterns renamed from bbmap/* to bbtools/*
  • Data files renamed from multiqc_bbmap.json to individual multiqc_bbtools_<subtool>.json files

Note: The previously unused file types (bincov, covstats, rpkm, statsfile, statsfile_machine) were not implemented as they were marked as not_implemented in the original module.

This refactoring unifies the separate `bbmap` and `bbduk` modules into a single
`bbtools` module following MultiQC conventions. The change addresses the issues
raised in #3398 and #1146.

Key changes:
- Merge bbmap and bbduk into unified bbtools module
- Each subtool now has its own self-contained file with:
  - Independent parsing logic
  - Individual table/plot configuration
  - Separate data file output (multiqc_bbtools_<subtool>)
- Standard data structure: Dict[str, Dict] with sample names as keys
- Consistent naming: bbtools/<subtool> in search patterns
- Single entry point in pyproject.toml

Submodules included:
- duk: BBDuk filtering statistics (from bbduk module)
- stats: Adapter/contaminant filtering statistics
- bbsplit: Reference genome alignment distribution
- qchist: Quality value counts (+ Q30% general stats)
- aqhist: Average read quality histogram
- bhist: Base composition histogram
- bqhist: Base quality boxplot data
- covhist: Coverage depth histogram
- ehist: Errors-per-read histogram
- gchist: GC content histogram
- idhist: Identity histogram
- ihist: Insert size histogram
- indelhist: Indel length histogram
- lhist: Read length histogram
- mhist: Match/sub/del/ins rates
- qahist: Quality accuracy histogram
- qhist: Quality by position histogram

Breaking changes:
- Module name changed from `bbmap`/`bbduk` to `bbtools`
- Search patterns renamed from `bbmap/*` to `bbtools/*`
- Data files renamed from `multiqc_bbmap.json` to individual
  `multiqc_bbtools_<subtool>.json` files

Note: The previously unused file types (bincov, covstats, rpkm, statsfile,
statsfile_machine) were not implemented as they were marked as not_implemented
in the original module.
@claude
Copy link
claude bot commented Dec 3, 2025

Code Review Summary

This is a well-executed refactoring that successfully consolidates the bbmap and bbduk modules into a unified bbtools module. The code follows MultiQC conventions, is well-structured, and appears functionally sound. A few minor issues and suggestions are noted below.

Detailed Review

✅ Strengths

  1. Excellent Architecture: The refactoring into self-contained submodule files is clean and maintainable. Each subtool has its own parse function following a consistent pattern.

  2. Proper MultiQC Conventions: The code follows CLAUDE.md guidelines:

    • ✅ Raises ModuleNoSamplesFound when appropriate
    • ✅ Calls self.add_software_version() (even with None for submodules without version info)
    • ✅ Calls self.write_data_file() at the end of each submodule
    • ✅ Entry point correctly added to pyproject.toml
    • ✅ Module class docstring instead of separate markdown
    • ✅ Info starts with capital letter
    • ✅ Uses f-strings and modern Python 3 syntax
  3. Consistent Data Structures: All submodules use Dict[str, Dict] with sample names as keys, making the code predictable and maintainable.

  4. Good Error Handling: Appropriate use of try-except blocks, logging for duplicates, and graceful handling of malformed data.

  5. Documentation< 8000 /strong>: Clear docstrings explain each submodule's purpose.

🔍 Issues & Suggestions

Minor Issues

  1. Namespace Consistency (duk.py:85, qchist.py:72, bbsplit.py:76)

    • Some submodules use namespace parameter in general_stats_addcols while the main module in old code did not
    • Recommendation: This is actually an improvement for namespace isolation, so keep it. Just ensure it's documented if users have custom configs referencing these stats.
  2. get_general_stats_headers Usage (qchist.py:70, bbsplit.py:74)

    • The code calls module.get_general_stats_headers(all_headers=...) but then checks if general_stats_headers: before adding
    • Question: What happens when get_general_stats_headers returns None or empty dict? This pattern assumes it can return falsy values to skip general stats addition
    • Recommendation: Verify this is the intended behavior or add a comment explaining it
  3. Inconsistent Version Tracking (duk.py:142 vs other modules)

    • Only duk.py actually parses and calls add_software_version with a real version
    • All other modules call module.add_software_version(None) with a comment "Superfluous function call"
    • Recommendation: This is acceptable per CLAUDE.md requirements, but consider whether version info could be extracted from other file types
  4. Magic Number in qchist (qchist.py:143)

    • cutoff = sumy * 0.999 - uses 99.9% threshold without explanation
    • Recommendation: Add a comment explaining why 99.9% is used for plot x-axis range determination
  5. Log Axis Zero Handling (qchist.py:158-161)

    • Adding 0.1 to zero counts is a workaround for log axis display
    • Comment in code mentions this, which is good
    • Recommendation: Consider if there's a better way to handle this in the plotting library itself
  6. Error Handling in BBSplit (bbsplit.py:211-213)

    • Catches (ValueError, IndexError) and logs warning
    • Good practice, but the error message could include the sample name for debugging
    • Recommendation: Include s_name in the warning message

Code Quality Observations

  1. Type Hints: ✅ Consistent use of type hints throughout
  2. Naming: ✅ Clear, descriptive variable and function names
  3. Separation of Concerns: ✅ Parse functions are separate from plotting logic
  4. DRY Principle: ✅ Minimal code duplication across submodules

Performance Considerations

  1. File Reading: Each submodule independently calls find_log_files, which is appropriate for flexibility
  2. Data Processing: No obvious performance issues - data structures are efficient
  3. Plot Data Preparation (qchist.py:136-163): The nested loops for calculating cutoff and preparing plot data could be combined for efficiency, but current implementation is clear and readable

Security Considerations

No security concerns identified:

  • Input validation with try-except blocks
  • No command execution or file writing to user-controlled paths
  • No SQL or external API calls
  • Regex patterns are simple and not vulnerable to ReDoS

Test Coverage

⚠️ Cannot assess test coverage without seeing test files:

  • Recommend verifying tests exist for all 18 submodules
  • Especially important for parsing edge cases (malformed files, missing columns, etc.)
  • Test data should cover single-end and paired-end scenarios

Breaking Changes

📋 Well-documented breaking changes:

  • Module name: bbmap/bbdukbbtools
  • Search patterns: bbmap/*bbtools/*
  • Data files: multiqc_bbmap.json → individual multiqc_bbtools_<subtool>.json

Recommendation: Consider adding a migration note in the changelog/release notes.

📝 Specific Line Comments

bbtools.py:68 - Using dict() instead of {} is less idiomatic but acceptable
duk.py:83 - Good use of get_general_stats_headers for user customization
qchist.py:159 - Adding 0.1 for log axis is creative but may confuse users seeing 0.1 for missing data
bbsplit.py:195 - Consider validating that parts[0] (ref_name) is not empty

🎯 Recommendations Summary

  1. Must Fix: None - code is production-ready
  2. Should Consider:
    • Add comment explaining 99.9% cutoff in qchist
    • Include sample names in BBSplit error messages
    • Verify test coverage for all submodules
  3. Nice to Have:
    • Document the namespace changes for users with custom configs
    • Consider version extraction from other file types

Final Verdict

LGTM with minor suggestions - This is high-quality code that successfully achieves its refactoring goals. The issues noted are minor and mostly suggestions for improvement rather than bugs. The code is ready to merge after considering the recommendations above.

Great work on maintaining code quality while performing a significant refactoring! 🎉

Update section names across all BBTools submodules to include the
subtool name (BBMap, BBDuk, BBSplit) for clarity without the overall
module name prefix.
@ewels ewels force-pushed the claude/refactor-bbtools-module-01GJTmkZRxwjh2bVTcx1RWdz branch from 5791f6e to 6e89180 Compare December 3, 2025 14:25
@ewels

This comment was marked as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

0