OpenSWATH Refactoring - Modularity of OpenSWATH #7665

jcharkow · 2024-11-08T15:47:52Z

jcharkow
Nov 8, 2024
Maintainer

Originally OpenSwathWorkflow was in several different parts
E.g. OpenSwathFileSplitter, OpenSwathChromatogramExtractor, OpenSwathRTNormalizer, etc.

Many of these TOPP tools have not been maintained (e.g., with adding ion mobility), and I do not think they are used much in practice anymore.

On the other hand, perhaps modularity is desirable compared to the current workflow.

Pros Of Modularity:

Something that is already done in other parts of OpenMS - The DDA LFQ workflow in OpenMS seems to already be quite modular.
Since analysis approaches are still constantly evolving, a modular system allows for easier experimentation with new DIA analysis workflows, new acquisition schemes, ion mobility etc.
Greater readability, especially for new users as they can get a high level overview of the workflow through the TOPP tool diagrams instead of sifting through the C++ code.
Potential for more parallelization in high throughput workflows
Outputting to a file allows for experimentation in other programming languages
- python (e.g. with novel deep learning peak picking approaches)
- Rust
- etc.

Cons of Modularity:

More work is required initially in terms of updating the out-of-date TOPP tools
Likely will be less performant as constantly have to write things out to memory instead of doing the entire process in memory

Ideally, we would have both, but as seen right now I do not know if we have the bandwidth to support both workflows. Inevitably some code gets updated and another section of the code remains out of date until it is not usable anymore.

My opinion is to move back towards a more modular approach for DIA analysis. I know this will be more effort up front however, I think the benefits outweigh the costs. Originally I believe the workflow was amalgamated in order to make it easier on users however in practice the tool is still quite overwhelming to new users. Furthermore, there are commercially available DIA tools for those who just want a click-and-done approach and are not interested in how the analysis is done and we can use workflow managers to create pipelines for the community instead of doing everything internally.

singjc · 2024-11-08T16:30:26Z

singjc
Nov 8, 2024
Maintainer

I am in favour of the modularity idea, it will make it easier to switch out things, test new ideas, and might make maintaining a little easier as well.

Two things that come to mind:

For library-free OSW, it would be useful for OSW to be modular, because then you could create predicted spectral libraries, or predict peptide properties on the fly. I am working on a rust crate that implements peptdeeps RT, MS2 intensity and CCS prediction models. It would be fairly simple to create a binary that you can call during OSW to make these predictions or make a predicted spectral library.
Performing across run alignment upfront, prior to pyprophet statistical validation. If we can do the alignment post PP, then we can include across run information into the scoring, which is what I currently do for PTMs. I am tinkering around with a rust crate that re-implements some of the core-concepts from DIAlignR. This could be a plugin rust binary after OSWs PP.

But of course, as you mentioned, this is a lot more effort upfront to make everything more modular.

I still think that there should be one single TOPP tool, that can call these individual modular components though.

3 replies

jcharkow Nov 8, 2024
Maintainer Author

I guess if we have a single function which implemented the entire TOPP tool in c++ then it would be possible to keep the single TOPP tool?

singjc Nov 8, 2024
Maintainer

I guess if we have a single function which implemented the entire TOPP tool in c++ then it would be possible to keep the single TOPP tool?

Yes, should be possible. Just thinking from a user-end perspective, it's easier to run a single tool, than multiple. Unless it's all housed into a GUI (which itself is a single tool).

jcharkow Nov 8, 2024
Maintainer Author

I am thinking that it might be easier maintenance wise to just house everything in a GUI, e.g. KNIME, nextflow, snakmake, webapps. Although this may lead to more a decrease in performance I think this is a tradeoff worth making.

timosachsenberg · 2024-11-11T08:40:08Z

timosachsenberg
Nov 11, 2024
Maintainer

Achieving both modularity and performance is ideal, but there is no straightforward solution. As you've already mentioned, we need to balance key trade-offs: development speed and simplicity, usability, and performance.

I'd like to share some experiences we gained from transitioning from multiple tools to a more unified, single-tool solution for ProteomicsLFQ (the TOPP tool).

We found that achieving maintainable modularity at the TOPP tool level is possible if we ensure that the tools act as thin wrappers, forwarding all parameters directly from the algorithm to the tool. This approach is demonstrated in ProteomicsLFQ (see code snippets here and here). In ProteomicsLFQ, we use algorithms that expose param objects across multiple steps to implement different stages of the workflow. This helped us address several issues:

Performance: Writing and reading intermediate results is a significant bottleneck in large-scale analysis. By discarding temporary results (and only writing them out when debugging is enabled), we achieve substantial improvements in processing speed.
Maintenance and Development Speed: We discovered that our XML files couldn't adequately model protein and protein group quantities. Extending our consensusXML file wasn't straightforward, as it would involve changing the schema. Since we didn't need to store the consensus map with the quantities, we could update the data structures and algorithms without concerning ourselves with serialization. We then wrote the final results to a suitable file format that included these quantities.
Automation: Transferring information from the beginning of the workflow (e.g., determining elution width from the data) to later steps would have required dedicated files to store and reload those results.
Cross-file analysis: Maintaining the ability to revisit already processed data (e.g., requantification) is practically impossible if strict file-wise processing is enforced or requires yet another tool.

Our current solution separates components that are flexible and frequently changing (e.g., use of search engines) from those we want to keep static and robust (e.g., alignment, feature detection, inference, quantification). However, this approach isn't without challenges. One major issue is the parallelization of components that could benefit from it. For example, feature detection could see performance gains from file-based splitting. Currently, we process a large experimental design file sequentially, with OpenMP parallelization at the file level. One potential solution is to introduce an advanced parameter that allows processing files in batches, with a driver batch that merges the results from different, completed processes (e.g. processed on different computers).

In ProteomicsLFQ, we made a conceptual separation where algorithms operate on data structures, and tools handle loading and storing algorithm results. This might not be fully achievable in OpenSWATH, as we may need to incorporate some file streaming within the algorithms themselves. Consequently, algorithms might need to receive file names instead of data structures like MSExperiment. I think this would still be ok and keep a clear separation of algortihm and tool.

Arguing against fine-grained modularity at the tool level is the fact that usability decreases and there's increased overhead in writing files. If data are only temporary results, they could be written out solely for debugging purposes and otherwise streamed more efficiently within a comprehensive tool that bundles several steps. This realization led us to move from the pipeline approach of LFQ to the single-tool ProteomicsLFQ. Additionally, the transition to a single ProteomicsLFQ tool provided more freedom to experiment with file formats and data structures, as constantly writing out results imposes significant overhead in maintaining or updating schemas.

Based on our experience with the ProteomicsLFQ tool, I believe that OpenSWATHWorkflow is already performing quite well, although I haven't had the opportunity to review the entire codebase. However, it could be enhanced by deriving more parameters directly from the data and utilizing an OpenMS experimental design file.

Playing devil's advocate, I propose that we can:

Achieve large-scale processing in OpenSWATHWorkflow by adopting a similar approach to ProteomicsLFQ. This could involve supporting batch processing or introducing an advanced developer parameter that lets users choose which steps to perform (defaulting to 'all').
Continue encapsulating main algorithms in proper classes and wrap them in pyOpenMS, allowing us to process data up to a certain point (e.g., until after peak group detection). This would simplify development of novel algorithms as these intermediate results can be used pyOpenMS for subsequent steps. This should work if we have an experimental design, ensuring we know which optional written files or intermediate results are available.
Maintain high performance, as we have the flexibility to decide whether to write debug files or process data directly in memory.

@jpfeuffer and @cbielow, what are your thoughts on this approach?

2 replies

jcharkow Nov 11, 2024
Maintainer Author

Thank you for bringing up these points and sharing your experience with this ProteomicsLFQ. Definitely a lot to think about.
Yes I do believe that the current OpenSWATH workflow has already been optimized and is performing well.
However, unlike ProteomicsLFQ, I am unsure how stable the current DIA pipeline is as seems to still be an active area of research. The general steps are the same in different DIA workflows however the order in which they are applied can be different.

I was specifically thinking in terms of development that TOPP tool modularity would be desirable however this might be avoidable by using pyopenms.

jcharkow Nov 11, 2024
Maintainer Author

Another thought. I think that OpenSwathWorkflow has some mechanisms to continue workflows where it was left off (e.g. providing an RTNorm file, or providing chromatogram files) however I do not know how much they are actually used in practice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenSWATH Refactoring - Modularity of OpenSWATH #7665

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

OpenSWATH Refactoring - Modularity of OpenSWATH #7665

jcharkow Nov 8, 2024 Maintainer

Replies: 2 comments · 5 replies

singjc Nov 8, 2024 Maintainer

jcharkow Nov 8, 2024 Maintainer Author

singjc Nov 8, 2024 Maintainer

jcharkow Nov 8, 2024 Maintainer Author

timosachsenberg Nov 11, 2024 Maintainer

jcharkow Nov 11, 2024 Maintainer Author

jcharkow Nov 11, 2024 Maintainer Author

jcharkow
Nov 8, 2024
Maintainer

Replies: 2 comments 5 replies

singjc
Nov 8, 2024
Maintainer

jcharkow Nov 8, 2024
Maintainer Author

singjc Nov 8, 2024
Maintainer

jcharkow Nov 8, 2024
Maintainer Author

timosachsenberg
Nov 11, 2024
Maintainer

jcharkow Nov 11, 2024
Maintainer Author

jcharkow Nov 11, 2024
Maintainer Author