Replies: 2 comments 5 replies
-
I am in favour of the modularity idea, it will make it easier to switch out things, test new ideas, and might make maintaining a little easier as well. Two things that come to mind:
But of course, as you mentioned, this is a lot more effort upfront to make everything more modular. I still think that there should be one single TOPP tool, that can call these individual modular components though. |
Beta Was this translation helpful? Give feedback.
-
Achieving both modularity and performance is ideal, but there is no straightforward solution. As you've already mentioned, we need to balance key trade-offs: development speed and simplicity, usability, and performance. I'd like to share some experiences we gained from transitioning from multiple tools to a more unified, single-tool solution for ProteomicsLFQ (the TOPP tool). We found that achieving maintainable modularity at the TOPP tool level is possible if we ensure that the tools act as thin wrappers, forwarding all parameters directly from the algorithm to the tool. This approach is demonstrated in ProteomicsLFQ (see code snippets here and here). In ProteomicsLFQ, we use algorithms that expose param objects across multiple steps to implement different stages of the workflow. This helped us address several issues:
Our current solution separates components that are flexible and frequently changing (e.g., use of search engines) from those we want to keep static and robust (e.g., alignment, feature detection, inference, quantification). However, this approach isn't without challenges. One major issue is the parallelization of components that could benefit from it. For example, feature detection could see performance gains from file-based splitting. Currently, we process a large experimental design file sequentially, with OpenMP parallelization at the file level. One potential solution is to introduce an advanced parameter that allows processing files in batches, with a driver batch that merges the results from different, completed processes (e.g. processed on different computers). In ProteomicsLFQ, we made a conceptual separation where algorithms operate on data structures, and tools handle loading and storing algorithm results. This might not be fully achievable in OpenSWATH, as we may need to incorporate some file streaming within the algorithms themselves. Consequently, algorithms might need to receive file names instead of data structures like MSExperiment. I think this would still be ok and keep a clear separation of algortihm and tool. Arguing against fine-grained modularity at the tool level is the fact that usability decreases and there's increased overhead in writing files. If data are only temporary results, they could be written out solely for debugging purposes and otherwise streamed more efficiently within a comprehensive tool that bundles several steps. This realization led us to move from the pipeline approach of LFQ to the single-tool ProteomicsLFQ. Additionally, the transition to a single ProteomicsLFQ tool provided more freedom to experiment with file formats and data structures, as constantly writing out results imposes significant overhead in maintaining or updating schemas. Based on our experience with the ProteomicsLFQ tool, I believe that OpenSWATHWorkflow is already performing quite well, although I haven't had the opportunity to review the entire codebase. However, it could be enhanced by deriving more parameters directly from the data and utilizing an OpenMS experimental design file. Playing devil's advocate, I propose that we can:
@jpfeuffer and @cbielow, what are your thoughts on this approach? |
Beta Was this translation helpful? Give feedback.
-
Originally OpenSwathWorkflow was in several different parts
E.g. OpenSwathFileSplitter, OpenSwathChromatogramExtractor, OpenSwathRTNormalizer, etc.
Many of these TOPP tools have not been maintained (e.g., with adding ion mobility), and I do not think they are used much in practice anymore.
On the other hand, perhaps modularity is desirable compared to the current workflow.
Pros Of Modularity:
- python (e.g. with novel deep learning peak picking approaches)
- Rust
- etc.
Cons of Modularity:
Ideally, we would have both, but as seen right now I do not know if we have the bandwidth to support both workflows. Inevitably some code gets updated and another section of the code remains out of date until it is not usable anymore.
My opinion is to move back towards a more modular approach for DIA analysis. I know this will be more effort up front however, I think the benefits outweigh the costs. Originally I believe the workflow was amalgamated in order to make it easier on users however in practice the tool is still quite overwhelming to new users. Furthermore, there are commercially available DIA tools for those who just want a click-and-done approach and are not interested in how the analysis is done and we can use workflow managers to create pipelines for the community instead of doing everything internally.
Beta Was this translation helpful? Give feedback.
All reactions