Abstract
Parton labeling methods are widely used when reconstructing collider events with top quarks or other massive particles. State-of-the-art techniques are based on machine learning and require training data with events that have been matched using simulations with truth information. In nature, there is no unique matching between partons and final state objects due to the properties of the strong force and due to acceptance effects. We propose a new approach to parton labeling that circumvents these challenges by recycling regression models. The final state objects that are most relevant for a regression model to predict the properties of a particular top quark are assigned to said parent particle without having any parton-matched training data. This approach is demonstrated using simulated events with top quarks and outperforms the widely-used \(\chi ^2\) method.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
A common task in collider event reconstruction is assigning final state objects to a branch of the hypothesized reaction that generated the event. For example, hard-scatter events with outgoing quarks and gluons produce jets that can be associated with their initiating partons. When there are many outgoing particles from the hard-scatter reaction, this is a complex combinatorial challenge. Events with multiple top quarks naturally result in such final states, since nearly all top quarks decay to a b-quark and a W boson, which subsequently decays to two quarks or leptons. A key challenge in many measurements and searches involving top quarks is the assignment of reconstructed objects with one of the top quark decay products. Classically, this assignment has used \(\chi ^2\) or related methods that enumerate all possibilities and pick the one which is most consistent with having two on-shell W boson and top quark intermediaries. The difficulty with these methods is that they do not take into account all available information and are computationally expensive.
A number of modern machine learning (ML) methods have been proposed to address these challenges. These techniques range from Boosted Decision Trees [1,2,3] and existing neural networks [4, 5] to custom, permutation invariant deep learning methods [6,7,8,9]. In all cases, object identification can make use of a variety of lepton-, jet- and event-level properties that were inaccessible with \(\chi ^2\) or likelihood methods [10]. This is possible because the ML approaches are trained on simulations, so whatever information is available and well-modeled (within uncertainty) can be used for object labeling.
Despite the success of these ML methods, they all share a common fundamental challenge with classical approaches. In particular, they all require matched objects for training. This may be problematic for two reasons (see e.g. Fig. 1). First, there is no unique match between a hard-scatter quark/gluon and a jet. A single quark/gluon can fragment into multiple jets, and a single jet can be composed of hadrons with energy flow originating from multiple quarks/gluons. This is particularly acute for top quarks, which carry color charge and thus must be color-connected to another quark/gluon in the event. The extent of the overlap also depends on the jet clustering algorithm - jets with a larger catchment area [11] are more likely to be due to the merger of multiple parton showers. Second, even if a parent object like a top quark could be uniquely associated with a set of decay products, acceptance effects will obscure the association. In particular, the finite geometric and energy acceptance of detectors results in missed final state objects.
Our philosophy is to circumvent the issues caused by object-parton matching by directly regressing onto the target particle properties. In Ref. [12], we designed the Covariant Particle Transformer (CPT), a partially Lorentz covariant point cloud transformer, to learn the four-vectors of top quarks given reconstructed jets, leptons, photons, and missing energy. In this paper, we show how one can reuse such a regression method to perform parton labeling. We explore two possibilities, one based on the attention mechanism within the CPT and one based on the gradient of predicted four-vectors with respect to the inputs. The latter approach is compatible with any regression-based top quark reconstruction method, even if it does not involve neural network attention. While we still advocate for regression in cases where the underlying top quark properties are needed, parton labeling is still widely used for determining these properties and no matter what approach is used, parton labels can be useful for diagnostic purposes.
This paper is organized as follows. Section 2 briefly reviews the CPT technique and then introduces our two approaches to extracting parton labels from the regression model. Numerical results are presented in Sect. 4 using a dataset that is briefly introduced in Sect. 3. The paper ends with conclusions and outlook in Sect. 5.
2 Methods
Our goal is to take final states with n top quarks that decay hadronically and assign jets to one of these quarks. In principle, one could simultaneously predict n and assign jets, but in practice, there is often a particular number of target quarks; if not, one could first run a multi-class classification procedure. We also restrict our approach to assigning three jets to each top quark. Both ML-based approaches described below could be modified to assign fewer or more jets by placing thresholds on the Jacobean values (Sect. 2.2) or the attention weights (Sect. 2.3), but we leave this to future work.
2.1 Covariant particle transformer
The Covariant Particle Transformer (CPT) is a Transformer-based [13] neural network tailored for collider physics applications and has demonstrated superior performance in predicting top quarks’ kinematics compared to classical approaches [12]. CPT takes as inputs the 4-vectors and particle identifications of all observed final state objects (jets, lepton, photons, etc.) and outputs predicted 4-vectors of a pre-specified number of top quarks. Compared to the standard transformer architecture, CPT is designed to respect important symmetries in collider physics: it is permutation invariant under reordering of the inputs and partially Lorentz covariant, meaning if we apply a longitudinal boost and/or a transverse rotation to all the inputs, CPT’s outputs will be boosted and/or rotated accordingly, respecting Lorentz symmetry.
In each layer of the network, CPT additively updates the feature vector of every object \(f_i\) (could be an input or output) with \(\Delta f_i\) defined as a function of all the feature vectors \(\{f_k\}:\)
where \(\varphi \) is a learned linear transformation and \(\{\alpha _{ik}\}\) are positive attention weights, which are themselves non-linear functions of \(\{f_k\},\) such that \(\sum _k \alpha _{ik} = 1\) for each i. The output feature vectors are eventually transformed to the predicted 4-vectors of the top quarks. If i is an output index and k is an input index, then intuitively \(\alpha _{ik}\) measures the importance of the information in k for predicting the properties of i. The above procedure is named the covariant attention mechanism, which modifies the standard attention mechanism in a transformer to ensure partial Lorentz covariance. To capture complex correlations between the inputs and outputs, CPT uses \(L=6\) covariant attention layers and \(H=4\) attention heads per layer to decode the top quark 4-vectors, where each attention head performs separate learned updates according to Eq. (1) for added flexibility. We refer readers to the original CPT paper for a more comprehensive review of the architecture and implementation.
2.2 Gradient-based labeling
The idea of the gradient-based method is to assign a jet to a particular top quark if changes to the jet properties result in significant changes to the top quark properties. If the top quarks were produced independently of each other and of other radiation within the event, then only the jets they produce should be relevant for reconstructing their properties. In reality, this is not the case because top quarks and other objects are correlated through momentum conservation and other physics effects.
Strictly speaking, the term ‘gradient’ applies to the case of one-dimensional quantities (e.g. top quark \(p_T\)), but for regression methods that predict multiple top quark properties, a more accurate name would be ‘Jacobian-based’. For simplicity, we will henceforth always call this method ‘gradient-based’.
The gradient-based labeling scheme is compatible with any regression model (not just the CPT from Sect. 2.1) and is based on the following quantity:
where \(f_{i,x}\) is the predicted \(x\in \{p_T,y,\phi \}\) of top quark i and \(j_{k,x}\) is the observed x of jet k. Since \(f_i\) is a neural network, we can compute the derivatives in Eq. (2) using the same automatic differentiation (e.g. back propagation) that is used when training the network in the first place. We assign jet k to top quark i if \(\Delta _{ik}\) is one of the top three values across all k. The same jet could be assigned to multiple top quarks. Equation (2) is not the unique combination of elements from the Jacobian and it could be that other combinations could be more effective. We found that using the derivatives with respect to \(p_T\), y, and \(\phi \) was only slightly better than \(p_T\) alone. More complex schemes that weight the different entries separately are also possible.
When f is a CPT, then \(\Delta _{ik}\) is a partial Lorentz scalar and so the labeling is invariant under longitudinal boosts and rotations in the transverse plane.
2.3 Attention-based labeling
In each covariant attention layer and attention head in CPT, the attention weight \(\alpha _{ik}\) can be interpreted as a measure of the importance of input k for predicting the properties of top i, locally in the network. By averaging \(\alpha _{ik}\) over all layers and attention heads, we obtain a measure of the overall importance of input k to top i:
where \(\alpha ^{\ell h}_{ik}\) is the attention weight between top i and input k in the \(h\text {th}\) attention head in the \(\ell \text {th}\) layer. Similar to gradient-based labeling, we assign the jet with index k to top quark i if \(\bar{\alpha }_{ik}\) is one of the top three values across all jets.
Due to the design of CPT, all attention weights are partial Lorentz scalars and \(\bar{\alpha }_{ik}\) is again a partial Lorentz scalar, implying the labeling is invariant under longitudinal boosts and rotations in the transverse plane.
2.4 \(\chi ^2\)-based labeling
The baseline parton labeling scheme that we use is a widely applied \(\chi ^2\) method. In particular, in events with at least two b-jets, the assignment of jets to top quarks is based on the combination that minimized the following \(\chi ^2\):
where \(m_t\) and \(m_W\) are the top quark and W boson masses, respectively, and \(\sigma _{m_{bjj}}\) and \(\sigma _{m_{jj}}\) are the resolutions of truth-matched top and W events, respectively. Events without six jets, two of which are b-tagged, are not reconstructable with the \(\chi ^2\) method. It may be possible to recover some of the non-reconstructable cases using other approaches for the b-jets (e.g. taking the highest energy jet(s)), so we check that our results hold in cases where events have two b-jets.
3 Dataset
For numerical studies, we use the same dataset as in Ref. [12], which is briefly summarized below. Top quark pair production in association with a Higgs bosonFootnote 1 in proton–proton collisions is generated with Madgraph@NLO 2.3.7 [14] at next-to-leading order (NLO) in Quantum Chromodynamics (QCD). The decays of the top quarks are simulated with MadSpin [15] and then the rest of the particle-level generation is created with Pythia 8.235 [16]. While this dataset does not emulate detector effects, the salient features of the problem are already present at particle level. Jets are clusterd using the anti-\(k_t\) [17] algorithm with \(R=0.4\) as implemented in FastJet 3.3.2 [18, 19].
Jets are required to have \(|y| \le 2.5\) and \(p_T \ge 25\) GeV. Jets that are \(\Delta R\) matchedFootnote 2 to b-quarks at the parton level are labeled as b-jets; this label is removed randomly for 30% of the b-jets, to mimic the inefficiency of a realistic b-tagging [20, 21]. We further apply a preselection on the testing set of \(N_\textrm{bjet} > 0\) and \(N_\textrm{jet} \ge 3\) to mimic realistic data analysis requirements.
When we need to refer to classical truth labels, we will call top quarks that have all three decay products as ‘truth-matched’ when each of the three quark decay products is within \(\Delta R<0.4\) of exactly one jet.
4 Results
First, we consider standard, non-unique metrics for evaluating performance. In particular, truth-matched top quarks are compared with each reconstruction method to see the fraction of the time that all three jets are the same. As noted earlier, the truth match labels are not unique, but this is a standard metric for quantifying performance. Figure 2 shows the frequency of an exact match for each method and for different jet multiplicities. The matching generally is harder the more jets there are in the event because there are more combinations and the truth label fidelity also degrades (see Fig. 1).
Overall, the attention-based approach outperforms the other two methods across all configurations, often by a large margin (10% or more). Inclusively, the gradient-based method outperforms the classical \(\chi ^2\) assignment, but the two approaches are comparable after requiring two b-jets. Across all events and inclusively across jet multiplicities, the \(\chi ^2\) approach has a poor matching frequency (about 10%) in part because it requires two b-jets and at least six distinct jets. In contrast, the attention- and gradient-based methods are still effective when there are fewer jets. The numbers for the attention-based and \(\chi ^2\)-based approaches are similar to the ones found by Spa-Net [6], although there are a number of differences in the setup that prohibit a precise comparison.
The next question is to study events in which there is no truth-match. Such events are not even part of the training for other ML-based labeling schemes, but our methods are still able to assign parton labels in these cases. One way to see if the assigned jets in such events are sensible is to examine their trijet invariant mass. Figure 3 presents histograms of this map inclusively and for events without a truth match. There are roughly twice as many entries for the attention- and gradient-based histograms in the top plot of Fig. 3 because of events where there is no truth match. All five histograms in the figure look similar, with a peak near the top quark mass of about 175 GeV [22]. The peak sharpest for the truth-matched events and is slightly sharper for the attention-based method than the gradient-based method. This may be expected from Fig. 2, which indicates that the attention-based approach has a higher fidelity of picking the ‘correct’ jets.
Our last investigation is if the trijet kinematic properties in unmatched events are close to the truth top quarks. One reasonable definition of a ‘good match’ would be that the reconstructed top properties are close to the truth properties, which does not require assigning quark identities to the jets. Since our methods are derived from a top quark property regressor, we would expect that the trijet properties align well with the truth top quark properties, but it is important to check. Figure 4 provides confirmation for the top quark \(p_T\) and \(y\).
5 Conclusions and outlook
Parton labeling continues to be an important task in collider event reconstruction even though such labels are not unique. We have proposed a set of tools based on regression methods that are able to assign parton labels without also needing unphysical parton matching for training. Our approaches are competitive even though they are not trained using trijet information and are much more flexible than other approaches, since we are able to accommodate events with fewer jets than expected from the lowest order decay Feynman diagrams. While our techniques are compatible with many regression approaches, the CPT model studies here is particularly useful because it is permutation invariant and partially Lorentz covariant. The corresponding labels inherit some of these properties.
There are a number of possible ways to further improve these approaches, including how to best combine the attention weights or Jacobian elements to assign parton labels. It may also be possible to combine approaches in the future, where a simpler model can be trained using the label information from a regression model.
6 Software
The code for this project is built on the one from Ref. [12]. Updated software that produces also the gradients and makes the figures in this paper can be found at https://github.com/hep-lbdl/Covariant-Particle-Transformer.
Data Availability
The manuscript has associated data in a data repository. [Authors’ comment: The data can be downloaded from the Github link in the Software section.]
Notes
The Higgs boson decays to photons and is largely ignored and irrelevant for jet labeling. We use this sample because it was the main one used in Ref. [12], although it was also shown that the performance is similar in other top quark final states.
\(\Delta R\) is defined as \(\sqrt{\Delta y^2 + \Delta \phi ^2}\), where \(\Delta y\) is the difference of two particles in pseudorapidity and \(\Delta \phi \) is the difference in azimuthal angle.
References
M. Aaboud et al., (ATLAS), Search for the standard model Higgs boson produced in association with top quarks and decaying into a \(b\bar{b}\) pair in \(pp\) collisions at \(\sqrt{s} = 13\) TeV with the ATLAS detector. Phys. Rev. D 97, 072016 (2018). https://doi.org/10.1103/PhysRevD.97.072016. arXiv:1712.08895 [hep-ex]
A.M. Sirunyan et al., (CMS), Measurement of the \({\rm t}\bar{t} {\rm b}\bar{b} \) production cross section in the all-jet final state in pp collisions at \(\sqrt{s} =\) 13 TeV. Phys. Lett. B 803, 135285 (2020). https://doi.org/10.1016/j.physletb.2020.135285. arXiv:1909.05306 [hep-ex]
G. Aad et al., (ATLAS), \(CP\) Properties of Higgs Boson Interactions with Top Quarks in the \(t\bar{t}H\) and \(tH\) Processes Using \(H \rightarrow \gamma \gamma \) with the ATLAS Detector. Phys. Rev. Lett. 125, 061802 (2020). https://doi.org/10.1103/PhysRevLett.125.061802. arXiv:2004.04545 [hep-ex]
J. Erdmann, T. Kallage, K. Kröninger, O. Nackenhorst, From the bottom to the top—reconstruction of \(t\bar{t}\) events with deep learning. JINST 14(11), P11015. https://doi.org/10.1088/1748-0221/14/11/P11015. arXiv:1907.11181 [hep-ex]
A. Badea, W.J. Fawcett, J. Huth, T.J. Khoo, R. Poggi, L. Lee, Solving combinatorial problems at particle colliders using machine learning (2022). arXiv:2201.02205 [hep-ph]
M. J. Fenton, A. Shmakov, T.-W. Ho, S.-C. Hsu, D. Whiteson, P. Baldi, Permutationless many-jet event reconstruction with symmetry preserving attention networks (2020). arXiv:2010.09206 [hep-ex]
J.S.H. Lee, I. Park, I.J. Watson, S. Yang, Zero-permutation jet-parton assignment using a self-attention network (2020). arXiv:2012.03542 [hep-ex]
A. Shmakov, M. J. Fenton, T.-W. Ho, S.-C. Hsu, D. Whiteson, P. Baldi, SPANet: generalized permutationless set assignment for particle physics using symmetry preserving attention (2021). arXiv:2106.03898 [hep-ex]
L. Ehrke, J. A. Raine, K. Zoch, M. Guth, T. Golling, Topological reconstruction of particle physics processes using graph neural networks (2023). arXiv:2303.13937 [hep-ph]
J. Erdmann, S. Guindon, K. Kroeninger, B. Lemmer, O. Nackenhorst, A. Quadt, P. Stolte, A likelihood-based reconstruction algorithm for top-quark pairs and the KLFitter framework. Nucl. Instrum. Methods A 748, 18 (2014). https://doi.org/10.1016/j.nima.2014.02.029. arXiv:1312.5595 [hep-ex]
M. Cacciari, G.P. Salam, G. Soyez, The catchment area of jets. JHEP 04, 005 https://doi.org/10.1088/1126-6708/2008/04/005. arXiv:0802.1188 [hep-ph]
S. Qiu, S. Han, X. Ju, B. Nachman, H. Wang, A holistic approach to predicting top quark kinematic properties with the covariant particle transformer (2022). arXiv:2203.05687 [hep-ph]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, in Advances in Neural Information Processing Systems (2017) p. 5998. arXiv:1706.03762
J. Alwall, R. Frederix, S. Frixione, V. Hirschi, F. Maltoni, O. Mattelaer, H.S. Shao, T. Stelzer, P. Torrielli, M. Zaro, The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations. JHEP 07, 079. https://doi.org/10.1007/JHEP07(2014)079. arXiv:1405.0301 [hep-ph]
P. Artoisenet, R. Frederix, O. Mattelaer, R. Rietkerk, Automatic spin-entangled decays of heavy resonances in Monte Carlo simulations. JHEP 03, 015. https://doi.org/10.1007/JHEP03(2013)015. arXiv:1212.3460 [hep-ph]
T. Sjöstrand, S. Ask, J.R. Christiansen, R. Corke, N. Desai, P. Ilten, S. Mrenna, S. Prestel, C.O. Rasmussen, P.Z. Skands, An introduction to PYTHIA 8.2. Comput. Phys. Commun. 191, 159 (2015). https://doi.org/10.1016/j.cpc.2015.01.024. arXiv:1410.3012 [hep-ph]
M. Cacciari, G.P. Salam, G. Soyez, The anti-\(k_t\) jet clustering algorithm. JHEP 04, 063. https://doi.org/10.1088/1126-6708/2008/04/063. arXiv:0802.1189 [hep-ph]
M. Cacciari, G.P. Salam, G. Soyez, FastJet user manual. Eur. Phys. J. C 72, 1896 (2012). https://doi.org/10.1140/epjc/s10052-012-1896-2. arXiv:1111.6097 [hep-ph]
M. Cacciari, G.P. Salam, Dispelling the \(N^{3}\) myth for the \(k_t\) jet-finder. Phys. Lett. B 641, 57 (2006). https://doi.org/10.1016/j.physletb.2006.08.037. arXiv:hep-ph/0512210
G. Aad et al., (ATLAS), ATLAS b-jet identification performance and efficiency measurement with \(t{\bar{t}}\) events in pp collisions at \(\sqrt{s}=13\) TeV. Eur. Phys. J. C 79, 970 (2019). https://doi.org/10.1140/epjc/s10052-019-7450-8. arXiv:1907.05120 [hep-ex]
A.M. Sirunyan et al. (CMS), Identification of heavy-flavour jets with the CMS detector in pp collisions at 13 TeV. JINST 13 (05), P05011. https://doi.org/10.1088/1748-0221/13/05/P05011. arXiv:1712.07158 [physics.ins-det]
Particle Data Group, Review of Particle Physics. Prog. Theor. Exp. Phys. 2020, 083C01 (2020) https://doi.org/10.1093/ptep/ptaa104
Acknowledgements
BN thanks Chase Shimmin for useful discussions. This work is supported by the U.S. Department of Energy, Office of Science under contract DE-AC02-05CH11231. H.W.’s work is partly supported by the U.S. National Science Foundation under the Award No. 2046280.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Funded by SCOAP3. SCOAP3 supports the goals of the International Year of Basic Sciences for Sustainable Development.
About this article
Cite this article
Qiu, S., Han, S., Ju, X. et al. Parton labeling without matching: unveiling emergent labelling capabilities in regression models. Eur. Phys. J. C 83, 622 (2023). https://doi.org/10.1140/epjc/s10052-023-11809-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1140/epjc/s10052-023-11809-z