********** Accept **********

STATISTICAL REVIEWER

1. While the results might be affected by the different initialization, combining the 5-runs of the test set, it inflates the sample size (1558) to 7790. As a result, the precision is much reduced because essentially you are putting the same sentences 5 times, and treating them as if they are all unrelated to each other. The authors can still use bootstrapping, except that the resampling each time should be limited to 1558, not 7790.

2. "P-values were zero" implies that the situation would never occur. The "P" in P-value stands for probability, and the probability of 0 means that it would never happen. Please modify the sentence appropriately.

3. Accuracy is between 0 and 1. A variance of accuracy equaling 2.28 would not happen. Please either make sure correct decimal points have been added or % has been added.

4. In the text, macro-average is used for task2, but in C4, micro-average is used. Please make sure correct information is reported.

********** Revise **********

STATISTICAL REVIEWER

Summary Statement:

1. The authors did not demonstrate that the models provide "considerable performance gains" over baselines. In fact, it is not clear if proper comparisons were made.

Abstract:

2. M&M: provide a sentence or two on what statistical methods were used for comparing models for the three tasks. In the main paper, "three"-sentence instead of "two"-sentence was specified for the report summary.

3. Results: accuracy and F1-score quoted were different from those provided in Table2. There were no measurements as to what is considered "better." It is not clear if there are any real differences in point estimates present after taking into account the variations.

Materials and Methods:

4. Task 1: How were F1-score calculated (and why?) for each algorithm? The sensitivity/specificity of each algorithm might be of more interest. In addition, the portion of evaluating/comparing the models should be detailed in M&M, not in Results. Furthermore, why did the authors inflate the test set by 5 times? Please provide detailed explanations. For each task, the number of training/evaluation/test sets should be clearly provided. Multiple tests were performed, and adjustments of the P-values/CI levels should be made to accommodate the multiple tests. Were the different percentages of training processing in order from smallest (5%) to larges(100%)? Or randomly?

5. Task 2: What did the authors mean by "Average" accuracy and "macro" average of F1-score? For coding system with more than 2 levels, did the labeling of the classes affect accuracy? Similar to task 1, please clarify how the F1-score was calculated.

6. Task 3: How was this task evaluated? Specify the tests used to compare ROGUE(-1,-2,-L).

Results:

7. The bootstrap methods should be detailed in M&M, not in Results. See #4, in particular, what is the reason for repeating the test set 5 times? In M&M, it specifies 5-levels, with the lowest percentage of training set of 5%, while in Results, the lowest is 1%. In the tables, the total level is 7. Please make sure the methods and results, including tables/figures, are all consistent. While the authors mentioned the bootstrapping methods used, no results were presented. Only the "mean" and "sd" were presented based on the inflated test set.

8. The numbers of training/validation/test sets should be specified clearly.

9. Low-scoring examples should be shown alongside their reference standard.

Tables and Figures:

10. Table 2: Why were the mean and standard deviation reported on five runs? Were the results based on test data? Provide sample sizes to avoid confusion.

11. Table 3: Sample size should be provided for each system.

12. Figure 2: Were the "numbered" categories used in accuracy calculations? For example, major abnormality and no attention needed is labeled as 1 and 2, but they are nowhere close in terms of their abnormality. A mislabel of 1 to 2 is much more serious than a mislabel of 5 to 6.

13. Figure 3: Were the histograms based on reference standards? Or based on model output? What is the purpose of Figure 3? How many words does a sentence usually have?

********** Revise **********

DEPUTY EDITOR

This manuscript investigates whether tailoring a transformer-based language model using a large corpus of 4.42M radiology reports from the VA system is beneficial for radiology NLP applications. The study was generally found to be well-designed and executed. The strength of the work is its use of a large, national report corpus covering all radiology specialties. The study investigates the utility of pretraining using a variety of publicly available general and domain-specific documents. A few comments to consider:

1. The Results section seem to blend methods and results together. The approach for each of the three experiments should be presented in the Methods. Actual results are only briefly described and are mostly left to the reader to interpret from Tables 1, 3, & 4. Explicit statistical testing would help confirm significance in performance differences between RadBERT and baseline comparisons.

2. The difference in performance for Task 1 shown in Table 1 seems marginal between baseline and RadBERT models. The impact of the Open-I dataset in fine-tuning the classifier should be clarified.

3. Section 3.1 is better suited in the Materials and Methods section.

The reviewers have provided critiques that will help clarify the approach and results further.


REVIEWER 1

The authors have reported the results of their study in which they utilize a radiology-specific transformer-based NLP model which has been pre-trained on 4.42 million radiology reports for VA’s nationwide on top of four well-known initializations. These are then compared with baseline models which are not radiology specific.

The three tasks which evaluated these models included abnormal sentence classification (abnormal vs normal findings), report coding (classification of Bi-Rads/Lung-Rads based on the report), and report summarization (two sentence summarization from the report).

The authors found that their RadBERT model outperformed the baseline models in general.

The ability to utilize NLP models to extract useful information from the millions of available radiology reports being generated is critical. Certainly, there is so much to be gained from this - data, diagnosis, and as the author mentioned, notification for abnormal findings for the ordering providers. The importance of such a tool cannot be understated.

One of the strengths of this study is the broad dataset of which the model was trained. I would imagine the generalizability of such a model would be good, given that it came from 2.17 million unique patients and across the country (albeit from VA hospitals, which may represent an older population).

The methods utilized are sound and represent standard methods of training an NLP model, though they are specifically doing this with radiology reports.

1.  While the results appear to be impressive, the authors do not make it clear how they compare to prior studies or if even prior studies exist in which these results can be compared. In fact, there is no significance attached to the accuracy/performance measures in any of the results - how are we to know that these are in fact much better than the baseline models? Additionally, the authors should denote examples of what the tasks were. Examples of great task 1, poor task 1, middle task 1, and etc. would allow for readers to understand what exactly is being measured for performance. For a layperson looking at the raw numbers, it would appear that while the numbers for RadBERT are indeed better than the baseline models, they are not very far apart and certainly not representative of the 2.17 million unique patients. This is why statistics and relative performance to other studies would be important in this particular manuscript.

2.  The discussion section is well written, however, without the above information in the results, it seems incomplete. I agree with the limitations and it would certainly be interesting to see how it performs with the BERT-large model, though I am unsure if it will be able to outperform RoBERTa.


REVIEWER 2

1.  One concern is the lack of discussion regarding the marginal improvement in the RadBERT variants compared to the BERT baseline. Can one really conclude that pretraining on radiology reports matter?

2.  Page 1, Line 29-BERT is new (2018), so "rarely been explored in the radiology domain" seems misleading to the reader 

3.  Page 2, Line 10: change "adapting" to "adapted for"

4.  Page 2, Line 19: why did you opt not to pretrain on BlueBERT?

5.  Page 2, line 30: why would RadBERT do better, particularly with less training? what does this say about the data?

6.  page 3, line 13: "adversarial" seems a little harsh here

7.  Page 3, line 16: change "progresses were made" to "progress has been made"

8.  Page 3, line 22: the reader may benefit from a brief discussion of what "context" means and why context-based deep learning is so important for NLP

9.  page 3 line 33: add "then" before "what's"

10.  page 3, line 35: is "classifies" the right word here?

11.  page 3, line 35: delete "much" or change to "significantly"

12.  page 3, line 46: is "pervasive" the best word choice here?

13.  page 3, line 48: replace the reference [11] to meaningful text

14.  page 4, line 17: please explain the choice, types and number of reports (US, CXR, CT, etc) and body parts

15.  page 5, line 3: please provide a couple sentences explaining "cross-entropy loss"

16.  page 5, line 5: explain why you chose 80%, 10%, 10%

17.  page 5, line 7: what does "vocabulary" refer to here? the list of words from the pretraining corpus or some other source from the general domain?

18.  page 5, section 2.3: please discuss what was used as ground truth

19.  page 5, section 2.4: please discuss what was used as ground truth

20.  page 5, line 56: a sentence to briefly explain sentence embeddings would be helpful for flow

21.  page 5, line 56: consider a brief description of k-means to explain its unsupervised nature. and why it's an appropriate choice for this exercise.

22.  page 7, line 32: please provide some specific noteworthy examples

23.  page 7, line 38: is there a directional relationship between performance and amount of training data used?

24.  page 8, line 5: did you consider using cross-fold validation to increase the utility of your training data?

25.  page 8, line 20: please offer suggestions as to why you think your model's performance with 4 million is not better than 2 million?

26.  page 8, line 40: how well did your 1000 randomly chosen reports reflect the training reports in terms of modality and body part?

27.  page 9, first paragraph: seems odd that your first paragraph focuses on follow-ups whereas the primary focus of the paper is about the utility of pretraining existing BERT models with radiology reports

28.  page 9, line 40: how clinically significant is the "superiority" of radBERT. the improvement seems marginal. please discuss further.

29.  page 10, line 30: instead of focusing on an overall number of reports, should the type of radiology reports for pretraining be similar to the type of NLP task? for example, do you think that training primarily on mammography reports (XR) would improve BIRADS coding at the expense of LIRADS (CT)? please discuss. is this a potential weakness of the research?

30.  figure 3. please explain "density" on x axis and "length" on x-axis. It would also be helpful to see the relationship between one report's length and/or density and its corresponding impression.

31.  table 4: please provide some context for ROUGE for the readers. what is the range, what is a "good" number


REVIEWER 3

Overall, this is a well-designed and well-reported study on a topic of importance to the Radiology AI community. There are a few suggestions I have to better support reading understanding and replicability of the work:

Methods:

1.  It would be more appropriate to include the reference to the Word Piece tokenization scheme here, rather than in the Results.

2.  Additionally, it would be helpful to elaborate on how the tokenization scheme handles out-of-vocabulary words, as some radiology-specific words may not be found in the original vocabulary of BERT-base.

3.  Reference to the Supplemental Material for training scheme and hyperparameters for masked language model pre-training should be made sooner within the Methods section. The code libraries used for this work should also be specified somewhere in the Methods section.

4.  How were the sentence embeddings extracted from the pre-trained masked language models for the report summarization task? Was it the same method described for the other 2 downstream tasks? For readers less familiar with Transformers, how does the model's output representation of the first token in a sentence convey information pertaining to the remainder of the sentence (i.e. please provide a brief explanation of how multi-head attention facilitates this)?

Results:

5.  What was the rationale for pre-training on different-sized subsets of the VA corpus?

6.  It is mentioned that a labeled subset of the Open-I IU CXR dataset was previously annotated...how? and by whom? Please provide a little more detail or a reference to previously published work, if applicable.

7.  How were the 5 diagnostic coding systems selected from the available options? It should be clarified that these are not ICD or SNOMED-like universal standards for diagnostic coding/billing but rather internal VA coding systems for communication of critical findings.

Supplemental Materials:

8.  An initial learning rate is mentioned for language model pre-training. What learning rate scheduler was used (e.g. cosine annealing, step-wise decay)?

9.  A final question (plea) to the authors: Is there any plan for making the RadBERT language models publicly available?