Skip to main content
Haitao Chu

    Haitao Chu

    In the survival analysis context, when an intervention either reduces a harmful exposure or introduces a beneficial treatment, it seems useful to quantify the gain in survival attributable to the intervention as an alternative to the... more
    In the survival analysis context, when an intervention either reduces a harmful exposure or introduces a beneficial treatment, it seems useful to quantify the gain in survival attributable to the intervention as an alternative to the reduction in risk. To accomplish this we introduce two new concepts, the attributable survival and attributable survival time, and study their properties. Our analysis includes comparison with the attributable risk function as well as hazard-based alternatives. We also extend the setting to the case where the intervention takes place at discrete points in time, and may either eliminate exposure or introduce a beneficial treatment in only a proportion of the available group. This generalization accommodates the more realistic situation where the treatment or exposure is dynamic. We apply these methods to assess the effect of introducing highly active antiretroviral therapy for the treatment of clinical AIDS at the population level.
    The conventional random effects model for meta-analysis of proportions approximates within-study variation using a normal distribution. Due to potential approximation bias, particularly for the estimation of rare events such as some... more
    The conventional random effects model for meta-analysis of proportions approximates within-study variation using a normal distribution. Due to potential approximation bias, particularly for the estimation of rare events such as some adverse drug reactions, the conventional method is considered inferior to the exact methods based on binomial distributions. In this paper, we compare two existing exact approaches-beta binomial (B-B) and normal-binomial (N-B)-through an extensive simulation study with focus on the case of rare events that are commonly encountered in medical research. In addition, we implement the empirical ("sandwich") estimator of variance into the two models to improve the robustness of the statistical inferences. To our knowledge, it is the first such application of sandwich estimator of variance to meta-analysis of proportions. The simulation study shows that the B-B approach tends to have substantially smaller bias and mean squared error than N-B for rare events with occurrences under five percent, while N-B outperforms B-B for relatively common events. Use of the sandwich estimator of variance improves the precision of estimation for both models. We illustrate the two approaches by applying them to two published meta-analysis from the fields of orthopedic surgery and prevention of adverse drug reactions.
    Systematic reviews of diagnostic tests often involve a mixture of case-control and cohort studies. The standard methods for evaluating diagnostic accuracy only focus on sensitivity and specificity and ignore the information on disease... more
    Systematic reviews of diagnostic tests often involve a mixture of case-control and cohort studies. The standard methods for evaluating diagnostic accuracy only focus on sensitivity and specificity and ignore the information on disease prevalence contained in cohort studies. Consequently, such methods cannot provide estimates of measures related to disease prevalence, such as population averaged or overall positive and negative predictive values, which reflect the clinical utility of a diagnostic test. In this paper, we propose a hybrid approach that jointly models the disease prevalence along with the diagnostic test sensitivity and specificity in cohort studies, and the sensitivity and specificity in case-control studies. In order to overcome the potential computational difficulties in the standard full likelihood inference of the proposed hybrid model, we propose an alternative inference procedure based on the composite likelihood. Such composite likelihood based inference does no...
    This paper describes the core features of the R package mmeta, whichimplements the exact posterior inference of odds ratio, relative risk, and risk difference given either a single 2 × 2 table or multiple 2 × 2 tables when the risks... more
    This paper describes the core features of the R package mmeta, whichimplements the exact posterior inference of odds ratio, relative risk, and risk difference given either a single 2 × 2 table or multiple 2 × 2 tables when the risks within the same study are independent or correlated.
    Diagnostic systematic review is a vital step in the evaluation of diagnostic technologies. In many applications, it involves pooling pairs of sensitivity and specificity of a dichotomized diagnostic test from multiple studies. We propose... more
    Diagnostic systematic review is a vital step in the evaluation of diagnostic technologies. In many applications, it involves pooling pairs of sensitivity and specificity of a dichotomized diagnostic test from multiple studies. We propose a composite likelihood (CL) method for bivariate meta-analysis in diagnostic systematic reviews. This method provides an alternative way to make inference on diagnostic measures such as sensitivity, specificity, likelihood ratios, and diagnostic odds ratio. Its main advantages over the standard likelihood method are the avoidance of the nonconvergence problem, which is nontrivial when the number of studies is relatively small, the computational simplicity, and some robustness to model misspecifications. Simulation studies show that the CL method maintains high relative efficiency compared to that of the standard likelihood method. We illustrate our method in a diagnostic review of the performance of contemporary diagnostic imaging technologies for d...
    We have developed a statistical method named IsoDOT to assess differential isoform expression (DIE) and differential isoform usage (DIU) using RNA-seq data. Here isoform usage refers to relative isoform expression given the total... more
    We have developed a statistical method named IsoDOT to assess differential isoform expression (DIE) and differential isoform usage (DIU) using RNA-seq data. Here isoform usage refers to relative isoform expression given the total expression of the corresponding gene. IsoDOT performs two tasks that cannot be accomplished by existing methods: to test DIE/DIU with respect to a continuous covariate, and to test DIE/DIU for one case versus one control. The latter task is not an uncommon situation in practice, e.g., comparing the paternal and maternal alleles of one individual or comparing tumor and normal samples of one cancer patient. Simulation studies demonstrate the high sensitivity and specificity of IsoDOT. We apply IsoDOT to study the effects of haloperidol treatment on the mouse transcriptome and identify a group of genes whose isoform usages respond to haloperidol treatment.
    Research Interests:
    In a meta-analysis of diagnostic accuracy studies, the sensitivities and specificities of a diagnostic test may depend on the disease prevalence since the severity and definition of disease may differ from study to study due to the design... more
    In a meta-analysis of diagnostic accuracy studies, the sensitivities and specificities of a diagnostic test may depend on the disease prevalence since the severity and definition of disease may differ from study to study due to the design and the population considered. In this paper, we extend the bivariate nonlinear random effects model on sensitivities and specificities to jointly model the disease prevalence, sensitivities and specificities using trivariate nonlinear random-effects models. Furthermore, as an alternative parameterization, we also propose jointly modeling the test prevalence and the predictive values, which reflect the clinical utility of a diagnostic test. These models allow investigators to study the complex relationship among the disease prevalence, sensitivities and specificities; or among test prevalence and the predictive values, which can reveal hidden information about test performance. We illustrate the proposed two approaches by reanalyzing the data from a meta-analysis of radiological evaluation of lymph node metastases in patients with cervical cancer and a simulation study. The latter illustrates the importance of carefully choosing an appropriate normality assumption for the disease prevalence, sensitivities and specificities, or the test prevalence and the predictive values. In practice, it is recommended to use model selection techniques to identify a best-fitting model for making statistical inference. In summary, the proposed trivariate random effects models are novel and can be very useful in practice for meta-analysis of diagnostic accuracy studies.
    The widely used Cox proportional hazards regression model for the analysis of censored survival data has limited utility when either hazard functions themselves are of primary interest, or when relative times instead of relative hazards... more
    The widely used Cox proportional hazards regression model for the analysis of censored survival data has limited utility when either hazard functions themselves are of primary interest, or when relative times instead of relative hazards are the relevant measures of association. Parametric regression models are an attractive option in situations such as this, although the choice of a particular model from the available families of distributions can be problematic. The generalized gamma (GG) distribution is an extensive family that contains nearly all of the most commonly used distributions, including the exponential, Weibull, log normal and gamma. More importantly, the GG family includes all four of the most common types of hazard function: monotonically increasing and decreasing, as well as bathtub and arc-shaped hazards. We present here a taxonomy of the hazard functions of the GG family, which includes various special distributions and allows depiction of effects of exposures on hazard functions. We applied the proposed taxonomy to study survival after a diagnosis of clinical AIDS during different eras of HIV therapy, where proportionality of hazard functions was clearly not fulfilled and flexibility in estimating hazards with very different shapes was needed. Comparisons of survival after AIDS in different eras of therapy are presented in terms of both relative times and relative hazards. Standard errors for these and other derived quantities are computed using the delta method and checked using the bootstrap. Description of standard statistical software (Stata, SAS and S-Plus) for the computations is included and available at http://statepi.jhsph.edu/software.
    Often in randomized clinical trials and observational cohort studies, a non-negative continuously distributed response variable is measured in treatment and control groups. In the presence of true zeros for the response variable, a... more
    Often in randomized clinical trials and observational cohort studies, a non-negative continuously distributed response variable is measured in treatment and control groups. In the presence of true zeros for the response variable, a two-part zero-inflated log-normal model (which assumes that the data has a probability mass at zero and a continuous response for values greater than zero) is usually recommended. However, in some environmental health and human immunodeficiency virus (HIV) studies, quantitative assays for metabolites of toxicants, or quantitative HIV RNA measurements are subject to left-censoring due to values falling below the limit of detection (LD). Here, a zero-inflated log-normal mixture model is often suggested since true zeros are indistinguishable from left-censored values due to the LD. When the probabilities of true zeros in the two groups are not restricted to be equal, the information contributed by values falling below LD is used only to estimate the probability of true zeros in the context of mixture distributions. We derived the required sample size to assess the effect of a treatment in the context of mixture models with equal and unequal variances based on the left-truncated log-normal distribution. Methods for calculation of statistical power are also presented. We calculate the required sample size and power for a recent study estimating the effect of oltipraz on reducing urinary levels of the hydroxylated metabolite aflatoxin M(1) (AFM(1)) in a randomized, placebo-controlled, double-blind phase IIa chemoprevention trial in Qidong, China. A Monte Carlo simulation study is conducted to investigate the performance of the proposed methods.
    To evaluate the probabilities of a disease state, ideally all subjects in a study should be diagnosed by a definitive diagnostic or gold standard test. However, since definitive diagnostic tests are often invasive and expensive, it is... more
    To evaluate the probabilities of a disease state, ideally all subjects in a study should be diagnosed by a definitive diagnostic or gold standard test. However, since definitive diagnostic tests are often invasive and expensive, it is generally unethical to apply them to subjects whose screening tests are negative. In this article, we consider latent class models for screening studies with two imperfect binary diagnostic tests and a definitive categorical disease status measured only for those with at least one positive screening test. Specifically, we discuss a conditional-independent and three homogeneous conditional-dependent latent class models and assess the impact of misspecification of the dependence structure on the estimation of disease category probabilities using frequentist and Bayesian approaches. Interestingly, the three homogeneous-dependent models can provide identical goodness-of-fit but substantively different estimates for a given study. However, the parametric form of the assumed dependence structure itself is not 'testable' from the data, and thus the dependence structure modeling considered here can only be viewed as a sensitivity analysis concerning a more complicated non-identifiable model potentially involving a heterogeneous dependence structure. Furthermore, we discuss Bayesian model averaging together with its limitations as an alternative way to partially address this particularly challenging problem. The methods are applied to two cancer screening studies, and simulations are conducted to evaluate the performance of these methods. In summary, further research is needed to reduce the impact of model misspecification on the estimation of disease prevalence in such settings.
    Likelihood-based approaches, which naturally incorporate left censoring due to limit of detection, are commonly utilized to analyze censored multivariate normal data. However, the maximum likelihood estimator (MLE) typically... more
    Likelihood-based approaches, which naturally incorporate left censoring due to limit of detection, are commonly utilized to analyze censored multivariate normal data. However, the maximum likelihood estimator (MLE) typically underestimates variance parameters. The restricted maximum likelihood estimator (REML), which corrects the underestimation of variance parameters, cannot be easily extended to analyze censored multivariate normal data. In the light of the connection between the REML and a Bayesian approach discovered in 1974 by Dr Harville, this paper describes a Bayesian approach to censored multivariate normal data. This Bayesian approach is justified through its link to the REML via Laplace's approximation and its performance is evaluated through a simulation study. We consider the Bayesian approach as a valuable alternative because it yields less biased variance parameter estimates than the MLE, and because a solid REML is technically difficult when data are left censored.
    To account for between-study heterogeneity in meta-analysis of diagnostic accuracy studies, bivariate random effects models have been recommended to jointly model the sensitivities and specificities. As study design and population vary,... more
    To account for between-study heterogeneity in meta-analysis of diagnostic accuracy studies, bivariate random effects models have been recommended to jointly model the sensitivities and specificities. As study design and population vary, the definition of disease status or severity could differ across studies. Consequently, sensitivity and specificity may be correlated with disease prevalence. To account for this dependence, a trivariate random effects model had been proposed. However, the proposed approach can only include cohort studies with information estimating study-specific disease prevalence. In addition, some diagnostic accuracy studies only select a subset of samples to be verified by the reference test. It is known that ignoring unverified subjects may lead to partial verification bias in the estimation of prevalence, sensitivities, and specificities in a single study. However, the impact of this bias on a meta-analysis has not been investigated. In this paper, we propose a novel hybrid Bayesian hierarchical model combining cohort and case-control studies and correcting partial verification bias at the same time. We investigate the performance of the proposed methods through a set of simulation studies. Two case studies on assessing the diagnostic accuracy of gadolinium-enhanced magnetic resonance imaging in detecting lymph node metastases and of adrenal fluorine-18 fluorodeoxyglucose positron emission tomography in characterizing adrenal masses are presented.
    Melanoma cell lines and normal human melanocytes (NHM) were assayed for p53-dependent G1 checkpoint response to ionizing radiation (IR)-induced DNA damage. Sixty-six percent of melanoma cell lines displayed a defective G1 checkpoint.... more
    Melanoma cell lines and normal human melanocytes (NHM) were assayed for p53-dependent G1 checkpoint response to ionizing radiation (IR)-induced DNA damage. Sixty-six percent of melanoma cell lines displayed a defective G1 checkpoint. Checkpoint function was correlated with sensitivity to IR with checkpoint-defective lines being radio-resistant. Microarray analysis identified 316 probes whose expression was correlated with G1 checkpoint function in melanoma lines (P≤0.007) including p53 transactivation targets CDKN1A, DDB2, and RRM2B. The 316 probe list predicted G1 checkpoint function of the melanoma lines with 86% accuracy using a binary analysis and 91% accuracy using a continuous analysis. When applied to microarray data from primary melanomas, the 316 probe list was prognostic of 4-yr distant metastasis-free survival. Thus, p53 function, radio-sensitivity, and metastatic spread may be estimated in melanomas from a signature of gene expression.
    ABSTRACT This paper deals with the problem of estimating the Pearson correlation coefficient when one variable is subject to left or right censoring. In parallel to the classical results on the Pearson correlation coefficient, we derive a... more
    ABSTRACT This paper deals with the problem of estimating the Pearson correlation coefficient when one variable is subject to left or right censoring. In parallel to the classical results on the Pearson correlation coefficient, we derive a workable formula, through tedious computation and intensive simplification, of the asymptotic variances of the maximum likelihood estimators in two cases: (1) known means and variances and (2) unknown means and variances. We illustrate the usefulness of the asymptotic results in experimental designs.
    A marginal approach and a variance-component mixed effect model approach (here called a conditional approach) are commonly used to analyze variables that are subject to limit of detection. We examine the theoretical relationship and... more
    A marginal approach and a variance-component mixed effect model approach (here called a conditional approach) are commonly used to analyze variables that are subject to limit of detection. We examine the theoretical relationship and investigate the numerical performance of these two approaches. We make some recommendations based on our results. The marginal approach is recommended for bivariate normal variables, and the variance-component mixed effect model is preferable for other multivariate analysis in most circumstances. Two approaches are illustrated through one case study from a preclinical experiment.
    The traditional fixed margin approach to evaluating an experimental treatment through an active-controlled noninferiority trial is simple and straightforward. However, its utility relies heavily on the constancy assumption of the... more
    The traditional fixed margin approach to evaluating an experimental treatment through an active-controlled noninferiority trial is simple and straightforward. However, its utility relies heavily on the constancy assumption of the experimental data. The recently developed covariate-adjustment method permits more flexibility and improved discriminatory capacity compared to the fixed margin approach. However, one major limitation of this covariate-adjustment methodology is its adherence on the patient-level data, which may not be accessible to investigators in practice. In this article, under some assumptions, we examine the feasibility of a partial covariate-adjustment approach based on data typically available from journal publications or other public data when the patient-level data are unavailable. We illustrate the usefulness of this approach through two real examples. We also provide design considerations on the efficiency of the partial covariate-adjustment approach.
    To examine the urban and rural variation in walking patterns and pedestrian crashes. The rates of pedestrians being struck by motor vehicles was estimated according to miles walked and resident years. New York State, USA during 2001... more
    To examine the urban and rural variation in walking patterns and pedestrian crashes. The rates of pedestrians being struck by motor vehicles was estimated according to miles walked and resident years. New York State, USA during 2001 through 2002. 35 732 pedestrians struck by vehicles. The adjusted rate ratio (aRR) of pedestrian-vehicle crash and pedestrian injury based on resident years and miles walked according to urban and rural areas. Compared with rural areas, the aRR for a pedestrian-vehicle collision, based on resident years, was 2.0 (95% CI 1.7 to 2.3) in small urban areas, 1.8 (95% CI 1.5 to 2.3) in mid-size urban areas, and 4.2 (95% CI 3.6 to 4.8) in the large urban area. The aRR based on miles walked was 2.3 (95% CI 1.6 to 3.2) in small urban areas, 2.0 (95% CI 1.4 to 2.9) in mid-size urban areas, and 1.9 (95% CI 1.4 to 2.7) in the large area. The aRR for a fatal pedestrian injury, based on miles walked, was 2.1 (95% CI 1.3 to 3.6) in small urban areas, 1.9 (95% CI 1.3 to 2.9) in mid-size urban areas, and 0.9 (95% CI 0.6 to 1.3) in the large urban area. The rate of pedestrian crashes and injuries in small and mid-size urban areas was twice that in rural areas, whether based on resident years or miles walked. The high rate of pedestrian crashes in the large urban area based on resident years could be partly explained by the fact that residents in such areas walk about twice as much as residents in rural areas. The rate of fatal pedestrian injury based on miles walked was similar in the large urban area and rural areas.

    And 24 more