[go: up one dir, main page]

ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Data Note

Whole-genome sequencing of nine esophageal adenocarcinoma cell lines

[version 1; peer review: 3 approved]
PUBLISHED 10 Jun 2016
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Data: Use and Reuse collection.

Abstract

Esophageal adenocarcinoma (EAC) is highly mutated and molecularly heterogeneous. The number of cell lines available for study is limited and their genome has been only partially characterized. The availability of an accurate annotation of their mutational landscape is crucial for accurate experimental design and correct interpretation of genotype-phenotype findings. We performed high coverage, paired end whole genome sequencing on eight EAC cell lines—ESO26, ESO51, FLO-1, JH-EsoAd1, OACM5.1 C, OACP4 C, OE33, SK-GT-4—all verified against original patient material, and one esophageal high grade dysplasia cell line, CP-D. We have made available the aligned sequence data and report single nucleotide variants (SNVs), small insertions and deletions (indels), and copy number alterations, identified by comparison with the human reference genome and known single nucleotide polymorphisms (SNPs). We compare these putative mutations to mutations found in primary tissue EAC samples, to inform the use of these cell lines as a model of EAC.

Keywords

Esophageal adenocarcinoma, whole genome sequencing, cell line, high-grade dysplasia, cancer genome, copy number alteration, single nucleotide variant

Introduction

Esophageal adenocarcinoma (EAC), including cancers of the gastro-esophageal junction, represent a substantial health concern in Western countries due to its increasing incidence and poor prognosis. To date, there are no widely accepted animal models for EAC and a limited number of cell lines are all that are available for in vitro functional studies. Recent genome-wide sequencing projects have shown that EAC is one of the most highly mutated solid cancers with a high degree of heterogeneity (Dulak et al., 2013; Weaver et al., 2014). In addition to point mutations there are also widespread copy number alterations with evidence of catastrophic events such as chromothripsis and bridge fusion breakages in about one-third of cases (Nones et al., 2014). An accurate annotation of the mutational landscape of available EAC cell lines is therefore crucial for optimal experimental design, interpretation of genotype-phenotype data and to analyse drug sensitivities. We selected eight EAC cell lines—ESO26, ESO51, FLO-1, JH-EsoAd1, OACM5.1 C, OACP4 C, OE33, SK-GT-4—the identities of which have been verified by short tandem repeat (STR) analysis, p53 mutation and xenograft histology against the original tumors (Boonstra et al., 2010), and one esophageal high grade dysplasia (CP-D) cell line. We performed high-coverage paired-end whole genome sequencing and aligned the sequence data to the human reference genome in order to detect single nucleotide variants, indels and copy number alterations.

Materials and methods

Ethics

Cell lines were obtained through commercially available repositories except JH-EsoAd1, which was a kind gift from Hector Alvarez (Table 1).

Table 1. Characteristics and clinico-pathological features of the EAC cell lines analysed.

Verified origin identifies cell lines whose pathological origin from EAC has been verified in Boonstra et al., 2010.

Cell lineAlternative
Names
AgeSexEthnicityHistologyDate
Derived
StagePloidyCommercial
Availability
Verified
origin
Ref
CP-DCP-18821Adult MhTERT immortalized
oesophageal HGD
1995HGDhypoyhetraploidATCCPalanca-Wessels
et al.,1998
ESO2656MCaucasianGOJ
adenocarcinoma
2000Stage IVhypodiploid (1.8)Public Health
England –Culture
Collection
YESBoonstra et al., 2010
ESO5174MCaucasian Distal Oesophageal
Adenocarcinoma
2000Stage IVhypotriploid (2.75)Public Health
England –Culture
Collection
YESBoonstra et al., 2010
FLO-168MCaucasianDistal Oesophageal
Adenocarcinoma
1991hypodiploid (1.9)Public Health
England –Culture
Collection
YESHughes et al., 1997
JH-EsoAd1JHAD166MCaucasianModerately to
poorly differentiated
Oesophageal
Adenocarcinoma
1997Stage IIA
(T3 N0 M0)
triploid No, due to be
deposited in ATCC
YESAlvarez et al., 2008
OACM5.1C47FCaucasianLymph node
metastases of
Distal Oesophageal
Adenocarcinoma
2001Stage IVhypodiploidPublic Health
England –Culture
Collection
YESde Both et al., 2001
OACP4 C55MCaucasianGastric cardia
adenocarcinoma
2001Stage IVAneuploidy (53–57
chromosomes)
Public Health
England –Culture
Collection
YESde Both et al., 2001
OE33JROECL3373FDistal Oesophageal
Adenocarcinoma
1993Stage IIAhypotetraploid (3.5)Public Health
England –Culture
Collection
YESRockett et al., 1997
SK-GT-483MDistal Oesophageal
Adenocarcinoma
1989Stage IIBAneuoplid (mode 59
chromosomes, SK
Public Health
England –Culture
Collection
YESAltorki et al., 1993

Cell lines

All cell lines were from a certified source (Table 1) and verified in house for >90% match with publicly reported STR profiles. Cell lines were mycoplasma tested and grown in standard conditions reported in cell repositories indicated in Table 1. Matched germline DNA was not available.

Library preparation, sequencing and QC

Genomic DNA was prepared from cultured cells with AllPrepDNA/RNA Mini Kit (Qiagen) according to manufacturer’s instructions. A single library was created for each sample, and 90-bp paired-end sequencing was performed at Beijing Genomic Institute (BGI, Guangdong, China) according to Illumina (Ca, USA) instructions to a typical depth of 30×, with 94% of the known genome being sequenced to at least 10× coverage and achieving a Phred quality of 30 for at least 80% of mapping bases. FastQC 0.11.2 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc) was used to assess the quality of the sequence data. Additional alignment, duplication and insert size metrics quality metrics are reported in Supplementary material 7. Sequence reads were mapped to the human reference genome (Ensembl GRCh37, release 84) using BWA 0.5.9 (Li, 2009), sorted into genome coordinate order and duplicates marked using Picard 1.105 (FixMateInformation and MarkDuplicates tools respectively, http://broadinstitute.github.io/picard). Original BAM files are available in the European Bioinformatics Institute (EBI) repository (project: PRJEB14018; sample accessions: ERS1158075-ERS1158083).

Mutation calling

GATK v3.2.2 (Broad Institute, MA, USA) was used to call and filter single nucleotide and indel variants compared to the reference genome. In brief, the steps run were as follows: 1) local realignment of reads to correct misalignments around indels using GATK RealignerTargetCreator and IndelRealigner tools; 2) recalibration of base quality scores using GATK BaseRecalibrator tool; 3) SNV and indel calling using GATK HaplotypeCaller which determines haplotype by re-assembly within regions determined to be active, i.e. where there is evidence for a variation, and uses a Bayesian approach to assign genotypes. Hard filters were applied to the resulting call set using recommendations available from the GATK documentation (https://www.broadinstitute.org/gatk) to generate a high-confidence set of SNV and indel calls. These were analyzed with Ensembl Variant Effect Predictor (release 75, http://www.ensembl.org/info/docs/tools/vep/index.html) to annotate with genomic features and consequences of protein coding regions (Supplementary material 4). For the purposes of the analysis, all variants with global minor allele frequency (GMAF) >0.0014 described in the 1000 Genomes project were separated out as likely germline polymorphisms (The 1000 Genomes Project Consortium et al., 2012) according to the criteria adopted in the Cosmic Cell Lines Project (Wellcome Trust Sanger Institute, Cambridge). Further, we removed all SNPs that have a minor allele frequency in the DBSNP (Ensembl v.58) and variants with a frequency ≥0.00025 in the ESP6500 (NHLBI GO Exome Sequencing Project, released June 20th 2012). A full list of the filtered variants is available in Supplementary material 4 and Supplementary material 6.

Copy number assessment

Copy number (CN) analysis was carried out using Control-FREEC (Boeva et al., 2012). Control-FREEC computes and segments CN profiles and is capable of characterizing over-diploid genomes, taking into consideration the CG-content and mapability profiles to normalize read count in the absence of a control sample. Ploidy in each cell line was assessed interactively with the Crambled app v.2.0 according to the methods described by Lynch (2015).

Dataset validation

Whole genome sequencing

We identified a median of 1.3×105 variants across all 9 cell lines (range 105,487–151,879; Figure 1a, Table 2, Supplementary material 3, Supplementary material 4). We found that 1,5% of the variants were in coding regions; additionally, 4% fell in surrounding gene regions (i.e. regulatory as defined in Zerbino et al. (2015), upstream and downstream regions), 41% in introns and 23% in intergenic regions. Among the variants in the coding sequence, the majority, 57.4%, were in the UTR regions, followed by exonic missense and synonymous variants (21% and 11% respectively (Figure 1, Table 2, Supplementary material 3, Supplementary material 4). The number of variations identified in the high-grade dysplasia CP-D line was not significantly lower to the median of other EAC cell lines, consistent with the finding that such pre-malignant lesions have already accumulated many SNVs (Weaver et al., 2014). OACP4C and ESO26 showed the smallest and largest number of variants, respectively. (Figure 1, Table 2).

7aa6eb7d-efd5-4548-88e8-19ae2e918f69_figure1.gif

Figure 1. Distribution of detected variants and coding sequence consequences (mean percentage value).

A) Bar chart showing the distribution of called variants across various regions of the genome as indicated; B) Details of the coding sequence variants identified by the Variant Effect Predictor (Ensembl) expressed as a mean percentage value of all cell lines (values were not statistically different among samples).

Table 2. Detailed distribution of identified variants for each cell lines.

Absolute number, median, median absolute deviation and range interval are listed for each category of mutation according to Variant Effect Predictor classification (Ensembl).

CP-DESO26ESO51FLO-1JH-EsoAD1OACM5.1OACP4COE33SK-
GT-4
MedianMedian
Absolute
Deviation
MinMax
Coding
variants
(type)
UTR5 prime UTR 22930126219120626422921630522933191305
3 prime UTR 9791097100292692910268489861113986578481113
Start/Stopinitiator codon 1322321012103
stop lost22422233220 2 4
stop retained 21422122220 1 4
stop gained10141716141791424143 9 24
Missensemissense 38549649743643548143144645444615 385 497
Splice Sitessplice
acceptor
41178111197781 4 11
splice donor 5761069651861 5 18
splice region 105113107929695831031021026 83 113
Frameshift
INDEL
frameshift 425241453434494654454 34 54
In Frame
INDEL
inframe
deletion
111015181514101520153 10 20
inframe
insertion
1017198141011816113 8 19
Synonymous19927828425922128320220824224236 199 284
Other11101111110 0 1
Non
coding
variants
(regions)
Gene
boundaries
downstream 19197204111892718009177111936316202184632031818927918 16202 20411
upstream191972076119332181221819620182168251894421239191971001 16825 21239
Intergenic296943809134040319992726931875215503298533380319992041 21550 38091
Introns553726168256671548695116356193432105594561374559451076 43210 61682
Non-coding
transcripts
Mature
miRNA
8136651058462 4 13
non-coding
transcript
12111100110 0 2
non coding
transcript
exon
214922002116186819202113181120952310211387 1811 2310
Regulatory
regions
TF binding
site
40445346943141350040844048644029 404 500
regulatory
region
4667586353014686 451250113582477861584778266 3582 6158
132674 151879 139131 132006 123179 137498 105487 135718 147631 135718 3712 105487 151879

A limitation of this study is represented by the lack of an available normal counterpart. In order to overcome this problem, in addition to the GATK calling pipeline we have applied a series of filters according to the criteria reported in methods and derived the 1000 Genomes Project (The 1000 Genomes Project Consortium et al., 2012), DBSNP (Ensembl v.58) and ESP6500 (released June 20th 2012). This approach reduced the number of variants by an order of magnitude from the original GATK pipeline (from a median of 4.1×106 to 1.3×105). Yet, the abundance of called variants compared to a range of 4,8×103-6×104 reported in human EAC (Weaver et al., 2014), may indicate that a proportion of the variants called in our final annotation are of germline origin. Also, additional mutations may have accumulated in vitro. A comprehensive annotation of the coding sequence variants identified is reported in Supplementary material 3 and Supplementary material 4.

Analysis of putative EAC driver genes

In order to investigate how closely cell lines reflect the spectrum of mutations observed in human specimens we analysed the mutational landscape of known cancer and putative EAC driver genes and compared to the previously reported mutation rate (Dulak et al., 2013; Weaver et al., 2014; Figure 2b & 2c). 69% of EACs have TP53 mutations (Weaver et al., 2014), while all cell lines carried at least one deleterious TP53 mutation. A SMAD4 mutation was present in 2 of 9 cell lines, ESO26 and JH-EsoAd, consistent with the 13% observed in EAC (Weaver et al., 2014). We were not able to identify mutations in ARID1A (affected by UTR variants in 1 of 9 cell lines) that is reportedly mutated in about 10% of cases of EAC specimens. Only some of the missense variants in the genes shown in Figure 2b resulted in known pathogenic mutations (i.e. TP53, PIK3CA, and TLR4). Other genes harboured benign or likely benign variants and/or variants with uncertain functional significance.

7aa6eb7d-efd5-4548-88e8-19ae2e918f69_figure2.gif

Figure 2. Analysis SNV and CNA of putative EAC genes identified in Dulak et al. (2013) and Weaver et al. (2014).

A) Log ratio of copy number status of the selected genes computed with Control-Freec (green indicates CN gain and red CN loss). Genome wide CN for each line is available in Supplementary material 1 and Supplementary material 3. B) SNVs identified by our pipelines and annotated by Variant Effect Predictor analysis (Ensembl). When more than one variant was present in a single gene, the most deleterious was annotated according to the color-coded legend reported at the bottom of the figure. A complete annotation of identified SNV are available in the Supplementary material 2. C) Blue and red bars indicate the mutation rate of EAC genes reported in Dulak et al., 2013; and Weaver et al., 2014, respectively.

We expanded our analysis to other cancer genes of potential relevance to OAC. We identified a pathogenic KRAS mutation in SKGT4, and a missense mutation of uncertain significance in MET (OE33), EGFR (CP-D, ESO26, IH-EsoAd1). Among DNA repair genes all cell lines carry benign missense variants of ATM and missense variants of uncertain significance in BRCA2. MSH2 is affected by a missense variant in SKGT4, splice site variants in CP-D, JH-EsoAd1, and UTR variants in ESO51 and OACP4 C (Supplementary material 3, Supplementary material 4, Supplementary material 6). Copy number analysis (Supplementary material 1, Supplementary material 2) identified recurrent amplifications in ERBB2, MYC, MET and SEMA5A, and deletions in SMAD4, CDKN2A, CCDC102B and SMARCA4.

This sequencing data will enable the research community to undertake and interpret further analyses (reviewed in Supplementary material 5) and to inform the use of these cell lines as a model of EAC. Our data highlight the need to develop additional in vitro models that have a germline reference genome to identify clearly the somatic changes (Gazdar et al., 1998). A larger number of cell lines might also more closely recapitulate the range of mutations observed in human disease.

Data availability

BAM files are available at the European Nucleotide Archive (ENA, EMBL-EBI, www.ebi.ac.uk/ena, Study PRJEB14018). Accession numbers: CP-D ERS1158083; SK-GT-4 ERS1158082; OE33 ERS1158081; OACP4 C ERS1158080; OACM5.1 ERS1158079; JH-EsoAd1 ERS1158078; FLO-1 ERS1158077; ES051 ERS1158076; ES026 ERS1158075.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 10 Jun 2016
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Contino G, Eldridge MD, Secrier M et al. Whole-genome sequencing of nine esophageal adenocarcinoma cell lines [version 1; peer review: 3 approved]. F1000Research 2016, 5:1336 (https://doi.org/10.12688/f1000research.7033.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 10 Jun 2016
Views
23
Cite
Reviewer Report 21 Jul 2016
Marnix Jansen, Barts Cancer Institute - a Cancer Research UK Centre of Excellence, Barts and The London School of Medicine and Dentistry, London, EC1M 6BQ, UK 
Approved
VIEWS 23
In this study Contino present their WGS analysis of 9 (verified) oesophageal adenocarcinoma cell lines. This is an adequate platform to present these data and the fact that the authors make all raw BAM files easily accessible to the community ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Jansen M. Reviewer Report For: Whole-genome sequencing of nine esophageal adenocarcinoma cell lines [version 1; peer review: 3 approved]. F1000Research 2016, 5:1336 (https://doi.org/10.5256/f1000research.7571.r14746)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
30
Cite
Reviewer Report 11 Jul 2016
Claire Palles, Oxford Centre for Cancer Gene Research, Wellcome Trust Centre for Human Genetics, Oxford, UK 
Laura Chegwidden, Oxford Centre for Cancer Gene Research, Wellcome Trust Centre for Human Genetics, Oxford, UK 
Approved
VIEWS 30
The authors have performed whole genome sequencing of eight esophageal adenocarcinoma cell lines and one esophageal high grade dysplastia cell line to an average depth of 30x.  The authors have made the BAM and VCF files available through the EBI ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Palles C and Chegwidden L. Reviewer Report For: Whole-genome sequencing of nine esophageal adenocarcinoma cell lines [version 1; peer review: 3 approved]. F1000Research 2016, 5:1336 (https://doi.org/10.5256/f1000research.7571.r14843)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
28
Cite
Reviewer Report 05 Jul 2016
Ian Beales, School of Health Policy and Practice, University of East Anglia, Norwich, Norfolk, UK 
Approved
VIEWS 28
The authors have examined the DNA sequences of 8 oesophageal adenocarcinoma cells lines and one high-grade dysplasia cell line, The authors should be congratulated for tackling this important unmet need in oesophageal cancer research and publishing these important findings in ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Beales I. Reviewer Report For: Whole-genome sequencing of nine esophageal adenocarcinoma cell lines [version 1; peer review: 3 approved]. F1000Research 2016, 5:1336 (https://doi.org/10.5256/f1000research.7571.r14325)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 10 Jun 2016
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.