Gene Miner - an integrated platform for cancer data analysis

2018 Insight Data Engineering project

Table of Content

Overview
Pipeline
Downstream analysis example
Configuration
Dependencies
Run instructions
Demo

Overview

The project extracted and integrated cancer research data from GDC Data Portal of National Cancer Institute to patient-indexed tables in PostgreSQL on Amazon RDS via Spark, thus enabled cancer-wide, genome-wide queries on cancer data via SQL or other analytic tools.

Pipeline

ETL pipeline is described below:

The pipeline can be divided into 2 stages:

Generation of file look-up table
Individual file parsing and information extraction.

File lookup table

In data portal of National Cancer Institute, each patient may have multiple feature files, and each feature file contains one aspect of information from one individual patient. Moreover, it isn't possible to tell which patient one file belongs to from the files directly.

To index the files downloaded from NCI, I created a file look-up table in psql with two meta data files:

Manifest files: Tabular CSV files, from which I retrieved filename and where the files are stored after downloading.
MetaInfo files: Json files, from which I retrieved filename, file type, patient id, and project id (the project that collected the patient's information).

With PySpark, I joined the two tables on filenames, grouped the files by their types, and saved the information to tables in PostgreSQL database. Due to time limit of the project, xml files and txt files were saved in separate tables for further information extraction. All the other file types are saved in another table, for future use.

Schema of xml_list table:

path	filename	data_format	project_id	case_id
text	text	text	text	text

Meanwhile, with the patient id information extracted from meta info files, I created another table in psql, named patient_info. This table, with patient id as primary key, in psql.

patient_info table will also save information about the project the patient enrolled in, and later the clinical information, such as patient gender, disease type and disease stage, from clinical xml files.

Schema of patient_info table:

case_id	project_id	disease_type	disease_stage	gender
text PRIMARY KEY UNIQUE	text	text	text	text

Xml files processor

As mentioned above, clinical reports are saved as xml format files. From these files, I extracted information of patient gender, primary tumor site, and tumor stage (if applicable).

Because each xml file is only 100kb, and each file will update one row in table patient_info, I read in xml file list from xml_list table in psql with Spark, and split the further tasks: file reading and info extraction, to separate partitions. In each partition, I parsed the xml file with Python built-in xml reader.

Further, the extracte information was updated in batch (by partition) to patient_info table.

The optimization and design logic of xml file processor can be found in wikipage.

rnaseq_files_processor

For normalization, RNA-seq data are stored in a separate table, gene_expr_table.

gene_expr_table is indexed with both patient id and gene id.

To start with, an empty table with headers: case_id(patient id), gene_id and gene_expr is created:

case_id	gene_id	gene_expr
text	text	float

Genome squencing results are saved in tabular txt files. For each partition of files, the txt files are parsed and appended to gene_expr_table together with patient id.

After data is saved to database, the gene_expr_table is indexed with B-tree.

Downstream analysis example

Top 10 genes changed between Stage II and Stage I of breast cancer

(See SQL code here.)

Based on disease_type and disease_stage columns patient_info table, patient_id associated with specific cancer type and specific tumor stages are filtered out and grouped.

These patient_id are then used to filter out the genes associated with specific diseases and stages in gene_expr_table, groupby gene_id.

After having the avg gene expression levels for each stage, the fold change between stages were calculated by dividing the values.

Finally, the (gene_id, fold_change) table is joined with reference genome table.

Reference genome table hs_genome was generated by Python via parsing annotated genome files from Ensembl. Information of genes from in total 24 chromosomes plus mitochondrial and non-chromosomal were extracted and saved to psql.

hs_genome table has the schema:

full_id id name chromosome strand pos_start pos_end info

text text text text text integer integer text

After sorting by fold_change, the top10 table came with schema:

gene_id	gene_name	fold_change	chromosome	info
text	text	float	text	text

Which are visualized in bar chart and showed in demo .

Configuration

The jobs were run on 4 m4.large EC2 instances.

The PostgreSQL database was created in Amazon RDS under the same VPC as EC2 instances.

The setup for instances, Spark, Jupyter notebook and postgresql can be found in wikipage.

Dependencies

The Anaconda environment used to run the project can be found in environment.yml.

# Python versions
    python = 3.6.5
    pyspark = 2.2.0

# Python packages
    psycopg2 = 2.7.4 # To connect to PostgreSQL
    boto3 = 1.7.4  # To connect to S3

Tha packages above should be available on both mater and workers.

# Driver packages
# PosgreSQL
    postgresql-42.2.2.jar

# Redshift
# NOTE: This is the combination that won't cause any compatibility issue
    spark-redshift_2.11-3.0.0-preview1.jar
    RedshiftJDBC41-1.2.12.1017.jar
    spark-avro_2.11-4.0.0.jar
    minimal-json-0.9.5.jar

# Front end
    flask = 1.0.2
    dash = 0.21.0
    dash-core-components = 0.22.1
    dash-html-components = 0.10.1
    gunicorn = 19.8.1
    pandas = 0.22.0

Run instructions

Set the PYTHONPATH before running:

export PYTHONPATH=$PYTHONPATH:/home/ubuntu/GeneMiner:/home/ubuntu/GeneMiner/src
Python drivers, jars and sub-modules required manual distribution are set in python script.
- To generate reference genome:
  
  GeneMiner/src $ python ref_genome/genome_parser.py
- To generate file look-up table:
  
  GeneMiner/src $ python pipeline/metainfo_processor.py
- To process each types of files:
  
  GeneMiner/src $ python pipeline/xml_processor.py
  
  GeneMiner/src $ python pipeline/rnaseq_processor.py
- To run the front-end:
  
  GeneMiner/flask-dash-app $ python run.py
  
  GeneMiner/flask-dash-app $ gunicorn app:app -D (Run in background)

Demo

The demo is built by plotly/dash and made publicly available via flask, gunicorn and nginx. For more information about setting up the front end, refer to wikipage.

When applicable, the demo is running at site: http://www.oxphos.online.

First, the website allows table display and summary view of patient clinical reports of different types and cancer (clinical reports):

Then, the website showcases one application of the data. For example, we can compare the gene expression change between Stage II and Stage I of breast cancer, and identify the top 10 genes that changed most between stages (gene expression analysis).

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
flask-dash-app		flask-dash-app
images		images
legacy		legacy
querys		querys
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gene Miner - an integrated platform for cancer data analysis

Table of Content

Overview

Pipeline

File lookup table

Xml files processor

rnaseq_files_processor

Downstream analysis example

Configuration

Dependencies

Run instructions

Demo

About

Releases

Packages

Languages

License

OXPHOS/GeneMiner

Folders and files

Latest commit

History

Repository files navigation

Gene Miner - an integrated platform for cancer data analysis

Table of Content

Overview

Pipeline

File lookup table

Xml files processor

rnaseq_files_processor

Downstream analysis example

Configuration

Dependencies

Run instructions

Demo

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages