strandex

strand-anchored regex for uniform sampling from FASTQ files (think spandex)

Why use this?

You want only a few reads from a large FASTQ file (downsampling)
You are constrained by I/O so that reading through the entire file is very slow
You want to avoid sampling only the beginning or end of the file
You want to expand a small FASTQ file to a specific number of reads (upsampling)

Caveats

For paired-end sampling, reads in both files must be in the same order and have the same length
For sampling n reads approximately equal to the total available, sampling with replacement may occur

Install

pip install strandex

Examples

from strandex import FastqSampler

sampler = FastqSampler('read1.fastq', fastq2='read2.fastq', nreads=100000, seed=42)
for read1, read2 in sampler:
  # read1 and read2 are 4-line strings sampled from paired input

sampler = FastqSampler('read1.fastq', nreads=100000, seed=42)
  for read1, read2 in sampler:
    # read1 is a 4-line string sampled from input
    # read2 is NoneType

Note that you may sample more reads than are available in your input file. In the event that you want to sample more reads than your input file contains, strandex will sample the file with replacement, meaning you will get some duplicate reads.

CLI script

usage: strandex [-h] [-fq2 FASTQ2] [-o2 OUT2] [-n NREADS] [-s SEED] fastq1 out

sample uniformly without reading an entire fastq file

positional arguments:
  fastq1                input fastq file
  out                   output fastq file

optional arguments:
  -h, --help            show this help message and exit
  -fq2 FASTQ2, --fastq2 FASTQ2
                        input fastq file read pairs
  -o2 OUT2, --out2 OUT2
                        output fastq file read pairs
  -n NREADS, --nreads NREADS
                        number of reads to sample from input (default: 1)
  -s SEED, --seed SEED  seed for random number generator (default: None)
  -t TRIM, --trim TRIM  trim reads to length -t (default: None)

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
strandex		strandex
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

strandex

Why use this?

Caveats

Install

Examples

CLI script

About

Releases 5

Packages

Languages

License

mdshw5/strandex

Folders and files

Latest commit

History

Repository files navigation

strandex

Why use this?

Caveats

Install

Examples

CLI script

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Languages

Packages