Test data set

After installing IgDiscover, you should run it once on a small test data that we provide, both to test your installation and to familiarize yourself with running the program.

  1. Download und unpack the test data set (version 0.5). To do this from the command-line, use these commands:

    wget https://bitbucket.org/igdiscover/testdata/downloads/igdiscover-testdata-0.5.tar.gz
    tar xvf igdiscover-testdata-0.5.tar.gz
    

The test data set contains some paired-end reads from human IgM heavy chain dataset ERR1760498 and a database of IGHV, IGHD, IGHJ sequences based on Ensembl annotations. You should use a database of higher quality for your own experiments.

  1. Initialize the IgDiscover pipeline directory:

    igdiscover init --db igdiscover-testdata/database/ --reads igdiscover-testdata/reads.1.fastq.gz discovertest
    

    The name discovertest is the name of the pipeline directory that will be created. Note that only the path to the first reads file needs to be given. The second file is found automatically. There may be a couple of messages “Skipping ‘x’ because it contains the same sequence as ‘y’”, which you can ignore.

    The command will have printed a message telling you that the pipeline directory has been initialized, that you should edit the configuration file, and how to actually run IgDiscover after that.

  2. The generated igdiscover.yaml configuration file does not actually need to be edited for the test dataset, but you may still want to have a read through it as you will need to do so for you own data. You may want to do this while the pipeline is running in the next step. The configuration is in YAML format. When editing the file, just follow the way it is already structured.

  3. Run the analysis. To do so, change into the pipeline directory and run this command:

    cd discovertest && igdiscover run
    

    On this small dataset, running the pipeline should take not more than about 5 minutes.

  4. Finally, inspect the results in the discovertest/iteration-01 or discovertest/final directories. The discovered V genes and extra information are listed in discovertest/iteration-01/new_V_germline.tsv. Discovered J genes are in discovertest/iteration-01/new_J.tsv. There are also corresponding .fasta files with the sequences only.

    See the explanation of final result files.

Test data set 2

An example analysis of MTPX (multiplex) library with barcode on 3’ end.

Paired-end Illumna raw runs can be downloaded from ENA:

wget http://ftp.sra.ebi.ac.uk/vol1/run/ERR300/ERR3004229/A007_IgM_MTPX_S1_L001_R1_001.fastq.gz
wget http://ftp.sra.ebi.ac.uk/vol1/run/ERR300/ERR3004229/A007_IgM_MTPX_S1_L001_R2_001.fastq.gz

Next you need to initialize IgDiscover:

igdiscover init --database your_database/ --reads1 A007_IgM_MTPX_S1_L001_R1_001.fastq.gz your_run_name
cd your_run_name

Edit igdiscover.yaml file to change barcode length on 3’ end:

barcode_length_3prime: 21

IgDiscover uses two optional input files which should be placed in the database folder before starting the analysis.

  • Tabular separated file containing correct gene lengths (V.tsv)

gene

length

IGHV1-18

296

IGHV1-2

296

Lengths are adjusted after consensus sequence is computed.

  • Tabular separated expected frequency table (expected.tsv)

gene

frequency

IGHV1-8

0.002

IGHV3-9

0.002

Where frequency is the lower limit for barcodes_exact frequencies, used in the post filtering step to remove alleles which are too low expressed.

Finally, run the analysis:

igdiscover run

Other test data sets

ENA project PRJEB15295 contains the data for our Nature Communications paper from 2016, in particular ERR1760498, which is the data for the human “H1” sample (multiplex PCR, IgM heavy chain).

Data used for testing TCR detection (human, RACE): SRR2905677 and SRR2905710.