Questions and Answers¶

How many sequences are needed to discover germline V gene sequences?¶

Library sizes of several hundred thousand sequences are required for V gene discovery, with even higher numbers necessary for full database production. For example, IgM library sizes of 750,000 to 1,000,000 sequences for heavy chain databases and 1.5 to 2 million sequences for light chain databases.

Can IgDiscover analyze IgG libraries?¶

IgDiscover has been developed to identify germline databases from libraries that contain substantial fractions of unswitched antibody sequences. We recommend IgM libraries for heavy chain V gene identification and IgKappa and IgLambda libraries for light chain identification. IgDiscover can identify a proportion of gemline sequences in IgG libraries but the process is much more efficient with IgM libraries, enabling the full set of germline sequences to be discovered.

Can IgDiscover analyze a previously sequenced library?¶

Yes, IgDiscover accepts both unpaired FASTQ files and paired FASTA files but the program should be made aware which is being used, see input requirements.

Do the positions of the PCR primers make a difference to the output?¶

Yes. For accurate V gene discovery, all primer sequences must be external to the V gene sequences. For example, forward multiplex amplification primers should be present in the leader sequence or 5’ UTR, and reverse amplification primers should be located in the constant region, preferably close to the 5’ border of the constant region. Primers that are present in framework 1 region or J segments are not recommended for library production.

What are the advantages to 5’-RACE compared to multiplex PCR for IgDiscover analysis?¶

Both 5’-RACE and multiplex PCR have their own advantages.

5’-RACE will enable library production from species where the upstream V gene sequence is unknown. The output of the upstream subcommand in IgDiscovery enables the identification of consensus leader and 5’-UTR sequences for each of the identified germline V genes, that can subsequenctly be used for primer design for either multiplex PCR or for monoclonal antibody amplification sets.

Multiplex PCR is recommended for species where the upstream sequences are well characterized. Multiplex amplification products are shorter than 5’-RACE products and therefore will be easier to pair and will have less length associated sequence errors.

What is meant by ‘starting database’?¶

The starting database refers to the folder that contains the three FASTA files necessary for the process of iterative V gene discovery to begin. IgDiscover uses the standalone IgBLAST program for comparative assignment of sequences to the starting database. Because IgBlast requires three files (for example V.fasta, D.fasta, J.fasta), three FASTA files should be included in the database folder for each analysis to proceed.

In the case of light chains (that do not contain D segments), a dummy D segment file should be included as IgBLAST will not proceed if it does not see three files in the database folder. It is sufficient to save the following sequence as a fasta file and rename it D.fasta, for example, for it to function as the dummy D.fasta file for human light chain analysis:

>D_ummy
GGGGGGGGGG

How can I use the IMGT database as a starting database?¶

Since we do not have permission to distribute IMGT database files with IgDiscover, you need to download them directly from IMGT. See the section about obtaining a V/D/J database.

How do I change the parameters of the program?¶

By editing the configuration file.

Where do I find the individualized database produced by IgDiscover?¶

The final germline database in FASTA format is in your analysis directory in the subdirectory final/database/. The V.fasta file contains the new list of V genes. The D.fasta and J.fasta files are unchanged from the starting database.

A phylogenetic tree of the V sequences can be found in final/dendrogram_V.pdf.

For more details of how that database was created, you need to inspect the files created in the last iteration of the discovery process, located in iteration-xx, where xx is the number of iterations configured in the igdiscover.yaml configuration file. For example, if three iterations were used, look into iteration-03/.

Most interesting in that folder are likely

the linkage cluster analysis plots in iteration-03/clusterplots/,
the error histograms in iteration-03/errorhistograms.pdf, which contain the windowed cluster analysis figures.
Details about the individualized database in new_V_germline.tsv in tab-separated-value format

The new_V_germline.fasta file is identical to the one in final/database/V.fasta

What does the _S1234 at the end of same gene names mean?¶

Please see the Section on gene names.