Fix copying of yaml file when multiple runs are initialized by batch command.
Fix agument to use safe_divide for SHM calculation and report NaN, otherwise program crashes with IMGT heavily truncated alleles.
Fix corecount expected.tsv file format to be the same as in postfilter rule.
Rename _count column produced by corecount to _gene_count to clarify that this is the number of assignments from IgBLAST.
Color warnings in corecount module, so they stand out.
Update documentation on manual installation and add requirements file for pip.
Added sra module for downloading libraries from sra
Added corecount module which automatically trims database query sequences up to maximum requested truncation or just before alleles are non-unique, then searches assignments for exact matches.
Added triender module which builds a prefix tree of a query V sequence, uses breath first search (BFS) algorithm to conditionally descent this tree in attempt of finding valid end variants.
Added batch module which allows analysis of multiple libraries that share the same yaml configuration and database.
Added separate germline filter for low-expressed genes, which require more liberal filtering.
Added post filter for high-expressed genes which at low barcodes_exact frequency are often false positives.
Added option to compute allelic groups between known database sequences based on edit-distance, single-linkage hierarchical clustering and cutting tree at requested height. Option prefixes database names with inferred allelic group and is useful for databases that do not separate gene names with a star symbol in their identifiers. IgDiscover relies on gene annotation in various modules and filtering steps.
Added aux file option to igdiscover.yaml to use IgBLAST generated CDR3s. This is helpful for TCRs and light chains or a new species in which currently built-in CDR3 detection may not work.
Added columns to candidates.tsv: Js_maxfreq, Ds_maxfreq, CDR3_len_maxfreq, CDR3_seq_maxfreq. The latter two can now be adjusted as germlinefilters.
Added an optional file which if present in database folder can be used to trim too long V genes during discovery to their known maximum length.
Added option for pooled Vs when haplotyping Js in haplotype module as well as independent v and j errors. Make expressed and heterozgous ratios gene group specific.
Added option for phred score based filtering using expected number of errors as threshold from vsearch.
Added option for reverse complementing of reads that occurs before barcode extraction.
Added option for indexing on either 5’ or 3’ end that can be extracted before UMI barcode and the most frequent index is used to select sequences for downstream analysis. This can be used to reduce potential spilover between libraries that are sequenced simulatenously
Added option in upstream module for splitting variants in leader according to set threshold and counting coexpressed anchoring Js.
Change the algorithm used for describing how the discovered V gene differs from the germline gene (the
database_changescolumn). This gives more sensible descriptions when the V gene is truncated at one end.
Faster startup time (mostly noticable when using
Ensure candidates get a unique name even if the hashes (
#108 Print a sensible error message when the GUI cannot be started.
Fix a crash (
KeyError) during “igdiscover augment” when region info for a database sequence could not be obtained.
IgDiscover now uses AIRR-formatted files: See the AIRR rearrangement schema
IgBLAST is run with the appropriate parameters to produce AIRR-compliant files
filtered.tab.gzcontain this IgBLAST output plus extra columns that IgDiscover needs (the AIRR schema allows extra columns)
filtered.tab.gzare now called
.tsvextension is required by the AIRR specification)
One downside is that, because there are more columns than before, the “assigned” and “filtered” files are larger than before.
The upside is that these files can be used with other tools that accept AIRR-compliant files.
Old “assigned” and “filtered” files can still be read by most IgDiscover commands. Output will always use new column names.
VDJ_ntcolumn was removed to reduce file size somewhat. It is now recomputed when necessary from the appropriate offsets.
Update to IgBLAST 1.17
discoverjcommand was renamed to
discoverjdto reflect that it also supports D gene discovery.
why_filteredcolumn would show a generic
is_duplicatereason for filters that compare candidates to each other. Now each filter criterion can be distinguised.
The somewhat vague “too similar sequence” germline filter criterion incorrectly removed some candidates that have a mutation close to the 3’ end. This was replaced with a simpler filter that only ensures that there are no two candidates with the same sequence.
Use IgBLAST 1.10
Get rid of some unnecessary dependencies by no longer requiring the unmaintained
sqtlibrary. Installation with Conda is now faster and requires half the disk space.
Add a full_exact column to
The IgBLAST cache is now disabled by default. We assume that, in most cases, datasets will not be re-run with the exact same parameters, and then it only fills up the disk. Delete your cache with
rm -r ~/.cache/igdiscoverto reclaim the space. To enable the cache, create a file
~/.config/igdiscover.confwith the contents
If you choose to enable the cache, results from the PEAR merging step will now also be cached. See also the caching documentation.
Added detection of chimeras to the (pre-)germline filters. Any novel allele that can be explained as a chimera of two unmodified reference alleles is marked in the
new_V_germline.tabfile. This is a bit sensitive, so the candidate is currently not discarded.
Two additional files
annotated_V_pregermline.tabare created in each iteration during the germline filtering step. These are identical to the
candidates.tabfile, except that they contain a
why_filteredcolumn that describes why a sequence was filtered. See the documentation for this feature.
A more realistic test dataset (v0.5), now based on human instead of rhesus data, was prepared. The testing instructions have been updated accordingly.
J discovery has been tuned to give fewer truncated sequences.
Statistics are written to
V SHM distribution plots are created automatically and written written to
v-shm-distributions.pdfin each iteration folder.
igdiscover dbdiffsubcommand was added that can compare two FASTA files.
When computing a consensus sequence, allow some sequences to be truncated in the 3’ end. Many of the discovered novel V alleles were truncated by one nucleotide in the 3’ end because IgBLAST does not always extend the alignment to the end of the V sequence. If these slightly too short V sequences were in the majority, their consensus would lead to a truncated sequence as well. The new consensus algorithm allows for this effect at the 3’ end and can therefore more often than previously find the full sequence. Example:
TACTGTGCGAGAGA (seq 1) TACTGTGCGAGAGA (seq 2) TACTGTGCGAGAG- (seq 3) TACTGTGCGAG--- (seq 4) TACTGTGCGAG--- (seq 5) TACTGTGCGAGAG (previous consensus) TACTGTGCGAGAGA (new consensus)
Add a column
new_V_germline.tabfile that describes how the novel sequence differs from the database sequence. Example:
Allow filtering by
CDR3_shared_ratioand do so by default (needs documentation)
Cache the edit distance when computing the distance matrix. Speeds up the
discover: Use more than six CPU cores if available
igblast: Print progress every minute
Implemented allele ratio filtering for J gene discovery
J genes are discovered as part of the pipeline (previously, one needed to run the
In each iteration, dendrograms are now created not only for V genes, but also for D and J genes. The file names are
The V dendrograms are now in
V_dendrogram.pdf). This puts all the dendrograms together when looking at the files in the iteration directory.
V_usage.pdffiles are no longer created. Instead,
expressed_V.pdfare created. These contain similar information, but an allele-ratio filter is used to filter out artifacts.
parsesubcommand (functionality is in the
New CDR3 detection method (only heavy chain sequences): CDR3 start/end coordinates are pre-computed using the database V and J sequences. Increases detection rate to 99% (previously less than 90%).
Remove the ability to check discovered genes for required motifs. This has never worked well.
Add a column
candidates.tabthat tries to count how many clonotypes are associated with a single candidate (using only exact occurrences). This is intended to replace the
exact_ratioto the germline filtering options. This checks the ratio between the exact V occurrence counts (
exactcolumn) between alleles.
Germline filtering option
allele_ratiowas renamed to
Implement a cache for IgBLAST results. When the same dataset is re-analyzed, possibly with different parameters, the cached results are used instead of re-running IgBLAST, which saves a lot of time. If the V/D/J database or the IgBLAST version has changed, results are not re-used.
barcodes_exactcolumn to the candidates table. It gives the number of unique barcode sequences that were used by the sequences in the set of exact sequences. Also, add a configuration setting
barcode_consensusthat can turn off consensus taking of barcode groups, which needs to be set to
Ds_exactcolumn to candidates table.
The pre-processing filtering step no longer reads in the full table of IgBLAST assignments, but filters the table piece by piece. Memory usage for this step therefore does not depend anymore on the dataset size and should always be below 1 GB.
The functionality of the
parsesubcommand has been integrated into the
igblastsubcommand. This means that
igdiscover igblastnow directly outputs a result table (
assigned.tab). This makes it easier to use that subcommand directly instead of only via the workflow.
igblastsubcommand now always runs
makeblastdbby itself and deletes the BLAST database afterwards. This reduces clutter and ensures the database is always up to date.
library_nameconfiguration setting. Instead, the
library_nameis now always the same as the name of analysis directory.
Add an “allele ratio” criterion to the germline filter to further reduce the number of false positives. The filter is activated by default and can be configured through the
allele_ratiosetting in the configuration file. See the documentation for how it works.
Ignore the CDR3-encoding bases whenever comparing two V gene sequences.
Avoid finding 5’-truncated V genes by extending found hits towards the 5’ end.
By default, candidate sequences are no longer merged if they are nearly identical. That is, the
differencessetting within the two germline filter configuration sections is now set to zero by default. Previously, we believed the merging would remove some false positives, but it turns out we also miss true positives. It also seems that with the other changes in this version we also no longer get the particular false positives the setting was supposed to catch.
Implement an experimental
discoverjscript for J gene discovery. It is curently not run automatically as part of
igdiscover run. See
igdiscover discoverj --helpfor how to run it manually.
configsubcommand, which can be used to change the configuration file from the command-line.
V_CDR3_startcolumn to the
filtered.tabtables. It describes where the CDR3 starts within the V sequence.
Similarly, add a
CDR3_startcolumn to the
new_V_germline.tabfile describing where the CDR3 starts within a discovered V sequence. It is computed by using the most common CDR3 start of the sequences within the cluster.
initsubcommand automatically fixes certain problems in the input database (duplicate sequences, empty records, duplicate sequence names). Previously, it would complain, but the user would have to fix the problems themselves.
Move source code to GitHub
Set up automatic code testing (continuous integration) via Travis
Many documentation improvements
The FASTA files of the input V/D/J gene lists now need to be named
J.fasta. The species name is no longer part of the file name. This should reduce confusion when working with species not supported by IgBLAST.
species:configuration setting in the configuration can (and should) now be left empty. Its only use was that it is passed to IgBLAST, but since IgDiscover provides IgBLAST with its own V/D/J sequences anyway, it does not seem to make a difference.
A “cross-mapping” detection has been added, which should reduce the number of false positives. See the documentation for an explanation.
Novel sequences identical to a database sequence no longer get the
No longer trim trim the initial
Grun in sequences (due to RACE) by default. It is now a configuration setting.
cdr3_locationconfiguration setting: It allows to set whether to use a CDR3 in addition to the barcode for grouping sequences.
groups.tab.gzfile by default (describing the de-barcoded groups)
The pre-processing filter is now configurable. See the
preprocessing_filtersection in the configuration file.
Many improvements to the documentation
Extended and fixed unit tests. These are now run via a CI system.
Statistics in JSON format are written to
IgBLAST 1.5.0 output can now be parsed. Parsing is also faster by 25%.
More helpful warning message when no sequences were discovered in an iteration.
Drop support for Python 3.3.
V sequences of the input database are now whitelisted by default. The meaning of the
whitelistconfiguration option has changed: If set to
false, those sequences are no longer whitelisted. To whitelist additional sequences, create a
whitelist.fastafile as before.
Sequences with stop codons are now filtered out by default.
Use more stringent germline filtering parameters by default.
It is now possible to install and run IgDiscover on OS X. Appropriate Conda packages are available on bioconda.
candidates.tab, which indicates whether the candidate sequence contains a stop codon.
Add a configuration option that makes it possible to disable the 5’ motif check by setting
looks_like_Vcolumn is ignored in this case).
Make it possible to whitelist known sequences: If a found gene candidate appears in that list, the sequence is included in the list of discovered sequences even when it would otherwise not pass filtering criteria. To enable this, just add a
whitelist.fastafile to the project directory before starting the analysis.
The criteria for germline filter and pre-germline filter are now configurable: See
pre_germline_filtersections in the configuration file.
Different runs of IgDiscover with the same parameters on the same input files will now give the same results. See the
seedparameter in the configuration, also on how to get non-reproducible results as before.
Both the germline and pre-germline filter are now applied in each iteration. Instead of the
new_V_database.fastafile, two files named
composesubcommand now outputs a filtered version of the
candidates.tabfile in addition to a FASTA file. The table contains columns closest_whitelist, which is the name of the closest whitelist sequence, and whitelist_diff, which is the number of differences to that whitelist sequence.
Optionally, sequences are not renamed in the
assigned.tabfile, but retain their original name as in the FASTA or FASTQ file. Set
rename: falsein the configuration file to get this behavior.
Started an “advanced” section in the manual.
IgDiscover can now also detect kappa and lambda light chain V genes (VK, VL)