Haploid Variant Calling

SeqCenter offers variant calling for haploid genomes using the Breseq software package. This tool works by predicting mutations relative to an assembled reference file that must be provided by the customer in advance. Both fasta genome sequence file formats (nucleotide-only) and GenBank / RefSeq annotated file formats can be used as reference files; however, annotated files are preferred and return the most insightful data. Reference files are often retrieved from online repositories, such as the NCBI website. Samples must share at least 99.5% homology with the provided reference. (For bacterial genomes, 0.5% divergence can correspond to over 10,000 predicted mutations.)

The variant call output is provided in an annotated HTML file that contains mutations categorized as single nucleotide mutations (SNPs), point insertions and deletions (indels), and new junctions (for internal rearrangements). This algorithm will not be able to detect larger insertions of outside material not listed in the reference and can struggle with larger structural rearrangements, including duplications. Machine-readable VCF format is also provided.

We do not offer variant calling for diploid or further ploidy genomes as standard services at this time, but are happy to discuss options like copy number variance (CNV) analyses as a custom offering. Please contact us for more information.

Requirements:

  • Data must be sequenced through a DNA pipeline.
  • An assembly that is less than 0.5% divergent from the sample is required. Accepted formats include:
    • Files in fasta or GenBank format (.gb, .gbk, .gbff). Other formats are not accepted.
    • Assemblies uploaded to NCBI, referenced by accession number. Please verify that the correct version is specified (GenBank vs RefSeq) and referenced by the correct accession number (GCA vs GCF.)
    • Assemblies of other samples from the same order. The assembly quality will affect the variant calling, so Nanopore Combo samples are recommended.
  • Breseq assumes that samples are homogenous and haploid.
    • If a sample is expected to be polymorphic, such as in a population of a single strain, please indicate this during the order so that the parameters can be adjusted.
    • Please contact us about variant calling for non-haploid samples.

Notes on the Reference

Choosing a reference can be daunting. Ideally, the reference assembly will be very relatively close, so that the list of predicted mutations is manageable. To give an idea of scale, a bacterial sample that is 99.5% homologous to its reference can produce over 10,000 predicted mutations.

When studying a derived organism to its ancestor, it may seem obvious to choose the ancestor as the reference. This is straightforward if an assembly exists for the ancestor. If it does not, there are a few options.

  1. If the ancestor is also closely related to a published reference genome: mutations relative between the derived and parent organism can be inferred with WGS and variant calling of both organisms against the reference. Any exact mutations predicted for the ancestor can then be effectively removed from the list for the derived organism.
  2. If the ancestor is not closely related to published genomes and a simple microbe: SeqCenter offers services intended to produce closed, reference-quality de novo assemblies for haploid samples, namely Nanopore Combos and PacBio WGS with Assembly.
  3. If the ancestor is not closely related and is not a simple microbe: WGS options are limited. SeqCenter does offer collapsed assembly services for more complex organisms, but these types of genomes tend to require iterative sequencing and more fine-tuned assembly methods to produce closed reference genomes.

We do not provide assembly-to-assembly comparisons or multi sequence alignments, but many commonly used GUI-based tools can perform these, such as Galaxy, Geneious, and packages on KBase.

Additional Resources:

Go to Top