Information

Can I simulate a previous and next sequence of my fasta files?


I have fasta sequences of a virus protein that had suffered a mutation in 2008 and this mutant had increase it fitness when compared to wild strain (according to literature).I want to simulate if my sequences were submitted in 2009(I have sequences from 2009 and 2010) or before to mutation and simulate next sequences to verify if the population of mutants really turns majority. How can I do that?My virus is a RNA -ss. Any tool to use that i can represent it graphically? Or simplifying… I want to simulate the past and the future sequences of my dataset.


From the description of the model you are interested in, you probably don't need numerical simulations. A simple analytical expectation might be enough. But anyway, if you want to perform numerical simulations (maybe you don't want to assume panmixia), then there are a number of choices to you.

The simulator that you will need needs to be able to implement selection. It hence need to be a forward-in-time simulator and not a coalescent simulator. I personally have experienced with 4 different forward-in-time simulators; SimBit, SLiM, SFS_code and Nemo.

If I am not mistaken, SFS_code won't be able to simulate haploid individuals and will probably be slower than SimBit and SLiM. In SimBit, SLiM and Nemo you cannot do haploid either, however, you can set the cloning rate to 100%, which will essentially be the same as simulating haploids.

To my experience and generally speaking, SLiM and SimBit are faster than Nemo. Because your mutation rate is high and because your recombination rate will be very low (will probably be zero), SimBit should be very very fast and SLiM relatively slow. Also, SimBit will offer much nicer outputs for the statistic you are interested in than SLiM does.

Please note that I am the author of SimBit and I therefore might have a conflict of interest here. Here is the link to SimBit.


Can I simulate a previous and next sequence of my fasta files? - Biology

NOTE -- THIS IS DEPRECATED AND BEING REWRITTEN IN C++

The python version is pretty buggy and unstable, so I've been tackling a complete rebuild. You can find the new version here.

(Only the latest release has binaries here at the moment)

After downloading, there are some tips on using it located further down

Quick note -- this program is still being actively developed. Although progress has slowed with the reopening of the world, I am going to keep trying to improve on it. It's not quite feature complete but it works pretty well for what I originally intended to make I will admit it's not very stable though. If you like it but find something is missing or broken, please leave a note in the issues tab.

Linnaeo is a python program I made mostly as an exercise in learning how to code a GUI. but also to solve a basic problem I had: nothing out there is very good for making, storing and viewing protein alignments.

Yes, out there in the world are countless amino acid sequences, and their sequences (and small differences) are connected to protein functions. but it's often hard to see that. This program is intended to help with this by meeting a few basic criteria:

  1. It should be fast.
  2. It should store any sequence you want. (but right now, just FASTA protein sequences, sorry)
  3. It should be easy to create, store, and export alignments.

That's a basic idea. Most likely those criteria will change.

Here is a screenshot as of Jun 14, 2020:

Here's another more recent screenshot from the Windows version, showing the "Conservation" theme and the tooltip feature for seeing different sequence residue numbers of the different aligned sequences:

An explanation of the themes I've designed (thus far) is available here.

New for v0.4.0 (devel branch only):

  • ClustalO now returns the alignment order, rather than as the input order. This required a custom Clustalo-Python extension (also here on my github), so it is not a part of master right now.
  • Added a mini-terminal for MacOS, since it runs windowed and without a terminal as a .app
  • greatly improved memory usages (so many dels!)
  • Changed over to an installer for Windows
  • added an icon and cleaned it up
  • Can now load multiple sequences and alignments at once.
  • Secondary structure search and display thanks to DSSP
  • Reference sequence selection for only highlighting residues that are the same.
  • Lots of other small improvements.

Previous additions for v0.1.6:

  • Import/export of protein sequences and alignments
  • Save/Load workspaces to keep your work
  • Combine sequences as needed to make new alignments using ClustalO
  • Sequences and alignments can be renamed and organized in subfolders as needed
  • Alignments wrap dynamically within the size of the window
  • Includes a ruler and a tooltip popup on click that shows the currently highlighted residue (and the corresponding numbers for the other sequences)
  • Preliminary residue theming, multiple themes available (more to come, subject to change)
  • Font and point size selection for the window
  • Can export the alignment window to PNG.

There is a lot I still want to do! Check out the "Issues" tab for stuff I know is broken. A quick list of features I want to add:

  • Currently only supports the Clustal Omega algorithm, and calls it even for just two sequences
  • I want to add better drag-and-drop functionality, such as dragging a new sequence onto an alignment window, deleting sequences from an alignment, or rerranging the order (the former 2 would be non-destructive)
  • I want to improve the color themes
  • I want to make a dark theme for the app itself
  • Residue annotations for marking comments and observations.

NOTE: The latest version (v0.4.0-pre) is based on the DEVEL branch, and uses a modified clustalo-python extension. This is so I can return the alignment order from clustalo (instead of just outputting the input order!) but makes it tougher to standardize. Please be aware that to run the devel branch you'll need to clone and build my clustalo-python repository (available here too) to get that to work!

I've attempted to build portable binaries for all Mac and Windows. Hopefully they work, but I'm still learning this part.

If you want to try building it yourself, or they don't work, here are some instructions:

If you're on linux it's a good bet you've done something like this before. Here's the basic process:

  • Create a new virtualenv
  • Clone the github into a folder and start up your virtualenv
  • Install everything required to run and fire it up.

Try my prebuilt ClustalO for python wheel, but if it doesn't work you'll have to go through the steps to compile ClustalO and the wrapper yourself (see the Mac section for tips).

Let's say you want to try and build it from source. maybe the latest devel version or something. Here's how. Install homebrew if you don't have it. it will make your life significantly easier. Then follow these steps:

Move to where you want the app to live and clone Linnaeo, and prep your env:

Install PyClustalo into it first. This is the trickiest one, and I hope it works for you.

Finally, install linnaeo, and run the program.

Note that if you want to use the DSSP feature, you'll also have to build and install a DSSP binary to your PATH. Unfortunately I don't think DSSP is included in Homebrew anymore. However, the binaries I uploaded should have a working binary and don't need anything externally.

Install Anaconda3 and create a new environment:

On Windows, I have a pre-compiled clustalo wheel file that seems to usually work. Please let me know if it doesn't.

Once opened you'll be greeted with the main window. Try opening up some previously-downloaded sequence files or other alignments you have made (only clustal format--.clustal or .aln--or fasta files though, sorry). You can also copy a fasta-formatted sequence and paste it directly into the program.

Sequences and alignments can be renamed or organized in folders. I recommend getting your sequences from UniProt as I've integrated it into the DSSP search function (it uses the sequence ID).

Renaming a sequence won't change the underlying ID, so feel free to call it whatever you like! You can access the ID by copying out a sequence or exporting it. Exporting an alignment uses the displayed names instead.

Highlight a few sequences (hold ctrl) and either double click or hit the align button to create a new alignment.

There is also an options panel that allows for choosing a font, theme, increase the font size, etc.

Finally, try holding the mouse button down on a residue within the alignment window--you'll see the residue ID (and equivalent ids for the aligned sequences) in a tooltip popup!

Enjoy, and hope it works for you, the intrepid early tester! I appreciate your feedback!

Repositories I am eternally grateful for -- they helped me get this onto windows -- and need to cite:


Genomes FAQ

What do I need to do if I have assignment information?

If you choose the single genome option, you will be prompted in the forms to provide information on which sequences belong to chromosomes, plasmids or organelles. If you choose the batch option, you should include this information in the FASTA headers. Additional requirements: https://www.ncbi.nlm.nih.gov/genbank/genomesubmit/#batch_assignment

What information should I be prepared to provide about gaps in my Genome submission?

You will need to confirm that you did not randomly merge the sequences into a single sequence and provide details on the approach you took. The default in the submission form is that 10 Ns in a row represent a gap and that “paired-ends” is the evidence that the sequences on either side of the gap are linked. If this is incorrect, then make the appropriate selections for your genome.


MetaPASSAGE – Automated Simulations and Analysis of Gene Families

:: DESCRIPTION

MetaPASSAGE, a software pipeline to generate in silico bacterial communities, simulates a sample of shotgun reads from a gene family represented in the community, orient or translate reads, and produce a profile-based alignment of the reads from which a gene-family phylogenetic tree can be built.

:: SCREENSHOTS

:: REQUIREMENTS

:: MORE INFORMATION


Comment

The Living DNA mtDNA file has limited use, since it contains only a particular selection of markers that they've chosen to predict your mtDNA haplogroup. I suppose you could use it to see which markers they tested. Their prediction may or may not be for your specific, final twig on the mtDNA Phylotree. In order to really find out that "twig" (if you are interested), you would have to take the FTDNA mtDNAFullSequence test, which tests all the 16,569 base pairs, and is currently on sale. Only you can know if that is something you want to do. With the Full Sequence, you can do a few things with it, including use the websites I mentioned in my last post, and others posted below in this post.

I will say that 23andMe predicts mtDNA haplogroups the same way as Living DNA (by a selection of markers), and they predicted my mtDNA haplogroup as K1a3. This was the same haplogroup as what National Geographic's Geno 2 predicted for me (again, testing only a selection of markers). When I did the FTDNA mtDNA Full Sequence, I thought I might get a bit further out on the mtDNA tree, but it turned out that I was still K1a3. I do have one unique mutation, for what it's worth. This is just my experience others have done the Full Sequence and their results did get them further out on their original branch.

Your other Living DNA information is what is of use in finding autosomal matches there. Perhaps one day they will provide mtDNA matches. You are correct about checking things out before testing, but consider it a lesson learned. Usually people test their DNA at one of the "big three" (FTDNA, 23andMe, or Ancestry), but now that there are several other companies, it complicates things. Always read the companies' privacy policies and terms of use before consenting to anything.

Besides using mtDNA Full Sequence results at the other websites I previously mentioned, another thing you can do with an FTDNA Full Sequence mtDNA Fasta file (the raw data file that irfu mentioned), is first to learn about Genbank, and then if you wish, upload your mtDNA Fasta file to Genbank to help science. You can read about how to do the upload at this page on Ian Logan's website: mtDNA Sequences at 'GenBank' (you can also read an interview with Ian Logan, although there is not a great amount of information about Genbank there). It is possible that your file might be helpful in adding another subclade to your haplogroup. Only the person whose mtDNA was tested can give permission to Genbank to use the file.

With your other mtDNA information:
mtDNA can contain mutations about medical conditions. If you wish, you can get a report from Ann Turner, an ISOGG member (interview here) and medical doctor who provides this service. She does not use the mtDNA Fasta file, but would need the .pdf of your mtDNA certificate from FTDNA. Note that FTDNA does not address medical issues for mtDNA. This ISOGG page, "Custom mitochondrial DNA reports", has a description of her service, and also contains a link to a copy of the type of report she provides.[/LIST]


Can I simulate a previous and next sequence of my fasta files? - Biology

I created this repo of biogrinder mainly for reading the README.md in a better formatted and easier way.

The source was originally downloaded from

A newer version is avialable now at

This repo is not for properly tracking development history of biogrinder, its code history is at

Please see the end of this document for how to clone it.

A versatile omics shotgun and amplicon sequencing read simulator

Grinder is a versatile program to create random shotgun and amplicon sequence libraries based on DNA, RNA or proteic reference sequences provided in a FASTA file.

Grinder can produce genomic, metagenomic, transcriptomic, metatranscriptomic, proteomic, metaproteomic shotgun and amplicon datasets from current sequencing technologies such as Sanger, 454, Illumina. These simulated datasets can be used to test the accuracy of bioinformatic tools under specific hypothesis, e.g. with or without sequencing errors, or with low or high community diversity. Grinder may also be used to help decide between alternative sequencing methods for a sequence-based project, e.g. should the library be paired-end or not, how many reads should be sequenced.

  • shotgun or amplicon read libraries
  • omics support to generate genomic, transcriptomic, proteomic, metagenomic, metatranscriptomic or metaproteomic datasets
  • arbitrary read length distribution and number of reads
  • simulation of PCR and sequencing errors (chimeras, point mutations, homopolymers)
  • support for paired-end (mate pair) datasets
  • specific rank-abundance settings or manually given abundance for each genome, gene or protein
  • creation of datasets with a given richness (alpha diversity)
  • independent datasets can share a variable number of genomes (beta diversity)
  • modeling of the bias created by varying genome lengths or gene copy number
  • profile mechanism to store preferred options
  • available to biologists or power users through multiple interfaces: GUI, CLI and API

Briefly, given a FASTA file containing reference sequence (genomes, genes, transcripts or proteins), Grinder performs the following steps:

Read the reference sequences, and for amplicon datasets, extracts full-length reference PCR amplicons using the provided degenerate PCR primers.

Determine the community structure based on the provided alpha diversity (number of reference sequences in the library), beta diversity (number of reference sequences in common between several independent libraries) and specified rank- abundance model.

Take shotgun reads from the reference sequences or amplicon reads from the full- length reference PCR amplicons. The reads may be paired-end reads when an insert size distribution is specified. The length of the reads depends on the provided read length distribution and their abundance depends on the relative abundance in the community structure. Genome length may also biases the number of reads to take for shotgun datasets at this step. Similarly, for amplicon datasets, the number of copies of the target gene in the reference genomes may bias the number of reads to take.

Alter reads by inserting sequencing errors (indels, substitutions and homopolymer errors) following a position-specific model to simulate reads created by current sequencing technologies (Sanger, 454, Illumina). Write the reads and their quality scores in FASTA, QUAL and FASTQ files.


File Formats Tutorial

This section explains some of the commonly used file formats in bioinformatics. The information provided here is basic and designed to help users to distinguish the difference between different formats. Please refer user manual or other information resources on web for more details.

FASTA

File extensions : file.fa, file.fasta, file.fsa

Fasta format is a simple way of representing nucleotide or amino acid sequences of nucleic acids and proteins. This is a very basic format with two minimum lines. First line referred as comment line starts with ‘>’ and gives basic information about sequence. There is no set format for comment line. Any other line that starts with ‘’ will be ignored. Lines with ‘’ are not a common feature of fasta files. After comment line, sequence of nucleic acid or protein is included in standard one letter code. Any tabulators, spaces, asterisks etc in sequence will be ignored.

FASTQ

File extensions : file.fastq, file.sanfastq, file.fq

Fastq format was developed by Sanger institute in order to group together sequence and its quality scores (Q: phred quality score). In fastq files each entry is associated with 4 lines.

  • Line 1 begins with a ‘ @ ‘ character and is a sequence identifier and an optional description.
  • Line 2 Sequence in standard one letter code.
  • Line 3 begins with a ‘ + ‘ character and is optionally followed by the same sequence identifier (and any additional description) again.
  • Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

Extended description on the fastq format :

Line 1: @K00188:208:HFLNGBBXX:3:1101:1428:1508 2:N:0:CTTGTA

Line 2: ATAATAGGATCCCTTTTCCTGGAGCTGCCTTTAGGTAATGTAGTATCTNATNGACTGNCNCCANANGGCTAAAGT

Line 4: AAAFFJJJJJJJJJJJJJJJJJFJJFJJJJJFJJJJJJJJJJJJJJJJ#FJ#JJJJF#F#FJJ#F#JJJFJJJJJ

A quality score ( PHRED scale) for each base pair. It indicates how confident we can be that the base was sequenced and identified correctly.

where p is the probability that the corresponding base call is incorrect.

Phred quality score Probability that the base is called wrong Accuracy of the base call
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.90%
40 1 in 10000 99.99%
50 1 in 100000 100.00%

Fastq-sanger holds PHRED score from 0-93 whereas fastq-Illumina provides PHRED scores from 0-62. Rather than giving numeric values of PHRED score they are provided in ASCII character codes from 33 to 126. Why 33 to 126? Because 33 to 126 codes for single characters, so the score can be represented by a single character. Refer the table below.

Based on the base character ( character that represents zero PHRED score ), PHRED scale is often referred as FHRED+33 (ASCII character !) or FHRED+64 (ASCII Character ?). The figure below illustrates the PHRED usage across different sequencing notations.

It is critical to figure out the PHRED score type used in .fastq files before processing them. Current Illumina .fastq files have +33 PHRED score. Please refer https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2847217/ for more details.

SAM (Sequence Alignment Map)

The SAM Format is a text format for storing sequence data in a series of tab delimited ASCII columns. Most often it is generated as a human readable version of its sister BAM format, which stores the same data in a compressed, indexed, binary form.

SAM format files are generated following mapping of the reads to reference sequence. It is TAB-delimited text format with header and a body. Header lines start with ‘@’ while alignment lines do not. Header hold generic information on SAM file along with version information, if the file is sorted, information on reference sequence, etc. The alignment records constitute the body of the file. Each alignment line/record has 11 mandatory fields describing essential alignment information.

Some terminology used in SAM manual:

Template : The DNA fragment that was measured

Reads : Depending on the methodology a template may produce one or more reads. These reads may cover the entire template or just a subsection of it. Reads originating from the same template typically cover different parts of the template, and, may represent the template itself or the reverse complement of it.

Segments : Each read may produce one or more alignments that in turn will have aligned regions called segments. From these segments it may be possible the infer the size of the original template.


source: http://samtools.github.io/hts-specs/SAMv1.pdf

Col. 1 QNAME:

Query NAME. Reads/segments having identical QNAME are regarded to come from the same template. A QNAME ‘*’ indicates the information is unavailable. In a SAM file, a read may occupy multiple alignment lines, when its alignment is chimeric .

Col. 2 FLAG:

Combination of bitwise flags.

BIT Description
1 0x1 template having multiple segments in sequencing
2 0x2 each segment properly aligned according to the aligner
4 0x4 segment unmapped
8 0x8 next segment in the template unmapped
16 0x10 SEQ being reverse complemented
32 0x20 SEQ of the next segment in the template being reverse complemented the first segment in the template
64 0x40 the first segment in the template
128 0x80 the last segment in the template
256 0x100 secondary alignment
512 0x200 not passing filters, such as platform/vendor quality controls
1024 0x400 PCR or optical duplicate
2048 0x800 supplementary alignment

Col. 3 RNAME:

Name of reference sequence. It generally refers to chromosome number.

Col. 4 POS:

Leftmost mapping position of the first matching base in read. It has 1-based indexing. If pos is set as 0, it represents a unmapped read. For a read pair READ1/1 and READ1/2 and single Read2

Col. 5 MAPQ:

It indicates MAPpping Quality. MAPQ= -10log10(Probability of mapping position being wrong). MAPQ=255 indicates mapping quality is unavailable.

Col. 6 CIGAR:

A string that describes alignment.

OP BAM Description
M 0 alignment match (can be a sequence match or mismatch)
I 1 insertion to the reference
D 2 deletion from the reference
N 3 skipped region from the reference
S 4 soft clipping (clipped sequences present in SEQ)
H 5 hard clipping (clipped sequences NOT present in SEQ)
P 6 padding (silent deletion from padded reference
= 7 sequence match
X 8 sequence mismatch

Difference between H and S is that if the mismatch sequence is reported as part of read sequence in alignment file it is a soft clipping. Often mismatch region matches somewhere else in reference sequence and in that case the mismatch region is removed from reported read sequence in alignment and is referred as Hard clipping.

CIGAR value of Read2 in POS example will be:

cccc does not match anywhere else in reference so soft clipping. , 5 ( GATAC) Matches, 2 (TA) Insertion, 4(GTAA) Matches, 1(*) Deletion, 3 (GAT) Matches, 2(..)Skipped region from reference(N), 4(GTCT) Matches

Col. 7 RNEXT , Col. 8 PNEXT:

RNEXT and PNEXT is to know the reference and position of a paired end read’s partner for visualisation tools. RNEXT is the name of the chromosome or contig to which the next template in a pair aligns. RNEXT of value ‘=‘ means align to same reference and ‘*’ represent no information available (single end sequencing). PNEXT where the other read of the pair aligns (Information unavailable =0, Otherwise POS value of pair).

Read1/1 and Read1/2 are pair and Read 3 is unpaired. So the RNEX and PNEXT values will be

Col. 9 TLEN : Observed Template LENgth

It represents the length of reference that is covered by pair end reads. The distance between leftmost mapped base to rightmost mapped base in paired reads. For unpaired reads it is 0.

Col. 10 SEQ: Sequence of the read or Segment.

Col. 11 QUAL: PHRED score values of read. If ‘*’ no values are stored.

A BAM (Binary Alignment/Map) file is the compressed binary version of the Sequence Alignment/Map (SAM), a compact and indexable representation of nucleotide sequence alignments. The data between SAM and BAM is exactly same. Being Binary BAM files are small in size and ideal to store alignment files. Require samtools to view the file.

VCF (Variant Calling Format/File)

VCF is a text file format with a header (information VCF version, sample etc) and data lines constitute the body of file.

This contains meta-information and is included after ‘##’ string. It is recommended to include INFO, FILTER and FORMAT entries for a better explanation of the data field.

Other information like alternate allele, assembly field, Contig field, sample field, pedigree field can also be included.

Data lines have 8 mandatory columns.

#CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO.

VCF format has very well explained manual available at https://samtools.github.io/hts-specs/VCFv4.2.pdf .

GFF (General Feature Format or Gene Finding Format)

File extensions : file.gff2, file. gff3, file.gff

It has first 8 fields like GFF2 but differs in field 9 in assigning attributes. 2 are highlighted here.
(a) GFF3 has better nesting feature. Links features to parent tag

(b) The most general way of representing a protein-coding gene is the so-called “three-level gene.” The top level is a feature of type “gene” which bundles up the gene’s transcripts and regulatory elements. Beneath this level are one or more transcripts of type “mRNA”. This level can also accommodate promoters and other cis-regulatory elements. At the third level are the components of the mRNA transcripts, most commonly CDS coding segments and UTRs. This example shows how to represent a gene named “EDEN” which has three alternatively-spliced mRNA transcripts:

Source: http://gmod.org/wiki/GFF3
GFF (General Feature Format or Gene Finding Format). GFF can be used for any kind of feature (Transcripts, exon, intron, promoter, 3’ UTR, repeatitive elements etc) associated with the sequence, whereas GTF is primarily for genes/transcripts. GFF3 is the latest version and an improvement over GFF2 format. However, many databases are still not equipped to handle GFF3 version. The differences will be explained later in text.

The GFF format has 9 mandatory columns and they are TAB separated. The 9 columns are as follows.

Col. 1 Reference Sequence:

This is the ID of reference sequence used to establish coordinate system for annotation. Usually chromosome name or number.

Col. 2 Source:

This explains how the feature annotation is derived. The source is a free text qualifier intended to describe the algorithm or operating procedure that generated this feature. Typically this is the name of a piece of software, such as “Genescan” or a database name, such as “Genbank.” In effect, the source is used to extend the feature ontology by adding a qualifier to the type creating a new composite type that is a subclass of the type in the type column. It is not necessary to specify a source. If there is no source, put a “.” (a period) in this field.

Col. 3 Feature:

The feature type name, like “gene” or “exon”. In a well-structured GFF file, all the children(exons, introns etc) features always follow their parents(Transcript) feature line. This way they are part of a single block

Col. 4 Start:

Genomic Start of the feature.

Genomic Start of the feature

Col. 6 Score:

Numeric value that generally indicates the confidence of the source on the annotated feature. A value of “.” (a dot) is used to define a null value. The semantics of the score are ill-defined. In GFF3 format,It is strongly recommended that E-values be used for sequence similarity features, and that P-values be used for ab initio gene prediction features. If there is no score, put a “.” (a period) in this field.

Col. 7 Strand:

Field that indicates the sense strand of the feature. “+’ :Watson strand and ‘-‘: crick strand. ‘?’ can be used for features whose strandedness is relevant, but unknown.

Col. 8 Frame (GFF2 and GTF) or Phase (GFF3):

For features of type “CDS”, the phase indicates where the feature begins with reference to the reading frame. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon. In other words, a phase of 𔄘” indicates that the next codon begins at the first base of the region described by the current line, a phase of 𔄙” indicates that the next codon begins at the second base of this region, and a phase of 𔄚” indicates that the codon begins at the third base of this region. This is NOT to be confused with the frame, which is simply start modulo 3. If there is no phase, put a “.” (a period) in this field.

For forward strand features, phase is counted from the start field. For reverse strand features, phase is counted from the end field.

The phase is required for all CDS features.

Explained let say ### and *** represent consecutive exons.

CTG C is first base (0), T is second base (1), G is third base(2)

Col. 9 Attribute or Group field:

All lines with the same group are linked together into a single item. The group field is a challenge. It is used in several distinct ways:

  • to group together a single sequence feature that spans a discontinuous range, such as a gapped alignment.
  • to name a feature, allowing it to be retrieved by name.
  • to add one or more notes to the annotation.
  • to add an alternative name

One of GFF2’s problems is that it is only able to represent one level of nesting of features. This is mainly a problem when dealing with genes that have multiple alternatively-spliced transcripts. GFF2 is unable to deal with the three-level hierarchy of gene transcript exon. Most people get around this by declaring a series of transcripts and giving them similar names to indicate that they come from the same gene.

The second limitation is that while GFF2 allows you to create two-level hierarchies, such as transcript exon, it doesn’t have any concept of the direction of the hierarchy. So it doesn’t know whether the exon is a subfeature of the transcript, or vice-versa. This means you have to use “aggregators” to sort out the relationships. This is a major pain in the neck. For this reason, GFF2 format has been deprecated in favor of GFF3 format databases.

GTF (Gene Transfer format)

GTF has the same format as GFF files. It has the same 9 fields that describe the gene/ transcript related features. The group/attribute field has been expanded into a list of attributes. Each attribute consists of a type/value pair. Attributes must end in a semi-colon, and be separated from any following attribute by exactly one space. The attribute list must begin with the two mandatory attributes:

gene_id value: A globally unique identifier for the genomic source of the sequence.

transcript_id value: A globally unique identifier for the predicted transcript.


7. Correct the assembly¶

Sequences from PacBio can have more errors than those from Illumina. Therefore, although it is useful to use the long PacBio reads to assemble the genome, we can also use the shorter and more accurate Illumina reads to correct errors in the PacBio assembly.

Make an alignment file¶

  • Aligns Illumina R1.fq and R2.fq to the PacBio assembly genome.fasta .
  • This produces a .bam file
  • | pipes the output to samtools to sort (required for downstream processing)
  • > pilon_aln . bam redirects the sorted bam to this file

Run Pilon¶

  • --genome is the name of the input assembly to be corrected
  • --frags is the alignment of the reads against the assembly
  • --output is the name of the output prefix
  • --fix is an option for types of corrections
  • --mindepth gives a minimum read depth to use
  • --changes produces an output file of the changes made
  • --verbose prints information to the screen during the run
  • --threads : number of cores

  • We can see lots of cases where a deletion (represented by a dot) has been corrected to a base.
  • Type q to exit.

We now have the corrected genome assembly of Staphylococcus aureus in .fasta format, containing a chromosome and a small plasmid.

Questions¶

Why don&rsquot we correct earlier in the assembly process?

Answer (click to reveal) We need to circularise the contigs and trim overhangs first.

Why can we use some reads (Illumina) to correct other reads (PacBio) ?

Answer (click to reveal) Illumina reads have higher accuracy.

Could we just use PacBio reads to assemble the genome?

Answer (click to reveal) Yes, if accuracy adequate.


Sequences and Alphabets¶

The alphabet object is perhaps the important thing that makes the Seq object more than just a string. The currently available alphabets for Biopython are defined in the Bio.Alphabet module. We’ll use the IUPAC alphabets here to deal with some of our favorite objects: DNA, RNA and Proteins.

Bio.Alphabet.IUPAC provides basic definitions for proteins, DNA and RNA, but additionally provides the ability to extend and customize the basic definitions. For instance, for proteins, there is a basic IUPACProtein class, but there is an additional ExtendedIUPACProtein class providing for the additional elements “U” (or “Sec” for selenocysteine) and “O” (or “Pyl” for pyrrolysine), plus the ambiguous symbols “B” (or “Asx” for asparagine or aspartic acid), “Z” (or “Glx” for glutamine or glutamic acid), “J” (or “Xle” for leucine isoleucine) and “X” (or “Xxx” for an unknown amino acid). For DNA you’ve got choices of IUPACUnambiguousDNA, which provides for just the basic letters, IUPACAmbiguousDNA (which provides for ambiguity letters for every possible situation) and ExtendedIUPACDNA, which allows letters for modified bases. Similarly, RNA can be represented by IUPACAmbiguousRNA or IUPACUnambiguousRNA.

The advantages of having an alphabet class are two fold. First, this gives an idea of the type of information the Seq object contains. Secondly, this provides a means of constraining the information, as a means of type checking.

Now that we know what we are dealing with, let’s look at how to utilize this class to do interesting work. You can create an ambiguous sequence with the default generic alphabet like this:

However, where possible you should specify the alphabet explicitly when creating your sequence objects - in this case an unambiguous DNA alphabet object:


Reproducibility with Scripts

It is highly unlikely that an analysis of this type is performed only once. More often than not, we’ll wish to adjust or replace the input files, compare to different protein sets, or adjust the parameters for the programs ran. Thus it make sense to capture the analysis we just performed as an executable script, perhaps called runhmmer.sh .

Note in the above that we’ve broken the long hmmsearch line into two by ending it midway with a backslash and continuing it on the next line. The backslash lets bash know that more of the command is to be specified on later lines. (The backslash should be the last character on the line, with no spaces or tabs following.) After making this script executable with chmod , we could then rerun the analysis by navigating to this directory and running ./runhmmer.sh .

What if we wanted to change the input file, say, to argonase-1s.fasta instead of p450s.fasta ? We could create a new project directory to work in, copy this script there, and then change all instances of p450s.fasta in the script to argonase-1s.fasta .

Alternatively, we could use the power of environment variables to architect our script in such a way that this process is easier.

Now the file names of interest are specified only once, near the top of the script, and from then on the script uses its own identifiers (as environment variables) to refer to them. Reusing this script would be as simple as changing the file names specified in three lines.

We can go a step further. It turns out that shell scripts can take parameters from the command line. The first parameter given to a script on the command line will be automatically stored in a variable accessible to the script called $1 , the second parameter will be stored in $2 , and so on. We can thus further generalize our script:

We could have replaced all instances of $query with $1 , but this organization makes our script easier to read in the future, an important consideration when programming. Now we can run a full analysis by specifying the three relevant file names on the command line, as in: ./runhmmer.sh p450s.fasta dmel-all-translation-r6.02.fasta p450s_hmmsearch_dmel.txt .

This runhmmer.sh is a good candidate for inclusion in our $HOME/local/bin so that we can run it from anywhere, though we may want to add lines immediately following the #! line, to provide some help text for anyone who attempts to run the script without providing the correct inputs:

The “if block” above will only execute if the number of parameters given ( $# ) is not equal to 3. Although languages like Python provide much nicer facilities for this sort of logic-based execution, the ability to conditionally provide usage information for scripts is important. As usual for bash , the interpreter ignores lines that start with # .


Watch the video: Introducing the Biopython SeqIO Module: Reading FASTA files (November 2021).