BpForms: tools for modified DNA, RNA, and proteins

BpForms is a toolkit for unambiguously representing the primary sequence of forms of biopolymers. By concretely representing the primary sequence of biopolymers, BpForms aims to facilitate concrete discussion about DNA modification, post-transcriptional processing, and post-translational processing; facilitate the determination of the structures of biopolymer forms; facilitate the integration of data about DNA modification, post-transcriptional processing, and post-translational processing; and enable whole-cell models that represent DNA modification, post-transcriptional processing, and post-translational processing and the functions of modified DNA, RNA, and proteins.

BpForms includes a notation for describing biopolymer forms, as well as this website, a JSON REST API, a command line interface, and a Python API for calculating properties of biopolymer forms. These tools are available open-source under the MIT license.

BpForms calculator

Enter a biopolymer form

Computed properties of the biopolymer form

BpForms notation

Overview

The BpForms notation represents biopolymers as FASTA sequences augmented with (a) multiple-letter alphabet-defined monomers delimited by curly brackets and (b) user-defined monomers described in square brackets by one or more attributes separated by "|". The structure, monomer-bond-atom, monomer-displaced-atom, left-bond-atom, left-displaced-atom, right-bond-atom, and right-displaced-atom attributes are required to calculate the chemical formula, molecular weight, and charge. All other attributes are optional.

BpForms has several pre-built alphabets.

Examples

  • [id: "dI" | name: "deoxyinosine"]ACGC: represents deoxyinosine at the first position
  • AC[id: "dI" | name: "deoxyinosine"]GC: represents deoxyinosine at the third position
  • AC{2mG}C[id: "dI" | name: "deoxyinosine"]: represents guanosine methylation at the second position and deoxyinosine at the last position

Structures of monomers

The structure attribute describes the chemical structure of the inline monomer. This attribute is a SMILES-encoded string. Each monomer can only have one structure. This attribute is required to calculate the structure of the BpForm.

Examples

[id: "dI" | structure: "O=C1NC=NC2=C1N=CN2" ]

Linkages between monomers

The monomer-bond-atom, monomer-displaced-atom, left-bond-atom, left-displaced-atom, right-bond-atom, and right-displaced-atom attributes describe the linkages between monomers and their backbone and between successive monomers. Each monomer can have multiple bonds and multiple displaced atoms. These attributes are required to calculate the structure of the BpForm.

Examples

[id: "dI" | structure: "O=C1NC=NC2=C1N=CN2" | monomer-bond-atom: Monomer / N / 10 / 0 | monomer-displaced-atom: Monomer / H / 10 / 0 ]

Uncertainty about the primary sequence

BpForms can represent two types of uncertainty in the primary sequences of biopolymer forms.
  • The delta-mass delta-charge attributes describe uncertainty in the chemical identity of the monomer.
  • The position attribute describes uncertainty in the position of the monomer within the sequence.

Examples

  • [id: "dAMP" | delta-mass: 1 | delta-charge: 1]: indicates the presence of an additional proton whose exact location is not known.
  • [id: "dI" | position: 2-3]: indicates that deoxyinosine may occur anywhere between the second and third position.

Metadata about monomers

BpForms can represent several types of metadata about monomers.
  • The id and name attributes are human-readable labels for monomers. Only one id and one name is allowed per monomer.
  • The synonym attribute is an additional human-readable label. Monomers can have multiple synonyms.
  • The identifier attribute indicates entries in databases and ontologies which are equivalent to the monomer. Monomers can have multiple identifiers. The namespace and id of each identifer must be separated by a "/".
  • The comments attribute describes additional information about the monomer. Monomers can only have one comment.

Examples

  • [id: "dI" | name: "deoxyinosine"]
  • [id: "dI" | synonym: "deoxyinosine" | synonym: "2'-deoxyinosine"]
  • [id: "dI" | identifier: "chebi" / "CHEBI:28997" | identifier: "pubchem.compound" / "65058"]
  • [id: "dI" | comments: "A purine 2'-deoxyribonucleoside that is inosine ..."]

Resources for reconstructing biopolymer forms

Resources for DNA forms

  • DNAMod : Database of non-canonical DNA nucleobases
  • MethDB : Database of non-canonical DNA
  • MethSMRT : Database of non-canonical DNA

Resources for RNA forms

  • MODOMICS : Database of non-canonical RNA nucleosides
  • RMBase : Database of modified RNA
  • RNA Modification Database : Database of modified RNA

Resources for protein forms

  • dbPTM : Database of non-canonical amino acids
  • Delta Mass : Database of modified amino acids
  • FindMod : Database of post-translational modifications
  • ProForma : Notation for protein forms. Note, this notation is not unambiguous. This limits its abiltiy to facilitate data integration and the calculation of properties of protein forms.
  • PDB Chemical Components : Database of modified amino acids
  • PDB in Europe Chemical Components : Database of modified amino acids
  • Protein Ontology : Database of modified proteins
  • PSIMOD : Ontology of non-canonical amino acids
  • RESID : Database of non-canonical protein residues
  • UniMod : Database of non-canonical amino acids
  • UniProt Controlled Vocabulary of Posttranslational Modifications : Database of modified amino acids

Integrating BpForms into pathway databases, models, and in silico designs

Pathway databases (BioPAX )

We are exploring a recommended best practice for encoding BpForms in BioPAX.

Example

Coming soon.

Models (SBML )

The annotation attribute of species can be used to describe species with BpForms.

Example

<species>
    <annotation>
      <bpforms:bpform xmlns:bpforms="https://www.bpforms.org/ns">
        <bpforms:alphabet>dna</bpforms:alphabet>
        <bpforms:sequence>A{m2C}GT</bpforms:sequence>
      </bpforms:BpForm>
    </annotation>
</species>

In silico designs (SBOL )

These encoding URIs can be used to describe components in in silico designs:

  • DNA: http://edamontology.org/format_XXXX
  • RNA: http://edamontology.org/format_YYYY
  • Protein: http://edamontology.org/format_ZZZZ

Example

<sbol:Sequence>
    <sbol:encoding rdf:resource="http://edamontology.org/format_XXXX"/>
    <sbol:elements>A{m2C}GT</sbol:elements>
</sbol:Sequence>

BpForm interfaces

Python package

The BpForms Python package is available from PyPI .

Command line interface

The BpForms command line interface is available from PyPI .

JSON REST API

The BpForms JSON REST API is available at https://bpforms.org/api.

Source code

BpForms is available open-source from GitHub .

Documentation

Please see BpForms documentation .

License

BpForms is released under the MIT license .

About BpForms

Citing BpForms

Lang PF, Chebaro Y & Jonathan R. Karr. BpForms: a toolkit for concretely describing modified DNA, RNA and proteins. arXiv:1903.10042

Team

BpForms was developed by Jonathan Karr , Yassmine Chebaro , and Paul Lang in the Karr Lab at the Icahn School of Medicine at Mount Sinai in New York, USA.

Acknowledgements

BpForms was supported by a National Institute of Health P41 award , a National Institute of Health MIRA R35 award , and a National Science Foundation INSPIRE award .

Questions/comments

Please contact Jonathan Karr with any questions or comments.