BpForms: tools for modified DNA, RNA, and proteins

BpForms is a toolkit for unambiguously representing the primary sequence of forms of biopolymers. By concretely representing the primary sequence of biopolymers, BpForms aims to facilitate concrete discussion about DNA modification, post-transcriptional processing, and post-translational processing; facilitate the determination of the structures of biopolymer forms; facilitate the integration of data about DNA modification, post-transcriptional processing, and post-translational processing; and enable whole-cell models that represent DNA modification, post-transcriptional processing, and post-translational processing and the functions of modified DNA, RNA, and proteins.

BpForms includes a notation for describing biopolymer forms, as well as this website, a JSON REST API, a command line interface, and a Python API for calculating properties of biopolymer forms. These tools are available open-source under the MIT license.

BpForms calculator

Enter a biopolymer form

Computed properties of the biopolymer form

BpForms notation

Overview

The BpForms notation represents biopolymers as FASTA sequences augmented with (a) multiple-letter alphabet-defined monomers delimited by curly brackets and (b) user-defined monomers described in square brackets by one or more attributes separated by "|". The structure attribute is required to calculate the chemical formula, molecular weight, and charge. All other attributes are optional.

BpForms has several pre-built alphabets.

Examples

  • [id: "dI" | name: "deoxyinosine"]ACGC: represents deoxyinosine at the first position
  • AC[id: "dI" | name: "deoxyinosine"]GC: represents deoxyinosine at the third position
  • AC{m2G}C[id: "dI" | name: "deoxyinosine"]: represents guanosine methylation at the second position and deoxyinosine at the last position

The structures of monomers

The structure attribute describes the chemical structure of the inline monomer. This attribute is an InChI-encoded string. Each monomer can only have one structure.

Examples

[id: "dI" | structure: InChI=1S
    /C10H12N4O4
    /c15-2-6-5(16)1-7(18-6)14-4-13-8-9(14)11-3-12-10(8)17
    /h3-7,15-16H,1-2H2,(H,11,12,17)
    /t5-,6+,7+
    /m0
    /s1
]

Uncertainty about the primary sequence

BpForms can represent two types of uncertainty in the primary sequences of biopolymer forms.
  • The delta-mass delta-charge attributes describe uncertainty in the chemical identity of the monomer.
  • The position attribute describes uncertainty in the position of the monomer within the sequence.

Examples

  • [id: "dAMP" | delta-mass: 1 | delta-charge: 1]: indicates the presence of an additional proton whose exact location is not known.
  • [id: "dI" | position: 2-3]: indicates that deoxyinosine may occur anywhere between the second and third position.

Metadata about monomers

BpForms can represent several types of metadata about monomers.
  • The id and name attributes are human-readable labels for monomers. Only one id and one name is allowed per monomer.
  • The synonym attribute is an additional human-readable label. Monomers can have multiple synonyms.
  • The identifier attribute indicates entries in databases and ontologies which are equivalent to the monomer. Monomers can have multiple identifiers. The namespace and id of each identifer must be separated by a "/".
  • The comments attribute describes additional information about the monomer. Monomers can only have one comment.

Examples

  • [id: "dI" | name: "deoxyinosine"]
  • [id: "dI" | synonym: "deoxyinosine" | synonym: "2'-deoxyinosine"]
  • [id: "dI" | identifier: "chebi" / "CHEBI:28997" | identifier: "pubchem.compound" / "65058"]
  • [id: "dI" | comments: "A purine 2'-deoxyribonucleoside that is inosine ..."]

Resources for reconstructing biopolymer forms

Resources for DNA forms

  • DNAMod : Database of non-canonical DNA nucleobases
  • MethDB : Database of non-canonical DNA
  • MethSMRT : Database of non-canonical DNA

Resources for RNA forms

  • MODOMICS : Database of non-canonical RNA nucleosides
  • RMBase : Database of modified RNA
  • RNA Modification Database : Database of modified RNA

Resources for protein forms

  • dbPTM : Database of non-canonical amino acids
  • ProForma : Notation for protein forms. Note, this notation is not unambiguous. This limits its abiltiy to facilitate data integration and the calculation of properties of protein forms.
  • Protein Ontology : Database of modified proteins
  • PSIMOD : Ontology of non-canonical amino acids
  • RESID : Database of non-canonical amino acids
  • UniMod : Database of non-canonical amino acids

BpForm interfaces

Python package

The BpForms Python package is available from PyPI .

Command line interface

The BpForms command line interface is available from PyPI .

JSON REST API

The BpForms JSON REST API is available at https://bpforms.org/api.

Source code

BpForms is available open-source from GitHub .

Documentation

Please see BpForms documentation .

License

BpForms is released under the MIT license .

About BpForms

Team

BpForms was developed by Jonathan Karr , Yassmine Chebaro , and Paul Lang in the Karr Lab at the Icahn School of Medicine at Mount Sinai in New York, USA.

Acknowledgements

BpForms was supported by a National Institute of Health P41 award , a National Institute of Health MIRA R35 award , and a National Science Foundation INSPIRE award .

Questions/comments

Please contact Jonathan Karr with any questions or comments.