Sunday, November 3, 2013

SMILES

Simplified Molecular Input Line Entry Specification (SMILES)




What is SMILES?

  • A specification for unambiguously describing the structure of chemical molecules using ASCll strings.
  • Widely used and computationally efficient.
  • Uses atomic symbols and a set of intuitive rules.
  • Uses hydrogen-suppressed molecular graphs (HSMG)
Canonical SMILESISOMERIC SMILES

  • a type of SMILES specification
  • includes rules for ensuring that each distinct chemical molecule has a single unique SMILES representation
  • common appilcation: indexing and ensuring uniqueness of molecules in a database.

  • a type of SMILES specification
  • includes extensions to supprt the specification of isotopes, chirality, and configuration about double bonds.
  • allow rigorous partial specification of chirality.

Grapgh Based definition

  • Printing the symbol nodes encountered in a depthfirst tree traversal of a chemical graph.
  • Chemical graph:
    1. Trimmed to remove hydrogen atoms and cycles are broken to turn it into a spanning tree.
    2. Numeric suffix labels are included to indicate the connected nodes on where the cycles have been broken.
    3. Use Parentheses to indicate points of branching on the tree.

SMILES Bonds


  • Single*
    -
    Double
    =
    Triple
    #
    Aromatic*
    :

SMILES Branches

  • Represented by enclosure in parentheses


ccc (cc) co
2-Ethyl-1-butanol 

  • Branches can be nested or "stacked" to any depth:

    • cc (c) c (=o) c (c) c2,4-dimethy-3-pentanone

      occ (ccc) c (c(c)c) ccc 2-propyl-3-isoprophyl-1-propanol

      os (=o) (=s) o thiosulfate

  • The SMILES branch/chain rules allow nested parenthetical expressions (branches) to an arbitrary depth.

SMILES Symbols/ Atoms

  • String of alphanumeric characters and certain punctuation symbols.
  • Terminates at the first space encountered when read left to right.
  • The ORGANIC SUBSET:
    • B, C, N, O, P, S, F, Cl, Br, I
  • Aliphatic or noraromathic carbon: C
  • Atom in aromatic ring: lowercase letter
  • Designate ring closure with pairs of matching digits, e.g:
    • c1ccccc1 is benzene
    • C1CCCCC1 is Cyclohexane

SMILES Charges

  • Charge is specified by a "+n" where "n" is a number; if the number is missing, it means either +1 or -1 as appropriate.
  • Also specify attached hydrogens and chargers in square brackets.
  • Number of attached hydrogens is the symbol H followed by optional digit.
    • [H+]
      proton
      [OH-]
      hydroxyl anion
      [OH3+]
      hydronium cation
      [FE++]
      iron(ll) cation
      [Cl-]
      chloride anion
      [Cu+2]
      copper cation
      [Cu++]
      copper cation

SMILES Cylic Structures

  • Break one single or one aromatic bond in each rings
  • only numbers 1-9 are used
  • A number should appear only twice
  • For example:
    • Napthalene: c12ccccc1cccc2

SMILES Fragments

  • Nitro
    N(=O)(=O)
    Nitrate
    ON(=O)(=O)
    Nitrite
    ON(=O)
    Sulfonic Acid
    S(=O)
    Cyanide/Nitrile
    C#N
    Azide
    N=N#N
    Azido
    N+=N-

SMILES Conventions

  • Avoid two consecutive left parentheses if possible.
  • Strive for the fewest number of possible branches.
  • Tautomeric bonds are not designated; enter the appropriate form.

Other Restrictions

  • A branch cannot begin a SMILES notation.
  • A branch cannot immediately follow a double-bond or triple-bond symbol.
  • Example: C=(CC)=C is invalid, but
  • C(=CC)C or C(CC)=C are valid SMILES.

Disconnected Structures

  • The dot '.' symbol (also called a "dot bond") is legal most places where a bond symbol would occur, but indicates that the atoms are not bonded.
  • The most common use of the dot-bond symbol is to represent disconnect and ionic compunds.

  • [Na+] . [Cl-] sodium chloride

    Oc1ccccc1.NCCO phenol, 2-amino ethanol

    [NH4+] . [NH4+] . [O-] S (=O) (=O)
    [S-]
    diammonium thiosulfate
  • The dot can appear most places that a bond symbol is allowed.
  • Although dot-bonds are commonly used to represent compunds with disconnected parts, a dot-bond does not itself mean that there are disconnected parts in the compound.

Isomeric and Chiral SMILES

  • Isomeric configuration indicated by forward and backward slashes; / \
  • Examples:
    • trans-1,2-dibromoethene: Br/C=C/Br
    • cis-1,2-dibromoethene: Br/C=C\Br
  • Chirality indicated by the "@" symbol


*You can also refer to other websites about the Simplified Molecular Input Line Entry Specification (SMILES):

No comments:

Post a Comment