Simplified Molecular Input Line Entry Specification (SMILES)
What is SMILES?
- A specification for unambiguously describing the structure of chemical molecules using ASCll strings.
- Widely used and computationally efficient.
- Uses atomic symbols and a set of intuitive rules.
- Uses hydrogen-suppressed molecular graphs (HSMG)
Canonical SMILES | ISOMERIC SMILES |
- a type of SMILES specification
- includes rules for ensuring that each distinct chemical molecule has a single unique SMILES representation
- common appilcation: indexing and ensuring uniqueness of molecules in a database.
|
- a type of SMILES specification
- includes extensions to supprt the specification of isotopes, chirality, and configuration about double bonds.
- allow rigorous partial specification of chirality.
|
Grapgh Based definition
- Printing the symbol nodes encountered in a depthfirst tree traversal of a chemical graph.
- Chemical graph:
- Trimmed to remove hydrogen atoms and cycles are broken to turn it into a spanning tree.
- Numeric suffix labels are included to indicate the connected nodes on where the cycles have been broken.
- Use Parentheses to indicate points of branching on the tree.
SMILES Bonds
Single*
|
-
|
Double
|
=
|
Triple |
#
|
Aromatic*
|
:
|
SMILES Branches
- Represented by enclosure in parentheses
| ccc (cc) co |
2-Ethyl-1-butanol
|
- Branches can be nested or "stacked" to any depth:
|
cc (c) c (=o) c (c) c | 2,4-dimethy-3-pentanone |
|
occ (ccc) c (c(c)c) ccc |
2-propyl-3-isoprophyl-1-propanol |
|
os (=o) (=s) o |
thiosulfate |
- The SMILES branch/chain rules allow nested parenthetical expressions (branches) to an arbitrary depth.
SMILES Symbols/ Atoms
- String of alphanumeric characters and certain punctuation symbols.
- Terminates at the first space encountered when read left to right.
- The ORGANIC SUBSET:
- B, C, N, O, P, S, F, Cl, Br, I
- Aliphatic or noraromathic carbon: C
- Atom in aromatic ring: lowercase letter
- Designate ring closure with pairs of matching digits, e.g:
- c1ccccc1 is benzene
- C1CCCCC1 is Cyclohexane
SMILES Charges
- Charge is specified by a "+n" where "n" is a number; if the number is missing, it means either +1 or -1 as appropriate.
- Also specify attached hydrogens and chargers in square brackets.
- Number of attached hydrogens is the symbol H followed by optional digit.
[H+]
|
proton
|
[OH-]
|
hydroxyl anion
|
[OH3+]
|
hydronium cation
|
[FE++]
|
iron(ll) cation
|
[Cl-]
|
chloride anion
|
[Cu+2]
|
copper cation
|
[Cu++]
|
copper cation
|
SMILES Cylic Structures
- Break one single or one aromatic bond in each rings
- only numbers 1-9 are used
- A number should appear only twice
- For example:
- Napthalene: c12ccccc1cccc2
SMILES Fragments
Nitro
|
N(=O)(=O)
|
Nitrate
|
ON(=O)(=O)
|
Nitrite
|
ON(=O)
|
Sulfonic Acid
|
S(=O)
|
Cyanide/Nitrile
|
C#N
|
Azide
|
N=N#N
|
Azido
|
N+=N-
|
SMILES Conventions
- Avoid two consecutive left parentheses if possible.
- Strive for the fewest number of possible branches.
- Tautomeric bonds are not designated; enter the appropriate form.
Other Restrictions
- A branch cannot begin a SMILES notation.
- A branch cannot immediately follow a double-bond or triple-bond symbol.
- Example: C=(CC)=C is invalid, but
- C(=CC)C or C(CC)=C are valid SMILES.
Disconnected Structures
- The dot '.' symbol (also called a "dot bond") is legal most places where a bond symbol would occur, but indicates that the atoms are not bonded.
- The most common use of the dot-bond symbol is to represent disconnect and ionic compunds.
|
[Na+] . [Cl-] |
sodium chloride |
|
Oc1ccccc1.NCCO |
phenol, 2-amino ethanol |
|
[NH4+] . [NH4+] . [O-] S (=O) (=O)
[S-] |
diammonium thiosulfate |
- The dot can appear most places that a bond symbol is allowed.
- Although dot-bonds are commonly used to represent compunds with disconnected parts, a dot-bond does not itself mean that there are disconnected parts in the compound.
Isomeric and Chiral SMILES
- Isomeric configuration indicated by forward and backward slashes; / \
- Examples:
- trans-1,2-dibromoethene: Br/C=C/Br
- cis-1,2-dibromoethene: Br/C=C\Br
- Chirality indicated by the "@" symbol
*You can also refer to other websites about the Simplified Molecular Input Line Entry Specification (SMILES):
No comments:
Post a Comment