Chemoinformatics Curiosities: A Chemical Educator’s Perspective on InChI

Organizations supporting machine-readable molecular formats.

Organizations supporting machine-readable molecular formats.

When it comes to machine-readable representations of molecules, I grew up with SMILES. A SMILES string reflects the connectivity and stereochemistry of a molecular structure, and may be generated from any number of other machine-readable formats, such as MOL and CML. SMILES strings are becoming ubiquitous on the web, thanks to giants like Wikipedia. They’re nice because they’re fairly short and readable—the SMILES string for ethane is simply “CC,” for example.

That said, the limitations of SMILES are difficult to ignore. The same readability that makes SMILES appealing to human eyes limits its scope significantly. The innards of the SMILES algorithm(s) are fairly simple from a chemist’s perspective, and do not take into account spontaneous structural changes like tautomerization (or even the structural equivalence of resonance forms). There are multiple algorithms, meaning there is not, strictly speaking, a one-to-one relationship between structure and SMILES string. Finally, SMILES is a proprietary format whose algorithms are kept under lock and key—with the notable exception of the OpenSMILES project.

IUPAC, chemistry’s own group of nerds with a nomenclature fetish, has been working to remedy this situation for over a decade. Their machine-readable format, the International Chemical Identifier or “InChI” (en-chee), reflects a completely different philosophy from the SMILES approach. The goal of InChI is not to fully represent molecular structure, but to generate a unique identifier for a particular compound, given a structural representation. The InChI folks recognized that molecules can be represented with varying levels of detail, and that we may not necessarily need all the details to uniquely identify a particular compound. Many species, for example, can be singled out by their molecular formulas and connectivity alone. H2 is a nice example—to uniquely identify H2, all we really need is its molecular formula and knowledge that the H’s are bound together. More complex compounds, such as those that may possess stereoisomers, need more details in their identifier. Continue reading →

Advertisements

Chemoinformatics Curiosities: The Morgan Algorithm

Apologies to one of my favorite chemistry blogs for the title of this series—it just fit too well!

I’ve become very interested in the field of chemoinformatics lately. It’s mind-boggling to think about how chemoinformatics could influence education, as student responses are digitized. It’s a young field with a lot of potential! A series of upcoming posts investigate some of the interesting aspects of chemoinformatics in a general sense (divorced from its most common bedfellow, chemical biology).

Here’s an interesting problem: how can one systematically and uniquely number the atoms of a molecular graph? We might need to do so, for instance, to compare two structures to see if they’re identical.

One solution would be to assign, systematically, a unique number to each atom in a structure based on connectivity. Atoms with identical connectivity are in identical chemical environments anyway, [1] so this procedure would provide us with a nice way to uniquely assign numbers to the atoms of a molecular graph. The toughest aspect of this solution is that little word “systematically.” Procedures that assign unique numbers to atoms must be designed so that the same numbering scheme results every time, irrespective of how the molecule is drawn. In a nutshell, the numbering must depend only on intrinsic properties of the molecular graph itself, and not at all on how it is represented.

Morgan devised an ingenious algorithm that meets this criterion while working for Chemical Abstracts. Let’s begin by numbering each non-hydrogen according to its “non-hydrogen degree,” that is, the number of heavy atoms to which it is attached. Ignore multiple bonds for now.

Morgan Algorithm: Step 1

Next, a weird, iterative addition trick assigns unique numbers to atoms based on their connectivity. For each atom, sum the degrees of each of its neighbors, and give that number to the atom. Rinse and repeat this process until the numbers are unique as possible. For our tyrosine example, this happens after five iterations…I’ll spare you the details, and show only the final result. Suffice it to say, if we repeated this, we wouldn’t introduce any more uniqueness in the numbers.

Morgan Algorithm: Step 2

At this point, most of the atoms have different labels. Begin at the atom with the highest number, and assign it as “1.” Look at atom 1’s neighbors, and assign the highest as 2, second highest as 3, etc. Then move to atom 2, rinse and repeat for any unassigned atoms attached to atom 2. Where ties emerge, assign the atom with higher bond order the lower number. When all is said and done, we get…

Morgan Algorithm: End Result!

In an ideal world, Morgan’s and related “relaxation” algorithms (which iteratively examine the neighbors of atoms) would assign identical numbers to symmetry-equivalent atoms and different numbers to symmetry-inequivalent atoms in all cases. However, there are known examples of molecules with symmetry-inequivalent atoms that cannot be distinguished by Morgan’s algorithm. For some applications of Morgan’s algorithm in the chemical literature, check these out. The alternative proposals to the Cahn-Ingold-Prelog system are particularly intriguing!

[1] Let’s ignore enantiotopic and diastereotopic groups for now… 😀