When it comes to machine-readable representations of molecules, I grew up with SMILES. A SMILES string reflects the connectivity and stereochemistry of a molecular structure, and may be generated from any number of other machine-readable formats, such as MOL and CML. SMILES strings are becoming ubiquitous on the web, thanks to giants like Wikipedia. They’re nice because they’re fairly short and readable—the SMILES string for ethane is simply “CC,” for example.
That said, the limitations of SMILES are difficult to ignore. The same readability that makes SMILES appealing to human eyes limits its scope significantly. The innards of the SMILES algorithm(s) are fairly simple from a chemist’s perspective, and do not take into account spontaneous structural changes like tautomerization (or even the structural equivalence of resonance forms). There are multiple algorithms, meaning there is not, strictly speaking, a one-to-one relationship between structure and SMILES string. Finally, SMILES is a proprietary format whose algorithms are kept under lock and key—with the notable exception of the OpenSMILES project.
IUPAC, chemistry’s own group of nerds with a nomenclature fetish, has been working to remedy this situation for over a decade. Their machine-readable format, the International Chemical Identifier or “InChI” (en-chee), reflects a completely different philosophy from the SMILES approach. The goal of InChI is not to fully represent molecular structure, but to generate a unique identifier for a particular compound, given a structural representation. The InChI folks recognized that molecules can be represented with varying levels of detail, and that we may not necessarily need all the details to uniquely identify a particular compound. Many species, for example, can be singled out by their molecular formulas and connectivity alone. H2 is a nice example—to uniquely identify H2, all we really need is its molecular formula and knowledge that the H’s are bound together. More complex compounds, such as those that may possess stereoisomers, need more details in their identifier. Continue reading →