Chemoinformatics Curiosities: A Chemical Educator’s Perspective on InChI
When it comes to machine-readable representations of molecules, I grew up with SMILES. A SMILES string reflects the connectivity and stereochemistry of a molecular structure, and may be generated from any number of other machine-readable formats, such as MOL and CML. SMILES strings are becoming ubiquitous on the web, thanks to giants like Wikipedia. They’re nice because they’re fairly short and readable—the SMILES string for ethane is simply “CC,” for example.
That said, the limitations of SMILES are difficult to ignore. The same readability that makes SMILES appealing to human eyes limits its scope significantly. The innards of the SMILES algorithm(s) are fairly simple from a chemist’s perspective, and do not take into account spontaneous structural changes like tautomerization (or even the structural equivalence of resonance forms). There are multiple algorithms, meaning there is not, strictly speaking, a one-to-one relationship between structure and SMILES string. Finally, SMILES is a proprietary format whose algorithms are kept under lock and key—with the notable exception of the OpenSMILES project.
IUPAC, chemistry’s own group of nerds with a nomenclature fetish, has been working to remedy this situation for over a decade. Their machine-readable format, the International Chemical Identifier or “InChI” (en-chee), reflects a completely different philosophy from the SMILES approach. The goal of InChI is not to fully represent molecular structure, but to generate a unique identifier for a particular compound, given a structural representation. The InChI folks recognized that molecules can be represented with varying levels of detail, and that we may not necessarily need all the details to uniquely identify a particular compound. Many species, for example, can be singled out by their molecular formulas and connectivity alone. H2 is a nice example—to uniquely identify H2, all we really need is its molecular formula and knowledge that the H’s are bound together. More complex compounds, such as those that may possess stereoisomers, need more details in their identifier.
InChI uses “layers” to represent the various levels of complexity associated with molecular compounds. Layers include things like charge, connectivity, and stereochemistry. The InChI algorithm generates layers associated with a compound until its description is unique (but, again, not necessarily until its structure is fully specified). Mirroring how the human understanding of chemistry has evolved over the years, very basic aspects of the compound such as its formula and connectivity are listed first, followed by more complicated features such as stereochemistry and isotopes. The layers of information are unique for a particular compound, but may not permit reconstruction of a precise structure. This is particularly true where resonance forms and tautomers may be involved. For example, all four of the following structures fall under the banner of the same InChI.
The InChI crew has recognized that it is often desirable to pass molecular structures around in the form of machine-readable strings, and that this requires a string that is not only unique, but also detailed enough to encode a single Lewis structure. An optional “AuxInfo” layer of the InChI can provide additional details so that a single Lewis structure can be reliably generated from the InChI string. For more details about the InChI specification, check out the InChI technical manual—it’s somewhat dry, but there are some very interesting examples peppered throughout. The InChI FAQ is great if you’re not interested in the grisly innards of the specification.
From an educational research perspective, machine-readable molecular representations play an important role in the constant push to analyze more student data more quickly. Clever solutions, such as Cooper’s OrganicPad, have been devised to convert structures in various input formats into machine-readable representations. Comparing a student’s input to a correct response supplied by an instructor (or to the input of other students) is necessary to extract useful information from students’ data, but the logic of making these comparisons can get complicated quickly. Here, algorithms like the one developed for InChI can help—the InChI group worries about things like normalization and standardization so the educational researcher doesn’t have to. InChI in particular is nice because simple comparisons (e.g., “do molecules a and b have the same molecular formula?”) can be made using simple string manipulation with no chemical logic. InChIs are also inherently flexible: as future generations learn more about relationships between structure and chemical properties, additional layers may be added without gutting the original algorithm and all the old InChIs made with it.
InChI is shackled, however, by its status as an identifier rather than a structural encoding. There is not a one-to-one correspondence of InChI string to structure. As a result, the inferences we can make about the structure corresponding to a particular InChI string are limited (without the AuxInfo and/or FixedH layers). SMILES has InChI beat on efficiency, as InChI strings are generally very long.
InChI does what IUPAC does best: it provides a unique name for a particular compound. I like to think of InChI as “IUPAC Nomenclature 2.0,” or “Nomenclature for Machines.” Its educational utility comes from the various levels of abstraction provided by the different layers, although for me, the exact value of this abstraction is still unclear.