For exhaustive surveys with many items, it’s very tempting to group items together according to the constructs that they appear to measure mutually. For instance, asking someone about their height, level of athleticism, sense of coordination, and ball-handling ability seems to be appropriate for a section of a survey measuring basketball skill. But how can we know, based on survey data alone, whether these four items taken together measure what we think they’re measuring?

The implicit assumption here is that a person with significant basketball skill will rate highly on all of the scales defined by the survey items (and vice versa, for people with low basketball skill). In stats terms, this means that all the items have large positive correlations with one another—when one goes up, the other goes up, because they’re both controlled by basketball skill. The root cause of the correlations is an underlying construct that is common to all the survey items.

An instrument with multiple categories like “basketball skill” achieves **internal consistency** when it actually measures the constructs it purports to measure based on the structure of its categories*. In an ideal world, all the items within a category would correlate perfectly with each other and not at all with items from any of the other categories. While this is impossible in real life, it is possible to maximize the correlations between items within a category—surveys that do this are internally consistent. The statistical methods that find categories that maximize internal correlations while minimizing external correlations are lumped under the name **factor analysis**. If the results of a factor analysis are consistent with existing survey categories, internal consistency is assured. Factor analysis wins the award for my favorite statistical method—for all you physical science-y types out there, principal component analysis (PCA) is a kind of factor analysis.

I needed to determine internal consistency recently for our pre-semester survey, which asks students about their level of comfort with our course objectives, attitudes towards our course and education in general, and how often they make use of general learning skills. The survey is broken up into four categories: Understanding, Skills, Attitudes, and (Mental) Integration. When I ran factor analysis to confirm these categories, the result was surprising. The resulting factors grouped together items in the same category for the most part, but also included items from other categories. The cool thing about this result is that it implies that the survey is actually measuring something different, and deeper, than the four basic categories it purports to measure.

You can do some pretty neat things with the results of a factor analysis. Unlike a physical survey, factor-analyzed survey items need not be dumped into a single category—a single item can contribute to different factors in different amounts (the fancy-schmancy stats types call these amounts **loadings**). Using the factor loadings as coefficients, one can construct a linear combination of a single student’s responses to each survey item to give a **factor score**. The factor score is essentially the measured value of that factor for the student. Why is this cool? Because you can hit the student back with her factor scores, which appear at first glance to be marginally related to the survey items, but are nonetheless statistically validated. When new data is added to the survey, the factor analysis can be run again and the results applied to an updated set of factors and/or loadings.

The biggest caveat here is in naming and interpretation of the factors. You often hear, and justifiably so, that factor analyses should be interpreted “in the light of theory.” In other words, existing theories should help frame one’s thinking about how to name and define factors. Should, for instance, an item with a (relatively low) loading of 0.3 be ignored in the interpretation of a factor? If doing so makes sense in the light of theory, then heck yeah. If there’s no good reason to leave it out, then no.

* Or at least when all the items within each category measure *something* in common…it’s a whole new can of worms to validate the name of the category itself.