Chemical Education Roundup, 4-23-13

“It was the best of times; it was the worst of times.” This sentiment nicely sums up the state of chemical education right now. While sequestration threatens the largest sources of funding for chemical education researchers in the US, the literature has been on fire in the past few weeks with some intriguing studies. There’s a lot to talk about, so let’s get right into it!

First, the bad news. STEM education takes a painful hit in the President’s budget for FY 2014.

The single biggest consolidation proposed this year is in the area of science, technology, engineering, and mathematics (STEM) education, where the Administration is proposing a bold restructuring of STEM education programs—consolidating 90 programs and realigning ongoing STEM education activities to improve the delivery, impact, and visibility of these efforts.

Don’t be fooled by the rhetoric–this is almost certainly bad news for American chem ed researchers. It will be interesting to see how existing NSF-funded programs respond to these changes, but it’s almost certain to hurt the proliferation of new programs. It’s worth noting also that this is only a proposed budget, but if President Obama is throwing STEM education under the bus, I don’t see Congress fighting back.

Enough with the bad news! The bright side is that a lot of interesting research is happening these days. I’ve been digging into the general chemistry literature lately for professional reasons, and a very recent study out of Middle Tennessee State University caught my eye. The research addressed student conceptions of gases, focusing on a question that asks about the effects of a temperature change on the particulate nature of helium gas (originally studied by Nurrenben and Pickering). The conclusion of the research is typical: scaffolding and schema-activating designs for assessments improve performance on conceptual problems relative to more vague designs, but the authors were unable to track down the exact source of the performance boost (despite a few controls).

cartoon-sledge-hammer-guyOne clue is provided by another recent study: that of Behmke and Atwood on the implementation of problems sensitive to cognitive load theory  in an electronic homework system. The authors converted single, multi-step problems into sequences of related problems that “fade” from nearly complete when given to fully incomplete. Using an analytical approach based on item response theory, the authors observed that students exposed to the “statically fading” questions were very likely to perform better on subsequent related problems. The act of breaking a multi-step problem down and exposing its process over multiple problems can improve performance.

Jennifer Lewis and colleagues at USF have written a very important summary of the state of the art in psychometric measurement for chemistry education research. In addition to pointing out the typical methods researchers use to argue for the validity and reliability of survey results, Lewis et al. note that chemistry education research is becoming more interdisciplinary as evidence mounts for theoretical overlap between sub-fields of science education. They also draw attention to the need for qualitative research to complement quantitative efforts (see the MTSU study for a nice recent example of this idea). A nice read right after Lewis’s review is Barbera’s recent psychometric analysis of the Chemical Concepts Inventory.

In other news: a simple approach to assessing general chemistry laboratories; an investigation of apprenticeship in research groups; differential item functioning in science assessments; the evolution of online video in an organic chemistry course; teaching gas laws to blind students. Mouse over the links for full article titles!


Factor Me: Internal Consistency in Survey Instruments

For exhaustive surveys with many items, it’s very tempting to group items together according to the constructs that they appear to measure mutually. For instance, asking someone about their height, level of athleticism, sense of coordination, and ball-handling ability seems to be appropriate for a section of a survey measuring basketball skill. But how can we know, based on survey data alone, whether these four items taken together measure what we think they’re measuring?

The implicit assumption here is that a person with significant basketball skill will rate highly on all of the scales defined by the survey items (and vice versa, for people with low basketball skill). In stats terms, this means that all the items have large positive correlations with one another—when one goes up, the other goes up, because they’re both controlled by basketball skill. The root cause of the correlations is an underlying construct that is common to all the survey items.

An instrument with multiple categories like “basketball skill” achieves internal consistency when it actually measures the constructs it purports to measure based on the structure of its categories*. In an ideal world, all the items within a category would correlate perfectly with each other and not at all with items from any of the other categories. While this is impossible in real life, it is possible to maximize the correlations between items within a category—surveys that do this are internally consistent. The statistical methods that find categories that maximize internal correlations while minimizing external correlations are lumped under the name factor analysis. If the results of a factor analysis are consistent with existing survey categories, internal consistency is assured. Factor analysis wins the award for my favorite statistical method—for all you physical science-y types out there, principal component analysis (PCA) is a kind of factor analysis.

I needed to determine internal consistency recently for our pre-semester survey, which asks students about their level of comfort with our course objectives, attitudes towards our course and education in general, and how often they make use of general learning skills. The survey is broken up into four categories: Understanding, Skills, Attitudes, and (Mental) Integration. When I ran factor analysis to confirm these categories, the result was surprising. The resulting factors grouped together items in the same category for the most part, but also included items from other categories. The cool thing about this result is that it implies that the survey is actually measuring something different, and deeper, than the four basic categories it purports to measure.

You can do some pretty neat things with the results of a factor analysis. Unlike a physical survey, factor-analyzed survey items need not be dumped into a single category—a single item can contribute to different factors in different amounts (the fancy-schmancy stats types call these amounts loadings). Using the factor loadings as coefficients, one can construct a linear combination of a single student’s responses to each survey item to give a factor score. The factor score is essentially the measured value of that factor for the student. Why is this cool? Because you can hit the student back with her factor scores, which appear at first glance to be marginally related to the survey items, but are nonetheless statistically validated. When new data is added to the survey, the factor analysis can be run again and the results applied to an updated set of factors and/or loadings.

The biggest caveat here is in naming and interpretation of the factors. You often hear, and justifiably so, that factor analyses should be interpreted “in the light of theory.” In other words, existing theories should help frame one’s thinking about how to name and define factors. Should, for instance, an item with a (relatively low) loading of 0.3 be ignored in the interpretation of a factor? If doing so makes sense in the light of theory, then heck yeah. If there’s no good reason to leave it out, then no.

* Or at least when all the items within each category measure something in common…it’s a whole new can of worms to validate the name of the category itself.

How Much is Too Much? Survey Instruments in Education

Surveys: God's gift to educational research

A while back, while racking his brain for ideas on how to conduct educational research, some guy came to the glorious realization that the easiest way to get information from people is to just ask for it. Since that fateful day, surveys have been an indispensable tool for research in education. As critical as surveys are, though, it’s easy to lose sight of their pros and (especially) cons in the heat of a research project or teaching semester. Gathering and presenting information from surveys has limitations like any other method.

The freedom offered by survey methods to the researcher can be both liberating and dangerous. Consider the following scenario: you want to measure the physical activity of a student by surveying her about her workout schedule. So, you ask her how often she visits the gym. What you don’t (and indeed, probably can’t) know is that the student works at the gym, and is there to work just as much as she is to work out. How will she answer? If you’re lucky, she actually keeps track of when she visits the gym to workout. More than likely, you’ll get an answer that has been heavily fudged by the student. Your instrument doesn’t measure what you’d like it to; it lacks validity. In educational research, very often constructs derived from survey items are used to measure “grander” concepts, such as intelligence. To the extent that the theory is incorrect, such survey instruments lack validity.

Survey instruments must also be reliable; that is, they must be designed and implemented to measure the same thing the same way over time, from within when measured by two different questions, or when administered by different people. A perfectly reliable instrument possesses no random error; generally, the condition of reliability argues for very specific survey items which cover all or nearly all variables that may be at play in every item. On the other hand, increasing the number of items in a survey increases the potential for misread or ill-considered items, and may dilute the impact of any one item from the perspective of analysis. There is a subtle balance there. The workout survey, for instance, would clearly benefit from an item about other activities at the gym besides working out. An item asking for the precise nature of these activities, however, is probably overkill, and would likely confuse the large number of survey respondents who just go to the gym to exercise.

Thus the question I pose in the title of this post: how much is too much? How much information is too much to collect? Too much to process? Too much to respond to? In my experience, there really is no reason for an educator to breach subjects or theories beyond his/her field, if the research is solely for course or curriculum development. Students will invest earnestly in a survey that may benefit a course they’re currently taking, and exploring “grander” ideas about the relationship between external variables and an (in the scheme of things unimportant) university course just confuses and alienates students (I can vouch for this one personally). Accept from the outset that students won’t care if you discover a hidden connection between a student’s appreciation for 80’s hip hop and their performance in your course, so it’s not even worth going there. Really, in 2010, there is no reason for educators in the trenches of teaching to design their own survey instruments except, perhaps, in bleeding-edge fields. Nice thing about adapting or borrowing an existing instrument is that you don’t have to sweat validity and reliability, which is built into the design of existing instruments. Actually measuring what you set out to measure can be a pretty cool experience in and of itself. Trust me—the survey designers want educators to do this!

Honest teachers will accept that all they really want and need to do with educational research is improve their own teaching, and that the easiest way to improve is to simply ask students how well this or that intervention worked. My advice here is very simple: don’t neglect factors that could significantly affect the results of an intervention survey, such as the student’s performance on said intervention. Strive for completeness, but avoid irrelevant questions or bringing things out of left field in the middle of a survey (“I will be holding a surprise pop quiz tomorrow. How much do you know about the Wittig reaction?”). In the broader context of the learning goals, planned activities, and assessment methods of a course, survey instruments can serve as valuable formative (pre-, mid-, or post-semester!) probing tools…but find an existing framework that’s adaptable that works! Here are some of my favorites from chem ed:

Chemistry Self-concept Inventory
Student Assessment of Learning Gains
Groundwater Pollution Survey – If this one seems out of place, check out their “exploratory factor analysis.” This is one of the better papers I’ve seen that actually describes the statistics behind instrument validity. The content is a little esoteric, but hey, they’re asking for what they want to know! Kudos!