Problems with interpretations of ordinal interactions when the underlying scale is unknown

  • strict warning: Non-static method view::load() should not be called statically in /customers/c/4/b/ on line 879.
  • strict warning: Declaration of views_handler_filter::options_validate() should be compatible with views_handler::options_validate($form, &$form_state) in /customers/c/4/b/ on line 589.
  • strict warning: Declaration of views_handler_filter::options_submit() should be compatible with views_handler::options_submit($form, &$form_state) in /customers/c/4/b/ on line 589.
  • strict warning: Declaration of views_handler_filter_boolean_operator::value_validate() should be compatible with views_handler_filter::value_validate($form, &$form_state) in /customers/c/4/b/ on line 149.
  • strict warning: Declaration of views_plugin_style_default::options() should be compatible with views_object::options() in /customers/c/4/b/ on line 25.
  • strict warning: Declaration of views_plugin_row::options_validate() should be compatible with views_plugin::options_validate(&$form, &$form_state) in /customers/c/4/b/ on line 135.
  • strict warning: Declaration of views_plugin_row::options_submit() should be compatible with views_plugin::options_submit(&$form, &$form_state) in /customers/c/4/b/ on line 135.
  • strict warning: Non-static method view::load() should not be called statically in /customers/c/4/b/ on line 879.
  • strict warning: Non-static method view::load() should not be called statically in /customers/c/4/b/ on line 879.
  • strict warning: Declaration of views_handler_field_comment::init() should be compatible with views_handler_field::init(&$view, $options) in /customers/c/4/b/ on line 50.
  • strict warning: Declaration of views_handler_filter_node_status::operator_form() should be compatible with views_handler_filter::operator_form(&$form, &$form_state) in /customers/c/4/b/ on line 14.

In this blog post I will especially look at the scaling problems that arise in the design of computer algorithms that determines at what level the participant is training. These scaling problems are not trivial since they very much affect the conclusions that can be drawn from data. I will show that at the moment any conclusions drawn from ordinal interactions should not be taken serious as long as the underlying measurement scale is unknown. It should be in the interest of the psychology profession to be weary of research that not only play upon parents' good will and desire to give their child the best possible start in life, but also makes a profit from these parents. Hence, psychologist should have great ethical concerns in profiting on parents' good will without being certain to actually deliver to their expectations. The long-term consequence of not properly manage the public high esteem in science will not actually damage the charlatans profiting from concerned parents but those few researchers who actually conduct proper science. Once the legacy build by hundreds of years of research in the physical sciences and engineering is depleted society will be in a dangerous position where truth and fact is regarded as relative. It is therefore the moral obligation of scientists to strike down upon fraudulent research aimed at the public, wherever and however it is framed.



Many experiments in applied psychology aim to examine the effect of some intervention on participants that differ with some aspect like a diagnosis, brain lesion, genotype or personality. Unfortunately the whole prospect of interpreting the results from such a study is compromised by the non-random assignment of participants which makes the pre-existing differences between the group to compromise the interpretation of data. However, even in the cases where participants are randomly assign interpretation of results are not trivial unless the exact nature of the psychological scale is known.

In order not to comprise pedagogical clarity with the fact that several technical issues can be raised when discussing working memory training I will start with a classical example from educational science that is free from the emotional attachment people may have to the idea that working memory can be improved through training. The idea is that by taking an example from another, but similar field, the critique by analogy should be much clearer. Starting out from this simple example I will then discuss the problems of working memory training in more detail.

The scale problem

As an example from educational research we'll take Winter & McClelland's (1978) study where they developed a measure of how well students could analyze and articulate complex concepts, the test is a form of thematic apperception test. Winter & McClelland used their measurement in three different schools "Ivy League", "Teachers College" and "Community College" both in freshman year and the senior year. As is common in most psychological research Winter & McClelland (1978) only took the raw difference score between freshman measurement and the senior year measurement. The results can be seen in Figure 1.

Figure 1. Thematic analysis scores of freshmen and seniors at three colleges, adapted from Winter & McClelland (1978)

Winter & McClelland interpret their results to mean that:

liberal education of Ivy College improved the ability to form and articulate concepts, sharpened the accuracy of concepts, and tended to fuse these two component skills together”
(Winter & McClelland, 1978)

This maybe a nice conclusion especially to faculty members of prestigious Ivy League institutions (McClelland just happened to be a faculty of Harvard University at the moment the article was published). Students not enrolled to expensive Ivy League schools may conclude that all the results are saying is that `the rich is getting richer'. However, none of these conclusions can actually be drawn from the data. It may come as a surprise, but in fact one cannot even be certain that the liberal arts students of the Ivy League school learns faster than the other schools, even though their results in senior year are higher than the other schools. Why this is so I will explain very soon.

But before going into the details we may need to take a small detour and go over some basic concepts of Analysis of Variance (ANOVA). In Figure 2 we can see examples of (A.) two significant main effects (School and Time) but no interaction effect (School x Time), (B.) two significant main effects and as well as a significant interaction (School x Time) this interaction is referred to as an ordinal interaction to contrast it against (C.) which is an disordinal interaction effect where no significant main effects are present. The bottom line for this blog post is that there is really no way to interpret an ordinal interaction without additional experiments. I will illustrate this using Winter & McClellands study as an example.

Figure 2. (A.) two significant main effects (School and Time) but no interaction effect (School x Time), (B.) two significant main effects as well as a significant interaction (School x Time) this is referred to as an ordinal interaction (C.) No main effects only an interaction effect, this is referred to as a disordinal interaction effect and is evidence of non-linearities in the measurement structure

Imagine that we would replicate the experiment of Winter & McClelland (1978) but instead of using a thematic apperception test for measuring academic ability we would use a traditional test of mathematical reasoning. We measure performance on the test in the three different schools as a function of what year they are studying and obtain the following results:

Figure 3. Results from a hypothetical replication using a test of mathematical reasoning

Again the Ivy League faculty attribute the steeper progress among the Ivy League students to their faculty's teaching skills and the non-Ivy league students attribute it to previous differences among students that are magnified. However, what if we also would include a test of reading comprehension into the test battery and obtain the following results:

Figure 4. Results from a hypothetical replication using a test of reading comprehension. Notice that most people want to point out that there is a `clear ceiling effect' in the data however what is it that makes these people to claim ceiling effects for these scores but for the ones obtained in mathematical reasoning (Figure 3.) they would not claim it to be an artifact of `floor effects'?

Here it looks like the Teachers College and the Community College have a higher increase from baseline than the Ivy League college. As you are presenting these data to the faculty of the Ivy League college a faculty member who thinks he is a real wiz when it comes to statistics raises his hand and points out: "You obviously have a ceiling effect in your data, if the scale would be able to measure beyond 100% correct you would probably see a much higher increase among the Ivy League students". This sort of reasoning is very strange and in fact it says more about the preconceived ideas and expectations than the statistical skills of the one who claim these results to indicate "ceiling effects".

In fact the argument of ceiling effects is made in one of the few Randomized Controled Trials (RCT) on adaptive cognitive training for children with ADHD problems (Klingberg, et al 2004). The investigators of this paper found training related transfer to Spanboard but not Raven's complex matrices nor the Stroop Task. It was argued that the absence of transfer for Raven and Stroop could be explained by ceiling effect:

"The ability to detect the remaining significant
effect for the Raven’s task and for accuracy in
the Stroop task was limited by the ceiling effects, which
were more pronounced for the treatment group."

(Klingberg, et al. 2004)

However, nowhere in the paper it is mentioned that the significant transfer effects they did obtain could just as well be explained by a limited range (i.e. floor effects or inhomogenous variance). This sword swings in both ways. To be fair, it should be clear that this problem is in no way only applicable to this paper or even to the WM-training literature but to most studies that use psychological measurement where the underlying structure is unknown.

The problem is that most applied research in psychology goes out to measure things with the naive assumption that the measurement structure one examines have similar and simplistic properties as for example mass, length and duration. However psychological properties have a much more richer structure than the ones in physics, that is why psychological research is very difficult to do.

Now using our hypothetical example let us see just why floor and ceiling effects are not trivial unless the exact structure of the measurement is known. In our hypothetical example with mathematical reasoning and reading comprehension both results can be explained by the underlying logistic monotonic function that relates probability of correct item response as a function of time in school to two free parameters:

Figure 4. Results from a hypothetical replication using a test of reading comprehension. Notice that most people want to point out that there is a `clear ceiling effect' in the data however what is it that makes these people to claim ceiling effects for these scores but for the ones obtained in mathematical reasoning (Figure 3.) they would not claim it to be an artifact of `floor effects'?

The logistic function fitted to the hypothetical data can be seen in the top panels of following Figure. And as can be seen the groups only differ with respect to the intercept but not in slope, that is their gain from education is similar, the groups only differ at baseline. The above equation is known to figure in the psychometric literature as the Guttman scale. According to Guttman the probability of getting a correct response on an item is a function of the participants ability and the item's difficulty, . When the ability is greater than the items difficulty, , the participant is able to get a correct response above guessing levels.

The interpretation according to Guttman's scale is that the students at the different schools differ in their initial ability (which of course is increased during college) and the mathematical test is more difficult than the reading comprehension, which is not surprising given that the students are liberal arts majors.

Now the problem with interpreting difference scores between baseline and post-treatment is that the scale is not on interval or ratio level. For example, conclusions based on temperature measurements obtained on a Celsius scale is equally valid when the data is transformed to the Farenheit scale. However, if we would apply a logit-transform to our hypothetical data:

We can clearly see that the difference in learning rate across school years between the colleges is completely gone after the logit-transform. The groups only seem to differ in the intercept which rather reflect differences already present before enrollment into the freshman year. The conclusion that the groups differ in their learning rate is not invariant under a monotonic transformation therefore the data structure cannot be on an interval scale and conclusions drawn form difference scores, such as the difference before and after treatment, cannot be rendered meaningful without some necessary steps to obtain a well-behaving psychological scale.

Bottom line: be very careful about interpreting ordinal interactions, or any effect that can go away with a monotonic transformation, and to look for disordinal interactions or effects that remain even after extreme but monotonic transformations.

Factorial designs and measurement inequalities

A common technique for obtaining measurements within the physical sciences is an experimental design referred to as constructive design, which are quite common in psychophysics. In a constructive design one starts by establishing standard sequences. For a conjoint measurement with two factors and (e.g. in the case of learning , and ) one starts by choosing some origin, , and some unit then one searches for a value which solves the equation (notice that we use the similarity relation, , rather than the equality relation, , since we make a stark contrast between empirical relations and numerical relations, and belongs to relations between numbers). When is found one can then continue to search for that satisfies and for that satisfies . Note that can be obtained by two solutions and . The two solution must therefore coincide and consequently this provides an empirical test of the validity of the psychological scale.

Figure 3. Otto Ludwig Hölder (1859 – 1937) one of the greatest mathematicans of last centuary. Famous for the Hölder Inequality which holds between integrals and an vital tool for the study of Lp spaces and through which the Uncertainty Principle of signal processing can be derived. Hölder was also famous for his contribution to the Jordan-Hölder theorem in representation theory which states that any two composition series of a given group are equivalent. The Jordan-Hölder theorem is indespensible in the study of difference measurement in psychology and differential equations in general. Given that some many psychologist are interested in obtaining difference measurements such as before versus after treatment suprisingly few psychologist are familiar with Hölder's work.

Nevertheless, for practical reasons constructive designs cannot be performed in most social sciences because of biological and social variability between subjects: a different scale construction is required for every participant. A second reason is that the construction requires a great deal of care since errors in measurement could easily make the two equations not to coincide. Errors are also magnified by systematic errors such as time and order biases. In learning this is trivial since we cannot brain wash the subject and redo the learning experiment with different set of parameters.

For these reasons factorial designs are much more common. and most methods (ANOVA) have been developed for these type of designs. But how can we assure that a psychological scale really exist when we are using a factorial design? Of course one can test necessary axioms such as independence (i.e. that the order obtained on one factor is independent of the level on the other factor, for example 2 week of training is more than 1 week of training regardless of what performance is obtained, since it is the other factor) and failure of any of these would exclude the model, however testing existential axioms such as the solvability axiom would be quite difficult. The solvability axioms tells us that the levels within factors can actually

Even if solvability is assumed to be true for the underlying data-generation process for a particular data set it could very well be the case that a measurement scale cannot be obtained even though a testable and necessary axiom such as independence is true. To see that this is the case consider an hypothetical example from learning theory. Consider a three factorial experiment (3 x 3 x 2) where we study the impact of previous experience (H), motivational drive (D), and incentive (K) on a measure of performance. The performance measure is simply partial ordered between the experimental groups so that 1 is the group that learnt slowest and 18 is the group that learnt the fastest.

For the first level of incentive we obtain the following order:

Likewise, for the second level of incentive (K) we obtain:

Hence when we pool data across one factor as we would do when we examine the effect of incentive (K) and previous experience (H) only we get:

It is easy to see from the above table that these data do not violate the independence axioms. For example additive independence tells us that for all a, b, p, q where a, b are a subset of A and p, q are a subset of P it should be the case that if one empirically observes that ap have better or equal performance to bq (i.e. ) then the numerical representation should follow .

This can maybe most easily be inspected visually:

Figure 4. Demonstration that independence and solvability in the data generation process does not necessarily lead to a data set that can be represented.

We may also see in the above graph that the measurement structure in fact does not obey the condition of double cancellation. A psychological scale is said to obey double cancellation if and and if independence and solvability are both true it follows that must be the case. However as is clear from the data form the example, and clearly indicated by vectors in Figure 6, the last inequality is reversed from that predicted. We already saw that independence was true hence it must be the case that solvability is not true for these data.

Figure 5. Demonstration that independence and solvability in the data generation process does not necessarily lead to a data set that can be represented.

So if we still want an additive representation how can we do this without a solvability axiom? Now this section is already starting to get too technical but I will return to this topic in future blog post. But the solution is either to construct a finite linear structure, or a polynomial structure (interested readers are referred to Kranz et al. 1978, Chapter 6).

Applications to ADHD and WM-training


Winter, D.G. and McClelland, D.C. (1978). “Thematic analysis An empirically derived measure of the effects of liberal arts education.” Journal of Educational Psychology. 70, pp. 8-16.


2012-10-26 10:34

It is very simple to inspect the source code of any command from a package in R by simply typing in the console the package...

2012-09-23 03:33

In this blog post I will especially look at the scaling problems that arise in the design of...

2011-04-12 20:51

Preliminary remark on the separate presentations held by Randy Gallistel, Greg Jensen, David...

User login