‘Kiwimeter’ is a methodological car crash and I still can’t look away

9:31 Mar 16, 2016 88

by Tze Ming Mok

On a survey methodological level, it just looks worse the more I find out. Nerd rage to follow, but briefly up front: Increasingly, I think my original disagreement with the Human Rights Commission is more semantic than substantive.

I still believe it’s very important to test responses to that specific bellwether statement about Maori receiving so-called ‘special treatment’ in surveys, even if the ‘agree/disagree’ answer options don’t work properly in this one. (I’ve noted elsewhere that a better scale would be ‘how acceptable or unacceptable do you find this statement?’)

But because the overall effect of ‘Kiwimeter’ is one of causing emotional distress and feelings of marginalisation for Maori, even if due to incompetence, then it’s ultimately a racist effect. And as we know from our Human Rights Act harassment definitions, intent doesn’t matter; effect does. I’d like to thank folks on Twitter for sharing their experiences with me on this.

Marama Fox is also right: in terms of research ethics, there was an ethical requirement to assess whether the survey would do ‘harm’ to respondents. I think there is good evidence that it has caused emotional distress and harm (one Twitter commenter described seeing that question as a “punch to the puku”). Those responsible seemingly did NOT due their due diligence to assess whether there was a risk of this. The academics who were involved need have a close look at the part they played.

How do you assess that risk? You test the survey. Did they do this? Not properly. This is what I have figured out so far:

Some people have reported taking part in the original survey that developed the archetypes, and we now know it was not a representative random population sample survey. It was a weighted selection of the self-selective group of people who had previously filled out ‘Vote Compass’. From appearances, Vox Labs seems sort of confident that their approach to non-random selection is basically awesome, as if it’s as good as a YouGov panel, and that upweighting or downweighting certain demographics to match their population proportions is, indeed, magic.

[SPOILER: WEIGHTING IS NOT MAGIC AND VOX LABS IS NOT YOUGOV].

So yes, absolutely no part of this survey even originated as a representative ‘probability sample’ survey. For the moment, let’s leave behind speculation about whether Vox Labs is as good at constructing a representative-ish online panel as YouGov, as we have nothing to go on other than my own mean instincts.

Instead, let’s look at the failure of the survey testing stage and questionnaire development. These are very big nerd-problems.

Essentially, the original more carefully-but-not-randomly selected survey was the pilot for Kiwimeter. From folks like Stephen Judd and Stephanie Rogers who remember taking part in that pilot, the questions in the original survey weren’t substantially different from the current survey and the one that has attracted all the controversy was exactly the same. So if they got negative feedback at that stage, they didn’t care enough to change it.

The Founder/Director of Vox Labs, Cliff Van Der Linden, has been tweeting somewhat piteously from Canada in defense of the methodology, including a couple of tweets to me so far. I asked him whether there had been cognitive testing carried out on the survey, as this would have prevented the main problems I wrote about earlier.

As you can see from the screen grab, his completely off-topic response about factor analysis (a data crunching method to be applied to results) seemed to indicate that he did not understand my question, possibly because he did not know what cognitive testing was.

NEEERRRRRDRRAAAAAAAGE

I thought I was pretty restrained though, right?

What is cognitive testing? It’s a kind of interviewing technique where people talk about what is going through their minds as they fill out a survey – it would have picked up the ‘this seems racist’ problem immediately. It’s a standard step that credible survey research organisations build into the development phase. And it is not complicated or expensive shit to do.

Most data nerds, programmers and political scientists don’t need to know what cognitive testing is, and this seems to be Van Der Linden’s background. But any nerd who works in survey research damn well better. I look at the culture of an online ‘engagement’ outfit like Vox Labs, and I don’t see a depth of knowledge about traditional survey research and its implementation, which is a problem for credibility on a project whose credibility is already compromised because it’s being carried out by, well, TVNZ.

Also on his Twitter stream, Van Der Linden however pleads that Vox was not responsible for questionnaire development, only technical delivery and analysis. I have some sympathy. He states that the questionnaire was developed by a panel of New Zealanders that included academics and Maori - why would Vox have doubted their expertise? Fair enough! What happened here? I have no freakin’ idea.

But not all academics are necessarily going to have a professional survey research background in the nuts and bolts of delivering a questionnaire that works, even if they are great at analysing psychological constructs from data.

If the New Zealand panel did any cognitive piloting, they obviously didn’t sample widely enough. It’s possible that they viewed ‘piloting’ as Vox’s area. But Vox was in Canada: How could it carry out decent qualitative research with New Zealanders? The blame does not lie with the Canadians. This is a very disappointing day for New Zealand academia. The comments so far from those involved have not been illuminating.

When the jobs have been portioned out like this - questions here, implementation there - a meaningful on-the-ground pilot to test whether the questions actually worked was lost. This whole project seems like a classic case of a failure of research expertise and oversight of the whole enterprise, from development to delivery, start to finish. Instead of nose to tail dining, we’ve got something half-assed.

I hope at least it was as cheap as it looks.

Tze Ming Mok apologises for an entire blog about survey methodology

88 responses to this post

First ←Older Page 1 2 3 4 Newer→ Last

Peter Davis, 09:58 Mar 16, 2016

I must say I am not aware of the kiwimeter and I am generally pretty unhappy with the way in which the media are using so-called surveys (particularly those that are just write-in and online). But ... There is a problem if ethics committees start saying that you cannot ask questions about controversial and/or sensitive issues. You immediately get a whole lot of subjects taken off the agenda, and of course a respondent can always refuse to answer a particular question (which they often do) and refuse the survey (which over 50 per cent routinely do). This is not necessarily to defend the methodology used here. But to give you an example of a very useful item in a sensitive area, a polling organisation in the US (or maybe UK - sorry to be vague) - reputable and cited, cannot recall which one - has over the years asked a question along the lines of whether the respondent would be happy if their daughter/son married a person of another race (or maybe they specified black). On the face of it, that is racist. However, over the years this polling organisation has shown that public attitudes have changed and that an increasing proportion of people asked this question say that they would be quite happy for their son/daughter to marry a person of a different race. Now that tells you something about changing social norms, and is a thoroughly reassuring statistic about greater racial intolerance - worth knowing. And we would not want a busy-body ethics committee disallowing a question of this kind or requiring the interviewers to put up all sorts of caveats and spoiler alerts to their respondents. This is not to defend poor pilot testing and lack of cognitive interviewing or shoddy survey practice, of course!

Since Mar 2016 • 7 posts Report
Russell Brown, 10:02 Mar 16, 2016

Stephanie Rogers, a respondent to the precursor survey, has written an extremely revealing post about where the questions came from – and what questions were picked up from the precursor survey for the ‘Kiwimeter’.
I’m still taking it in, but it looks like something very weird and wrong has gone on here. Sheeeeit.

Auckland • Since Nov 2006 • 22850 posts Report
Stephen Judd, 10:31 Mar 16, 2016

I think there is a big difference between a university researcher asking these questions, and a national broadcaster doing it essentially for entertainment, without seemingly any safeguards.

Wellington • Since Nov 2006 • 3122 posts Report
linger, 10:37 Mar 16, 2016

…apologises for an entire blog about survey methodology
No apology necessary. Survey geeks represent!
On the cognitive testing/ factor analysis confusion:
Maybe the “special treatment” item does correlate reliably with many other items, mutually measuring some underlying dimension of belief (accepting/ rejecting the dogwhistle probably correlates quite highly with social conservatism/ liberalism, voting National/Left, etc), so a statistician looking at the results (possibly without even reading the wording of the items) will see the item as psychologically valid: it “works” in the sense that it helps measure something, for most of the respondents in the dataset.
I gather this is where van der Linden is coming from.
However, (i) it is a bad idea to include an item that is ambiguous or stressful to answer, even if that item scores as statistically “valid”, as the reaction to that item may affect responses to, or completion of, subsequent items too; and (ii) it is a really bad idea to include an item that is systematically more ambiguous or more stressful for one particular subset of respondents. If that subset is a minority, you won’t be able to notice any problem by looking at overall factor loadings in a pilot survey; you can only identify the problem by cognitive testing (piloting with respondents self-reporting their interpretations and emotional responses to items).
N.B. for this purpose the pilot group doesn't have to be representative, but it does have to be diverse; a stratified sample (i.e. with specified quotas for groups whose responses will be compared) would be better than a random sample. A self-selected sample is the worst of all possible starting points for a pilot of a survey intended for the general population.
After the full survey is performed, if you then check the factor loadings of each item for each subset group separately, you might notice differences in completion rates, and/or in factor structure. But by then, it’s too late: your results are already suspect, and you’ve damaged your credibility among the survey takers.

Tokyo • Since Apr 2007 • 1944 posts Report
Tze Ming Mok, in reply to linger, 10:40 Mar 16, 2016

EXACTLY, YO.

SarfBank, Lunnin' • Since Nov 2006 • 154 posts Report
Stephen Judd, 10:57 Mar 16, 2016

It would be interesting to know how many people failed to complete the online survey after encountering that question...

Wellington • Since Nov 2006 • 3122 posts Report
James Butler, in reply to Peter Davis, 11:00 Mar 16, 2016

a question along the lines of whether the respondent would be happy if their daughter/son married a person of another race (or maybe they specified black). On the face of it, that is racist.
Perhaps. Seems to me that “How would you feel if your daughter/son married someone of another race” is a question which gauges racism, while “How would you feel if your daughter/son married someone who wasn’t a real American/Briton” is a racist question, and comes much closer to the insidious effect of “special treatment”.
ETA not that I can say how a POC might feel when confronted with either form, but this would be my starting assumption.

Auckland • Since Jan 2009 • 856 posts Report
James Green, in reply to Russell Brown, 12:18 Mar 16, 2016

I’m still taking it in, but it looks like something very weird and wrong has gone on here.
On a preliminary look, I’d say they took the data from the bigger set of questions, (used factor analysis as Cliff from Vox suggested) and worked out what the smallest number of questions they could ask was that yielded the same information.
While this process is designed to decrease the boredom factor and make the scale shorter, removing the positively worded questions could have changed people’s reactions to the scale e.g.:
- It should be compulsory to teach Māori language in school.
- The government should compensate Māori for past injustices.
- Māori have fewer opportunities in life than do other New Zealanders.
From a statistical “information” point of view, these questions are probably strongly (negatively) associated with the punch in the guts question, so they were removed to make it shorter. However, without those questions, the context for the questions changes.
On a related nerdy point, “Māori should not receive any special treatment.” is almost certainly reverse scored (and certainly the reverse of the other examples I’ve given above that were removed), and there is a trend away from using reverse scored items.

Limerick, Ireland • Since Nov 2006 • 703 posts Report
Moz, in reply to linger, 13:25 Mar 16, 2016

it is a really bad idea to include an item that is systematically more ambiguous or more stressful for one particular subset of respondents.... identify the problem by cognitive testing
I now have a vision of a lab with a white-coated researcher holding a clipboard while someone sits in front of a computer filling out the survey. And every time the subject says "oh, f*** you" the researcher ticks a box. I know the power relationships make it unlikely that that would actually happen, and I fear that a subject who didn't complete the survey might not be counted, but I can easily imagine swearing at the researcher in that situation.

Sydney, West Island • Since Nov 2006 • 1233 posts Report
Andrew Robertson, 13:43 Mar 16, 2016

Hi all.
The folks involved in the New Zealand Attitudes and Values study (me included) have written this open letter about the measurement of racism and prejudice.
This is not about the Kiwimeter survey. These issues come up from time to time on other research projects, so we thought it would be useful to release this position statement.

Wellington • Since Apr 2014 • 65 posts Report
Felix_felix, in reply to James Butler, 13:49 Mar 16, 2016

Don't know about the UK, but the US recall is probably Gallup's "Do you approve or disapprove of marriage between blacks and whites" 1959- .
That question has been changed over time from referring to "white and colored people" and more recently (1968-1978) "whites and non-whites"; I'm guessing because of these very issues.

Welly • Since Nov 2012 • 10 posts Report
linger, in reply to Moz, 14:21 Mar 16, 2016

The whole purpose of a pilot is to improve the survey design so as to remove item ambiguity, maximise completion rates, and maximise interpretability of the results. (“Making the test shorter” is not an aim in itself: it is only useful if it leads to higher completion rates, without harming interpretability.) So those are things you definitely need to measure for the pilot: non-completion has to be recorded.
Serious ambiguity should first be eliminated by testing items on small focus groups, who are also interviewed to check their understanding of and reactions to each item and response option. Then a second stage pilots the entire draft survey by asking respondents to complete a comment log as they fill out the survey (possibly also with follow-up interviews). This can be used to address problems in item sequence and timing, as well as remaining ambiguities and difficulties in the individual items. All of this is supposed to happen before doing any testing on the general public.
But, for “Kiwimeter”, the results of the public “pilot” survey were instead used only to narrow the range of items to those most strongly characterising the groups of respondents (“clusters”) with similar response patterns in the pilot results. Removing item redundancy in this way reduces survey length, but it also reduces item interpretability, as you lose the ability to crosscheck item responses to confirm the meaning inferred by the respondent. Which is important because, with an increased diversity of respondents, there is a risk that some items retained will fail for some groups underrepresented in the pilot stages – and so the overall results will be less robust.

Tokyo • Since Apr 2007 • 1944 posts Report
Jarno van der Linden, 14:50 Mar 16, 2016

As a fellow but unrelated van der Linden, the 'v's and 'd's should be lower case.

Nelson • Since Oct 2007 • 82 posts Report
James Littlewood*, 14:54 Mar 16, 2016

It’s a standard step that credible survey research organisations build into the development phase. And it is not complicated or expensive shit to do.
I’ve spent many years in the market research biz. Still do occasionally. There’s no way cognitive testing is standard in market research (which 10:1 is how TVNZ would have approached this mess as a client). There is a little bit of piloting but seldom for respondent welfare; more for flow logic etc.
Was cost a reason? Sure. Was it expensive? Enough to slow down a competitive pitch budget. It just wasn’t on the radar. I wouldn’t take that as an indictment of the industry. It’s just not a problem …. unless someone tries to pull something as monumentally misguided (i.e. loaded) as Kiwimeter.
Fair assumption that TVNZ is probably not an experienced social research client. Which matters because in social research the ethical stakes are usually higher, because the topics are usually more provocative.

Auckland • Since Mar 2008 • 410 posts Report
Geoff Lealand, 15:29 Mar 16, 2016

I am doing a lecture on Ethics in Research in my Media Research course tomorrow and will certainly use this.

Screen & Media Studies, U… • Since Oct 2007 • 2562 posts Report
Andrew Robertson, in reply to James Littlewood*, 15:37 Mar 16, 2016

There are a lot of assumptions in your post.
1. Kiwimeter isn’t market research.
2. Cognitive testing is not standard in market research, but it is fairly common in robust social research.
3. Cognitive testing is not necessarily a suitable technique for constructing these sorts of psychometric measures (and psychometric measures are not typically used in market research).
4. You’re assuming TVNZ is the client. I have no knowledge at all of the arrangement between TVNZ, the academics at Auckland Uni, and the Vote Compass folk, but I’m guessing (assuming) there’s some sort of partnership arrangement rather than the typical client-supplier arrangement that you might see in market research. It seems they each get something out of this. TVNZ get news coverage and the academics get a lot of data and findings they can publish in academic journals (see, for example, Vote Compass in the 2014 NZ election: Hearing the voice of New Zealanders, in Political Science, Volume 67).

Wellington • Since Apr 2014 • 65 posts Report
Lucy Telfar Barnard, 16:08 Mar 16, 2016

my own mean instincts
They're not mean. They're definitely well above average.

Wellington • Since Nov 2006 • 585 posts Report
Sacha, in reply to Andrew Robertson, 16:21 Mar 16, 2016

I'm imagining TVNZ's advertisers are a beneficiary of this study. Psychometric targetting etc.

Ak • Since May 2008 • 19745 posts Report
Sacha, in reply to Moz, 16:25 Mar 16, 2016

And every time the subject says “oh, f*** you” the researcher ticks a box.
so that's what f-score means :)

Ak • Since May 2008 • 19745 posts Report
linger, 16:26 Mar 16, 2016

Is Kiwimeter “research” at all?
Results of an online self-selected survey (which may be expected to be biased towards middle-class Pakeha) show, at best, groupings of beliefs that co-occur in subsets of that dominant set of respondents.
Use of factor analysis based on that dataset will further consolidate dimensions based on Pakeha opinion, while combinations of views representing minorities will be lost in the noise. In case it’s not obvious: this step of the analysis assumes that the population has one consistent factor structure for its beliefs, and that groups of respondents merely differ quantitatively on those factors. If minority groups have qualitatively different belief structures, that is not allowed to emerge: instead, what is constructed is a factor structure for the dominant group. (This remains true even if responses for underrepresented minorities are weighted to more accurately reflect their share of the overall population.)
Clusters are constructed from scores on the overall dimensions, so retain and reify the bias towards the dominant factor structure.
Subsequent thinning of items for the final survey further reduces robustness of results for minorities.
Hence extrapolation of findings to “New Zealand” as a whole is dubious in the extreme.
Media organisations will want a survey with results reflecting their target disposable-income demographic, and biases in that direction are fine with them.
University researchers, however, should know better.

Tokyo • Since Apr 2007 • 1944 posts Report
Carol Stewart, 16:32 Mar 16, 2016

Not unrelated: methodological and ethical problems in a survey carried out by the Island Bay Residents Association on the Island Bay cycleway.

Wellington • Since Jul 2008 • 830 posts Report
izogi, 16:47 Mar 16, 2016

Fair enough! What happened here? I have no freakin’ idea.
TVNZ is subject to the Official Information Act. Perhaps there's room to find out some of the detail.

Wellington • Since Jan 2007 • 1142 posts Report
Sacha, in reply to James Green, 16:52 Mar 16, 2016

I’d say they took the data from the bigger set of questions, (used factor analysis as Cliff from Vox suggested) and worked out what the smallest number of questions they could ask was that yielded the same information. While this process is designed to decrease the boredom factor and make the scale shorter, removing the positively worded questions could have changed people’s reactions to the scale
Exactly. And from Andrew’s NZAVS statement (PDF):
"you need statements that are worded in each direction, for example,
‘People from group X cannot be trusted’ and
‘People from group X are trustworthy’.
This is to control for something called agreement bias, where people may tend to agree with things a bit more than they disagree. Some people may read the statements in the negatively worded direction ‘People from group X cannot be trusted’ and argue that the statement itself is offensive and hence the survey itself is racist"
However (ibid),
Getting your ruler straight is a lot more complex than just having
statements worded in both directions.
What we and other researchers do is use a series of statistical models to identify the statements that best fit together to measure an attitude, such as negative or positive attitudes toward a particular ethnic group. We use these techniques to identify a set of statements that when all used in the same survey, fit together to provide a good measure of the underlying attitude.
At the same time we also work hard to develop scales that use ‘natural language’; we endeavour to express things in ways that people in New Zealand talk about them. Some of our statements are adapted from interviews with people in New Zealand, and some others are adapted from political speeches, blogs, etc.
The last sentence worries me, given the increasingly spun and framed nature of political communications. Last thing we-the-public need is for that to be injected into supposedly-reliable research to justify the next merry-go-round of distortion.

Ak • Since May 2008 • 19745 posts Report
BenWilson, 17:00 Mar 16, 2016

So this is like a mathemagical technomological version of Paul Henry on talkback?

Auckland • Since Nov 2006 • 10657 posts Report
izogi, in reply to Peter Davis, 17:02 Mar 16, 2016

But ... There is a problem if ethics committees start saying that you cannot ask questions about controversial and/or sensitive issues. You immediately get a whole lot of subjects taken off the agenda
I think it's premature to start suggesting that type of research couldn't take place under more ethical processes, though. Surely part of what an ethics committee will be considering is what's being learned from the research compared with the potential for harm, as well as measures taken to mitigate that potential harm.
Considering the "Maori special treatment" question, what is being learned from it?
One class of respondents will say they agree. Some may just respond without effect. Another class of respondents will be offended and antagonised by the question, because they find it impossible to answer clearly without compromising the integrity of what they think. (We know this because it's happened.)
Is the intent to find out what people think? If so, the question's unlikely to produce meaningful information, because allowable answers don't fit what people want to say. At the very least if people answer honestly, it's clear to them that their answer is likely to be misinterpreted, hence the frustration.
Or, is the intent to find out how people react to being asked the question? If so, why? How is that reaction being measured to record useful data, if at all? How is the potential for harm to the subjects being mitigated as part of the research process?
In the end, it's almost certainly a poorly worded question which reveals little to no useful information, and deeply offends many people in the process of asking it. Reporting results of this survey as "news" just increases the likely hurt.

Wellington • Since Jan 2007 • 1142 posts Report