I must say I am not aware of the kiwimeter and I am generally pretty unhappy with the way in which the media are using so-called surveys (particularly those that are just write-in and online). But ... There is a problem if ethics committees start saying that you cannot ask questions about controversial and/or sensitive issues. You immediately get a whole lot of subjects taken off the agenda, and of course a respondent can always refuse to answer a particular question (which they often do) and refuse the survey (which over 50 per cent routinely do). This is not necessarily to defend the methodology used here. But to give you an example of a very useful item in a sensitive area, a polling organisation in the US (or maybe UK - sorry to be vague) - reputable and cited, cannot recall which one - has over the years asked a question along the lines of whether the respondent would be happy if their daughter/son married a person of another race (or maybe they specified black). On the face of it, that is racist. However, over the years this polling organisation has shown that public attitudes have changed and that an increasing proportion of people asked this question say that they would be quite happy for their son/daughter to marry a person of a different race. Now that tells you something about changing social norms, and is a thoroughly reassuring statistic about greater racial intolerance - worth knowing. And we would not want a busy-body ethics committee disallowing a question of this kind or requiring the interviewers to put up all sorts of caveats and spoiler alerts to their respondents. This is not to defend poor pilot testing and lack of cognitive interviewing or shoddy survey practice, of course!
Stephanie Rogers, a respondent to the precursor survey, has written an extremely revealing post about where the questions came from – and what questions were picked up from the precursor survey for the ‘Kiwimeter’.
I’m still taking it in, but it looks like something very weird and wrong has gone on here. Sheeeeit.
I think there is a big difference between a university researcher asking these questions, and a national broadcaster doing it essentially for entertainment, without seemingly any safeguards.
…apologises for an entire blog about survey methodology
No apology necessary. Survey geeks represent!
On the cognitive testing/ factor analysis confusion:
Maybe the “special treatment” item does correlate reliably with many other items, mutually measuring some underlying dimension of belief (accepting/ rejecting the dogwhistle probably correlates quite highly with social conservatism/ liberalism, voting National/Left, etc), so a statistician looking at the results (possibly without even reading the wording of the items) will see the item as psychologically valid: it “works” in the sense that it helps measure something, for most of the respondents in the dataset.
I gather this is where van der Linden is coming from.
However, (i) it is a bad idea to include an item that is ambiguous or stressful to answer, even if that item scores as statistically “valid”, as the reaction to that item may affect responses to, or completion of, subsequent items too; and (ii) it is a really bad idea to include an item that is systematically more ambiguous or more stressful for one particular subset of respondents. If that subset is a minority, you won’t be able to notice any problem by looking at overall factor loadings in a pilot survey; you can only identify the problem by cognitive testing (piloting with respondents self-reporting their interpretations and emotional responses to items).
N.B. for this purpose the pilot group doesn't have to be representative, but it does have to be diverse; a stratified sample (i.e. with specified quotas for groups whose responses will be compared) would be better than a random sample. A self-selected sample is the worst of all possible starting points for a pilot of a survey intended for the general population.
After the full survey is performed, if you then check the factor loadings of each item for each subset group separately, you might notice differences in completion rates, and/or in factor structure. But by then, it’s too late: your results are already suspect, and you’ve damaged your credibility among the survey takers.
It would be interesting to know how many people failed to complete the online survey after encountering that question...
a question along the lines of whether the respondent would be happy if their daughter/son married a person of another race (or maybe they specified black). On the face of it, that is racist.
Perhaps. Seems to me that “How would you feel if your daughter/son married someone of another race” is a question which gauges racism, while “How would you feel if your daughter/son married someone who wasn’t a real American/Briton” is a racist question, and comes much closer to the insidious effect of “special treatment”.
ETA not that I can say how a POC might feel when confronted with either form, but this would be my starting assumption.
I’m still taking it in, but it looks like something very weird and wrong has gone on here.
On a preliminary look, I’d say they took the data from the bigger set of questions, (used factor analysis as Cliff from Vox suggested) and worked out what the smallest number of questions they could ask was that yielded the same information.
While this process is designed to decrease the boredom factor and make the scale shorter, removing the positively worded questions could have changed people’s reactions to the scale e.g.:
- It should be compulsory to teach Māori language in school.
- The government should compensate Māori for past injustices.
- Māori have fewer opportunities in life than do other New Zealanders.
From a statistical “information” point of view, these questions are probably strongly (negatively) associated with the punch in the guts question, so they were removed to make it shorter. However, without those questions, the context for the questions changes.
On a related nerdy point, “Māori should not receive any special treatment.” is almost certainly reverse scored (and certainly the reverse of the other examples I’ve given above that were removed), and there is a trend away from using reverse scored items.
it is a really bad idea to include an item that is systematically more ambiguous or more stressful for one particular subset of respondents.... identify the problem by cognitive testing
I now have a vision of a lab with a white-coated researcher holding a clipboard while someone sits in front of a computer filling out the survey. And every time the subject says "oh, f*** you" the researcher ticks a box. I know the power relationships make it unlikely that that would actually happen, and I fear that a subject who didn't complete the survey might not be counted, but I can easily imagine swearing at the researcher in that situation.
The folks involved in the New Zealand Attitudes and Values study (me included) have written this open letter about the measurement of racism and prejudice.
This is not about the Kiwimeter survey. These issues come up from time to time on other research projects, so we thought it would be useful to release this position statement.
Don't know about the UK, but the US recall is probably Gallup's "Do you approve or disapprove of marriage between blacks and whites" 1959- .
That question has been changed over time from referring to "white and colored people" and more recently (1968-1978) "whites and non-whites"; I'm guessing because of these very issues.
The whole purpose of a pilot is to improve the survey design so as to remove item ambiguity, maximise completion rates, and maximise interpretability of the results. (“Making the test shorter” is not an aim in itself: it is only useful if it leads to higher completion rates, without harming interpretability.) So those are things you definitely need to measure for the pilot: non-completion has to be recorded.
Serious ambiguity should first be eliminated by testing items on small focus groups, who are also interviewed to check their understanding of and reactions to each item and response option. Then a second stage pilots the entire draft survey by asking respondents to complete a comment log as they fill out the survey (possibly also with follow-up interviews). This can be used to address problems in item sequence and timing, as well as remaining ambiguities and difficulties in the individual items. All of this is supposed to happen before doing any testing on the general public.
But, for “Kiwimeter”, the results of the public “pilot” survey were instead used only to narrow the range of items to those most strongly characterising the groups of respondents (“clusters”) with similar response patterns in the pilot results. Removing item redundancy in this way reduces survey length, but it also reduces item interpretability, as you lose the ability to crosscheck item responses to confirm the meaning inferred by the respondent. Which is important because, with an increased diversity of respondents, there is a risk that some items retained will fail for some groups underrepresented in the pilot stages – and so the overall results will be less robust.
As a fellow but unrelated van der Linden, the 'v's and 'd's should be lower case.
It’s a standard step that credible survey research organisations build into the development phase. And it is not complicated or expensive shit to do.
I’ve spent many years in the market research biz. Still do occasionally. There’s no way cognitive testing is standard in market research (which 10:1 is how TVNZ would have approached this mess as a client). There is a little bit of piloting but seldom for respondent welfare; more for flow logic etc.
Was cost a reason? Sure. Was it expensive? Enough to slow down a competitive pitch budget. It just wasn’t on the radar. I wouldn’t take that as an indictment of the industry. It’s just not a problem …. unless someone tries to pull something as monumentally misguided (i.e. loaded) as Kiwimeter.
Fair assumption that TVNZ is probably not an experienced social research client. Which matters because in social research the ethical stakes are usually higher, because the topics are usually more provocative.
I am doing a lecture on Ethics in Research in my Media Research course tomorrow and will certainly use this.
There are a lot of assumptions in your post.
1. Kiwimeter isn’t market research.
2. Cognitive testing is not standard in market research, but it is fairly common in robust social research.
3. Cognitive testing is not necessarily a suitable technique for constructing these sorts of psychometric measures (and psychometric measures are not typically used in market research).
4. You’re assuming TVNZ is the client. I have no knowledge at all of the arrangement between TVNZ, the academics at Auckland Uni, and the Vote Compass folk, but I’m guessing (assuming) there’s some sort of partnership arrangement rather than the typical client-supplier arrangement that you might see in market research. It seems they each get something out of this. TVNZ get news coverage and the academics get a lot of data and findings they can publish in academic journals (see, for example, Vote Compass in the 2014 NZ election: Hearing the voice of New Zealanders, in Political Science, Volume 67).
my own mean instincts
They're not mean. They're definitely well above average.
I'm imagining TVNZ's advertisers are a beneficiary of this study. Psychometric targetting etc.
And every time the subject says “oh, f*** you” the researcher ticks a box.
so that's what f-score means :)
Is Kiwimeter “research” at all?
Results of an online self-selected survey (which may be expected to be biased towards middle-class Pakeha) show, at best, groupings of beliefs that co-occur in subsets of that dominant set of respondents.
Use of factor analysis based on that dataset will further consolidate dimensions based on Pakeha opinion, while combinations of views representing minorities will be lost in the noise. In case it’s not obvious: this step of the analysis assumes that the population has one consistent factor structure for its beliefs, and that groups of respondents merely differ quantitatively on those factors. If minority groups have qualitatively different belief structures, that is not allowed to emerge: instead, what is constructed is a factor structure for the dominant group. (This remains true even if responses for underrepresented minorities are weighted to more accurately reflect their share of the overall population.)
Clusters are constructed from scores on the overall dimensions, so retain and reify the bias towards the dominant factor structure.
Subsequent thinning of items for the final survey further reduces robustness of results for minorities.
Hence extrapolation of findings to “New Zealand” as a whole is dubious in the extreme.
Media organisations will want a survey with results reflecting their target disposable-income demographic, and biases in that direction are fine with them.
University researchers, however, should know better.
Not unrelated: methodological and ethical problems in a survey carried out by the Island Bay Residents Association on the Island Bay cycleway.
Fair enough! What happened here? I have no freakin’ idea.
TVNZ is subject to the Official Information Act. Perhaps there's room to find out some of the detail.
I’d say they took the data from the bigger set of questions, (used factor analysis as Cliff from Vox suggested) and worked out what the smallest number of questions they could ask was that yielded the same information. While this process is designed to decrease the boredom factor and make the scale shorter, removing the positively worded questions could have changed people’s reactions to the scale
Exactly. And from Andrew’s NZAVS statement (PDF):
"you need statements that are worded in each direction, for example,
‘People from group X cannot be trusted’ and
‘People from group X are trustworthy’.
This is to control for something called agreement bias, where people may tend to agree with things a bit more than they disagree. Some people may read the statements in the negatively worded direction ‘People from group X cannot be trusted’ and argue that the statement itself is offensive and hence the survey itself is racist"
Getting your ruler straight is a lot more complex than just having
statements worded in both directions.
What we and other researchers do is use a series of statistical models to identify the statements that best fit together to measure an attitude, such as negative or positive attitudes toward a particular ethnic group. We use these techniques to identify a set of statements that when all used in the same survey, fit together to provide a good measure of the underlying attitude.
At the same time we also work hard to develop scales that use ‘natural language’; we endeavour to express things in ways that people in New Zealand talk about them. Some of our statements are adapted from interviews with people in New Zealand, and some others are adapted from political speeches, blogs, etc.
The last sentence worries me, given the increasingly spun and framed nature of political communications. Last thing we-the-public need is for that to be injected into supposedly-reliable research to justify the next merry-go-round of distortion.
So this is like a mathemagical technomological version of Paul Henry on talkback?
But ... There is a problem if ethics committees start saying that you cannot ask questions about controversial and/or sensitive issues. You immediately get a whole lot of subjects taken off the agenda
I think it's premature to start suggesting that type of research couldn't take place under more ethical processes, though. Surely part of what an ethics committee will be considering is what's being learned from the research compared with the potential for harm, as well as measures taken to mitigate that potential harm.
Considering the "Maori special treatment" question, what is being learned from it?
One class of respondents will say they agree. Some may just respond without effect. Another class of respondents will be offended and antagonised by the question, because they find it impossible to answer clearly without compromising the integrity of what they think. (We know this because it's happened.)
Is the intent to find out what people think? If so, the question's unlikely to produce meaningful information, because allowable answers don't fit what people want to say. At the very least if people answer honestly, it's clear to them that their answer is likely to be misinterpreted, hence the frustration.
Or, is the intent to find out how people react to being asked the question? If so, why? How is that reaction being measured to record useful data, if at all? How is the potential for harm to the subjects being mitigated as part of the research process?
In the end, it's almost certainly a poorly worded question which reveals little to no useful information, and deeply offends many people in the process of asking it. Reporting results of this survey as "news" just increases the likely hurt.