Office of the Vice President for Global Communications

Wednesday, September 9, 2009

Technology poses new challenges for protecting the integrity of research data

Download a PDF version of the executive summary of the report, “Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age.”

As evolving digital technologies increase researchers’ abilities to analyze and share data, they also leave that data more susceptible to distortion. Just last month, a major U.S. university announced an internal investigation after a magazine said it found what appeared to be “duplicated and manipulated images” in a scientist’s published studies, the third time he had been subjected to such a review.

Recognizing the difficult questions being raised by new technologies, the National Academy of Sciences, National Academy of Engineering, and Institute of Medicine recently recommended new principles to guide those who generate, share and maintain scientific data. Provost Teresa Sullivan, who served on the 17-member committee that issued the report, sat down recently with the Record Update to discuss the challenges that researchers face in the digital age.

Record Update: If you could, summarize the issues that prompted the committee report. How has digital technology changed the way that researchers collect and analyze data, and what concerns have grown out of that?

Provost Teresa Sullivan: The basic method that scientists and other researchers use to ensure quality is the peer review system. What’s happened is that technology has outstripped the ability of the peer review system to cope with it. There were some specific cases that were brought to this committee as examples of the problem.  In one there was a digital figure that had been inappropriately altered, and the editor of the journal was unable to detect that the figure had been altered. This altering rather substantially changed the conclusions. That kind of thing leads you back into the issue of how can you peer review that figure unless you can get all the way back into the database and then to the software that was used to analyze data?

A second problem, which was ancillary to this, is the databases are now so large. In astronomy, for example, they can basically take a picture of the stars every night, and you can have billions of pieces of data. Even if we know how to put all these data in some kind of usable format and we can figure out a technique to use them, how long should you store them? There's not a lot of agreement among scientists on how long you're responsible for keeping the data on which you have based a conclusion. Do I have to keep it for five years? Do I have to keep it until I retire? And if I have to keep it until I retire, what would I do if I kept it in, say, the equivalent of Word 6.0 and now we’re at 15.0? What's my responsibility to keep refreshing it?

Or suppose I'm not an information scientist and I don’t really know about this, I'm just an ordinary middle user of the technology, and my expertise is really in geology or biology or some other field. These are issues that we considered.  One of the most important conclusions is that there is a distributed set of responsibilities and a lot of people are going to have to step up to the plate to these responsibilities: the person who does the investigation to start with, their disciplinary association, the editors of the journals in that field, librarians, universities themselves. Everybody is going to have a role to play in this.

The first step is recognizing how much the technology has changed the process we had in place, how it affects the way we need to do our work.

RU:  Computers have been around for a while. Is this an issue that has been ongoing or is it relatively new, and why is it coming up now?

TS:  It's relatively new and has come up now because, as with any NRC (National Research Council) report, there were people who were sufficiently concerned about the issues to ask the NRC to look at them. I don’t know if there was any particular precipitating incident; there had been a number of things recently that helped to bring it to the fore. One of them was the access to data produced for NIH (National Institutes for Health) or NSF (National Science Foundation) grants. There are now requirements that the data be shared. This raises issues: How do you share it, when do you share it? What do you do about keeping confidential the names of subjects? There have been some moves in the direction of more open access to data, but the technological barriers to that have not always been addressed. This is an effort to address that.

I think the best we can do with this report is to get more conversation going about this set of issues. Our recommendation was not hard and fast. The recommendation to professional associations is to sit down and think hard about these things. The American Psychological Association has done that, and they came up with a guideline that you should keep your data for five years, which seems reasonable. But why five years and not 10 years?  And if psychology says five years, should history require 100? These are the kinds of questions that need to be considered.

Another concern is that there are special issues for fields like anthropology, because if you make your data available to someone else you may inadvertently compromise the confidentiality you promised your respondents. How do anthropologists deal with that kind of issue?  It's obviously something the American Anthropological Association will have to take up.

RU:  Would it be fair to characterize this report, in this stage of process, as being essentially a consciousness-raising effort, pointing out problems and figuring out where to go?

TS: I think that’s some of it, and I think some of it also is recognizing that different people have responsibility. I'm not sure everybody recognized there is a role for journal editors, there's a role for peer reviewers, and so forth. It's not a matter just of saying, “The person who writes the article needs to do this.” There are a lot of other things to be considered.

RU: Is the concern here that data and analysis will be manipulated, and there need to be some safeguards to prevent this? Or is it more a concern that without these safeguards in place research in general might be considered harder to verify and less credible?

TS: It’s a little bit of both those things. The issue we were presented with about the article in which the figure had been altered — that poses an issue that is really hard to detect. How would you have detected that? That question led to the discussion of what is it reasonable for a peer reviewer to want to look at? Does the peer reviewer have to go and redraw the diagram? That doesn’t seem reasonable. So what are the levels of transparency needed for this?

Anyone who has collected data in original format knows that it's not just the way you collected data that’s important, it's also what you did to it later on. Did you transform variables; did you collapse variables together; did you create indexes, and so on? A person who follows you and wants to replicate your work needs to know how you did all those things. It’s a whole lot more documentation then most researchers are used to doing.

I do not think this commission was trying to say there is a vast amount of cheating and fabrication going on. I don’t think anyone would claim that was the case.  But I think it is the case that the verification is getting harder to do.

RU: You said this report doesn’t put into place hard and fast solutions. It would seem like it’s difficult to do that, given the evolving state of technology.

TS: It is hard to do that and one reason is that different fields face different problems. What should be required in sociology probably doesn’t make sense in physics. That’s another point that needed to be made, the disciplines need to be part of this. They have an important role to play.

RU:  So, what was the general approach that the committee took in addressing its charge with this?

TS: There was a real effort to come up with recommendations that were broadly applicable, even if the specifics with this journal or that journal differed, or if one field required archiving for 10 years and another archiving for five years. There was an effort to try to find some almost generic recommendation that would be valuable to a lot of people. I don’t think we could have actually done much more than that with the group we had assembled.

RU: Essentially does it come down to the human element? You're still putting the trust in the people that are involved, right?

TS: Trust and verify. The problem is how we do the verifying? That’s what we haven’t figured out yet.

Here at the U-M, we have the Inter-University Consortium for Political and Social Research (ICPSR) at the Institute for Social Research. It is a great archive of social science data. So what a lot of investigators have done, when they finish collecting the data, is to archive it at ICPSR and then they don’t have to worry about: Is it being refreshed? Can other investigators get to it? Other investigators who want to look at those data can go to ICPSR and get the data. This kind of clearinghouse model is one that at least some fields may be able to use.

RU:  Are there other things that U-M has been doing specially to head this problem off?

TS:  In our library we have something called Deep Blue.  It is a very detailed archiving program for research projects and other things. What happens in Deep Blue is that a faculty member can archive not just the paper that got published, but also maybe the longer version that they had to edit down that had more detail in it, and maybe an earlier version of the paper and so on. What they can do is archive a history of the project, which may be valuable to future researchers who come back and say, “I can't reconstruct this variable,” and you say, “Well, you had to go back to the earlier paper to see how we constructed the variable.”

Deep Blue is also valuable for faculty members who are getting ready to retire. It's a way for them to archive their research— at least to the extent the data are electronically available. I think that will be useful in the future when people want to verify earlier research.  An anthropologist, for example, can archive some of his or her observations, edited to take out anything identifying the subjects who were promised confidentiality. Those interviews and field notes are a resource that we lose when a faculty member retires, and I’d like to see us make more use of Deep Blue so that we don’t lose that important work.

Another thing we have to do is to think about these issues when we’re training graduate students in research techniques.  We tend to train them to what we consider the state of the art now. We’ve got to anticipate what the state of the art will be a few years out, so that we can make sure students at least understand the issue. They may not understand how it is exactly that you're going to save the work that is in this current version of software, and how you will refresh it to the next version of the software, but at a minimum, we need to make them aware that it is an issue. And of course we do provide considerable training to students. We’re required by NSF and NIH to give training to students in research conduct, but we probably need to look at the content of that from time to time to be sure we’re handling all these new issues.

RU: Will this anticipate that there will be any specific procedural things that will change the way professors or researchers do things, going forward, at U-M right now?

TS: I'm not sure that it will. I do know that some journals are beginning to change their standards, and so people in the fields that publish in those journals will certainly be affected by them. Some journals are now saying, “We won't publish it unless you make the data available to other authors,” or “We’ll publish the regression equation but you have to file the correlation matrix with us.”  There are all kinds of new rules that people are considering. The conversation is further along in some disciplines than in others.

RU: Does this affect different universities differently? Does a large research university like U-M feel more pressure to deal with these issues?

TS:  Yes. One of the reasons is that at U-M we are data collectors. In a lot of fields now, faculty at smaller universities are doing secondary analysis, work in which they make use of data collected by others and do their own analysis of that data. They don’t have the resources to go out and do the big data collection that we and other research universities are able to do. We have clinical trials going on over at the hospital. We’ve got the School of Public Health out collecting original data. At ISR we’re collecting big surveys like the health and retirement survey and the panel studies of income dynamics. We are data collectors and so that puts a greater emphasis on our being able to archive the data so that they useable in the future.

RU: Would that mean a large school like U- M would have to have more of those controls and processes instituted early on in the process?

TS: Probably, and it probably also means we’re going to have to think about it more. I'm pretty sure that if you went over to ISR you would find that they think about this a lot.

RU: We’ve been talking about research at the graduate level and by professors. Does this whole idea have any bearing on the average work that an undergraduate will do? The integrity of where they would go to look for research? Are there things being done at the university to keep them abreast of how to find incredible information with everything on the Web.

TS:  I think one of the most important challenges we face with undergraduate students is getting them to ask the question, “How do I know this is true?” The fact that something is on the Internet is not enough. I think in the classes that the library teaches — the library actually teaches quite a few classes to undergraduates — one of the things that they are very careful about is the provenance of data.  How do you know it’s reliable, and how can you tell? It's an issue that I think a lot of people who teach, particularly those who teach freshmen, are trying to hammer hard: just because it's on the Internet doesn’t make it true.

Certainly students in the social science fields here often get exposure to what it is like to collect data, so they begin thinking about these issues for themselves.  And we have a new major in informatics just getting started and I think those students will face a lot of the interesting issues about how you archive data, how much you archive, and so on.

RU:  So going forward, what will be the next step from this report? Is there a next step that has been identified?

TS: The report itself is pretty new. I think there needs to be some discussion of it among some of the relevant scientific communities first. I expect that will be the next step.

RU:  It seems like this is a pretty fundamental issue in academics. When you look at all other things people are dealing with in academics, where does this rate?

TS:  It's pretty high because we’ve trusted the peer review system to tell us what's good work and what's not good work, and this may change the way the peer review system works. Now, peer reviewers, instead of just asking for the paper, may be asking for extracts of data, re-running some of the analysis themselves. I think that’s already happened in a few specialized areas.  That will be a real change in the way we do peer review.