The National Academies recently released a report addressing integrity in scientific research, including the social and behavioral sciences. One of the recommendations is that after publication researchers share with others the data on which an article is based. This supports research transparency and should lead to greater reproducability of research. I think sharing data is generally a good thing, and I have done so. But I feel that the authors of the report have failed to address some important issues related to such data sharing. This is obviously a topic much broader that the subject of this blog. But at two points in my comments I will give examples that relate directly to things discussed here.
In discussing the recommendation on data sharing, the report points favorably to the policies of some journals requiring that authors make the data for an article availble to others on request. But further discussion in the report strongly implies that data should be made available in a repository from which anyone can download it. The difference is significant because in making the data available online, the person(s) who created the data then lose all control over how it might be used.
But first, a simple, practical issue. Making data available for others to use entails a significant amount of work relating to formatting, documentation, and so forth. I am very careful about documenting my data as I do research, but that original documentation is completely meaningful only to me. Just sharing data with co-authors requires some additional effort. Sharing it publicly would require more. I suspect that the majority of datasets from articles would never be used by others. So it is inefficient to put the work in for every dataset to get it into a form in which it can be shared. It makes much more sense to put in this effort when someone makes the request to use the data. At that point, I am happy to do so.
The authors of the report (many from the natural sciences) seem to most often view datasets as the products of experiments, to be reported in an paper, which then is the end of the story. Indeed, they actually see as a problem “the temptation to publish multiple papers on just one experiment or dataset.” (p. 17) They fail to realize that for certain types of research, datasets are developed, often with a great deal of effort, to support the investigation of multiple research questions. Those creating the data have a reasonable expectation of being able to carry out their research without having it preempted by others using their data.
My urban patterns dataset with data on housing units by census tract for 59 large urban areas from 1950 to 2010 is an example of this. I spent at least a year-and-a-half building the dataset. I have a long list of research questions I intend to address using this dataset. The papers currently on the Research page represent just a start. I feel that it is reasonable that I shold be able to be the first to use this data to address these questions. I certainly would not have put in the effort I did in creating the dataset only for one or two papers. This does not mean that I would be unwilling to share the data with others before I have completed this program. I’m finished with all of the questions I have intended to address relating to the negative exponential model. If someone wants to do more, I could be willing to share the data. Or if someone wants to combine my data with some other dataset, sharing could be appropriate. But that’s why I believe I need to have control over the sharing.
I was surprised that the authors of the report failed to address reputational risks that could be associated with data sharing (and by this, I am not including risks associated with others finding out about problems with the original research). Putting data on an archive for anyone to use can result in uses that can negatively impact the reputation of the data creator.
The first (and least significant) reputational risk comes from someone taking the data and producing and publishing a very crappy piece of work. While most such efforts are justifiably ignored, occasionally they will achieve notoriety for their sheer absence of quality. Assuming the author of the crappy research appropriately cites the creator of the data, the creator will forever be linked with the work. While everyone should understand that the data creator is not responsible, just being associated would not the most pleasant thing.
For certain types of data, the reputational risk can be much greater. For example, suppose researchers post data dealing with a social problem that includes information on race. A white supremacist could obtain the data, improperly manipulate it, and falsely claim that the results supported their racist views. And they might well prominently note that the creators of the data were respected researchers at a major university. Such a nightmare scenario is why researchers have a legitimate interest in controlling the sharing of their data.
For researchers working in a field involving contentious positions with extremely strong partisans on both sides, risks can extend to the use of the data by others in that field. Getting back to the subject of this blog, urban sprawl and its effects represents just such a field. A study is published indicating that sprawl or compact cities does or does not have some effect, and those whose position has not been supported can be vociferous in their attacks and arguments against it. This has happened–in both directions. I have no doubt that if the data from such a study were made freely available for download that that someone whose position had not been supported might reanalyze the data making the assumptions necessary to reach the opposite conclusion in an attempt to discredit the original study and its author.