Category Archives: Research

Some urban researchers are careless…and wrong

I have read a number of scholarly articles in which the authors were using census Urbanized Area data from 2000 or later in which they described those areas as consisting of territory with a population density of 1,000 or more. And that is incorrect. The density threshold for adding blocks or other small areas to an Urbanized Area (or Urban Cluster) is 500 persons per square mile. I’m not into naming and shaming and won’t. But come on! If you can’t even describe the data you are using accurately, why should anyone trust anything else you are saying?

I know where the error comes from. Starting with the 2000 census, the Census Bureau dramatically changed how they defined the notion of “urban” and Urbanized Areas (for the most part greatly improving the definition). Under the old definition, it was the case that a small area had to have a population density of at least 1,000 persons per square mile to be included in an Urbanized Area. An excellent summary of how the census definition of “urban” has evolved can be found here.

I assume that a researcher making this error had read earlier articles that described Urbanized Areas as consisting of areas with densities of 1,000 or more (either correctly, if referring to pre–2000 Urbanized Areas or incorrectly, if referring to the later areas). I expect this would be the source, not the census definition of the earlier Urbanized Areas, for if these authors were too careless and lazy to look up the definition for their current work, they likely would not have done so in the past either.

The current Urbanized Area density minimum plays a key role in the definition of urban areas for my urban patterns research. And of course I am continuing to read new articles that are published that deal with urban patterns, including those using Urbanized Area data. The first few times I read articles referring to the 1000-person-per-square-mile cutoff for 2000 or 2010 Urbanized Areas, I panicked. Did I make a mistake in understanding the definition and get it wrong? (It is a complex definition.) Each of those times I went back and re-read the formal notices on urban area criteria for 2000 and 2010  in the Federal Register. After having assured myself several times that I was correct, I no longer have to repeat this.

Technical note

The 2000 and 2010 urban area criteria do make use of a population density minimum of 1,000 persons per square mile in the first stage of the delineation process. An urban area core is defined that includes small areas with population densities of 1,000 or more. Then additional areas are added with densities of 500 persons per square mile and above. The existence of an initial urban area core meeting the higher density threshold will not be an issue for Urbanized Areas.

On the sharing of data from research

The National Academies recently released a report addressing integrity in scientific research, including the social and behavioral sciences. One of the recommendations is that after publication researchers share with others the data on which an article is based. This supports research transparency and should lead to greater reproducability of research. I think sharing data is generally a good thing, and I have done so. But I feel that the authors of the report have failed to address some important issues related to such data sharing. This is obviously a topic much broader that the subject of this blog. But at two points in my comments I will give examples that relate directly to things discussed here.

In discussing the recommendation on data sharing, the report points favorably to the policies of some journals requiring that authors make the data for an article availble to others on request. But further discussion in the report strongly implies that data should be made available in a repository from which anyone can download it. The difference is significant because in making the data available online, the person(s) who created the data then lose all control over how it might be used.

But first, a simple, practical issue. Making data available for others to use entails a significant amount of work relating to formatting, documentation, and so forth. I am very careful about documenting my data as I do research, but that original documentation is completely meaningful only to me. Just sharing data with co-authors requires some additional effort. Sharing it publicly would require more. I suspect that the majority of datasets from articles would never be used by others. So it is inefficient to put the work in for every dataset to get it into a form in which it can be shared. It makes much more sense to put in this effort when someone makes the request to use the data. At that point, I am happy to do so.

The authors of the report (many from the natural sciences) seem to most often view datasets as the products of experiments, to be reported in an paper, which then is the end of the story. Indeed, they actually see as a problem “the temptation to publish multiple papers on just one experiment or dataset.” (p. 17) They fail to realize that for certain types of research, datasets are developed, often with a great deal of effort, to support the investigation of multiple research questions. Those creating the data have a reasonable expectation of being able to carry out their research without having it preempted by others using their data.

My urban patterns dataset with data on housing units by census tract for 59 large urban areas from 1950 to 2010 is an example of this. I spent at least a year-and-a-half building the dataset. I have a long list of research questions I intend to address using this dataset. The papers currently on the Research page represent just a start. I feel that it is reasonable that I shold be able to be the first to use this data to address these questions. I certainly would not have put in the effort I did in creating the dataset only for one or two papers. This does not mean that I would be unwilling to share the data with others before I have completed this program. I’m finished with all of the questions I have intended to address relating to the negative exponential model. If someone wants to do more, I could be willing to share the data. Or if someone wants to combine my data with some other dataset, sharing could be appropriate. But that’s why I believe I need to have control over the sharing.

I was surprised that the authors of the report failed to address reputational risks that could be associated with data sharing (and by this, I am not including risks associated with others finding out about problems with the original research). Putting data on an archive for anyone to use can result in uses that can negatively impact the reputation of the data creator.

The first (and least significant) reputational risk comes from someone taking the data and producing and publishing a very crappy piece of work. While most such efforts are justifiably ignored, occasionally they will achieve notoriety for their sheer absence of quality. Assuming the author of the crappy research appropriately cites the creator of the data, the creator will forever be linked with the work. While everyone should understand that the data creator is not responsible, just being associated would not the most pleasant thing.

For certain types of data, the reputational risk can be much greater. For example, suppose researchers post data dealing with a social problem that includes information on race. A white supremacist could obtain the data, improperly manipulate it, and falsely claim that the results supported their racist views. And they might well prominently note that the creators of the data were respected researchers at a major university. Such a nightmare scenario is why researchers have a legitimate interest in controlling the sharing of their data.

For researchers working in a field involving contentious positions with extremely strong partisans on both sides, risks can extend to the use of the data by others in that field. Getting back to the subject of this blog, urban sprawl and its effects represents just such a field. A study is published indicating that sprawl or compact cities does or does not have some effect, and those whose position has not been supported can be vociferous in their attacks and arguments against it. This has happened–in both directions. I have no doubt that if the data from such a study were made freely available for download that that someone whose position had not been supported might reanalyze the data making the assumptions necessary to reach the opposite conclusion in an attempt to discredit the original study and its author.

On the choice of Combined Statistical Areas

Last year, I wrote a post discussing why I chose to use the larger Combined Statistical Areas (CSAs) for my urban patterns research rather than the commonly used Metropolitan Statistical Areas (MSAs). I followed this up with a second post giving examples of how the sharing of transportation infrastructure–commuter rail and airports–could be an indicator of the integration of areas that should be considered together as a single, larger metropolitan area.

This decision to use the CSAs is of such fundamental importance to my research that I felt it deserved more extended, formal treatment. I prepared the paper “On the Choice of Combined Statistical Areas” that provides greater background, covers the topics addressed in those blog posts in more detail, and addresses some other implications of the the choice of CSAs over MSAs. It also shows how the CSAs are comparable in extent to MSAs as they had been defined earlier for the 2000 census. This last topic was also addressed in an earlier post.

The paper is posted on the Research page of the website and can also be downloaded here.

The negative exponential model and the size of cities

Researchers have long noted the tendency for densities to decline as a negative exponential function of distance from the center. They have looked at declines in the density gradient over time as a measure of decentralization in urban areas. They have noted the relationships of the estimated parameters of the model–the density gradient and the density at the center–to a variety of characteristics of urban areas, including, naturally, the size of the area. The consistent finding has been that the gradients tend to be smaller for larger urban areas, while the central densities tend to be larger.

Consider the relationships among the three–the gradient, the central density, and the size of the urban area. If density declines with distance following the negative exponential model, these three values must necessarily be mathematically related. But what affects what? It seems reasonable to believe that the size of the urban area is primarily affected by factors other than the parameters of the negative exponential model.

But what about the model parameters? Housing is long lasting and once established, the patterns in developed areas can remain remarkably stable for many decades. The density of urban development was much higher before widespread use of the automobile. And it turns out that the central densities are very strongly related to the sizes of urban areas in 1910. So it may not be unreasonable to conclude that, at least to some extent the density gradient is determined by the central density and the size of the urban area.

Solving for the mathematical relationship between the gradient, central density, and size yields a somewhat complex expression. However, a simplified approximation can be used. This approximation has the density gradient being directly proportional to the square root of the central density and inversely proportional to the square root of the size of the urban area.

As described in an earlier post and in a paper, I had used my urban patterns data to estimate the parameters of the negative exponential model for large urban areas in the United States from 1950 to 2010. It was straightforward to test for the conformity with the expected relationships among the density gradient, central density, and the size of the urban area. The gradient was indeed approximately inversely proportional to the size of the area, as expected. And the gradient did increase with the central density, though the proportionality was closer to the density itself rather than the square root. It may be possible that this is the result of the fact that the census tract densities in my data (and used by most other researchers) are measures of gross density including nonresidential uses, streets, and vacant land and are therefore lower than the net residential densities within the residential areas alone.

More information on this analysis, including the mathematical derivation of the relationship among the 3 values, is in the paper “Negative Exponential Model Parameters and the Size of Large Urban Areas in the U.S., 1950–2010,” which can be downloaded here.

The negative exponential density gradient and decentralization

Many researchers have used the density gradient from the negative exponential model to study the decentralization of population and housing units in urban areas. The density gradient is the rate of decline of density with distance from the center of the city. A decrease or flattening of the density gradient has been considered to be evidence of the decentralization of population or housing. And the density gradient has been used as a measure of the amount of centralization in an urban area that could be used to compare levels of centralization with other urban areas.

I have estimated the density gradients for 43 large urban areas for each of the census years from 1950 to 2010. And I have developed a separate, “pure” measure of centralization of housing units which I described in the previous post. I am calling this measure the centralization ratio. So this gave me the means of actually looking at the extent to which the density gradient was a good measure of centralization and decentralization.

First, I looked at changes in the density gradient over time and compared it to changes in the centralization ratio. The relationship was reasonably strong. It is appropriate to use the change in the density gradient as a measure of decentralization.

Then I looked at the relationship between the magnitudes of the density gradient and the centralization ratio at single points in time. This time, virtually no relationship. The density gradient does not work as a measure of the level of centralization in an urban area that could be used to make comparisons with other urban areas.

What gives? Why such different findings? The key lay in the fact that the density gradient is strongly inversely related to the size of an urban area. Using the density gradient to predict the centralization ratio resulted in no relationship. But add number of housing units in the urban area to the model, controlling for the size of the area, and a strong relationship emerged. And this is why the change in the density gradient works as a measure of change in centralization over time. The size of the urban area is being subtracted out when you look at the change (with the exception of any change in size over the period).

Someone committed to the idea that the density gradient is a good measure of centralization might object that I have only shown that the centralization ratio and the density gradient are different, not that one is a better measure of centralization. I think I make a good case for the use of the centralization ratio. Also, in developing the measure, I calculated other measures of centralization for a sample of a dozen areas and they were all highly correlated. And an anecdotal point: The three urban areas in my study with the highest centralization ratios were New York, Chicago, and Philadelphia. And all three had density gradients that were below the mean for the 43 large urban areas I looked at.

Centralization in large urban areas

Many have examined the decentralization of population and housing units over time. A common approach has been to use the density gradient from the exponential model as a measure of centralization. I have estimated the parameters for the model for large urban areas since 1950. I wanted to consider how well the density gradient actually performed as a centralization measure (which will be the subject of the next post). But to do so, I needed a separate, good measure of the centralization of housing units.

I reviewed a variety of centralization measures in the literature and was not satisfied with any of them, so I developed my own. I wanted a measure that made maximum use of the data on the distribution of housing units by census tract. And I wanted the measure to be interpretable, to have meaning beyond a larger value indicating housing is more centralized. The measure involves calculating two values: One is the mean distance housing units in the urban area are from the center. The other is the mean distance they would be from the center if housing units were uniformly distributed in the area, densities everywhere the same, no centralization. The ratio of the actual to the uniform distance would, of course, be 1 if housing were uniformly distributed and would decline with decreasing mean distance to the center and greater centralization. The minimum value would be 0 if all housing were located at the center. I wanted a measure of centralization that would increase with greater centralization, so this ratio is subtracted from 1. This measure, which I am calling the centralization ratio, is the proportional reduction in mean distance housing units are located to the center compared with a uniform distribution. So a centralization ratio of 0.25, for example, would mean that the mean distance to the center is a quarter less than for an even distribution.

I calculated the centralization ratio for 59 large urban areas for each census year from 1950 to 2010. The widely expected decentralization did occur, on average, with the mean value dropping from 0.25 to 0.18 over this period. But decentralization was far from universal; 14 areas saw increases.

Levels of centralization varied greatly across the urban areas. The highest and lowest values in 2010, for example, were 0.46 and 0.08. New York, Chicago, and Philadelphia were the areas with the highest levels of centralization, not surprisingly. Tampa-St. Petersburg, El Paso, and Jacksonville were the lowest. Urban areas in the Northeast had the highest mean centralization in 2010, followed by those in the Midwest. Urban areas in the South had the lowest levels of centralization (and would have been even lower if Washington-Baltimore, more like other large urban areas in the Northeast Corridor, had been excluded). The very largest urban areas also tended to have higher levels of centralization.

More detail on this analysis using the centralization ratio is in the paper “The Degree of Centralization in Large Urban Areas in the U.S., 1950–2010,” which can be downloaded here.

Accessibility to employment will always decline with distance from the center

The previous post on why the negative exponential model still works made the argument that average densities in rings around the CBD would only be modestly affected by the presence of outlying employment centers. Another approach to thinking about these issues focuses on accessibility to employment throughout the urban area.

Accessibility to employment varies, of course, across an urban area and can be determined for every location in the area. It is a measure of how many jobs are located close to a given location. A measure can be a simple as the number of jobs within some distances to a weighted sum of distances to all jobs in the urban area, with the weight given the jobs decreasing with distance. (Some form of the latter is much better.)

It has been shown that accessibility to employment is a better predictor of densities in census tracts than distance to the center. Accessibility is also more closely related to housing prices than distance, as it affects land rents (which is the way in which densities are affected).

Now turning to the question of why the negative exponential model still works for urban areas with increasingly more employment outside the CBD. For most plausible distributions of employment in an urban area, accessibility to employment will still decline with distance from the center. In fact, it is easy to show that accessibility will decline in that way even if employment were uniformly distributed across the area. Consider a circular urban area with a radius of 10 miles in which employment is evenly distributed and no employment is located outside. At the center of the urban area, the most remote job is 10 miles away. At a point on the edge of the area, the most remote job is 20 miles away. Consider the simple measure of the number of jobs within 5 miles of a location. For locations out to 5 miles from the center, the number will remain constant. Moving farther out, the number will decline steadily as one moves toward the periphery, as an increasing portion of the 5-mile circle around the location falls outside the urban area, the area with no jobs. More complex measures of accessibility will also decline with distance from the center, in a more uniform manner starting at the center.

For an area with multiple employment centers, employment accessibility will be varying with distances to those centers as well as to the center of the entire urban area. Accessibility will not be closely related to distance from the CBD. But the average employment accessibilities for concentric rings around the center will continue to decrease in a fairly steady fashion with distance from the CBD.

This provides a basis for observing one other difference in the results from tract- versus ring-based estimates of the negative exponential model. The model is estimated using distance to the center as the independent variable in a regression to predict density (using the log of density to form a linear expression). But what if accessibility to employment is the correct predictor of density, with distance being used only as a proxy? Then the tract-based model, with accessibility more weakly related to distance, will have greater error in the independent variable. (The error is not in the measurement of distance, of course, but in the use of distance to approximate accessibility.) And error in an independent variable in a regression will generally have the effect of attenuating the estimate of the regression coefficient, in this case, the density gradient.

Comparing the results of estimating the negative exponential model using both tract and ring density data shows this to be the case. For the earlier years, 1950 to 1970, the mean estimates of the density gradients were quite similar when using the tract and ring data. But after 1970, the mean estimates for the gradients become lower for the tract estimates as compared with the ring estimates. So the attenuation appears to exist in later decades. This is consistent with distance from the center becoming an increasingly poorer predictor of density at the tract level over time.

More detail on this ring-based analysis of the negative exponential decline of density is in the paper “The Monocentric Model with Polycentric Employment: Ring versus Tract Estimates of the Negative Exponential Decline of Density,” which can be downloaded here.