Category Archives: Research

Defining exurban areas

For the urban patterns research, in addition to delineating the urban areas for each year, I wanted to delineate exurban areas beyond the urban areas that could reasonably be considered to be parts of the metropolitan area related to the urban core. Unlike the census Urbanized Areas, however, there is no accepted standard definition for exurban areas. Fortunately, a thorough review of past studies of exurban areas and how they were defined has been provided by Berube and others (Finding Exurbia, Brookings, 2006).

A minimum population or housing unit density–obviously much lower than the urban density threshold–was the most common criterion used in defining exurban areas. Other factors were also considered, especially commuting to the urban area. Data are not available over the entire period of the urban patterns dataset to allow the use of commuting. However, the maximum extent of the exurban area would be limited to the area of the Combined Statistical Area (CSA) or Metropolitan Statistical Area (MSA), which at a minimum guarantees interaction with the urban area for 2010 for the counties as a whole, if not individual tracts.

I decided to define exurban areas as the sets of contiguous tracts that were adjacent to the urban areas and had housing unit densities greater than some value. The minimum density levels used to define exurban areas in various studies varied widely, from 40 acres per housing unit down to about 10 acres per unit. (For studies using the lowest densities, the extent of the exurban areas was most often limited by the commuting criterion rather than density.) I approached the problem by mapping the tracts meeting different minima in 2010 to make a judgment as to what looked reasonable.

The very low minimum density thresholds of 30 or 40 acres per unit frequently resulted in all or most of the CSA or MSA being considered exurban, with the tracts meeting these levels extending far beyond those areas, especially in the eastern U.S. On the other hand, a density minimum of 10 acres per unit produced much smaller exurban areas than seemed reasonable and consistent with personal observation.

The choice came down to thresholds of either 15 acres per unit or 20 acres per unit. The resulting exurban areas generally looked appropriate for most areas. The final choice of 15 acres per unit came down to a number of specific situations where the lower density level produced areas that seemed too large. I’ll give two examples: The exurban area for Indianapolis in 2010 would have extended south at least halfway to Louisville, through area I would never consider exurban. And the Portland exurban area would have encompassed a large portion of the Willamette valley.

A further check reinforced my decision on the minimum density for exurban areas of 15 acres per unit, which is one-fifth of the urban density theshold. For CSAs or MSAs adjacent to other CSAs or MSAs, it was not uncommon for both exurban areas to extend to the common boundary. But for areas not adjacent to others, the extent of contiguous exurban density tracts was generally either confined within the boundaries of the CSA or MSA or extended beyond the boundary at only one or two points, with a string of exurban density tracts along a highway. (This is much like census Urbanized Areas, which frequently have such tendrils of urban development extending outward.) So the density threshold for exurban areas seems consistent with the areas of significant metropolitan interaction as indicated by the CSA and MSA boundaries.

This process of defining the exurban areas is treated in greater detail in the paper, “Defining Exurban Areas for the Analysis of Urban Patterns Over Time” which can be downloaded here.

On the sharing of data from research

The National Academies recently released a report addressing integrity in scientific research, including the social and behavioral sciences. One of the recommendations is that after publication researchers share with others the data on which an article is based. This supports research transparency and should lead to greater reproducability of research. I think sharing data is generally a good thing, and I have done so. But I feel that the authors of the report have failed to address some important issues related to such data sharing. This is obviously a topic much broader that the subject of this blog. But at two points in my comments I will give examples that relate directly to things discussed here.

In discussing the recommendation on data sharing, the report points favorably to the policies of some journals requiring that authors make the data for an article availble to others on request. But further discussion in the report strongly implies that data should be made available in a repository from which anyone can download it. The difference is significant because in making the data available online, the person(s) who created the data then lose all control over how it might be used.

But first, a simple, practical issue. Making data available for others to use entails a significant amount of work relating to formatting, documentation, and so forth. I am very careful about documenting my data as I do research, but that original documentation is completely meaningful only to me. Just sharing data with co-authors requires some additional effort. Sharing it publicly would require more. I suspect that the majority of datasets from articles would never be used by others. So it is inefficient to put the work in for every dataset to get it into a form in which it can be shared. It makes much more sense to put in this effort when someone makes the request to use the data. At that point, I am happy to do so.

The authors of the report (many from the natural sciences) seem to most often view datasets as the products of experiments, to be reported in an paper, which then is the end of the story. Indeed, they actually see as a problem “the temptation to publish multiple papers on just one experiment or dataset.” (p. 17) They fail to realize that for certain types of research, datasets are developed, often with a great deal of effort, to support the investigation of multiple research questions. Those creating the data have a reasonable expectation of being able to carry out their research without having it preempted by others using their data.

My urban patterns dataset with data on housing units by census tract for 59 large urban areas from 1950 to 2010 is an example of this. I spent at least a year-and-a-half building the dataset. I have a long list of research questions I intend to address using this dataset. The papers currently on the Research page represent just a start. I feel that it is reasonable that I shold be able to be the first to use this data to address these questions. I certainly would not have put in the effort I did in creating the dataset only for one or two papers. This does not mean that I would be unwilling to share the data with others before I have completed this program. I’m finished with all of the questions I have intended to address relating to the negative exponential model. If someone wants to do more, I could be willing to share the data. Or if someone wants to combine my data with some other dataset, sharing could be appropriate. But that’s why I believe I need to have control over the sharing.

I was surprised that the authors of the report failed to address reputational risks that could be associated with data sharing (and by this, I am not including risks associated with others finding out about problems with the original research). Putting data on an archive for anyone to use can result in uses that can negatively impact the reputation of the data creator.

The first (and least significant) reputational risk comes from someone taking the data and producing and publishing a very crappy piece of work. While most such efforts are justifiably ignored, occasionally they will achieve notoriety for their sheer absence of quality. Assuming the author of the crappy research appropriately cites the creator of the data, the creator will forever be linked with the work. While everyone should understand that the data creator is not responsible, just being associated would not the most pleasant thing.

For certain types of data, the reputational risk can be much greater. For example, suppose researchers post data dealing with a social problem that includes information on race. A white supremacist could obtain the data, improperly manipulate it, and falsely claim that the results supported their racist views. And they might well prominently note that the creators of the data were respected researchers at a major university. Such a nightmare scenario is why researchers have a legitimate interest in controlling the sharing of their data.

For researchers working in a field involving contentious positions with extremely strong partisans on both sides, risks can extend to the use of the data by others in that field. Getting back to the subject of this blog, urban sprawl and its effects represents just such a field. A study is published indicating that sprawl or compact cities does or does not have some effect, and those whose position has not been supported can be vociferous in their attacks and arguments against it. This has happened–in both directions. I have no doubt that if the data from such a study were made freely available for download that that someone whose position had not been supported might reanalyze the data making the assumptions necessary to reach the opposite conclusion in an attempt to discredit the original study and its author.

On the choice of Combined Statistical Areas

Last year, I wrote a post discussing why I chose to use the larger Combined Statistical Areas (CSAs) for my urban patterns research rather than the commonly used Metropolitan Statistical Areas (MSAs). I followed this up with a second post giving examples of how the sharing of transportation infrastructure–commuter rail and airports–could be an indicator of the integration of areas that should be considered together as a single, larger metropolitan area.

This decision to use the CSAs is of such fundamental importance to my research that I felt it deserved more extended, formal treatment. I prepared the paper “On the Choice of Combined Statistical Areas” that provides greater background, covers the topics addressed in those blog posts in more detail, and addresses some other implications of the the choice of CSAs over MSAs. It also shows how the CSAs are comparable in extent to MSAs as they had been defined earlier for the 2000 census. This last topic was also addressed in an earlier post.

The paper is posted on the Research page of the website and can also be downloaded here.

The negative exponential model and the size of cities

Researchers have long noted the tendency for densities to decline as a negative exponential function of distance from the center. They have looked at declines in the density gradient over time as a measure of decentralization in urban areas. They have noted the relationships of the estimated parameters of the model–the density gradient and the density at the center–to a variety of characteristics of urban areas, including, naturally, the size of the area. The consistent finding has been that the gradients tend to be smaller for larger urban areas, while the central densities tend to be larger.

Consider the relationships among the three–the gradient, the central density, and the size of the urban area. If density declines with distance following the negative exponential model, these three values must necessarily be mathematically related. But what affects what? It seems reasonable to believe that the size of the urban area is primarily affected by factors other than the parameters of the negative exponential model.

But what about the model parameters? Housing is long lasting and once established, the patterns in developed areas can remain remarkably stable for many decades. The density of urban development was much higher before widespread use of the automobile. And it turns out that the central densities are very strongly related to the sizes of urban areas in 1910. So it may not be unreasonable to conclude that, at least to some extent the density gradient is determined by the central density and the size of the urban area.

Solving for the mathematical relationship between the gradient, central density, and size yields a somewhat complex expression. However, a simplified approximation can be used. This approximation has the density gradient being directly proportional to the square root of the central density and inversely proportional to the square root of the size of the urban area.

As described in an earlier post and in a paper, I had used my urban patterns data to estimate the parameters of the negative exponential model for large urban areas in the United States from 1950 to 2010. It was straightforward to test for the conformity with the expected relationships among the density gradient, central density, and the size of the urban area. The gradient was indeed approximately inversely proportional to the size of the area, as expected. And the gradient did increase with the central density, though the proportionality was closer to the density itself rather than the square root. It may be possible that this is the result of the fact that the census tract densities in my data (and used by most other researchers) are measures of gross density including nonresidential uses, streets, and vacant land and are therefore lower than the net residential densities within the residential areas alone.

More information on this analysis, including the mathematical derivation of the relationship among the 3 values, is in the paper “Negative Exponential Model Parameters and the Size of Large Urban Areas in the U.S., 1950–2010,” which can be downloaded here.

The negative exponential density gradient and decentralization

Many researchers have used the density gradient from the negative exponential model to study the decentralization of population and housing units in urban areas. The density gradient is the rate of decline of density with distance from the center of the city. A decrease or flattening of the density gradient has been considered to be evidence of the decentralization of population or housing. And the density gradient has been used as a measure of the amount of centralization in an urban area that could be used to compare levels of centralization with other urban areas.

I have estimated the density gradients for 43 large urban areas for each of the census years from 1950 to 2010. And I have developed a separate, “pure” measure of centralization of housing units which I described in the previous post. I am calling this measure the centralization ratio. So this gave me the means of actually looking at the extent to which the density gradient was a good measure of centralization and decentralization.

First, I looked at changes in the density gradient over time and compared it to changes in the centralization ratio. The relationship was reasonably strong. It is appropriate to use the change in the density gradient as a measure of decentralization.

Then I looked at the relationship between the magnitudes of the density gradient and the centralization ratio at single points in time. This time, virtually no relationship. The density gradient does not work as a measure of the level of centralization in an urban area that could be used to make comparisons with other urban areas.

What gives? Why such different findings? The key lay in the fact that the density gradient is strongly inversely related to the size of an urban area. Using the density gradient to predict the centralization ratio resulted in no relationship. But add number of housing units in the urban area to the model, controlling for the size of the area, and a strong relationship emerged. And this is why the change in the density gradient works as a measure of change in centralization over time. The size of the urban area is being subtracted out when you look at the change (with the exception of any change in size over the period).

Someone committed to the idea that the density gradient is a good measure of centralization might object that I have only shown that the centralization ratio and the density gradient are different, not that one is a better measure of centralization. I think I make a good case for the use of the centralization ratio. Also, in developing the measure, I calculated other measures of centralization for a sample of a dozen areas and they were all highly correlated. And an anecdotal point: The three urban areas in my study with the highest centralization ratios were New York, Chicago, and Philadelphia. And all three had density gradients that were below the mean for the 43 large urban areas I looked at.

Centralization in large urban areas

Many have examined the decentralization of population and housing units over time. A common approach has been to use the density gradient from the exponential model as a measure of centralization. I have estimated the parameters for the model for large urban areas since 1950. I wanted to consider how well the density gradient actually performed as a centralization measure (which will be the subject of the next post). But to do so, I needed a separate, good measure of the centralization of housing units.

I reviewed a variety of centralization measures in the literature and was not satisfied with any of them, so I developed my own. I wanted a measure that made maximum use of the data on the distribution of housing units by census tract. And I wanted the measure to be interpretable, to have meaning beyond a larger value indicating housing is more centralized. The measure involves calculating two values: One is the mean distance housing units in the urban area are from the center. The other is the mean distance they would be from the center if housing units were uniformly distributed in the area, densities everywhere the same, no centralization. The ratio of the actual to the uniform distance would, of course, be 1 if housing were uniformly distributed and would decline with decreasing mean distance to the center and greater centralization. The minimum value would be 0 if all housing were located at the center. I wanted a measure of centralization that would increase with greater centralization, so this ratio is subtracted from 1. This measure, which I am calling the centralization ratio, is the proportional reduction in mean distance housing units are located to the center compared with a uniform distribution. So a centralization ratio of 0.25, for example, would mean that the mean distance to the center is a quarter less than for an even distribution.

I calculated the centralization ratio for 59 large urban areas for each census year from 1950 to 2010. The widely expected decentralization did occur, on average, with the mean value dropping from 0.25 to 0.18 over this period. But decentralization was far from universal; 14 areas saw increases.

Levels of centralization varied greatly across the urban areas. The highest and lowest values in 2010, for example, were 0.46 and 0.08. New York, Chicago, and Philadelphia were the areas with the highest levels of centralization, not surprisingly. Tampa-St. Petersburg, El Paso, and Jacksonville were the lowest. Urban areas in the Northeast had the highest mean centralization in 2010, followed by those in the Midwest. Urban areas in the South had the lowest levels of centralization (and would have been even lower if Washington-Baltimore, more like other large urban areas in the Northeast Corridor, had been excluded). The very largest urban areas also tended to have higher levels of centralization.

More detail on this analysis using the centralization ratio is in the paper “The Degree of Centralization in Large Urban Areas in the U.S., 1950–2010,” which can be downloaded here.

Accessibility to employment will always decline with distance from the center

The previous post on why the negative exponential model still works made the argument that average densities in rings around the CBD would only be modestly affected by the presence of outlying employment centers. Another approach to thinking about these issues focuses on accessibility to employment throughout the urban area.

Accessibility to employment varies, of course, across an urban area and can be determined for every location in the area. It is a measure of how many jobs are located close to a given location. A measure can be a simple as the number of jobs within some distances to a weighted sum of distances to all jobs in the urban area, with the weight given the jobs decreasing with distance. (Some form of the latter is much better.)

It has been shown that accessibility to employment is a better predictor of densities in census tracts than distance to the center. Accessibility is also more closely related to housing prices than distance, as it affects land rents (which is the way in which densities are affected).

Now turning to the question of why the negative exponential model still works for urban areas with increasingly more employment outside the CBD. For most plausible distributions of employment in an urban area, accessibility to employment will still decline with distance from the center. In fact, it is easy to show that accessibility will decline in that way even if employment were uniformly distributed across the area. Consider a circular urban area with a radius of 10 miles in which employment is evenly distributed and no employment is located outside. At the center of the urban area, the most remote job is 10 miles away. At a point on the edge of the area, the most remote job is 20 miles away. Consider the simple measure of the number of jobs within 5 miles of a location. For locations out to 5 miles from the center, the number will remain constant. Moving farther out, the number will decline steadily as one moves toward the periphery, as an increasing portion of the 5-mile circle around the location falls outside the urban area, the area with no jobs. More complex measures of accessibility will also decline with distance from the center, in a more uniform manner starting at the center.

For an area with multiple employment centers, employment accessibility will be varying with distances to those centers as well as to the center of the entire urban area. Accessibility will not be closely related to distance from the CBD. But the average employment accessibilities for concentric rings around the center will continue to decrease in a fairly steady fashion with distance from the CBD.

This provides a basis for observing one other difference in the results from tract- versus ring-based estimates of the negative exponential model. The model is estimated using distance to the center as the independent variable in a regression to predict density (using the log of density to form a linear expression). But what if accessibility to employment is the correct predictor of density, with distance being used only as a proxy? Then the tract-based model, with accessibility more weakly related to distance, will have greater error in the independent variable. (The error is not in the measurement of distance, of course, but in the use of distance to approximate accessibility.) And error in an independent variable in a regression will generally have the effect of attenuating the estimate of the regression coefficient, in this case, the density gradient.

Comparing the results of estimating the negative exponential model using both tract and ring density data shows this to be the case. For the earlier years, 1950 to 1970, the mean estimates of the density gradients were quite similar when using the tract and ring data. But after 1970, the mean estimates for the gradients become lower for the tract estimates as compared with the ring estimates. So the attenuation appears to exist in later decades. This is consistent with distance from the center becoming an increasingly poorer predictor of density at the tract level over time.

More detail on this ring-based analysis of the negative exponential decline of density is in the paper “The Monocentric Model with Polycentric Employment: Ring versus Tract Estimates of the Negative Exponential Decline of Density,” which can be downloaded here.