The Hispanic or Latino category in the census and other federal statistics

The Office of Management and Budget has offered a proposal for updating OMB’s race and ethnicity statistical standards, to be followed by all federal agencies including the census. It was released in January of last year but I only recently happened across it.

The proposal suggests two major changes: One is moving the Hispanic and Latino query from a separate question to being one option in the race (now race and ethnicity?) question. The second is adding a new Middle Eastern and North African option. I think both are good ideas.

One thing I looked at was how they proposed to describe the Hispanic and Latino category. Aside from a minor correction, it would remain the same as in the current standards. And that is a problem. That statement is as follows:

Hispanic or Latino: A person of Cuban, Mexican, Puerto Rican, Cuban, South or Central American, or other Spanish culture or origin, regardless of race. The term, ‘‘Spanish origin,’’ can be used in addition to ‘‘Hispanic or Latino.’’

I have always been confused by this statement. Does it refer to persons from Latin-American cultures or from Spanish-speaking cultures? These are obviously different. Portuguese-speaking Brazilians might not necessarily consider themselves as having Spanish culture or origin.

The statement refers to “A person of Cuban, Mexican, … culture or origin.” Cuban, Mexican, and so forth are adjectives so they have to be modifying something, so this presumably refers to “culture or origin.” This gives “Cuban culture or origin,” “Mexican culture or origin,” etc. The final adjective is “or other Spanish,” which is also obviously modifying culture or origin. But in the list is “South or Central American,” so this would be including persons of South or Central American culture or origin.” And there is an “or” between “South and Central American” and “other Spanish” so the statement clearly is not requiring that a person from Latin America be other Spanish.

Yet the description is emphasizing the Spanish part, including the statement “The term, ‘Spanish origin,’ can be used in addition to ‘Hispanic or Latino.’ “ And “Spaniard” is listed as one of the example identities.

The primary persons who are not South or Central American who would be of “other Spanish culture or origin” would be people from Spain. What is the rationale for including them as Hispanic or Latino? Are persons from Spain expected to identify with Latin Americans as being Hispanic or Latino? I would guess they would be more likely to identify as Spanish and then, more generally, as European.

What about Latin Americans who are not Spanish-speaking? (Brazilians would be the largest population, of course.) It has always seemed to me that some would choose to identify as Latin American and choose Latino. Has this been examined with the available census data? And how are non-Spanish-speaking persons of color in Latin America who do not consider themselves white or black or Native American expected to respond?

People are expected to self-identify in response to the Hispanic or Latino question. Obviously extremely few will have read the OMB statement about the question. But the very presentation of the option as Hispanic or Latino presents the ambiguity, only heightened with the more detailed question mentioning the option “Spaniard.”

The diversity of urban neighborhoods has been increasing

I looked at neighborhood racial and ethnic diversity in 56 large urban areas from 1980 to 2020. This starts with the percentages of the populations of the census tracts in these urban areas that are white, black, Hispanic, and Asian and Pacific Islander. These are used to calculate an index of tract diversity that ranges from zero when the tract includes only persons from one group to 100 for tracts with 25 percent in each group, maximum diversity.

Mean diversity for all of the tracts in the urban areas jumped from 32 in 1980 to 59 in 2020. To give an idea of what this means, a tract with 86 percent of the population in one group, 14 percent in a second group, and none in the other two would have a diversity index of 32, the mean for 1980. The 2020 value of 59 could be for a tract with 67 percent in one group and 33 percent in a second.

In each decade over this period, about three-quarters of the tracts had increases in diversity and only a quarter saw declines.

Increases in diversity are associated with declines in the percent of a tract’s population white and increases in the shares for the other three groups. Growth in both tract population and the population of the entire urban area are related to greater increases in diversity.

A surprising pair of findings: Tract diversity at the start of a decade is negatively related to change in diversity (more opportunity for increase?) while urban area diversity is positively related (greater tolerance for diversity?).

A lot more detail is in my paper, “Changes in racial and ethnic diversity in neighborhoods in large urban areas in the U.S., 1980-2020,” which can be downloaded here.

Neighborhood diversity

A huge amount of research has been devoted to residential segregation by race. Since I have been looking at racial and ethnic diversity, I decided to look at diversity at the neighborhood level. Such residential diversity is just the opposite of segregation: If all neighborhoods were occupied by members of a single racial or ethnic group, this would be maximum segregation and zero diversity.

Racial and ethnic diversity (which can range from 0 to 100) was calculated for individual neighborhoods (census tracts). These values were combined to produce two measures of neighborhood diversity for each of 56 large urban areas from 1980 to 2020. The two measures are related to different approaches to the measurement of segregation.

Average neighborhood diversity across the areas increased dramatically over the forty-year period, from 28 to 57 on one measure and from 58 to 78 on the second. On the first measure, Las Vegas, Sacramento, and San Francisco-Oakland-San Jose were the most diverse in 2020 while El Paso, Pittsburgh, and Cleveland-Akron were the least diverse. (El Paso is an outlier as it is overwhelmingly Latino.) Mean diversity in 2020 was highest in the West and the South, lowest in the Midwest and Northeast.

Urban areas that grew more rapidly from 1980 to 2020 were more diverse and saw greater increases in residential diversity. Decreases in the urban area percent while and increases in percent black, Latino, and Asian were associated with increased growth in diversity.

For more on this, see my paper “Racial and Ethnic Diversity in Neighborhoods in Large Urban Areas in the U.S., 1980-2020,” which can be downloaded here.

Racial and ethnic diversity in urban areas is rapidly increasing

In 1980, whites were in the minority in only three of 56 of the largest urban areas. By 2020, one-third, 19 of the areas, were less than half white. The average share of the population white dropped from about three-quarters to just over half during this period. The share of the population Latino and Asian increased from around 10 percent to nearly 30 percent. The changing shares of the population in the four major racial and ethnic groups are illustrated in this graph:

Mean share of population in racial and ethnic groups

An index of diversity summarizes the racial and ethnic distribution of a population with a single number. The index varies from zero when the entire population is in a single group to 100 with equal proportions in each of the groups. From 1980 to 2020, the average diversity index for the 56 urban areas increased from 49 to 74. The most diverse urban areas in 2020 were San Francisco-Oakland-San Jose and Houston, both with index values of 93. At the other extreme, the least diverse areas were El Paso at 36 and Pittsburgh at 43. El Paso is overwhelmingly Latino, and Pittsburgh is largely white.

Studies have frequently reported how the suburbs were predominantly white. This research has generally used the standard central city-suburb division. But those cities encompass widely varying portions of their urban areas. This research defines the suburban periphery in a consistent manner as the areas added to the urban areas since 1940. Mean diversity in the suburbs shot up from 36 to 70 from 1980 to 2020, reducing the difference from the urban core, as seen in this graph:

Mean diversity in the suburban periphery and urban core

By 2020, diversity in the suburbs was actually greater than in the urban core in nearly half of the urban areas.

For more on this, see my paper, “Racial and Ethnic Diversity in Large Urban Areas in the U.S., 1980-2020,” which can be downloaded here.

Down with legends on graphs

In creating graphs of data for a paper, I have come to the conclusion that legends should be avoided whenever possible. Graphs for multiple categories of data including line graphs, bar graphs, or area graphs will designate the categories using different colors or line or fill styles. It is then typical to identify the categories using a legend showing the colors or styles associated with each category as in this graph:

This is an area graph showing the percentage distribution over time of the population of the average large urban area in the four major racial and ethnic groups. Each group is displayed using a different color, with the groups identified in the legend.

That this is the expected way of presenting the information for an area graph is reinforced by the instructions for creating such a graph in the software I am using, which concludes with the instruction to insert a legend.

I realized there is a much better alternative for identifying the areas. Simply place the name of the group on the area, as I have done here:

It is much easier to just read the name of the group on the graph than to look at a legend, identify the color associated with a group, and then look at the graph.

Placing labels on the graph is not limited to area graphs. Here is a line graph of the same data, with the lines in different colors and labels on the graph, not in a legend. Note that I also did the labels in the same colors as the lines to reinforce the association.

Not all graphs can be easily labeled in lieu of including a legend. Consider this graph, again using the same data:

It would be extremely difficult to put labels on the graph, so a legend would be required. But I don’t think this is a very good way to graph the data, so at least here this would be a moot point.

Urban area centralization and decentralization

I compared the mean distance housing units were located from the center of the urban area to the mean distance had they been evenly distributed at the same density across the urban area in 56 large urban areas. This is used as a measure of the centralization of the housing units. From 1980 to 2010, average centralization across the urban areas declined steadily, continuing a pattern of decentralization within urban areas that has been taking place for many decades.

But from 2010 to 2020, the average level of centralization showed a modest increase. Was this real or a fluke? I Calculated several alternative measures of centralization. Their averages likewise initially declined but again the trend reverses, with increases over the last decade or two.

So has the era of decentralization of urban areas in the U.S. come to an end? Or to go even further, are urban areas entering a period of recentralization? I not willing to make such a pronouncement. I remember too many instances of observers seeing a reversal in a trend and concluding a major change had taken place, only to find out this was a temporary aberration. In the early 1970s, for the first time the population outside of metropolitan areas grew more rapidly than the metro population. Many proclaimed the end of centuries of urbanization, naming the new pattern counterurbanization. . .only to see metro populations soon resume their faster growth.

For more on this study of centralization, see my paper “The Extent of Centralization of Housing Units in Large American Cities, 1970-2020,” which can be downloaded here.

Employers and their employees are not obligated to save downtowns

Numerous stories have described the conflicts between employees who want to work from home, at least part of the time, and employers wanting them to return to the office. At least these two groups have legitimate interests in the resolution of this conflict. My own take on this—with no skin in the game, as I am retired—is that allowing some degree of hybrid work should be a reasonable compromise with complete remote work an option as well. Employers insisting on full-time return to the office are doing so in the face of evidence that employees performed very well when working remotely 100 percent of the time during the COVID lockdown. I think the employers are basing their requests on old-fashioned ideas…and on the desire of a least some managers to exert maximum control over their employees.

But then there are the others campaigning for return to the office: Downtown business interests responding to the reduced traffic and business because employees have not fully returned. Property owners of office buildings with high vacancies and declining rents. And general local boosters, including newspapers campaigning for workers to return. Every so often the Washington Post editorial board has addressed this, suggesting that the federal government should do more to get workers to return to the office for the sake of downtown Washington.

I understand the problems faced by downtowns in many cities. Downtown businesses have reduced profits and fewer employees—-and those are the businesses that have not been forced to close. Owners of the office buildings may be forced to sell at a loss. Local governments face the prospect of sharply reduced tax collections. Numerous articles have posed the possibility of an “urban doom loop” leading to greater problems. These problems are real and cities and businesses will have to work to cope with them.

Pleas are being made to businesses to require their employees to return to the office and to employees suggesting that is is their obligation to return to the office. The health of downtowns are not their responsibility. By coming downtown, employees provided a market for the businesses there. They should not now be expected to enduring tedious and expensive commutes downtown simply to support those businesses when that is not necessary. Likewise, employers deciding that some form of hybrid work is in the best interests of both their company and their employees should not be expected to change simply to support downtown interests.

Density decline over time

The distribution of housing unit and population densities is frequently described as declining as a negative exponential function of distance from the Central Business District (CBD). I calculated the density gradient, central density, and goodness-of-fit for this pattern of decline of housing unit density for 40 large urban areas in the United States from 1970 to 2020. A few of the highlights from the results:

These measures vary widely across the urban areas. Larger and older urban areas, especially in the Northeast, generally have the highest values on all three measures. Areas in the Sunbelt are often the lowest.

Values have generally declined over the period from 1970 to 2020 with but with urban areas following very different trajectories.

The explanation for the negative exponential decline of density is that with employment concentrated in the CBD, people prefer to be closer to the center and are willing to pay more for those locations. The decentralization of employment that has been occuring in urban areas raises the question as to whether the negative exponential model will do less well over time in accounting for the distribution of densities. From 1970 to 2020 the negative exponential model describes density patterns less well, with the mean R2 value falling from 0.41 to 0.26.

One more tantalizing finding: The mean central density for 2020 is indeed lower than for 1970. But it did rebound somewhat in the final two decades. Does this indicate the end of the long-term pattern of decline, or even the reversal? We probably need more time and data to get a better sense of the meaning of this.

For more, see the paper, “The Negative Exponential Decline of Density in Large Urban Areas in the U.S., 1970-2020,” which can be downloaded here.

Hypothesis testing versus exploratory analysis

I have always understood that the preferred form of data analysis is hypothesis testing, where you begin with a specific hypothesis, collect the data and test to see whether you can confirm the hypothesis (by rejecting the null hypothesis). This is seen as the better approach compared to exploratory analysis where you search through data looking for possible relationships. The hypothesis testing is seen as providing a rigorous framework that minimizes the likelihood of drawing incorrect conclusions.

At issue is the possibility of drawing a false conclusion from a relationship present in the data only by chance. And the danger is that a researcher might be tempted to search to find statistically significant relationships that can form the basis for publication (since the publication of negative findings is quite rare). This is often referred to as p-hacking.

I am starting to rethink the question of hypotheses testing versus exploratory analysis with respect to the likelihood of false conclusions. To be sure, if you state an unambiguous hypothesis that includes equally clear statements about how the variables are to be measured and the level of significance to be used and then perform the hypotheses test, the probability of a false conclusion should be close to the significance level chosen for the test. For example, using the ubiquitous level of p < 0.05, the probability of drawing an incorrect conclusion if you reject the null hypothesis should be close to one chance out of twenty. (The probability of error must be considered to be at least somewhat larger because there always are other potential sources of error in any analysis beyond random variation in the data.)

But this is not how researchers often work. In social science research, how various concepts are to be measured is often not unambiguously specified. And the mathematical form of relationships is likewise frequently not known. So the researcher examines different variables and tries various transformations and combination of the variables to see if a relationship exists. This becomes essentially a form of exploratory analysis within the context of the hypothesis testing. And to some extent this is reasonable, even as it increases the probability of drawing an incorrect conclusion.

But a real problem arises when the researcher is firmly committed to the truth of the hypothesis, either as a result of self-confidence in his or her ability to develop it or because the hypothesis arises from a strongly held set of beliefs that the researcher seeks to promote. In such cases, the research can easily move beyond good-faith exploration of different variables and forms of the relationship to full-on p-hacking to come up with something, anything that leads to the desired conclusion in support of the hypothesis. This can include variables and forms of variables that most likely would have been either rejected or not even considered if pursuing the exploration in good faith. Anything to prove the hypothesis.

I have seen this. I have been involved in collaborative research in which another researcher did this, in a manner that seemed to me very blatant, and which forced me to withdraw from the collaboration.

I think it is better to view the distinction between hypotheses testing and exploratory research as a continuum, ranging from completely rigorous hypotheses testing to exploration in the context of considering a hypothesis to completely exploratory research in the absence of any hypothesis. Any research that involves some degree of exploration increases the probability of drawing an incorrect conclusion beyond any level of statistical significance employed. What is most important in those instances which involve some level of exploration, including in the context to hypotheses testing, is that the researcher pursue the investigation in good faith. P-hacking can occur in either context.

And no, I have no idea of how to draw a line between good-faith exploration and p-hacking. But if the researcher begins the exploration by identifying a limited set of variables considered to be possibly related to the value predicted, this would lead to confidence in the validity of the exploratory analysis.

Supreme Court ends affirmative action

The decision said race could not be used in considering applicants for college admission. I have seen a suggestion that colleges might use the income of an applicant’s zip code as a weak surrogate for race to increase diversity.

They can do much better! Geocode the applicant’s residence down to the census tract. Identify characteristics of census tracts associated with disadvantage, such as income, poverty, low educational attainment, unemployment, and single-parent households. This would identify not only applicants having those characteristics but those disadvantaged by liviing in a neighborhood with those characteristics. These would all be legitimate factors to consider in admissions to achieve socioeconomic diversity without using race. Create a measure combining these characteristics and use that as a factor in admissions.

Given the higher level of disadvantage of minority populations combined with the high levels of residential segregation by race, this could go far in achiving the objective of racial diversity in college admissions. The Supreme Court decision presumably precludes colleges and universities from investigating the extent to which such a procedure would be successful in predicting an applicant’s race. It would not prevent others.