Comparing Cities with Data Mining Tools

If you are in a mid-sized city and you want to compare your spending or performance to one or more other mid-sized cities, you have 286 cities to choose from. Knowing which cities are appropriate for comparison can be difficult and mysterious.

At GovEx, we wanted to simplify the process of finding comparable cities by using a standard set of demographic, economic, and geographic variables from the U.S. Census to group cities based on their similarity to one another.

Using R Studio, we designed a hierarchical clustering model to group a set of 286 mid-sized cities according to their population size, population growth, median income, percentage of population in poverty, percentage of population nonwhite, land area, population density, state capital status, and geographic region.

Hierarchical clustering works by calculating the “distance” between groups of cities, then clustering the groups that have the smallest distance between each other.

We tuned this particular model to group cities around their “size and shape,” which means we applied higher weights to the population size, density, and land area variables.

Our goal for this model was to compare budget expenditures across similar cities. We based the weights on the assumption that the “size and shape” variables affect city budgets in similar ways.

Users can tweak the model to cluster cities around other combinations of variables and weights. Is population size and density important to your question? Or are economic factors like median income and poverty important? Our model lets the user apply “high,” “medium,” and “low” weights to any of the the Census variables to suit a variety of research questions.

There is no perfect model, and knowing which variables to pick and how to weight them depends on what your city plans to do with a list of similar cities.

The output of this model is a “dendrogram” that shows city clusters as “leafs” stemming from “branches.” In general, when two or more cities appear on the same branch, they are more similar to each other than cities that appear on different branches.

City cluster dendrogram
Full Dendrogram Based on Size and Shape Model

 

Sample Cluster
Sample Cluster

 

With this model, you can compare the expenditures and revenues of similar cities to understand if your city is spending a “normal” amount for a given department, or receiving a “normal” amount for a given fund.

Similarly for performance metrics, you can compare your performance to similar cities to help you understand what “normal” performance might look like for a given operational area.

Below are some suggested indicators and weights for clustering cities around various themes.

Indicators with High Weights Cluster Theme
Population Size, Population Growth, Population Density, Land Area Size and Shape
Median Income, Poverty, Nonwhite, State Capital, Region People and Place

After you have tuned the model to group cities in a manner that is appropriate for your goals, you can begin to compare higher-level characteristics across cities on a relatively “apples-to-apples” basis.

Right now this model is a basic R script that requires the user to manually search through a large image to find clusters of interest. The next step we envision is to build a user friendly application that makes model tweaking and cluster identification a much simpler process.

We invite the community to contribute to this project, and to provide potential use cases for comparison between cities.

If you have any questions about this work, or about using this model to find cities similar to yours, please contact me at nghadji@jhu.edu or on Twitter @nghadji.

Special thanks to Richard Dunks for his expertise and guidance in helping design this model.