Hack for Change mapping project

4th June 2016 was Code For America’s “National Day of Civic Hacking” (or “Hack for Change” which is more speak-able). Tucson’s local event was held in the University of Arizona’s Science and Engineering library. The temperature in Tucson had just popped over 100F, that day in particular was forecast to be 113F, so there never was a better day for staying indoors hacking with plenty of air conditioning (and pizza).

Looking through the list of suggested projects I found the Opportunity Project particularly interesting because it involved taking advantage of federal+local open data for social good. It also gave the chance to investigate the CitySDK tool that works as a wrapper around the various APIs required to grab the different data sets available (census, FEMA, farmer’s markets, etc).

We formed at team of 3 (Jon Eckel, Pete Lowe and myself) called JustMapIt! (chosen to reflect our dedication to producing something by the end of the day). We decided that creating a visualisation that mapped the population income and/or poverty index across Tucson, along with access to grocery stores, may yield something interesting and useful. First we began by checking out the available data, making sure it contained data in the Tucson area!

The first major issue was that the CitySDK tool didn’t appear to be working. In the interest of time we decided to directly grab our own data sets instead.

Data sets

INCOME IN THE PAST 12 MONTHS (IN 2014 INFLATION-ADJUSTED DOLLARS) from the 2014 American Community Survey 1-Year Estimates data for Arizona
Latitude and Longitude positions of grocery stores in Tucson scraped from venues with categoryID=’Grocery Store’ in Foursquare using its API (and then cleaned a little)

In [40]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import folium

%matplotlib inline

The income data

This data set gives the number of households and median income per census tract. Census tracts are small-ish, subdivisions of a county (or similar). Theys provide a stable set of geographic units for the presentation of statistical census data. Generally they contain a population size between 1,200 and 8,000 people, with an optimum size of 4,000 people.

In the data below the columns id and id2 contain the census tract id’s.

In [41]:

# household income census data
income_df = pd.read_csv("income_census_data.csv", header=[0,1], dtype={0:str, 1:str})

Munging

There were two levels of column labels so the dataframe columns were multindexed. Since the upper level of labels gave no useful information, for ease of use we removed them.

In [42]:

# remove secondary column label
levels = income_df.columns.levels
labels = income_df.columns.labels
income_df.columns = levels[1][labels[1]]

# quick look at data
income_df.head()

	Id	Id2	Geography	Households; Estimate; Total	Households; Margin of Error; Total	Families; Estimate; Total	Families; Margin of Error; Total	Married-couple families; Estimate; Total	Married-couple families; Margin of Error; Total	Nonfamily households; Estimate; Total	...	Nonfamily households; Estimate; PERCENT IMPUTED - Family income in the past 12 months	Nonfamily households; Margin of Error; PERCENT IMPUTED - Family income in the past 12 months	Households; Estimate; PERCENT IMPUTED - Nonfamily income in the past 12 months	Households; Margin of Error; PERCENT IMPUTED - Nonfamily income in the past 12 months	Families; Estimate; PERCENT IMPUTED - Nonfamily income in the past 12 months	Families; Margin of Error; PERCENT IMPUTED - Nonfamily income in the past 12 months	Married-couple families; Estimate; PERCENT IMPUTED - Nonfamily income in the past 12 months	Married-couple families; Margin of Error; PERCENT IMPUTED - Nonfamily income in the past 12 months	Nonfamily households; Estimate; PERCENT IMPUTED - Nonfamily income in the past 12 months	Nonfamily households; Margin of Error; PERCENT IMPUTED - Nonfamily income in the past 12 months
0	1400000US04019000100	04019000100	Census Tract 1, Pima County, Arizona	319	50	48	38	34	31	271	...	(X)	(X)	(X)	(X)	(X)	(X)	(X)	(X)	13.7	(X)
1	1400000US04019000200	04019000200	Census Tract 2, Pima County, Arizona	1916	189	914	182	452	145	1002	...	(X)	(X)	(X)	(X)	(X)	(X)	(X)	(X)	26.7	(X)
2	1400000US04019000300	04019000300	Census Tract 3, Pima County, Arizona	680	86	244	54	109	54	436	...	(X)	(X)	(X)	(X)	(X)	(X)	(X)	(X)	22.5	(X)
3	1400000US04019000400	04019000400	Census Tract 4, Pima County, Arizona	1719	97	395	101	253	78	1324	...	(X)	(X)	(X)	(X)	(X)	(X)	(X)	(X)	27.5	(X)
4	1400000US04019000500	04019000500	Census Tract 5, Pima County, Arizona	1544	119	309	98	158	64	1235	...	(X)	(X)	(X)	(X)	(X)	(X)	(X)	(X)	30.8	(X)

5 rows × 131 columns

In [43]:

print "Name of columns with income data:"
for col in income_df.columns:
    if "income" in col:
        print col

Name of columns with income data:
Households; Estimate; Median income (dollars)
Households; Margin of Error; Median income (dollars)
Families; Estimate; Median income (dollars)
Families; Margin of Error; Median income (dollars)
Married-couple families; Estimate; Median income (dollars)
Married-couple families; Margin of Error; Median income (dollars)
Nonfamily households; Estimate; Median income (dollars)
Nonfamily households; Margin of Error; Median income (dollars)
Households; Estimate; Mean income (dollars)
Households; Margin of Error; Mean income (dollars)
Families; Estimate; Mean income (dollars)
Families; Margin of Error; Mean income (dollars)
Married-couple families; Estimate; Mean income (dollars)
Married-couple families; Margin of Error; Mean income (dollars)
Nonfamily households; Estimate; Mean income (dollars)
Nonfamily households; Margin of Error; Mean income (dollars)
Households; Estimate; PERCENT IMPUTED - Household income in the past 12 months
Households; Margin of Error; PERCENT IMPUTED - Household income in the past 12 months
Families; Estimate; PERCENT IMPUTED - Household income in the past 12 months
Families; Margin of Error; PERCENT IMPUTED - Household income in the past 12 months
Married-couple families; Estimate; PERCENT IMPUTED - Household income in the past 12 months
Married-couple families; Margin of Error; PERCENT IMPUTED - Household income in the past 12 months
Nonfamily households; Estimate; PERCENT IMPUTED - Household income in the past 12 months
Nonfamily households; Margin of Error; PERCENT IMPUTED - Household income in the past 12 months
Households; Estimate; PERCENT IMPUTED - Family income in the past 12 months
Households; Margin of Error; PERCENT IMPUTED - Family income in the past 12 months
Families; Estimate; PERCENT IMPUTED - Family income in the past 12 months
Families; Margin of Error; PERCENT IMPUTED - Family income in the past 12 months
Married-couple families; Estimate; PERCENT IMPUTED - Family income in the past 12 months
Married-couple families; Margin of Error; PERCENT IMPUTED - Family income in the past 12 months
Nonfamily households; Estimate; PERCENT IMPUTED - Family income in the past 12 months
Nonfamily households; Margin of Error; PERCENT IMPUTED - Family income in the past 12 months
Households; Estimate; PERCENT IMPUTED - Nonfamily income in the past 12 months
Households; Margin of Error; PERCENT IMPUTED - Nonfamily income in the past 12 months
Families; Estimate; PERCENT IMPUTED - Nonfamily income in the past 12 months
Families; Margin of Error; PERCENT IMPUTED - Nonfamily income in the past 12 months
Married-couple families; Estimate; PERCENT IMPUTED - Nonfamily income in the past 12 months
Married-couple families; Margin of Error; PERCENT IMPUTED - Nonfamily income in the past 12 months
Nonfamily households; Estimate; PERCENT IMPUTED - Nonfamily income in the past 12 months
Nonfamily households; Margin of Error; PERCENT IMPUTED - Nonfamily income in the past 12 months

It seems the column we want to look at is “Households; Estimate; Median income (dollars)”

In [44]:

median_income = income_df["Households; Estimate; Median income (dollars)"]
print median_income.describe()

print median_income[:30]


#fig = plt.figure(figsize=(10,10))
#ax = fig.add_subplot(111)
#ax.hist(median_income)

#fig = plt.figure(figsize=(10,10))
#ax = fig.add_subplot(111)
#ax.hist(median_income)

count       241
unique      240
top       27472
freq          2
Name: Households; Estimate; Median income (dollars), dtype: object
   24861
   24856
   30739
   18792
   23188
   51667
   25805
   44250
   34492
   39145
  26983
  30441
  13193
  22955
  14940
  21599
  27573
  50387
  41507
  28874
  33947
  48258
  33380
  28247
  32857
  28084
  23778
  24214
  29292
  30878
Name: Households; Estimate; Median income (dollars), dtype: object

The output from describe looks odd though the data itself looks ok. Also an error is raised on trying to plot it, with hist

TypeError: len() of unsized object

or with plot

ValueError: could not convert string to float:

In [45]:

# check by eye
#for val in median_income:
#    print val, type(val)

    
# Sample output:
# 84410 <type 'str'>
# 65284 <type 'str'>
# 53460 <type 'str'>
# 39798 <type 'str'>
# - <type 'str'>
# 29923 <type 'str'>
# 27664 <type 'str'>
# 34726 <type 'str'>
# 34000 <type 'str'>

# print all entries with "-" for Households; Estimate; Median income (dollars)
for index, row in income_df.iterrows():
    if row["Households; Estimate; Median income (dollars)"]=='-':
        print "Median income =", row["Households; Estimate; Median income (dollars)"],
        print "Number of households =", row["Households; Estimate; Total"]

Median income = - Number of households = 0

There is one entry where there is a null value (-) for the median income, and this corresponds to a census tract with 0 households (this seems to be because this tract is a State Prison complex). So we need to ignore the tract where number of households=0, and also convert the data to floats (because its type is string).

In [46]:

median_income = income_df.ix[(income_df["Households; Estimate; Total"] > 0), 
                             "Households; Estimate; Median income (dollars)"].astype(float)


fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111)
ax.hist(median_income, 50)
ax.set_xlabel('median income in AZ per tract ($)', fontsize=24)

png

Read in the Tucson grocery store data

Scraped from foursquare into a simple CSV file

In [47]:

supermarkets = pd.read_csv("grocery_stores.csv")
supermarkets.head()

	lat	lon	name	addr
0	32.229253	-110.873651	Kimpo Market	5595 E 5th St
1	32.220195	-110.807966	Walmart Neighborhood Market	8640 E Broadway Blvd
2	32.118384	-110.798278	Safeway	9050 E Valencia Rd
3	32.256930	-110.943687	India Dukaan	2754 N Campbell Ave
4	32.193137	-110.841855	Walmart Neighborhood Market	2550 S Kolb Rd

Folium for mapping

Folium is a python wrapper for the Leaflet javascript library, which itself can render interactive maps.

We need a way to convert the census tract ID to its equivalent area on the map, the census website provides this data in the form of ESRI Shapefiles.

Folium works with GeoJSON files so we need to convert. Handily we can do this using an online converter

In [48]:

import folium
import json

# GeoJSON file of Arizona census tracts
state_geo = "arizona.json"

# initialize map
tucson_coords = [32.2,-110.94]
mp = folium.Map(location=tucson_coords, zoom_start=11)

# map data to geo_json
mp.geo_json(geo_path=state_geo, 
            data=income_df.ix[(income_df["Households; Estimate; Total"] > 0)],
            data_out="median_income.json", 
            columns=["Id2", "Households; Estimate; Median income (dollars)"],
            key_on="feature.properties.GEOID",
            fill_color='YlGn',
            fill_opacity=0.7,
            line_opacity=0.2 ,
            threshold_scale= np.logspace(np.log10(15000), np.log10(125000), 6).tolist(),
            legend_name='Median Income') 

# plot the supermarkets on the map
for i,row in supermarkets.iterrows():
    name = row["name"].decode("utf8")
    mp.circle_marker(location=[str(row["lat"]), str(row["lon"])], popup=name, radius=100, fill_color="red", )

# generate the HTML/Javascript
mp.create_map(path='tucson.html', plugin_data_out=False)

Map!

Here is a link to the map. Unfortunately we ran out of time before being able to add a toggle to toggle between median income and another dataset (e.g. population density). This particular visualisation would also be served better by higher resolution income data than that given by census tracts, but it was a great start: we learned a lot and finished something!