Scraping the Tucson Weekly events
Web scraping in the mire
I wanted to write a fairly simple web app that would provide a “value added” search of Tucson event listings in the Tucson Weekly. A simple search of events occuring over a weekend returns hundreds of results to potentially sift through, and personally I’d like to filter out certain things like weekly drum circles and listings of an ongoing months-long exhibition in a generic art gallery.
Obviously the website allows you to do some simple filtering, but I’m going to see if I can tune it up.
There is this great blog post that demos exactly how to start this kind of project using BeautifulSoup via an example of scrapting the concert listings from NYC’s Village Voice.
First task: simple scrape of event data into a table
Later on, when this is approaching an initial version of the actual app, this will be turned into a weekly scheduled job that grabs the event data from the current date until some future date (maybe a month?) and stores it.
Crawling over the pages returned for a particular date, starting from the current date, and going up to some N days later is straight forward. All that was needed was minor modifications to the python code from the blog post.
However, unlike the Village Voice music listings the html div
tags used by the
Tucson Weekly website are often a lot more obscure (in my html-newbie opinion).
After sifting through the html I found that the div
tag labelled “EventListing
clearfix” was the one that corresponding to a single event.
In [17]:
import datetime
import requests
from bs4 import BeautifulSoup
# just do current day for illustration
numDays = 1
for j in range(0, numDays):
nEvents = 0
# get string with date
date = datetime.date.today() + datetime.timedelta(days = j)
dateString = date.strftime("%Y-%m-%d")
print "DATE:"
print dateString
baseUrl = 'http://www.tucsonweekly.com/tucson/EventSearch?narrowByDate='
# while loop to crawl over pages
# Set maximum possible number of pages of events:
# in practice would be a huge number, and the code should break before this
npageMax = 2 # for illustration, we'll stop after one page
# base URL for this date
url = baseUrl + dateString
ii=0
for pageNum in xrange(1, npageMax):
print "On page =", pageNum
# crawl over pages via appending initial date url
if (pageNum>1):
url = baseUrl + dateString + "&page=" + str(pageNum)
# parse webpage with BeautifulSoup
text = requests.get(url).text # get raw html
soup = BeautifulSoup(text) # input it into BeautifulSoup to parse it
# get all "div" tags that corresond to a single event
eventTags = soup.find_all('div', 'EventListing clearfix')
print "There are", len(eventTags), "events on this page"
nEvents += len(eventTags)
DATE:
2016-02-07
On page = 1
There are 15 events on this page
So now we’ve found the events, we need to grab their data!
And this is where it got more irritating. Within each:
<div class="EventListing clearfix" ...
I needed to find the url of the event, and unfortunately unlike the Village
Voice, Tucson Weekly wasn’t nice in just allowing me to grab all anchor tags
with the attribute href
like this:
artistLinks = [tag.a.attrs['href'] for tag in artistTags]
Each event had multiple anchor tags, many with an href
attribute that was not
the url I was after.
I managed to find the following fix: iterate over each anchor tag looking for
the first one that has a href
attribute, but does NOT have a class
attribute. This seems to return the correct anchor tag with the href
attribute
giving the event’s url.
In [18]:
# loop over each event and attempt to extract the html link for the page of the event
for i, event in enumerate(eventTags):
print "EVENT", i ,"START:"
# find all the anchor ('a') tags within the 'div' tag for the evetn
anchor_tags = event.find_all('a')
# iterate over each anchor tag looking for the FIRST one that has a 'href' attribute
# but does NOT have a 'class' attribute
isFound = False
for j,mb in enumerate(anchor_tags):
if (mb.has_attr('href') and (not isFound) and (not mb.has_attr('class')) ):
event_name = mb.get_text().strip() # this should be the event name! (type unicode)
event_link = mb.attrs['href'] # this should be the event webpage! (type string)
print "Event name =", event_name
print "Event link =", event_link
isFound = True
EVENT 0 START:
Event name = 43rd Annual President’s Concert
Event link = http://www.tucsonweekly.com/tucson/43rd-annual-presidents-concert/Event?oid=6015139
EVENT 1 START:
Event name = Special Liturgical Music: Arvo Pärt's Berliner Messe
Event link = http://www.tucsonweekly.com/tucson/special-liturgical-music-arvo-parts-berliner-messe/Event?oid=6022644
EVENT 2 START:
Event name = How to Live in Happiness
Event link = http://www.tucsonweekly.com/tucson/how-to-live-in-happiness/Event?oid=6021947
EVENT 3 START:
Event name = Big Game Viewing Party
Event link = http://www.tucsonweekly.com/tucson/big-game-viewing-party/Event?oid=6025673
EVENT 4 START:
Event name = Community Forum for LGBT Seniors and Friends
Event link = http://www.tucsonweekly.com/tucson/community-forum-for-lgbt-seniors-and-friends/Event?oid=6021664
EVENT 5 START:
Event name = Reel in the Closet
Event link = http://www.tucsonweekly.com/tucson/reel-in-the-closet/Event?oid=6020267
EVENT 6 START:
Event name = Your Wellness Journey-WellWays Workshop
Event link = http://www.tucsonweekly.com/tucson/your-wellness-journey-wellways-workshop/Event?oid=6019458
EVENT 7 START:
Event name = Tucson Ukulele Meetup
Event link = http://www.tucsonweekly.com/tucson/tucson-ukulele-meetup/Event?oid=6017963
EVENT 8 START:
Event name = Ron DeVous
Event link = http://www.tucsonweekly.com/tucson/ron-devous/Event?oid=6022296
EVENT 9 START:
Event name = Ron Doering & RonDeVous Revue
Event link = http://www.tucsonweekly.com/tucson/ron-doering-and-rondevous-revue/Event?oid=6014090
EVENT 10 START:
Event name = Art Walk Sundays
Event link = http://www.tucsonweekly.com/tucson/art-walk-sundays/Event?oid=6008150
EVENT 11 START:
Event name = Art & Crafts Festival
Event link = http://www.tucsonweekly.com/tucson/art-and-crafts-festival/Event?oid=6015383
EVENT 12 START:
Event name = Rhythms of the Americas
Event link = http://www.tucsonweekly.com/tucson/rhythms-of-the-americas/Event?oid=5998815
EVENT 13 START:
Event name = Sorne - live performances & vocal workshop
Event link = http://www.tucsonweekly.com/tucson/sorne-live-performances-and-vocal-workshop/Event?oid=6011617
EVENT 14 START:
Event name = "Desert Schemes"
Event link = http://www.tucsonweekly.com/tucson/desert-schemes/Event?oid=5981354
Now all that is left is to go to each individual event url and scrape the key information.
Following the blog post I define my own scrape
function and modify to fit with
the Tucson Weekly event properties.
Again this is much tougher than the Village Voice case. Not all events have the same properties, and even the ones that do have some variation between the format of the data.
Also the div
tags were not as straightforward, for example I couldn’t just
call:
find('div', 'when')
find('div', 'price')
find('div', 'neighborhood')
on the BeautifulSoup object, but instead look in the meta data and in a div
tag with class="MainColumn Event"
and id="EventMetaData"
Then it gets even more fudgey where I have to use the span
tag, which here
always has class="label"
no matter the event property it contains. This means
I have to find the property type from the actual text covered by the span
.
In [20]:
import re
# function to scrape info from a given link on site
def scrape(link):
# parse webpage with BeautifulSoup
text = requests.get(link).text
soup = BeautifulSoup(text)
# extract venue and genre(s) from <title>stuff|stuff|stuff|</title>
# at top of html
title = soup.title.string
title = title.split(' | ')
name = title[0]
venue = title[1]
genre = title[2]
print "Event name =", name
print "Venue =", venue
print "Kind =", genre ,"\n"
### Get Description
description = soup.find_all("meta", {"name":"description"} )[0]["content"]
print "Description =", description ,"\n"
### Get Address stuff
address = ["og:street-address", "og:latitude", "og:longitude", "og:locality",
"og:region", "og:postal-code"]
addr_stuff = []
for ad in address:
result = soup.find_all("meta", {"property": ad})
if (len(result)>0):
addr_stuff.append(result[0]["content"])
else:
addr_stuff.append(None)
street = addr_stuff[0]
lat = addr_stuff[1]
lon = addr_stuff[2]
locale = addr_stuff[3]
region = addr_stuff[4]
zipc = addr_stuff[5]
print "Address =", street, locale, region, zipc
print "Lat, lon =", lat, lon ,"\n"
### Get event meta data
metaData = soup.find_all("div", {"class":"MainColumn Event ", "id":"EventMetaData"} )
spans = metaData[0].find_all("span")
### Look for text indicating event property within each span tag
foundWhen = False
foundPrice = False
textPrice = 'none'
for i, cont in enumerate(spans):
if (cont.text=="When:"):
when = cont.next_sibling.strip()
foundWhen = True
if (cont.text=="Price:"):
textPrice = cont.next_sibling.strip()
foundPrice = True
### Some events don't have a 'when' property,
# e.g. they just say 'every 4th Friday within event description
if (not foundWhen):
print "returning none"
return None
### Parse price value
# use regular expressions (re) to scrape multiple prices
# () look for groups of the character sequence inside the brackets
# \$ look for a dollar sign at the front (needs escape character \)
# \d look for decimal digit [0-9]
# + greedy, make longest sequence possible (match one or more of the preciding RE)
# \. look for '.' (needs escape character \)
# ? Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
# ab? will match either 'a' or 'ab'. i.e. doesn't HAVE to match the thing immediately proceeding it
dollarValues = re.findall('(\$?\d+\.?\d?\d?)', textPrice)
# checks if price contains free
if (textPrice.count('free') != 0) or (textPrice.count('Free') != 0) or (textPrice.count('FREE') != 0):
minprice = maxprice = 0.00
elif len(dollarValues) == 0:
minprice = maxprice = None
else:
for i in range(0, len(dollarValues)):
dollarValues[i] = float(dollarValues[i].strip('$'))
minprice = min(dollarValues)
maxprice = max(dollarValues)
if (maxprice != minprice):
print "Price range =", minprice, "to", maxprice
price = str(minprice) + " to " + str(maxprice)
else:
price = minprice
print "Price =", minprice
print "\nWhen =", when ,"\n"
return name, description, street + locale + region + zipc, price, when
# returns tuple of name, description, street + locale + region + zipc, price, when
print scrape(event_link)
Event name = "Desert Schemes"
Venue = Desert Artisans' Gallery
Kind = Art
Description = “Desert Schemes” new art exhibit through Feb 7th works by Margaret Aden, Gail Brynildsen, Denyse Fenelon, Pamela Howe, Tom Kolt and Jan Thompson.
Address = 6536 E. Tanque Verde Road. Tucson Arizona
Lat, lon = 32.24576 -110.85340
Price = 0.0
When = First Monday-Sunday of every month, 10 a.m.-5 p.m. Continues through Feb. 7
(u'"Desert Schemes"', u'\u201cDesert Schemes\u201d new art exhibit through Feb 7th works by Margaret Aden, Gail Brynildsen, Denyse Fenelon, Pamela Howe, Tom Kolt and Jan Thompson.', '6536 E. Tanque Verde Road.TucsonArizona', 0.0, u'First Monday-Sunday of every month, 10 a.m.-5 p.m. Continues through Feb. 7')
In short, we won!
In long, there are still a bunch of problems:
- the returned data is not in a consistent format
- if the event has no “When: “ text it is not included, even though the “when” is probably given in the event description