Scraping the Tucson Weekly events
Web scraping in the mire
I wanted to write a fairly simple web app that would provide a “value added” search of Tucson event listings in the Tucson Weekly. A simple search of events occuring over a weekend returns hundreds of results to potentially sift through, and personally I’d like to filter out certain things like weekly drum circles and listings of an ongoing months-long exhibition in a generic art gallery.
Obviously the website allows you to do some simple filtering, but I’m going to see if I can tune it up.
There is this great blog post that demos exactly how to start this kind of project using BeautifulSoup via an example of scrapting the concert listings from NYC’s Village Voice.
First task: simple scrape of event data into a table
Later on, when this is approaching an initial version of the actual app, this will be turned into a weekly scheduled job that grabs the event data from the current date until some future date (maybe a month?) and stores it.
Crawling over the pages returned for a particular date, starting from the current date, and going up to some N days later is straight forward. All that was needed was minor modifications to the python code from the blog post.
However, unlike the Village Voice music listings the html div
tags used by the
Tucson Weekly website are often a lot more obscure (in my html-newbie opinion).
After sifting through the html I found that the div
tag labelled “EventListing
clearfix” was the one that corresponding to a single event.
In [17]:
DATE:
2016-02-07
On page = 1
There are 15 events on this page
So now we’ve found the events, we need to grab their data!
And this is where it got more irritating. Within each:
<div class="EventListing clearfix" ...
I needed to find the url of the event, and unfortunately unlike the Village
Voice, Tucson Weekly wasn’t nice in just allowing me to grab all anchor tags
with the attribute href
like this:
artistLinks = [tag.a.attrs['href'] for tag in artistTags]
Each event had multiple anchor tags, many with an href
attribute that was not
the url I was after.
I managed to find the following fix: iterate over each anchor tag looking for
the first one that has a href
attribute, but does NOT have a class
attribute. This seems to return the correct anchor tag with the href
attribute
giving the event’s url.
In [18]:
EVENT 0 START:
Event name = 43rd Annual President’s Concert
Event link = http://www.tucsonweekly.com/tucson/43rd-annual-presidents-concert/Event?oid=6015139
EVENT 1 START:
Event name = Special Liturgical Music: Arvo Pärt's Berliner Messe
Event link = http://www.tucsonweekly.com/tucson/special-liturgical-music-arvo-parts-berliner-messe/Event?oid=6022644
EVENT 2 START:
Event name = How to Live in Happiness
Event link = http://www.tucsonweekly.com/tucson/how-to-live-in-happiness/Event?oid=6021947
EVENT 3 START:
Event name = Big Game Viewing Party
Event link = http://www.tucsonweekly.com/tucson/big-game-viewing-party/Event?oid=6025673
EVENT 4 START:
Event name = Community Forum for LGBT Seniors and Friends
Event link = http://www.tucsonweekly.com/tucson/community-forum-for-lgbt-seniors-and-friends/Event?oid=6021664
EVENT 5 START:
Event name = Reel in the Closet
Event link = http://www.tucsonweekly.com/tucson/reel-in-the-closet/Event?oid=6020267
EVENT 6 START:
Event name = Your Wellness Journey-WellWays Workshop
Event link = http://www.tucsonweekly.com/tucson/your-wellness-journey-wellways-workshop/Event?oid=6019458
EVENT 7 START:
Event name = Tucson Ukulele Meetup
Event link = http://www.tucsonweekly.com/tucson/tucson-ukulele-meetup/Event?oid=6017963
EVENT 8 START:
Event name = Ron DeVous
Event link = http://www.tucsonweekly.com/tucson/ron-devous/Event?oid=6022296
EVENT 9 START:
Event name = Ron Doering & RonDeVous Revue
Event link = http://www.tucsonweekly.com/tucson/ron-doering-and-rondevous-revue/Event?oid=6014090
EVENT 10 START:
Event name = Art Walk Sundays
Event link = http://www.tucsonweekly.com/tucson/art-walk-sundays/Event?oid=6008150
EVENT 11 START:
Event name = Art & Crafts Festival
Event link = http://www.tucsonweekly.com/tucson/art-and-crafts-festival/Event?oid=6015383
EVENT 12 START:
Event name = Rhythms of the Americas
Event link = http://www.tucsonweekly.com/tucson/rhythms-of-the-americas/Event?oid=5998815
EVENT 13 START:
Event name = Sorne - live performances & vocal workshop
Event link = http://www.tucsonweekly.com/tucson/sorne-live-performances-and-vocal-workshop/Event?oid=6011617
EVENT 14 START:
Event name = "Desert Schemes"
Event link = http://www.tucsonweekly.com/tucson/desert-schemes/Event?oid=5981354
Now all that is left is to go to each individual event url and scrape the key information.
Following the blog post I define my own scrape
function and modify to fit with
the Tucson Weekly event properties.
Again this is much tougher than the Village Voice case. Not all events have the same properties, and even the ones that do have some variation between the format of the data.
Also the div
tags were not as straightforward, for example I couldn’t just
call:
find('div', 'when')
find('div', 'price')
find('div', 'neighborhood')
on the BeautifulSoup object, but instead look in the meta data and in a div
tag with class="MainColumn Event"
and id="EventMetaData"
Then it gets even more fudgey where I have to use the span
tag, which here
always has class="label"
no matter the event property it contains. This means
I have to find the property type from the actual text covered by the span
.
In [20]:
Event name = "Desert Schemes"
Venue = Desert Artisans' Gallery
Kind = Art
Description = “Desert Schemes” new art exhibit through Feb 7th works by Margaret Aden, Gail Brynildsen, Denyse Fenelon, Pamela Howe, Tom Kolt and Jan Thompson.
Address = 6536 E. Tanque Verde Road. Tucson Arizona
Lat, lon = 32.24576 -110.85340
Price = 0.0
When = First Monday-Sunday of every month, 10 a.m.-5 p.m. Continues through Feb. 7
(u'"Desert Schemes"', u'\u201cDesert Schemes\u201d new art exhibit through Feb 7th works by Margaret Aden, Gail Brynildsen, Denyse Fenelon, Pamela Howe, Tom Kolt and Jan Thompson.', '6536 E. Tanque Verde Road.TucsonArizona', 0.0, u'First Monday-Sunday of every month, 10 a.m.-5 p.m. Continues through Feb. 7')
In short, we won!
In long, there are still a bunch of problems:
- the returned data is not in a consistent format
- if the event has no “When: “ text it is not included, even though the “when” is probably given in the event description