As a Melbourne resident and daily commuter on our Myki public transport fare system (no comment), I was intrigued when I heard the dataset for the Melbourne Datathon 2018 was to be large scale, real world Myki usage data. What cool insights can we glean on how our bustling city uses its public transport network? Let’s find out! Best of all, we’ll check it out without transforming it from CSV, or even moving it out of S3.
NOTE
This article is a reposting of my 2018 Medium article.
INFO
At the request of Data Victoria, links to the data discussed in this article have been removed.
Here are a couple of quick stats I gleaned from this 1.8 billion row dataset, with SQL queries that run in seconds, for much less than the cost of a cup of coffee:
- Fares consisted of
65.65%
full-fare,33.65%
paying concession fares, and a measly0.69%
free-riding concessions - The favourite touch-off location for Victorian Seniors Concession card holders on a Sunday is Box Hill Railway Station
AWS Athena
AWS Athena is a fully-managed, serverless query service that allows you to run SQL queries against data stored in S3 buckets — it’s a bit like magic. Built on Apache Hive and Facebook’s Presto, Athena let’s you define (or even discover) the schema of your data, then start running queries instantly. And all you pay for is the data that Athena has to riffle though to run your query, and the cost of storing it in S3.
So first let’s get our data, pop it into S3, and prepare our schema.
Preparing our Myki data in S3
To have a first look and explore the data files, check out this public S3 link:
Edit: by request of Data Victoria, links to the data discussed in this article have been removed.
(bucket contents is 25.4gb)
---
Once again, many thanks to Data Science Melbourne for organising the Melbourne Datathon and providing the data. Thanks also to PTV for releasing the dataset.
http://www.datasciencemelbourne.com/datathon/
https://www.ptv.vic.gov.au/
Here is a brief description of the files we’re interested in for today:
|
|
Create yourself a bucket and copy in the data! (feel free to just copy one of the Samp_X
directories if you want to have a leaner look at the data)
|
|
Easy! Now, we tell AWS Athena what our data looks like so it can query it — we do this using our Data Catalog in AWS Glue, which integrates with Athena.
Preparing our data schema in AWS Glue Data Catalogue
AWS Glue is Amazon’s fully-managed ETL (extract, transform, load) service to make it easy to prepare and load data from various data sources for analytics and batch processing. Today we’re just interested in using Glue for the Data Catalogue, as that will allow us to define a schema on the Myki data we just dumped into S3. AWS Glue makes this easy by providing data “crawlers” that can peek at our data and discover a lot of our schema automatically.
- Navigate to
AWS Glue
- Select
Crawlers
from the menu - Select
Add crawler
- Name our crawler
myki_crawler
- Click
Next
, we don’t need any of the optional configuration here
- Configure
S3
data store Specified path in my account
- Include path for me is
s3://tomwwright-myki-data/
, replace with whatever you called your bucket earlier when you copied the data - Leave
Exclude patterns
blank, for this dataset we’ll just clean up any junk the crawler finds that we don’t want - Click
Next
, selectNo
to adding another data store, clickNext
again
- Select
Create an IAM role
(unless you have roles in place for Glue) - Enter
MykiCrawler
for the name — or whatever you want, I’m a Medium post, not your mother - Click
Next
- Select
Run on Demand
as the frequency, we just want to fire this off once - Click
Next
- Select
Add database
, let’s just call itmyki
, don’t need any of the optional config, clickCreate
- Leave all the optional config as is
- Click
Next
- Click
Finish
!
You should have been returned to the Crawlers screen of AWS Glue, so select myki_crawler
and hit Run crawler
. It’ll take about 7 minutes to run, in my experience, so maybe grab yourself a coffee or take a quick walk.
Once the crawler has completed we’ll be able to view the discovered tables under Databases
-> Tables
. The first thing we’ll do is delete some junk tables that the crawler has created from examining some of the non-data files in our S3 bucket — delete all the tables with the classification Unknown
, as in the screenshot below. We could have saved ourselves this hassle by neglecting to copy them into the bucket in the first place, or added them to the crawler’s exclude paths configuration. Oh well, easy enough to just delete them.
Opening up the calendar table, we can see all the great stuff the crawler has discovered for us: data format, delimiters, record counts, column types, and so on. One thing that is missing are the column names, because that information isn’t present in the myki data files. Below you’ll find some column labels (not necessarily all of them) that we need to apply in order to be able to write readable queries for our tables. Select a table and click Edit schema
in the top right to update the columns.
table: calendar
1 col0 bigint dateid
2 col1 string date
3 col2 bigint year
6 col5 string month
13 col12 string daytype # 'Weekday', 'Weekend', 'Public Holiday'
16 col15 string dayofweek
---
table: cardtypes
1 col0 bigint typeid
2 col1 string typedesc
3 col2 string paidorfree # 'Paid', 'Free'
4 col3 string concession # 'Full Fare', 'Concession'
---
table: events
1 col0 bigint mode
2 col1 string date
3 col2 string datetime
4 col3 bigint cardid
5 col4 bigint cardtypeid
6 col5 bigint vehicleid
8 col7 string routeid
9 col8 string stopid
10 partition_0 samplenum
11 partition_1 onoroff # 'ScanOffTransaction', 'ScanOnTra...on'
12 partition_2 year
13 partition_3 week
---
table: stops
1 col0 bigint stopid
2 col1 string name
3 col2 string longname
4 col3 string stoptype
5 col4 string suburb
6 col5 bigint postcode
10 col9 double lat
11 col10 double long
With our column names in place, we’re finally ready to do some querying!
Querying our schema with AWS Athena
Clicking through to the Athena service we find ourselves first on the Query Editor screen. Cool, let’s select our myki
database and plug in a simple count query!
When we’ve got a query we’re really happy with we can hit Save as
and give it a name, then we can easily find it and run it again under Saved Queries
. Athena also allows us to access our query history under History
, where we can download any query results.
Partitions
An important metric to notice here for your Athena queries is the data scanned, as pricing for Athena is directly (and only) related to the amount of data read by Athena in the process of running your query:
“You are charged for the number of bytes scanned by Amazon Athena, rounded up to the nearest megabyte, with a 10MB minimum per query. … $5 per TB of data scanned”
As it operates on plain flat files it is difficult for Athena to be intelligent about the data that it scans, and generally it will need to read all the files that your schema specifies as making up the table. Unless, when creating the schema, partitions are specified that define “columns” that map to directories that Athena can use to be intelligent about what data should be scanned for a given query. To illustrate, recall the folder structure of our data:
|
|
Our events folder is structured by a series of subfolders that neatly divide our events data by a couple of key columns: a numbered sample folder, whether it is a touch-on or touch-off event, the year, and the week. Luckily, our AWS Glue crawler was smart enough to notice all that, and has already organised all that in our schema! Nice!
To illustrate the importance of these partitions, I’ve counted the number of unique Myki cards used in the year 2016 (about 7.4 million, by the way) with two different queries:
- one using a
LIKE
operator on the date column in our data, - and one using our
year
partitioning column.
Our first query, using date LIKE '2016%'
, has to scan the entire events table, all 25 gigabytes of it, because Athena is unable to determine what files might have matching lines in them. Cost for this query is about $0.12.
Our second query, using year = '2016'
, is able to retrieve our results and only scan 8 gigabytes of our events data! This is because year
is one of the partition columns that maps to particular key patterns in our S3 data. Cost for this query is about $0.04, and runs in under one-third the time!
Intelligently organising your data with parititions is key to being able to run quick and cost-effective queries on Athena. Alright, let’s move on!
Popular stops for touch-off
So, my first thought, where do people go using their Myki? Seems like an easy enough question to answer with SQL, let’s start by getting a list of the most popular stops.
|
|
Cool, we can join this with our stops
table for a more human-readable representation, let’s do that.
|
|
Hmm. There is a bunch of stopid
rows that are missing stops
information… To me, given they’re most of the most popular stops and there is the right number, my guess is the 64xxx
stops are the City Loop railway stops, and maybe 24001
is Richmond station?
Apart from that, these results make sense to me, not surprising at all to see all the top stops be railway stops.
Breakdown of fare types
There always seems to be a bit of chatter about the spending on providing concession public transport travel to various groups. What is the breakdown of full-fare vs. concession? Again, seems simple enough. Let’s start by aggregating on the card type as present in our events
table.
|
|
Nice, so same as last time, let’s join to our cardtypes
table to deliver some more human-readable information:
|
|
Full-fare passengers look like an overwhelming majority, at more than 4 times the most common (but vaguely-labelled) General Concession
concession fare type. But then the rest of the list is made up of the various distinct concession types, so it might not be so simple. The only free-riding fare type to crack this Top 14 is the Employee Travel Pass
— makes sense.
Let’s go for an even higher-level look by aggregating again, I’m curious about the overall break-down of full-fare vs. concession vs free.
|
|
Alright, there is the full picture: 65.65% full-fare, 33.65% paying concession fares, and a measly 0.69% free-riding concessions.
Top Sunday destinations for Seniors
Finally, let’s try something a little specific. On a Sunday, where do our senior public transport users like to go?
Joining to our calendar
table will allow us to determine which dates are Sunday
, and our query on fare types before shows us that cardtypeid = 9
is the Victorian Seniors Concession
fare type we need to be looking for. Awesome!
|
|
Now just the same as our first query, join to stops to give us some human-readable information. Remembering again that some stops
seem to be missing their information…
|
|
Our mysterious unnamed stops (almost certain it is the City Loop stops) remain head of the pack, but apparently our Myki-wielding Seniors are fond of spending time in Box Hill, Caulfield, and Footscray. Who knew.
Wrapping up
It’s good to take a moment and think about how cool this sort of technology is, because it is easy to forget once you dive in. The ability to be running SQL queries, with joins, against nothing more than CSV files sitting in S3. That’s impressive.
This Myki data is also not a trivial amount to be analysing. Over 1.8 billion rows, and we’re able to crunch it in less than 10 seconds. And it costs absolute pennies. The queries listed in this article all-up scans about 145 gigabytes of S3 data, costing me a whopping total of…
$0.75
Happy querying!
–Tom