Analyzing Your Email with SQLAnalyzing Your Email with SQL
We talk about data a lot at the Periscope office. Recently, the dicussion turned to how much time we spend managing our email inboxes. Austin, one of our engineers, decided to analyze his own email usage for answers.
Ever since I first heard the phrase Inbox Zero I have been aggressively pursuing it with careful inbox management and quick email response times. I wanted to see how I was doing, so I downloaded a data dump of my gmail data.
Getting Your Data Download from Gmail
To analyze how much time I was spending on email, I decided to pull out the subject, from, to, and date fields.
Analyzing the Data
The first thing I do with any dataset is to look for a date column and plot the growth over time. I have been using this email account for about one year, so I decided to look at a weekly overview.
I quickly encountered dates that came with different timezones and varying formats, for example Wed, 02 Mar 2016 14:43:34 +0200 versus Tue, 1 Mar 2016 19:25:02 -0600 (CST). Redshift had some trouble with the variable timezone formats, and because I was interested in weekly granularity, I chopped off the timestamp by removing anything that came after + and -.
(regexp_replace(date, ' [+\-].*'))::timestamp
Here are my results:
Immediately an elbow in the growth popped out at me. In January we changed how we were logging information from our production servers, which led to a much greater volume of emails.
The engineering team recognized that while having the alerts was great, the email flood was not. We switched to using Rollbar to aggregate emails coming from the servers, and the volume tapers off at the end of January. This led me to ask who was sending me all of these emails, so I drew a pie chart of the top ten accounts sending me email.
As expected, most of my volume comes from email lists that handle things like our server emails, signups from people interested in trying out Periscope, and messages to our all-purpose email, [email protected]
I saw my coworkers Andreas and Sunny impressively show up in seventh and eighth place, respectively, and wanted to take a second look while ignoring anything coming from an automated address.
I decided to use regexp_substr to parse the sender out of the email address. Some of the emails come from our previous periscope.io domain, some use the named noreply <[email protected]> format, while others do not.
To do a consistent comparison I only wanted to consider any characters up to a space or @ symbol.
regexp_substr("from", '[^@ ]*')
This regex says to record any characters that are not @ or space, and then to stop when it sees one of those characters. I can then use the extracted name to quickly filter out lists:
regexp_substr("from", '[^@ ]*') not in (
Re-running my weekly cumulative graph with this filter dropped my total from 255,000 to 45,600. Progress! Knowing now that my analysis would not be dominated by email lists, I wanted to ask a more involved question: whose emails do I respond to most quickly, and whose sit in my inbox the longest?
To start answering this question, I decided to do a self-join on my email. Since different email servers make different modifications to the subject line, I first I stripped the Re: out of the subjects so that I could match emails by subject:
replace(subject, 'Re: ', '') as subject
I bundled my data cleaning into a sql view clean_email, and next looked at emails with the same subject matching the from sender of one email to the to sender of the next.
I built a list of matching emails that I had responded to within one week:
with all_response_times as (
, e1.created_at as recv_time
, e2.created_at as send_time
clean_email e1, clean_email e2
e1.subject = e2.subject
and e1."from" = e2."to"
and e1.created_at < e2.created_at
and e1."to" like '%[email protected]%'
and datediff(minute, recv_time, send_time) <= (168*60)
And simplified the problem by only looking at emails with a single response:
replied_once as (
) where count = 2
select * from response_times, replied_once
where response_times.subject = replied_once.subject
Then I plotted the average over time:
, avg(datediff(minute, recv_time, send_time))
My response times have been pretty dynamic! I spent a lot of time in August, September, and October building charts for new Periscope trials, and would have to spend time running queries before I could respond.
In February, I stepped up my recruiting efforts (we’re hiring!) and my email response times have definitely seen the effects of coordinating schedules since.
Finally I wanted to know how much time I was spending communicating with Periscope teams over email. We use Slack for most of our internal communications, so I wanted to see where we were using email to communicate.
I mapped each of the internal email addresses to their Periscope team, and looked at the sum and average number of hours it takes me to reply:
I am clearly on top of my marketing email response time, but it looks like I have some work to do with the sales team. If you want to see where you are spending your time with email, download your own Gmail archive and start exploring!