SQL Murder Mystery

Working through an interactive method for learning SQL syntax

Posted by Asa Hess-Matsumoto on Monday, November 1, 2021

During a recent trip, I had a go with a training site I discovered: A SQL Murder Mystery. This website is a clever method for getting oriented to the syntax used for querying SQL databases.

The premise of the website is that you are tasked with (re)discovering a lost case file and identifying a murder suspect. All of the information needed to accomplish this is available within the SQLite database, but it’s incumbent upon you to google and apply the necessary methods in order to make this happen.

The first command(s) provided are

SELECT name 
  FROM sqlite_master
 where type = 'table'

which dump the tables contained on the sqlite_master database. These include the following:

  • crime_scene_report
  • drivers_license
  • person
  • facebook_event_checkin
  • interview
  • get_fit_now_member
  • get_fit_now_check_in
  • income
  • solution

From the overview we know that the function of the solution table is to merely affirm that we have the right answer. When we have our answer, we can confirm it’s correct by running the pre-defined query against the table.

Beyond that, we also know that our starting clue is that the murder occurred 15 Jan 2018 in SQL City. Taking all this into account, we can start by querying the crime_scene_report for information. This is doable either by getting information on the table column types, or by running a basic SELECT * FROM

. The latter simply dumps all the entries from the respective table.

In order to discover the appropriate casefile, we can run:

SELECT * from crime_scene_report where date = '20180115'and type = 'murder' and city = 'SQL City'

This command pulls all entries from the crime_scene_report table whose entries include the date 20180115, type murder, and city SQL City. In our case, this returns a single entry with the following description:

“Security footage shows that there were 2 witnesses. The first witness lives at the last house on “Northwestern Dr”. The second witness, named Annabel, lives somewhere on “Franklin Ave”.”

With this information, we can perform additional queries to the person table in order to help identify who the two witnesses are:

select * from person where name like 'Annabel%' 
and address_street_name = 'Franklin Ave';

select * from person 
where address_street_name = 'Northwestern Dr' 
order by address_number;

The first query leverages the LIKE operator in order to pull partial matches from the person table; if we had used a SELECT * from person where name = ‘Annabel’ instead, we would get no results (since the “=” operator would require exact matches, and we don’t know Annabel’s last name). We further whittle down the potential number of Annabels we might find (same first name, different last name) by adding the and address_street_name qualifier.

The second query sorts all of the returned entries by the address number using the ORDER BY operator; we want this in order to determine which house is “the last house” on Northwestern Dr.

In running these queries, we turn up Annabel Miller and Morty Shapiro. With their names and id numbers, we can query the interview table in order to learn what they saw:

select * from interview where person_id = '16371' or person_id = '14887'

This query returns all entries from the interview table that include matches with the person_id values we pulled for Annabel and Morty.

Annabel Miller: I heard a gunshot and then saw a man run out. He had a “Get Fit Now Gym” bag. The membership number on the bag started with “48Z”. Only gold members have those bags. The man got into a car with a plate that included “H42W”.

Morty Schapiro: I saw the murder happen, and I recognized the killer from my gym when I was working out last week on January the 9th.

There’s a lot of information for us to work with here, but from Annabel’s description we might think to query the get_fit_now_member and (perhaps) the get_fit_now_check_in tables.

select * from get_fit_now_member 
where membership_status = 'gold' 
and id like '48Z%';

select * from get_fit_now_check_in 
where membership_id like "48Z%" 
and check_in_date = "20180109";

Nothing new in concept with either query; the first builds on Annabel’s interview and the second on Morty’s. Either way, they both return the same potential suspects: Joe Germuska and Jeremy Bowers.

In order for us to determine which of the two suspects it could possibly be, I took their respective person_id values and passed them back through the interview table:

select * from interview where person_id = '28819' or person_id = '67318'

This returns only one interview, and its pretty damning:

Jeremy Bowers: “I was hired by a woman with a lot of money. I don’t know her name but I know she’s around 5’5” (65”) or 5’7” (67”). She has red hair and she drives a Tesla Model S. I know that she attended the SQL Symphony Concert 3 times in December 2017.”

Well, it looks like Jeremy was the murderer, and we can confirm this by running it in the solutions predefined query. However, both the output from solutions and the interview discovered above suggest there was someone else orchestrating the whole thing.

Moreover, since this is a SQL learning challenge at its heart, the solutions output challenges us to try and solve this in as few individual queries as possible. After some fumbling about, I ended up performing the following single query:

select * from person 
where license_id in (
  select id from drivers_license where id in (
    select license_id from person where id in (
      select person_id from facebook_event_checkin 
      where event_name = "SQL Symphony Concert" 
      and date like "201712%" 
      group by person_id having count(*) = 3)))

There’s a lot to unpack here, and I’ll admit upfront that this isn’t the prettiest way to go about it.

What’s happening in this single query is actually a series of nested queries:

  • The innermost query is pulling person_id values from the facebook_event_checkin table whose descriptions match Jeremy’s interview (there are two that are returned); more specifically, we identify any entry that appears only three times having a matching date value in the range of December 2017.
  • These person_id values are passed to the person table in order to identify their license_ids.
  • These license_id values then used to determine ids from the drivers_license table.
  • These are then passed again back to the person table in order to identify the applicable person.

From step 1 to 2, we come out with two potential suspects. From steps 2 to 3, we prune the two suspects down to one, since only one of the suspects has a license in the drivers_license table. Step 3 to 4 pipes that output back to the person table in order to reveal the mastermind as: Miranda Priestly.

All in all this was a fun exercise. I learned a couple of things about SQL queries that I hadn’t before; I’m also aware that I could clean up my final answer some (namely through the use of the EXISTS operator, which would save the crude piping back and forth to the person table). I think that this is a wonderful educational tool to introduce someone to SQL programming.