At first, the data set in BigQuery might seem confusing to work with. If you've
worked with any of our public BigQuery data sets in the past (like the
The truth of the matter is that BigQuery can get much more sophisticated than
that. The rows of a BigQuery table don't just have to be straightforward
key-value pairs. They can look more like rows of JSON objects, containing some
simple data (like strings, integers, and floats), but also more complex data
like arrays, structs, or even arrays of structs. Something a little more like
this:
Okay, so this won't win any awards for "Best Error Message of 2017"2 , but if
you think about it, the reason it's barfing makes sense. You're trying to
compare a string value to "an element of a struct that's buried inside of an
array". Sure, that element ends up being a string, but they're fairly different
objects.
So to fix this, you can use the UNNEST
function.
The UNNEST
function will take an array and break
it out into each of its individual elements. Let's start with a simple example.
Calling:
#standardSQL
WITH data AS (
SELECT "primes under 15" AS description,
[1,2,3,5,7,11,13] AS primes_array)
SELECT *
FROM data
will give you back a single row consisting of a string, and that array of data.
Instead, try something like this:
#standardSQL
WITH data AS (
SELECT "primes under 15" AS description,
[1,2,3,5,7,11,13] AS primes_array)
SELECT description, prime
FROM data CROSS JOIN UNNEST (primes_array) as prime
What you're basically saying is, "Hey, BigQuery, please break up that
primes_array
into its individual members. Then
join each of these members with a clone of the original row." So you end up with
a data structure that looks more like this:
The results are similar as before, but now each prime is in its own row:
You'll notice that the original primes_array
is
still included in the data structure. In some cases (as you'll see below), this
can be useful. In this particular case, I found it was a little confusing, which
is why I only asked for the individual fields of
description
and
prime
instead of SELECT
*.
3
It's also common convention to replace that CROSS
JOIN
syntax with a comma, so you get a query that looks like
this.
#standardSQL
WITH data AS (
SELECT "primes under 15" AS description,
[1,2,3,5,7,11,13] AS primes_array)
SELECT description, prime
FROM data, UNNEST (primes_array) as prime
It's the exact same query as the previous one; it's just a little more readable.
Plus, I can now stand by my original statement that this data format means you
don't have perform any JOINs. :)
And the nice thing here is that I now have one piece of "prime" data per column
that I can interact with. So I can start to do comparisons like this:
#standardSQL
WITH data AS (
SELECT "primes under 15" AS description,
[1,2,3,5,7,11,13] AS primes_array)
SELECT description, prime
FROM data, UNNEST (primes_array) as prime
WHERE prime > 8
To get just that list of prime numbers between 8 and 15.
So going back to our Firebase Analytics data, I can now use the UNNEST
function to look for events that have a specific name.
#standardSQL
SELECT event.name, event.timestamp_micros
FROM `firebase-analytics-sample-data.android_dataset.app_events_20160607`,
UNNEST(event_dim) as event
WHERE event.name = "round_completed"
As you'll recall, events have their own params
array, which contains all of the event parameters. If I were to
UNNEST
those as well, I'd be able to query for
specific events that contain specific event parameter values:
#standardSQL
SELECT event, event.name, event.timestamp_micros
FROM `firebase-analytics-sample-data.android_dataset.app_events_20160607`,
UNNEST(event_dim) as event,
UNNEST(event.params) as event_param
WHERE event.name = "round_completed"
AND event_param.key = "score"
AND event_param.value.int_value > 10000
Note that in this case, I am selecting "event" as one of the fields in my query, which gives me the original array of all my event parameters nicely grouped together in my table results.
Querying against user properties works in a similar manner. Let's say I'm
curious as to what language my users prefer using for my app, something our app
is tracking in a "language" user property. First, I'll use the
UNNEST
query to get just a list of each user and
their preferred language.
#standardSQL
SELECT
user_dim.app_info.app_instance_id as unique_id,
MAX(user_prop.key) as keyname,
MAX(user_prop.value.value.string_value) as keyvalue
FROM `firebase-analytics-sample-data.android_dataset.app_events_20160607`,
UNNEST(user_dim.user_properties) AS user_prop
WHERE user_prop.key = "language"
GROUP BY unique_id
And then I can use that as my inner selection to grab the total number of users4 that fits into that group.#standardSQL
SELECT keyvalue, count(*) as count
FROM (
SELECT
user_dim.app_info.app_instance_id as unique_id,
MAX(user_prop.key) as keyname,
MAX(user_prop.value.value.string_value) as keyvalue
FROM `firebase-analytics-sample-data.android_dataset.app_events_20160607`,
UNNEST(user_dim.user_properties) AS user_prop
WHERE user_prop.key = "language"
GROUP BY unique_id
)
GROUP BY keyvalue
ORDER BY count DESC
I can also UNNEST
both my event parameters and my
user properties if I want to create one great big query (no pun intended) where
I want to look at events of a specific name where an event parameter matches a
particular criteria, while also filtering by users who meet a certain
criteria:
#standardSQL
SELECT user_dim, event, event.name, event.timestamp_micros
FROM `firebase-analytics-sample-data.android_dataset.app_events_20160607`,
UNNEST(event_dim) as event,
UNNEST(event.params) as event_param,
UNNEST(user_dim.user_properties) as user_prop
WHERE event.name = "round_completed"
AND event_param.key = "squares_daubed"
AND event_param.value.int_value > 20
AND user_prop.key = "elite_powers"
AND (CAST(user_prop.value.value.string_value as int64)) > 1
Once you start playing around with the UNNEST
function, you'll find that it's really powerful and it can make working with
Firebase Analytics data a lot more fun. If you want to find out more, you can
check out the Working
with Arrays section of BigQuery's standard SQL documentation.
And don't forget, you get 1 terabyte of usage data for free every month with
BigQuery, so don't be afraid to play around with it. Go crazy, you array
expander, you!
1 ↩
The BigQuery team has asked me to inform you that this is really because
standard SQL is the preferred SQL dialect for querying data stored in BigQuery.
But I'm pretty sure they're just saying that so they get invited to all the good
parties.
2 ↩
Yet another year the Messies have slipped from our grasp!
3 ↩ I could have also done this by saying "SELECT * EXCEPT (primes_array)", which can be pretty convenient sometimes.
4 ↩
Okay, technically, each "App Instance" -- a user interacting with my app
from multiple devices would get counted multiple times here.