How to Extract UTM Parameters in GA4 BigQuery: 3 Methods (and Which to Use When)
If you're working with GA4 data in BigQuery, you've come across URLs with query parameters, UTM tags like utm_source and utm_medium, click IDs like gclid, plus all the custom parameters your marketing team has invented over the years. Pulling them out cleanly is one of those tasks that looks simple, then quietly turns into a regex headache.
The good news: as of early 2024, the typical case got much easier. GA4 added a collected_traffic_source struct to its BigQuery export that gives you UTM values already parsed, per event, with no regex needed.
The interesting bit: you still need the older methods for non-standard parameters, custom query strings, historic data from before 2024, or when you want to extract something collected_traffic_source doesn't cover.
This post covers all three approaches, the modern way first, then the regex method for single parameters, then the array method for multiple or unknown ones. Plus the gotchas that catch most people out.
Why query parameters still matter
A quick reminder of what these are for:
Campaign attribution — utm_source and utm_campaign tie sessions back to where you actually spent the money
Audience segmentation — separate paid social from organic, email from referral, by their entry point
Content performance — see which landing page or content piece each visitor came from
Click IDs — gclid, fbclid, msclkid, ttclid, li_fat_id connect ad clicks to platform-specific reporting and unlock closed-loop attribution
Once these are in their own columns in BigQuery, you can join them to revenue, build proper attribution models, and stop guessing.
The modern way: use collected_traffic_source
Since early 2024, GA4 has populated a collected_traffic_source struct on every event in the BigQuery export. It contains pre-parsed UTM values and Google's click IDs at the event level, no regex, no array unnesting, no string splitting required.
| Field | What it contains | Example |
|---|---|---|
collected_traffic_source.manual_source | utm_source | |
collected_traffic_source.manual_medium | utm_medium | cpc |
collected_traffic_source.manual_campaign_name | utm_campaign | spring_sale |
collected_traffic_source.manual_campaign_id | utm_id | 123 |
collected_traffic_source.manual_term | utm_term | running_shoes |
collected_traffic_source.manual_content | utm_content | ad_a |
collected_traffic_source.manual_source_platform | utm_source_platform | google_ads |
collected_traffic_source.gclid | Google Ads click ID | EAIaIQ... |
collected_traffic_source.dclid | Display & Video 360 click ID | (varies) |
collected_traffic_source.srsltid | Google Shopping click ID | (varies) |
The query is now a single statement:
select event_date, collected_traffic_source.manual_source as utm_source, collected_traffic_source.manual_medium as utm_medium, collected_traffic_source.manual_campaign_name as utm_campaign, collected_traffic_source.gclid from `<your-project>.<ga4-dataset>.<ga4-table>` where event_name = 'page_view'
A few things worth knowing:
These values are captured at the event level - they reflect whatever was on the URL for that specific event
For session-level last-click attribution, use session_traffic_source_last_click instead - added more recently and very useful
For user-level first-touch attribution, use the traffic_source struct (this has existed for longer)
collected_traffic_source is not backfilled - older data from before it was introduced still needs URL extraction
When you still need to extract from the URL
The collected fields cover standard UTMs and Google's click IDs. You'll still need URL extraction for:
Custom parameters — ?ref=, ?aff=, ?promo=, ?utm_custom_id=, or anything else your marketing team has invented
Click IDs not in collected_traffic_source — fbclid (Meta), ttclid (TikTok), msclkid (Microsoft), li_fat_id (LinkedIn). At time of writing, only Google's own click IDs have been added to the collected fields
Historic data from before early 2024
Edge cases where GA4 has stripped or transformed the standard parameters in some way
For these, the two methods below are still the answer.
Method 1: single parameter with regexp_extract
If you need one specific parameter, regexp_extract is the cleanest tool:
select
regexp_extract(
(select value.string_value from unnest(event_params) where key = 'page_location'),
r'[?&]utm_source=([^]+)'
) as utm_source
from
`<your-project>.<ga4-dataset>.<ga4-table>`
where
event_name = 'page_view'
A couple of changes worth flagging from older versions of this query:
The regex is now [?&]utm_source=([^&#]+) — [?&] is cleaner than the non-capturing group, and [^&#]+ stops at both & and #, so URL fragments don't leak into the value
Added a where event_name = 'page_view' filter, because page_location is only set on certain events. Without this, you'll get a lot of NULLs in your output
Example Output
| URL | utm_source |
|---|---|
example.com/landing?utm_source=facebook&utm_medium=social | |
example.com/product?utm_source=google&utm_medium=cpc | |
example.com/contact?utm_source=linkedin |
Use this method when:
You only need one or two specific parameters
You're extracting a non-standard parameter (fbclid, ttclid, your own custom tag)
You want minimum query complexity
Method 2: all parameters into an array
If you want everything in the query string, or you don't know which parameters will be present, split the URL into an array of key-value pairs:
with url_params as (
select
event_date,
array(
select as struct
split(keyval, '=')[safe_offset(0)] as key,
split(keyval, '=')[safe_offset(1)] as value
from unnest(
split(
split(
(select value.string_value from unnest(event_params) where key = 'page_location'),
'?'
)[safe_offset(1)],
'&'
)
) as keyval
) as url_query
from `<your-project>.<ga4-dataset>.<ga4-table>`
where event_name = 'page_view'
)
select
event_date,
url_query,
(select value from unnest(url_query) where key = 'utm_source') as utm_source,
(select value from unnest(url_query) where key = 'utm_medium') as utm_medium,
(select value from unnest(url_query) where key = 'utm_campaign') as utm_campaign,
(select value from unnest(url_query) where key = 'fbclid') as fbclid
from url_params
This builds a url_query array column where each element is a {key, value} struct. From there you can pull any parameter by name without writing a new regex for each one.
Use this when:
You want multiple parameters at once
You're exploring data and don't know which parameters are present
You're tracking custom or evolving parameters
Common gotchas
Things that will catch you out regardless of method:
URL-encoded values stay encoded. A URL like ?utm_campaign=spring%20sale&utm_content=ad%201 will return spring%20sale and ad%201 - the percent encoding doesn't get reversed automatically. Wrap your extraction in safe.url_decode() (newer BigQuery function) to decode them, or do a replace() for the common cases.
Case sensitivity. UTM_SOURCE is not the same as utm_source to either the regex or the array key lookup. If you suspect mixed casing, normalise with lower() — either on the source string for Method 1, or on the key comparison for Method 2.
Missing query strings. URLs with no ? at all return NULL from both methods. Fine in itself, but in Method 2 the entire url_query array will be empty for those events — worth being aware of when you write downstream queries.
Fragments. URLs like example.com/page?utm_source=email#section-2 shouldn't cause issues with the updated regex ([^&#]+ excludes #), but if you're using older regex patterns that don't, you'll end up with email#section-2 as your utm_source. Easy thing to miss in legacy queries.
Multiple equals signs in values. A redirect-style URL like ?redirect=https%3A//example.com/?foo=bar will mangle in Method 2, because the split on = only takes index 0 and 1. For URLs that carry other URLs as values, encode them properly at the source (%3D for =) - or fall back to Method 1 with a regex tailored to the specific parameter.
Which method should you use?
| Use case | Recommended method |
|---|---|
| Standard UTMs from data after early 2024 | collected_traffic_source |
| Google click IDs (gclid, dclid, srsltid) | collected_traffic_source |
| Last-click session attribution | session_traffic_source_last_click |
| First-touch user attribution | traffic_source (user-level struct) |
| One specific non-standard parameter | Method 1 (regex) |
fbclid, ttclid, msclkid, li_fat_id | Method 1 (regex) |
| Multiple parameters at once | Method 2 (array) |
| Exploring data — unknown parameters | Method 2 (array) |
| Historic data from before early 2024 | Method 1 or 2 |
Putting query parameters to work
Once the parameters are in columns, the analysis options open up:
Campaign performance — join UTMs to sessions and conversions, slice by source / medium / campaign
Audience behaviour — segment by entry-point parameter to see how users from different sources behave on site
Customer journey mapping — combine event-level UTMs with session-level last-click to see how attribution shifts across touchpoints
Funnel analysis — build Looker Studio funnels grouped by acquisition source
Cross-platform reconciliation — extract fbclid, ttclid, msclkid and gclid then join them to the equivalent records in Meta Business Manager, TikTok Ads Manager, Microsoft Ads or Google Ads. This is the foundation of closed-loop attribution
CRM enrichment — pull the parameters into your CRM workflow so Sales sees which campaign brought each lead in
Final thoughts
The methods in this post haven't really changed because BigQuery has changed, they've changed because GA4 has. The lift-and-go answer for the most common cases is now collected_traffic_source, and that's a good thing. URL extraction is still genuinely useful, just not as universally required as it used to be.
If you're starting fresh on a GA4 BigQuery setup, use collected_traffic_source for everything it covers, and only reach for regex or array methods for the gaps — custom parameters, non-Google click IDs, historic data. If you're working with older data, the regex and array methods still hold up — just be aware of the gotchas above.
If you've got a specific extraction challenge, or want help on the pipelines that depend on this, like BigQuery → Looker Studio reporting or BigQuery → CRM for offline conversions, get in touch.