How to Extract UTM Parameters in GA4 BigQuery: 3 Methods (and Which to Use When)

If you're working with GA4 data in BigQuery, you've come across URLs with query parameters, UTM tags like utm_source and utm_medium, click IDs like gclid, plus all the custom parameters your marketing team has invented over the years. Pulling them out cleanly is one of those tasks that looks simple, then quietly turns into a regex headache.

The good news: as of early 2024, the typical case got much easier. GA4 added a collected_traffic_source struct to its BigQuery export that gives you UTM values already parsed, per event, with no regex needed.

The interesting bit: you still need the older methods for non-standard parameters, custom query strings, historic data from before 2024, or when you want to extract something collected_traffic_source doesn't cover.

This post covers all three approaches, the modern way first, then the regex method for single parameters, then the array method for multiple or unknown ones. Plus the gotchas that catch most people out.

Why query parameters still matter

A quick reminder of what these are for:

  • Campaign attribution — utm_source and utm_campaign tie sessions back to where you actually spent the money

  • Audience segmentation — separate paid social from organic, email from referral, by their entry point

  • Content performance — see which landing page or content piece each visitor came from

  • Click IDs — gclid, fbclid, msclkid, ttclid, li_fat_id connect ad clicks to platform-specific reporting and unlock closed-loop attribution

Once these are in their own columns in BigQuery, you can join them to revenue, build proper attribution models, and stop guessing.

The modern way: use collected_traffic_source

Since early 2024, GA4 has populated a collected_traffic_source struct on every event in the BigQuery export. It contains pre-parsed UTM values and Google's click IDs at the event level, no regex, no array unnesting, no string splitting required.

Field What it contains Example
collected_traffic_source.manual_sourceutm_sourcefacebook
collected_traffic_source.manual_mediumutm_mediumcpc
collected_traffic_source.manual_campaign_nameutm_campaignspring_sale
collected_traffic_source.manual_campaign_idutm_id123
collected_traffic_source.manual_termutm_termrunning_shoes
collected_traffic_source.manual_contentutm_contentad_a
collected_traffic_source.manual_source_platformutm_source_platformgoogle_ads
collected_traffic_source.gclidGoogle Ads click IDEAIaIQ...
collected_traffic_source.dclidDisplay & Video 360 click ID(varies)
collected_traffic_source.srsltidGoogle Shopping click ID(varies)

The query is now a single statement:

select
  event_date,
  collected_traffic_source.manual_source as utm_source,
  collected_traffic_source.manual_medium as utm_medium,
  collected_traffic_source.manual_campaign_name as utm_campaign,
  collected_traffic_source.gclid
from
  `<your-project>.<ga4-dataset>.<ga4-table>`
where
  event_name = 'page_view'

A few things worth knowing:

  • These values are captured at the event level - they reflect whatever was on the URL for that specific event

  • For session-level last-click attribution, use session_traffic_source_last_click instead - added more recently and very useful

  • For user-level first-touch attribution, use the traffic_source struct (this has existed for longer)

  • collected_traffic_source is not backfilled - older data from before it was introduced still needs URL extraction

When you still need to extract from the URL

The collected fields cover standard UTMs and Google's click IDs. You'll still need URL extraction for:

  • Custom parameters — ?ref=, ?aff=, ?promo=, ?utm_custom_id=, or anything else your marketing team has invented

  • Click IDs not in collected_traffic_source — fbclid (Meta), ttclid (TikTok), msclkid (Microsoft), li_fat_id (LinkedIn). At time of writing, only Google's own click IDs have been added to the collected fields

  • Historic data from before early 2024

  • Edge cases where GA4 has stripped or transformed the standard parameters in some way

For these, the two methods below are still the answer.

Method 1: single parameter with regexp_extract

If you need one specific parameter, regexp_extract is the cleanest tool:

select
  regexp_extract(
    (select value.string_value from unnest(event_params) where key = 'page_location'),
    r'[?&]utm_source=([^&#]+)'
  ) as utm_source
from
  `<your-project>.<ga4-dataset>.<ga4-table>`
where
  event_name = 'page_view'

A couple of changes worth flagging from older versions of this query:

  • The regex is now [?&]utm_source=([^&#]+) — [?&] is cleaner than the non-capturing group, and [^&#]+ stops at both & and #, so URL fragments don't leak into the value

  • Added a where event_name = 'page_view' filter, because page_location is only set on certain events. Without this, you'll get a lot of NULLs in your output

Example Output

URL utm_source
example.com/landing?utm_source=facebook&utm_medium=socialfacebook
example.com/product?utm_source=google&utm_medium=cpcgoogle
example.com/contact?utm_source=linkedinlinkedin

Use this method when:

  • You only need one or two specific parameters

  • You're extracting a non-standard parameter (fbclid, ttclid, your own custom tag)

  • You want minimum query complexity

Method 2: all parameters into an array

If you want everything in the query string, or you don't know which parameters will be present, split the URL into an array of key-value pairs:

with url_params as (
  select
    event_date,
    array(
      select as struct
        split(keyval, '=')[safe_offset(0)] as key,
        split(keyval, '=')[safe_offset(1)] as value
      from unnest(
        split(
          split(
            (select value.string_value from unnest(event_params) where key = 'page_location'),
            '?'
          )[safe_offset(1)],
          '&'
        )
      ) as keyval
    ) as url_query
  from `<your-project>.<ga4-dataset>.<ga4-table>`
  where event_name = 'page_view'
)
select
  event_date,
  url_query,
  (select value from unnest(url_query) where key = 'utm_source') as utm_source,
  (select value from unnest(url_query) where key = 'utm_medium') as utm_medium,
  (select value from unnest(url_query) where key = 'utm_campaign') as utm_campaign,
  (select value from unnest(url_query) where key = 'fbclid') as fbclid
from url_params

This builds a url_query array column where each element is a {key, value} struct. From there you can pull any parameter by name without writing a new regex for each one.

Use this when:

  • You want multiple parameters at once

  • You're exploring data and don't know which parameters are present

  • You're tracking custom or evolving parameters

Common gotchas

Things that will catch you out regardless of method:

URL-encoded values stay encoded. A URL like ?utm_campaign=spring%20sale&utm_content=ad%201 will return spring%20sale and ad%201 - the percent encoding doesn't get reversed automatically. Wrap your extraction in safe.url_decode() (newer BigQuery function) to decode them, or do a replace() for the common cases.

Case sensitivity. UTM_SOURCE is not the same as utm_source to either the regex or the array key lookup. If you suspect mixed casing, normalise with lower() — either on the source string for Method 1, or on the key comparison for Method 2.

Missing query strings. URLs with no ? at all return NULL from both methods. Fine in itself, but in Method 2 the entire url_query array will be empty for those events — worth being aware of when you write downstream queries.

Fragments. URLs like example.com/page?utm_source=email#section-2 shouldn't cause issues with the updated regex ([^&#]+ excludes #), but if you're using older regex patterns that don't, you'll end up with email#section-2 as your utm_source. Easy thing to miss in legacy queries.

Multiple equals signs in values. A redirect-style URL like ?redirect=https%3A//example.com/?foo=bar will mangle in Method 2, because the split on = only takes index 0 and 1. For URLs that carry other URLs as values, encode them properly at the source (%3D for =) - or fall back to Method 1 with a regex tailored to the specific parameter.

Which method should you use?

Use case Recommended method
Standard UTMs from data after early 2024collected_traffic_source
Google click IDs (gclid, dclid, srsltid)collected_traffic_source
Last-click session attributionsession_traffic_source_last_click
First-touch user attributiontraffic_source (user-level struct)
One specific non-standard parameterMethod 1 (regex)
fbclid, ttclid, msclkid, li_fat_idMethod 1 (regex)
Multiple parameters at onceMethod 2 (array)
Exploring data — unknown parametersMethod 2 (array)
Historic data from before early 2024Method 1 or 2

Putting query parameters to work

Once the parameters are in columns, the analysis options open up:

  • Campaign performance — join UTMs to sessions and conversions, slice by source / medium / campaign

  • Audience behaviour — segment by entry-point parameter to see how users from different sources behave on site

  • Customer journey mapping — combine event-level UTMs with session-level last-click to see how attribution shifts across touchpoints

  • Funnel analysis — build Looker Studio funnels grouped by acquisition source

  • Cross-platform reconciliation — extract fbclid, ttclid, msclkid and gclid then join them to the equivalent records in Meta Business Manager, TikTok Ads Manager, Microsoft Ads or Google Ads. This is the foundation of closed-loop attribution

  • CRM enrichment — pull the parameters into your CRM workflow so Sales sees which campaign brought each lead in

Final thoughts

The methods in this post haven't really changed because BigQuery has changed, they've changed because GA4 has. The lift-and-go answer for the most common cases is now collected_traffic_source, and that's a good thing. URL extraction is still genuinely useful, just not as universally required as it used to be.

If you're starting fresh on a GA4 BigQuery setup, use collected_traffic_source for everything it covers, and only reach for regex or array methods for the gaps — custom parameters, non-Google click IDs, historic data. If you're working with older data, the regex and array methods still hold up — just be aware of the gotchas above.

If you've got a specific extraction challenge, or want help on the pipelines that depend on this, like BigQuery → Looker Studio reporting or BigQuery → CRM for offline conversions, get in touch.

Previous
Previous

Bounce Rate: The Old vs. New Definition in GA4