Loading Data exports from Google Cloud Storage into BigQuery

Follow the steps follow to load Marfeel Data exports from Google Cloud Storage into BigQuery.

Create table in dataset

In the dataset where the data has to be imported, create a table with “Empty table” as source. The schema, edited as text, should be compatible with the following:

[
  {
    "mode": "REQUIRED",
    "name": "user_id",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "registered_user_id",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "user_status",
    "type": "INTEGER"
  },
  {
    "mode": "REQUIRED",
    "name": "rfv",
    "type": "INTEGER"
  },
  {
    "mode": "REQUIRED",
    "name": "has_consent",
    "type": "BOOLEAN"
  },
  {
    "mode": "REQUIRED",
    "name": "session_id",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "page_id",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "site_id",
    "type": "INTEGER"
  },
  {
    "mode": "REQUIRED",
    "name": "editorial_id",
    "type": "INTEGER"
  },
  {
    "mode": "REQUIRED",
    "name": "editorial_title",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "editorial_publish_time",
    "type": "TIMESTAMP"
  },
  {
    "mode": "REQUIRED",
    "name": "editorial_update_time",
    "type": "TIMESTAMP"
  },
  {
    "mode": "REQUIRED",
    "name": "sections_array",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "authors_array",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "entities_array",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "tags_array",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "page_type",
    "type": "INTEGER"
  },
  {
    "mode": "REQUIRED",
    "name": "start_time",
    "type": "TIMESTAMP"
  },
  {
    "mode": "REQUIRED",
    "name": "end_time",
    "type": "TIMESTAMP"
  },
  {
    "mode": "REQUIRED",
    "name": "event_date",
    "type": "DATE"
  },
  {
    "mode": "REQUIRED",
    "name": "event_local_time",
    "type": "TIMESTAMP"
  },
  {
    "mode": "REQUIRED",
    "name": "ip_v4",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "ip_v6",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "user_country",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "user_region",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "user_city",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "operating_system",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "user_agent",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "full_url",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "canonical_url",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "referrer",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "browser",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "browser_version",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "device_category",
    "type": "INTEGER"
  },
  {
    "mode": "REQUIRED",
    "name": "url_params",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "search_phrase",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "client_event_time",
    "type": "TIMESTAMP"
  },
  {
    "mode": "REQUIRED",
    "name": "utm_source",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "utm_medium",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "utm_campaign",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "utm_content",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "utm_term",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "duration",
    "type": "INTEGER"
  },
  {
    "mode": "REQUIRED",
    "name": "scroll",
    "type": "INTEGER"
  },
  {
    "mode": "REQUIRED",
    "name": "dns_timing",
    "type": "INTEGER"
  },
  {
    "mode": "REQUIRED",
    "name": "connect_timing",
    "type": "INTEGER"
  },
  {
    "mode": "REQUIRED",
    "name": "response_start_timing",
    "type": "INTEGER"
  },
  {
    "mode": "REQUIRED",
    "name": "fetch_timing",
    "type": "INTEGER"
  },
  {
    "mode": "REQUIRED",
    "name": "dom_interactive_timing",
    "type": "INTEGER"
  },
  {
    "mode": "REQUIRED",
    "name": "dom_content_loaded_timing",
    "type": "INTEGER"
  },
  {
    "mode": "REQUIRED",
    "name": "dom_complete_timing",
    "type": "INTEGER"
  },
  {
    "mode": "REQUIRED",
    "name": "fcp_timing",
    "type": "INTEGER"
  },
  {
    "mode": "REQUIRED",
    "name": "PageVars",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "SessionVars",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "UserVars",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "has_web_vitals",
    "type": "BOOLEAN"
  },
  {
    "mode": "REQUIRED",
    "name": "lcp",
    "type": "INTEGER"
  },
  {
    "mode": "REQUIRED",
    "name": "fid",
    "type": "INTEGER"
  },
  {
    "mode": "REQUIRED",
    "name": "cls",
    "type": "INTEGER"
  },
  {
    "mode": "REQUIRED",
    "name": "recirculation_source",
    "type": "STRING"
  },
  {
    "mode": "REQUIRED",
    "name": "r",
    "type": "INTEGER"
  },
  {
    "mode": "REQUIRED",
    "name": "f",
    "type": "INTEGER"
  },
  {
    "mode": "REQUIRED",
    "name": "v",
    "type": "INTEGER"
  },
  {
    "mode": "REQUIRED",
    "name": "engagement_time",
    "type": "INTEGER"
  }
]

DDL

The corresponding DDL (name of the table suppressed) is:

CREATE TABLE `...`
(
  user_id STRING NOT NULL,
  registered_user_id STRING NOT NULL,
  user_status INT64 NOT NULL,
  rfv INT64 NOT NULL,
  has_consent BOOL NOT NULL,
  session_id STRING NOT NULL,
  page_id STRING NOT NULL,
  site_id INT64 NOT NULL,
  editorial_id INT64 NOT NULL,
  editorial_title STRING NOT NULL,
  editorial_publish_time TIMESTAMP NOT NULL,
  editorial_update_time TIMESTAMP NOT NULL,
  sections_array STRING NOT NULL,
  authors_array STRING NOT NULL,
  entities_array STRING NOT NULL,
  tags_array STRING NOT NULL,
  page_type INT64 NOT NULL,
  start_time TIMESTAMP NOT NULL,
  end_time TIMESTAMP NOT NULL,
  event_date DATE NOT NULL,
  event_local_time TIMESTAMP NOT NULL,
  ip_v4 STRING NOT NULL,
  ip_v6 STRING NOT NULL,
  user_country STRING NOT NULL,
  user_region STRING NOT NULL,
  user_city STRING NOT NULL,
  operating_system STRING NOT NULL,
  user_agent STRING NOT NULL,
  full_url STRING NOT NULL,
  canonical_url STRING NOT NULL,
  referrer STRING NOT NULL,
  browser STRING NOT NULL,
  browser_version STRING NOT NULL,
  device_category INT64 NOT NULL,
  url_params STRING NOT NULL,
  search_phrase STRING NOT NULL,
  client_event_time TIMESTAMP NOT NULL,
  utm_source STRING NOT NULL,
  utm_medium STRING NOT NULL,
  utm_campaign STRING NOT NULL,
  utm_content STRING NOT NULL,
  utm_term STRING NOT NULL,
  duration INT64 NOT NULL,
  scroll INT64 NOT NULL,
  dns_timing INT64 NOT NULL,
  connect_timing INT64 NOT NULL,
  response_start_timing INT64 NOT NULL,
  fetch_timing INT64 NOT NULL,
  dom_interactive_timing INT64 NOT NULL,
  dom_content_loaded_timing INT64 NOT NULL,
  dom_complete_timing INT64 NOT NULL,
  fcp_timing INT64 NOT NULL,
  PageVars STRING NOT NULL,
  SessionVars STRING NOT NULL,
  UserVars STRING NOT NULL,
  has_web_vitals BOOL NOT NULL,
  lcp INT64 NOT NULL,
  fid INT64 NOT NULL,
  cls INT64 NOT NULL,
  recirculation_source STRING NOT NULL,
 r INT64 NOT NULL,
 f INT64 NOT NULL,
 v INT64 NOT NULL,
 engagement_time INT64 NOT NULL
);

Set up transfer

Recurring loads of data from Cloud Storage to BigQuery can be configured in several ways as described in the documentation.

Generally, this only has to be done once, so programmatic ways like API or Java are recommended only for users with expertise in them.

The bq command does not allow configuring the frequency of transfers. They are forced to happen daily. The procedure requires opening another window to carry out an authorization and to copy and paste a code, so it can be confusing for non-technical users.

The recommended way is to create the transfer through the console:

  1. In the BigQuery menu, click on “Data transfers”.
  2. Click on “+ CREATE TRANSFER”.
  3. Select “Google Cloud Storage” as source.
  4. Select a descriptive name for the transfer.
  5. In the schedule options, choose how often the data has to be transferred. For executions more frequent than daily, choose “Custom” in the “Repeats” field. For executions every half an hour, write “every 30 minutes” in the “Custom Schedule” field.
  6. Select the destination dataset and table.
  7. In the Cloud Storage URI, click on “BROWSE”, select the bucket where the CSV files are being exported, and write “*.csv” in the “Filename” field. Click on “SELECT”.
  8. Choose whether to delete or keep the files after they have been transferred.
  9. Make sure that the file format is “CSV”.
  10. In the transfer option for CSV, set the value of “Header rows to skip” to 1 and check the option “Allow quoted newlines” (in case some data like titles contain line feeds).