DEV Community

iamtodor
iamtodor

Posted on

Python's BigQuery External CVS tables and null_marker challenge

Testing data pipelines is crucial for ensuring their efficiency and reliability. When dealing with data sources like BigQuery external tables, it becomes necessary to emulate these data sources to perform thorough pipeline testing.

So idea is to create external CSV table using Python's BigQuery library.

For additional context, we treat empty value as "" and null as \N.

Unfortunately, this library does not allow you to specify null_marker for CSV files while creating an external CSV table using ExternalConfig

The null_marker option is not presented in CsvOptions, in contrast to, for example, skip_leading_rows.

external_config = ExternalConfig(ExternalSourceFormat.CSV)
external_config.options.skip_leading_rows = 1
external_config.options.null_marker = '\\N' # does not work
Enter fullscreen mode Exit fullscreen mode

Simultaneously, it's important to note that while a Data Definition Language (DDL) statement that includes the null_marker option may work successfully when executed directly in the BigQuery console, it might not function as expected when submitted using the Python client.

CREATE
OR REPLACE EXTERNAL TABLE project_name.dataset_name.table_name (
    `field1` STRING,
    `field2` STRING,
) OPTIONS (
    uris = ['gs://bucket_name/prefix_name/file_name.csv'],
    format = CSV,
    skip_leading_rows = 1,
    null_marker = '\\N'
)
Enter fullscreen mode Exit fullscreen mode

If you execute this query in the BigQuery console, your table will be created successfully and ready for querying. However, if you attempt to submit the same query using the Python library, you will encounter the following exception:

>>> bq_client.query(query=create_table_ddl).result()
google.api_core.exceptions.BadRequest: 400 Syntax error: Illegal escape sequence: \N
Enter fullscreen mode Exit fullscreen mode

Even when attempting to shield the \\\\N sequence, it may not produce the expected outcome. The actual null values are not processed as intended; instead, they remain as \\N rather than being interpreted as null.

If you have insights or solutions regarding the challenge of handling null values in emulated tables, kindly share them in the comments section. This will help improve the guide and provide a comprehensive solution for others facing a similar issue.

Top comments (0)