How to Call rest Api call to notebook in Snowflake , As API call generates output files which are need to store in Snowflake itself
you do not have to call an API directly from Snowflake. You can load files directly from your python notebook with a connection to Snowflake DB:
SQL code:
-- Create destination table to store your API queries results data
create or replace table public.tmp_table
(page_id int,id int,status varchar,provider_status varchar,ts_created timestamp);
-- create a new format for your csv files
create or replace file format my_new_format type = 'csv' field_delimiter = ';' field_optionally_enclosed_by = '"' skip_header = 1;
-- Put your local file to the Snowflake's temporary storage
put file:///Users/Admin/Downloads/your_csv_file_name.csv #~/staged;
-- Copying data from storage into table
copy into public.tmp_table from #~/staged/your_csv_file_name.csv.gz file_format = my_new_format ON_ERROR=CONTINUE;
select * from public.tmp_table;
-- Delete temporary data
remove #~/staged/tmp_table.csv.gz;
You can do the same with python:
https://docs.snowflake.com/en/user-guide/python-connector-example.html#loading-data
target_table = 'public.tmp_table'
filename = 'your_csv_file_name'
filepath = f'/home/Users/Admin/Downloads/{filename}.csv'
conn = snowflake.connector.connect(
user=USER,
password=PASSWORD,
account=ACCOUNT,
warehouse=WAREHOUSE,
database=DATABASE,
schema=SCHEMA
)
conn.cursor().execute(f'put file://{filepath} #~/staged;')
result = conn.cursor().execute(f'''
COPY INTO {target_table}
FROM #~/staged/{filename}.gz
file_format = (format_name = 'my_new_format'
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
ESCAPE_UNENCLOSED_FIELD = NONE) ON_ERROR=CONTINUE;
''')
conn.cursor().execute(f'REMOVE #~/staged/{filename}.gz;')
Related
Below is the code to get the tsv.gz file from gcs and unzip the file and converting into comma separated csv file to load csv data into Bigquery.
storage_client = storage.Client(project=project_id)
blobs_list = list(storage_client.list_blobs(bucket_name))
for blobs in blobs_list:
if blobs.name.endswith(".tsv.gz"):
source_file = blobs.name
uri = "gs://{}/{}".format(bucket_name, source_file)
gcs_file_system = gcsfs.GCSFileSystem(project=project_id)
with gcs_file_system.open(uri) as f:
gzf = gzip.GzipFile(mode="rb", fileobj=f)
csv_table=pd.read_table(gzf)
csv_table.to_csv('GfG.csv',index=False)
Code seems not effective to load data into BQ as getting many issues. Thought doing wrong with the conversion of file. Please put you thoughts where it went wrong?
If your file is gzip (not zip, I mean gzip), and in Cloud Storage, don't load it, unzip it and stream load it.
You can directly load, as is, in BigQuery, it's magic!! Here a sample
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
# table_id = "your-project.your_dataset.your_table_name"
job_config = bigquery.LoadJobConfig(
autodetect=True, #Automatic schema
field_delimiter=",", # Use \t if your separator is tab in your TSV file
skip_leading_rows=1, #Skip the header values(but keep it for the column naming)
# The source format defaults to CSV, so the line below is optional.
source_format=bigquery.SourceFormat.CSV,
)
uri = "gs://{}/{}".format(bucket_name, source_file)
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Waits for the job to complete.
I'm using Synapse Serverless Pool and get the following error trying to use CETAS
Msg 15860, Level 16, State 5, Line 3
External table location path is not valid. Location provided: 'https://accountName.blob.core.windows.net/ontainerName/test/'
My workspace managed identity should have all the correct ACL and RBAC roles on the storage account. I'm able to query the files I have there but is unable to execute the CETAS command.
CREATE DATABASE SCOPED CREDENTIAL WorkspaceIdentity WITH IDENTITY = 'Managed Identity'
GO
CREATE EXTERNAL DATA SOURCE MyASDL
WITH ( LOCATION = 'https://accountName.blob.core.windows.net/containerName'
,CREDENTIAL = WorkspaceIdentity)
GO
CREATE EXTERNAL FILE FORMAT CustomCSV
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (ENCODING = 'UTF8')
);
GO
CREATE EXTERNAL TABLE Test.dbo.TestTable
WITH (
LOCATION = 'test/',
DATA_SOURCE = MyASDL,
FILE_FORMAT = CustomCSV
) AS
WITH source AS
(
SELECT
jsonContent
, JSON_VALUE (jsonContent, '$.zipCode') AS ZipCode
FROM
OPENROWSET(
BULK '/customer-001-100MB.json',
FORMAT = 'CSV',
FIELDQUOTE = '0x00',
FIELDTERMINATOR ='0x0b',
ROWTERMINATOR = '\n',
DATA_SOURCE = 'MyASDL'
)
WITH (
jsonContent varchar(1000) COLLATE Latin1_General_100_BIN2_UTF8
) AS [result]
)
SELECT ZipCode, COUNT(*) as Count
FROM source
GROUP BY ZipCode
;
If I've tried everything in the LOCATION parameter of the CETAS command, but nothing seems to work. Both folder paths, file paths, with and without leading / trailing / etc.
The CTE select statement works without the CETAS.
Can't I use the same data source for both reading and writing? or is it something else?
The issue was with my data source definition.
Where I had used https:\\ when I changed this to wasbs:\\ as per the following link TSQL CREATE EXTERNAL DATA SOURCE
Where it describes you have to use wasbs, abfs or adl depending on your data source type being a V2 storage account, V2 data lake or V1 data lake
I am trying to auto-ingest data from gcs into bigquery using a bucket triggered cloud function.the file types are gzipped json files which can have a maximum size of 2gb.the cloud function works fine with small files.however it tends to timeout when i give it large files that range from 1 to 2 gbs.is there a way to further optimize my function here is the code below:
def bigquery_job_trigger(data, context):
# Set up our GCS, and BigQuery clients
storage_client = storage.Client()
client = bigquery.Client()
file_data = data
file_name = file_data["name"]
table_id = 'BqJsonIngest'
bucket_name = file_data["bucket"]
dataset_id = 'dataDelivery'
dataset_ref = client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
job_config.autodetect = True
blob = storage_client.bucket(bucket_name).get_blob(file_name)
file = blob.open("rb")
client.load_table_from_file(
file,
table_ref,
location="US", # Must match the destination dataset location.
job_config=job_config,
)
If the file's already in GCS, there's no need to open the blob inside your function (or the need to do so is not apparently from the snippet provided).
See client.load_table_from_uri, or just checkout one of the existing code samples like https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-gcs-csv#bigquery_load_table_gcs_csv-python
I am trying to load data into polybase table from csv flat file having "/,/|/^ data into it.
I have create file format with the (STRING_DELIMITER = '"')
CREATE EXTERNAL FILE FORMAT StringDelimiter WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = ',',
STRING_DELIMITER = '"',
FIRST_ROW = 2,
ENCODING = 'UTF8'
) );
But i got an error while fetching from blob storage:
HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopExecutionException: Could not find a delimiter after string delimiter.
Unfortunately, the escape character support is not yet available in synapse while loading using Polybase.
You can convert CSV flat file to Parquet file format in data factory.
Then to CREATE EXTERNAL FILE FORMAT use following query
CREATE EXTERNAL FILE FORMAT table_name
WITH
(
FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
)
Reference link1- https://learn.microsoft.com/en-us/answers/questions/118102/polybase-load-csv-file-that-contains-text-column-w.html
Reference link2- https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop#arguments-for-create-external-file-format
I want to generate tables automatically in Bigquery whenever a file is uploaded in storage bucket using cloud function in python.
For example- if sample1.csv file is uploaded to bucket then a sample1 table will be created in Bigquery.
How to automate it using cloud function using Python i tried with below code but was able to generate 1 table and all data got appended to that table, how to proceed
def hello_gcs(event, context):
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
table_id = "test_project.test_dataset.test_Table"
job_config = bigquery.LoadJobConfig(
autodetect=True,
skip_leading_rows=1,
# The source format defaults to CSV, so the line below is optional.
source_format=bigquery.SourceFormat.CSV,
)
uri = "gs://test_bucket/*.csv"
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = client.get_table(table_id) # Make an API request.
print("Processing file: {file['name']}.")
Sounds like you need to do three things:
Extract the name of the CSV file/object from the notification event you're receiving to fire your function.
Update the table_id in your example code to set the table name based on the filename you extracted in the first step.
Update the uri in your example code to only use the single file as the input. As written, your example attempts to load data from all matching CSV objects in GCS to the table.