According to https://cloud.google.com/dataproc/docs/concepts/connectors/bigquery the connector uses BigQuery Storage API to read data using gRPC. However, I couldn't find any Storage API/gRPC usage in the source code here: https://github.com/GoogleCloudDataproc/spark-bigquery-connector/tree/master/connector/src/main/scala
My questions are:
1. could anyone show me the source code where uses storage API & gprc call?
2. Does Dataset<Row> df = session.read().format("bigquery").load() work through GBQ storage API? if not, how to read from GBQ to Spark using BigQuery Storage API?
Spark BigQuery Connector uses only BigQuery Storage API for reads, you can see it here, for example.
Yes, Dataset<Row> df = session.read().format("bigquery").load() works through BigQuery Storage API.
Related
Does Azure Databricks use the query acceleration functions in Azure Data Lake Storage gen2? In documentation we can see that spark can benefit from this functionality.
I'm wondering if, in the case where I only use the delta format, I'm profiting from this functionality and whether to include it in the pricing in Azure Calculator under the Storage Account section?
From the docs
Query acceleration supports CSV and JSON formatted data as input to each request.
So it doesn't work with Parquet or Delta - because it is fundamentally a row based accelerator, and Parquet is a columnar format.
I want to know what security protocol is used for interaction between databricks and bigquery while using the spark bigquery connector:
https://github.com/GoogleCloudDataproc/spark-bigquery-connector
Like for example: interaction between Azure databricks and SalesForce Marketing Cloud is via SFTP(SSH file transfer protocol)
We have a Datafactory pipeline in Azure to move a on-premise SQL table to Azure blob storage Gen2 in parquet format. I think the majority cost would come from the Azure storage, right?
Now we want to move those data to BigQuery. Due to our security policy, we still need the Datafactory pipeline to read from SQL table. So we create a databrick notebook to read the parquet file and move to BigQuery using the Spark BigQuery connector. Now I need to estimate the total cost. On top of the Azure storage, do we have to pay some kind of egress cost to move data out of Azure storage? And does google would charge us some kind of ingress cost to move data to BQ?
All inbound or ingress data transfers to Azure data centers from on-premises environments are free. However, outbound data transfers incur charges.
Data migration from other platforms into BigQuery is free.
To estimate the cost of Google Cloud Platform services, you can use the Google Cloud Pricing Calculator.
Complementing #Ismail's answer:
The migration from other platforms is free when the BigQuery Data Transfer service is used; however, this is not the case if the data is moved to BigQuery using the Spark BigQuery connector.
The connector writes data to BigQuery by writing it first to Cloud Storage (GCS) and then loading it into BigQuery, as mentioned here:
Notice that the process writes the data first to GCS and then loads it to BigQuery, a GCS bucket must be configured to indicate the temporary data location.
Cloud Storage princing depends on the Storage class used and the location of the bucket; so, asuming a Standard class, your migration process will generate charges for:
Data storage; and
Operations
Loading the data from Cloud Storage to BigQuery is free; however, there might be network egress fees if the bucket location is not on the same region/multi-region than the dataset.
Finally, once your data is in BigQuery it will be subject to the BigQuery Storage pricing.
I suggest to check both the Storage and BigQuery complete pricing documentation to check for details, limitations and some examples on how the pricing work.
I am working on a spark project where the storage sink is Azure Blob Storage. I write data in parquet format. I need some metrics around storage, eg. numberOfFilesCreated, writtenBytes etc. On searching for it online I came across a particular metrics that the hadoop-azure package has called the AzureFileSystemInstrumentation. I am not sure about how to access the same from spark and can't find any resources for the same. How would one access this instrumentation for the given spark job?
Based on my experience, I think there are three solution can be used in your current scenario, as below.
Directly use Hadoop API for HDFS to get HDFS Metrics Data in Spark, because hadoop-azure just implements the HDFS APIs for using Azure Blob Storage, so please see the Hadoop offical document for Metrics to know what particular metrics you want to use, such as CreateFileOps or FilesCreated as the figure below to get numberOfFilesCreated. Meanwhile, there is a similar SO thread How do I get HDFS bytes read and write for Spark applications? which you can refer to.
Directly use Azure Storage SDK for Java or other languages you used to write a program to do the statistics for files stored in Azure Blob Storage as blobs ordered by creation timestamp or others, please refer to the offical document Quickstart: Azure Blob storage client library v8 for Java to know how to use its SDK.
Use Azure Function with Blob Trigger to monitor the events of files created in Azure Blob Storage, then you can write the code for statistics on every blob created event, please refer to the offical document Create a function triggered by Azure Blob storage to know how to use Blob Trigger. Even, you can send these metrics what you want to Azure Table Storage or Azure SQL Database or other services for statistics later in the Azure Blob Trigger Function.
I am using flink streaming to read the data from the file in AzureDataLake store.Is there any connector available to read the data from the file stored in Azure Data Lake continuously as the file is updated.How to do it?
Azure Data Lake Store (ADLS) supports REST API interface that is compatible with HDFS and is documented here. https://learn.microsoft.com/en-us/rest/api/datalakestore/webhdfs-filesystem-apis.
Currently there are no APIs or connectors available that poll ADLS and notify/read-data as the files/folders are updated. This is something that you could implement in a custom connector using the APIs provided above. Your connector would need to poll the ADLS account/folder on a recurring basis to identify changes.
Thanks,
Sachin Sheth
Program Manager
Azure Data Lake