BigQuery connector for pyspark via Hadoop Input Format example - apache-spark

I have a large dataset stored into a BigQuery table and I would like to load it into a pypark RDD for ETL data processing.
I realized that BigQuery supports the Hadoop Input / Output format
https://cloud.google.com/hadoop/writing-with-bigquery-connector
and pyspark should be able to use this interface in order to create an RDD by using the method "newAPIHadoopRDD".
http://spark.apache.org/docs/latest/api/python/pyspark.html
Unfortunately, the documentation on both ends seems scarce and goes beyond my knowledge of Hadoop/Spark/BigQuery. Is there anybody who has figured out how to do this?

Google now has an example on how to use the BigQuery connector with Spark.
There does seem to be a problem using the GsonBigQueryInputFormat, but I got a simple Shakespeare word counting example working
import json
import pyspark
sc = pyspark.SparkContext()
hadoopConf=sc._jsc.hadoopConfiguration()
hadoopConf.get("fs.gs.system.bucket")
conf = {"mapred.bq.project.id": "<project_id>", "mapred.bq.gcs.bucket": "<bucket>", "mapred.bq.input.project.id": "publicdata", "mapred.bq.input.dataset.id":"samples", "mapred.bq.input.table.id": "shakespeare" }
tableData = sc.newAPIHadoopRDD("com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat", "org.apache.hadoop.io.LongWritable", "com.google.gson.JsonObject", conf=conf).map(lambda k: json.loads(k[1])).map(lambda x: (x["word"], int(x["word_count"]))).reduceByKey(lambda x,y: x+y)
print tableData.take(10)

Related

Getting duplicate values for each key while querying parquet file using PySpark

We are getting duplicated values while querying data from the parquet file using PySpark.
While getting correct data after querying from presto.
Spark Version: 3.1
Configuration setup so far:
from scbuilder.kubernetes import Kubernetes
kobj = Kubernetes(kubernetes = True)
kobj.setExecutorCores(5)
kobj.setExecutorMemory("5g")
kobj.addAdditionalConf("spark.driver.memory", "8g")
kobj.setNumberOfExecutor(2)
sc = kobj.buildSparkSession()
sc.getActiveSession()
sc.conf.set('sc.hadoopConfiguration.setClass',"mapreduce.input.pathFilter.class")
sc.conf.set("hive.convertMetastoreParquet",False)
sc.conf.set("hive.input.format","org.apache.hadoop.hive.ql.io.HiveInputFormat")
Actual data count: 17722
After querying from parquet file: 1036320
Need help to understand why parquet file is showing such behavior and how we can fix it?

Loading Data from Azure Synapse Database into a DataFrame with Notebook

I am attempting to load data from Azure Synapse DW into a dataframe as shown in the image.
However, I'm getting the following error:
AttributeError: 'DataFrameReader' object has no attribute 'sqlanalytics'
Traceback (most recent call last):
AttributeError: 'DataFrameReader' object has no attribute 'sqlanalytics'
Any thoughts on what I'm doing wrong?
That particular method has changed its name to synapsesql (as per the notes here) and is Scala only currently as I understand it. The correct syntax would therefore be:
%%spark
val df = spark.read.synapsesql("yourDb.yourSchema.yourTable")
It is possible to share the Scala dataframe with Python via the createOrReplaceTempView method, but I'm not sure how efficient that is. Mixing and matching is described here. So for your example you could mix and match Scala and Python like this:
Cell 1
%%spark
// Get table from dedicated SQL pool and assign it to a dataframe with Scala
val df = spark.read.synapsesql("yourDb.yourSchema.yourTable")
// Save the dataframe as a temp view so it's accessible from PySpark
df.createOrReplaceTempView("someTable")
Cell 2
%%pyspark
## Scala dataframe is now accessible from PySpark
df = spark.sql("select * from someTable")
## !!TODO do some work in PySpark
## ...
The above linked example shows how to write the dataframe back to the dedicated SQL pool too if required.
This is a good article for importing / export data with Synpase notebooks and the limitation is described in the Constraints section:
https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/synapse-spark-sql-pool-import-export#constraints

Spark "modifiedBefore" option while reading data from files

I am using Spark-2.4 to read files from hadoop.
The requirement is to read the files whose modified time is before some provided value.
I came across the spark documentation that mentions about the option modifiedBefore, please refer to the following spark doc Modification Time Path Filters, but I am not sure if it's available in spark 2.4, if not how can I achieve this?
The options modifiedBefore and modifiedAfter are available since Spark 3+ and can only be used in batch not streaming. For Spark 2.4, you can use Hadoop FileSystem globStatus method and filter files using getModificationTime.
Here is an example of a function that takes a path and a threshold and returns list of file paths filtered using the threshold:
import org.apache.hadoop.fs.Path
def getFilesModifiedBefore(path: Path, modifiedBefore: String) = {
val format = new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss")
val thresHoldTime = format.parse(modifiedBefore).getTime()
val files = path.getFileSystem(sc.hadoopConfiguration).globStatus(path)
files.filter(_.getModificationTime < thresHoldTime).map(_.getPath.toString)
}
Then using it with spark.read.csv :
val df = spark.read.csv(getFilesModifiedBefore(new Path("/mypath"), "2021-03-17T10:46:12"):_*)

Writing data into snowflake using Python

Can we write data directly into snowflake table without using Snowflake internal stage using Python????
It seems auxiliary task to write in stage first and then transform it and then load it into table. can it done in one step only just like JDBC connection in RDBMS.
The absolute fastest way to load data into Snowflake is from a file on either internal or external stage. Period. All connectors have the ability to insert the data with standard insert commands, but this will not perform as well. That said, many of the Snowflake drivers are now transparently using PUT/COPY commands to load large data to Snowflake via internal stage. If this is what you are after, then you can leverage the pandas write_pandas command to load data from a pandas dataframe to Snowflake in a single command. Behind the scenes, it will execute the PUT and COPY INTO for you.
https://docs.snowflake.com/en/user-guide/python-connector-api.html#label-python-connector-api-write-pandas
I highly recommend this pattern over INSERT commands in any driver. And I would also recommend transforms be done AFTER loading to Snowflake, not before.
If someone is having issues with large datasets. Try Using dask instead and generate your dataframe partitioned into chunks. Then you can use dask.delayed with sqlalchemy. Here we are using snowflake's native connector method i.e. pd_writer, which under the hood uses write_pandas and eventually using PUT COPY with compressed parquet file. At the end it comes down to your I/O bandwidth to be honest. The more throughput you have, the faster it gets loaded in Snowflake Table. But this snippet provides decent amount of parallelism overall.
import functools
from dask.diagnostics import ProgressBar
from snowflake.connector.pandas_tools import pd_writer
import dask.dataframe as dd
df = dd.read_csv(csv_file_path, blocksize='64MB')
ddf_delayed = df.to_sql(
table_name.lower(),
uri=str(engine.url),
schema=schema_name,
if_exists=if_exists,
index=False,
method=functools.partial(
pd_writer,quote_identifiers=False),
compute=False,
parallel=True
)
with ProgressBar():
dask.compute(ddf_delayed, scheduler='threads', retries=3)
Java:
Load Driver Class:
Class.forName("net.snowflake.client.jdbc.SnowflakeDriver")
Maven:
Add following code block as a dependency
<dependency>
<groupId>net.snowflake</groupId>
<artifactId>snowflake-jdbc</artifactId>
<version>{version}</version>
Spring :
application.yml:
spring:
datasource
hikari:
maximumPoolSize: 4 # Specify maximum pool size
minimumIdle: 1 # Specify minimum pool size
driver-class-name: com.snowflake.client.jdbc.SnowflakeDriver
Python :
import pyodbc
# pyodbc connection string
conn = pyodbc.connect("Driver={SnowflakeDSIIDriver}; Server=XXX.us-east-2.snowflakecomputing.com; Database=VAQUARKHAN_DB; schema=public; UID=username; PWD=password")
# Cursor
cus=conn.cursor()
# Execute SQL statement to get current datetime and store result in cursor
cus.execute("select current_date;")
# Display the content of cursor
row = cus.fetchone()
print(row)
How to insert json response data in snowflake database more efficiently?
Apache Spark:
<dependency>
<groupId>net.snowflake</groupId>
<artifactId>spark-snowflake_2.11</artifactId>
<version>2.5.9-spark_2.4</version>
</dependency>
Code
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.DataFrame
/ Use secrets DBUtil to get Snowflake credentials.
val user = dbutils.secrets.get("data-warehouse", "<snowflake-user>")
val password = dbutils.secrets.get("data-warehouse", "<snowflake-password>")
val options = Map(
"sfUrl" -> "<snowflake-url>",
"sfUser" -> user,
"sfPassword" -> password,
"sfDatabase" -> "<snowflake-database>",
"sfSchema" -> "<snowflake-schema>",
"sfWarehouse" -> "<snowflake-cluster>"
)
// Generate a simple dataset containing five values and write the dataset to Snowflake.
spark.range(5).write
.format("snowflake")
.options(options)
.option("dbtable", "<snowflake-database>")
.save()
// Read the data written by the previous cell back.
val df: DataFrame = spark.read
.format("snowflake")
.options(options)
.option("dbtable", "<snowflake-database>")
.load()
display(df)
Fastest way to load data into Snowflake is from a file
https://community.snowflake.com/s/article/How-to-Load-Terabytes-Into-Snowflake-Speeds-Feeds-and-Techniques
https://bryteflow.com/how-to-load-terabytes-of-data-to-snowflake-fast/
https://www.snowflake.com/blog/ability-to-connect-to-snowflake-with-jdbc/
https://docs.snowflake.com/en/user-guide/jdbc-using.html
https://www.persistent.com/blogs/json-processing-in-spark-snowflake-a-comparison/

Pyspark : GET row from hbase using row-key

I have a use-case to read from HBase inside a pyspark job and is currently doing a scan on the HBase table like this,
conf = {"hbase.zookeeper.quorum": host, "hbase.cluster.distributed": "true", "hbase.mapreduce.inputtable": "table_name", "hbase.mapreduce.scan.row.start": start, "hbase.mapreduce.scan.row.stop": stop}
rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat", "org.apache.hadoop.hbase.io.ImmutableBytesWritable","org.apache.hadoop.hbase.client.Result", keyConverter=keyConv, valueConverter=valueConv,conf=cmdata_conf)
I am unable to find the conf to do a GET on the HBase table. Can someone help me? I could find that filters are not supported with pyspark. But is it not possible to do a simple GET?
Thanks!

Resources