How to query from Cloud SQL with PySpark? - apache-spark

I'm setting up a dataproc job to query some tables from BigQuery, but while I am able to retrieve data from BigQuery, using the same syntax does not work for retrieving data from an External Connection within my BigQuery project.
More specifically, I'm using the query below to retrieve event data from the analytics of my project:
PROJECT = ... # my project name
NUMBER = ... # my project's analytics number
DATE = ... # day of the events in the format YYYYMMDD
analytics_table = spark.read \
.format('com.google.cloud.spark.bigquery') \
.option('table', f'{PROJECT}.analytics_{NUMBER}.events_{DATE}') \
.load()
While the query above works perfectly, I am unable to query to an external connection of my project. I'd like to be able to do something like:
DB_NAME = ... # my database name, considering that my Connection ID is
# projects/<PROJECT_NAME>/locations/us-central1/connections/<DB_NAME>
my_table = spark.read \
.format('com.google.cloud.spark.bigquery') \
.option('table', f'{PROJECT}.{DB_NAME}.my_table') \
.load()
Or even like this:
query = 'SELECT * FROM my_table'
my_table = spark.read \
.format('com.google.cloud.spark.bigquery') \
.option('query', query) \
.load()
How can I retrieve this data?
Thanks in advance :)

Related

ClickHouse housepower driver with spark

I'm new to Stack and Spark so please forgive me my simplicity and mistakes!
I have a problem with Clickhouse and spark (2.4.7), I work on a jupyter notebook.
Basically, I want to insert dataframe with Array column to Clickhouse with Array(String) column. Using yandex driver this is impossible, because jdbc doesn't support Arrays, right? ;)
So I wanted to run Spark Session with housepower jar: clickhouse-native-jdbc-shaded-2.6.4.jar, because I read that they added handling Arrays - correct me if I'm wrong.
And I want to get a query from Clickhouse via jdbc.
spark = SparkSession\
.builder\
.enableHiveSupport()\
.appName(f'custom-events-test)\
.config("spark.jars", "drivers/clickhouse-native-jdbc-shaded-2.6.4.jar")\
.getOrCreate()
My query:
query = """
select date as date,
partnerID as partnerID,
sessionID as sessionID,
toString(mapKeys(cast((JSONExtractKeysAndValues(ec.custom, 'String')), 'Map(String, String)'))) as keys,
toString(mapValues(cast((JSONExtractKeysAndValues(ec.custom, 'String')), 'Map(String, String)'))) as values
from audience.uber_all
array join eventContents as ec
PREWHERE date = ('2022-07-22')
WHERE partnerID = 'XXX'
and ec.custom != '{}'
order by date, partnerID
"""
and my code:
df_tab = spark.read \
.format("jdbc") \
.option("driver", "com.github.housepower.jdbc.ClickHouseDriver") \
.option("url", f"jdbc:clickhouse://{ch_host}:9000/{ch_db}") \
.option("query", query) \
.option("user", ch_user) \
.option("password", ch_pass) \
.load()
But there I get an error:
housepower_error
BUT when I run above query with yandex driver: ru.yandex.clickhouse.ClickHouseDriver
everything works fine. (even with housepower jar)
This error also appears when I want to import column like this:
JSONExtractKeysAndValues(ec.custom, 'String')
or
toString(JSONExtractKeysAndValues(ec.custom, 'String'))
What am I doing wrong ?
And tell me how to insert a DF with Array column using spark jdbc to Clickhouse table also with Array(String) column? I was looking everywhere but coudn't find a solution...
Thank you in advance !

Handling Duplicates in Databricks autoloader

I am new to this Databricks Autoloader, we have a requirement where we need to process the data from AWS s3 to delta table via Databricks autoloader. I was testing this autoloader so I came across duplicate issue that is if i upload a file with name say emp_09282021.csv having same data as emp_09272021.csv then it is not detecting any duplicate it is simply inserting them so if I had 5 rows in emp_09272021.csv file now it will become 10 rows as I upload emp_09282021.csv file.
below is the code that i tried:
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("header",True) \
.schema("id string,name string, age string,city string") \
.load("s3://some-s3-path/source/") \
.writeStream.format("delta") \
.option("mergeSchema", "true") \
.option("checkpointLocation", "s3://some-s3-path/tgt_checkpoint_0928/") \
.start("s3://some-s3-path/spark_stream_processing/target/")
any guidance please to handle this?
It's not the task of the autoloader to detect duplicates, it provides you the possibility to ingest data, but you need to handle duplicates yourself. There are several approaches to that:
Use built-in dropDuplicates function. It's recommended to use it with watermarking to avoid creating a huge state, but you need to have some column that will be used as event time, and it should be part of dropDuplicate list (see docs for more details):
streamingDf \
.withWatermark("eventTime", "10 seconds") \
.dropDuplicates("col1", "eventTime")
Use Delta's merge capability - you just need to insert data that isn't in the Delta table, but you need to use foreachBatch for that. Something like this (please note that table should already exist, or you need to add a handling of non-existent table):
from delta.tables import *
def drop_duplicates(df, epoch):
table = DeltaTable.forPath(spark,
"s3://some-s3-path/spark_stream_processing/target/")
dname = "destination"
uname = "updates"
dup_columns = ["col1", "col2"]
merge_condition = " AND ".join([f"{dname}.{col} = {uname}.{col}"
for col in dup_columns])
table.alias(dname).merge(df.alias(uname), merge_condition)\
.whenNotMatchedInsertAll().execute()
# ....
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("header",True) \
.schema("id string,name string, age string,city string") \
.load("s3://some-s3-path/source/") \
.writeStream.foreachBatch(drop_duplicates)\
.option("checkpointLocation", "s3://some-s3-path/tgt_checkpoint_0928/") \
.start()
In this code you need to change the dup_columns variable to specify columns that are used to detect duplicates.

Spark SQL Transformation returns no data (Structured Streaming)

I have a Kafka stream through which I am getting JSON based IoT device logs.I'm using pyspark to process the stream to analyze and create a transformed output.
My device json looks like this:
{"messageid":"1209a714-811d-4ad6-82b7-5797511d159f",
"mdsversion":"1.0",
"timestamp":"2020-01-20 19:04:32 +0530",
"sensor_id":"CAM_009",
"location":"General Assembly Area",
"detection_class":"10"}
{"messageid":"4d119126-2d12-412c-99c2-c159381bee5c",
"mdsversion":"1.0",
"timestamp":"2020-01-20 19:04:32 +0530",
"sensor_id":"CAM_009",
"location":"General Assembly Area",
"detection_class":"10"}
I'm trying to transform the logs in a way that it returns me unique count of each device based on the timestamp and sensor id. The result JSON would look like this:
{
"sensor_id":"CAM_009",
"timestamp":"2020-01-20 19:04:32 +0530",
"location":"General Assembly Area",
count:2
}
Full code that I'm trying - pyspark-kafka.py
spark = SparkSession.builder.appName('analytics').getOrCreate()
spark.sparkContext.setLogLevel('ERROR')
brokers='kafka-mybroker-url-host:9092'
readTopic = 'DetectionEntry'
outTopic = 'DetectionResults'
df = spark.readStream.format("kafka").option("kafka.bootstrap.servers",brokers).option("subscribe",readTopic).load()
transaction_detail_df1 = df.selectExpr("CAST(value AS STRING)", "timestamp")
alert_schema = StructType() \
.add("message_id", StringType()) \
.add("mdsversion", StringType()) \
.add("timestamp", StringType()) \
.add("sensor_id", StringType()) \
.add("location", StringType()) \
.add("detection_class", StringType()) \
transaction_detail_df2 = transaction_detail_df1\
.select(from_json(col("value"), alert_schema).alias("alerts"))
transaction_detail_df3 = transaction_detail_df2.select("alerts.*")
transaction_detail_df3 = transaction_detail_df3.withColumn("timestamp",to_timestamp(col("timestamp"),"YYYY-MM-DD HH:mm:ss SSSS")).withWatermark("timestamp", "500 milliseconds")
tempView = transaction_detail_df3.createOrReplaceTempView("alertsview")
results = spark.sql("select sensor_id, timestamp, location, count(*) as count from alertsview group by sensor_id, timestamp, location")
results.printSchema()
results_kakfa_output = results
results_kakfa_output.writeStream \
.format("console") \
.outputMode("append") \
.trigger(processingTime='3 seconds') \
.start().awaitTermination()
When I run this code, I get the following output. The overall objective is to process the entire device logs on an interval of 3 seconds and find unique counts for each timestamp entry for a device within the interval period. I have tried the SQL query on a MySQL database with same schema and it works fine. However, I'm getting no results here in the output to process further. I'm unable to figure out what am I missing here.

How to specify column data type when writing Spark DataFrame to Oracle

I want to write a Spark DataFrame to an Oracle table by using Oracle JDBC driver. My code is listed below:
url = "jdbc:oracle:thin:#servername:sid"
mydf.write \
.mode("overwrite") \
.option("truncate", "true") \
.format("jdbc") \
.option("url", url) \
.option("driver", "oracle.jdbc.OracleDriver") \
.option("createTableColumnTypes", "desc clob, price double") \
.option("user", "Steven") \
.option("password", "123456") \
.option("dbtable", "table1").save()
What I want is to specify the desc column to clob type and the price column to double precision type. But Spark show me that the clob type is not supported. The length of desc string is about 30K. I really need your help. Thanks
As per this note specifies that there are some data types that are not supported. If the target table is already created with CLOB data type then createTableColumnTypes may be redundant. You can check if writing to a CLOB column is possible with spark jdbc if table is already created.
Create your table in mysql with your required schema , now use mode='append' and save records .
mode='append' only insert records without modify table schema.

Running custom Apache Phoenix SQL query in PySpark

Could someone provide an example using pyspark on how to run a custom Apache Phoenix SQL query and store the result of that query in a RDD or DF. Note: I am looking for a custom query and not an entire table to be read into a RDD.
From Phoenix Documentation, to load an entire table I can use this:
table = sqlContext.read \
.format("org.apache.phoenix.spark") \
.option("table", "<TABLENAME>") \
.option("zkUrl", "<hostname>:<port>") \
.load()
I want to know what is the corresponding equivalent for using a custom SQL
sqlResult = sqlContext.read \
.format("org.apache.phoenix.spark") \
.option("sql", "select * from <TABLENAME> where <CONDITION>") \
.option("zkUrl", "<HOSTNAME>:<PORT>") \
.load()
Thanks.
This can be done using Phoenix as a JDBC data source as given below:
sql = '(select COL1, COL2 from TABLE where COL3 = 5) as TEMP_TABLE'
df = sqlContext.read.format('jdbc')\
.options(driver="org.apache.phoenix.jdbc.PhoenixDriver", url='jdbc:phoenix:<HOSTNAME>:<PORT>', dbtable=sql).load()
df.show()
However it should be noted that if there are column aliases in the SQL statement then the .show() statement would throw up an exception (It will work if you use .select() to select the columns that are not aliased), this is a possible bug in Phoenix.
Here you need to use .sql to work with custom queries. Here is syntax
dataframe = sqlContext.sql("select * from <table> where <condition>")
dataframe.show()
To Spark2, I didn't have problem with .show() function, and I did not use .select() function to print all values of DataFrame coming from Phoenix.
So, make sure that your sql query has been inside parentheses, look my example:
val sql = " (SELECT P.PERSON_ID as PERSON_ID, P.LAST_NAME as LAST_NAME, C.STATUS as STATUS FROM PERSON P INNER JOIN CLIENT C ON C.CLIENT_ID = P.PERSON_ID) "
val dft = dfPerson.sparkSession.read.format("jdbc")
.option("driver", "org.apache.phoenix.jdbc.PhoenixDriver")
.option("url", "jdbc:phoenix:<HOSTNAME>:<PORT>")
.option("useUnicode", "true")
.option("continueBatchOnError", "true")
.option("dbtable", sql)
.load()
dft.show();
It shows me:
+---------+--------------------+------+
|PERSON_ID| LAST_NAME|STATUS|
+---------+--------------------+------+
| 1005| PerDiem|Active|
| 1008|NAMEEEEEEEEEEEEEE...|Active|
| 1009| Admission|Active|
| 1010| Facility|Active|
| 1011| MeUP|Active|
+---------+--------------------+------+

Resources