How to create Kudu table from pyspark dataframe

How to create Kudu table from pyspark dataframe - apache-spark

Am trying a simple approach to write a datafram from pyspark and into a non-existing kudu table
df.write.format('org.apache.kudu.spark.kudu') \
.option('kudu.master', kudu_master) \
.option('kudu.table', kudu_table) \
.mode("Append") \
.save()
but I get the exception
py4j.protocol.Py4JJavaError: An error occurred while calling o92.save.
: org.apache.kudu.client.NonRecoverableException: the table does not exist: table_name: "kudu_table"
I expected the table to be created as it does in other database types, am I missing something or does Kudu tables need to pre-created ?
After some searching,
I was trying to directly call the underlying functions, I could create kuduContext but to create the table I have to wrap all the needed objects ex; schema, schema columns, etc ... and for some reason the internet doesn't have much info on that
kc = sc._jvm.org.apache.kudu.spark.kudu.KuduContext(kudu_master, sc._jsc.sc()) # working
print(kc.tableExists("test_table")) #working
kc.createTable("test_table", sc._jvm.org.apache.kudu.Schema(data.schema), sc._jvm.org.apache.kudu.client.CreateTableOptions().addHashPartitions(list("myKey"), 3)) #not working

Related

AWS Glue reading data from Sybase table

While loading data from Sybase DB in AWS Glue I encounter an error:
Py4JJavaError: An error occurred while calling o261.load.
: java.sql.SQLException: The identifier that starts with '__SPARK_GEN_JDBC_SUBQUERY_NAME' is too long. Maximum length is 30.
The code I use is:
spark.read.format("jdbc").
option("driver", "net.sourceforge.jtds.jdbc.Driver").
option("url", jdbc_url).
option("query", query).
option("user", db_username).
option("password", db_password).
load()
Is there any way to set this identifier as a custom one in order to have it shorter? What's interesting I am able to load all the data from a particular table by replacing query option with option("dbtable", table) but invoking a custom query is impossible.
Best Regards

Error writing data to Bigquery using Databricks Pyspark

I run a daily job to write data to BigQuery using Databricks Pyspark. There was a recent update of configuration for Databricks (https://docs.databricks.com/data/data-sources/google/bigquery.html) which caused the job to fail. I followed all the steps in the docs. Reading data works again but writing throws the following error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS not found
I tried adding configuration also right in the code (as advised for similar errors in Spark) but it did not help:
spark._jsc.hadoopConfiguration().set('fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem')
spark._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'true')
spark._jsc.hadoopConfiguration().set('google.cloud.auth.service.account.json.keyfile', "<path-to-key.json>")
My code is:
upload_table_dataset = 'testing_dataset'
upload_table_name = 'testing_table'
upload_table = upload_table_dataset + '.' + upload_table_name
(import_df.write.format('bigquery')
.mode('overwrite')
.option('project', 'xxxxx-test-project')
.option('parentProject', 'xxxxx-test-project')
.option('temporaryGcsBucket', 'xxxxx-testing-bucket')
.option('table', upload_table)
.save()
)

You need to install the GCS connector on your cluster first

Cosmos DB spatial query using Spark

I would like to query a cosmos db collection using a spatial query. Specifically the ST_DISTANCE query. This query works as intended using the azure-cosmos Python SDK.
I am looking to use this query via Apache Spark for a more complex query pattern. However, using the ST_DISTANCE query in a SQL cell in a notebook results in the following error.
Error in SQL statement: AnalysisException: Undefined function: 'ST_DISTANCE'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.
The notebook is initialized as follows.
# Configure Catalog Api to be used
spark.conf.set("spark.sql.catalog.cosmosCatalog", "com.azure.cosmos.spark.CosmosCatalog")
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountEndpoint", cosmosEndpoint)
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountKey", cosmosMasterKey)
from pyspark.sql.functions import col
df = spark.read.format("cosmos.oltp").options(**cfg)\
.option("spark.cosmos.read.inferSchema.enabled", "true")\
.load()
df.createOrReplaceTempView("outlets")
_______________________________________________________________________
%sql
SELECT * FROM outlets f WHERE ST_DISTANCE(f.boundary, POINT(0,0)) < 600

Based on what I understand from the Cosmos DB Spark connector github repo[1], not all Cosmos DB filter queries are supported via the connector (yet?). So the ST_DISTANCE and other filter functions in the spatial family aren't going to work as those aren't predicates that are natively supported by Spark to be pushed down to the database.
Found something that will help sail past this issue at least temporarily. The query config[2] allows sending a custom query directly to Cosmos DB. A temporary view can be built and queried over. This will not work for all use cases, but this solved my issue where I need a single view with distance filtering done. Rest can be handled via Spark SQL.
Refer spark.cosmos.read.customQuery[2] in below sample.
outlets_cfg = {
"spark.cosmos.accountEndpoint" : cosmosEndpoint,
"spark.cosmos.accountKey" : cosmosMasterKey,
"spark.cosmos.database" : cosmosDatabaseName,
"spark.cosmos.container" : cosmosContainerName,
"spark.cosmos.read.customQuery" : "SELECT * FROM c WHERE ST_DISTANCE(c.location,{\"type\":\"Point\",\"coordinates\": [12.832489, 18.9553242]}) < 1000"
}
df = spark.read.format("cosmos.oltp").options(**outlets_cfg)\
.option("spark.cosmos.read.inferSchema.enabled", "true")\
.load()
df.createOrReplaceTempView("outlets")
[1] https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3-1_2-12/
[2] https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3-1_2-12/docs/configuration-reference.md#query-config

How to create a table with primary key using jdbc spark connector (to ignite)

I'm trying to save a spark dataframe to the ignite cache using spark connector (pyspark) like this:
df.write.format("jdbc") \
.option("url", "jdbc:ignite:thin://<ignite ip>") \
.option("driver", "org.apache.ignite.IgniteJdbcThinDriver") \
.option("primaryKeyFields", 'id') \
.option("dbtable", "ignite") \
.mode("overwrite") \
.save()
# .option("createTableOptions", "primary key (id)") \
# .option("customSchema", 'id BIGINT PRIMARY KEY, txt TEXT') \
I have an error:
java.sql.SQLException: No PRIMARY KEY defined for CREATE TABLE
The library org.apache.ignite:ignite-spark-2.4:2.9.0 is installed. I can't use the ignite format because azure databricks uses spring framework version that conflicts with the spring framework version in the org.apache.ignite:ignite-spark-2.4:2.9.0. So I'm trying to use jdbc thin client. But I can only read/append data to an existing cache.
I can't use the overwrite mode because I can't choose primary key. There is an option primaryKeyFields for the ignite format, but it doesn't work on jdbc. The jdbc customSchema option is ignored. The createTableOptions adds primary key statement after the schema parenthesis and a sql syntax error occurs.
Is there a way to determine a primary key for the jdbc spark connector?

Here's an example with correct syntax that should work fine:
DataFrameWriter < Row > df = resultDF
.write()
.format(IgniteDataFrameSettings.FORMAT_IGNITE())
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), configPath)
.option(IgniteDataFrameSettings.OPTION_TABLE(), "Person")
.option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PRIMARY_KEY_FIELDS(), "id, city_id")
.option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PARAMETERS(), "template=partitioned,backups=1")
.mode(Append);
Please let me know if something is wrong here.

How to work with temporary tables in foreachBatch?

We are building a streaming platform where it is essential to work with SQL's in batches.
val query = streamingDataSet.writeStream.option("checkpointLocation", checkPointLocation).foreachBatch { (df, batchId) => {
df.createOrReplaceTempView("events")
val df1 = ExecutionContext.getSparkSession.sql("select * from events")
df1.limit(5).show()
// More complex processing on dataframes
}}.trigger(trigger).outputMode(outputMode).start()
query.awaitTermination()
Error thrown is :
org.apache.spark.sql.streaming.StreamingQueryException: Table or view not found: events
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'events' not found in database 'default';
Streaming source is Kafka with watermarking and without using Spark-SQL we are able to execute dataframe transformations. Spark version is 2.4.0 and Scala is 2.11.7. Trigger is ProcessingTime every 1 minute and OutputMode is Append.
Is there any other approach to facilitate use of spark-sql within foreachBatch ? Would it work with upgraded version of Spark - in which case to version do we upgrade ?
Kindly help. Thank you.

tl;dr Replace ExecutionContext.getSparkSession with df.sparkSession.
The reason of the StreamingQueryException is that the streaming query tries to access the events temporary table in a SparkSession that knows nothing about it, i.e. ExecutionContext.getSparkSession.
The only SparkSession that has this events temporary table registered is exactly the SparkSession the df dataframe is created within, i.e. df.sparkSession.

Please check the code snippet below. Here, I have created two separate DataFrames, responseDF1 and responseDF2 from resultDF and shown the output in the console. responseDF2 is created using a temporary table. You can try the same.
resultDF.writeStream.foreachBatch {(batchDF: DataFrame, batchId: Long) =>
batchDF.persist()
val responseDF1 = batchDF.selectExpr("ResponseObj.type","ResponseObj.key", "ResponseObj.activity", "ResponseObj.price")
responseDF1.show()
responseDF1.createTempView("responseTbl1")
val responseDF2 = batchDF.sparkSession.sql("select activity, key from responseTbl1")
responseDF2.show()
batchDF.sparkSession.catalog.dropTempView("responseTbl1")
batchDF.unpersist()
()}.start().awaitTermination()
Code Snippet

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to create Kudu table from pyspark dataframe - apache-spark

Related

AWS Glue reading data from Sybase table

Error writing data to Bigquery using Databricks Pyspark

Cosmos DB spatial query using Spark

How to create a table with primary key using jdbc spark connector (to ignite)

How to work with temporary tables in foreachBatch?

Categories

Resources