How to read data from db2 using jdbc connection? - apache-spark

I'm trying to fetch the data from db2 using
df= spark.read.format(“jdbc”).option(“user”,”user”).option(“password”,”password”)\
.option(“driver”, “com.ibm.db2.jcc.DB2Driver”)\
.option(“url”,”jdbc:db2://url:<port>/<DB>”)\
.option(“query”, query)\
.load()
In my local in options query function is working but in server it is asking me to use dbtable
when i use dbtable i'm getting sqlsyntax error: sql code =-104 sqlstate =42601 and taking wrong columns
can some one help me with this

You can use the AS400 driver to fetch DB2 data using Spark.
Your DB2 URL will look something like this: jdbc:as400://<DBIPAddress>
val query = "(select * from db.temptable) temp"
val df = spark.read.format("jdbc").option("url", <YourURL>).option("driver", "com.ibm.as400.access.AS400JDBCDriver").option("dbtable", query).option("user", <Username>).option("password", <Password>).load()
Please note that you will need to keep the query format as shown above (i.e. give an alias to the query). Hope this resolves your issue.

Related

Spark SQL persistent view over jdbc data source

I want to create a persistent (global) view in spark sql that gets data from an underlying jdbc database connection. It works fine when I use a temporary (session-scoped) view as shown below but fails when trying to create a regular (persistent and global) view.
I don't understand why the latter should not work but couldn't find any docs/hints as all examples are always done with temporary views. Technically, I cannot see why it shouldn't work as the data is properly retrieved from jdbc source in the temporary view and thus it should not matter if I wanted to "store" the query in a persistent view so that whenever calling the view it would retrieve data directly from jdbc source.
Config.
tbl_in = myjdbctable
tbl_out = myview
db_user = 'myuser'
db_pw = 'mypw'
jdbc_url = 'jdbc:sqlserver://myserver.domain:1433;database=mydb'
This works.
query = f"""
create or replace temporary view {tbl_out}
using jdbc
options(
dbtable '{tbl_in}',
user '{db_user}',
password '{db_pw}',
url '{jdbc_url}'
)
"""
spark.sql(query)
> DataFrame[]
This does not work.
query = f"""
create or replace view {tbl_out}
using jdbc
options(
dbtable '{tbl_in}',
user '{db_user}',
password '{db_pw}',
url '{jdbc_url}'
)
"""
spark.sql(query)
> ParseException:
Error.
ParseException:
mismatched input 'using' expecting {'(', 'UP_TO_DATE', 'AS', 'COMMENT', 'PARTITIONED', 'TBLPROPERTIES'}(line 3, pos 0)
== SQL ==
create or replace view myview
using jdbc
^^^
options(
dbtable 'myjdbctable',
user 'myuser',
password '[REDACTED]',
url 'jdbc:sqlserver://myserver.domain:1433;database=mydb'
)
TL;DR: A spark sql table over jdbc source behaves like a view and so can be used like one.
It seems my assumptions about jdbc tables in spark sql were flawed. It turns out that a sql table with a jdbc source (i.e. created via using jdbc) is actually a live query against the jdbc source (and not a one-off jdbc query during table creation as I assumed). In my mind it actually behaves like a view then. That means if the underlying jdbc source changes (e.g. new entries in a column) this is reflected in the spark sql table on read (e.g. select from) without having to re-create the table.
It follows that the spark sql table over jdbc source satisfies my requirements of having an always up2date reflection of the underlying table/sql object in the jdbc source. Usually, I would use a view for that. Maybe this is the reason why there is no persistent view over a jdbc source but only temporary views (which of course still make sense as they are session-scoped). It should be noted that the spark sql jdbc table behaves like a view which may be surprising, in particular:
if you add a column in underlying jdbc table, it will not show up in spark sql table
if you remove a column from underlying jdbc table, an error will occur when spark sql table is accessed (assuming the removed column was present during spark sql table creation)
if you remove the underlying jdbc table, an error will occur when spark sql table is accessed
The input of spark.sql should be DML (Data Manipulation Language). Its output is a dataframe.
In terms of best practices, you should avoid using DDL (Data Definition Language) with spark.sql. Even if some statements may work, that's not meant to be used this way.
If you want to use DDL, simply connect to your DB using python packages.
If you want to create a temp view in spark, do it using spark syntaxe createTempView

Writing data to timestreamDb from AWS Glue

I'm trying to use glue streaming and write data to AWS TimestreamDB but I'm having a hard time in configuring the JDBC connection.
Steps I’m following are below and the documentation link: https://docs.aws.amazon.com/timestream/latest/developerguide/JDBC.configuring.html
I’m uploading the jar to S3. There are multiple jars here and I tried with each one of it. https://github.com/awslabs/amazon-timestream-driver-jdbc/releases
In the glue job I’m pointing the jar lib path to the above s3 location
In the job script I’m trying to read from timestream using both spark/ glue with the below code but its not working. Can someone explain what I'm doing wrong here
This is my code:
url = jdbc:timestream://AccessKeyId=<myAccessKeyId>;SecretAccessKey=<mySecretAccessKey>;SessionToken=<mySessionToken>;Region=us-east-1
source_df = sparkSession.read.format("jdbc").option("url",url).option("dbtable","IoT").option("driver","software.amazon.timestream.jdbc.TimestreamDriver").load()
datasink1 = glueContext.write_dynamic_frame.from_options(frame = applymapping0, connection_type = "jdbc", connection_options = {"url":url,"driver":"software.amazon.timestream.jdbc.TimestreamDriver", database = "CovidTestDb", dbtable = "CovidTestTable"}, transformation_ctx = "datasink1")
To this date (April 2022) there is not support for write operations using timestream's jdbc driver (reviewed the code and saw a bunch of no write support exceptions). It is possible to read data from timestream using glue though. Following steps worked for me:
Upload timestream-query and timestream-jdbc to an S3 bucket that you can reference in your glue script
Ensure that the IAM role for the script has access to read operations to the timestream database and table
You don't need to use the access key and secret parameters in the jdbc url, using something like jdbc:timestream://Region=<timestream-db-region> should be enough
Specify the driver and fetchsize options option("driver","software.amazon.timestream.jdbc.TimestreamDriver")
option("fetchsize", "100") (tweak the fetchsize according to your needs)
Following is a complete example of reading a dataframe from timestream:
val df = sparkSession.read.format("jdbc")
.option("url", "jdbc:timestream://Region=us-east-1")
.option("driver","software.amazon.timestream.jdbc.TimestreamDriver")
// optionally add a query to narrow the data to fetch
.option("query", "select * from db.tbl where time between ago(15m) and now()")
.option("fetchsize", "100")
.load()
df.write.format("console").save()
Hope this helps

Cosmos DB spatial query using Spark

I would like to query a cosmos db collection using a spatial query. Specifically the ST_DISTANCE query. This query works as intended using the azure-cosmos Python SDK.
I am looking to use this query via Apache Spark for a more complex query pattern. However, using the ST_DISTANCE query in a SQL cell in a notebook results in the following error.
Error in SQL statement: AnalysisException: Undefined function: 'ST_DISTANCE'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.
The notebook is initialized as follows.
# Configure Catalog Api to be used
spark.conf.set("spark.sql.catalog.cosmosCatalog", "com.azure.cosmos.spark.CosmosCatalog")
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountEndpoint", cosmosEndpoint)
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountKey", cosmosMasterKey)
from pyspark.sql.functions import col
df = spark.read.format("cosmos.oltp").options(**cfg)\
.option("spark.cosmos.read.inferSchema.enabled", "true")\
.load()
df.createOrReplaceTempView("outlets")
_______________________________________________________________________
%sql
SELECT * FROM outlets f WHERE ST_DISTANCE(f.boundary, POINT(0,0)) < 600
Based on what I understand from the Cosmos DB Spark connector github repo[1], not all Cosmos DB filter queries are supported via the connector (yet?). So the ST_DISTANCE and other filter functions in the spatial family aren't going to work as those aren't predicates that are natively supported by Spark to be pushed down to the database.
Found something that will help sail past this issue at least temporarily. The query config[2] allows sending a custom query directly to Cosmos DB. A temporary view can be built and queried over. This will not work for all use cases, but this solved my issue where I need a single view with distance filtering done. Rest can be handled via Spark SQL.
Refer spark.cosmos.read.customQuery[2] in below sample.
outlets_cfg = {
"spark.cosmos.accountEndpoint" : cosmosEndpoint,
"spark.cosmos.accountKey" : cosmosMasterKey,
"spark.cosmos.database" : cosmosDatabaseName,
"spark.cosmos.container" : cosmosContainerName,
"spark.cosmos.read.customQuery" : "SELECT * FROM c WHERE ST_DISTANCE(c.location,{\"type\":\"Point\",\"coordinates\": [12.832489, 18.9553242]}) < 1000"
}
df = spark.read.format("cosmos.oltp").options(**outlets_cfg)\
.option("spark.cosmos.read.inferSchema.enabled", "true")\
.load()
df.createOrReplaceTempView("outlets")
[1] https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3-1_2-12/
[2] https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3-1_2-12/docs/configuration-reference.md#query-config

Read mariaDB4J table into spark DataSet using JDBC

I am trying to read table from MariaDB4J via jdbc using the following command:
Dataset<Row> jdbcDF = spark.read()
.format("jdbc")
.option("url", url)
.option("dbtable", String.format("SELECT userID FROM %s;", TableNAME))
.load();
I get the following error:
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'SELECT userID FROM MYTABLE; WHERE 1=0' at line 1
I am not sure where the WHERE comes from and why I get this error...
Thanks
The second parameter supplied to the dbtable option is not correct. Instead of specifing a sql query, you should either use a table name qualified with the schema or a valid SQL query with an alias.
Dataset<Row> jdbcDF = spark.read()
.format("jdbc")
.option("url", url)
.option("dbtable", String.format("(SELECT userID FROM %s) as table_alias", tableNAME))
.load();
Your question is similar to this one, so forgive me if I quote myself:
The reason why you see the strange looking query WHERE 1=0 is that Spark tries to infer the schema of your data frame without loading any actual data. This query is guaranteed never to deliver any results, but the query result's metadata can be used by Spark to get the schema information.

spark-sql Table or view not found error

I'm trying to run a basic java program using spark-sql & JDBC. I'm running into the following error. Not sure what's wrong here. Most of the material I have read does not talk on what needs to be done to fix this problem.
It will also be great if someone can point me to some good material to read on Spark-sql (Spark-2.1.1). I'm planning to use spark to implement ETL's, connecting to MySQL and other datasources.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: myschema.mytable; line 1 pos 21;
String MYSQL_CONNECTION_URL = "jdbc:mysql://localhost:3306/myschema";
String MYSQL_USERNAME = "root";
String MYSQL_PWD = "root";
Properties connectionProperties = new Properties();
connectionProperties.put("user", MYSQL_USERNAME);
connectionProperties.put("password", MYSQL_PWD);
Dataset<Row> jdbcDF2 = spark.read()
.jdbc(MYSQL_CONNECTION_URL, "myschema.mytable", connectionProperties);
spark.sql("SELECT COUNT(*) FROM myschema.mytable").show();
It's because Spark is not registering any tables from any schemas from connection by default in Spark SQL Context. You must register it by yourself:
jdbcDF2.createOrReplaceTempView("mytable");
spark.sql("select count(*) from mytable");
Your jdbcDF2 has a source in myschema.mytable from MySQL and will load data from this table on some action.
Remember that MySQL table is not the same as Spark table or view. You are telling Spark to read data from MySQL, but you must register this DataFrame or Dataset as table or view in current Spark SQL Context or Spark Session

Resources