Writing data to timestreamDb from AWS Glue - apache-spark

I'm trying to use glue streaming and write data to AWS TimestreamDB but I'm having a hard time in configuring the JDBC connection.
Steps I’m following are below and the documentation link: https://docs.aws.amazon.com/timestream/latest/developerguide/JDBC.configuring.html
I’m uploading the jar to S3. There are multiple jars here and I tried with each one of it. https://github.com/awslabs/amazon-timestream-driver-jdbc/releases
In the glue job I’m pointing the jar lib path to the above s3 location
In the job script I’m trying to read from timestream using both spark/ glue with the below code but its not working. Can someone explain what I'm doing wrong here
This is my code:
url = jdbc:timestream://AccessKeyId=<myAccessKeyId>;SecretAccessKey=<mySecretAccessKey>;SessionToken=<mySessionToken>;Region=us-east-1
source_df = sparkSession.read.format("jdbc").option("url",url).option("dbtable","IoT").option("driver","software.amazon.timestream.jdbc.TimestreamDriver").load()
datasink1 = glueContext.write_dynamic_frame.from_options(frame = applymapping0, connection_type = "jdbc", connection_options = {"url":url,"driver":"software.amazon.timestream.jdbc.TimestreamDriver", database = "CovidTestDb", dbtable = "CovidTestTable"}, transformation_ctx = "datasink1")

To this date (April 2022) there is not support for write operations using timestream's jdbc driver (reviewed the code and saw a bunch of no write support exceptions). It is possible to read data from timestream using glue though. Following steps worked for me:
Upload timestream-query and timestream-jdbc to an S3 bucket that you can reference in your glue script
Ensure that the IAM role for the script has access to read operations to the timestream database and table
You don't need to use the access key and secret parameters in the jdbc url, using something like jdbc:timestream://Region=<timestream-db-region> should be enough
Specify the driver and fetchsize options option("driver","software.amazon.timestream.jdbc.TimestreamDriver")
option("fetchsize", "100") (tweak the fetchsize according to your needs)
Following is a complete example of reading a dataframe from timestream:
val df = sparkSession.read.format("jdbc")
.option("url", "jdbc:timestream://Region=us-east-1")
.option("driver","software.amazon.timestream.jdbc.TimestreamDriver")
// optionally add a query to narrow the data to fetch
.option("query", "select * from db.tbl where time between ago(15m) and now()")
.option("fetchsize", "100")
.load()
df.write.format("console").save()
Hope this helps

Related

How to read data from db2 using jdbc connection?

I'm trying to fetch the data from db2 using
df= spark.read.format(“jdbc”).option(“user”,”user”).option(“password”,”password”)\
.option(“driver”, “com.ibm.db2.jcc.DB2Driver”)\
.option(“url”,”jdbc:db2://url:<port>/<DB>”)\
.option(“query”, query)\
.load()
In my local in options query function is working but in server it is asking me to use dbtable
when i use dbtable i'm getting sqlsyntax error: sql code =-104 sqlstate =42601 and taking wrong columns
can some one help me with this
You can use the AS400 driver to fetch DB2 data using Spark.
Your DB2 URL will look something like this: jdbc:as400://<DBIPAddress>
val query = "(select * from db.temptable) temp"
val df = spark.read.format("jdbc").option("url", <YourURL>).option("driver", "com.ibm.as400.access.AS400JDBCDriver").option("dbtable", query).option("user", <Username>).option("password", <Password>).load()
Please note that you will need to keep the query format as shown above (i.e. give an alias to the query). Hope this resolves your issue.

Incremental data load from Redshift to S3 using Pyspark and Glue Jobs

I have created a pipeline where the data ingestion takes place between Redshift and S3. I was able to do the complete load using the below method:
def readFromRedShift(spark: SparkSession, schema, tablename):
table = str(schema) + str(".") + str(tablename)
(url, Properties, host, port, db) = con.getConnection("REDSHIFT")
df = spark.read.jdbc(url=url, table=table, properties=Properties)
return df
Where getConnection is a different method under a separate class that handles all the redshift-related details. Later on, I used this method and created a data frame, and wrote the results into S3 which worked like a charm.
Now, I want to load the incremental data. Will enabling the Job Bookmarks Glue option help me? Or is there any other way to do it? I followed this official documentation but was of no help to me for my problem statement. So, if I run it for the first time as it will load the complete data, and if I rerun it will it be able to load the newly arrived records?
You are right. It can be achieved via use of job bookmarks, but at the same time it can be a bit tricky.
Please refer to this doc https://aws.amazon.com/blogs/big-data/load-data-incrementally-and-optimized-parquet-writer-with-aws-glue/

Moving data from Kinesis -> RDS using Spark with AWS Glue implementation locally

I have a Spark project with AWS Glue implementation running locally.
I listen to a Kinesis stream so when Data is arrived in JSON format, I can storage to S3 correctly.
I want to store in AWS RDS instead of storing in S3.
I have tried to use:
dataFrame.write
.format("jdbc")
.option("url","jdbc:mysql://aurora.cluster.region.rds.amazonaws.com:3306/database")
.option("user","user")
.option("password","password")
.option("dbtable","test-table")
.option("driver","com.mysql.jdbc.Driver")
.save()
Spark project get data from a Kinesis stream using AWS glue job.
I want to add the data to Aurora database.
It fails with error
Caused by: java.sql.SQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL
server version for the right syntax to use near '-glue-table (`label2` TEXT , `customerid` TEXT , `sales` TEXT , `name` TEXT )' a
t line 1
This is the test dataFrame Im using, dataFrame.show():
+------+----------+-----+--------------------+
|label2|customerid|sales| name|
+------+----------+-----+--------------------+
| test6| test| test|streamingtesttest...|
+------+----------+-----+--------------------+
Using Spark DynamicFrame instead of DataFrame and using the glueContext sink to publish to Aurora:
So the final code could be:
lazy val mysqlJsonOption = jsonOptions(MYSQL_AURORA_URI)
//Write to Aurora
val dynamicFrame = DynamicFrame(joined, glueContext)
glueContext.getSink("mysql", mysqlJsonOption).writeDynamicFrame(dynamicFrame)

Spark Cassandra Connector Issue

I am trying to integrate Cassandra with Spark and facing the below issue.
Issue:
com.datastax.spark.connector.util.ConfigCheck$ConnectorConfigurationException: Invalid Config Variables
Only known spark.cassandra.* variables are allowed when using the Spark Cassandra Connector.
spark.cassandra.keyspace is not a valid Spark Cassandra Connector variable.
Possible matches:
spark.cassandra.sql.keyspace
spark.cassandra.output.batch.grouping.key
at com.datastax.spark.connector.util.ConfigCheck$.checkConfig(ConfigCheck.scala:50)
at com.datastax.spark.connector.cql.CassandraConnectorConf$.apply(CassandraConnectorConf.scala:253)
at org.apache.spark.sql.cassandra.CassandraSourceRelation$.apply(CassandraSourceRelation.scala:263)
at org.apache.spark.sql.cassandra.CassandraCatalog.org$apache$spark$sql$cassandra$CassandraCatalog$$buildRelation(CasandraCatalog.scala:41)
at org.apache.spark.sql.cassandra.CassandraCatalog$$anon$1.load(CassandraCatalog.scala:26)
at org.apache.spark.sql.cassandra.CassandraCatalog$$anon$1.load(CassandraCatalog.scala:23)
Please find the below versions of spark Cassandra and connector I am using.
Spark : 1.6.0
Cassandra : 2.1.17
Connector Used : spark-cassandra-connector_2.10-1.6.0-M1.jar
Below is the code snippet I am using to connect Cassandra from spark.
val conf: org.apache.spark.SparkConf = new SparkConf(true) \
.setAppName("Spark Cassandra") \
.set"spark.cassandra.connection.host", "abc.efg.lkh") \
.set("spark.cassandra.auth.username", "xyz") \
.set("spark.cassandra.auth.password", "1234") \
.set("spark.cassandra.keyspace","abcded")
val sc = new SparkContext("local[*]", "Spark Cassandra",conf)
val csc = new CassandraSQLContext(sc)
csc.setKeyspace("abcded")
val my_df = csc.sql("select * from table")
Here when I try to create DF, I am getting above posted error. I tried without passing schema in conf but it is trying to access in default schema where mentioned user doesn't have access.
Already a JIRA was opened and closed.
https://datastax-oss.atlassian.net/browse/SPARKC-102
yet I am getting this issue. Please let me know whether I need to use lastest connector to resolve this issue.
Thanks in advance.
The important information is in the error message you posted [formatted for readability]:
Invalid Config Variables
Only known spark.cassandra.* variables are allowed when using the Spark Cassandra Connector.
spark.cassandra.keyspace is not a valid Spark Cassandra Connector variable.
Possible matches: spark.cassandra.sql.keyspace
spark.cassandra.keyspace is not an available property for the connector. A full list of the available properties can be found here: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md
You may have some luck using the suggested spark.cassandra.sql.keyspace; otherwise you may just need to explicitly specify the keyspace for every Cassandra interaction you perform using the connector.

MemSQL Spark Job

I am trying to read a CSV file in Spark job using MemSQL Extractor and do some enrichment using Transformer and load to MemSQL Database using Java.
I see there is memsql-spark interface jar but not finding any useful Java API documentation or example.
I have started writing extractor to read from CSV but I dont know how to move further.
public Option<RDD<byte[]>> nextRDD(SparkContext sparkContext, UserExtractConfig config, long batchInterval, PhaseLogger logger) {
RDD<String> inputFile = sparkContext.textFile(filePath, minPartitions);
RDD<String> inputFile = sparkContext.textFile(filePath, minPartitions);
RDD<byte[]> bytes = inputFile.map(ByteUtils.utf8StringToBytes(filePath), String.class); //compilation error
return bytes; //compilation error
}
Would appreciate if someone can point me to some direction to get started...
thanks...
First configure Spark connector in java using following code:
SparkConf conf = new SparkConf();
conf.set("spark.datasource.singlestore.clientEndpoint", "singlestore-host")
spark.conf.set("spark.datasource.singlestore.user", "admin")
spark.conf.set("spark.datasource.singlestore.password", "s3cur3-pa$$word")
After running the above code spark will be connected to java then you can read csv in spark dataframe. You can transform and manipulate data according to requirements then you can write this dataframe to Database table.
Also attaching link for your reference.
spark-singlestore.

Resources