Incremental data load from Redshift to S3 using Pyspark and Glue Jobs - apache-spark

I have created a pipeline where the data ingestion takes place between Redshift and S3. I was able to do the complete load using the below method:
def readFromRedShift(spark: SparkSession, schema, tablename):
table = str(schema) + str(".") + str(tablename)
(url, Properties, host, port, db) = con.getConnection("REDSHIFT")
df = spark.read.jdbc(url=url, table=table, properties=Properties)
return df
Where getConnection is a different method under a separate class that handles all the redshift-related details. Later on, I used this method and created a data frame, and wrote the results into S3 which worked like a charm.
Now, I want to load the incremental data. Will enabling the Job Bookmarks Glue option help me? Or is there any other way to do it? I followed this official documentation but was of no help to me for my problem statement. So, if I run it for the first time as it will load the complete data, and if I rerun it will it be able to load the newly arrived records?

You are right. It can be achieved via use of job bookmarks, but at the same time it can be a bit tricky.
Please refer to this doc https://aws.amazon.com/blogs/big-data/load-data-incrementally-and-optimized-parquet-writer-with-aws-glue/

Related

Writing data to timestreamDb from AWS Glue

I'm trying to use glue streaming and write data to AWS TimestreamDB but I'm having a hard time in configuring the JDBC connection.
Steps I’m following are below and the documentation link: https://docs.aws.amazon.com/timestream/latest/developerguide/JDBC.configuring.html
I’m uploading the jar to S3. There are multiple jars here and I tried with each one of it. https://github.com/awslabs/amazon-timestream-driver-jdbc/releases
In the glue job I’m pointing the jar lib path to the above s3 location
In the job script I’m trying to read from timestream using both spark/ glue with the below code but its not working. Can someone explain what I'm doing wrong here
This is my code:
url = jdbc:timestream://AccessKeyId=<myAccessKeyId>;SecretAccessKey=<mySecretAccessKey>;SessionToken=<mySessionToken>;Region=us-east-1
source_df = sparkSession.read.format("jdbc").option("url",url).option("dbtable","IoT").option("driver","software.amazon.timestream.jdbc.TimestreamDriver").load()
datasink1 = glueContext.write_dynamic_frame.from_options(frame = applymapping0, connection_type = "jdbc", connection_options = {"url":url,"driver":"software.amazon.timestream.jdbc.TimestreamDriver", database = "CovidTestDb", dbtable = "CovidTestTable"}, transformation_ctx = "datasink1")
To this date (April 2022) there is not support for write operations using timestream's jdbc driver (reviewed the code and saw a bunch of no write support exceptions). It is possible to read data from timestream using glue though. Following steps worked for me:
Upload timestream-query and timestream-jdbc to an S3 bucket that you can reference in your glue script
Ensure that the IAM role for the script has access to read operations to the timestream database and table
You don't need to use the access key and secret parameters in the jdbc url, using something like jdbc:timestream://Region=<timestream-db-region> should be enough
Specify the driver and fetchsize options option("driver","software.amazon.timestream.jdbc.TimestreamDriver")
option("fetchsize", "100") (tweak the fetchsize according to your needs)
Following is a complete example of reading a dataframe from timestream:
val df = sparkSession.read.format("jdbc")
.option("url", "jdbc:timestream://Region=us-east-1")
.option("driver","software.amazon.timestream.jdbc.TimestreamDriver")
// optionally add a query to narrow the data to fetch
.option("query", "select * from db.tbl where time between ago(15m) and now()")
.option("fetchsize", "100")
.load()
df.write.format("console").save()
Hope this helps

Chaining Delta Streams programmatically raising AnalysisException

Situation : I am producing a delta folder with data from a previous Streaming Query A, and reading later from another DF, as shown here
DF_OUT.writeStream.format("delta").(...).start("path")
(...)
DF_IN = spark.readStream.format("delta").load("path)
1 - When I try to read it this wayin a subsequent readStream (chaining queries for an ETL Pipeline) from the same program I end up having the Exception below.
2 - When I run it in the scala REPL however, it runs smoothly.
Not sure What is happening there but it sure is puzzling.
org.apache.spark.sql.AnalysisException: Table schema is not set. Write data into it or use CREATE TABLE to set the schema.;
at org.apache.spark.sql.delta.DeltaErrors$.schemaNotSetException(DeltaErrors.scala:365)
at org.apache.spark.sql.delta.sources.DeltaDataSource.sourceSchema(DeltaDataSource.scala:74)
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:209)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:95)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:95)
at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:33)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:171)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:225)
at org.apache.spark.ui.DeltaPipeline$.main(DeltaPipeline.scala:114)
From the Delta Lake Quick Guide - Troubleshooting:
Table schema is not set error
Problem:
When the path of Delta table is not existing, and try to stream data from it, you will get the following error.
org.apache.spark.sql.AnalysisException: Table schema is not set. Write data into it or use CREATE TABLE to set the schema.;
Solution:
Make sure the path of a Delta table is created.
After reading the error message, I did try to be a good boy and follow the advice, so I tried to make sure there actually IS valid data in the delta folder I am trying to read from BEFORE calling the readStream, and voila !
def hasFiles(dir: String):Boolean = {
val d = new File(dir)
if (d.exists && d.isDirectory) {
d.listFiles.filter(_.isFile).size > 0
} else false
}
DF_OUT.writeStream.format("delta").(...).start(DELTA_DIR)
while(!hasFiles(DELTA_DIR)){
print("DELTA FOLDER STILL EMPTY")
Thread.sleep(10000)
}
print("FOUND DATA ON DELTA A - WAITING 30 SEC")
Thread.sleep(30000)
DF_IN = spark.readStream.format("delta").load(DELTA_DIR)
It ended up working but I had to make sure to wait enough time for "something to happen" (don't know what exactly TBH, but it seems that reading from delta needs some writes to be complete - maybe metadata ? -
However, this still is a hack. I hope it was possible to start reading from an empty delta folder and wait for content to start pouring in it.
For me I couldnt find the absolute path a simple solution was using this alternative:
spark.readStream.format("delta").table("tableName")

How to create an index in ElasticSearch and push data from streaming query?

Not able to create an index and push the data to elasticsearch using elastic sink
df.writeStream
.outputMode("append")
.format("org.elasticsearch.spark.sql")
.option("es.nodes","localhost:9200")
.option("checkpointLocation","/tmp/")
.option("es.resource","index/type")
.start
There are no errors, unfortunately it is not working.
at times(1 out of 10) above snippet creates new index but it doesn't push data of the dataframe/dataset to index created. Remaining times it doesn't even create a index. It seems something with properties of elastic search configurations.
Try the below code
df.writeStream
.outputMode("append")
.format("es")
.option("es.nodes","localhost")
.option("es.port","9200")
.option("checkpointLocation","/tmp/")
.start("index/type")
SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
conf.set("es.index.auto.create", "true");
Refer the official document for more help
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html
From one of the OP's comments:
Solved. The time column being used with WithWaterMark option should be a real time streaming value. My testing data is a dump taken from production which is old dated time value , post the correction of time column to current time stamp,it worked as expected
You need to give elastic-hadoop library from maven repository in order to do that.Here
After that use this:
df.writeStream.outputMode("append").format("org.elasticsearch.spark.sql").option("checkpointLocation","GiveCheckPointingName").option("es.port","9200").option("es.nodes","LocalHost").start("indexname/indextype").awaitTermination()

Google Cloud Bigquery Library Error

I am receiving this error
Cannot set destination table in jobs with DDL statements
When I try to resubmit a job from the job.build_resource() function in the google.cloud.bigquery library.
It seems that the destination table is set to something like this after that function call.
'destinationTable': {'projectId': 'xxx', 'datasetId': 'xxx', 'tableId': 'xxx'},
Am I doing something wrong here? Thanks to anyone that can give me any guidance here.
EDIT:
The job is initially being triggered by this
query = bq.query(sql_rendered)
We store the job id and use it later to check the status.
We get the job like this
job = bq.get_job(job_id=job_id)
If it meets a condition, in this case it failed due to rate limiting. We retry the job.
We retry the job like this
di = job._build_resource()
jo = bigquery.Client(project=self.project_client).job_from_resource(di)
jo._begin()
I think that's pretty much all of the code you need, but happy to provide more if needed.
You are seeing this error because you have a DDL statement in your query. What is happening is that the job_config is changing some values after the execution of the first query, particularly the job_config.destination . In order to try to overcome this issue, you could try to reset the value of job_config.destination to None after each job submission or use a different job_config for every query.

Spark DataFrame Filter using Binary (Array[Bytes]) data

I have a DataFrame from a JDBC table hitting MySql and I need to filter it using a UUID. The data is stored in MySql using binary(16) and when querying out in spark is converted to Array[Byte] as expected.
I'm new to spark and have been trying various ways to pass a variable of type UUID into the DataFrame's filter method.
Ive tried statements like
val id: UUID = // other logic that looks this up
df.filter(s"id = $id")
df.filter("id = " convertToByteArray(id))
df.filter("id = " convertToHexString(id))
All of these error with different messages.
I just need to somehow pass in Binary types but can't seem to put my finger on how to do so properly.
Any help is greatly appreciated.
After reviewing even more sources online, I found a way to accomplish this without using the filter method.
When I'm reading from my sparkSession, I just use an adhoc table instead of table name, as follows:
sparkSession.read.jdbc(connectionString, s"(SELECT id, {other col omitted) FROM MyTable WHERE id = 0x$id) AS MyTable", props)
This pre-filters the results for me and then I just work with the data frame as I need.
If anyone knows of a solution using filter, I'd still love to know it as that would be useful in some cases.

Resources