How to run 2 process parallel in Scala and Spark? - multithreading

Env: Spark 1.6, Scala
Hi
I need to run to process parallel. First one, for receiving data and second one for transformation and saving in Hive table. I want to repeat first process with a interval of 1 min and second process with interval of 2 min.
==========First Process=== executes once per minute=============
val DFService = hivecontext.read
.format("jdbc")
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("url", "jdbc:sqlserver://xx.x.x.xx:xxxx;database=myDB")
.option("dbtable", "(select Service_ID,ServiceIdentifier from myTable ) tmp")
.option("user", "userName")
.option("password", "myPassword")
.load()
DFService.registerTempTable("memTBLService")
DFService.write.mode("append").saveAsTable("hiveTable")
=============Second Process === executes once per 2 minute =========
var DF2 = hivecontext.sql("select * from hiveTable")
var data=DF2.select(DF2("Service_ID")).distinct
data.show()
How can I run this two process parallel and with desired interval in Scala?
Thanks
Hossain

Write two separate Spark applications.
Then use cron to execute each app per your desired schedule. Alternatively, you can use Apache AirFlow for scheduling your Spark apps.
Refer to the following question for how to use cron with Spark: How to schedule my Apache Spark application to run everyday at 00.30 AM(night) in IBM Bluemix?

Related

Spark MySQL JDBC hangs without errors

I have a simple application ingesting MySQL tables into S3 running on Spark 3.0.0 (EMR 6.1).
The MySQL table is loaded using a single large executor with 48G of memory as follows:
spark.read \
.option("url", jdbc_url) \
.option("dbtable", query) \
.option("driver", driver_class) \
.option("user", secret['username']) \
.option("password", secret['password']) \
.format("jdbc").load()
The Spark jobs work without any problem on small tables where the MySQL query takes less than ~6 minutes to complete. However, in two jobs where the query goes beyond this time the Spark job gets stuck RUNNING and does not trigger any error. The stderr and stdout logs don't show any progress and the executor is completely healthy.
The dag is very simple:
In MYSQL (Aurora RDS) the query seems to be completed but the connection is still open, while checking the thread state it shows 'cleaned up'.
I have tried with the MySQL connector version 5 and 8, but both of them show the same behaviour. I guess this could be related to a Spark default timeout configuration, but I would like to have some guidance.
This is a Thread Dump of the single executor:

Spark Structured Streaming Batch Query

I am new to kafka and spark structured streaming. I want to know how spark in batch mode knows which offset to read from? If I specify "startingOffsets" as "earliest", I am only getting the latest records and not all the records in the partition. I ran the same code in 2 different clusters. Cluster A ( local machine ) fetched 6 records, Cluster B ( TST cluster - very first run) fetched 1 record.
df = spark \
.read \
.format("kafka") \
.option("kafka.bootstrap.servers", broker) \
.option("subscribe", topic) \
.option("startingOffsets", "earliest") \
.option("endingOffsets", "latest" ) \
.load()
I am planning to run my batch once a day, will I get all the records from the yesterday to current run? Where do i see offsets and commits for batch queries?
According to the Structured Streaming + Kafka Integration Guide your offsets are stored in the provided checkpoint location that you set in the write part of your batch job.
If you do not delete the checkpoint files, the job will continue to read from Kafka where it left off. If you delete the checkpoint files or if you run the job for the very first time the job will consume messages based on the option startingOffsets.

Parallel execution of read and write API calls in PySpark SQL

I need to load the incremental records from a set of tables in MySQL to Amazon S3 in Parquet format. These tables are common across several databases/schemas in the AWS MySQL managed instance. The code should copy data from each of the schemas (which has a set of common tables) in parallel.
I'm using read API PySpark SQL to connect to MySQL instance and read data of each table for a schema and am writing the result dataframe to S3 using write API as a Parquet file. I'm running this in a loop for each table in a database as shown in the code below:
def load_data_to_s3(databases_df):
db_query_properties = config['mysql-query']
auto_id_values = config['mysql-auto-id-values']
for row in databases_df.collect():
for table in db_query_properties.keys():
last_recorded_id_value = auto_id_values[table]
select_sql = "select * from {}.{} where id>{}".format(row.database_name, table, last_recorded_id_value)
df = spark.read.format("jdbc") \
.option("driver", mysql_db_properties['driver']) \
.option("url", row.database_connection_url) \
.option("dbtable", select_sql) \
.option("user", username) \
.option("password", password) \
.load()
s3_path = 's3a://{}/{}/{}'.format(s3_bucket, database_dir, table)
df.write.parquet(s3_path, mode="append")
I would like to know how I can scale this code to multiple databases running in parallel in an EMR cluster. Please suggest me a suitable approach. Let me know if any more details required.
I can propose two solutions:
1. Easy way
Submit multiple jobs to your EMR at once(one job per DB). If monitoring is the problem, just have the logs for failed ones only written to S3 or HDFS.
2. Bit of code change required
You could try using threading to parallelize the data pulls from each DB. I can show a sample for how to do it, but you might need to do more changes to suit your use case.
Sample implementaion:
import threading
def load_data_to_s3(databases_df):
db_query_properties = config['mysql-query']
auto_id_values = config['mysql-auto-id-values']
for row in databases_df.collect():
for table in db_query_properties.keys():
last_recorded_id_value = auto_id_values[table]
select_sql = "select * from {}.{} where id>{}".format(row.database_name, table, last_recorded_id_value)
df = spark.read.format("jdbc") \
.option("driver", mysql_db_properties['driver']) \
.option("url", row.database_connection_url) \
.option("dbtable", select_sql) \
.option("user", username) \
.option("password", password) \
.load()
s3_path = 's3a://{}/{}/{}'.format(s3_bucket, database_dir, table)
df.write.parquet(s3_path, mode="append")
threads = [threading.Thread(target=load_data_to_s3, args=(db) for db in databases_df]
for t in threads:
t.start()
for t in threads:
t.join()
Also, please make sure to change the scheduler to FAIR using the set('spark.scheduler.mode', 'FAIR') property. This will create a thread for each of your DBs. If you want to control the number of threads running parallelly, modify the for loop accordingly.
Additionally, if you want to create new jobs from within the program, pass your SparkSession along with the arguments.
Your list_of_databases is not parallelized. To do the parallel processing, you should parallelize the list and do the parallel job by using foreach or something that is given by spark.
Turn on the concurrent option in EMR and send EMR step for each table, or you can use the fair scheduler of the Spark which can internally proceed the job in parallel with a small modification of your code.

How and where JDBC connection(s) are created in Spark Structured Streaming job's forEachBatch loop?

I have a spark structured streaming job which reads from Kafka and writes the dataframe to Oracle inside foreachBatch loop. My code is as below. I understand that number of parallel connections will depend on numPartitions configuration but confused on how the connection is reused across executor, tasks and micro batches.
If a connection is made once at all executors then will it
remain open for future micro-batches as well or a new connection
will be established for each iteration.
If a connection is made for each task inside executor (eg- 10 tasks, then 10 connections) then does it mean that every time new connections will be established for each loop iteration and its task
StreamingDF.writeStream
.trigger(Trigger.ProcessingTime("30 seconds"))
.foreachBatch{
(microBatchDF: DataFrame, batchID) =>
{
microBatchDF.write
.format("jdbc")
.option("url", jdbcUrl)
.option("dbtable", "my_schema.my_table")
.option("user", username)
.option("password", password)
.option("batchsize", 10000)
.mode(SaveMode.Overwrite)
.save(
}
What is the best way to reuse the same connections to minimise batch execution time?

save spark dataframe to multiple targets parallel

I need to write my final dataframe to hdfs and oracle database.
currently once saving to hdfs done, it start writing to rdbms. is there any way to use java threads to save same dataframe to hdfs as well as rdbms parallel.
finalDF.write().option("numPartitions", "10").jdbc(url, exatable, jdbcProp);
finalDF.write().mode("OverWrite").insertInto(hiveDBWithTable);
Thanks.
Cache finalDF before writing to hdfs and rdbms. Then make sure that enough executors are available for writing simultaneously. If number of partitions in finalDF are p and cores per executors are c, then you need minimum ceilof(p/c)+ceilof(10/c) executors.
df.show and df.write are Actions. Actions occur sequentially in Spark. So, answer is No, not possible standardly unless threads used.
We can use below code to append dataframe values to table
DF.write
.mode("append")
.format("jdbc")
.option("driver", driverProp)
.option("url", urlDbRawdata)
.option("dbtable", TABLE_NAME)
.option("user", userName)
.option("password", password)
.option("numPartitions", maxNumberDBPartitions)
.option("batchsize",batchSize)
.save()

Resources