spark Connect Two Database tables to produce a third data - apache-spark

DataFrameLoadedFromLeftDatabase=data loaded using DataFrameReader from first database say LeftDB.
I need to
iterate through each row in this dataframe,
connect to a second database say RightDB,
find some matching record from RightDB,
and do some business logic
This is an iterative operation so it is not simply doable with a JOIN between LeftDB and RightDB to find some new fields, create a New Dataframe targetDF and write into a third Database say ThirdDB using DataframeWriter
I know that I can use
val targetDF = DataFrameLoadedFromLeftDatabase.mapPartitions(
partition => {
val rightDBconnection = new DbConnection // establish a connection to RightDB
val result = partition.map(record => {
readMatchingFromRightDBandDoBusinessLogicTransformationAndReturnAList(record, rightDBconnection)
}).toList
rightDBconnection.close()
result.iterator
}
).toDF()
targetDF.write
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "table3")
.option("user", "username")
.option("password", "password")
.save()
I am wondering whether apache spark is suitable for these type of chatty data processing applications
I am wondering whether interating throguh each record in RightDB will be too chatty in this approach
I am looking forward with some advices to improve this design to make use of SPARK capabilites. I also wanted to make sure the processing do not cause too much shuffle operations for performance reasons
Ref: Related SO Post

At this kind of situations we always prefer spark.sql. Basically define two different DFs and join them based on a query then you can apply your business logic afterwards.
For example;
import org.apache.spark.sql.{DataFrame, SparkSession}
// Add your columns here
case class MyResult(ID: String, NAME: String)
// Create a SparkSession
val spark = SparkSession.builder()
.appName("Join Tables and Add Prefix to ID Column")
.config("spark.master", "local[*]")
.getOrCreate()
// Read the first table from DB1
val firstTable: DataFrame = spark.read
.format("jdbc")
.option("url", "jdbc:postgresql://localhost/DB1")
.option("dbtable", "FIRST_TABLE")
.option("user", "your_username")
.option("password", "your_password")
.load()
firstTable.createOrReplaceTempView("firstTable")
// Read the second table from DB2
val secondTable: DataFrame = spark.read
.format("jdbc")
.option("url", "jdbc:postgresql://localhost/DB2")
.option("dbtable", "SECOND_TABLE")
.option("user", "your_username")
.option("password", "your_password")
.load()
secondTable.createOrReplaceTempView("secondTable")
// Apply you filtering here
val result: DataFrame = spark.sql("SELECT f.*, s.* FROM firstTable as f left join secondTable as s on f.ID = s.ID")
val finalData = result.as[MyResult]
.map{record=>
// Do your business logic
businessLogic(record)
}
// Write the result to the third table in DB3
finalData.write
.format("jdbc")
.option("url", "jdbc:postgresql://localhost/DB3")
.option("dbtable", "THIRD_TABLE")
.option("user", "your_username")
.option("password", "your_password")
.save()
If your tables are big, you can execute a query and read its results directly. If you do this you can reduce your input sizes by filtering by dates etc:
val myQuery = """
(select * from table
where // do your filetering here
) foo
"""
val df = sqlContext.format("jdbc").
option("url", "jdbc:postgresql://localhost/DB").
.option("user", "your_username")
.option("password", "your_password")
.option("dbtable", myQuery)
.load()
Other than this, it is hard to do record specific operations directly via spark. You have to maintain your client connections etc as custom logics. Spark designed to read/write huge amounts data. It creates pipelines for this purpose. Simple operations will be an overhead for it. Always do your API calls (or single DB calls) in your map functions. If you use a cache layer in there, it could be life saving in terms of performance. Always try to use a connection pool in your custom database calls, otherwise spark will try to execute all of mapping operations with different connections which may create a pressure on your database and cause connection failures.

Can think of a lot of improvements but in general all of them are going to depend on having the data pre-distributed in a HDFS, HBase, Hive database, MongoDB,...
I mean: You are thinking "relational data with distributed processing mindset" ... I though we were already beyond that XD

Related

Predicate in Pyspark JDBC does not do a partitioned read

I am trying to read a Mysql table in PySpark using JDBC read. The tricky part here is that the table is considerably big, and therefore causes our Spark executor to crash when it does a non-partitioned vanilla read of the table.
Hence, the objective function is basically that we want to do a partitioned read of the table. Couple of things that we have been trying -
We looked at the "numPartitions-partitionColumn-lowerBound-upperBound" combo. This does not work for us since our indexing key of the original table is a string, and this only works with integral types.
The other alternative that is suggested in the docs is the predicate option. This does not seem to work for us, in the sense that the number of partitions seem to still be 1, instead of the number of predicates that we are sending.
The code snippet that we are using is as follows -
input_df = self._Flow__spark.read \
.format("jdbc") \
.option("url", url) \
.option("user", config.user) \
.option("password", config.password) \
.option("driver", "com.mysql.cj.jdbc.Driver") \
.option("dbtable", "({}) as query ".format(get_route_surge_details_query(start_date, end_date))) \
.option("predicates", ["recommendation_date = '2020-11-14'",
"recommendation_date = '2020-11-15'",
"recommendation_date = '2020-11-16'",
"recommendation_date = '2020-11-17'",
]) \
.load()
It seems to be doing a full table scan ( non-partitioned ), whilst completely ignoring the passed predicates. Would be great to get some help on this.
Try the following :
spark_session\
.read\
.jdbc(url=url,
table= "({}) as query ".format(get_route_surge_details_query(start_date, end_date)),
predicates=["recommendation_date = '2020-11-14'",
"recommendation_date = '2020-11-15'",
"recommendation_date = '2020-11-16'",
"recommendation_date = '2020-11-17'"],
properties={
"user": config.user,
"password": config.password,
"driver": "com.mysql.cj.jdbc.Driver"
}
)
Verify the partitions by
df.rdd.getNumPartitions() # Should be 4
I found this after digging the docs at https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=jdbc#pyspark.sql.DataFrameReader.jdbc

Query table names

I have been given access to a database. I am querying the data from a spark cluster. How do I check all the databases/tables I have access to?
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2};user='{3}';password='{4}'".format(jdbcHostname, jdbcPort, jdbcDatabase, jdbcUsername, jdbcPassword)
connectionProperties = {
"user" : jdbcUsername,
"password" : jdbcPassword,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
df = spark.read.jdbc(url=jdbcUrl, properties=connectionProperties)
Access to the database has been authenticated.
In SQL Server itself:
select *
from sys.tables
Not sure if you use synonym or not as the way into the sys schema.
val tables = spark.read.jdbc(jdbc_url, "sys.tables", connectionProperties)
tables.select(...
If you have synonym, replace the sys.tables with that. There are different ways of writing, you go down the tables or own SQL Query approach approach. This is the tables approach. Here under the SQL Query approach, an example:
val dataframe_mysql = spark.read.jdbc(jdbcUrl, "(select k, v from sample order by k DESC) e", connectionProperties)
SCALA version I just realized.
pyspark specifically
See: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Just the same approach, but specifically for this case postgres:
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.load()
JDBC loading and saving can be achieved via either the load/save or jdbc methods, see the guide.

Is there a way to define “partitionColumn” in “option(”partitionColumn“,”colname“)” in Spark-JDBC if the column is of datatype: String?

I am trying to load data from RDBMS to a hive table on HDFS. I am reading the RDBMS table in the below way:
val mydata = spark.read
.format("jdbc")
.option("url", connection)
.option("dbtable", "select * from dev.userlocations")
.option("user", usrname)
.option("password", pwd)
.option("numPartitions",20)
.load()
I see in the executor logs that the option("numPartitions",20) is not given properly and the entire data in dumped into a single executor.
Now there are options to provide the partition column, lower bound & upper bound as below:
val mydata = spark.read
.format("jdbc")
.option("url", connection)
.option("dbtable", "select * from dev.userlocations")
.option("user", usrname)
.option("password", pwd)
.option("partitionColumn","columnName")
.option("lowerbound","x")
.option("upperbound","y")
.option("numPartitions",20).load()
The above one only works if I have the partition column is of numeric datatype. In the table I am reading, it is partitioned based on a column location. It is of size 5gb overall & there are 20 different partitions in the table based on. I have 20 distinct locations in the table. Is there anyway I can read the table in partitions based on the partition column of the table: location ?
Could anyone let me know if it can be implemented at all ?
You can use the predicates option for this. It takes an array of string and each item in the array is a condition for partitioning the source table. Total number of partitions determined by those conditions.
val preds = Array[String]("location = 'LOC1'", "location = 'LOC2' || location = 'LOC3'")
val df = spark.read.jdbc(
url = databaseUrl,
table = tableName,
predicates = preds,
connectionProperties = properties
)

How to write Partitions to Postgres using foreachPartition (pySpark)

I am new to Spark and trying to wite df partitions to Postgres
here is my code:
//csv_new is a DF with nearly 40 million rows and 6 columns
csv_new.foreachPartition(callback) // there are 19204 partitions
def callback(iterator):
print(iterator)
// the print gives me itertools.chain object
but when writing to DB with following code:
iterator.write.option("numPartitions", count).option("batchsize",
1000000).jdbc(url=url, table="table_name", mode=mode,
properties=properties)
gives an error:
*AttributeError: 'itertools.chain' object has no attribute 'write' mode is append and properties are set
Any leads on how to write the df partition to DB
You don't need to do that.
The documentation states it along these lines and it occurs in parallel:
df.write.format("jdbc")
.option("dbtable", "T1")
.option("url", url1)
.option("user", "User")
.option("password", "Passwd")
.option("numPartitions", "5") // to define parallelism
.save()
There are some performances aspects to consider, but that can be googled.
Many Thanks to #thebluephantom ,just a little add on in case the the table already exists save mode also needs to be defined.
Following was my implementation which worked :-
mode = "Append"
url = "jdbc:postgresql://DatabaseIp:port/DB Name"
properties = {"user": "username", "password": "password"}
df.write
.option("numPartitions",partitions here)
.option("batchsize",your batch size default is 1000)
.jdbc(url=url, table="tablename", mode=mode, properties=properties)

Spark thinks I'm reading DataFrame from a Parquet file

Spark 2.x here. My code:
val query = "SELECT * FROM some_big_table WHERE something > 1"
val df : DataFrame = spark.read
.option("url",
s"""jdbc:postgresql://${redshiftInfo.hostnameAndPort}/${redshiftInfo.database}?currentSchema=${redshiftInfo.schema}"""
)
.option("user", redshiftInfo.username)
.option("password", redshiftInfo.password)
.option("dbtable", query)
.load()
Produces:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:183)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:183)
at scala.Option.getOrElse(Option.scala:121)
I'm not reading anything from a Parquet file, I'm reading from a Redshift (RDBMS) table. So why am I getting this error?
If you use generic load function you should include format as well:
// Query has to be subquery
val query = "(SELECT * FROM some_big_table WHERE something > 1) as tmp"
...
.format("jdbc")
.option("dbtable", query)
.load()
Otherwise Spark assumes that you use default format, which in presence of no specific configuration, is Parquet.
Also nothing forces you to use dbtable.
spark.read.jdbc(
s"jdbc:postgresql://${hostnameAndPort}/${database}?currentSchema=${schema}",
query,
props
)
variant is also valid.
And of course with such simple query all of that it is not needed:
spark.read.jdbc(
s"jdbc:postgresql://${hostnameAndPort}/${database}?currentSchema=${schema}",
some_big_table,
props
).where("something > 1")
will work the same way, and if you want to improve performance you should consider parallel queries
How to improve performance for slow Spark jobs using DataFrame and JDBC connection?
Whats meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?
Spark 2.1 Hangs while reading a huge datasets
Partitioning in spark while reading from RDBMS via JDBC
or even better, try Redshift connector.

Resources