How to write Partitions to Postgres using foreachPartition (pySpark) - apache-spark

I am new to Spark and trying to wite df partitions to Postgres
here is my code:
//csv_new is a DF with nearly 40 million rows and 6 columns
csv_new.foreachPartition(callback) // there are 19204 partitions
def callback(iterator):
print(iterator)
// the print gives me itertools.chain object
but when writing to DB with following code:
iterator.write.option("numPartitions", count).option("batchsize",
1000000).jdbc(url=url, table="table_name", mode=mode,
properties=properties)
gives an error:
*AttributeError: 'itertools.chain' object has no attribute 'write' mode is append and properties are set
Any leads on how to write the df partition to DB

You don't need to do that.
The documentation states it along these lines and it occurs in parallel:
df.write.format("jdbc")
.option("dbtable", "T1")
.option("url", url1)
.option("user", "User")
.option("password", "Passwd")
.option("numPartitions", "5") // to define parallelism
.save()
There are some performances aspects to consider, but that can be googled.

Many Thanks to #thebluephantom ,just a little add on in case the the table already exists save mode also needs to be defined.
Following was my implementation which worked :-
mode = "Append"
url = "jdbc:postgresql://DatabaseIp:port/DB Name"
properties = {"user": "username", "password": "password"}
df.write
.option("numPartitions",partitions here)
.option("batchsize",your batch size default is 1000)
.jdbc(url=url, table="tablename", mode=mode, properties=properties)

Related

spark Connect Two Database tables to produce a third data

DataFrameLoadedFromLeftDatabase=data loaded using DataFrameReader from first database say LeftDB.
I need to
iterate through each row in this dataframe,
connect to a second database say RightDB,
find some matching record from RightDB,
and do some business logic
This is an iterative operation so it is not simply doable with a JOIN between LeftDB and RightDB to find some new fields, create a New Dataframe targetDF and write into a third Database say ThirdDB using DataframeWriter
I know that I can use
val targetDF = DataFrameLoadedFromLeftDatabase.mapPartitions(
partition => {
val rightDBconnection = new DbConnection // establish a connection to RightDB
val result = partition.map(record => {
readMatchingFromRightDBandDoBusinessLogicTransformationAndReturnAList(record, rightDBconnection)
}).toList
rightDBconnection.close()
result.iterator
}
).toDF()
targetDF.write
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "table3")
.option("user", "username")
.option("password", "password")
.save()
I am wondering whether apache spark is suitable for these type of chatty data processing applications
I am wondering whether interating throguh each record in RightDB will be too chatty in this approach
I am looking forward with some advices to improve this design to make use of SPARK capabilites. I also wanted to make sure the processing do not cause too much shuffle operations for performance reasons
Ref: Related SO Post
At this kind of situations we always prefer spark.sql. Basically define two different DFs and join them based on a query then you can apply your business logic afterwards.
For example;
import org.apache.spark.sql.{DataFrame, SparkSession}
// Add your columns here
case class MyResult(ID: String, NAME: String)
// Create a SparkSession
val spark = SparkSession.builder()
.appName("Join Tables and Add Prefix to ID Column")
.config("spark.master", "local[*]")
.getOrCreate()
// Read the first table from DB1
val firstTable: DataFrame = spark.read
.format("jdbc")
.option("url", "jdbc:postgresql://localhost/DB1")
.option("dbtable", "FIRST_TABLE")
.option("user", "your_username")
.option("password", "your_password")
.load()
firstTable.createOrReplaceTempView("firstTable")
// Read the second table from DB2
val secondTable: DataFrame = spark.read
.format("jdbc")
.option("url", "jdbc:postgresql://localhost/DB2")
.option("dbtable", "SECOND_TABLE")
.option("user", "your_username")
.option("password", "your_password")
.load()
secondTable.createOrReplaceTempView("secondTable")
// Apply you filtering here
val result: DataFrame = spark.sql("SELECT f.*, s.* FROM firstTable as f left join secondTable as s on f.ID = s.ID")
val finalData = result.as[MyResult]
.map{record=>
// Do your business logic
businessLogic(record)
}
// Write the result to the third table in DB3
finalData.write
.format("jdbc")
.option("url", "jdbc:postgresql://localhost/DB3")
.option("dbtable", "THIRD_TABLE")
.option("user", "your_username")
.option("password", "your_password")
.save()
If your tables are big, you can execute a query and read its results directly. If you do this you can reduce your input sizes by filtering by dates etc:
val myQuery = """
(select * from table
where // do your filetering here
) foo
"""
val df = sqlContext.format("jdbc").
option("url", "jdbc:postgresql://localhost/DB").
.option("user", "your_username")
.option("password", "your_password")
.option("dbtable", myQuery)
.load()
Other than this, it is hard to do record specific operations directly via spark. You have to maintain your client connections etc as custom logics. Spark designed to read/write huge amounts data. It creates pipelines for this purpose. Simple operations will be an overhead for it. Always do your API calls (or single DB calls) in your map functions. If you use a cache layer in there, it could be life saving in terms of performance. Always try to use a connection pool in your custom database calls, otherwise spark will try to execute all of mapping operations with different connections which may create a pressure on your database and cause connection failures.
Can think of a lot of improvements but in general all of them are going to depend on having the data pre-distributed in a HDFS, HBase, Hive database, MongoDB,...
I mean: You are thinking "relational data with distributed processing mindset" ... I though we were already beyond that XD

ClickHouse housepower driver with spark

I'm new to Stack and Spark so please forgive me my simplicity and mistakes!
I have a problem with Clickhouse and spark (2.4.7), I work on a jupyter notebook.
Basically, I want to insert dataframe with Array column to Clickhouse with Array(String) column. Using yandex driver this is impossible, because jdbc doesn't support Arrays, right? ;)
So I wanted to run Spark Session with housepower jar: clickhouse-native-jdbc-shaded-2.6.4.jar, because I read that they added handling Arrays - correct me if I'm wrong.
And I want to get a query from Clickhouse via jdbc.
spark = SparkSession\
.builder\
.enableHiveSupport()\
.appName(f'custom-events-test)\
.config("spark.jars", "drivers/clickhouse-native-jdbc-shaded-2.6.4.jar")\
.getOrCreate()
My query:
query = """
select date as date,
partnerID as partnerID,
sessionID as sessionID,
toString(mapKeys(cast((JSONExtractKeysAndValues(ec.custom, 'String')), 'Map(String, String)'))) as keys,
toString(mapValues(cast((JSONExtractKeysAndValues(ec.custom, 'String')), 'Map(String, String)'))) as values
from audience.uber_all
array join eventContents as ec
PREWHERE date = ('2022-07-22')
WHERE partnerID = 'XXX'
and ec.custom != '{}'
order by date, partnerID
"""
and my code:
df_tab = spark.read \
.format("jdbc") \
.option("driver", "com.github.housepower.jdbc.ClickHouseDriver") \
.option("url", f"jdbc:clickhouse://{ch_host}:9000/{ch_db}") \
.option("query", query) \
.option("user", ch_user) \
.option("password", ch_pass) \
.load()
But there I get an error:
housepower_error
BUT when I run above query with yandex driver: ru.yandex.clickhouse.ClickHouseDriver
everything works fine. (even with housepower jar)
This error also appears when I want to import column like this:
JSONExtractKeysAndValues(ec.custom, 'String')
or
toString(JSONExtractKeysAndValues(ec.custom, 'String'))
What am I doing wrong ?
And tell me how to insert a DF with Array column using spark jdbc to Clickhouse table also with Array(String) column? I was looking everywhere but coudn't find a solution...
Thank you in advance !

Predicate in Pyspark JDBC does not do a partitioned read

I am trying to read a Mysql table in PySpark using JDBC read. The tricky part here is that the table is considerably big, and therefore causes our Spark executor to crash when it does a non-partitioned vanilla read of the table.
Hence, the objective function is basically that we want to do a partitioned read of the table. Couple of things that we have been trying -
We looked at the "numPartitions-partitionColumn-lowerBound-upperBound" combo. This does not work for us since our indexing key of the original table is a string, and this only works with integral types.
The other alternative that is suggested in the docs is the predicate option. This does not seem to work for us, in the sense that the number of partitions seem to still be 1, instead of the number of predicates that we are sending.
The code snippet that we are using is as follows -
input_df = self._Flow__spark.read \
.format("jdbc") \
.option("url", url) \
.option("user", config.user) \
.option("password", config.password) \
.option("driver", "com.mysql.cj.jdbc.Driver") \
.option("dbtable", "({}) as query ".format(get_route_surge_details_query(start_date, end_date))) \
.option("predicates", ["recommendation_date = '2020-11-14'",
"recommendation_date = '2020-11-15'",
"recommendation_date = '2020-11-16'",
"recommendation_date = '2020-11-17'",
]) \
.load()
It seems to be doing a full table scan ( non-partitioned ), whilst completely ignoring the passed predicates. Would be great to get some help on this.
Try the following :
spark_session\
.read\
.jdbc(url=url,
table= "({}) as query ".format(get_route_surge_details_query(start_date, end_date)),
predicates=["recommendation_date = '2020-11-14'",
"recommendation_date = '2020-11-15'",
"recommendation_date = '2020-11-16'",
"recommendation_date = '2020-11-17'"],
properties={
"user": config.user,
"password": config.password,
"driver": "com.mysql.cj.jdbc.Driver"
}
)
Verify the partitions by
df.rdd.getNumPartitions() # Should be 4
I found this after digging the docs at https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=jdbc#pyspark.sql.DataFrameReader.jdbc

Defining Schema for a large dataframe in spark

I am trying to load a large dataset from a txt file (1000 columns, > 1M Rows) into the spark environment and my dataset has no headers, as a concequence I have run into this error:
TypeError: Can not infer schema for type:
The challenge: looking at the examples given in the documentation, both examples show how to infer the schema using reflection and programmatically demonstrate the idea with few (two) columns that can be easily typed. No special column names are needed since the data represents a matrix.
How would I go about inferring from a larger set of columns, hopefully w/o typing out. Or can the data be loaded in an alternative way that does not require these definitions.
PS: Spark newbie and using pyspark
EDITED (Added information)
dataset = "./data.txt"
conf = (SparkConf()
.setAppName("myApp")
.setMaster("host")
.set("spark.cores.max", "15")
.set("spark.rdd.compress", "true")
.set("spark.broadcast.compress", "true"))
sc = SparkContext(conf=conf)
spark = SparkSession \
.builder \
.appName("myApp") \
.config(conf=SparkConf()) \
.getOrCreate()
data = sc.textFile(dataset)
df = spark.createDataFrame(data)
data.txt contains 1Mn rows and 1000 columns similar to what would be obtained for example by the following code:
np.random.randint(20, size=(1000000, 1000))

Spark thinks I'm reading DataFrame from a Parquet file

Spark 2.x here. My code:
val query = "SELECT * FROM some_big_table WHERE something > 1"
val df : DataFrame = spark.read
.option("url",
s"""jdbc:postgresql://${redshiftInfo.hostnameAndPort}/${redshiftInfo.database}?currentSchema=${redshiftInfo.schema}"""
)
.option("user", redshiftInfo.username)
.option("password", redshiftInfo.password)
.option("dbtable", query)
.load()
Produces:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:183)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:183)
at scala.Option.getOrElse(Option.scala:121)
I'm not reading anything from a Parquet file, I'm reading from a Redshift (RDBMS) table. So why am I getting this error?
If you use generic load function you should include format as well:
// Query has to be subquery
val query = "(SELECT * FROM some_big_table WHERE something > 1) as tmp"
...
.format("jdbc")
.option("dbtable", query)
.load()
Otherwise Spark assumes that you use default format, which in presence of no specific configuration, is Parquet.
Also nothing forces you to use dbtable.
spark.read.jdbc(
s"jdbc:postgresql://${hostnameAndPort}/${database}?currentSchema=${schema}",
query,
props
)
variant is also valid.
And of course with such simple query all of that it is not needed:
spark.read.jdbc(
s"jdbc:postgresql://${hostnameAndPort}/${database}?currentSchema=${schema}",
some_big_table,
props
).where("something > 1")
will work the same way, and if you want to improve performance you should consider parallel queries
How to improve performance for slow Spark jobs using DataFrame and JDBC connection?
Whats meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?
Spark 2.1 Hangs while reading a huge datasets
Partitioning in spark while reading from RDBMS via JDBC
or even better, try Redshift connector.

Resources