Predicate in Pyspark JDBC does not do a partitioned read - apache-spark

I am trying to read a Mysql table in PySpark using JDBC read. The tricky part here is that the table is considerably big, and therefore causes our Spark executor to crash when it does a non-partitioned vanilla read of the table.
Hence, the objective function is basically that we want to do a partitioned read of the table. Couple of things that we have been trying -
We looked at the "numPartitions-partitionColumn-lowerBound-upperBound" combo. This does not work for us since our indexing key of the original table is a string, and this only works with integral types.
The other alternative that is suggested in the docs is the predicate option. This does not seem to work for us, in the sense that the number of partitions seem to still be 1, instead of the number of predicates that we are sending.
The code snippet that we are using is as follows -
input_df = self._Flow__spark.read \
.format("jdbc") \
.option("url", url) \
.option("user", config.user) \
.option("password", config.password) \
.option("driver", "com.mysql.cj.jdbc.Driver") \
.option("dbtable", "({}) as query ".format(get_route_surge_details_query(start_date, end_date))) \
.option("predicates", ["recommendation_date = '2020-11-14'",
"recommendation_date = '2020-11-15'",
"recommendation_date = '2020-11-16'",
"recommendation_date = '2020-11-17'",
]) \
.load()
It seems to be doing a full table scan ( non-partitioned ), whilst completely ignoring the passed predicates. Would be great to get some help on this.

Try the following :
spark_session\
.read\
.jdbc(url=url,
table= "({}) as query ".format(get_route_surge_details_query(start_date, end_date)),
predicates=["recommendation_date = '2020-11-14'",
"recommendation_date = '2020-11-15'",
"recommendation_date = '2020-11-16'",
"recommendation_date = '2020-11-17'"],
properties={
"user": config.user,
"password": config.password,
"driver": "com.mysql.cj.jdbc.Driver"
}
)
Verify the partitions by
df.rdd.getNumPartitions() # Should be 4
I found this after digging the docs at https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=jdbc#pyspark.sql.DataFrameReader.jdbc

Related

How to filter data using spark.read in place?

I try read data in Delta format from ADLS. I want read some portion of that data using filter in place. Same approach worked for me during reading JDBC format
query = f"""
select * from {table_name}
where
createdate < to_date('{createdate}','YYYY-MM-DD HH24:MI:SS') or
modifieddate < to_date('{modifieddate}','YYYY-MM-DD HH24:MI:SS')
"""
return spark.read \
.format("jdbc") \
.option("url", url) \
.option("query", query) \
.option("user", username) \
.option("password", password) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.load()
So I tried to create in similar way reading delta using query but it reads whole table.
return spark.read \
.format("delta") \
.option("query", query) \
.load(path)
How could I solve this issue without reading full df and then filter it?
Thanks in advance!
Spark uses a functionality called predicate pushdown to optimize queries.
In the first case, the filters can be passed on to the oracle database.
Delta does not work that way. There can be optimisations through data skipping and Z-ordering, but since you are essentially querying parquet files, you have to read the all of them in memory and filter afterwards.

ClickHouse housepower driver with spark

I'm new to Stack and Spark so please forgive me my simplicity and mistakes!
I have a problem with Clickhouse and spark (2.4.7), I work on a jupyter notebook.
Basically, I want to insert dataframe with Array column to Clickhouse with Array(String) column. Using yandex driver this is impossible, because jdbc doesn't support Arrays, right? ;)
So I wanted to run Spark Session with housepower jar: clickhouse-native-jdbc-shaded-2.6.4.jar, because I read that they added handling Arrays - correct me if I'm wrong.
And I want to get a query from Clickhouse via jdbc.
spark = SparkSession\
.builder\
.enableHiveSupport()\
.appName(f'custom-events-test)\
.config("spark.jars", "drivers/clickhouse-native-jdbc-shaded-2.6.4.jar")\
.getOrCreate()
My query:
query = """
select date as date,
partnerID as partnerID,
sessionID as sessionID,
toString(mapKeys(cast((JSONExtractKeysAndValues(ec.custom, 'String')), 'Map(String, String)'))) as keys,
toString(mapValues(cast((JSONExtractKeysAndValues(ec.custom, 'String')), 'Map(String, String)'))) as values
from audience.uber_all
array join eventContents as ec
PREWHERE date = ('2022-07-22')
WHERE partnerID = 'XXX'
and ec.custom != '{}'
order by date, partnerID
"""
and my code:
df_tab = spark.read \
.format("jdbc") \
.option("driver", "com.github.housepower.jdbc.ClickHouseDriver") \
.option("url", f"jdbc:clickhouse://{ch_host}:9000/{ch_db}") \
.option("query", query) \
.option("user", ch_user) \
.option("password", ch_pass) \
.load()
But there I get an error:
housepower_error
BUT when I run above query with yandex driver: ru.yandex.clickhouse.ClickHouseDriver
everything works fine. (even with housepower jar)
This error also appears when I want to import column like this:
JSONExtractKeysAndValues(ec.custom, 'String')
or
toString(JSONExtractKeysAndValues(ec.custom, 'String'))
What am I doing wrong ?
And tell me how to insert a DF with Array column using spark jdbc to Clickhouse table also with Array(String) column? I was looking everywhere but coudn't find a solution...
Thank you in advance !

Difference between using spark SQL and SQL

My main question concerns about performance. Looking at the code below:
query = """
SELECT Name, Id FROM Customers WHERE Id <> 1 ORDER BY Id
"""
df = spark.read.format(jdbc) \
.option("url", "connectionString") \
.option("user", user) \
.option("password", password) \
.option("numPartitions", 10) \
.option("partitionColumn", "Id") \
.option("lowerBound", lowerBound) \
.option("upperBound", upperBound) \
.option("dbtable", query) \
.load()
As far as I understand, this command will be sent to the DB process the query and return the value to spark.
Now considering the code below:
df = spark.read.jdbc(url = mssqlconnection,
table = "dbo.Customers",
properties = mssql_prop
).select(
f.col("Id"),
f.col("Name")
).where("Id = <> 1").orderBy(f.col("Id"))
I know that spark will load the entire table into memory and then execute the filters on the dataframe.
Finally, the last code snippet:
df = spark.read.jdbc(url = mssqlconnection,
table = "dbo.Customers",
properties = mssql_prop
)
final_df = spark_session.sql("""
SELECT Name, Id FROM Customers WHERE Id <> 1 ORDER BY Id
""")
I have 3 questions:
Among the 3 codes, which one is the most correct. I always use the second approach, is this correct?
What is the difference between using a spark.sql and using the commands directly in the dataframe according to the second code snipper?
What is the ideal number of lines for me to start using spark? Is it worth using in queries that return less than 1 million rows?

Spark JDBC read API: Determining the number of partitions dynamically for a column of type datetime

I'm trying to read a table from an RDS MySQL instance using PySpark. It's a huge table, hence I want to parallelize the read operation by making use of the partitioning concept. The table doesn't have a numeric column to find the number of partitions. Instead, it has a timestamp column (i.e. datetime type).
I found the lower and upper bounds by retrieving the min and max values of the timestamp column. However, I'm not sure if there's a standard formula to find out the number of partitions dynamically. Here is what I'm doing currently (hardcoding the value for numPartititons parameter):
select_sql = "SELECT {} FROM {}".format(columns, table)
partition_info = {'partition_column': 'col1',
'lower_bound': '<result of min(col1)>',
'upper_bound': '<result of max(col1)>',
'num_partitions': '10'}
read_df = spark.read.format("jdbc") \
.option("driver", driver) \
.option("url", url) \
.option("dbtable", select_sql) \
.option("user", user) \
.option("password", password) \
.option("useSSL", False) \
.option("partitionColumn", partition_info['partition_column']) \
.option("lowerBound", partition_info['lower_bound'])) \
.option("upperBound", partition_info['upper_bound'])) \
.option("numPartitions", partition_info['num_partitions']) \
.load()
Please suggest me a solution/your approach that works. Thanks
How to set numPartitions depends on your cluster's definition. There are no right or wrong or automatic settings here. As long as you understand the logic behind partitionColumn, lowerBound, upperBound, numPartitions, and probably lots of benchmarking, you can decide what's the right number.
Pyspark - df.cache().count() taking forever to run
What is the meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?

How to write Partitions to Postgres using foreachPartition (pySpark)

I am new to Spark and trying to wite df partitions to Postgres
here is my code:
//csv_new is a DF with nearly 40 million rows and 6 columns
csv_new.foreachPartition(callback) // there are 19204 partitions
def callback(iterator):
print(iterator)
// the print gives me itertools.chain object
but when writing to DB with following code:
iterator.write.option("numPartitions", count).option("batchsize",
1000000).jdbc(url=url, table="table_name", mode=mode,
properties=properties)
gives an error:
*AttributeError: 'itertools.chain' object has no attribute 'write' mode is append and properties are set
Any leads on how to write the df partition to DB
You don't need to do that.
The documentation states it along these lines and it occurs in parallel:
df.write.format("jdbc")
.option("dbtable", "T1")
.option("url", url1)
.option("user", "User")
.option("password", "Passwd")
.option("numPartitions", "5") // to define parallelism
.save()
There are some performances aspects to consider, but that can be googled.
Many Thanks to #thebluephantom ,just a little add on in case the the table already exists save mode also needs to be defined.
Following was my implementation which worked :-
mode = "Append"
url = "jdbc:postgresql://DatabaseIp:port/DB Name"
properties = {"user": "username", "password": "password"}
df.write
.option("numPartitions",partitions here)
.option("batchsize",your batch size default is 1000)
.jdbc(url=url, table="tablename", mode=mode, properties=properties)

Resources