JDBC read with partitioning - first/last partition skewed - apache-spark

I am reading dataframe from JDBC source using partitioning as described here, using numPartitions, partitionColumn, upperBound, lowerBound. I've been using this quite often, but this time I noticed something weird. With numPartition = 32 and 124 distinct partition column values, this split data into 30 smaller chunks and 2 large.
Task 1 - partitions 1 .. 17 (17 values!)
Task 2 - partitions 18 .. 20 (3 values)
Task 3 - partitions 21 .. 23 (3 values)
Task 4 - partitions 24 .. 26 (3 values)
...
Task 30 - partitions 102 .. 104 (3 values)
Task 31 - partitions 105 .. 107 (3 values)
Task 32 - partitions 108 .. 124 (17 values!)
I'm just wondering whether this actually worked as expected and what I can do to make it split into even chunks apart from experimenting maybe with different values of numPartitions (note that I number of values can vary and I'm not always able to predict it).

I looked through source code of JDBCRelation.scala and found out that this is exactly how it's implemented. It first calculates stride as (upperBound - lowerBound) / numPartitions, which in my case is 124 / 32 = 3, and then remaining values are allocated evenly to fist and last partition.
I was a bit unlucky with the number of values, because if I had 4 more, then 128 / 32 = 4 and it would nicely align 32 partitions of 4 values each.
I ended up pre-querying the table for exact range and then manually providing predicates using:
val partRangeSql = "SELECT min(x), max(x) FROM table")
val (partMin, partMax) =
spark.read.jdbc(jdbcUrl, s"($partRangeSql) _", props).as[(Int, Int)].head
val predicates = (partMin to partMax).map(p => s"x = $p").toArray
spark.read.jdbc(jdbcUrl, s"table", predicates, props)
That makes it 124 partitions (one per value), so need to be careful with overloading database server (but I'm limiting number of executors so that not running more than 32 concurrent sessions).
I guess adjusting lowerBound/upperBound so that upperBound - lowerBound is a multiple of numPartitions would also do the work.

Related

How to estimate a Spark DataFrame's row count after join two or more table?

I'm developing a feature, support dynamic sql as input, and then using the input to submit a spark job. But the inputs are unpredicatable, some inputs may exceed the limit, it's a dager for me. I want to check the sql's cost before submit the job, is a way I can estimate the cost accurately?
My Spark conf is :
Spark Version: 3.3.1
conf:
spark.sql.cbo.enabled: true
spark.sql.statistics.histogram.enabled:true
example:
I have a dataFrame df1 like this
n x y z
'A' 1 2 3
'A' 4 5 6
'A' 7 8 9
'A' 10 11 12
'A' 13 14 15
'A' 16 17 18
'A' 19 20 21
'A' 22 23 24
'A' 25 26 27
'A' 28 29 30
row count of df1.join(df1,"n","left").join(df1,"n","left") should be 1000
row count of df1.join(df1,"n","left").join(df1,"n","left") should be 10
but result of dataFrame.queryExecution.optimizedPlan.stats is awlyways 1000 for examples above.
I've tried in some way:
dataFrame.queryExecution.optimizedPlan.stats, but the es rows is much bigger than actual rows, especially when join operation exists.
Use dataFrame.rdd.countApprox. The problem is that it need much time to get the actual result when dataFrame is big
I also try to use org.apache.spark.sql.execution.command.CommandUtils#calculateMultipleLocationSizesInParallel, it's better than dataFrame.rdd.countApprox, but in some extreme scenario, it also cost more than tens of minutes。
First let's calculate the number of rows in each table
df1_count = df1.count()
df2_count = df2.count()
df3_count = df3.count()
Then use cogroup to create a DataFrame containing the row counts from each table
counts_df = df1.cogroup(df2, df3)
Add up the row counts to get the estimated total number of rows in the joined DataFrame
estimated_row_count = counts_df.sum()
Eventually, when you join, you can try this approach
joined_df = df1.join(df2, on=..., how=...).join(df3, on=..., how=...)
exact_row_count = joined_df.count()

How to handle data skew in Spark window functions?

I have a data set that I'm trying to process in PySpark. The data (on disk as Parquet) contains user IDs, session IDs, and metadata related to each session. I'm adding a number of columns to my dataframe that are the result of aggregating over a window. The issue I'm running into is that all but 4-6 executors will complete quickly and the rest run forever without completing. My code looks like this:
import pyspark.sql.functions as f
from pyspark.sql.window import Window
empty_col_a_cond = ((f.col("col_A").isNull()) |
(f.col("col_A") == ""))
session_window = Window.partitionBy("user_id", "session_id") \
.orderBy(f.col("step_id").asc())
output_df = (
input_df
.withColumn("col_A_val", f
.when(empty_col_a_cond, f.lit("NA"))
.otherwise(f.col("col_A")))
# ... 10 more added columns replacing nulls/empty strings
.repartition("user_id", "session_id")
.withColumn("s_user_id", f.first("user_id", True).over(session_window))
.withColumn("s_col_B", f.collect_list("col_B").over(session_window))
.withColumn("s_col_C", f.min("col_C").over(session_window))
.withColumn("s_col_D", f.max("col_D").over(session_window))
# ... 16 more added columns aggregating over session_window
.where(f.col("session_flag") == 1)
.where(f.array_contains(f.col("s_col_B"), "some_val"))
)
In my logs, I see this over and over:
INFO ExternalAppendOnlyUnsafeRowArray: Reached spill threshold of 4096 rows, switching to org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
INFO UnsafeExternalSorter: Thread 92 spilling sort data of 9.2 GB to disk (2 times so far)
INFO UnsafeExternalSorter: Thread 91 spilling sort data of 19.3 GB to disk (0 time so far)
Which suggests that Spark can't hold all the windowed data in memory. I tried increasing the internal settings spark.sql.windowExec.buffer.in.memory.threshold and spark.sql.windowExec.buffer.spill.threshold, which helped a little but there are still executors not completing.
I believe this is all caused by some skew in the data. Grouping by both user_id and session_id, there are 5 entries with a count >= 10,000, 100 records with a count between 1,000 and 10,000, and 150,000 entries with a count less than 1,000 (usually count = 1).
input_df \
.groupBy(f.col("user_id"), f.col("session_id")) \
.count() \
.filter("count < 1000") \
.count()
# >= 10k, 6
# < 10k and >= 1k, 108
# < 1k, 150k
This is the resulting job DAG:

identifying decrease in values in spark (outliers)

I have a large data set with millions of records which is something like
Movie Likes Comments Shares Views
A 100 10 20 30
A 102 11 22 35
A 104 12 25 45
A *103* 13 *24* 50
B 200 10 20 30
B 205 *9* 21 35
B *203* 12 29 42
B 210 13 *23* *39*
Likes, comments etc are rolling totals and they are suppose to increase. If there is drop in any of this for a movie then its a bad data needs to be identified.
I have initial thoughts about groupby movie and then sort within the group. I am using dataframes in spark 1.6 for processing and it does not seem to be achievable as there is no sorting within the grouped data in dataframe.
Buidling something for outlier detection can be another approach but because of time constraint I have not explored it yet.
Is there anyway I can achieve this ?
Thanks !!
You can use the lag window function to bring the previous values into scope:
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy('Movie).orderBy('maybesometemporalfield)
dataset.withColumn("lag_likes", lag('Likes, 1) over windowSpec)
.withColumn("lag_comments", lag('Comments, 1) over windowSpec)
.show
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-functions.html#lag
Another approach would be to assign a row number (if there isn't one already), lag that column, then join the row to it's previous row, to allow you to do the comparison.
HTH

More than expected jobs running in apache spark

I am trying to learn apache-spark. This is my code which i am trying to run. I am using pyspark api.
data = xrange(1, 10000)
xrangeRDD = sc.parallelize(data, 8)
def ten(value):
"""Return whether value is below ten.
Args:
value (int): A number.
Returns:
bool: Whether `value` is less than ten.
"""
if (value < 10):
return True
else:
return False
filtered = xrangeRDD.filter(ten)
print filtered.collect()
print filtered.take(8)
print filtered.collect() gives this as output [1, 2, 3, 4, 5, 6, 7, 8, 9].
As per my understanding filtered.take(n) will take n elements from RDD and print it.
I am trying two cases :-
1)Giving value of n less than or equal to number of elements in RDD
2)Giving value of n greater than number of elements in RDD
I have pyspark application UI to see number of jobs that run in each case. In first case only one job is running but in second five jobs are running.
I am not able to understand why is this happening. Thanks in advance.
RDD.take tries to evaluate as few partitions as possible.
If you take(9) it will fetch partition 0 (job 1) find 9 items and happily terminate.
If you take(10) it will fetch partition 0 (job 1) and find 9 items. It needs one more. Since partition 0 had 9, it thinks partition 1 will probably have at least one more (job 2). But it doesn't! In 2 partitions it has found 9 items. So 4.5 items per partition so far. The formula divides it by 1.5 for pessimism and decides 10 / (4.5 / 1.5) = 3 partitions will do it. So it fetches partition 2 (job 3). Still nothing. So 3 items per partition so far, divided by 1.5 means we need 10 / (3 / 1.5) = 5 partitions. It fetches partitions 3 and 4 (job 4). Nothing. We have 1.8 items per partition, 10 / (1.8 / 1.5) = 8. It fetches the last 3 partitions (job 5) and that's it.
The code for this algorithm is in RDD.scala. As you can see it's nothing but heuristics. It saves some work usually, but it can lead to unnecessarily many jobs in degenerate cases.

Cassandra and wide row disk size estimate?

I am trying to estimate the amount of space required for each column in a Cassandra wide row, but the numbers that I get are wildly conflicting.
I have a pretty standard wide row table to store some time series data:
CREATE TABLE raw_data (
id uuid,
time timestamp,
data list<float>,
PRIMARY KEY (id, time)
);
In my case, I store 20 floats in the data list.
Datastax provides some formulas for estimating user data size.
regular_total_column_size = column_name_size + column_value_size + 15
row_size = key_size + 23
primary_key_index = number_of_rows * ( 32 + average_key_size )
For this table, we get the following values:
regular_total_column_size = 8 + 80 + 15 = 103 bytes
row_size = 16 + 23 = 39 bytes
primary_key_index = 276 * ( 32 + 16 ) = 13248 bytes
I'm mostly interested in how the row grows, so the 103 bytes per column is of interest. I counted all the samples in my database and ended up with 29,241,289 unique samples. Multiplying it out I get an estimated raw_data table size of 3GB.
In reality, I have 4GB of compressed data as measured by nodetool cfstats right after compaction. It reports a compression ratio of 0.117. It averages out to 137 bytes per sample, on disk, after compression. That seems very high, considering:
only 88 bytes of that is user data
It's 34 bytes more per sample
This is after deflate compression.
So, my question is: how do I accurately forecast how much disk space Cassandra wide rows consume, and how can I minimize the total disk space?
I'm running a single node with no replication for these tests.
This may be due to compaction strategies. With size tiered compaction, the SSTables will build up to double the required space during compaction. For levelled compaction, around 10% extra space will be needed. Depending on compaction strategy, you need to take into account the additional disk spaced used.

Resources