I have a data set that I'm trying to process in PySpark. The data (on disk as Parquet) contains user IDs, session IDs, and metadata related to each session. I'm adding a number of columns to my dataframe that are the result of aggregating over a window. The issue I'm running into is that all but 4-6 executors will complete quickly and the rest run forever without completing. My code looks like this:
import pyspark.sql.functions as f
from pyspark.sql.window import Window
empty_col_a_cond = ((f.col("col_A").isNull()) |
(f.col("col_A") == ""))
session_window = Window.partitionBy("user_id", "session_id") \
.orderBy(f.col("step_id").asc())
output_df = (
input_df
.withColumn("col_A_val", f
.when(empty_col_a_cond, f.lit("NA"))
.otherwise(f.col("col_A")))
# ... 10 more added columns replacing nulls/empty strings
.repartition("user_id", "session_id")
.withColumn("s_user_id", f.first("user_id", True).over(session_window))
.withColumn("s_col_B", f.collect_list("col_B").over(session_window))
.withColumn("s_col_C", f.min("col_C").over(session_window))
.withColumn("s_col_D", f.max("col_D").over(session_window))
# ... 16 more added columns aggregating over session_window
.where(f.col("session_flag") == 1)
.where(f.array_contains(f.col("s_col_B"), "some_val"))
)
In my logs, I see this over and over:
INFO ExternalAppendOnlyUnsafeRowArray: Reached spill threshold of 4096 rows, switching to org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
INFO UnsafeExternalSorter: Thread 92 spilling sort data of 9.2 GB to disk (2 times so far)
INFO UnsafeExternalSorter: Thread 91 spilling sort data of 19.3 GB to disk (0 time so far)
Which suggests that Spark can't hold all the windowed data in memory. I tried increasing the internal settings spark.sql.windowExec.buffer.in.memory.threshold and spark.sql.windowExec.buffer.spill.threshold, which helped a little but there are still executors not completing.
I believe this is all caused by some skew in the data. Grouping by both user_id and session_id, there are 5 entries with a count >= 10,000, 100 records with a count between 1,000 and 10,000, and 150,000 entries with a count less than 1,000 (usually count = 1).
input_df \
.groupBy(f.col("user_id"), f.col("session_id")) \
.count() \
.filter("count < 1000") \
.count()
# >= 10k, 6
# < 10k and >= 1k, 108
# < 1k, 150k
This is the resulting job DAG:
Related
I have a (very) large dataset partitioned by year, month and day. The partition columns were derived from a updated_at column during ingestion. Here is how it looks like:
id
user
updated_at
year
month
day
1
a
1992-01-19
1992
1
19
2
c
1992-01-20
1992
1
20
3
a
1992-01-21
1992
1
21
...
...
...
...
...
...
720987
c
2012-07-20
2012
7
20
720988
a
2012-07-21
2012
7
21
...
...
...
...
...
...
I need to use Apache Spark to find the 5th earliest event per user.
A simple window function like the one below is impossible since I use a shared cluster and I won't have enough resources to do in-memory processing at any given time due to the size of the dataset.
window = Window.partitionBy("user").orderBy(F.asc("updated_at"))
.withColumn("rank", F.dense_rank().over(window))
.filter(F.col("rank") == 5)
I am considering looping through partitions, processing and persisting data to disk, and then merging them back. How would you solve it? Thanks!
I think the code below will be faster because data is partitioned by these cols and spark can benefit from data locality.
Window.partitionBy("user").orderBy(F.asc("year"), F.asc("month"), F.asc("day"))
I am reading dataframe from JDBC source using partitioning as described here, using numPartitions, partitionColumn, upperBound, lowerBound. I've been using this quite often, but this time I noticed something weird. With numPartition = 32 and 124 distinct partition column values, this split data into 30 smaller chunks and 2 large.
Task 1 - partitions 1 .. 17 (17 values!)
Task 2 - partitions 18 .. 20 (3 values)
Task 3 - partitions 21 .. 23 (3 values)
Task 4 - partitions 24 .. 26 (3 values)
...
Task 30 - partitions 102 .. 104 (3 values)
Task 31 - partitions 105 .. 107 (3 values)
Task 32 - partitions 108 .. 124 (17 values!)
I'm just wondering whether this actually worked as expected and what I can do to make it split into even chunks apart from experimenting maybe with different values of numPartitions (note that I number of values can vary and I'm not always able to predict it).
I looked through source code of JDBCRelation.scala and found out that this is exactly how it's implemented. It first calculates stride as (upperBound - lowerBound) / numPartitions, which in my case is 124 / 32 = 3, and then remaining values are allocated evenly to fist and last partition.
I was a bit unlucky with the number of values, because if I had 4 more, then 128 / 32 = 4 and it would nicely align 32 partitions of 4 values each.
I ended up pre-querying the table for exact range and then manually providing predicates using:
val partRangeSql = "SELECT min(x), max(x) FROM table")
val (partMin, partMax) =
spark.read.jdbc(jdbcUrl, s"($partRangeSql) _", props).as[(Int, Int)].head
val predicates = (partMin to partMax).map(p => s"x = $p").toArray
spark.read.jdbc(jdbcUrl, s"table", predicates, props)
That makes it 124 partitions (one per value), so need to be careful with overloading database server (but I'm limiting number of executors so that not running more than 32 concurrent sessions).
I guess adjusting lowerBound/upperBound so that upperBound - lowerBound is a multiple of numPartitions would also do the work.
I have two data frames. Dataframe A is of shape (1269345,5) and dataframe B is of shape (18583586, 3).
Dataframe A looks like:
Name. gender start_coordinate end_coordinate ID
Peter M 30 150 1
Hugo M 4500 6000 2
Jennie F 300 700 3
Dataframe (B) looks like
ID_sim. position string
1 89 aa
4 568 bb
5 938437 cc
I want to make extract rows and make two data frames for which position column in dataframe B falls in the interval (specified by start_coordinate and end_coordinate column) in dataframe A.So resulting dataframe would look like:
###Final dataframe A
Name. gender start_coordinate end_coordinate ID
Peter M 30 150 1
Jennie F 300 700 3
###Final dataframe B
ID_sim. position string
1 89 aa
4 568 bb
I tried using numpy broadcasting like this:
s, e = dfA[['start_coordinate', 'end_coordinate']].to_numpy().T
p = dfB['position'].to_numpy()[:, None]
dfB[((p >= s) & (p <= e)).any(1)]
But this gave me the following error:
MemoryError: Unable to allocate 2.72 TiB for an array with shape (18583586, 160711) and data type bool
I think its because my numpy becomes quite large when I try broadcasting. How can I achieve my task without numpy broadcasting considering that my dataframes are very large. Insights will be appreciated.
This is likely due to your system overcommit mode.
It will be 0 by default,
Heuristic overcommit handling. Obvious overcommits of address space
are refused. Used for a typical system. It ensures a seriously wild
allocation fails while allowing overcommit to reduce swap usage. The
root is allowed to allocate slightly more memory in this mode. This is
the default.
By Running below command to check your current overcommit mode
$ cat /proc/sys/vm/overcommit_memory
0
In this case, you're allocating
> 156816 * 36 * 53806 / 1024.0**3
282.8939827680588
~282 GB and the kernel is saying well obviously there's no way I'm going to be able to commit that many physical pages to this, and it refuses the allocation.
If (as root) you run:
$ echo 1 > /proc/sys/vm/overcommit_memory
This will enable the "always overcommit" mode, and you'll find that indeed the system will allow you to make the allocation no matter how large it is (within 64-bit memory addressing at least).
I tested this myself on a machine with 32 GB of RAM. With overcommit mode 0 I also got a MemoryError, but after changing it back to 1 it works:
>>> import numpy as np
>>> a = np.zeros((156816, 36, 53806), dtype='uint8')
>>> a.nbytes
303755101056
You can then go ahead and write to any location within the array, and the system will only allocate physical pages when you explicitly write to that page. So you can use this, with care, for sparse arrays.
I am trying to learn apache-spark. This is my code which i am trying to run. I am using pyspark api.
data = xrange(1, 10000)
xrangeRDD = sc.parallelize(data, 8)
def ten(value):
"""Return whether value is below ten.
Args:
value (int): A number.
Returns:
bool: Whether `value` is less than ten.
"""
if (value < 10):
return True
else:
return False
filtered = xrangeRDD.filter(ten)
print filtered.collect()
print filtered.take(8)
print filtered.collect() gives this as output [1, 2, 3, 4, 5, 6, 7, 8, 9].
As per my understanding filtered.take(n) will take n elements from RDD and print it.
I am trying two cases :-
1)Giving value of n less than or equal to number of elements in RDD
2)Giving value of n greater than number of elements in RDD
I have pyspark application UI to see number of jobs that run in each case. In first case only one job is running but in second five jobs are running.
I am not able to understand why is this happening. Thanks in advance.
RDD.take tries to evaluate as few partitions as possible.
If you take(9) it will fetch partition 0 (job 1) find 9 items and happily terminate.
If you take(10) it will fetch partition 0 (job 1) and find 9 items. It needs one more. Since partition 0 had 9, it thinks partition 1 will probably have at least one more (job 2). But it doesn't! In 2 partitions it has found 9 items. So 4.5 items per partition so far. The formula divides it by 1.5 for pessimism and decides 10 / (4.5 / 1.5) = 3 partitions will do it. So it fetches partition 2 (job 3). Still nothing. So 3 items per partition so far, divided by 1.5 means we need 10 / (3 / 1.5) = 5 partitions. It fetches partitions 3 and 4 (job 4). Nothing. We have 1.8 items per partition, 10 / (1.8 / 1.5) = 8. It fetches the last 3 partitions (job 5) and that's it.
The code for this algorithm is in RDD.scala. As you can see it's nothing but heuristics. It saves some work usually, but it can lead to unnecessarily many jobs in degenerate cases.
I am trying to estimate the amount of space required for each column in a Cassandra wide row, but the numbers that I get are wildly conflicting.
I have a pretty standard wide row table to store some time series data:
CREATE TABLE raw_data (
id uuid,
time timestamp,
data list<float>,
PRIMARY KEY (id, time)
);
In my case, I store 20 floats in the data list.
Datastax provides some formulas for estimating user data size.
regular_total_column_size = column_name_size + column_value_size + 15
row_size = key_size + 23
primary_key_index = number_of_rows * ( 32 + average_key_size )
For this table, we get the following values:
regular_total_column_size = 8 + 80 + 15 = 103 bytes
row_size = 16 + 23 = 39 bytes
primary_key_index = 276 * ( 32 + 16 ) = 13248 bytes
I'm mostly interested in how the row grows, so the 103 bytes per column is of interest. I counted all the samples in my database and ended up with 29,241,289 unique samples. Multiplying it out I get an estimated raw_data table size of 3GB.
In reality, I have 4GB of compressed data as measured by nodetool cfstats right after compaction. It reports a compression ratio of 0.117. It averages out to 137 bytes per sample, on disk, after compression. That seems very high, considering:
only 88 bytes of that is user data
It's 34 bytes more per sample
This is after deflate compression.
So, my question is: how do I accurately forecast how much disk space Cassandra wide rows consume, and how can I minimize the total disk space?
I'm running a single node with no replication for these tests.
This may be due to compaction strategies. With size tiered compaction, the SSTables will build up to double the required space during compaction. For levelled compaction, around 10% extra space will be needed. Depending on compaction strategy, you need to take into account the additional disk spaced used.