Calculating the size of a full outer join in pandas

Calculating the size of a full outer join in pandas - python-3.x

tl;dr
My issue here is that I'm stuck at calculating how many rows to anticipate on each part of a full outer merge when using Pandas DataFrames as part of a combinatorics graph.
Questions (repeated below).
The ideal solution would be to not require the merge and to query panel objects. Given that there isn't a query method on the panel is there a cleaner solution which would solve this problem without hitting the memory ceiling?
If the answer to 2 is no, how can I calculate the size of the required merge table for each combination of sets without carrying out the merge? This might be a sub-optimal approach but in this instance it would be acceptable for the purpose of the application.
Is Python the right language for this or should I be looking at a more statistical language such as R or write it at a lower level (c, cython) - Databases are out of the question.
The problem
Recently I re-wrote the py-upset graphing library to make it more efficient in terms of time when calculating combinations across DataFrames. I'm not looking for a review of this code, it works perfectly well in most instances and I'm happy with the approach. What I am looking for now is the answer to a very specific problem; uncovered when working with large data-sets.
The approach I took with the re-write was to formulate an in-memory merge of all provided dataframes on a full outer join as seen on lines 480 - 502 of pyupset.resources
for index, key in enumerate(keys):
frame = self._frames[key]
frame.columns = [
'{0}_{1}'.format(column, key)
if column not in self._unique_keys
else
column
for column in self._frames[key].columns
]
if index == 0:
self._merge = frame
else:
suffixes = (
'_{0}'.format(keys[index-1]),
'_{0}'.format(keys[index]),
)
self._merge = self._merge.merge(
frame,
on=self._unique_keys,
how='outer',
copy=False,
suffixes=suffixes
)
For small to medium dataframes using joins works incredibly well. In fact recent performance tests have shown that it'll handle 5 or 6 Data-Sets containing 10,000's of lines each in a less than a minute which is more than ample for the application structure I require.
The problem now moves from time based to memory based.
Given datasets of potentially 100s of thousands of records, the library very quickly runs out of memory even on a large server.
To put this in perspective, my test machine for this application is an 8-core VMWare box with 128GiB RAM running Centos7.
Given the following dataset sizes, when adding the 5th dataframe, memory usage spirals exponentially. This was pretty much anticipated but underlines the heart of the problem I am facing.
Rows | Dataframe
------------------------
13963 | dataframe_one
48346 | dataframe_two
52356 | dataframe_three
337292 | dataframe_four
49936 | dataframe_five
24542 | dataframe_six
258093 | dataframe_seven
16337 | dataframe_eight
These are not "small" dataframes in terms of the number of rows although the column count for each is limited to one unique key + 4 non-unique columns. The size of each column in pandas is
column | type | unique
--------------------------
X | object | Y
id | int64 | N
A | float64 | N
B | float64 | N
C | float64 | N
This merge can cause problems as memory is eaten up. Occasionally it aborts with a MemoryError (great, I can catch and handle those), other times the kernel takes over and simply kills the application before the system becomes unstable, and occasionally, the system just hangs and becomes unresponsive / unstable until finally the kernel kills the application and frees the memory.
Sample output (memory sizes approximate):
[INFO] Creating merge table
[INFO] Merging table dataframe_one
[INFO] Data index length = 13963 # approx memory <500MiB
[INFO] Merging table dataframe_two
[INFO] Data index length = 98165 # approx memory <1.8GiB
[INFO] Merging table dataframe_three
[INFO] Data index length = 1296665 # approx memory <3.0GiB
[INFO] Merging table dataframe_four
[INFO] Data index length = 244776542 # approx memory ~13GiB
[INFO] Merging table dataframe_five
Killed # > 128GiB
When the merge table has been produced, it is queried in set combinations to produce graphs similar to https://github.com/mproffitt/py-upset/blob/feature/ISSUE-7-Severe-Performance-Degradation/tests/generated/extra_additional_pickle.png
The approach I am trying to build for solving the memory issue is to look at the sets being offered for merge, pre-determine how much memory the merge will require, then if that combination requires too much, split it into smaller combinations, calculate each of those separately, then put the final dataframe back together (divide and conquer).
My issue here is that I'm stuck at calculating how many rows to anticipate on each part of the merge.
Questions (repeated from above)
The ideal solution would be to not require the merge and to query panel objects. Given that there isn't a query method on the panel is there a cleaner solution which would solve this problem without hitting the memory ceiling?
If the answer to 2 is no, how can I calculate the size of the required merge table for each combination of sets without carrying out the merge? This might be a sub-optimal approach but in this instance it would be acceptable for the purpose of the application.
Is Python the right language for this or should I be looking at a more statistical language such as R or write it at a lower level (c, cython).
Apologies for the lengthy question. I'm happy to provide more information if required or possible.
Can anybody shed some light on what might be the reason for this?
Thank you.

Question 1.
Dask shows a lot of promise in being able to calculate the merge table "out of memory" by using hdf5 files as a temporary store.
By using multi-processing to create the merges, dask also offers a performance increase over pandas. Unfortunately this is not carried through to the query method so performance gains made on the merge are lost on querying.
It is still not a completely viable solution as dask may still run out of memory on large, complex merges.
Question 2.
Pre-calculating the size of the merge is entirely possible using the following method.
Group each dataframe by a unique key and calculate the size.
Create a set of key names for each dataframe.
Create an intersection of sets from 2.
Create a set difference for set 1 and for set 2
To accommodate for np.nan stored in the unique key, select all NAN values. If one frame contains nan and the other doesn't, write the other as 1.
for sets in the intersection, multiply the count from each groupby('...').size()
Add counts from the set differences
Add a count of np.nan values
In python this could be written as:
def merge_size(left_frame, right_frame, group_by):
left_groups = left_frame.groupby(group_by).size()
right_groups = right_frame.groupby(group_by).size()
left_keys = set(left_groups.index)
right_keys = set(right_groups.index)
intersection = right_keys & left_keys
left_sub_right = left_keys - intersection
right_sub_left = right_keys - intersection
left_nan = len(left_frame.query('{0} != {0}'.format(group_by)))
right_nan = len(right_frame.query('{0} != {0}'.format(group_by)))
left_nan = 1 if left_nan == 0 and right_nan != 0 else left_nan
right_nan = 1 if right_nan == 0 and left_nan != 0 else right_nan
sizes = [(left_groups[group_name] * right_groups[group_name]) for group_name in intersection]
sizes += [left_groups[group_name] for group_name in left_sub_right]
sizes += [right_groups[group_name] for group_name in right_sub_left]
sizes += [left_nan * right_nan]
return sum(sizes)
Question 3
This method is fairly heavy on calculating and would be better written in Cython for performance gains.

Related

Spark Window Function Null Skew

Recently I've encountered an issue running one of our PySpark jobs. While analyzing the stages in Spark UI I have noticed that the longest running stage takes 1.2 hours to run out of the total 2.5 hours that takes for the entire process to run.
Once I took a look at the stage details it was clear that I'm facing a severe data skew, causing a single task to run for the entire 1.2 hours while all other tasks finish within 23 seconds.
The DAG showed this stage involves Window Functions which helped me to quickly narrow down the problematic area to a few queries and finding the root cause -> The column, account, that was being used in the Window.partitionBy("account") had 25% of null values.
I don't have an interest to calculate the sum for the null accounts though I do need the involved rows for further calculations therefore I can't filter them out prior the window function.
Here is my window function query:
problematic_account_window = Window.partitionBy("account")
sales_with_account_total_df = sales_df.withColumn("sum_sales_per_account", sum(col("price")).over(problematic_account_window))
So we found the one to blame - What can we do now? How can we resolve the skew and the performance issue?

We basically have 2 solutions for this issue:
Break the initial dataframe to 2 different dataframes, one that filters out the null values and calculates the sum on, and the second that contains only the null values and is not part of the calculation. Lastly we union the two together.
Apply salting technique on the null values in order to spread the nulls on all partitions and provide stability to the stage.
Solution 1:
account_window = Window.partitionBy("account")
# split to null and non null
non_null_accounts_df = sales_df.where(col("account").isNotNull())
only_null_accounts_df = sales_df.where(col("account").isNull())
# calculate the sum for the non null
sales_with_non_null_accounts_df = non_null_accounts_df.withColumn("sum_sales_per_account", sum(col("price")).over(account_window)
# union the calculated result and the non null df to the final result
sales_with_account_total_df = sales_with_non_null_accounts_df.unionByName(only_null_accounts_df, allowMissingColumns=True)
Solution 2:
SPARK_SHUFFLE_PARTITIONS = spark.conf.get("spark.sql.shuffle.partitions")
modified_sales_df = (sales_df
# create a random partition value that spans as much as number of shuffle partitions
.withColumn("random_salt_partition", lit(ceil(rand() * SPARK_SHUFFLE_PARTITIONS)))
# use the random partition values only in case the account value is null
.withColumn("salted_account", coalesce(col("account"), col("random_salt_partition")))
)
# modify the partition to use the salted account
salted_account_window = Window.partitionBy("salted_account")
# use the salted account window to calculate the sum of sales
sales_with_account_total_df = sales_df.withColumn("sum_sales_per_account", sum(col("price")).over(salted_account_window))
In my solution I've decided to use solution 2 since it didn't force me to create more dataframes for the sake of the calculation, and here is the result:
As seen above the salting technique helped resolving the skewness. The exact same stage now runs for a total of 5.5 minutes instead of 1.2 hours. The only modification in the code was the salting column in the partitionBy. The comparison shown is based on the exact same cluster/nodes amount/cluster config.

Spark goes java heap space out of memory with a small collect

I've got a problem with Spark, its driver and an OoM issue.
Currently I have a dataframe which is being built with several, joined sources (actually different tables in parquet format), and there are thousands of tuples. They have a date which represents the date of creation of the record, and distinctly they are a few.
I do the following:
from pyspark.sql.functions import year, month
# ...
selectionRows = inputDataframe.select(year('registration_date').alias('year'), month('registration_date').alias('month')).distinct()
selectionRows.show() # correctly shows 8 tuples
selectionRows = selectionRows.collect() # goes heap space OoM
print(selectionRows)
Reading the memory consumption statistics shows that the driver does not exceed ~60%. I thought that the driver should load only the distinct subset, not the entire dataframe.
Am I missing something? Is it possible to collect those few rows in a smarter way? I need them as a pushdown predicate to load a secondary dataframe.
Thank you very much!
EDIT / SOLUTION
After reading the comments and elaborating my personal needs, I cached the dataframe at every "join/elaborate" step, so that in a timeline I do the following:
Join with loaded table
Queue required transformations
Apply the cache transformation
Print the count to keep track of cardinality (mainly for tracking / debugging purposes) and thus apply all transformations + cache
Unpersist the cache of the previous sibiling step, if available (tick/tock paradigm)
This reduced some complex ETL jobs down to 20% of the original time (as previously it was applying the transformations of each previous step at each count).
Lesson learned :)

After reading the comments, I elaborated the solution for my use case.
As mentioned in the question, I join several tables one with each other in a "target dataframe", and at each iteration I do some transformations, like so:
# n-th table work
target = target.join(other, how='left')
target = target.filter(...)
target = target.withColumn('a', 'b')
target = target.select(...)
print(f'Target after table "other": {target.count()}')
The problem of slowliness / OoM was that Spark was forced to do all the transformations from start to finish at each table due to the ending count, making it slower and slower at each table / iteration.
The solution I found is to cache the dataframe at each iteration, like so:
cache: DataFrame = null
# ...
# n-th table work
target = target.join(other, how='left')
target = target.filter(...)
target = target.withColumn('a', 'b')
target = target.select(...)
target = target.cache()
target_count = target.count() # actually do the cache
if cache:
cache.unpersist() # free the memory from the old cache version
cache = target
print(f'Target after table "other": {target_count}')

Faster way to count values greater than 0 in Spark DataFrame?

I have a Spark DataFrame where all fields are integer type. I need to count how many individual cells are greater than 0.
I am running locally and have a DataFrame with 17,000 rows and 450 columns.
I have tried two methods, both yielding slow results:
Version 1:
(for (c <- df.columns) yield df.where(s"$c > 0").count).sum
Version 2:
df.columns.map(c => df.filter(df(c) > 0).count)
This calculation takes 80 seconds of wall clock time. With Python Pandas, it takes a fraction of second. I am aware that for small data sets and local operation, Python may perform better, but this seems extreme.
Trying to make a Spark-to-Spark comparison, I find that running MLlib's PCA algorithm on the same data (converted to a RowMatrix) takes less than 2 seconds!
Is there a more efficient implementation I should be using?
If not, how is the seemingly much more complex PCA calculation so much faster?

What to do
import org.apache.spark.sql.functions.{col, count, when}
df.select(df.columns map (c => count(when(col(c) > 0, 1)) as c): _*)
Why
Your both attempts create number of jobs proportional to the number of columns. Computing the execution plan and scheduling the job alone are expensive and add significant overhead depending on the amount of data.
Furthermore, data might be loaded from disk and / or parsed each time the job is executed, unless data is fully cached with significant memory safety margin which ensures that the cached data will not be evicted.
This means that in the worst case scenario nested-loop-like structure you use can roughly quadratic in terms of the number of columns.
The code shown above handles all columns at the same time, requiring only a single data scan.

The problem with your approach is that the file is scanned for every column (unless you have cached it in memory). The fastet way with a single FileScan should be:
import org.apache.spark.sql.functions.{explode,array}
val cnt: Long = df
.select(
explode(
array(df.columns.head,df.columns.tail:_*)
).as("cell")
)
.where($"cell">0).count
Still I think it will be slower than with Pandas, as Spark has a certain overhead due to the parallelization engine

Strange performance issue Spark LSH MinHash approxSimilarityJoin

I'm joining 2 datasets using Apache Spark ML LSH's approxSimilarityJoin method, but I'm seeings some strange behaviour.
After the (inner) join the dataset is a bit skewed, however every time one or more tasks take an inordinate amount of time to complete.
As you can see the median is 6ms per task (I'm running it on a smaller source dataset to test), but 1 task takes 10min. It's hardly using any CPU cycles, it actually joins data, but so, so slow.
The next slowest task runs in 14s, has 4x more records & actually spills to disk.
If you look
The join itself is a inner join between the two datasets on pos & hashValue (minhash) in accordance with minhash specification & udf to calculate the jaccard distance between match pairs.
Explode the hashtables:
modelDataset.select(
struct(col("*")).as(inputName), posexplode(col($(outputCol))).as(explodeCols))
Jaccard distance function:
override protected[ml] def keyDistance(x: Vector, y: Vector): Double = {
val xSet = x.toSparse.indices.toSet
val ySet = y.toSparse.indices.toSet
val intersectionSize = xSet.intersect(ySet).size.toDouble
val unionSize = xSet.size + ySet.size - intersectionSize
assert(unionSize > 0, "The union of two input sets must have at least 1 elements")
1 - intersectionSize / unionSize
}
Join of processed datasets :
// Do a hash join on where the exploded hash values are equal.
val joinedDataset = explodedA.join(explodedB, explodeCols)
.drop(explodeCols: _*).distinct()
// Add a new column to store the distance of the two rows.
val distUDF = udf((x: Vector, y: Vector) => keyDistance(x, y), DataTypes.DoubleType)
val joinedDatasetWithDist = joinedDataset.select(col("*"),
distUDF(col(s"$leftColName.${$(inputCol)}"), col(s"$rightColName.${$(inputCol)}")).as(distCol)
)
// Filter the joined datasets where the distance are smaller than the threshold.
joinedDatasetWithDist.filter(col(distCol) < threshold)
I've tried combinations of caching, repartitioning and even enabling spark.speculation, all to no avail.
The data consists of shingles address text that have to be matched:
53536, Evansville, WI => 53, 35, 36, ev, va, an, ns, vi, il, ll, le, wi
will have a short distance with records where there is a typo in the city or zip.
Which gives pretty accurate results, but may be the cause of the join skew.
My question is:
What may cause this discrepancy? (One task taking very very long, even though it has less records)
How can I prevent this skew in minhash without losing accuracy?
Is there a better way to do this at scale? ( I can't Jaro-Winkler / levenshtein compare millions of records with all records in location dataset)

It might be a bit late but I will post my answer here anyways to help others out. I recently had similar issues with matching misspelled company names (All executors dead MinHash LSH PySpark approxSimilarityJoin self-join on EMR cluster). Someone helped me out by suggesting to take NGrams to reduce the data skew. It helped me a lot. You could also try using e.g. 3-grams or 4-grams.
I don’t know how dirty the data is, but you could try to make use of states. It reduces the number of possible matches substantially already.
What really helped me improving the accuracy of the matches is to postprocess the connected components (group of connected matches made by the MinHashLSH) by running a label propagation algorithm on each component. This also allows you to increase N (of the NGrams), therefore mitigating the problem of skewed data, setting the jaccard distance parameter in approxSimilarityJoin less tightly, and postprocess using label propagation.
Finally, I am currently looking into using skipgrams to match it. I found that in some cases it works better and reduces the data skew somewhat.

Spark Feature Vector Transformation on Pre-Sorted Input

I have some data in a tab-delimited file on HDFS that looks like this:
label | user_id | feature
------------------------------
pos | 111 | www.abc.com
pos | 111 | www.xyz.com
pos | 111 | Firefox
pos | 222 | www.example.com
pos | 222 | www.xyz.com
pos | 222 | IE
neg | 333 | www.jkl.com
neg | 333 | www.xyz.com
neg | 333 | Chrome
I need to transform it to create a feature vector for each user_id to train a org.apache.spark.ml.classification.NaiveBayes model.
My current approach is the essentially the following:
Load the raw data into a DataFrame
Index the features with StringIndexer
Go down to the RDD and Group by user_id and map the feature indices into a sparse Vector.
The kicker is this... the data is already pre-sorted by user_id. What's the best way to take advantage of that? It pains me to think about how much needless work may be occurring.
In case a little code is helpful to understand my current approach, here is the essence of the map:
val featurization = (vals: (String,Iterable[Row])) => {
// create a Seq of all the feature indices
// Note: the indexing was done in a previous step not shown
val seq = vals._2.map(x => (x.getDouble(1).toInt,1.0D)).toSeq
// create the sparse vector
val featureVector = Vectors.sparse(maxIndex, seq)
// convert the string label into a Double
val label = if (vals._2.head.getString(2) == "pos") 1.0 else 0.0
(label, vals._1, featureVector)
}
d.rdd
.groupBy(_.getString(1))
.map(featurization)
.toDF("label","user_id","features")

Lets start with your other question
If my data on disk is guaranteed to be pre-sorted by the key which will be used for a group aggregation or reduce, is there any way for Spark to take advantage of that?
It depends. If operation you apply can benefit from map-side aggregation then you can gain quite a lot by having presorted data without any further intervention in your code. Data sharing the same key should located on the same partitions and can be aggregated locally before shuffle.
Unfortunately it won't help much in this particular scenario. Even if you enable map side aggregation (groupBy(Key) doesn't use is so you'll need custom implementation) or aggregate over feature vectors (you'll find some examples in my answer to How to define a custom aggregation function to sum a column of Vectors?) there is not much to gain. You can save some work here and there but you still have to transfer all indices between nodes.
If you want to gain more you'll have to do a little bit more work. I can see two basic ways you can leverage existing order:
Use custom Hadoop input format to yield only complete records (label, id, all features) instead of reading data line by line. If your data has fixed number of lines per id you could even try to use NLineInputFormat and apply mapPartitions to aggregate records afterwards.
This is definitely more verbose solution but requires no additional shuffling in Spark.
Read data as usual but use custom partitioner for groupBy. As far as I can tell using rangePartitioner should work just fine but to be sure you can try following procedure:
use mapPartitionsWithIndex to find minimum / maximum id per partition.
create partitioner which keeps minimum <= ids < maximum on the current (i-th) partition and pushes maximum to the partition i + 1
use this partitioner for groupBy(Key)
It is probably more friendly solution but requires at least some shuffling. If expected number of records to move is low (<< #records-per-partition) you can even handle this without shuffle using mapPartitions and broadcast* although having partitioned can be more useful and cheaper to get in practice.
* You can use an approach similar to this: https://stackoverflow.com/a/33072089/1560062

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string