Does writing a dataframe to HDFS affect its sorting

Does writing a dataframe to HDFS affect its sorting - apache-spark

I'm running a code on apache spark on a multi node environment(one master and two slave nodes) in which I'm manipulating a dataframe and then performing logistic regression to it. In between I'm also writing out the interim transformed files. I have witnessed a peculiar observation (and yes I've double checked and triple checked) which I'm not able to explain and want to confirm if this could be because of my code or there might be other factors in play.
I have a dataframe like
df
uid rank text
a 1 najn
b 2 dak
c 1 kksa
c 3 alkw
b 1 bdsj
c 2 asma
I sort it with the following code
sdf = df.orderBy("uid", "rank")
sdf.show()
uid rank text
a 1 najn
b 1 bdsj
b 2 dak
c 1 kksa
c 2 asma
c 3 alkw
and write the transformed df to HDFS using
sdf.repartition(1)
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("/someLocation")
Now when i again try to view the data it seems to have lost its sorting
sdf.show()
uid rank text
a 1 najn
c 2 asma
b 2 dak
c 1 kksa
c 3 alkw
b 1 bdsj
When i skip the writing code, it works fine.
Anyone has any pointers if this might be a valid case and we can do something to resolve it.
P.s. I tried various variations of the writing code, increasing the number of partition, removing the partitioning altogether and saving it to other formats.

The problem is not writing to HDFS but rather the repartition as stated in the comments by zero323.
If you are planning to write everything down to a single file you should do it like this:
sdf.coalesce(1).orderBy("uid", "rank").write...
coalesce avoids repartitioning (it just copies the partitions one after the other instead of shuffling everything by the hash) which would mean your data would still be ordered within the original partitions and therfore faster to order (of course you can always lose the original ordering as it won't help much here).
Note that this is not scalable as you are pulling everything to a single partition. if you would have wrong without any repartitioning you would get a number of files according to the original number of partitions of sdf. Each file would be ordered inside so you can easily combine them.

Related

Number of files saved by parquet writer in pyspark

How many files does a pyspark parquet write generate? I have read that the output is one file per in memory partition. However, this does not seem to always be true.
I am running a 6 executors cluster with 6G executor memory per executor. All the rest (pyspark, overhead, offheap) are 2G
using the following data:
dummy_data = spark.createDataFrame(pd.DataFrame({'a':np.random.choice([1,2,3,4,5,6,7,8,9,10],100000)}))
The following code where I repartition without specifying a column to repartition by, always produces the number of files equal to the number of memory partitions:
df_dummy = dummy_data.repartition(200)
df_dummy.rdd.getNumPartitions()
df_dummy.write.format("parquet").save("gs://monsoon-credittech.appspot.com/spark_datasets/test_writes/df_dummy_repart_wo_id")
#files generated 200
However, the following code, where I do specify the column to repartition the data by, produces some random number of files:
df_dummy = dummy_data.repartition(200,'a')
df_dummy.rdd.getNumPartitions()
df_dummy.write.format("parquet").save("gs://monsoon-credittech.appspot.com/spark_datasets/test_writes/df_dummy_repart_w_id")
#files generated 11
Can you help me understand the number of output files that gets generated by the pyspark parquet writer.

This is an answer that does not explain everything you're noticing, but probably contains useful enough information that it would be a pity not to share it.
The reason why you're seeing a different amount of output files is because of the order of your data after those 2 partitions.
dummy_data.repartition(200) repartitions your individual rows using round robin partitioning
the result is that your data has a random ordering, because your input data has random ordering
dummy_data.repartition(200,'a') uses hash partitioning according to the column a's values
the result is that your data is chopped up in a very specific order: hashing the column values will put values where a == 1 always in the same partition
since your nr of partitions is smaller than the distinct amount of possible values, each partition will contain only 1 distinct a value.
Now, there is a pattern in the amount of output part-files you receive:
In the case of dummy_data.repartition(200), you simply get the same number of part-files as partitions. 200 in your example.
In the other case, you get 11 part-files. If you have a look at the content of those part-files, you will see that there is 1 empty file + 10 filled files. 1 for each distinct value of your original dataset. So this leads to the conclusion that while writing your files, something is being smart and merging those minuscule and identical files. I'm not sure whether this is Spark, or the PARQUET_OUTPUT_COMMITTER_CLASS, or something else.
Conclusion
In general, you get the same amount of part-files as the amount of partitions.
In your specific case, when you're repartitioning by the column (which is the only value in the Row), your parquet part-files will contain a bunch of the same values. It seems that something (I don't know what) is being smart and merging files with the same values.
In your case, you got 11 part-files because there is 1 empty file and 10 files for each distinct value in your dataframe. Try changing np.random.choice([1,2,3,4,5,6,7,8,9,10] to np.random.choice([1,2,3,4,5,6,7,8] and you will see you'll get 9 part-files (8 + 1).

Most likely, the reason you see 11 files being written after you do a .repartition(200,'a') is because your first partition (with partition id = 0) becomes empty. Spark allows the task working on that empty partition to proceed with the write, but will suppress writing all other empty parquet files for all other partitions. This behavior can be tracked down to the changes made for JIRA SPARK-21435 "Empty files should be skipped while write to file", and corresponding code in FileFormatWriter.scala:
:
val dataWriter =
if (sparkPartitionId != 0 && !iterator.hasNext) {
// In case of empty job, leave first partition to save meta for file format like parquet.
new EmptyDirectoryDataWriter(description, taskAttemptContext, committer)
} else if (description.partitionColumns.isEmpty && description.bucketSpec.isEmpty) {
:
So, if you repartition your dataset such that partition 0 becomes non-empty, you would not see any empty files written.

Spark Error - Max iterations (100) reached for batch Resolution

I am working on Spark SQL where I need to find out Diff between two large CSV's.
Diff should give:-
Inserted Rows or new Record // Comparing only Id's
Changed Rows (Not include inserted ones) - Comparing all column values
Deleted rows // Comparing only Id's
Spark 2.4.4 + Java
I am using Databricks to Read/Write CSV
Dataset<Row> insertedDf = newDf_temp.join(oldDf_temp,oldDf_temp.col(key)
.equalTo(newDf_temp.col(key)),"left_anti");
Long insertedCount = insertedDf.count();
logger.info("Inserted File Count == "+insertedCount);
Dataset<Row> deletedDf = oldDf_temp.join(newDf_temp,oldDf_temp.col(key)
.equalTo(newDf_temp.col(key)),"left_anti")
.select(oldDf_temp.col(key));
Long deletedCount = deletedDf.count();
logger.info("deleted File Count == "+deletedCount);
Dataset<Row> changedDf = newDf_temp.exceptAll(oldDf_temp); // This gives rows (New +changed Records)
Dataset<Row> changedDfTemp = changedDf.join(insertedDf, changedDf.col(key)
.equalTo(insertedDf.col(key)),"left_anti"); // This gives only changed record
Long changedCount = changedDfTemp.count();
logger.info("Changed File Count == "+changedCount);
This works well for CSV with columns upto 50 or so.
The Above code fails for one row in CSV with 300+columns, so I am sure this is not file Size problem.
But if I have a CSV having 300+ Columns then it fails with Exception
Max iterations (100) reached for batch Resolution – Spark Error
If I set the below property in Spark, It Works!!!
sparkConf.set("spark.sql.optimizer.maxIterations", "500");
But my question is why do I have to set this?
Is there something wrong which I am doing?
Or this behaviour is expected for CSV's which have large columns.
Can I optimize it in any way to handle Large column CSV's.

The issue you are running into is related to how spark takes the instructions you tell it and transforms that into the actual things it's going to do. It first needs to understand your instructions by running Analyzer, then it tries to improve them by running its optimizer. The setting appears to apply to both.
Specifically your code is bombing out during a step in the Analyzer. The analyzer is responsible for figuring out when you refer to things what things you are actually referring to. For example, mapping function names to implementations or mapping column names across renames, and different transforms. It does this in multiple passes resolving additional things each pass, then checking again to see if it can resolve move.
I think what is happening for your case is each pass probably resolves one column, but 100 passes isn't enough to resolve all of the columns. By increasing it you are giving it enough passes to be able to get entirely through your plan. This is definitely a red flag for a potential performance issue, but if your code is working then you can probably just increase the value and not worry about it.
If it isn't working, then you will probably need to try to do something to reduce the number of columns used in your plan. Maybe combining all the columns into one encoded string column as the key. You might benefit from checkpointing the data before doing the join so you can shorten your plan.
EDIT:
Also, I would refactor your above code so you could do it all with only one join. This should be a lot faster, and might solve your other problem.
Each join leads to a shuffle (data being sent between compute nodes) which adds time to your job. Instead of computing adds, deletes and changes independently, you can just do them all at once. Something like the below code. It's in scala psuedo code because I'm more familiar with that than the Java APIs.
import org.apache.spark.sql.functions._
var oldDf = ..
var newDf = ..
val changeCols = newDf.columns.filter(_ != "id").map(col)
// Make the columns you want to compare into a single struct column for easier comparison
newDf = newDF.select($"id", struct(changeCols:_*) as "compare_new")
oldDf = oldDF.select($"id", struct(changeCols:_*) as "compare_old")
// Outer join on ID
val combined = oldDF.join(newDf, Seq("id"), "outer")
// Figure out status of each based upon presence of old/new
// IF old side is missing, must be an ADD
// IF new side is missing, must be a DELETE
// IF both sides present but different, it's a CHANGE
// ELSE it's NOCHANGE
val status = when($"compare_new".isNull, lit("add")).
when($"compare_old".isNull, lit("delete")).
when($"$compare_new" != $"compare_old", lit("change")).
otherwise(lit("nochange"))
val labeled = combined.select($"id", status)
At this point, we have every ID labeled ADD/DELETE/CHANGE/NOCHANGE so we can just a groupBy/count. This agg can be done almost entirely map side so it will be a lot faster than a join.
labeled.groupBy("status").count.show

Spark Join optimization

let's say I have two dataframes that I want to join using "inner join": A and B, each one has 100 columns and billions of rows.
If in my use case I'm only interested in 10 columns of A and 4 columns of B, does Spark do the optimization for me in order to handle this and shuffle only 14 columns or will he be shuffling everything then selecting 14 columns?
Query 1 :
A_select = A.select("{10 columns}").as("A")
B_select = B.select("{4 columns}").as("B")
result = A_select.join(B_select, $"A.id"==$"B.id")
Query 2 :
A.join(B, $"A.id"==$"B.id").select("{14 columns}")
Is Query1==Query2 in termes of behavior, execution time, data shuffling ?
Thanks in advance for your answers :

Yes, spark will handle the optimization for you. Due to it's lazy evaluation behaviour only the required attributes will be selected from the datafrmes (A and B).
You can use explain function to view logical/physical plan,
result.explain()
Both the query will be returning same physical plan. Hence execution time and data shuffling will be same.
Reference - Pyspark documentation for explain function.

Calculating the size of a full outer join in pandas

tl;dr
My issue here is that I'm stuck at calculating how many rows to anticipate on each part of a full outer merge when using Pandas DataFrames as part of a combinatorics graph.
Questions (repeated below).
The ideal solution would be to not require the merge and to query panel objects. Given that there isn't a query method on the panel is there a cleaner solution which would solve this problem without hitting the memory ceiling?
If the answer to 2 is no, how can I calculate the size of the required merge table for each combination of sets without carrying out the merge? This might be a sub-optimal approach but in this instance it would be acceptable for the purpose of the application.
Is Python the right language for this or should I be looking at a more statistical language such as R or write it at a lower level (c, cython) - Databases are out of the question.
The problem
Recently I re-wrote the py-upset graphing library to make it more efficient in terms of time when calculating combinations across DataFrames. I'm not looking for a review of this code, it works perfectly well in most instances and I'm happy with the approach. What I am looking for now is the answer to a very specific problem; uncovered when working with large data-sets.
The approach I took with the re-write was to formulate an in-memory merge of all provided dataframes on a full outer join as seen on lines 480 - 502 of pyupset.resources
for index, key in enumerate(keys):
frame = self._frames[key]
frame.columns = [
'{0}_{1}'.format(column, key)
if column not in self._unique_keys
else
column
for column in self._frames[key].columns
]
if index == 0:
self._merge = frame
else:
suffixes = (
'_{0}'.format(keys[index-1]),
'_{0}'.format(keys[index]),
)
self._merge = self._merge.merge(
frame,
on=self._unique_keys,
how='outer',
copy=False,
suffixes=suffixes
)
For small to medium dataframes using joins works incredibly well. In fact recent performance tests have shown that it'll handle 5 or 6 Data-Sets containing 10,000's of lines each in a less than a minute which is more than ample for the application structure I require.
The problem now moves from time based to memory based.
Given datasets of potentially 100s of thousands of records, the library very quickly runs out of memory even on a large server.
To put this in perspective, my test machine for this application is an 8-core VMWare box with 128GiB RAM running Centos7.
Given the following dataset sizes, when adding the 5th dataframe, memory usage spirals exponentially. This was pretty much anticipated but underlines the heart of the problem I am facing.
Rows | Dataframe
------------------------
13963 | dataframe_one
48346 | dataframe_two
52356 | dataframe_three
337292 | dataframe_four
49936 | dataframe_five
24542 | dataframe_six
258093 | dataframe_seven
16337 | dataframe_eight
These are not "small" dataframes in terms of the number of rows although the column count for each is limited to one unique key + 4 non-unique columns. The size of each column in pandas is
column | type | unique
--------------------------
X | object | Y
id | int64 | N
A | float64 | N
B | float64 | N
C | float64 | N
This merge can cause problems as memory is eaten up. Occasionally it aborts with a MemoryError (great, I can catch and handle those), other times the kernel takes over and simply kills the application before the system becomes unstable, and occasionally, the system just hangs and becomes unresponsive / unstable until finally the kernel kills the application and frees the memory.
Sample output (memory sizes approximate):
[INFO] Creating merge table
[INFO] Merging table dataframe_one
[INFO] Data index length = 13963 # approx memory <500MiB
[INFO] Merging table dataframe_two
[INFO] Data index length = 98165 # approx memory <1.8GiB
[INFO] Merging table dataframe_three
[INFO] Data index length = 1296665 # approx memory <3.0GiB
[INFO] Merging table dataframe_four
[INFO] Data index length = 244776542 # approx memory ~13GiB
[INFO] Merging table dataframe_five
Killed # > 128GiB
When the merge table has been produced, it is queried in set combinations to produce graphs similar to https://github.com/mproffitt/py-upset/blob/feature/ISSUE-7-Severe-Performance-Degradation/tests/generated/extra_additional_pickle.png
The approach I am trying to build for solving the memory issue is to look at the sets being offered for merge, pre-determine how much memory the merge will require, then if that combination requires too much, split it into smaller combinations, calculate each of those separately, then put the final dataframe back together (divide and conquer).
My issue here is that I'm stuck at calculating how many rows to anticipate on each part of the merge.
Questions (repeated from above)
The ideal solution would be to not require the merge and to query panel objects. Given that there isn't a query method on the panel is there a cleaner solution which would solve this problem without hitting the memory ceiling?
If the answer to 2 is no, how can I calculate the size of the required merge table for each combination of sets without carrying out the merge? This might be a sub-optimal approach but in this instance it would be acceptable for the purpose of the application.
Is Python the right language for this or should I be looking at a more statistical language such as R or write it at a lower level (c, cython).
Apologies for the lengthy question. I'm happy to provide more information if required or possible.
Can anybody shed some light on what might be the reason for this?
Thank you.

Question 1.
Dask shows a lot of promise in being able to calculate the merge table "out of memory" by using hdf5 files as a temporary store.
By using multi-processing to create the merges, dask also offers a performance increase over pandas. Unfortunately this is not carried through to the query method so performance gains made on the merge are lost on querying.
It is still not a completely viable solution as dask may still run out of memory on large, complex merges.
Question 2.
Pre-calculating the size of the merge is entirely possible using the following method.
Group each dataframe by a unique key and calculate the size.
Create a set of key names for each dataframe.
Create an intersection of sets from 2.
Create a set difference for set 1 and for set 2
To accommodate for np.nan stored in the unique key, select all NAN values. If one frame contains nan and the other doesn't, write the other as 1.
for sets in the intersection, multiply the count from each groupby('...').size()
Add counts from the set differences
Add a count of np.nan values
In python this could be written as:
def merge_size(left_frame, right_frame, group_by):
left_groups = left_frame.groupby(group_by).size()
right_groups = right_frame.groupby(group_by).size()
left_keys = set(left_groups.index)
right_keys = set(right_groups.index)
intersection = right_keys & left_keys
left_sub_right = left_keys - intersection
right_sub_left = right_keys - intersection
left_nan = len(left_frame.query('{0} != {0}'.format(group_by)))
right_nan = len(right_frame.query('{0} != {0}'.format(group_by)))
left_nan = 1 if left_nan == 0 and right_nan != 0 else left_nan
right_nan = 1 if right_nan == 0 and left_nan != 0 else right_nan
sizes = [(left_groups[group_name] * right_groups[group_name]) for group_name in intersection]
sizes += [left_groups[group_name] for group_name in left_sub_right]
sizes += [right_groups[group_name] for group_name in right_sub_left]
sizes += [left_nan * right_nan]
return sum(sizes)
Question 3
This method is fairly heavy on calculating and would be better written in Cython for performance gains.

Find sub-sequence of events from a stream of events

I am giving a miniature version of my issue below
I have 2 different sensors sending 1/0 values as a stream. I am able to consume the stream using Kafka and bring it to spark for processing. Please note a sample stream I have given below.
Time --------------> 1 2 3 4 5 6 7 8 9 10
Sensor Name --> A A B B B B A B A A
Sensor Value ---> 1 0 1 0 1 0 0 1 1 0
I want to identify a sub sequence pattern occurring in this stream. For eg- if A =0 and the very next value (based on time) in the stream is B =1 then I want to push an alert. In the example above I have highlighted 2 places – where I want to give an alert. In general it will be like
“If a set of sensor-event combination happens within a time interval,
raise an alert”.
I am new to spark and don’t know Scala. I am currently doing my coding using python.
My actual problem contains more sensors and each sensor can have different value combinations. Meaning my subsequence and event stream
I have tried Couple of options without success
Window Functions – Can be useful for moving avgs cumulative sums
etc. not for this usecase
Bring spark Dataframes /RDDs to local python structure like list
and panda Dataframes and do sub-sequencing – it take lots of
shuffles and spark event streams queued after some iterations
UpdateStatewithKey – Tried couple of ways and not able to understand
fully how this works and whether this is applicable for this use
case.

Anyone looking for a solution to this question can use my solution:
1- To keep them connected, you need to gather events with collect_list.
2- It's best to sort your event on the collect_list, but be cautious because it arranges data by the first column, so it's important to put the DateTime in that column.
3- I dropped DateTime from collect_list, as an example.
4- Finally, you should contact all elements to explore it with string functions like contain to find your subsequence.
.agg(expr("array_join(TRANSFORM(array_sort(collect_list((Time , Sensor Value))), a -> a.Time ),'')")as "MySequence")
after this agg function, you can use any regular expression or string function to detect your pattern.
check this link for more information about collect_list:
collect list
check this link for more information about sorting a collect_list:
sort a collect list

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string