pyspark code in loop acts different from single executions - apache-spark

The following code hits some jobs with 'job skipped' after a few times thru the loop, and the df that is read on that iteration by 'myfunc' comes back with 0 rows (but surprisingly, with the correct number of columns) :
for i in range(len(dates)-1):
date1,date2=dates[i],dates[i+1]
params['file_path'] = ['s3a://path/to/files{}.json'.format(date1),'s3a://path/to/files{}.json'.format(date2)]
df = myfunc(params)
However when I run it 'by hand' several times, all is well - no skipped jobs and df's come back full.
date1,date2=dates[0],dates[1]
params['file_path'] = ['s3a://path/to/files{}.json'.format(date1),'s3a://path/to/files{}.json'.format(date2)]
df = myfunc(params)
The above runs fine, and when I change to date1,date2=dates[1],dates[2] also ok, etc. There aren't very many files and I've already finished them all by hand as above but would like to know what's going on. The filenames generated in the for loop work fine when I copy-paste them into my params. I am far from expert in spark so let me know if there's something obvious to check.

Without knowing the code of myfunc I can only guess you problem.
Probably the 0 rows issue originates from the assignment df = myfunc(params) which will overwrite the df all the time and does not append to the previous df. Probably for the last two dates it is just empty.
Skipping jobs usually comes from caching. Are you using caching anywhere?

Related

Python pandas - processing

I appreciate any help as Im a bit desperate!
Im trying to filter out a 110k row and 46 column dataframe and apply some summarization. Its a whole pipeline that extracts the data from postgres, stores it into a DF and goes through a series of filters, everthing on memory.
My first execution runs smoothly, everything goes as expected, but as soon as I execute the script again (short period of time), Im finding duplicates or missing ids when filtering out.
Lets put it this way:
Here, sometimes the last two transaction_id (from top to bottom) get duplicated as in the picture, and sometimes I get only the first two transaction_id (36739792, 36740873).
The source is fine no questions about it.
After some time (5 min at least) I execute the script and I get the expected results, it just works fine, which in this case would be only the first four transaction_id. The issue comes when re-executing in a short period.
When debugging line by line actually, I get everything right, so this makes me think the logic is not the issue.
Could this be a memory issue? maybe somehow the script just keeps holding data even after execution?
Im using python 3.9, pandas and vscode.
Again, Appreciate any help. Regards.

How do I print statements inside a function that has been applied to a Spark RDD?

I'm applying a function to a Spark RDD, like so:
data_2 = sqlContext.createDataFrame(pandas_df,data_schema)
data_3 = data_2.rdd.map(lambda x: parallelized_func(x, **args*)).collect()
Now, the function parallelized_func looks something like this:
def parallelized_func(a,b,c):
####FUNCTION BODY#####
print("unique identifier for each row in pandas_df")
return {'df1':df1,'df2':df2}
The issue I'm facing is this: When I run the "data_3 = ..." statement above in a Databricks notebook, I want the to get the unique identifier that I'm printing inside parallelized_func to show up somewhere, on some console, because that would make it easier to debug when there's an issue with any row in the pandas_df dataframe.
I tried checking the std_out and std_err consoles for every executor that's running the jobs, but there's always a whole load of other statements that occupy most of the console (all Spark statements related to various tasks being executed, I assume). I can sometimes find my print statement in this vast sea of other statements, but it's a really inefficient and ineffective way of debugging.
Is there a better way I can go about printing a statement like this? Or a better way of finding it? Can I for instance suppress all other execution-related statements that Spark keeps throwing up on the console?
Attaching a snapshot of the other statements that get printed on the console.
Print it's not really good solution, because as you said there is tons of logs that spark's write (and print for debugging it's not good also).
You can make logger that write your logs to somewhere else (thats way only your logs will be write over there) such as NFS / whereever u can write it.(even locally on the executors and then check for it)
If you try to find the "corrupted" rows maybe, only for deubgging, filter only the corrupted rows and collect it to the driver, and then u'll can check the rows locally on the notebook.

Spark Error - Max iterations (100) reached for batch Resolution

I am working on Spark SQL where I need to find out Diff between two large CSV's.
Diff should give:-
Inserted Rows or new Record // Comparing only Id's
Changed Rows (Not include inserted ones) - Comparing all column values
Deleted rows // Comparing only Id's
Spark 2.4.4 + Java
I am using Databricks to Read/Write CSV
Dataset<Row> insertedDf = newDf_temp.join(oldDf_temp,oldDf_temp.col(key)
.equalTo(newDf_temp.col(key)),"left_anti");
Long insertedCount = insertedDf.count();
logger.info("Inserted File Count == "+insertedCount);
Dataset<Row> deletedDf = oldDf_temp.join(newDf_temp,oldDf_temp.col(key)
.equalTo(newDf_temp.col(key)),"left_anti")
.select(oldDf_temp.col(key));
Long deletedCount = deletedDf.count();
logger.info("deleted File Count == "+deletedCount);
Dataset<Row> changedDf = newDf_temp.exceptAll(oldDf_temp); // This gives rows (New +changed Records)
Dataset<Row> changedDfTemp = changedDf.join(insertedDf, changedDf.col(key)
.equalTo(insertedDf.col(key)),"left_anti"); // This gives only changed record
Long changedCount = changedDfTemp.count();
logger.info("Changed File Count == "+changedCount);
This works well for CSV with columns upto 50 or so.
The Above code fails for one row in CSV with 300+columns, so I am sure this is not file Size problem.
But if I have a CSV having 300+ Columns then it fails with Exception
Max iterations (100) reached for batch Resolution – Spark Error
If I set the below property in Spark, It Works!!!
sparkConf.set("spark.sql.optimizer.maxIterations", "500");
But my question is why do I have to set this?
Is there something wrong which I am doing?
Or this behaviour is expected for CSV's which have large columns.
Can I optimize it in any way to handle Large column CSV's.
The issue you are running into is related to how spark takes the instructions you tell it and transforms that into the actual things it's going to do. It first needs to understand your instructions by running Analyzer, then it tries to improve them by running its optimizer. The setting appears to apply to both.
Specifically your code is bombing out during a step in the Analyzer. The analyzer is responsible for figuring out when you refer to things what things you are actually referring to. For example, mapping function names to implementations or mapping column names across renames, and different transforms. It does this in multiple passes resolving additional things each pass, then checking again to see if it can resolve move.
I think what is happening for your case is each pass probably resolves one column, but 100 passes isn't enough to resolve all of the columns. By increasing it you are giving it enough passes to be able to get entirely through your plan. This is definitely a red flag for a potential performance issue, but if your code is working then you can probably just increase the value and not worry about it.
If it isn't working, then you will probably need to try to do something to reduce the number of columns used in your plan. Maybe combining all the columns into one encoded string column as the key. You might benefit from checkpointing the data before doing the join so you can shorten your plan.
EDIT:
Also, I would refactor your above code so you could do it all with only one join. This should be a lot faster, and might solve your other problem.
Each join leads to a shuffle (data being sent between compute nodes) which adds time to your job. Instead of computing adds, deletes and changes independently, you can just do them all at once. Something like the below code. It's in scala psuedo code because I'm more familiar with that than the Java APIs.
import org.apache.spark.sql.functions._
var oldDf = ..
var newDf = ..
val changeCols = newDf.columns.filter(_ != "id").map(col)
// Make the columns you want to compare into a single struct column for easier comparison
newDf = newDF.select($"id", struct(changeCols:_*) as "compare_new")
oldDf = oldDF.select($"id", struct(changeCols:_*) as "compare_old")
// Outer join on ID
val combined = oldDF.join(newDf, Seq("id"), "outer")
// Figure out status of each based upon presence of old/new
// IF old side is missing, must be an ADD
// IF new side is missing, must be a DELETE
// IF both sides present but different, it's a CHANGE
// ELSE it's NOCHANGE
val status = when($"compare_new".isNull, lit("add")).
when($"compare_old".isNull, lit("delete")).
when($"$compare_new" != $"compare_old", lit("change")).
otherwise(lit("nochange"))
val labeled = combined.select($"id", status)
At this point, we have every ID labeled ADD/DELETE/CHANGE/NOCHANGE so we can just a groupBy/count. This agg can be done almost entirely map side so it will be a lot faster than a join.
labeled.groupBy("status").count.show

Excel Get & Transform (Power Query) M Code Style and Performance

I've created a few reasonably complex M queries and have started running into some severe performance issues. I'm wondering if has to do with how I sometime organize my code.
The issues I've been having are:
1) Power Query constantly uses all of several CPU cores, calculating something, even if I'm not waiting for a result.
2) In task manager I can sometimes see that the Power Query threads ("Microsoft.mashup.Container.NetFX40.exe") are nearly idle, while Excel.exe is using 100% of one core for tens of minutes - even though at most I'm looking values in a few parameter tables that don't contain more than a couple dozen cells.
3) Some steps take extremely long to calculate, even though the operations involved are trivial. For example, I have a list of 10 text values taken from an Excel table. This list appears as one of my query steps when I 'preview' it. Then I want to remove a single value, so the next step = List.RemoveItems(myList, {"val"}). It didn't compute after 30 minutes, even though I could see the list was correctly loaded in a previous step.
4) UI sometimes becomes unresponsive for several minutes after changing code. Can still right-click on Queries at left hand side to enter advanced editor, and click the red X at top right and choose to keep changes, but all the rest is unresponsive. Not greyed out, just unresponsive.
Anyway, I just wanted to ask if anyone's had similar trouble, and if anyone knows what triggers particularly bad performance in PQ.
I'll often use something like the following pattern to keep the total number of queries down while still being able to easily inspect individual steps:
let
ThisWB = Excel.CurrentWorkbook(),
CfgTbl = ThisWB{[Name="myCfgTbl"]}[Content],
x = aFn(CfgTbl),
y = bFn(CfgTbl),
output = [ThisWB=ThisWB, CfgTbl=CfgTbl, x=x, y=y]
in
output
Is this likely to lead to any issues? Just thought it might because at one point after waiting a very long time for a simple function result, I created a new query = Excel.CurrentWorkbook(){[Name="myCfgTbl"]}[Content], referenced it from the other query, and my result calculated immediately. No idea why.
It calculates previews. Turn off auto preview generation.
I messed with something like this in cases with formula-heavy tables.
The rest probably requires code examples, especially your last case.
BTW, is your version of power query (or Excel 2016) up-to-date?

How to design spark program to process 300 most recent files?

Situation
New small files comes in periodically. I need to do calculation on recent 300 files. So basically there is a window moving forward. The size of the window is 300 and I need do calculation on the window.
But something very important to know is that this is not a spark stream computing. Because in spark stream, the unit/scope of window is time. Here the unit/scope is number of files.
Solution1
I will maintain a dict, the size of the dict is 300. Each new file comes in, I turn it into spark data frame and put it into dict. Then I make sure the oldest file in the dict is popped out if the length of dict is over 300.
After this I will merge all data frames in the dict to a bigger one and do calculation.
The above process will be run in a loop. Every time new file comes in we go through the loop.
pseudo code for solution 1
for file in file_list:
data_frame = get_data_frame(file)
my_dict[ timestamp ] = data_frame
for timestamp in my_dict.keys():
if timestamp older than 24 hours:
# not only unpersist, but also delete to make sure the memory is released
my_dict[timestamp].unpersist
del my_dict[ timestamp ]
# pop one data frame from the dict
big_data_frame = my_dict.popitem()
for timestamp in my_dict.keys():
df = my_dict.get( timestamp )
big_data_frame = big_data_frame.unionAll(df)
# Then we run SQL on the big_data_frame to get report
problem for solution 1
Always hit Out of memory or gc overhead limit
question
Do you see anything inappropriate in the solution 1?
Is there any better solution?
Is this the right kind of situation to use spark ?
One observation, you probably don't want to use popitem, the keys of a Python dictionary are not sorted, so you can't guarantee that you're popping the earliest item. Instead I would recreate the dictionary each time using a sorted list of timestamps. Assuming your filenames are just timestamps:
my_dict = {file:get_dataframe(file) for file in sorted(file_list)[-300:]}
Not sure if this will fix your problem, can you paste the full stacktrace of your error into the question? It's possible that your problem is happening in the Spark merge/join (not included in your question).
My suggestion to this is streaming, but not with respect to time, I mean you will still have some window and sliding interval set, but say it is 60 secs.
So every 60 secs you get the DStream of file contents, in 'x' partitions. These 'x' partitions represent the files you drop onto HDFS or file system.
So, this way you can keep track of how many files/partitions have been read, if they are less than 300 then wait until they become 300. After the count hits 300 then you can start processing.
If it's possible to keep track of the most recent files or if it's possible to just discover them once in a while, then I'd suggest to do something like
sc.textFile(','.join(files));
or if it's possible to identify specific pattern to get those 300 files, then
sc.textFile("*pattern*");
And it's even possible to have comma separated patterns, but it might happen that some files that match more, than one pattern, would be read more, than once.

Resources