Pyspark grouby taking a very long time - apache-spark

I've searched around quite a bit, and asked my professor, but not sure what the problem is. I have a test data set of 20,000 points. I perform a groupby on this to get the sum and it takes maybe 30 seconds, which is reasonable since this is on a jupyter notebook.
However, when I put in the larger test dataframe of 1.5 Million data points, it takes hours. Everything else, even on the large dataset happens quickly (multiple conditional joins, etc). My professor thinks that one key occurs very frequently and this could cause an issue. But I cant even check this.
When I run
df = df.groupby('ID').count().sort('ID').desc()).show()
on the small data set it works very fast and says that one value has 25 points, while all others are below 5. So maybe there is a key explosion. However, on the larger dataframe, well ive been waiting half an hour now.
Any help would be appreciated, Thanks

Related

PySpark groupby strange behaviour

I am querying a large (2 trillion records) parquet file using PySpark, partitioned by two columns, month and day .
If I run a simple query as:
SELECT month, day, count(*) FROM mytable
WHERE month >= 201801 and month< 202301 -- two years data
GROUP BY month, day
ORDER BY month, day
the query is executed in 5 min or less. Super good performance!
If, I remove the where condition, it will bring whole data lake information (4 years). This query will take 1.5 hours to execute.
This behaviour is far from normal. I guess might be related to the large amount of data being queried in the workers node, leading to GC or shuffle, but is just a guess
How can I debug above situation?
My understanding is that Spark should be clever enough to calculate per partion (since is a distributed environment), and take around 5 * 2 (double years), not so much big different
Edit1: Adding information from SparkUI
I will put the screenshots of the two runs, 4 years data, 1.7 hours, and 3 years data, 7.5 min. First, always the 4 years data
General overview
Job Page
Stage 1 - Heavy stage
Stage 2
SQL
Edit 2 - New findings - Scheduler delay
In the heavy task, I have found out an scheduler delay
If this is the case, what is the approach?
Thanks a lot!
I have found what was the problem.
By increasing the memory and cores (not really important) of the
Driver, the problem was solved.
How to reach this conclusion?
First, I knew my data was not very skewed (as pointed by #samkart and #Leonid Vasilev). but, I checked again.
Second, all metrics were very similar to each other, without great number differences, soooo, it had to be something.
Third and lastly, I open the Stage Event line, and found a very interesting issue, see edit 2.
After further investigating why my scheduler was so delayed, I really didn't find the real reason, but this sentence gave me the hint. The problem was in the driver
Scheduler delay (blue) is the time spent waiting. There is something
that the executors are waiting for - often this is waiting for the
driver that controls and coordinates the jobs.
source: enter link description here
In that post, the author also mention something very important that I wish to add
See all that red and blue? This is a sure sign that something is up.
What we really want to see is lots of green - the proportion of time
spent doing work - I mean real work - the part where Spark does the
number crunching.
TDLR:
Biggest problem came from Scheduler delay, very related to driver. Increasing the Memory (and vCPUs), solved the issue.

Python pandas - processing

I appreciate any help as Im a bit desperate!
Im trying to filter out a 110k row and 46 column dataframe and apply some summarization. Its a whole pipeline that extracts the data from postgres, stores it into a DF and goes through a series of filters, everthing on memory.
My first execution runs smoothly, everything goes as expected, but as soon as I execute the script again (short period of time), Im finding duplicates or missing ids when filtering out.
Lets put it this way:
Here, sometimes the last two transaction_id (from top to bottom) get duplicated as in the picture, and sometimes I get only the first two transaction_id (36739792, 36740873).
The source is fine no questions about it.
After some time (5 min at least) I execute the script and I get the expected results, it just works fine, which in this case would be only the first four transaction_id. The issue comes when re-executing in a short period.
When debugging line by line actually, I get everything right, so this makes me think the logic is not the issue.
Could this be a memory issue? maybe somehow the script just keeps holding data even after execution?
Im using python 3.9, pandas and vscode.
Again, Appreciate any help. Regards.

Invisible Delays between Spark Jobs

There are 4 major actions(jdbc write) with respect to application and few counts which in total takes around 4-5 minutes for completion.
But the total uptime of Application is around 12-13minutes.
I see there are certain jobs by name run at ThreadPoolExecutor.java : 1149. Just before this job being reflected on Spark UI, the invisible long delays occur.
I want to know what are the possible causes for these delays.
My application is reading 8-10 CSV files, 5-6 VIEWs from table. Number of joins are around 59, few groupBy with agg(sum) are there and 3 unions are there.
I am not able to reproduce the issue in DEV/UAT env since the data is not that much.
It's in the production where I get the app. executed run by my Manager.
If anyone has come across such delays in their job, please share your experience what could be the potential cause for this, currently I am working around the unions, i.e. caching the associated dataframes and calling count so as to get the benefit of cache in the coming union(yet to test, if union is the reason for delays)
Similarly, I tried the break the long chain of transformations with cache and count in between to break the long lineage.
The time reduced from initial 18 minutes to 12 minutes but the issue with invisible delays still persist.
Thanks in advance
I assume you don't have a CPU or IO heavy code between your spark jobs.
So it really sparks, 99% it is QueryPlaning delay.
You can use
spark.listenerManager.register(QueryExecutionListener) to check different metrics of query planing performance.

How to avoid selecting too many data

What we are doing is pretty much like
putting time series data into cassandra
running an spark aggregation job every hour and put aggregated data back to cassandra
One of the problems we found is, if the hourly job does not succeed, for example, continuously, 1 AM ~ 2 AM, 2 AM ~ 3 AM, 3 AM ~ 4 AM (or more), then next time, it'll aggregate the data from 1 AM to 5 AM (last success time is recorded in cassandra). The issue comes at this hour, because it's now 4 (or more) hours data, and it's way larger than one hour data which then results in an OutofMemory exception by selecting too many data from cassandra into dataframe.
Well, adding memory to spark executor is a way fixing this. However, considering it's an edge issue, I'm wondering if there's any mature pattern or architecture to deal with this issue.

Excel Get & Transform (Power Query) M Code Style and Performance

I've created a few reasonably complex M queries and have started running into some severe performance issues. I'm wondering if has to do with how I sometime organize my code.
The issues I've been having are:
1) Power Query constantly uses all of several CPU cores, calculating something, even if I'm not waiting for a result.
2) In task manager I can sometimes see that the Power Query threads ("Microsoft.mashup.Container.NetFX40.exe") are nearly idle, while Excel.exe is using 100% of one core for tens of minutes - even though at most I'm looking values in a few parameter tables that don't contain more than a couple dozen cells.
3) Some steps take extremely long to calculate, even though the operations involved are trivial. For example, I have a list of 10 text values taken from an Excel table. This list appears as one of my query steps when I 'preview' it. Then I want to remove a single value, so the next step = List.RemoveItems(myList, {"val"}). It didn't compute after 30 minutes, even though I could see the list was correctly loaded in a previous step.
4) UI sometimes becomes unresponsive for several minutes after changing code. Can still right-click on Queries at left hand side to enter advanced editor, and click the red X at top right and choose to keep changes, but all the rest is unresponsive. Not greyed out, just unresponsive.
Anyway, I just wanted to ask if anyone's had similar trouble, and if anyone knows what triggers particularly bad performance in PQ.
I'll often use something like the following pattern to keep the total number of queries down while still being able to easily inspect individual steps:
let
ThisWB = Excel.CurrentWorkbook(),
CfgTbl = ThisWB{[Name="myCfgTbl"]}[Content],
x = aFn(CfgTbl),
y = bFn(CfgTbl),
output = [ThisWB=ThisWB, CfgTbl=CfgTbl, x=x, y=y]
in
output
Is this likely to lead to any issues? Just thought it might because at one point after waiting a very long time for a simple function result, I created a new query = Excel.CurrentWorkbook(){[Name="myCfgTbl"]}[Content], referenced it from the other query, and my result calculated immediately. No idea why.
It calculates previews. Turn off auto preview generation.
I messed with something like this in cases with formula-heavy tables.
The rest probably requires code examples, especially your last case.
BTW, is your version of power query (or Excel 2016) up-to-date?

Resources