Excel Get & Transform (Power Query) M Code Style and Performance - excel

I've created a few reasonably complex M queries and have started running into some severe performance issues. I'm wondering if has to do with how I sometime organize my code.
The issues I've been having are:
1) Power Query constantly uses all of several CPU cores, calculating something, even if I'm not waiting for a result.
2) In task manager I can sometimes see that the Power Query threads ("Microsoft.mashup.Container.NetFX40.exe") are nearly idle, while Excel.exe is using 100% of one core for tens of minutes - even though at most I'm looking values in a few parameter tables that don't contain more than a couple dozen cells.
3) Some steps take extremely long to calculate, even though the operations involved are trivial. For example, I have a list of 10 text values taken from an Excel table. This list appears as one of my query steps when I 'preview' it. Then I want to remove a single value, so the next step = List.RemoveItems(myList, {"val"}). It didn't compute after 30 minutes, even though I could see the list was correctly loaded in a previous step.
4) UI sometimes becomes unresponsive for several minutes after changing code. Can still right-click on Queries at left hand side to enter advanced editor, and click the red X at top right and choose to keep changes, but all the rest is unresponsive. Not greyed out, just unresponsive.
Anyway, I just wanted to ask if anyone's had similar trouble, and if anyone knows what triggers particularly bad performance in PQ.
I'll often use something like the following pattern to keep the total number of queries down while still being able to easily inspect individual steps:
let
ThisWB = Excel.CurrentWorkbook(),
CfgTbl = ThisWB{[Name="myCfgTbl"]}[Content],
x = aFn(CfgTbl),
y = bFn(CfgTbl),
output = [ThisWB=ThisWB, CfgTbl=CfgTbl, x=x, y=y]
in
output
Is this likely to lead to any issues? Just thought it might because at one point after waiting a very long time for a simple function result, I created a new query = Excel.CurrentWorkbook(){[Name="myCfgTbl"]}[Content], referenced it from the other query, and my result calculated immediately. No idea why.

It calculates previews. Turn off auto preview generation.
I messed with something like this in cases with formula-heavy tables.
The rest probably requires code examples, especially your last case.
BTW, is your version of power query (or Excel 2016) up-to-date?

Related

Invisible Delays between Spark Jobs

There are 4 major actions(jdbc write) with respect to application and few counts which in total takes around 4-5 minutes for completion.
But the total uptime of Application is around 12-13minutes.
I see there are certain jobs by name run at ThreadPoolExecutor.java : 1149. Just before this job being reflected on Spark UI, the invisible long delays occur.
I want to know what are the possible causes for these delays.
My application is reading 8-10 CSV files, 5-6 VIEWs from table. Number of joins are around 59, few groupBy with agg(sum) are there and 3 unions are there.
I am not able to reproduce the issue in DEV/UAT env since the data is not that much.
It's in the production where I get the app. executed run by my Manager.
If anyone has come across such delays in their job, please share your experience what could be the potential cause for this, currently I am working around the unions, i.e. caching the associated dataframes and calling count so as to get the benefit of cache in the coming union(yet to test, if union is the reason for delays)
Similarly, I tried the break the long chain of transformations with cache and count in between to break the long lineage.
The time reduced from initial 18 minutes to 12 minutes but the issue with invisible delays still persist.
Thanks in advance
I assume you don't have a CPU or IO heavy code between your spark jobs.
So it really sparks, 99% it is QueryPlaning delay.
You can use
spark.listenerManager.register(QueryExecutionListener) to check different metrics of query planing performance.

Pyspark grouby taking a very long time

I've searched around quite a bit, and asked my professor, but not sure what the problem is. I have a test data set of 20,000 points. I perform a groupby on this to get the sum and it takes maybe 30 seconds, which is reasonable since this is on a jupyter notebook.
However, when I put in the larger test dataframe of 1.5 Million data points, it takes hours. Everything else, even on the large dataset happens quickly (multiple conditional joins, etc). My professor thinks that one key occurs very frequently and this could cause an issue. But I cant even check this.
When I run
df = df.groupby('ID').count().sort('ID').desc()).show()
on the small data set it works very fast and says that one value has 25 points, while all others are below 5. So maybe there is a key explosion. However, on the larger dataframe, well ive been waiting half an hour now.
Any help would be appreciated, Thanks

Spark optimize "DataFrame.explain" / Catalyst

I've got a complex software which performs really complex SQL queries (well not queries, Spark plans you know). <-- The plans are dynamic, they change based on user input so I can't "cache" them.
I've got a phase in which spark takes 1.5-2min building the plan. Just to make sure, I added "logXXX", then explain(true), then "logYYY" and it takes 1minute 20 seconds for the explain to execute.
I've trying breaking the lineage but this seems to cause worse performance because the actual execution time becomes longer.
I can't parallelize driver work (already did, but this task can't be overlapped with anything else).
Any ideas/guide on how to improve the plan builder in Spark? (like for example, flags to try enabling/disabling and such...)
Is there a way to cache plans in Spark? (so I can run that in parallel and then execute it)
I've tried disabling all possible optimizer rules, setting min iterations to 30... but nothing seems to affect that concrete point :S
I tried disabling wholeStageCodegen and it helped a little, but the execution is longer so :).
Thanks!,
PS: The plan does contain multiple unions (<20, but quite complex plans inside each union) which are the cause for the time, but splitting them apart also affects execution time.
Just in case it helps someone (and if no-one provides more insights).
As I couldn't manage to reduce optimizer times (and well, not sure if reducing optimizer times would be good, as I may lose execution time).
One of the latest parts of my plan was scanning two big tables and getting one column from each one of them (using windows, aggregations etc...).
So I splitted my code in two parts:
1- The big plan (cached)
2- The small plan which scans and aggregates two big tables (cached)
And added one more part:
3- Left Join/enrich the big plan with the output of "2" (this takes like 10seconds, the dataset is not so big) and finish the remainder computation.
Now I launch both actions (1,2) in parallel (using driver-level parallelism/threads), cache the resulting DataFrames and then wait+ afterwards perform 3.
With this, while Spark driver (thread 1) is calculating the big plan (~2minutes) the executors will be executing part "2" (which has a small plan, but big scans/shuffles) and then both get "mixed" in like 10-15seconds, which a good improvement in execution time over the 1:30 I save while calculating the plan.
Comparing times:
Before I would have
1:30 Spark optimizing time + 6 minutes execution time
Now I have
max
(
1:30 Spark Optimizing time + 4 minutes execution time,
0:02 Spark Optimizing time + 2 minutes execution time
)
+ 15 seconds joining both parts
Not so much, but quite a few "expensive" people will be waiting for it to finish :)

Maximum number of activewidth object allowed

My program runs fine with limited data but when I put in all four databases activewidth won't work.
Database 1 has 29990 entries.
Database 2 has around 27000 entries.
Database 3 has roughly 17000 entries.
Database 4 has 430 entries.
Each database, grouped by its kind and includes business type, name, address, city, state, phone number, longitude, latitude, sales tax info, and daily hours of operation.
In total 12.1Mb of data.
With database 1 only in the program it works fine and I can scroll over a point on the map and activewidth will increase the size of the dot and the program will bring up the underlying data on the left hand side of the screen just like it is suppose to do.
Now that I have added in all four maps and can click them on and off separately, with only #1 turned on activewidth won't work and the underlying data won't show on the left. The points on the map are there and I can click through all four checkbuttons and turn on and off the points. I currently don't have the code in yet for the underlying data on database 2-4, just the ability to turn them on and off. Only activewidth isn't working now that I've got it so I can view the points for all 4 databases.
I decided to try commenting out all code for databases 2-4 and see what would happen and it went back to working fine again. Then I went and added in database 2 back into the mix and it went back to not working again. Then I tried database 2 only and it was activewidthing fine as long as database 1 was commented out. With database 1 active the activewidth was very slow to work/not working at all.
Is there a feasible maximum number of entries I can use. Hopefully not because I still have several more databases to get finished off and added into the program that will take the total number of entries up over 100K before all is said and done.
Nothing else makes sense since I'm just changing self.alocation to self.blocation, when I go to add in database 2-4. I'm just changing the identifier to show which database is being worked with and copying the rest of the code over between routines since everything is the same...just different business separated into appropriately grouped databases. It seems it's in the amount of data that is being used and not in the anything doing with the way the program is written.
I figured by splitting up the files, not only for my benefit but also to make the files smaller it would help alleviate the problem but so far it hasn't. Is there any other way to work around data overload?
self.alocation = []
for x in range(0, len(self.atype)):
pix1x = round((float(self.along[x])+(-self.longitudecenter+(self.p/2)))/(self.p/714),0)
pix1y = round((((self.p/2) + self.latitudecenter-(float(self.alat[x])))/(self.p/714)),0)
z = self.canvas.create_line(pix1x, pix1y, pix1x+4, pix1y+4, activewidth="10.0", fill = '', width = 5)
self.alocation.append((z,x))

Excel Data Table Repeating Results

I am working with a model that utilizes the Data Table functionality in Excel to process many scenarios and collect the output. However, I will occasionally find that the data table will repeat results for sequential scenarios, which would be impossible.
The data table is generated with the following VBA code:
Sheets("ProjSheet").Range(Range("DTAnchor").Offset(-1, -1),Range("DTAnchor").Offset(NumScns - 1, 1))
.Table ColumnInput:=Range("CurrScen")
The calculation mode is semiautomatic throughout the code, and the data table is updated using Calculate.
The erroneous output might look something like this:
Scn | Result
------------
1 | 341.5
2 | 0
3 | 861.4
4 | 861.4 <- Wrong!
5 | 861.4 <- Wrong!
6 | 10.5
7 | 64.9
...
The workbook is not small at ~22MB, and the data table typically takes a minute or so to churn through 1,000 scenarios.
I can verify the result of any particular scenario by running it individually, so we know that the results in this case should not be identical.
These models are typically saved with different model settings and run simultaneously in different instances of Excel. They are also generally run on remote computers where multiple users could log in, kick off some models, and then disconnect and leave them running in the background. These computers have 8 cores so they could run up to 8 models simultaneously. Sometimes, the models are opened and kicked off in different Excel instances using a macro that writes a VB Script which is then kicked off by a .bat file.
I've had difficulty replicating the error reliably. One theory I have is that when more Excel models are running than there are cores in the remote computer, Excel freezes up during the data table process and cannot always complete its calculation for some scenarios, and therefore it spits out the same results it currently has stored. I don't know enough about Excel's inner workings to decide if this makes sense, though.
Has anyone else come across data tables that repeat results before? If you know more about the mechanics of data tables, do you know what might cause this error and how to prevent it in the future?
Let me know if you need any more information about the error or things that I've tried to identify the cause. Thanks!

Resources