I am working with a model that utilizes the Data Table functionality in Excel to process many scenarios and collect the output. However, I will occasionally find that the data table will repeat results for sequential scenarios, which would be impossible.
The data table is generated with the following VBA code:
Sheets("ProjSheet").Range(Range("DTAnchor").Offset(-1, -1),Range("DTAnchor").Offset(NumScns - 1, 1))
.Table ColumnInput:=Range("CurrScen")
The calculation mode is semiautomatic throughout the code, and the data table is updated using Calculate.
The erroneous output might look something like this:
Scn | Result
------------
1 | 341.5
2 | 0
3 | 861.4
4 | 861.4 <- Wrong!
5 | 861.4 <- Wrong!
6 | 10.5
7 | 64.9
...
The workbook is not small at ~22MB, and the data table typically takes a minute or so to churn through 1,000 scenarios.
I can verify the result of any particular scenario by running it individually, so we know that the results in this case should not be identical.
These models are typically saved with different model settings and run simultaneously in different instances of Excel. They are also generally run on remote computers where multiple users could log in, kick off some models, and then disconnect and leave them running in the background. These computers have 8 cores so they could run up to 8 models simultaneously. Sometimes, the models are opened and kicked off in different Excel instances using a macro that writes a VB Script which is then kicked off by a .bat file.
I've had difficulty replicating the error reliably. One theory I have is that when more Excel models are running than there are cores in the remote computer, Excel freezes up during the data table process and cannot always complete its calculation for some scenarios, and therefore it spits out the same results it currently has stored. I don't know enough about Excel's inner workings to decide if this makes sense, though.
Has anyone else come across data tables that repeat results before? If you know more about the mechanics of data tables, do you know what might cause this error and how to prevent it in the future?
Let me know if you need any more information about the error or things that I've tried to identify the cause. Thanks!
Related
If I want to use two dimensions in a crosstab, the report starts to process (no errors) but does not finish to load (after 30 minutes). When I use the dimension individually the report is ready after 30 to 60 seconds. Both Dimensions have a 1 to 1 (Dimension) to 1 to N (fact table) relationship modelled in Cognos Framework Manager. If I run the generated MDX/SQL directly in the database the query is running indefinitely as well...
Does anyone have an idea why this could happen?
Any help appreciated!
Share the generated MDX/SQL with your data architect and/or database administrator. They should be able to tell you what is wrong with the relationships in your model.
My program runs fine with limited data but when I put in all four databases activewidth won't work.
Database 1 has 29990 entries.
Database 2 has around 27000 entries.
Database 3 has roughly 17000 entries.
Database 4 has 430 entries.
Each database, grouped by its kind and includes business type, name, address, city, state, phone number, longitude, latitude, sales tax info, and daily hours of operation.
In total 12.1Mb of data.
With database 1 only in the program it works fine and I can scroll over a point on the map and activewidth will increase the size of the dot and the program will bring up the underlying data on the left hand side of the screen just like it is suppose to do.
Now that I have added in all four maps and can click them on and off separately, with only #1 turned on activewidth won't work and the underlying data won't show on the left. The points on the map are there and I can click through all four checkbuttons and turn on and off the points. I currently don't have the code in yet for the underlying data on database 2-4, just the ability to turn them on and off. Only activewidth isn't working now that I've got it so I can view the points for all 4 databases.
I decided to try commenting out all code for databases 2-4 and see what would happen and it went back to working fine again. Then I went and added in database 2 back into the mix and it went back to not working again. Then I tried database 2 only and it was activewidthing fine as long as database 1 was commented out. With database 1 active the activewidth was very slow to work/not working at all.
Is there a feasible maximum number of entries I can use. Hopefully not because I still have several more databases to get finished off and added into the program that will take the total number of entries up over 100K before all is said and done.
Nothing else makes sense since I'm just changing self.alocation to self.blocation, when I go to add in database 2-4. I'm just changing the identifier to show which database is being worked with and copying the rest of the code over between routines since everything is the same...just different business separated into appropriately grouped databases. It seems it's in the amount of data that is being used and not in the anything doing with the way the program is written.
I figured by splitting up the files, not only for my benefit but also to make the files smaller it would help alleviate the problem but so far it hasn't. Is there any other way to work around data overload?
self.alocation = []
for x in range(0, len(self.atype)):
pix1x = round((float(self.along[x])+(-self.longitudecenter+(self.p/2)))/(self.p/714),0)
pix1y = round((((self.p/2) + self.latitudecenter-(float(self.alat[x])))/(self.p/714)),0)
z = self.canvas.create_line(pix1x, pix1y, pix1x+4, pix1y+4, activewidth="10.0", fill = '', width = 5)
self.alocation.append((z,x))
I have data in the format { host | metric | value | time-stamp }. We have hosts all around the world reporting metrics.
I'm a little confused about using window operations (say, 1 hour) to process data like this.
Can I tell my window when to start, or does it just start when the application starts? I want to ensure I'm aggregating all data from hour 11 of the day, for example. If my window starts at 10:50, I'll just get 10:50-11:50 and miss 10 minutes.
Even if the window is perfect, data may arrive late.
How do people handle this kind of issue? Do they make windows far bigger than needed and just grab the data they care about on every batch cycle (kind of sliding)?
In the past, I worked on a large-scale IoT platform and solved that problem by considering that the windows were only partial calculations. I modeled the backend (Cassandra) to receive more than 1 record for each window. The actual value of any given window would be the addition of all -potentially partial- records found for that window.
So, a perfect window would be 1 record, a split window would be 2 records, late-arrivals are naturally supported but only accepted up to a certain 'age' threshold. Reconciliation was done at read time. As this platform was orders of magnitude heavier in terms of writes vs reads, it made for a good compromise.
After speaking with people in depth on MapR forums, the consensus seems to be that hourly and daily aggregations should not be done in a stream, but rather in a separate batch job once the data is ready.
When doing streaming you should stick to small batches with windows that are relatively small multiples of the streaming interval. Sliding windows can be useful for, say, trends over the last 50 batches. Using them for tasks as large as an hour or a day doesn't seem sensible though.
Also, I don't believe you can tell your batches when to start/stop, etc.
I've created a few reasonably complex M queries and have started running into some severe performance issues. I'm wondering if has to do with how I sometime organize my code.
The issues I've been having are:
1) Power Query constantly uses all of several CPU cores, calculating something, even if I'm not waiting for a result.
2) In task manager I can sometimes see that the Power Query threads ("Microsoft.mashup.Container.NetFX40.exe") are nearly idle, while Excel.exe is using 100% of one core for tens of minutes - even though at most I'm looking values in a few parameter tables that don't contain more than a couple dozen cells.
3) Some steps take extremely long to calculate, even though the operations involved are trivial. For example, I have a list of 10 text values taken from an Excel table. This list appears as one of my query steps when I 'preview' it. Then I want to remove a single value, so the next step = List.RemoveItems(myList, {"val"}). It didn't compute after 30 minutes, even though I could see the list was correctly loaded in a previous step.
4) UI sometimes becomes unresponsive for several minutes after changing code. Can still right-click on Queries at left hand side to enter advanced editor, and click the red X at top right and choose to keep changes, but all the rest is unresponsive. Not greyed out, just unresponsive.
Anyway, I just wanted to ask if anyone's had similar trouble, and if anyone knows what triggers particularly bad performance in PQ.
I'll often use something like the following pattern to keep the total number of queries down while still being able to easily inspect individual steps:
let
ThisWB = Excel.CurrentWorkbook(),
CfgTbl = ThisWB{[Name="myCfgTbl"]}[Content],
x = aFn(CfgTbl),
y = bFn(CfgTbl),
output = [ThisWB=ThisWB, CfgTbl=CfgTbl, x=x, y=y]
in
output
Is this likely to lead to any issues? Just thought it might because at one point after waiting a very long time for a simple function result, I created a new query = Excel.CurrentWorkbook(){[Name="myCfgTbl"]}[Content], referenced it from the other query, and my result calculated immediately. No idea why.
It calculates previews. Turn off auto preview generation.
I messed with something like this in cases with formula-heavy tables.
The rest probably requires code examples, especially your last case.
BTW, is your version of power query (or Excel 2016) up-to-date?
this is our situation:
We store user messages in table Storage. The Partition key is the UserId and the RowKey is used as a message id.
When a users opens his message panel we want to just .Take(x) number of messages, we don't care about the sortOrder. But what we have noticed is that the time it takes to get the messages varies very much by the number of messages we take.
We did some small tests:
We did 50 * .Take(X) and compared the differences:
So we did .Take(1) 50 times and .Take(100) 50 times etc.
To make an extra check we did the same test 5 times.
Here are the results:
As you can see there are some HUGE differences. The difference between 1 and 2 is very strange. The same for 199-200.
Does anybody have any clue how this is happening? The Table Storage is on a live server btw, not development storage.
Many thanks.
X: # Takes
Y: Test Number
Update
The problem only seems to come when I'm using a wireless network. But I'm using the cable the times are normal.
Possibly the data is collected in batches of a certain number x. When you request x+1 rows, it would have to take two batches and then drop a certain number.
Try running your test with increments of 1 as the Take() parameter, to confirm or dismiss this assumption.