I appreciate any help as Im a bit desperate!
Im trying to filter out a 110k row and 46 column dataframe and apply some summarization. Its a whole pipeline that extracts the data from postgres, stores it into a DF and goes through a series of filters, everthing on memory.
My first execution runs smoothly, everything goes as expected, but as soon as I execute the script again (short period of time), Im finding duplicates or missing ids when filtering out.
Lets put it this way:
Here, sometimes the last two transaction_id (from top to bottom) get duplicated as in the picture, and sometimes I get only the first two transaction_id (36739792, 36740873).
The source is fine no questions about it.
After some time (5 min at least) I execute the script and I get the expected results, it just works fine, which in this case would be only the first four transaction_id. The issue comes when re-executing in a short period.
When debugging line by line actually, I get everything right, so this makes me think the logic is not the issue.
Could this be a memory issue? maybe somehow the script just keeps holding data even after execution?
Im using python 3.9, pandas and vscode.
Again, Appreciate any help. Regards.
Related
I've searched around quite a bit, and asked my professor, but not sure what the problem is. I have a test data set of 20,000 points. I perform a groupby on this to get the sum and it takes maybe 30 seconds, which is reasonable since this is on a jupyter notebook.
However, when I put in the larger test dataframe of 1.5 Million data points, it takes hours. Everything else, even on the large dataset happens quickly (multiple conditional joins, etc). My professor thinks that one key occurs very frequently and this could cause an issue. But I cant even check this.
When I run
df = df.groupby('ID').count().sort('ID').desc()).show()
on the small data set it works very fast and says that one value has 25 points, while all others are below 5. So maybe there is a key explosion. However, on the larger dataframe, well ive been waiting half an hour now.
Any help would be appreciated, Thanks
The following code hits some jobs with 'job skipped' after a few times thru the loop, and the df that is read on that iteration by 'myfunc' comes back with 0 rows (but surprisingly, with the correct number of columns) :
for i in range(len(dates)-1):
date1,date2=dates[i],dates[i+1]
params['file_path'] = ['s3a://path/to/files{}.json'.format(date1),'s3a://path/to/files{}.json'.format(date2)]
df = myfunc(params)
However when I run it 'by hand' several times, all is well - no skipped jobs and df's come back full.
date1,date2=dates[0],dates[1]
params['file_path'] = ['s3a://path/to/files{}.json'.format(date1),'s3a://path/to/files{}.json'.format(date2)]
df = myfunc(params)
The above runs fine, and when I change to date1,date2=dates[1],dates[2] also ok, etc. There aren't very many files and I've already finished them all by hand as above but would like to know what's going on. The filenames generated in the for loop work fine when I copy-paste them into my params. I am far from expert in spark so let me know if there's something obvious to check.
Without knowing the code of myfunc I can only guess you problem.
Probably the 0 rows issue originates from the assignment df = myfunc(params) which will overwrite the df all the time and does not append to the previous df. Probably for the last two dates it is just empty.
Skipping jobs usually comes from caching. Are you using caching anywhere?
I've created a few reasonably complex M queries and have started running into some severe performance issues. I'm wondering if has to do with how I sometime organize my code.
The issues I've been having are:
1) Power Query constantly uses all of several CPU cores, calculating something, even if I'm not waiting for a result.
2) In task manager I can sometimes see that the Power Query threads ("Microsoft.mashup.Container.NetFX40.exe") are nearly idle, while Excel.exe is using 100% of one core for tens of minutes - even though at most I'm looking values in a few parameter tables that don't contain more than a couple dozen cells.
3) Some steps take extremely long to calculate, even though the operations involved are trivial. For example, I have a list of 10 text values taken from an Excel table. This list appears as one of my query steps when I 'preview' it. Then I want to remove a single value, so the next step = List.RemoveItems(myList, {"val"}). It didn't compute after 30 minutes, even though I could see the list was correctly loaded in a previous step.
4) UI sometimes becomes unresponsive for several minutes after changing code. Can still right-click on Queries at left hand side to enter advanced editor, and click the red X at top right and choose to keep changes, but all the rest is unresponsive. Not greyed out, just unresponsive.
Anyway, I just wanted to ask if anyone's had similar trouble, and if anyone knows what triggers particularly bad performance in PQ.
I'll often use something like the following pattern to keep the total number of queries down while still being able to easily inspect individual steps:
let
ThisWB = Excel.CurrentWorkbook(),
CfgTbl = ThisWB{[Name="myCfgTbl"]}[Content],
x = aFn(CfgTbl),
y = bFn(CfgTbl),
output = [ThisWB=ThisWB, CfgTbl=CfgTbl, x=x, y=y]
in
output
Is this likely to lead to any issues? Just thought it might because at one point after waiting a very long time for a simple function result, I created a new query = Excel.CurrentWorkbook(){[Name="myCfgTbl"]}[Content], referenced it from the other query, and my result calculated immediately. No idea why.
It calculates previews. Turn off auto preview generation.
I messed with something like this in cases with formula-heavy tables.
The rest probably requires code examples, especially your last case.
BTW, is your version of power query (or Excel 2016) up-to-date?
I got a problem with my Stream Analytics job. I'm pulling events from an IoT Hub and grouping them in timewindows based on their custom timestamps; I've already written a query that does this correctly. But the problem is that it just doesn't write anything into my output table (being a NoSQL table on my Storage Account).
The query runs without problems in the query editor (when testing with a sample input file) and produces the correct output, but when running 'for real', it doesn't output anything (the output table remains empty). I've even tried renaming the table and outputting to a blob storage, but no dice. Here's the query:
SELECT
'general' AS partitionKey,
MIN(ID_frame) AS rowKey,
DATEADD(second, 1, DATEADD(hour, -3, System.TimeStamp)) AS window_start,
System.TimeStamp AS window_end,
COUNT(ID_frame) AS device_count
INTO
[IoT-Hub-output-table]
FROM
[IoT-Hub-input] TIMESTAMP BY custom_timestamp
GROUP BY TumblingWindow(Duration(hour, 3), Offset(second, -1))
The interesting part is that, if I omit any windowing in my query, then the table output works just fine.
I've been beating my head against the wall about this for a few days now, so I think I've already tried most of the obvious things.
As you are using a TumblingWindow of 3 hours, it means you will get a single output every 3 hours which contains an aggregate of all the events within that period.
So did you already wait for 3 hours for the first output to be generated?
I would try and set the window smaller, and try again to see if the output works correctly.
Turns out the query did output into my table, but with an amount of delay I didn't expect; I was waiting for 20-30 minutes at max. while the first insertions would began after a little later than half an hour. Thus I was cancelling the Analytics job before any output was produced and falsely assuming it just wouldn't output anything.
I found this to be the case afer I noted that 'sometimes' (when the job was running for long enough) there appeared to be some output. And in those output records I noticed the big delay between my custom timestamp field and the general timestamp field (which the engine uses to remember when the entity was updated for the last time)
a really strange thing happend to me with Connector/python and I couldn't find any explanation on the Internet.
I finished and closed first part of my program - database analysis. I've spent a lot of time with profilling to decrease to required time and it worked. Then I started the second part of the program, after several day I had to execute the first part to get data processed. But it wen't very slow. I knew I haven't made any important changes to that part.
So I spent several hours going through git log and checkouting to previous version and found the last commit, with fast analysis.
Ouput of the diff:
- insertq = "INSERT INTO `sp_domains` (domain) VALUES (%s) ON DUPLICATE KEY UPDATE domain=domain"
+ insertq = """
+ INSERT
+ INTO `sp_domains` (domain)
+ VALUES (%s)
+ ON DUPLICATE KEY UPDATE domain=domain
+ """
This is the only change I've made in the shared class and it really is the speed difference reason. I just can't figure out, what happend by using tripple quotation notation. Is it something with executemany(...) method, which is use to execute the query?
Thank you for explanation
I think it has to do with turning your one query with 4 inline parts into one query with multiple parts per line. The executemany(...) may have to do additional processing to strip whitespace, newlines and tabs to ensure that it compresses correctly to the original statement (more than just rearranging the string but additional vulnerabilities, idk). If you want to write it that way, do the string processing yourself before hand with split and join. Or,
From here: Use implicit continuation, it's more elegant:
def f():
s = ('123'
'456')
return s
....you can see if this method is any faster.