Connector/python slow insertion with tripled quotes - python-3.x

a really strange thing happend to me with Connector/python and I couldn't find any explanation on the Internet.
I finished and closed first part of my program - database analysis. I've spent a lot of time with profilling to decrease to required time and it worked. Then I started the second part of the program, after several day I had to execute the first part to get data processed. But it wen't very slow. I knew I haven't made any important changes to that part.
So I spent several hours going through git log and checkouting to previous version and found the last commit, with fast analysis.
Ouput of the diff:
- insertq = "INSERT INTO `sp_domains` (domain) VALUES (%s) ON DUPLICATE KEY UPDATE domain=domain"
+ insertq = """
+ INSERT
+ INTO `sp_domains` (domain)
+ VALUES (%s)
+ ON DUPLICATE KEY UPDATE domain=domain
+ """
This is the only change I've made in the shared class and it really is the speed difference reason. I just can't figure out, what happend by using tripple quotation notation. Is it something with executemany(...) method, which is use to execute the query?
Thank you for explanation

I think it has to do with turning your one query with 4 inline parts into one query with multiple parts per line. The executemany(...) may have to do additional processing to strip whitespace, newlines and tabs to ensure that it compresses correctly to the original statement (more than just rearranging the string but additional vulnerabilities, idk). If you want to write it that way, do the string processing yourself before hand with split and join. Or,
From here: Use implicit continuation, it's more elegant:
def f():
s = ('123'
'456')
return s
....you can see if this method is any faster.

Related

Python pandas - processing

I appreciate any help as Im a bit desperate!
Im trying to filter out a 110k row and 46 column dataframe and apply some summarization. Its a whole pipeline that extracts the data from postgres, stores it into a DF and goes through a series of filters, everthing on memory.
My first execution runs smoothly, everything goes as expected, but as soon as I execute the script again (short period of time), Im finding duplicates or missing ids when filtering out.
Lets put it this way:
Here, sometimes the last two transaction_id (from top to bottom) get duplicated as in the picture, and sometimes I get only the first two transaction_id (36739792, 36740873).
The source is fine no questions about it.
After some time (5 min at least) I execute the script and I get the expected results, it just works fine, which in this case would be only the first four transaction_id. The issue comes when re-executing in a short period.
When debugging line by line actually, I get everything right, so this makes me think the logic is not the issue.
Could this be a memory issue? maybe somehow the script just keeps holding data even after execution?
Im using python 3.9, pandas and vscode.
Again, Appreciate any help. Regards.

Get an incrementing number in Logic App Select

I am using a Logic App to transform some data for an integration. I am trying to avoid using For Each loops as the amount of data I am working with is high, and these incur a cost for each action and iteration of the for each loop.
However the integration I am working with requires a unique incrementing number for each line. They don't have to be sequential, or even starting with 1 but the order should be kept the same.
So with the above, the first one would get LineNumber 1, the second LineNumber 2, etc.. (or like I said, it could be 67829, 67835, etc..)
I tried to set a variable with ticks(utcNow()) before the start of the mapping, and then use sub(ticks(utcNow()), variables('startTicks')) but this is evaluated once and the same number is applied to all.
My next thought is to use an azure function/inline javascript to go through afterward and assign them, but just wondering if there is a way to accomplish this in the select.
or like I said, it could be 67829, 67835, etc..
Answering to this requirement,
Inside the Select Option :
indexOf(string(variables('<DATA Variable>')),string(item()))
Explanation :
item() - current item (of all items) in the select - stringified the same & tried to find the same in stringified version of the entire data - the index number will be returned.
OUTPUT
Please note :
Did not get a chance to check on a very large dataset.
This may fail, if a specific row(all values in the row) repetitive in nature - I assume this may not
be your case (order number might unique )

pyspark code in loop acts different from single executions

The following code hits some jobs with 'job skipped' after a few times thru the loop, and the df that is read on that iteration by 'myfunc' comes back with 0 rows (but surprisingly, with the correct number of columns) :
for i in range(len(dates)-1):
date1,date2=dates[i],dates[i+1]
params['file_path'] = ['s3a://path/to/files{}.json'.format(date1),'s3a://path/to/files{}.json'.format(date2)]
df = myfunc(params)
However when I run it 'by hand' several times, all is well - no skipped jobs and df's come back full.
date1,date2=dates[0],dates[1]
params['file_path'] = ['s3a://path/to/files{}.json'.format(date1),'s3a://path/to/files{}.json'.format(date2)]
df = myfunc(params)
The above runs fine, and when I change to date1,date2=dates[1],dates[2] also ok, etc. There aren't very many files and I've already finished them all by hand as above but would like to know what's going on. The filenames generated in the for loop work fine when I copy-paste them into my params. I am far from expert in spark so let me know if there's something obvious to check.
Without knowing the code of myfunc I can only guess you problem.
Probably the 0 rows issue originates from the assignment df = myfunc(params) which will overwrite the df all the time and does not append to the previous df. Probably for the last two dates it is just empty.
Skipping jobs usually comes from caching. Are you using caching anywhere?

Excel Get & Transform (Power Query) M Code Style and Performance

I've created a few reasonably complex M queries and have started running into some severe performance issues. I'm wondering if has to do with how I sometime organize my code.
The issues I've been having are:
1) Power Query constantly uses all of several CPU cores, calculating something, even if I'm not waiting for a result.
2) In task manager I can sometimes see that the Power Query threads ("Microsoft.mashup.Container.NetFX40.exe") are nearly idle, while Excel.exe is using 100% of one core for tens of minutes - even though at most I'm looking values in a few parameter tables that don't contain more than a couple dozen cells.
3) Some steps take extremely long to calculate, even though the operations involved are trivial. For example, I have a list of 10 text values taken from an Excel table. This list appears as one of my query steps when I 'preview' it. Then I want to remove a single value, so the next step = List.RemoveItems(myList, {"val"}). It didn't compute after 30 minutes, even though I could see the list was correctly loaded in a previous step.
4) UI sometimes becomes unresponsive for several minutes after changing code. Can still right-click on Queries at left hand side to enter advanced editor, and click the red X at top right and choose to keep changes, but all the rest is unresponsive. Not greyed out, just unresponsive.
Anyway, I just wanted to ask if anyone's had similar trouble, and if anyone knows what triggers particularly bad performance in PQ.
I'll often use something like the following pattern to keep the total number of queries down while still being able to easily inspect individual steps:
let
ThisWB = Excel.CurrentWorkbook(),
CfgTbl = ThisWB{[Name="myCfgTbl"]}[Content],
x = aFn(CfgTbl),
y = bFn(CfgTbl),
output = [ThisWB=ThisWB, CfgTbl=CfgTbl, x=x, y=y]
in
output
Is this likely to lead to any issues? Just thought it might because at one point after waiting a very long time for a simple function result, I created a new query = Excel.CurrentWorkbook(){[Name="myCfgTbl"]}[Content], referenced it from the other query, and my result calculated immediately. No idea why.
It calculates previews. Turn off auto preview generation.
I messed with something like this in cases with formula-heavy tables.
The rest probably requires code examples, especially your last case.
BTW, is your version of power query (or Excel 2016) up-to-date?

Cassandra - Write doesn't fail, but values aren't inserted

I have a cluster of 3 Cassandra 2.0 nodes. My application I wrote a test which tries to write and read some data into/from Cassandra. In general this works fine.
The curiosity is that after I restarted my computer, this test will fail, because after writting I read the same value I´ve write before and there I get null instead of the value, but the was no exception while writing.
If I manually truncate the used column family, the test will pass. After that I can execute this test how often I want, it passes again and again. Furthermore it doesn´t matter if there are values in the Cassandra or not. The result is alwalys the same.
If I look at the CLI and the CQL-shell there are two different views:
Does anyone have an ideas what is going wrong? The timestamp in the CLI is updated after re-execution, so it seems to be a read-problem?
A part of my code:
For inserts I tried
Insert.Options insert = QueryBuilder.insertInto(KEYSPACE_NAME,TABLENAME)
.value(ID, id)
.value(JAHR, zonedDateTime.getYear())
.value(MONAT, zonedDateTime.getMonthValue())
.value(ZEITPUNKT, date)
.value(WERT, entry.getValue())
.using(timestamp(System.nanoTime() / 1000));
and
Insert insert = QueryBuilder.insertInto(KEYSPACE_NAME,TABLENAME)
.value(ID, id)
.value(JAHR, zonedDateTime.getYear())
.value(MONAT, zonedDateTime.getMonthValue())
.value(ZEITPUNKT, date)
.value(WERT, entry.getValue());
My select looks like
Select.Where select = QueryBuilder.select(WERT)
.from(KEYSPACE_NAME,TABLENAME)
.where(eq(ID, id))
.and(eq(JAHR, zonedDateTime.getYear()))
.and(eq(MONAT, zonedDateTime.getMonthValue()))
.and(eq(ZEITPUNKT, Date.from(instant)));
Consistencylevel is QUORUM (for both) and replicationfactor 3
I'd say this seems to be a problem with timestamps since a truncate solves the problem. In Cassandra last write wins and this could be a problem caused by the use of System.nanoTime() since
This method can only be used to measure elapsed time and is not related to any other notion of system or wall-clock time.
...
The values returned by this method become meaningful only when the difference between two such values, obtained within the same instance of a Java virtual machine, is computed.
http://docs.oracle.com/javase/7/docs/api/java/lang/System.html#nanoTime()
This means that the write that occured before the restart could have been performed "in the future" compared to the write after the restart. This would not fail the query, but the written value would simply not be visible due to the fact that there is a "newer" value available.
Do you have a requirement to use sub-millisecond precision for the insert timestamps? If possible I would recommend using System.currentTimeMillis() instead of nanoTime().
http://docs.oracle.com/javase/7/docs/api/java/lang/System.html#currentTimeMillis()
If you have a requirement to use sub-millisecond precision it would be possible to use System.currentTimeMillis() with some kind of atomic counter that ranged between 0-999 and then use that as a timestamp. This would however break if multiple clients insert the same row at the same time.

Resources