python iterator outputs in a abnormal order - python-3.5

In Python 3.5x executed the below code
for key in {'six':-99, 'one':1, '1.5': 1.5, 'two':2, 'five':5}:
print(key)
I was expecting the output in the below order
six, one,
1.5, two, five
and got the below output
1.5, six, five, two, one
Why this weird behaviour, that too without requesting to change the order, also why this random order ?
you can try at the below link
Run Code

Related

Another problem with: PerformanceWarning: DataFrame is highly fragmented

Since I am still learning Python, I am getting some optimisation errors here.
I keep getting the error
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
and it does take for a while to load for what I am doing now.
Here is my code:
def Monte_Carlo_for_Tracking_Error(N,S,K,Ru,Rd,r,I,a):
ldv=[]
lhp=[]
lsp=[]
lod=[]
Tracking_Error_df=pd.DataFrame()
# Go through different time steps of rebalancing
for y in range(1,I+1):
i=0
# do the same step a amount of times
while i<a:
Sample_Stock_Prices=[]
Sample_Hedging_Portfolio=[]
Hedging_Portfolio_Value=np.zeros(N) # Initzialize Hedging PF
New_Path=Portfolio_specification(N,S,K,Ru,Rd,r) # Get a New Sample Path
Sample_Stock_Prices.append(New_Path[0])
Sample_Hedging_Portfolio.append(Changing_Rebalancing_Rythm(New_Path,y))
Call_Option_Value=[]
Call_Option_Value.append(New_Path[1])
Differences=np.zeros(N)
for x in range(N):
Hedging_Portfolio_Value[x]=Sample_Stock_Prices[0][x]*Sample_Hedging_Portfolio[0][x]
for z in range(N):
Differences[z]=Call_Option_Value[0][z]-Hedging_Portfolio_Value[z]
lhp.append(Hedging_Portfolio_Value)
lsp.append(np.asarray(Sample_Stock_Prices))
ldv.append(np.asarray(Sample_Hedging_Portfolio))
lod.append(np.asarray(Differences))
Tracking_Error_df[f'Index{i+(y-1)*200}']=Differences
i=i+1
return(Tracking_Error_df,lod,lsp,lhp,ldv)
Code starts to give me warnings when I try to run:
Simulation=MCTE(100,100,104,1.05,0.95,0,10,200)
Small part of the warning:
C:\Users\xxx\AppData\Local\Temp\ipykernel_1560\440260239.py:30: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
Tracking_Error_df[f'Index{i+(y-1)*200}']=Differences
C:\Users\xxx\AppData\Local\Temp\ipykernel_1560\440260239.py:30: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
Tracking_Error_df[f'Index{i+(y-1)*200}']=Differences
C:\Users\xxx\AppData\Local\Temp\ipykernel_1560\440260239.py:30: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
Tracking_Error_df[f'Index{i+(y-1)*200}']=Differences
I am using jupyter notebook for this. If somebody could help me optimise it, would appreciate it.
I tried to test the code and I am expecting to have a more performance-oriented version of it.

Python pandas - processing

I appreciate any help as Im a bit desperate!
Im trying to filter out a 110k row and 46 column dataframe and apply some summarization. Its a whole pipeline that extracts the data from postgres, stores it into a DF and goes through a series of filters, everthing on memory.
My first execution runs smoothly, everything goes as expected, but as soon as I execute the script again (short period of time), Im finding duplicates or missing ids when filtering out.
Lets put it this way:
Here, sometimes the last two transaction_id (from top to bottom) get duplicated as in the picture, and sometimes I get only the first two transaction_id (36739792, 36740873).
The source is fine no questions about it.
After some time (5 min at least) I execute the script and I get the expected results, it just works fine, which in this case would be only the first four transaction_id. The issue comes when re-executing in a short period.
When debugging line by line actually, I get everything right, so this makes me think the logic is not the issue.
Could this be a memory issue? maybe somehow the script just keeps holding data even after execution?
Im using python 3.9, pandas and vscode.
Again, Appreciate any help. Regards.

pyspark code in loop acts different from single executions

The following code hits some jobs with 'job skipped' after a few times thru the loop, and the df that is read on that iteration by 'myfunc' comes back with 0 rows (but surprisingly, with the correct number of columns) :
for i in range(len(dates)-1):
date1,date2=dates[i],dates[i+1]
params['file_path'] = ['s3a://path/to/files{}.json'.format(date1),'s3a://path/to/files{}.json'.format(date2)]
df = myfunc(params)
However when I run it 'by hand' several times, all is well - no skipped jobs and df's come back full.
date1,date2=dates[0],dates[1]
params['file_path'] = ['s3a://path/to/files{}.json'.format(date1),'s3a://path/to/files{}.json'.format(date2)]
df = myfunc(params)
The above runs fine, and when I change to date1,date2=dates[1],dates[2] also ok, etc. There aren't very many files and I've already finished them all by hand as above but would like to know what's going on. The filenames generated in the for loop work fine when I copy-paste them into my params. I am far from expert in spark so let me know if there's something obvious to check.
Without knowing the code of myfunc I can only guess you problem.
Probably the 0 rows issue originates from the assignment df = myfunc(params) which will overwrite the df all the time and does not append to the previous df. Probably for the last two dates it is just empty.
Skipping jobs usually comes from caching. Are you using caching anywhere?

In Weka 3.8.3 get different result when set the probaabilityEstimates true or false

I use the same training data set and testing data set.
And I choose Weka classifiers-> functions-> LibSVM , and use default parameters.
I use default parameter and get the result:
https://imgur.com/aIq90wP
When I set the parameter probabilityEstimates to true, I get this result:
https://imgur.com/NGVY5No
The default parameters set are like this:
https://imgur.com/GOfLnVd
Why am I getting different results?
Maybe it's a silly question but I'll be grateful if someone can answer this.
Thanks!
This seems to be related to the random number process.
I used the same libSVM, all defaults, with diabetes.arff (comes with the software).
Run 1: no probabilityEstimates, 500 correct
Run 2: same, 500 correct
Run 3: probabilityEstimates, 498 correct
Run 4: same, 498 correct (so, with identical parameters, the process replicates)
Run 5: probabilityEstimates, but change seed from 1 to 55, 500 correct.
Run 6: probabilityEstimates, but change seed from 55 to 666, 498 correct.
Run 7: probabilityEstimates, but change seed from 666 to 1492, 499 correct.
The algorithm needs, for whatever reason, a different amount of random numbers or uses them in a different order, resulting in slight perturbations in the number correct when probabilityEstimates are requested. We get the same effect if we change the random number seed (which tells the random number generator where to start).

Are Spark DataFrames ever implicitly cached?

I have recently understood that Spark DAGs get executed lazily, and intermediate results are never cached unless you explicitly call DF.cache().
Now I've ran an experiment that should give me different random numbers every time, based on that fact:
from pyspark.sql.functions import rand
df = spark.range(0, 3)
df = df.select("id", rand().alias('rand'))
df.show()
Executing these lines multiple times gives me different random numbers each time, as expected. But if the computed values (rand() in this case) are never stored, then calling just df.show() repeatedly should give me new random numbers every time, because the 'rand' column is not cached, right?
df.show()
This command called a second time gives me the same random numbers as before though. So the values are stored somewhere now, which I thought does not happen.
Where is my thinking wrong? And could you give me a minimal example of non-caching that results in new random numbers every time?
The random seed parameter of rand() is set when rand().alias('rand') is called inside the select method and does not change afterwards. Therefore, calling show multiple times does always use the same random seed and hence the result is the same.
You can see it more clearly when you return the result of rand().alias('rand') by itself, which also shows the random seed parameter:
>>> rand().alias('rand')
Column<b'rand(166937772096155366) AS `rand`'>
When providing the seed directly, it will show up accordingly:
>>> rand(seed=22).alias('rand')
Column<b'rand(22) AS `rand`'>
The random seed is set when calling rand() and is stored as a column expression within the select method. Therefore the result is the same. You will get different results when reevaluating rand() everytime like df.select("id", rand().alias('rand')).show().

Resources