pandas value assignment with conditions on index

pandas value assignment with conditions on index - python-3.x

I am trying to calculate the Demarker Indicator in Python. Following describes how to calculate it:
Choose a predetermined period “X” (Standard value is “14”, although a value of “8” or “9” tends to be more sensitive);
Calculate DeMax = High – Previous High if >0, otherwise DeMax = 0;
Calculate DeMin = Previous Low – Low if >0, otherwise DeMin = 0;
DeM = MA of DeMax/(MA of DeMax + MA of DeMin).
Following is my attempt, df is a dataframe contains open high low close price with date as index:
df['DeMax'] = df['High'] - df['High'].shift(1)
df['DeMin'] = df['Low'].shift(1)-df['Low']
# Method 1: df['DeMax'][df['DeMax'] < 0] = 0.0
# Method 2: df[df['DeMax']< 0]['DeMax'] = 0.0
If I use Method 1, it is ok. But if I use Method2, I will get warning SettingWithCopyWarning, even I use copy method like this df['DeMax'] = df['High'].copy() - df['High'].copy().shift(1) won't solve the issue.
I have also checked that df['DeMax'][df['DeMax'] < 0] and df[test['DeMax']< 0]['DeMax'] are same pandas series, so why they behave differently if I try to assign values?
Also, if I do something like this
df['DeMax'] = df['High'] - df['High'].shift(1)
df['DeMin'] = df['Low'].shift(1)-df['Low']
a = df['DeMax'][df['DeMax'] < 0]
a = 0
Then a will be 0 instead of a pandas series, but I also expect df['DeMax'][df['DeMax'] < 0] will be 0, which does not happen, could anyone help? Thanks.

You are looking for .loc
df.loc[df['DeMax'] < 0,'DeMax']=0 # it will change all value less that 0 to 0

Related

Sum of ranks using one sided communication but there is nothing being loaded into the target

I am aiming to create create a sum of ranks in a ring similar to a allreduce program but using one-sided communication.
For example, if there four processes in this system. The output would be:
PE0: Sum = 6
PE2: Sum = 6
PE3: Sum = 6
PE1: Sum = 6
However, with my current solution with one-sided communication, all the sums are 0.
My code so far:
#!/usr/bin/env python3
from mpi4py import MPI
import numpy as np
rcv_buf = np.empty((), dtype=np.intc) # uninitialized 0 dimensional integer array
status = MPI.Status()
comm_world = MPI.COMM_WORLD
my_rank = comm_world.Get_rank()
size = comm_world.Get_size()
right = (my_rank+1) % size;
left = (my_rank-1+size) % size;
snd_buf = np.array(my_rank, dtype=np.intc) # 0 dimensional integer array with 1 element initialized with the value of my_rank
sum = 0
copy = 0
# create a window
win = MPI.Win.Create(snd_buf, 1, MPI.INFO_NULL, comm=comm_world)
# we need a master process
# sync remote get call
for i in range(size):
win.Fence(0)
win.Get(snd_buf, left, copy)
win.Fence(0)
sum += copy
win.Free()
print(f"PE{my_rank}:\tSum = {copy}")
I'm not sure how to check that the Get call is working properly and if it is, is there any other way to load and store.

I was using the win.Get call incorrectly. In the documentation, the first parameter of the Get call is specified as origin (BufSpec) which I mistook for the origin value that was in the window which was snd_buf but it should be the buffer where you would want your answer to be stored. I also had to include a Put call to send the value of the rank to the next process. This makes the final code:
#!/usr/bin/env python3
from mpi4py import MPI
import numpy as np
rcv_buf = np.empty((), dtype=np.intc) # uninitialized 0 dimensional integer array
status = MPI.Status()
comm_world = MPI.COMM_WORLD
my_rank = comm_world.Get_rank()
size = comm_world.Get_size()
right = (my_rank+1) % size;
left = (my_rank-1+size) % size;
snd_buf = np.array(my_rank, dtype=np.intc) # 0 dimensional integer array with 1 element initialized with the value of my_rank
sum = 0
# create a window
win = MPI.Win.Create(snd_buf, 1, MPI.INFO_NULL, comm=comm_world)
# sync remote get call
for i in range(size):
win.Fence(0)
win.Put(snd_buf, left)
win.Fence(0)
win.Get(rcv_buf, right)
win.Fence(0)
sum += rcv_buf
print(f"PE{my_rank}:\tSum = {sum}")
win.Free()

What's a potentially better algorithm to solve this python nested for loop than the one I'm using?

I have a nested loop that has to loop through a huge amount of data.
Assuming a data frame with random values with a size of 1000,000 rows each has an X,Y location in 2D space. There is a window of 10 length that go through all the 1M data rows one by one till all the calculations are done.
Explaining what the code is supposed to do:
Each row represents a coordinates in X-Y plane.
r_test is containing the diameters of different circles of investigations in our 2D plane (X-Y plane).
For each 10 points/rows, for every single diameter in r_test, we compare the distance between every point with the remaining 9 points and if the value is less than R we add 2 to H. Then we calculate H/(N**5) and store it in c_10 with the index corresponding to that of the diameter of investigation.
For this first 10 points finally when the loop went through all those diameters in r_test, we read the slope of the fitted line and save it to S_wind[ii]. So the first 9 data points will have no value calculated for them thus giving them np.inf to be distinguished later.
Then the window moves one point down the rows and repeat this process till S_wind is completed.
What's a potentially better algorithm to solve this than the one I'm using? in python 3.x?
Many thanks in advance!
import numpy as np
import pandas as pd
####generating input data frame
df = pd.DataFrame(data = np.random.randint(2000, 6000, (1000000, 2)))
df.columns= ['X','Y']
####====creating upper and lower bound for the diameter of the investigation circles
x_range =max(df['X']) - min(df['X'])
y_range = max(df['Y']) - min(df['Y'])
R = max(x_range,y_range)/20
d = 2
N = 10 #### Number of points in each window
#r1 = 2*R*(1/N)**(1/d)
#r2 = (R)/(1+d)
#r_test = np.arange(r1, r2, 0.05)
##===avoiding generation of empty r_test
r1 = 80
r2= 800
r_test = np.arange(r1, r2, 5)
S_wind = np.zeros(len(df['X'])) + np.inf
for ii in range (10,len(df['X'])): #### maybe the code run slower because of using len() function instead of a number
c_10 = np.zeros(len(r_test)) +np.inf
H = 0
C = 0
N = 10 ##### maybe I should also remove this
for ind in range(len(r_test)):
for i in range (ii-10,ii):
for j in range(ii-10,ii):
dd = r_test[ind] - np.sqrt((df['X'][i] - df['X'][j])**2+ (df['Y'][i] - df['Y'][j])**2)
if dd > 0:
H += 1
c_10[ind] = (H/(N**2))
S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0]

You can use numpy broadcasting to eliminate all of the inner loops. I'm not sure if there's an easy way to get rid of the outermost loop, but the others are not too hard to avoid.
The inner loops are comparing ten 2D points against each other in pairs. That's just dying for using a 10x10x2 numpy array:
# replacing the `for ind` loop and its contents:
points = np.hstack((np.asarray(df['X'])[ii-10:ii, None], np.asarray(df['Y'])[ii-10:ii, None]))
differences = np.subtract(points[None, :, :], points[:, None, :]) # broadcast to 10x10x2
squared_distances = (differences * differences).sum(axis=2)
within_range = squared_distances[None,:,:] < (r_test*r_test)[:, None, None] # compare squares
c_10 = within_range.sum(axis=(1,2)).cumsum() * 2 / (N**2)
S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0] # this is unchanged...
I'm not very pandas savvy, so there's probably a better way to get the X and Y values into a single 2-dimensional numpy array. You generated the random data in the format that I'd find most useful, then converted into something less immediately useful for numeric operations!
Note that this code matches the output of your loop code. I'm not sure that's actually doing what you want it to do, as there are several slightly strange things in your current code. For example, you may not want the cumsum in my code, which corresponds to only re-initializing H to zero in the outermost loop. If you don't want the matches for smaller values of r_test to be counted again for the larger values, you can skip that sum (or equivalently, move the H = 0 line to in between the for ind and the for i loops in your original code).

Iterations over 2d numpy arrays with while and for statements

In the code supplied below I am trying to iterate over 2D numpy array [i][k]
Originally it is a code which was written in Fortran 77 which is older than my grandfather. I am trying to adapt it to python.
(for people interested whatabouts: it is a simple hydraulics transients event solver)
Bear in mind that all variables are introduced in my code which I don't paste here.
H = np.zeros((NS,50))
Q = np.zeros((NS,50))
Here I am assigning the first row values:
for i in range(NS):
H[0][i] = HR-i*R*Q0**2
Q[0][i] = Q0
CVP = .5*Q0**2/H[N]
T = 0
k = 0
TAU = 1
#Interior points:
HP = np.zeros((NS,50))
QP = np.zeros((NS,50))
while T<=Tmax:
T += dt
k += 1
for i in range(1,N):
CP = H[k][i-1]+Q[k][i-1]*(B-R*abs(Q[k][i-1]))
CM = H[k][i+1]-Q[k][i+1]*(B-R*abs(Q[k][i+1]))
HP[k][i-1] = 0.5*(CP+CM)
QP[k][i-1] = (HP[k][i-1]-CM)/B
#Boundary Conditions:
HP[k][0] = HR
QP[k][0] = Q[k][1]+(HP[k][0]-H[k][1]-R*Q[k][1]*abs(Q[k][1]))/B
if T == Tc:
TAU = 0
CV = 0
else:
TAU = (1.-T/Tc)**Em
CV = CVP*TAU**2
CP = H[k][N-1]+Q[k][N-1]*(B-R*abs(Q[k][N-1]))
QP[k][N] = -CV*B+np.sqrt(CV**2*(B**2)+2*CV*CP)
HP[k][N] = CP-B*QP[k][N]
for i in range(NS):
H[k][i] = HP[k][i]
Q[k][i] = QP[k][i]
Remember i is for rows and k is for columns
What I am expecting is that for all k number of columns the values should be calculated until T<=Tmax condition is met. I cannot figure out what my mistake is, I am getting the following errors:
RuntimeWarning: divide by zero encountered in true_divide
CVP = .5*Q0**2/H[N]
RuntimeWarning: invalid value encountered in multiply
QP[N][k] = -CV*B+np.sqrt(CV**2*(B**2)+2*CV*CP)
QP[N][k] = -CV*B+np.sqrt(CV**2*(B**2)+2*CV*CP)
ValueError: setting an array element with a sequence.

Looking at your first iteration:
H = np.zeros((NS,50))
Q = np.zeros((NS,50))
for i in range(NS):
H[0][i] = HR-i*R*Q0**2
Q[0][i] = Q0
The shape of H is (NS,50), but when you iterate over a range(NS) you apply that index to the 2nd dimension. Why? Shouldn't it apply to the dimension with size NS?
In numpy arrays have 'C' order by default. Last dimension is inner most. They can have a F (fortran) order, but let's not go there. Thinking of the 2d array as a table, we typically talk of rows and columns, though they don't have a formal definition in numpy.
Lets assume you want to set the first column to these values:
for i in range(NS):
H[i, 0] = HR - i*R*Q0**2
Q[i, 0] = Q0
But we can do the assignment whole rows or columns at a time. I believe new versions of Fortran also have these 'whole-array' functions.
Q[:, 0] = Q0
H[:, 0] = HR - np.arange(NS) * R * Q0**2
One point of caution when translating to Python. Indexing starts with 0; so does ranges and np.arange(...).
H[0][i] is functionally the same as H[0,i]. But when using slices you have to use the H[:,i] format.
I suspect your other iterations have similar problems, but I'll stop here for now.

Regarding the errors:
The first:
RuntimeWarning: divide by zero encountered in true_divide
CVP = .5*Q0**2/H[N]
You initialize H as zeros so it is normal that it complains of division by zero. Maybe you should add a conditional.
The third:
QP[N][k] = -CV*B+np.sqrt(CV**2*(B**2)+2*CV*CP)
ValueError: setting an array element with a sequence.
You define CVP = .5*Q0**2/H[N] and then CV = CVP*TAU**2 which is a sequence. And then you try to assign a derivate form it to QP[N][K] which is an element. You are trying to insert an array to a value.
For the second error I think it might be related to the third. If you could provide more information I would like to try to understand what happens.
Hope this has helped.

How to set LpVariable and Objective Function in pulp for LPP as per the formula?

I want to calculate the Maximised value of the particular user based on his Interest | Popularity | both Interest and Popularity using following Linear Programming Problem(LPP) equation
using pulp package in python3.7.
I have 4 lists
INTEREST = [5,10,15,20,25]
POPULARITY = [4,8,12,16,20]
USER = [1,2,3,4,5]
cost = [2,4,6,8,10]
and 2 variable values as
e=0.5 ; e may take (0 or 1 or 0.5)
budget=20
and
i=0 to n ; n is length of the list
means, the summation want to perform for all list values.
Here, if e==0 means Interest will 0 ; if e==1 means Popularity will 0 ; if e==0.5 means Interest and Popularity will be consider for Max Value
Also xi takes 0 or 1; if xi==1 then the user will be consider else if xi==0 then the user will not be consider.
and my pulp code as below
from pulp import *
INTEREST = [5,10,15,20,25]
POPULARITY = [4,8,12,16,20]
USER = [1,2,3,4,5]
cost = [2,4,6,8,10]
e=0.5
budget=10
#PROBLEM VARIABLE
prob = LpProblem("MaxValue", LpMaximize)
# DECISION VARIABLE
int_vars = LpVariable.dicts("Interest", INTEREST,0,4,LpContinuous)
pop_vars = LpVariable.dicts("Popularity",
POPULARITY,0,4,LpContinuous)
user_vars = LpVariable.dicts("User",
USER,0,4,LpBinary)
#OBJECTIVE fUNCTION
prob += lpSum(USER(i)((INTEREST[i]*e for i in INTEREST) +
(POPULARITY[i]*(1-e) for i in POPULARITY)))
# CONSTRAINTS
prob += USER(i)cost(i) <= budget
#SOLVE
prob.solve()
print("Status : ",LpStatus[prob.status])
# PRINT OPTIMAL SOLUTION
print("The Max Value = ",value(prob.objective))
Now I am getting 2 errors as
1) line 714, in addInPlace for e in other:
2) line 23, in
prob += lpSum(INTEREST[i]e for i in INTEREST) +
lpSum(POPULARITY[i](1-e) for i in POPULARITY)
IndexError: list index out of range
What I did wrong in my code. Guide me to resolve this problem. Thanks in advance.

I think I finally understand what you are trying to achieve. I think the problem with your description is to do with terminology. In a linear program we reserve the term variable for those variables which we want to be selected or chosen as part of the optimisation.
If I understand your needs correctly your python variables e and budget would be considered parameters or constants of the linear program.
I believe this does what you want:
from pulp import *
import numpy as np
INTEREST = [5,10,15,20,25]
POPULARITY = [4,8,12,16,20]
COST = [2,4,6,8,10]
N = len(COST)
set_user = range(N)
e=0.5
budget=10
#PROBLEM VARIABLE
prob = LpProblem("MaxValue", LpMaximize)
# DECISION VARIABLE
x = LpVariable.dicts("user_selected", set_user, 0, 1, LpBinary)
# OBJECTIVE fUNCTION
prob += lpSum([x[i]*(INTEREST[i]*e + POPULARITY[i]*(1-e)) for i in set_user])
# CONSTRAINTS
prob += lpSum([x[i]*COST[i] for i in set_user]) <= budget
#SOLVE
prob.solve()
print("Status : ",LpStatus[prob.status])
# PRINT OPTIMAL SOLUTION
print("The Max Value = ",value(prob.objective))
# Show which users selected
x_soln = np.array([x[i].varValue for i in set_user])
print("user_vars: ")
print(x_soln)
Which should return the following, i.e. with these particular parameters only the last user is selected for inclusion - but this decision will change - for example if you increase the budget to 100 all users will be selected.
Status : Optimal
The Max Value = 22.5
user_vars:
[0. 0. 0. 0. 1.]

How to calculate correlation if one value is 0

For calculation the pearsons coefficient between two arrays I use the following :
double[] arr1 = new double[4];
arr1[0] = 1;
arr1[1] = 1;
arr1[2] = 1;
arr1[3] = 1;
double[] arr2 = new double[4];
arr2[0] = 1;
arr2[1] = 1;
arr2[2] = 1;
arr2[3] = 1;
PearsonsCorrelation pc = new PearsonsCorrelation();
println("Correlation is "+pc.correlation(arr1, arr2));
For output I receive : Correlation is NaN
The PearsonsCorrelation class is contained in the apache commons API : http://commons.apache.org/proper/commons-math/userguide/stat.html
The values in each of the arrays is based on whether or not a user contains a word in their dataset. The above arrays should be perfectly correlated ?
This question is related to How to set a value's for calculating Eucludeian distance and correlation

Someone had a similar issue here [link]. Apparently, the issue is related to having a 0 standard deviation in your arrays.

You attempt to compute the correlation between two vectors of length four. As all values in each vector are the same (0 in one vector, 1 in the other), this is equivalent to attempting to compute the correlation coefficient between two numbers (0 and 1 on this case).
It is perhaps obvious to see that there is no such a thing; you need at least two distinct pairs. Just as you cannot draw a meaningful regression line if you only have one pair of values.
If only one of the vectors had some variation, the result would still be NA, but it in that case it would be reasonable to set it to zero.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

pandas value assignment with conditions on index - python-3.x

You are looking for .loc df.loc[df['DeMax'] < 0,'DeMax']=0 # it will change all value less that 0 to 0

Related

Sum of ranks using one sided communication but there is nothing being loaded into the target

What's a potentially better algorithm to solve this python nested for loop than the one I'm using?

Iterations over 2d numpy arrays with while and for statements

How to set LpVariable and Objective Function in pulp for LPP as per the formula?

How to calculate correlation if one value is 0

Categories

Resources