Boolean, Flatnonzero, Selecting a certain range in numpy in python - python-3.x

I have a data file in .txt that consists of 2 columns. The first one is my x values and the second column contains my y values.
What I am trying to do is quite simple. I want to identify where my x values are =>1700 and <=1735 so that I can get the repective y values within that x range. At the end I want to get the sum of those y values.
The following is the code I wrote.
import numpy as np
data = np.loadtxt('NI2_2.txt')
x_all= data[:,0]
y_all= data[:,1]
x_selected= np.flatnonzero(np.logical_and(x_all<=1700),(x_all=>1735))
y_selected= y_all[x_selected]
y_final= np.sum(y_selected)
I get an error message for my x_selected, saying that the syntax is not correct. Does someone see what is wrong with it?
Thanks!
Cece

Try using np.where:
y_selected = y_all[np.where((x_all >= 1700) & (x_all <= 1735))]
y_final = np.sum(y_selected)
EDIT:
Also you cannot write => in python. Use >=.

It may be only because the comparison operand is >= and not => but i can't try any further, sorry.

Related

Trying to merge 2 dataframes but receiving value error of merging object and int32 columns

I have been trying to address an issue mentioned here
I had been trying to use a list of dates to filter a dataframe, and a very gracious person was helping me, but now with the current code, I am receiving these errors.
# Assign a sequential number to each trading day
df_melt_test_percent = df_melt_test_percent.sort_index().assign(DayNumber=lambda x: range(len(x)))
# Find the indices of the FOMC_dates
tmp = pd.merge(
df_FOMC_dates, df_melt_test_percent[['DayNumber']],
left_on='FOMC_date', right_on='DayNumber'
)
# For each row, get the FOMC_dates ± 3 days
tmp['delta'] = tmp.apply(lambda _: range(-3, 4), axis=1)
tmp = tmp.explode('delta')
tmp['DayNumber'] += tmp['delta']
# Assemble the result
result = pd.merge(tmp, df_melt_test_percent, on='DayNumber')
Screenshots of dataframes:
If anyone has any advice on how to fix this, it would be greatly appreciated.
EDIT #1:
The columns on which you want to merge are not the same types in both dataframes. Likely one is string the other integer. You should convert to the same type before merging. Assuming from the little bit you showed, before your merge, run:
tmp['DayNumber'] = tmp['DayNumber'].astype(int)
Alternatively:
df_melt_test_percent['DayNumber'] = df_melt_test_percent['DayNumber'].astype(str)
NB. This might not work as you did not provide a full example. Either search by yourself the right types or provide a reproducible example.

Iterating throughput dataframe columns and using .apply() gives KeyError

So im trying to normalize my features by using .apply() iteratively on all columns of the dataframe but it gives KeyError. Can someone help?
I've tried using below code but it doesnt work :
for x in df.columns:
df[x+'_norm'] = df[x].apply(lambda x:(x-df[x].mean())/df[x].std())
I don't think it's a good idea to use mean and std functions inside the apply. You are calculating them each time which that any row is going to get its new value. Instead you can calculate them in the beginning of the loop and use of it in the apply function. Like below:
for x in df.columns:
mean = df[x].mean()
std = df[x].std()
df[x+'_norm'] = df[x].apply(lambda y:(y-mean)/std)

Yet another Pandas SettingWithCopyWarning question

Yes this question has been asked many times! No, I have still not been able to figure out how to run this boolean filter without generating the Pandas SettingWithCopyWarning warning.
for x in range(len(df_A)):
df_C = df_A.loc[(df_A['age'] >= df_B['age_limits'].iloc[x][0]) &
(df_A['age'] <= df_B['age_limits'].iloc[x][1])]
df_D['count'].iloc[x] = len(df_C) # triggers warning
I've tried:
Copying df_A and df_B in every possible place
Using a mask
Using a query
I know I can suppress the warning, but I don't want to do that.
What am I missing? I know it's probably something obvious.
Many thanks!
For more details on why you got SettingWithCopyWarning, I would suggest you to read this answer. It is mostly because selecting the columns df_D['count'] and then using iloc[x] does a "chained assignment" that is flagged this way.
To prevent it, you can get the position of the column you want in df_D and then use iloc for both the row and the column in the loop for:
pos_col_D = df_D.columns.get_loc['count']
for x in range(len(df_A)):
df_C = df_A.loc[(df_A['age'] >= df_B['age_limits'].iloc[x][0]) &
(df_A['age'] <= df_B['age_limits'].iloc[x][1])]
df_D.iloc[x,pos_col_D ] = len(df_C) #no more warning
Also, because you compare all the values of df_A.age with the bounds of df_B.age_limits, I think you could improve the speed of your code using numpy.ufunc.outer, with ufunc being greater_equal and less_egal, and then sum over the axis=0.
#Setup
import numpy as np
import pandas as pd
df_A = pd.DataFrame({'age': [12,25,32]})
df_B = pd.DataFrame({'age_limits':[[3,99], [20,45], [15,30]]})
#your result
for x in range(len(df_A)):
df_C = df_A.loc[(df_A['age'] >= df_B['age_limits'].iloc[x][0]) &
(df_A['age'] <= df_B['age_limits'].iloc[x][1])]
print (len(df_C))
3
2
1
#with numpy
print ( ( np.greater_equal.outer(df_A.age, df_B.age_limits.str[0])
& np.less_equal.outer(df_A.age, df_B.age_limits.str[1]))
.sum(0) )
array([3, 2, 1])
so you can assign the previous line of code directly in df_D['count'] without loop for. Hope this work for you

TypeError: "'numpy.float64' object is not iterable" using python 3.x possibly os module issue?

I am running into seemingly simple issue, but I am still scratching my head, not sure why it isn't working. If you could please provide your feedback, it is greatly appreciated!
What I am trying to do is that I have txt file that has x and y look like the following (x and y are tab separated):
x y
1500 1
2000 0.5
2500 2
3000 6
In my code I determine my precursor and products to be within a certain x range. Then I want to determine the fraction of my precursor.
The following is my code
import numpy as np
import os #reading files using os module
myfiles = sorted(os.listdir('input_102417apo'))
my_ratio=[]
for file in myfiles:
with open('input_102417apo/'+file, 'r') as f: #determining x, y in my txt files
data = np.loadtxt(f,delimiter='\t')
data_filtered_both = data[data[:,1] != 0.000]
x_array=(data_filtered_both[:,0])
y_array=(data_filtered_both[:,1])
y_norm=(y_array/np.max(y_array))
x_and_y = []
row = np.array([list (i) for i in zip(x_array,y_norm)])
for x, y in row:
if y>0:
x_and_y.append((x,y))
precursor_x=[]
precursor_y=[]
for x,y in (x_and_y):
if x>2260 and x<2280:
precursor_x.append(x)
precursor_y.append(y)
precursor_y_sum=np.sum(precursor_y)
product6_x=[]
product6_y=[]
for x,y in (x_and_y):
if x>1685 and x<1722:
product6_x.append(x)
product6_y.append(y)
product6_y_sum=np.sum(product6_y)
product5_x=[]
product5_y=[]
for x,y in (x_and_y):
if x>2035 and x<2080:
product5_x.append(x)
product5_y.append(y)
product5_y_sum=np.sum(product5_y)
my_ratio.extend((precursor_y_sum)/(precursor_y_sum+monomer6_y_sum+ monomer5_y_sum))
with open ('output/'+file, 'w') as f:
f.write('{0:f}\n'.format(my_ratio))
I am batch processing many files that is organized by a order (number) so I want to have one list that shows a fraction of my precursor from all of my files.
That is why I created my_ratio.
But I am running into the following error message:
TypeError: 'numpy.float64' object is not iterable
I am not quite sure what causes it to be not iterable and how I could fix it. Thanks!
The issue is here: my_ratio.extend((precursor_y_sum)/(precursor_y_sum+monomer6_y_sum+ monomer5_y_sum))
You can't extend with just a variable, it has to be a list. I would use append there unless there is some reason you really need to use extend.

Openpyxl: Manipulation of cell values

I'm trying to pull cell values from an excel sheet, do math with them, and write the output to a new sheet. I keep getting an ErrorType. I've run the code successfully before, but just added this aspect of it, thus code has been distilled to below:
import openpyxl
#set up ws from file, and ws_out write to new file
def get_data():
first = 0
second = 0
for x in range (1, 1000):
if ws.cell(row=x, column=1).value == 'string':
for y in range (1, 10): #Only need next ten rows after 'string'
ws_out.cell(row=y, column=1).value = ws.cell(row=x+y, column=1).value
second = first #displaces first -> second
first = ws.cell(row=x+y, column=1).value/100 #new value for first
difference = first - second
ws_out.cell(row=x+y+1, column=1).value = difference #add to output
break
Throws a TypeError message:
first = ws.cell(row=x+y, column=1).value)/100
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'
I assume this is referring to the ws.cell value and 100, respectively, so I've also tried:
first = int(ws.cell(row=x, column=1))/100 #also tried with float
Which raises:
TypeError: int() argument must be a string or a number
I've confirmed that every cell in the column is made up of numbers only. Additionally, openpyxl's cell.data_type returns 'n' (presumably for number as far as I can tell by the documentation).
I've also tested more simple math, and have the same error.
All of my searching seems to point to openpyxl normally behaving like this. Am I doing something wrong, or is this simply a limitation of the module? If so, are there any programmatic workarounds?
As a bonus, advice on writing code more succinctly would be much appreciated. I'm just beginning, and feel there must be a cleaner way to write an ideas like this.
Python 3.3, openpyxl-1.6.2, Windows 7
Summary
cfi's answer helped me figure it out, although I used a slightly different workaround. On inspection of the originating file, there was one empty cell (which I had missed earlier). Since I will be re-using this code later on columns with more sporadic empty cells, I used:
if ws.cell(row=x+r, column=40).data_type == 'n':
second = first #displaces first -> second
first = ws.cell(row=x+y, column=1).value/100 #new value for first
difference = first - second
ws_out.cell(row=x+y+1, column=1).value = difference #add to output
Thus, if a specified cell was empty, it was ignored and skipped.
Are you 100% sure (=have verified) that all the cells you are accessing actually hold a value? (Edit: Do a print("dbg> cell value of {}, {} is {}".format(row, 1, ws.cell(row=row, column=1).value)) to verify content)
Instead of going through a fixed range(1,1000) I'd recomment to use openpyxl introspection methods to iterate over existing rows. E.g.:
wb=load_workbook(inputfile)
for ws in wb.worksheets:
for row in ws.rows:
for cell in row: value = cell.value
When getting the values do not forget to extract the .value attribute:
first = ws.cell(row=x+y, column=1).value/100 #new value for first
As a general note: x, and y are useful variable names for 2D coordinates. Don't use them both for rows. It will mislead others who have to read the code. Instead of x you could use start_row or row_offset or something similar. Instead of y you could just use row and you could let it start with the first index being the start_row+1.
Some example code (untested):
def get_data():
first = 0
second = 0
for start_row in range (1, ws.rows):
if ws.cell(row=start_row, column=1).value == 'string':
for row in range (start_row+1, start_row+10):
ws_out.cell(row=start_row, column=1).value = ws.cell(row=row, column=1)
second = first
first = ws.cell(row=row, column=1).value/100
difference = first - second
ws_out.cell(row=row+1, column=1).value = difference
break
Now with this code I still don't understand what you are trying to achieve. Is the break indented correctly? If yes, the first time you are matching string, the outer loop will be quit by the break. Then, what is the point of the variables first and second?
Edit: Also ensure that your reading from and writing into cell().value not just cell().

Resources