Python Pandas data frame setting copy of slice working sometimes but not always, despite nearly identical code - python-3.x

I have one data frame called patient_df that is made like this:
PATIENT_COLS = ['Origin', 'Status', 'Team', 'Bed', 'Admit_Time', 'First_Consult', 'Decant_Time', 'Ward_Time', 'Discharge_Order', 'Discharged'] # data to track for each patient
patient_df = pd.DataFrame(columns=PATIENT_COLS)
Then, at multiple points in my code I will access a row of this data frame and update fields associated with it (the row at patient_ID doesn't exist prior to me creating it in the first line):
patient_df.loc[patient_ID] = [None for i in range(NUM_PATIENT_COLS)]
record = patient_df.loc[patient_ID]
record.Origin = ORIGIN()
record.Admit_Time = sim_time
This code runs perfectly with no errors or warnings and the output is as expected (the actual data frame is updated).
However, I have another data frame called ip_df:
ip_df = pd.read_csv(PATH + 'Clean_IP.csv')
Now, when I try to access the rows in the same way (this time the rows already exist):
for patient in ALC_patients:
record = ip_df.loc[patient]
orig_end = record.IP_Discharge_DT
record.IP_LOS = MAX_STAY
record.IP_Discharge_DT = record.N_Left_DT + timedelta(days=MAX_STAY)
I get
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
Now, I realize what's happening is I'm actually accessing a copy of the data frame and thus not updating the actual one, and I can fix this by using
ip_df[patient, 'IP_LOS'] = MAX_STAY
However, I find the first code much cleaner, plus I don't have to make the data frame search for the row again every time. Why is this working with patient_df but not for ip_df, and is there anything I can change to use code more like what I am for patient_df?

pd.options.mode.chained_assignment = None # default='warn'
According to this link setting this in your code will turn off the warn flag

Related

Accessing columns in a csv using python

I am trying to access data from a CSV using python. I am able to access entire columns for data values; however, I want to also access rows, an use like and indexed coordinate system (0,1) being column 0, row 1. So far I have this:
#Lukas Robin
#25.07.2021
import csv
with open("sun_data.csv") as sun_data:
sunData = csv.reader(sun_data, delimiter=',')
global data
for data in sunData:
print(data)
I don't normally use data tables or CSV, so this is a new area for me.
As mentioned in the comment, you could make the jump to using pandas and spend a little time learning that. It would be a good investment of time if you plan to do much data analysis or work with data tables regularly.
If you just want to pull in a table of numbers and access it as you request, you are perfectly fine using csv package and doing that. Below is an example...
If your .csv file has a header in it, you can simply add in next(sun_data) before starting the inner loop to advance the iterator and let that data fall on the floor...
import csv
f_in = 'data_table.csv'
data = [] # a container to hold the results
with open(f_in, 'r') as source:
sun_data = csv.reader(source, delimiter=',')
for row in sun_data:
# convert the read-in values to float data types (or ints or ...)
row = [float(t) for t in row]
# append it to the data table
data.append(row)
print(data[1][0])

Can a python for loop affect the final dataframe result

I have found that these two pseudo code scripts produce to different results.
Script 1:
# Load dataframe
Df_1 = read_csv(path to file.csv)
# Start iteration through list of dates
For date in range(250):
Df_1 = Function_that_calculates_stuff(Df_1)
# Grab data I’m interested in and save to text file
row = pd.DataFrame([[str(Df_1.iloc[1,1]), ]])
Txt_file = Txt_file.append(row, ignore_index = True)
# After loop, save dataframes
Df_1.to_csv(path to file.csv)
Script 2:
For number in range(250):
For date in range(1):
# Load dataframe
Df_1 = read_csv(path to file.csv)
Df_1 = Function_that_calculates_stuff(Df_1)
# Grab data I’m interested in and save to text file
row = pd.DataFrame([[str(Df_1.iloc[1,1]), ]])
Txt_file = Txt_file.append(row, ignore_index = True)
Df_1.to_csv(path to file.csv)
I’m stumped why this is. I have tried walking through the code one line at a time, but have been unable to find anything that could explain this.
Can a nested loop alter the data loaded or saved in them?
And is there a way to create a sort of dam that prevents unwanted data from “leaking” to the start of the loop.
The two code snippits you provided dont do the same thing. Essentially in the first code snippet you read the data once from the file. We will call this the initial state. Then for each iteration of the loop you call the function passing it the data and then store the result back into the same variable thus overwritting the data.
So when the next iteration starts you will be using the data form the last iteration of the loop not the inital data loaded from the file. this will continue like a filtering or reduction process. each iteration may reduce or enrich the data form the last loop.
In your second code you are always loading the data form the file. So each iteration of the loop is applied to the original data not acting upon the data on the previous data form the previous loop.
As you dont give much in your example to reproduce the issue I have constructed a simple example to demonstrate this
def divides_by_i(num_data, i):
return [num for num in num_data if num % i == 0]
data = ("0,1,2,3,4,5,6,7,8,9,"
"10,11,12,13,14,15,16,17,18,19,20,"
"20,21,22,23,24,25,26,27,28,29,30")
nums_data = [int(num) for num in data.split(",")]
for i in range(1, 4):
nums_data = divides_by_i(nums_data, i)
with open("out1.txt", "w") as out1:
out1.write(",".join(str(num) for num in nums_data))
for i in range(1, 4):
for _ in range(1):
nums_data = [int(num) for num in data.split(",")]
nums_data = divides_by_i(nums_data, i)
with open("out2.txt", "w") as out1:
out1.write(",".join(str(num) for num in nums_data))
OUTPUT
First loop: 0,6,12,18,24,30
Second loop: 0,3,6,9,12,15,18,21,24,27,30
In the first loop it takes all the data and overwrites it with data that is divisable the number of that loop iteration. The next iteration will act only on the numbers from the previous iteration. we write the data at the end to a file.
In the second loop it always reads the data from the original numbers and applies the result and then saves the result of this iteration over any other output. So essentially in the second loop the function will act on the original data and apply the function and save the output each time, meaning you lose any output from the previous loop.
So the issue is not with the fact that you have a nested loop, its the fact that your opening your file only once in the first example and opening it on each iteration in the second example. If you move that open back to outside the loops it will act the same.

Pandas dropped row showing in plot

I am trying to make a heatmap.
I get my data out of a pipeline that class some rows as noisy, I decided to get a plot including them and a plot without them.
The problem I have: In the plot without the noisy rows I have blank line appearing (the same number of lines than rows removed).
Roughly The code looks like that (I can expand part if required I am trying to keep it shorts).
If needed I can provide a link with similar data publicly available.
data_frame = load_df_fromh5(file) # load a data frame from the hdf5 output
noisy = [..] # a list which indicate which row are vector
# I believe the problem being here:
noisy = [i for (i, v) in enumerate(noisy) if v == 1] # make a vector which indicates which index to remove
# drop the corresponding index
df_cells_noisy = df_cells[~df_cells.index.isin(noisy)].dropna(how="any")
#I tried an alternative method:
not_noisy = [0 if e==1 else 1 for e in noisy)
df = df[np.array(not_noisy, dtype=bool)]
# then I made a clustering using scipy
Z = hierarchy.linkage(df, method="average", metric="canberra", optimal_ordering=True)
df = df.reindex(hierarchy.leaves_list(Z))
# the I plot using the df variable
# quit long function I believe the problem being upstream.
plot(df)
The plot is quite long but I believe it works well because the problem only shows with the no noisy data frame.
IMO I believe somehow pandas keep information about the deleted rows and that they are plotted as a blank line. Any help is welcome.
Context:
Those are single-cell data of copy number anomaly (abnormalities of the number of copy of genomic segment)
Rows represent individuals (here individuals cells) columns represents for the genomic interval the number of copy (2 for vanilla (except sexual chromosome)).

Slow loop aggregating rows and columns

I have a DataFrame with a column named 'UserNbr' and a column named 'Spclty', which is composed of elements like this:
[['104', '2010-01-31'], ['215', '2014-11-21'], ['352', '2016-07-13']]
where there can be 0 or more elements in the list.
Some UserNbr keys appear in multiple rows, and I wish to collapse each such group into a single row such that 'Spclty' contains all the unique dicts like those in the list shown above.
To save overhead on appending to a DataFrame, I'm appending each output row to a list, instead to the DataFrame.
My code is working, but it's taking hours to run on 0.7M rows of input. (Actually, I've never been able to keep my laptop open long enough for it to finish executing.)
Is there a better way to aggregate into a structure like this, maybe using a library that provides more data reshaping options instead looping over UserNbr? (In R, I'd use the data.table and dplyr libraries.)
# loop over all UserNbr:
# consolidate specialty fields into dict-like sets (to remove redundant codes);
# output one row per user to new data frame
out_rows = list()
spcltycol = df_tmp.column.get_loc('Spclty')
all_UserNbr = df_tmp['UserNbr'].unique()
for user in all_UserNbr:
df_user = df_tmp.loc[df_tmp['UserNbr'] == user]
if df_user.shape[0] > 0:
open_combined = df_user_open.iloc[0, spcltycol] # capture 1st row
for row in range(1, df_user.shape[0]): # union with any subsequent rows
open_combined = open_combined.union(df_user.iloc[row, spcltycol])
new_row = df_user.drop(['Spclty', 'StartDt'], axis = 1).iloc[0].tolist()
new_row.append(open_combined)
out_rows.append(new_row)
# construct new dataframe with no redundant UserID rows:
df_out = pd.DataFrame(out_rows,
columns = ['UserNbr', 'Spclty'])
# convert Spclty sets to dicts:
df_out['Spclty'] = [dict(df_out['Spclty'][row]) for row in range(df_out.shape[0])]
The conversion to dict gets rid of specialties that are repeated between rows, In the output, a Spclty value should look like this:
{'104': '2010-01-31', '215': '2014-11-21', '352': '2016-07-13'}
except that there may be more key-value pairs than in any corresponding input row (resulting from aggregation over UserNbr).
I withdraw this question.
I had hoped there was an efficient way to use groupby with something else, but I haven't found any examples with a complex data structure like this one and have received no guidance.
For anyone who gets similarly stuck with very slow aggregation problems in Python, I suggest stepping up to PySpark. I am now tackling this problem with a Databricks notebook and am making headway with the pyspark.sql.window Window functions. (Now, it only takes minutes to run a test instead of hours!)
A partial solution is in the answer here:
PySpark list() in withColumn() only works once, then AssertionError: col should be Column

Getting array of Keyframe points in Blender

Hi i got a script im working on and its not working out as well as I want it to
This is what I got so far
import bpy
def Key_Frame_Points(): #Gets the key-frame values as an array.
fcurves = bpy.context.active_object.animation_data.action.fcurves
for curve in fcurves:
keyframePoints = fcurves[4].keyframe_points # selects Action channel's axis / attribute
for keyframe in keyframePoints:
print('KEY FRAME POINTS ARE #T ',keyframe.co[0])
KEYFRAME_POINTS_ARRAY = keyframe.co[0]
print(KEYFRAME_POINTS_ARRAY)
Key_Frame_Points()
When I run this its printing out all the keyframes on the selected Objects as I wanted it to. But the problem is that I cant for some reason get the Values its printing into a variable. If you run it and check the the System concole. its acting odd.Like as in its printing out the Values of the Keyframed object.But when I ask it to get those values as an array, its just printing out the last frame.
Here is how it looks like briefly
I think what you want to do is add each keyframe.co[1] to an array which means you want to use KEYFRAME_POINTS_ARRAY.append(keyframe.co[1]) and for that to work you will need to define it as an empty array outside the loop with KEYFRAME_POINTS_ARRAY = []
Note that keyframe.co[0] is the frame that is keyed while keyframe.co[1] is the keyed value at that frame.
Also of note is that you are looping through fcurves but not using each curve.
for curve in fcurves:
keyframePoints = fcurves[4].keyframe_points
By using fcurves[4] here you are reading the same fcurve every time, you probably meant to use keyframePoints = curve.keyframe_points
So I expect you want to have -
import bpy
def Key_Frame_Points(): #Gets the key-frame values as an array.
KEYFRAME_POINTS_ARRAY = []
fcurves = bpy.context.active_object.animation_data.action.fcurves
for curve in fcurves:
keyframePoints = curve.keyframe_points
for keyframe in keyframePoints:
print('KEY FRAME POINTS ARE frame:{} value:{}'.format(keyframe.co[0],keyframe.co[1]))
KEYFRAME_POINTS_ARRAY.append(keyframe.co[1])
return KEYFRAME_POINTS_ARRAY
print(Key_Frame_Points())
You may also be interested to use fcurves.find(data_path) to find a specific curve by it's path.
There is also fcurve.evaluate(frame) that will give you the curve value at any frame not just the keyed values.

Resources