Update Pandas DF during while loop (Python3, Pandas) - python-3.x

Some background: My code takes user input and applies it to my DF to remove certain rows. This process repeats as many times as the user would like. Unfortunately, I am not sure how to update my DF within the while loop I have created so that it keeps the changes being made:
data = ({'hello':['the man','is a','good guy']})
df = pd.DataFrame(data)
def func():
while True:
n = input('Words: ')
if n == "Done":
break
elif n != "Done":
pattern = '^'+''.join('(?=.*{})'.format(word) for word in n.split())
df[df['hello'].str.contains(pattern)==False]
How do I update the DF at the end of each loop so the changes being made stay put?

Ok, I reevaluated your problem and my old answer was totally wrong of course.
What you want is the DataFrame.drop method. This can be done inplace.
mask = df['hello'].str.contains(pattern)
df.drop(mask, inplace=True)
This will update your DataFrame.

Looks to me like you've already done all the hard work, but there are two problems.
Your last line doesn't store the result anywhere. Most Pandas operations are not "in-place", which means you have to store the result somewhere to be able to use it later.
df is a global variable, and setting its value inside a function doesn't work, unless you explicitly have a line stating global df. See the good answers to this question for more detail.
So I think you just need to do:
df = df[df['hello'].str.contains(pattern)==False]
to fix problem one.
For problem two, at the end of func, do return df then when you call func call it like:
df = func(df)
OR, start func with the line
global df

Related

How can i drop row using pandas?

I know, many many similar questions asked many many times (many is two times). But I can't figure out how can I do this in my case. It does not take any effect and I don't know why.
code:
default_path = os.path.dirname(os.path.abspath(__file__))
start_urls = []
if os.path.exists(f'{default_path}/amazon_permalink_error.csv'):
df = pd.read_csv(f'{default_path}/amazon_permalink_error.csv')
if len(df) > 0:
all_completedISBN = pd.read_csv(f'{default_path}/amazon_permalink.csv')['ISBN'].to_list()
for i in range(len(df)):
if df.iloc[i]['sku'].split('/')[-1] not in all_completedISBN:
start_urls.append(df.iloc[i]['sku'])
else:
df.drop(i)
else:
os.remove(f'{default_path}/amazon_permalink_error.csv')
amazon_permalink_error.csv:
sku
https://www.amazon.com/dp/B085K647FM
https://www.amazon.com/dp/B07MTMCNLX
https://www.amazon.com/dp/B07WSK5W7V
https://www.amazon.com/dp/B089T73ZB9
amazon_permalink.csv
ISBN,PERMALINK,Main Link,Brand,Price
B085K647FM,Razer-Raptor-Gaming-Monitor-Compatible,https://www.amazon.com/Razer-Raptor-Gaming-Monitor-Compatible/dp/B085K647FM,Razer,$619.95
B085K647FM,Razer-Raptor-Gaming-Monitor-Compatible,https://www.amazon.com/Razer-Raptor-Gaming-Monitor-Compatible/dp/B085K647FM,Razer,$619.95
B0959Y663R,Razer-Raptor-Gaming-Monitor-Compatible,https://www.amazon.com/Razer-Raptor-Gaming-Monitor-Compatible/dp/B0959Y663R,Razer,$797.49
B087N4LQPN,ALIENWARE-AW2521HF-24-5-Gaming-Monitor,https://www.amazon.com/ALIENWARE-AW2521HF-24-5-Gaming-Monitor/dp/B087N4LQPN,Alienware,
When I print:
print(len(start_urls)) it shows me 943 and total length of error file is 1134, so it means it is working but when I'm dropping which is in and print at last print(len(df)) it shows 1133 but it should show 943.
You are dropping the row from pandas using the drop method but not updating the variable's data. Use inplace=True for the drop method to solve your query
df.drop(i, inplace=True)

split time series dataframe when value change

I'have a Dataframe, that correspond to lat/long of an object in movement.
This object go from one place to another, and I created a column that reference what place he is at every second.
I want to split that dataframe, so when the object go in one place, the leave to another, I'll have two separate dataframe.
'None' mean he is between places
My actual code :
def cut_df2(df):
df_copy = df.copy()
#check if change of place
df_copy['changed'] = df_copy['place'].ne(df_copy['place'].shift().bfill()).astype(int)
last = 0
dfs= []
for num, line in df_copy.iterrows():
if line.changed:
dfs.append(df.iloc[last:num,:])
last = num
# Check if last line was in a place
if line.place != 'None':
dfs.append(df.iloc[last:,:])
df_outs= []
# Delete empty dataframes
for num, dataframe in enumerate(dfs):
if not dataframe.empty :
if dataframe.reset_index().place.iloc[0] != 'None':
df_outs.append(dataframe)
return df_outs
It won't work on big dataset, but work on simple examples and I've no idea why, anyone can help me?
Try using this instead:
https://www.geeksforgeeks.org/split-pandas-dataframe-by-rows/
iloc can be a good way to split a dataframe
df1 = datasX.iloc[:, :72]
df2 = datasX.iloc[:, 72:]

Delete empty dataframe with loop in Python

I have a serie of different DataFrame in Python. I want to check if each of them is either empty or not and then delete those that are actually empty. I am trying with a loop but none dataframes are actually deleted, also those that are actually empty. Here the example, where df_A, df_B, df_C, and df_D are my dataframes and the last one (df_D) is empy.
df_names = [df_A, df_B, df_C, df_D]
for df_ in df_names:
if df_.empty: del df_
For sure I am missing something quite simple, I hope you can help me with this (probably a bit silly) question.
You can use the python locals() function to do this. I would first save the dataframes in a list as string:
Code
df_names = ['df_A', 'df_B', 'df_C', 'df_D']
for df_ in df_names:
if locals()[df_].empty:
del locals()[df_]
You can also check if your dataframe has been deleted using the below code:
alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]
for i in alldfs:
if i[:1] != '_':
print (i)
The above snippet will return all the existing dataframes (excluding the ones defined by python internally)

How to save tuples output form for loop to DataFrame Python

I have some data 33k rows x 57 columns.
In some columns there is a data which I want to translate with dictionary.
I have done translation, but now I want to write back translated data to my data set.
I have problem with saving tuples output from for loop.
I am using tuples for creating good translation. .join and .append is not working in my case. I was trying in many case but without any success.
Looking for any advice.
data = pd.read_csv(filepath, engine="python", sep=";", keep_default_na=False)
for index, row in data.iterrows():
row["translated"] = (tuple(slownik.get(znak) for znak in row["1st_service"]))
I just want to see in print(data["1st_service"] a translated data not the previous one before for loop.
First of all, if your csv doesn't already have a 'translated' column, you'll have to add it:
import numpy as np
data['translated'] = np.nan
The problem is the row object you're trying to write to is only a view of the dataframe, it's not the dataframe itself. Plus you're missing square brackets for your list comprehension, if I'm understanding what you're doing. So change your last line to:
data.loc[index, "translated"] = tuple([slownik.get(znak) for znak in row["1st_service"]])
and you'll get a tuple written into that one cell.
In future, posting the exact error message you're getting is very helpful!
I have manage it, below working code:
data = pd.read_csv(filepath, engine="python", sep=";", keep_default_na=False)
data.columns = []
slownik = dict([ ])
trans = ' '
for index, row in data.iterrows():
trans += str(tuple([slownik.get(znak) for znak in row["1st_service"]]))
data['1st_service'] = trans.split(')(')
data.to_csv("out.csv", index=False)
Can you tell me if it is well done?
Maybe there is an faster way to do it?
I am doing it for 12 columns in one for loop, as shown up.

Outer Join Two Pandas Dataframes [duplicate]

This question already has answers here:
How to convert index of a pandas dataframe into a column
(9 answers)
Closed 1 year ago.
I'm not sure where I am astray but I cannot seem to reset the index on a dataframe.
When I run test.head(), I get the output below:
As you can see, the dataframe is a slice, so the index is out of bounds.
What I'd like to do is to reset the index for this dataframe. So I run test.reset_index(drop=True). This outputs the following:
That looks like a new index, but it's not. Running test.head again, the index is still the same. Attempting to use lambda.apply or iterrows() creates problems with the dataframe.
How can I really reset the index?
reset_index by default does not modify the DataFrame; it returns a new DataFrame with the reset index. If you want to modify the original, use the inplace argument: df.reset_index(drop=True, inplace=True). Alternatively, assign the result of reset_index by doing df = df.reset_index(drop=True).
BrenBarn's answer works.
The following also worked via this thread, which isn't a troubleshooting so much as an articulation of how to reset the index:
test = test.reset_index(drop=True)
As an extension of in code veritas's answer... instead of doing del at the end:
test = test.reset_index()
del test['index']
You can set drop to True.
test = test.reset_index(drop=True)
I would add to in code veritas's answer:
If you already have an index column specified, then you can save the del, of course. In my hypothetical example:
df_total_sales_customers = pd.DataFrame({'Sales': total_sales_customers['Sales'],
'Customers': total_sales_customers['Customers']}, index = total_sales_customers.index)
df_total_sales_customers = df_total_sales_customers.reset_index()

Resources