How to remove rows in pandas dataframe column that contain the hyphen character? - python-3.x

I have a DataFrame given as follows:
new_dict = {'Area_sqfeet': '[1002, 322, 420-500,300,1.25acres,100-250,3.45 acres]'}
df = pd.DataFrame([new_dict])
df.head()
I want to remove hyphen values and change acres to sqfeet in this dataframe.
How may I do it efficiently?

Use list comprehension:
mylist = ["1002", "322", "420-500","300","1.25acres","100-250","3.45 acres"]
# ['1002', '322', '420-500', '300', '1.25acres', '100-250', '3.45 acres']
Step 1: Remove hyphens
filtered_list = [i for i in mylist if "-" not in i] # remove hyphens
Step 2: Convert acres to sqfeet:
final_list = [i if 'acres' not in i else eval(i.split('acres')[0])*43560 for i in filtered_list] # convert to sq foot
#['1002', '322', '300', 54450.0, 150282.0]
Also, if you want to keep the "sqfeet" next tot he converted values use this:
final_list = [i if 'acres' not in i else "{} sqfeet".format(eval(i.split('acres')[0])*43560) for i in filtered_list]
# ['1002', '322', '300', '54450.0 sqfeet', '150282.0 sqfeet']

It's not clear if this is homework, and you haven't shown us what you have already tried per https://stackoverflow.com/help/how-to-ask
Here's something that might get you going in the right direction:
import pandas as pd
col_name = 'Area_sqfeet'
# per comment on your question, you need to make a dataframe with more
# than one row, your original question only had one row
new_list = ["1002", "322", "420-500","300","1.25acres","100-250","3.45 acres"]
df = pd.DataFrame(new_list)
df.columns = ["Area_sqfeet"]
# once you have the df as strings, here's how to remove the ones with hyphens
df = df[df["Area_sqfeet"].str.contains("-")==False]
print(df.head())

Related

Set decimal values to 2 points in list under list pandas

I am trying to set max decimal values upto 2 digit for result of a nested list. I have already tried to set precision and tried other things but can not find a way.
r_ij_matrix = variables[1]
print(type(r_ij_matrix))
print(type(r_ij_matrix[0]))
pd.set_option('display.expand_frame_repr', False)
pd.set_option("display.precision", 2)
data = pd.DataFrame(r_ij_matrix, columns= Attributes, index= Names)
df = data.style.set_table_styles([dict(selector='th', props=[('text-align', 'center')])])
df.set_properties(**{'text-align': 'center'})
df.set_caption('Table: Combined Decision Matrix')
You can solve your problem with the apply() method of the dataframe. You can do something like that :
df.apply(lambda x: [[round(elt, 2) for elt in list_] for list_ in x])
Solved it by copying the list to another with the desired decimal points. Thanks everyone.
rij_matrix = variables[1]
rij_nparray = np.empty([8, 6, 3])
for i in range(8):
for j in range(6):
for k in range(3):
rij_nparray[i][j][k] = round(rij_matrix[i][j][k], 2)
rij_list = rij_nparray.tolist()
pd.set_option('display.expand_frame_repr', False)
data = pd.DataFrame(rij_list, columns= Attributes, index= Names)
df = data.style.set_table_styles([dict(selector='th', props=[('text-align', 'center')])])
df.set_properties(**{'text-align': 'center'})
df.set_caption('Table: Normalized Fuzzy Decision Matrix (r_ij)')
applymap seems to be good here:
but there is a BUT: be aware that it is propably not the best idea to store lists as values of a df, you just give up the functionality of pandas. and also after formatting them like this, there are stored as strings. This (if really wanted) should only be for presentation.
df.applymap(lambda lst: list(map("{:.2f}".format, lst)))
Output:
A B
0 [2.05, 2.28, 2.49] [3.11, 3.27, 3.42]
1 [2.05, 2.28, 2.49] [3.11, 3.27, 3.42]
2 [2.05, 2.28, 2.49] [3.11, 3.27, 3.42]
Used Input:
df = pd.DataFrame({
'A': [[2.04939015319192, 2.280350850198276, 2.4899799195977463],
[2.04939015319192, 2.280350850198276, 2.4899799195977463],
[2.04939015319192, 2.280350850198276, 2.4899799195977463]],
'B': [[3.1144823004794873, 3.271085446759225, 3.420526275297414],
[3.1144823004794873, 3.271085446759225, 3.420526275297414],
[3.1144823004794873, 3.271085446759225, 3.420526275297414]]})

Pandas: Remove characters from cell in a FOR loop

I have a dataframe with a column labeled Amount which is a dollar amount. For some reason, some of the cells in this column are enclosed in quotation marks (ex: "$47.25").
I'm running this for loop and was wondering what is the best approach to remove the quotes.
for f in files:
print(f)
df = pd.read_csv(f, header = None, nrows=1)
print(df)
je = df.iloc[0,1]
df2 = pd.read_csv(f,header = 6, dtype = {'Amount':float})
df2.to_excel(w, sheet_name = je, index = False)
I have attempted to strip the " from the value using a for loop:
for cell in df2['Amount']:
cell = cell.strip('"')
df2['Amount']=pd.to_numeric(df2['Amount'])
But I am getting:
ValueError: Unable to parse string "$-167.97" at position 0
Thank you in advance!
Given this toy dataframe:
import pandas as pd
df = pd.DataFrame(
{"Transaction": ["t1", "t2"], "Amount": ["$47.25", "'$-167.97'"]}
)
print(df)
# Outputs
Transaction Amount
0 t1 $47.25
1 t2 '$-167.97'
Instead of using a for loop, which should generally be avoided with dataframes, you could simply remove the quotation marks from the Amountcolumn like this:
df["Amount"] = df["Amount"].str.replace("\'", "")
print(df)
# Outputs
Transaction Amount
0 t1 $47.25
1 t2 $-167.97

Better way to swap column values and then append them in a pandas dataframe?

here is my dataframe
import pandas as pd
data = {'from':['Frida', 'Frida', 'Frida', 'Pablo','Pablo'], 'to':['Vincent','Pablo','Andy','Vincent','Andy'],
'score':[2, 2, 1, 1, 1]}
df = pd.DataFrame(data)
df
I want to swap the values in columns 'from' and 'to' and add them on because these scores work both ways.. here is what I have tried.
df_copy = df.copy()
df_copy.rename(columns={"from":"to","to":"from"}, inplace=True)
df_final = df.append(df_copy)
which works but is there a shorter way to do the same?
One line could be :
df_final = df.append(df.rename(columns={"from":"to","to":"from"}))
On the right track. However, introduce deep=True to make a true copy, otherwise your df.copy will just update df and you will be up in a circle.
df_copy = df.copy(deep=True)
df_copy.rename(columns={"from":"to","to":"from"}, inplace=True)
df_final = df.append(df_copy)

Pandas checks with prefix and more checksum if searched prefix exists or no data

I have below code snippet which works fine.
import pandas as pd
import numpy as np
prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50']
df = pd.read_csv('new_hosts', index_col=False, header=None)
df['prefix'] = df[0].str[:4]
df['grp'] = df.groupby('prefix').cumcount()
df = df.pivot(index='grp', columns='prefix', values=0)
df['sj12'] = df['sj12'].str.extract('(\w{2}\d{2}\w\*)', expand=True)
df = df[ prefixes ].dropna(axis=0, how='all').replace(np.nan, '', regex=True)
df = df.rename_axis(None)
Example File new_hosts
sj000001
sj000002
sj000003
sj000004
sj124000
sj125000
sj126000
sj127000
sj128000
sj129000
sj130000
sj131000
sj132000
cr000011
cr000012
cr000013
cr000014
crn00001
crn00002
crn00003
crn00004
euk000011
eu0000012
eu0000013
eu0000014
eu5000011
eu5000013
eu5000014
eu5000015
Current output:
sj00 sj12 cr00 cr08 eu00 eu50
sj000001 cr000011 crn00001 euk000011 eu5000011
sj000002 cr000012 crn00002 eu0000012 eu5000013
sj000003 cr000013 crn00003 eu0000013 eu5000014
sj000004 cr000014 crn00004 eu0000014 eu5000015
What's expected:
1) As code works fine but as you see the current output the second column don't have any values but still appearing So, how could i have a checksum if a particular column don't have any values then remove that from display.
2) Can we place a check for the prefixes if they exists in the dataframe before processing to avoid the error.
Appreciate any help.
IIUC, before
df = df[ prefixes ].dropna(axis=0, how='all').replace(np.nan, '', regex=True)
you can do:
# remove all empty columns
df = df.dropna(axis=1, how='all')
That would solve your first part. Second part can be reindex?
# select prefixes:
prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50', 'sh00', 'dt00', 'sh00', 'dt00']
df = df.reindex(prefixes, axis=1).dropna(axis=1, how='all').replace(np.nan, '', regex=True)
Note the axis=1, not axis=0 is identical to what I propose for question 1.
Much thanks to Quang Hoang for the hints on the post, Just for the workaround, i got it working as follows until i get a better answer:
# Select prefixes
prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50']
df = pd.read_csv('new_hosts', index_col=False, header=None)
df['prefix'] = df[0].str[:4]
df['grp'] = df.groupby('prefix').cumcount()
df = df.pivot(index='grp', columns='prefix', values=0)
df = df[prefixes]
# For column `sj12` only extract the values having `sj12` and a should be a word immediately after that like `sj12[a-z]`
df['sj12'] = df['sj12'].str.extract('(\w{2}\d{2}\w\*)', expand=True)
df.replace('', np.nan, inplace=True)
# Remove the empty columns
df = df.dropna(axis=1, how='all')
# again drop if all values in the row are nan and replace nan to empty for live columns
df = df.dropna(axis=0, how='all').replace(np.nan, '', regex=True)
# drop the index field
df = df.rename_axis(None)
print(df)

Subtract a single value from columns in pandas

I have two data frames, df and df_test. I am trying to create a new dataframe for each df_test row that will include the difference between x coordinates and the y coordinates. I wold also like to create a new column that gives the magnitude of this distance between objects. Below is my code.
import pandas as pd
import numpy as np
# Create Dataframe
index_numbers = np.linspace(0, 10, 11, dtype=np.int)
index_ = ['OP_%s' % number for number in index_numbers]
header = ['X', 'Y', 'D']
# print(index_)
data = np.round_(np.random.uniform(low=0, high=10, size=(len(index_), 3)), decimals=0)
# print(data)
df = pd.DataFrame(data=data, index=index_, columns=header)
df_test = df.sample(3)
# print(df)
# print(df_test)
for index, row in df_test.iterrows():
print(index)
print(row)
df_(index) = df
df_(index)['X'] = df['X'] - df_test['X'][row]
df_(index)['Y'] = df['Y'] - df_test['Y'][row]
df_(index)['Dist'] = np.sqrt(df_(index)['X']**2 + df_(index)['Y']**2)
print(df_(index))
Better For Loop
for index, row in df_test.iterrows():
# print(index)
# print(row)
# print("df_{0}".format(index))
df_temp = df.copy()
df_temp['X'] = df_temp['X'] - df_test['X'][index]
df_temp['Y'] = df_temp['Y'] - df_test['Y'][index]
df_temp['Dist'] = np.sqrt(df_temp['X']**2 + df_temp['Y']**2)
print(df_temp)
I have written a for loop to run through each row of the df_test dataframe and "try" to create the columns. The (index) in each loop is the name of the new data frame based on test row used. Once the dataframe is created with the modified and new columns I would need to save the data frames to a dictionary. The new loop produces the each of the new dataframes I need but what is the best way to save each new dataframe? Any help in creating these columns would be greatly appreciated.
Please comment with any questions so that I can make it easier to understand, if need be.

Resources