Using a function to replace cell values in a column - python-3.x

I have a fairly large Dataframes 22000X29 . I want to clean up one particular column for data aggregation. A number of cells can be replaced by one column value. I would like to write a function to accomplish this using replace function. How do I pass the column name to the function?
I tried passing the column name as a variable to the function.
Of course, I could do this variable by variable, but that would be tedious
#replace in df from list
def replaceCell(mylist,myval,mycol,mydf):
for i in range(len(mylist)):
mydf.mycol.replace(to_replace=mylist[i],value=myval,inplace=True)
return mydf
replaceCell((c1,c2,c3,c4,c5,c6,c7),c0,'SCity',cimsBid)
cimsBid is the Dataframes, SCity is the column in which I want values to be changed
Error message:
AttributeError: 'DataFrame' object has no attribute 'mycol'

Try accessing your column as:
mydf[mycol]

On this command:
mydf.mycol.replace(to_replace=mylist[i],value=myval,inplace=True)
Pandas columns access by attribute operator doesn't allows on variable name. You need to access it through indexing operator [] as:
mydf[mycol].replace(to_replace=mylist[i],value=myval,inplace=True)
There are few more warnings here
Warning
You can use this access only if the index element is a valid Python identifier, e.g. s.1 is not allowed. See here for an explanation of
valid identifiers.
The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed.
Similarly, the attribute will not be available if it conflicts with any of the following list: index, major_axis, minor_axis, items.
In any of these cases, standard indexing will still work, e.g. s['1'], s['min'], and s['index'] will access the corresponding
element or column.

hi try these function hopefully it will work
def replace_values(replace_dict,mycol,mydf):
mydf = mydf.replace({mycol: replace_dict})
return mydf
pass replacing values as dictonary

Address the column as a string.
You should pass the whole list of values you want to replace (to_replace) and a list of new values (value). (Don't use tuples.
If you want to replace all values with the same new value, it might be best
def replaceCell(mylist,myval,mycol,mydf):
mydf[mycol].replace(to_replace=mylist,value=myval,inplace=True)
return mydf
# example dataframe
df = pd.DataFrame( {'SCity':['A','D','D', 'B','C','A','B','D'] ,
'value':[23, 42,76,34,87,1,52,94]})
# replace the 'SCity' column with a new value
mylist = list(df['SCity'])
myval = ['c0']*len(mylist)
df = replaceCell(mylist,myval,'SCity',df)
# the output
df
SCity value
0 c0 23
1 c0 42
2 c0 76
3 c0 34
4 c0 87
5 c0 1
6 c0 52
7 c0 94
This returns the df, with the replaced values.
If you intend to only change a few values, you can do this in a loop.
def replaceCell2(mylist,myval,mycol,mydf):
for i in range(len(mylist)):
mydf[mycol].replace(to_replace=mylist[i],value=myval,inplace=True)
return mydf
# example dataframe
df = pd.DataFrame( {'SCity':['A','D','D', 'B','C','A','B','D'] ,
'value':[23, 42,76,34,87,1,52,94]})
# Only entries with value 'A' or 'B' will be replaced by 'c0'
mylist = ['A','B']
myval = 'c0'
df = replaceCell2(mylist,myval,'SCity',df)
# the output
df
SCity value
0 c0 23
1 D 42
2 D 76
3 c0 34
4 C 87
5 c0 1
6 c0 52
7 D 94

Related

PerformanceWarning: DataFrame is highly fragmented. How to convert in to a more efficient way via pd.concat with designated column name

I got following warning while running under python 3.8 with the newest pandas.
PerformanceWarning: DataFrame is highly fragmented.
this is the place where I compile my data into one single dataframe, and also where the problem pops up.
def get_all_score():
df = pd.DataFrame()
for name, code in get_code().items():
global count
count += 1
print("ticker:" + name, "trade_code:" + code, "The {} data updated".format(count))
try:
df[name] = indicator_score(code)['total']
time.sleep(0.33334)
except:
continue
return df
I tried to look up in the forum, but I can't figure out how to manipulate with two variables, df[name]is my column name, and indicator_score(code)['total'] is my column output data, all the fractured dataframes are added horizontally, shown as bellow:
a b c <<< zz
1 30 40 10 21
2 41 50 11 33
3 44 66 20 29
4 51 71 19 10
5 31 88 31 60
6 60 95 40 70
.
.
.
what would be a neat way to use pd.concat() to solve my issue? thanks.
This is my workaround on this issue, but it seems not that reliable, one little glitch can totally ruin the past process. Here are my code:
def get_all_score():
df = pd.DataFrame()
name_list = []
for name, code in get_code().items():
global count
count += 1
print("ticker:" + name, "trade_code:" + code, "The {} data updated".format(count))
try:
name_list.append(name)
df = pd.concat([df, indicator_score(code)['总分']], axis=1)
# df[name] = indicator_score(code)['总分']
# time.sleep(0.33334)
except:
name_list.remove(name)
continue
df.columns = name_list
return df
I tried to replace name for column name before concat process, however I failed to do so. I only figured out how to replace column name after the concat process. This is such a pain. Does anyone have a better way to do so?
df[name] = indicator_score(code)['总分'].copy()
should solve your poor performance issue, I suppose, give it a try mate.

How to change Pandas Column Values in List Format

I'm trying to multiply each value in a column by 0.01 but the column values are in list format. How do I apply it to each element of the list in each row? For example, my data looks like this:
ID Amount
156 [14587, 38581, 55669]
798 [67178, 98635]
And I'm trying to multiply each element in the lists by 0.01.
ID Amount
156 [145.87, 385.81, 556.69]
798 [671.78, 986.35]
I've tried the following code but got an error message saying "can't multiply sequence by non-int of type 'float'".
df['Amount'] = df3['Amount'].apply(lambda x: x*0.00000001 in x)
You need another loop / list comprehension in apply:
df['Amount'] = df.Amount.apply(lambda lst: [x * 0.01 for x in lst])
df
ID Amount
0 156 [145.87, 385.81, 556.69]
1 798 [671.78, 986.35]

Create duplicate column in pandas dataframe

I want to duplicate a column which has numerical character in the start position. ie(1stfloor)
In simple term, I want to convert column 1stfloor to FirstFloor
df
1stfloor
456
784
746
44
9984
Tried using the below code,
df['FirstFloor'] = df['1stfloor']
encountered with below error message:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Expected output:
df
FirstFloor
456
784
746
44
9984
df['FirstFloor'] = df['1stfloor']
df['FirstFloor'] = df.loc[:, '1stfloor']
Both worked!

ValueError: DataFrame constructor not properly called when convert Json data to Dataframe

I meet a problem when I try to covert Json data to Dataframe using pandas and json package.
My raw data from Json file looks like:
{"Number":41,"Type":["A1","A2","A3","A4","A5"],"Percent":{"Very Good":1.2,"Good":2.1,"OK":1.1,"Bad":1.3,"Very Bad":1.7}}
And my code is:
import pandas as pd
import json
with open('Test.json', 'r') as filename:
json_file=json.load(filename)
df =pd.DataFrame(json_file['Type'],columns=['Type'])
The problem is when I only read Type from Json file, it gives me the correct result which looks like:
Type
0 A1
1 A2
2 A3
3 A4
4 A5
However when only read Number from Json file:
df =pd.DataFrame(json_file['Number'],columns=['Number'])
It gives me error: ValueError: DataFrame constructor not properly called!
Also If I use:
df = pd.DataFrame.from_dict(json_file)
I get the error:
ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.
I have done some researches on Google, but still cannot figure out why.
My goal is to break this Json data into two Dataframe, the first one is combine the Number and Type together:
Number Type
0 41 A1
1 41 A2
2 41 A3
3 41 A4
4 41 A5
Another Dataframe I want to get is the data in the Percent, which may look like:
Very Good 1.2
Good 2.1
OK 1.1
Bad 1.3
Very Bad 1.7
This should give your desired output:
with open('Test.json', 'r') as filename:
json_file=json.load(filename)
df = pd.DataFrame({'Number': json_file.get('Number'), 'Type': json_file.get('Type')})
df2 = pd.DataFrame({'Percent': json_file.get('Percent')})
Number type
0 41 A1
1 41 A2
2 41 A3
3 41 A4
4 41 A5
Percent
Bad 1.3
Good 2.1
OK 1.1
Very Bad 1.7
Very Good 1.2
You can generalize this into a function:
def json_to_df(d, ks):
return pd.DataFrame({k: d.get(k) for k in ks})
df = json_to_df(json_file, ['Number', 'Type'])
If you wish to avoid using the json package, you can do it directly in pandas:
_df = pd.read_json('Test.json', typ='series')
df = pd.DataFrame.from_dict(dict(_df[['Number', 'Type']]))
df2 = pd.DataFrame.from_dict(_df['Percent'], orient='index', columns=['Percent'])

Merge data frames based on column with different rows

I have multiple csv files that I read into individual data frames based on their name in the directory, like so
# ask user for path
path = input('Enter the path for the csv files: ')
os.chdir(path)
# loop over filenames and read into individual dataframes
for fname in os.listdir(path):
if fname.endswith('Demo.csv'):
demoRaw = pd.read_csv(fname, encoding = 'utf-8')
if fname.endswith('Key2.csv'):
keyRaw = pd.read_csv(fname, encoding = 'utf-8')
Then I filter to only keep certain columns
# filter to keep desired columns only
demo = demoRaw.filter(['Key', 'Sex', 'Race', 'Age'], axis=1)
key = keyRaw.filter(['Key', 'Key', 'Age'], axis=1)
Then I create a list of the above dataframes and use reduce to merge them on Key
# create list of data frames for combined sheet
dfs = [demo, key]
# merge the list of data frames on the Key
combined = reduce(lambda left,right: pd.merge(left,right,on='Key'), dfs)
Then I drop the auto generated column, create an Excel writer and write to a csv
# drop the auto generated index colulmn
combined.set_index('RecordKey', inplace=True)
# create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('final.xlsx', engine='xlsxwriter')
# write to csv
combined.to_excel(writer, sheet_name='Combined')
meds.to_excel(writer, sheet_name='Meds')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
The problem is some files have keys that aren't in others. For example
Demo file
Key Sex Race Age
1 M W 52
2 F B 25
3 M L 78
Key file
Key Key2 Age
1 7325 52
2 4783 25
3 1367 78
4 9435 21
5 7247 65
Right now, it will only include rows if there is a matching key in each (in other words it just leaves out the rows with keys not in the other files). How can I combine all rows from all files, even if keys don't match? So the end result will look like this
Key Sex Race Age Key2 Age
1 M W 52 7325 52
2 F B 25 4783 25
3 M L 78 1367 78
4 9435 21
5 7247 65
I don't care if the empty cells are blanks, NaN, #N/A, etc. Just as long as I can identify them.
Replace combined = reduce(lambda left,right: pd.merge(left,right,on='Key'), dfs) With: combined=pd.merge(demo,key, how='outer', on='Key') You will have to specificy the 'outer' to join both the full table of Key and Demo

Resources