Pandas Drop an Entire Column if All of the Values equal a Certain Value - python-3.x

Let's say I have dataframes that looks like this:
df_one
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
df_two:
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
The logic I am trying to implement is something like this:
If all of column C = "NaN" then drop the entire column
Else if all of column C = "Member" drop the entire column
else do nothing
Any suggestions?
Edit: Added Expected Output
Expected Output if using on both data frames:
df_one
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
df_two:
a b
0 dave blue
1 bill red
2 sally green
Edit #2: Why am I doing this in the first place?
I am ripping text from a PDF file into placing into CSV files using the Tabula library.
The data is not coming out in the way that I am hoping it would, so I am applying ETL concepts to move the data around.
The final outcome would be for management to be able to open the final result into a nicely formatted Excel file.
Some of the columns have part of the headers put into a separate row and things got shifted around for some reason while ripping the data out of the PDF.
The headers look something like this:
Team Type Member Contact
Count
What I am doing is checking an entire column for certain header values. If the entire column has a header value, I'm dropping the entire column.

Idea is replace Member to missing values first, then test if at least one no missing value by notna with any and add all columns with Trues for mask by Series.reindex:
mask = (df[['c']].replace('Member',np.nan)
.notna()
.any()
.reindex(df.columns, axis=1, fill_value=True))
print (mask)
Another idea id chain both mask by & for bitwise AND:
mask = ((df[['c']].notna() & df[['c']].ne('Member'))
.any()
.reindex(df.columns, axis=1, fill_value=True))
print (mask)
Last filter by columns in DataFrame.loc:
df = df.loc[:, mask]

Here's an alternate approach to do this.
import pandas as pd
import numpy as np
c = ['a','b','c']
d = [
['dave', 'blue', np.NaN],
['bill', 'red', np.NaN],
['sally', 'green', 'Member'],
['Ian', 'Org', 'Paid']]
df1 = pd.DataFrame(d,columns = c)
df2 = df1.loc[df1['a'] != 'Ian']
print (df1)
print (df2)
if df1.c.replace('Member',np.NaN).isnull().all():
df1 = df1[df1.columns.drop(['c'])]
print (df1)
if df2.c.replace('Member',np.NaN).isnull().all():
df2 = df2[df2.columns.drop(['c'])]
print (df2)
Output of this is:
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
a b
0 dave blue
1 bill red
2 sally green

my idea is simple, maybe it will help you. I wanna make sure that you want this one: drop the whole column if this column contains only NaN or 'Member' else do nothing.
So I need to check the column first (contain only NaN or 'Member'). We change 'Member' to NaN and use numpy for a test(or something else).
import pandas as pd
df = pd.DataFrame({'A':['dave','bill','sally','ian'],'B':['blue','red','green','org'],'C':[np.nan,np.nan,'Member','Paid']})
df2 = df.drop(index=[3],axis=0)
print(df)
print(df2)
# df 1
col = pd.Series([np.nan if x=='Member' else x for x in df['C'].tolist()])
if col.isnull().all():
df = df.drop(columns='C')
# df2
col = pd.Series([np.nan if x=='Member' else x for x in df2['C'].tolist()])
if col.isnull().all():
df2 = df2.drop(columns='C')
print(df)
print(df2)
A B C
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 ian org Paid
A B
0 dave blue
1 bill red
2 sally green

Related

Convert columns with large numbers and float64 as int type to suppress scientific formatting

I tried many SO answers but nothing tried so far worked.I have a column in a dfwhere there is a large value in a column like 89898989898989898 Whatever I do this is not being displayed as a number. All the cols are of float64 dtype.I do not have any float values in my df
After creating pivot I get a dataframe df and I tried converting to int then writing to excel, does not seem to have any difference and displays as scientific formatting(I can see the value when I click on the cell in the value bar) I could not convert to int directly as there are Nan values in the column :-
ID CATEG LEVEL COLS VALUE COMMENTS
1 A 2 Apple 1e+13 comment1
1 A 3 Apple 1e+13 comment1
1 C 1 Apple 1e+13 comment1
1 C 2 Apple 345 comment1
1 C 3 Apple 289 comment1
1 B 1 Apple 712 comment1
1 B 2 Apple 1e+13 comment1
2 B 3 Apple 376 comment1
2 C None Orange 1e+13 comment1
2 B None Orange 135 comment1
2 D None Orange 423 comment1
2 A None Orange 866 comment1
2 None Orange 496 comment2
After pivot the Apple column looks like this (providing just sample values to show the scientific notation values) :-
index Apple
1655 1e+13
1656 1e+13
1657 1e+13
1658 NaN
1659 NaN
df=pd.pivot_table(dfe,index=['ID','CATEG','LEVEL'],columns=['COLS'],values=['VALUE'])
df= df.fillna(0).astype(np.int64)
with pd.ExcelWriter('file.xlsx',options={'nan_inf_to_errors': True}) as writer :
df.groupby('ID').apply(lambda x: x.dropna(how='all', axis=1).to_excel(writer,sheet_name=str(x.name),na_rep=0,index=True))
writer.save()
What should I do to get rid of scientific formatting and get it displayed as a number in excel.
Also is there a way to autofit the columns in excel while writing from python to excel.Im using pd.ExcelWriter to write to excel
You can use set_option to avoid scientific formating
pd.set_option('display.float_format', lambda x: '%.3f' % x)
Also you can change column type with this command:
df['col'] = df[['col']].fillna(0)
df['col'] = df['col'].astype(int)
The following code generates a Dataframe with two columns. One with text and one with numbers and writes them to excel and changes the format so in order for excel to display the numbers as dot separated integers rather than in scientific notation. If you prefer a different format I am sure you can adopt the format string in the code. I hope that's what you're looking for :).
col1 = [float("nan"),*np.random.randint(10**15,size=5)]
col2 = [random.choice(["apple", "orange"]) for _ in range (6)]
with pd.ExcelWriter("pandas_column_formats.xlsx", engine='xlsxwriter') as writer:
pd.DataFrame(data={"numbers":col1, "strings":col2})\
.to_excel(writer)
workbook = writer.book
worksheet = writer.sheets['Sheet1']
format_ = workbook.add_format({'num_format': '#,##'})
worksheet.set_column('B:B', None, format_)
For your specific example try:
with pd.ExcelWriter('file.xlsx',options={'nan_inf_to_errors': True}) as writer:
df.groupby('ID').apply(lambda x: x.dropna(how='all', axis=1).to_excel(writer,sheet_name=str(x.name),na_rep=0,index=True))
workbook = writer.book
worksheet = writer.sheets['Sheet1']
format_ = workbook.add_format({'num_format': '#,##'})
worksheet.set_column('C:C', None, format_)
worksheet.set_column('E:E', None, format_)
Notice that I am guessing form the other thread that your columns with numbers will be C and E. You will probably have to widen the columns for the huge numbers. Excel will write ######### for it otherwise.

Text data massaging to conduct distance calculations in python

I am trying to get text data from dataframe "A" to be convereted to columns while text data from dataframe "B" to be in rows in a new dataframe "C" in order to calculate distance calculations.
Data in dataframe "A" looks like this
Unique -> header
'Amy'
'little'
'sheep'
'dead'
Data in dataframe "B" looks like this
common_words -> header
'Amy'
'George'
'Barbara'
i want the output in dataframe C as
Amy George Barbara
Amy
little
sheep
dead
Can anyone help me on this
What should be the actual content of data frame C? Do you only want to initialise it to some value (i.e. 0) in the first step and then fill it with the distance calculations?
You could initialise C in the following way:
import pandas as pd
A = pd.DataFrame(['Amy', 'little', 'sheep', 'dead'])
B = pd.DataFrame(['Amy', 'George', 'Barbara'])
C = pd.DataFrame([[0] * len(B)] * len(A), index=A[0], columns=B[0])
C will then look like:
Amy George Barbara
0
Amy 0 0 0
little 0 0 0
sheep 0 0 0
dead 0 0 0
Please pd.DataFrame(index =[list],columns =[list])
Extract the relevant lists using list(df.columnname.values)
Dummy data
print(dfA)
Header
0 Amy
1 little
2 sheep
3 dead
print(dfB)
Header
0 Amy
1 George
2 Barbara
dfC=pd.DataFrame(index=list(dfA.Header.values), columns=list(dfB.Header.values))
Amy George Barbara
Amy NaN NaN NaN
little NaN NaN NaN
sheep NaN NaN NaN
dead NaN NaN NaN
If interested in dfC without NaNS. Please
dfC=pd.DataFrame(index=list(dfA.Header.values), columns=list(dfB.Header.values)).fillna(' ')
Amy George Barbara
Amy
little
sheep
dead

How to split a Dataframe column whose data is not unique

I have a column called users in dataframe which doesn't have a unique format. I am doing a data cleanup project as the data looks unreadable.
company Users
A [{"Name":"Martin","Email":"name_1#email.com","EmpType":"Full"},{"Name":"Rick","Email":"name_2#email.com","Dept":"HR"}]
B [{"Name":"John","Email":"name_2#email.com","EmpType":"Full","Dept":"Sales" }]
I used the below query to this has broke down the data frame as below
df2 = df
df2 = df2.join(df['Users_config'].str.split('},{', expand=True).add_prefix('Users'))
company Users0 Users1
A "Name":"Martin","Email":"name_1#email.com","EmpType":"Full" "Name":"Rick","Email":"name_2#email.com","Dept":"HR"
B "Name":"John","Email":"name_2#email.com","EmpType":"Full","Dept":"Sales"
and further breaking the above df with "," using the same query I got the output as
Company Users01 Users02 Users03 Users10 Users11 Users12
1 "Name":"Martin" "Email":"name_1#email.com" "EmpType":"Full" "Name":"Rick" "Email":"name_2#email.com" "Dept":"HR"
2 "Name":"John" "Email":"name_2#email.com" "EmpType":"Full" "Dept":"Sales"
As this dataframe looks messy I want to get the output as below. I feel the best way to name the column is to use the column value "Name" from "Name":"Martin" itself and If we hardcore using df.rename the column name will get mismatch.
Company Name_1 Email_1 EmpType_1 Dept_1 Name_2 Email_2 Dept_2
1 Martin name_1#email.com Full Rick name_2#email.com "HR"
2 John name_2#email.com" Full Sales
Is there any way I can get the above output from the original dataframe.
Use:
df['Users'] = df['Users'].apply(ast.literal_eval)
d = df.explode('Users').reset_index(drop=True)
d = d.join(pd.DataFrame(d.pop('Users').tolist()))
d = d.set_index(['company', d.groupby('company').cumcount().add(1).astype(str)]).unstack()
d.columns = d.columns.map('_'.join)
Details:
First we use ast.literal_eval to evaluate the strings in Users column, then use DataFrame.explode on column Users to create a dataframe d.
print(d)
company Users
0 A {'Name': 'Martin', 'Email': 'name_1#email.com', 'EmpType': 'Full'}
1 A {'Name': 'Rick', 'Email': 'name_2#email.com', 'Dept': 'HR'}
2 B {'Name': 'John', 'Email': 'name_2#email.com', 'EmpType': 'Full', 'Dept': 'Sales'}
Create a new dataframe from the Users column in d and use DataFrame.join to join this new dataframe with d.
print(d)
company Name Email EmpType Dept
0 A Martin name_1#email.com Full NaN
1 A Rick name_2#email.com NaN HR
2 B John name_2#email.com Full Sales
Use DataFrame.groupby on column company then use groupby.cumcount to create a counter for each group, then use DataFrame.set_index to set the index of d as company + counter. Then use DataFrame.unstack to reshape the dataframe creating MultiIndex columns.
print(d)
Name Email EmpType Dept
1 2 1 2 1 2 1 2
company
A Martin Rick name_1#email.com name_2#email.com Full NaN NaN HR
B John NaN name_2#email.com NaN Full NaN Sales NaN
Finally use map along with .join to flatten the MultiIndex columns.
print(d)
Name_1 Name_2 Email_1 Email_2 EmpType_1 EmpType_2 Dept_1 Dept_2
company
A Martin Rick name_1#email.com name_2#email.com Full NaN NaN HR
B John NaN name_2#email.com NaN Full NaN Sales NaN

How to detect value variables prior to reshape wide to long?

Consider the below df. Notice how the user Paul has two colors vs his name.
df = pd.DataFrame({'names' :['Stacey', 'John', 'Paul'],
'blue':['blue',np.nan, np.nan],
'yellow':[np.nan, 'yellow', np.nan],
'green': [np.nan, np.nan, 'green'],
'purple':[np.nan, np.nan, 'purple' ]
})
print(df)
names blue yellow green purple
0 Stacey blue NaN NaN NaN
1 John NaN yellow NaN NaN
2 Paul NaN NaN green purple
If I am to reshape this df from wide to long, with pd.melt, I will expect the id 'Paul' entry to be duplicated.
df.melt(id_vars='names',
value_vars = ['blue','yellow','green','purple'],
value_name = 'color').dropna().drop('variable', axis=1))
names color
0 Stacey blue
4 John yellow
8 Paul green
11 Paul purple
How would one isolate/detect the repeated entries in the inital df so the output would be?:
names blue yellow green purple
2 Paul NaN NaN green purple
Thank you in advance:
pandas 0.23.4
python 3.7.1
You can use count for coun with exclude missing values with filtering by boolean indexing:
df = df[df[['blue','yellow','green','purple']].count(axis=1) > 1]
print (df)
names blue yellow green purple
2 Paul NaN NaN green purple
Details:
print (df[['blue','yellow','green','purple']].count(axis=1))
0 1
1 1
2 2
dtype: int64

Append Two Dataframes Together (Pandas, Python3)

I am trying to append/join(?) two different dataframes together that don't share any overlapping data.
DF1 looks like
Teams Points
Red 2
Green 1
Orange 3
Yellow 4
....
Brown 6
and DF2 looks like
Area Miles
2 3
1 2
....
7 12
I am trying to append these together using
bigdata = df1.append(df2,ignore_index = True).reset_index()
but I get this
Teams Points
Red 2
Green 1
Orange 3
Yellow 4
Area Miles
2 3
1 2
How do I get something like this?
Teams Points Area Miles
Red 2 2 3
Green 1 1 2
Orange 3
Yellow 4
EDIT: in regards to Edchum's answers, I have tried merge and join but each create somewhat strange tables. Instead of what I am looking for (as listed above) it will return something like this:
Teams Points Area Miles
Red 2 2 3
Green 1
Orange 3 1 2
Yellow 4
Use concat and pass param axis=1:
In [4]:
pd.concat([df1,df2], axis=1)
Out[4]:
Teams Points Area Miles
0 Red 2 2 3
1 Green 1 1 2
2 Orange 3 NaN NaN
3 Yellow 4 NaN NaN
join also works:
In [8]:
df1.join(df2)
Out[8]:
Teams Points Area Miles
0 Red 2 2 3
1 Green 1 1 2
2 Orange 3 NaN NaN
3 Yellow 4 NaN NaN
As does merge:
In [11]:
df1.merge(df2,left_index=True, right_index=True, how='left')
Out[11]:
Teams Points Area Miles
0 Red 2 2 3
1 Green 1 1 2
2 Orange 3 NaN NaN
3 Yellow 4 NaN NaN
EDIT
In the case where the indices do not align where for example your first df has index [0,1,2,3] and your second df has index [0,2] this will mean that the above operations will naturally align against the first df's index resulting in a NaN row for index row 1. To fix this you can reindex the second df either by calling reset_index() or assign directly like so: df2.index =[0,1].

Resources