Pandas Conditional Statement and Replace - python-3.x

I have a dataframe that looks like this
col_1 col_2 col_3 col_4
0 1 security team steve#example.com
1 james#example.com chef testing.csv
2 megan#example.com
I want to implement this logic:
If column 1 contains a value that contains an # sign, then replace column 4 with that column
else check column 2 and implement the same logic
else check column 3 and implement the same logic
else column 4 remains an empty string
My output will look like this:
col_1 col_2 col_3 col_4
0 1 security team steve#example.com steve#example.com
1 james#example.com chef testing.csv james#example.com
2 megan#example.com megan#example.com
I thought about trying to use a Numpy select statement somehow, but I believe that may be too difficult.
Any suggestions?

Solution if some rows has multiple values with # then email are joined by ,:
s = df.stack(dropna=False)
df['col_4'] = s[s.str.contains('#', na=False)].groupby(level=0).agg(','.join)
print (df)
col_1 col_2 col_3 col_4
0 1 security team steve#example.com steve#example.com
1 james#example.com chef testing.csv james#example.com
2 megan#example.com NaN NaN megan#example.com
Another idea with list comprehension:
df['col_4'] = [','.join(y for y in x if '#' in y) for x in df.fillna('').to_numpy()]

Related

Pandas Drop an Entire Column if All of the Values equal a Certain Value

Let's say I have dataframes that looks like this:
df_one
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
df_two:
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
The logic I am trying to implement is something like this:
If all of column C = "NaN" then drop the entire column
Else if all of column C = "Member" drop the entire column
else do nothing
Any suggestions?
Edit: Added Expected Output
Expected Output if using on both data frames:
df_one
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
df_two:
a b
0 dave blue
1 bill red
2 sally green
Edit #2: Why am I doing this in the first place?
I am ripping text from a PDF file into placing into CSV files using the Tabula library.
The data is not coming out in the way that I am hoping it would, so I am applying ETL concepts to move the data around.
The final outcome would be for management to be able to open the final result into a nicely formatted Excel file.
Some of the columns have part of the headers put into a separate row and things got shifted around for some reason while ripping the data out of the PDF.
The headers look something like this:
Team Type Member Contact
Count
What I am doing is checking an entire column for certain header values. If the entire column has a header value, I'm dropping the entire column.
Idea is replace Member to missing values first, then test if at least one no missing value by notna with any and add all columns with Trues for mask by Series.reindex:
mask = (df[['c']].replace('Member',np.nan)
.notna()
.any()
.reindex(df.columns, axis=1, fill_value=True))
print (mask)
Another idea id chain both mask by & for bitwise AND:
mask = ((df[['c']].notna() & df[['c']].ne('Member'))
.any()
.reindex(df.columns, axis=1, fill_value=True))
print (mask)
Last filter by columns in DataFrame.loc:
df = df.loc[:, mask]
Here's an alternate approach to do this.
import pandas as pd
import numpy as np
c = ['a','b','c']
d = [
['dave', 'blue', np.NaN],
['bill', 'red', np.NaN],
['sally', 'green', 'Member'],
['Ian', 'Org', 'Paid']]
df1 = pd.DataFrame(d,columns = c)
df2 = df1.loc[df1['a'] != 'Ian']
print (df1)
print (df2)
if df1.c.replace('Member',np.NaN).isnull().all():
df1 = df1[df1.columns.drop(['c'])]
print (df1)
if df2.c.replace('Member',np.NaN).isnull().all():
df2 = df2[df2.columns.drop(['c'])]
print (df2)
Output of this is:
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
a b
0 dave blue
1 bill red
2 sally green
my idea is simple, maybe it will help you. I wanna make sure that you want this one: drop the whole column if this column contains only NaN or 'Member' else do nothing.
So I need to check the column first (contain only NaN or 'Member'). We change 'Member' to NaN and use numpy for a test(or something else).
import pandas as pd
df = pd.DataFrame({'A':['dave','bill','sally','ian'],'B':['blue','red','green','org'],'C':[np.nan,np.nan,'Member','Paid']})
df2 = df.drop(index=[3],axis=0)
print(df)
print(df2)
# df 1
col = pd.Series([np.nan if x=='Member' else x for x in df['C'].tolist()])
if col.isnull().all():
df = df.drop(columns='C')
# df2
col = pd.Series([np.nan if x=='Member' else x for x in df2['C'].tolist()])
if col.isnull().all():
df2 = df2.drop(columns='C')
print(df)
print(df2)
A B C
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 ian org Paid
A B
0 dave blue
1 bill red
2 sally green

How to check and delete pandas column if all rows contain just punctuation character?

I have a dataframe with multiple columns and I want to delete such columns where all rows just contain any of the punctuation character. e-g
col_1 col_2 col_3 col_4
0 1 _ ab 1,235
1 2 ? cd 8,900
2 3 _ ef 1,235
3 4 - gh 8,900
Here I just want to delete col_2. How can I achieve that?
Idea is test if all values of column contains number or string by Series.str.contains in DataFrame.apply and DataFrame.all, last filter by DataFrame.loc:
df = df.loc[:, df.astype(str).apply(lambda x: x.str.contains('\d|\w')).all()]
Or:
df = df.loc[:, df.astype(str).apply(lambda x: x.str.contains('\d|[a-zA-Z]')).all()]
print (df)
col_1 col_3 col_4
0 1 ab 1,235
1 2 cd 8,900
2 3 ef 1,235
3 4 gh 8,900
If possible get all values for remove in string is posible add ^ for start of string and $ for end of string and then invert mask by ~:
p = """[!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ ]"""
df = df.loc[:, ~df.astype(str).apply(lambda x: x.str.contains('^' + p + '$')).all()]
print (df)
col_1 col_3 col_4
0 1 ab 1,235
1 2 cd 8,900
2 3 ef 1,235
3 4 gh 8,900
You could use python's string punctuation method and filter out the column that has punctuations with pandas' all method :
import string
#create a list of individual punctuations
punctuation_list = list(string.punctuation)
#filter out unwanted column
df.loc[:,~df.isin(punctuation_list).all()]
col_1 col_3 col_4
0 1 ab 1,235
1 2 cd 8,900
2 3 ef 1,235
3 4 gh 8,900

Pandas: How to increment a new column based on increment and consecutive properties of 2 other columns?

I'm currently working on a bulk data pre-processing framework in pandas and since I'm relatively new to pandas, I can't seem to solve this problem:
Given: A dataset with 2 columns :col_1, col_2
Required: A new column req_col such that it's value is incremented if
a. the values in col_1 are not consecutiveORb.the value of col_2 is incremented
consecutively
NOTE:
col_2 always starts from 1 and always increases in value and values are never missing (always consecutive), eg:1,1,2,2,3,3,4,5,6,6,6,7,8,8,9.....
col_1 always starts from 0 and always increases in value, but some
values can be missing (need not be consecutive), eg:0,1,2,2,3,6,6,6,10,10,10...
EXPECTED ANSWER:
col_1 col_2 req_col #Changes in req_col explained below
0 1 1
0 1 1
0 2 2 #because col_2 value has incremented
1 2 2
1 2 2
3 2 3 #because '3' is not consectutive to '1' in col_1
3 3 4 #because of increment in col_2
5 3 5 #because '5' is not consecutive to '3' in col_1
6 4 6 #because of increment in col_2 and so on...
6 4 6
Try:
df['req_col'] = (df['col_1'].diff().gt(1) | # col_1 is not consecutive
df['col_2'].diff().ne(0) # col_2 is has a jump
).cumsum()
Output:
0 1
1 1
2 2
3 2
4 2
5 3
6 4
7 5
8 6
9 6
dtype: int32

Prevent column name from disappearing after using replace on dataframe

So I have a real dataframe that somewhat follows the next structure:
d = {'col1':['1_ABC','2_DEF','3 GHI']}
df = pd.DataFrame(data=d)
Basically, some entries have the " _ ", others have " ".
My goal is to split that first number into a new column and keep the rest. For this, I thought I'd first replace the '_' by ' ' to normalize everything, and then simply split by ' ' to get the new column.
#Replace the '_' for ' '
new_df['Name'] = df['Name'].str.replace('_',' ')
My problem is that now my new_df now lost its column name:
0 1 ABC
1 2 DEF
Any way to prevent this from happening?
Thanks!
Function str.replace return Series, so there is no column name, only Series name.
s = df['col1'].str.replace('_',' ')
print (s)
0 1 ABC
1 2 DEF
2 3 GHI
Name: col1, dtype: object
print (type(s))
<class 'pandas.core.series.Series'>
print (s.name)
col1
If need new column assign to same DataFrame - df['Name']:
df['Name'] = df['col1'].str.replace('_',' ')
print (df)
col1 Name
0 1_ABC 1 ABC
1 2_DEF 2 DEF
2 3 GHI 3 GHI
Or overwrite values of original column:
df['col1'] = df['col1'].str.replace('_',' ')
print (df)
col1
0 1 ABC
1 2 DEF
2 3 GHI
If need new one column DataFrame use Series.to_frame for convert Series to df:
df2 = df['col1'].str.replace('_',' ').to_frame()
print (df2)
col1
0 1 ABC
1 2 DEF
2 3 GHI
Also is possible define new column name:
df1 = df['col1'].str.replace('_',' ').to_frame('New')
print (df1)
New
0 1 ABC
1 2 DEF
2 3 GHI
Like #anky_91 commented, if need new 2 columns add str.split:
df1 = df['col1'].str.replace('_',' ').str.split(expand=True)
df1.columns = ['A','B']
print (df1)
A B
0 1 ABC
1 2 DEF
2 3 GHI
If need add columns to existing DataFrame:
df[['A','B']] = df['col1'].str.replace('_',' ').str.split(expand=True)
print (df)
col1 A B
0 1_ABC 1 ABC
1 2_DEF 2 DEF
2 3 GHI 3 GHI

How do I make a panda frames values across multiple columns, its columns

I have the following dataframe loaded up in Pandas.
print(pandaDf)
id col1 col2 col3
12a a b d
22b d a b
33c c a b
I am trying to convert the values across multiple rows into its columns so the output would be like this :
Desired output:
id a b c d
12a 1 1 0 1
22b 1 1 0 0
33c 1 1 1 0
I have tried adding in a value column where the value = 1 and using a pivot table
pandaDf['value'] = 1
column = ['col1', 'col2', 'col3']
pandaDf.pivot_table(index = 'id', value = 'value', columns = column)
However, the resulting data frame is a multilevel index and the pandaDf.pivot() method does not allow multiple column values.
Please advise about how I could do this with an output of a single level index.
Thanks for taking the time to read this and I apologize if I have made any formatting errors in posting the question. I am still learning the proper stackoverflow syntax.
You can use One-Hot Encoding to solve this problem.
Here is one way to do this pd.get_dummies and some multiindex flatten and sum:
df1 = df.set_index('id')
df_out = pd.get_dummies(df1)
df_out.columns = df_out.columns.str.split('_', expand=True)
df_out = df_out.sum(level=1, axis=1).reset_index()
print(df_out)
Output:
id a c d b
0 12a 1 0 1 1
1 22b 1 0 1 1
2 33c 1 1 0 1
Using get_dummies
pd.get_dummies(df.set_index('id'),prefix='', prefix_sep='').sum(level=0,axis=1)
Out[81]:
a c d b
id
12a 1 0 1 1
22b 1 0 1 1
33c 1 1 0 1

Resources