How to detect value variables prior to reshape wide to long? - python-3.x

Consider the below df. Notice how the user Paul has two colors vs his name.
df = pd.DataFrame({'names' :['Stacey', 'John', 'Paul'],
'blue':['blue',np.nan, np.nan],
'yellow':[np.nan, 'yellow', np.nan],
'green': [np.nan, np.nan, 'green'],
'purple':[np.nan, np.nan, 'purple' ]
})
print(df)
names blue yellow green purple
0 Stacey blue NaN NaN NaN
1 John NaN yellow NaN NaN
2 Paul NaN NaN green purple
If I am to reshape this df from wide to long, with pd.melt, I will expect the id 'Paul' entry to be duplicated.
df.melt(id_vars='names',
value_vars = ['blue','yellow','green','purple'],
value_name = 'color').dropna().drop('variable', axis=1))
names color
0 Stacey blue
4 John yellow
8 Paul green
11 Paul purple
How would one isolate/detect the repeated entries in the inital df so the output would be?:
names blue yellow green purple
2 Paul NaN NaN green purple
Thank you in advance:
pandas 0.23.4
python 3.7.1

You can use count for coun with exclude missing values with filtering by boolean indexing:
df = df[df[['blue','yellow','green','purple']].count(axis=1) > 1]
print (df)
names blue yellow green purple
2 Paul NaN NaN green purple
Details:
print (df[['blue','yellow','green','purple']].count(axis=1))
0 1
1 1
2 2
dtype: int64

Related

Convert one dataframe's format and check if each row exits in another dataframe in Python

Given a small dataset df1 as follow:
city year quarter
0 sh 2019 q4
1 bj 2020 q3
2 bj 2020 q2
3 sh 2020 q4
4 sh 2020 q1
5 bj 2021 q1
I would like to create date range in quarter from 2019-q2 to 2021-q1 as column names, then check if each row in df1's year and quarter for each city exist in df2.
If they exist, then return ys for that cell, otherwise, return NaNs.
The final result will like:
city 2019-q2 2019-q3 2019-q4 2020-q1 2020-q2 2020-q3 2020-q4 2021-q1
0 bj NaN NaN NaN NaN y y NaN y
1 sh NaN NaN y y NaN NaN y NaN
To create column names for df2:
pd.date_range('2019-04-01', '2021-04-01', freq = 'Q').to_period('Q')
How could I achieve this in Python? Thanks.
We can use crosstab on city and the string concatenation of the year and quarter columns:
new_df = pd.crosstab(df['city'], df['year'].astype(str) + '-' + df['quarter'])
new_df:
col_0 2019-q4 2020-q1 2020-q2 2020-q3 2020-q4 2021-q1
city
bj 0 0 1 1 0 1
sh 1 1 0 0 1 0
We can convert to bool, replace False and True to be the correct values, reindex to add missing columns, and cleanup axes and index to get exact output:
col_names = pd.date_range('2019-01-01', '2021-04-01', freq='Q').to_period('Q')
new_df = (
pd.crosstab(df['city'], df['year'].astype(str) + '-' + df['quarter'])
.astype(bool) # Counts to boolean
.replace({False: np.NaN, True: 'y'}) # Fill values
.reindex(columns=col_names.strftime('%Y-q%q')) # Add missing columns
.rename_axis(columns=None) # Cleanup axis name
.reset_index() # reset index
)
new_df:
city 2019-q1 2019-q2 2019-q3 2019-q4 2020-q1 2020-q2 2020-q3 2020-q4 2021-q1
0 bj NaN NaN NaN NaN NaN y y NaN y
1 sh NaN NaN NaN y y NaN NaN y NaN
DataFrame and imports:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'city': ['sh', 'bj', 'bj', 'sh', 'sh', 'bj'],
'year': [2019, 2020, 2020, 2020, 2020, 2021],
'quarter': ['q4', 'q3', 'q2', 'q4', 'q1', 'q1']
})

Pandas Drop an Entire Column if All of the Values equal a Certain Value

Let's say I have dataframes that looks like this:
df_one
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
df_two:
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
The logic I am trying to implement is something like this:
If all of column C = "NaN" then drop the entire column
Else if all of column C = "Member" drop the entire column
else do nothing
Any suggestions?
Edit: Added Expected Output
Expected Output if using on both data frames:
df_one
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
df_two:
a b
0 dave blue
1 bill red
2 sally green
Edit #2: Why am I doing this in the first place?
I am ripping text from a PDF file into placing into CSV files using the Tabula library.
The data is not coming out in the way that I am hoping it would, so I am applying ETL concepts to move the data around.
The final outcome would be for management to be able to open the final result into a nicely formatted Excel file.
Some of the columns have part of the headers put into a separate row and things got shifted around for some reason while ripping the data out of the PDF.
The headers look something like this:
Team Type Member Contact
Count
What I am doing is checking an entire column for certain header values. If the entire column has a header value, I'm dropping the entire column.
Idea is replace Member to missing values first, then test if at least one no missing value by notna with any and add all columns with Trues for mask by Series.reindex:
mask = (df[['c']].replace('Member',np.nan)
.notna()
.any()
.reindex(df.columns, axis=1, fill_value=True))
print (mask)
Another idea id chain both mask by & for bitwise AND:
mask = ((df[['c']].notna() & df[['c']].ne('Member'))
.any()
.reindex(df.columns, axis=1, fill_value=True))
print (mask)
Last filter by columns in DataFrame.loc:
df = df.loc[:, mask]
Here's an alternate approach to do this.
import pandas as pd
import numpy as np
c = ['a','b','c']
d = [
['dave', 'blue', np.NaN],
['bill', 'red', np.NaN],
['sally', 'green', 'Member'],
['Ian', 'Org', 'Paid']]
df1 = pd.DataFrame(d,columns = c)
df2 = df1.loc[df1['a'] != 'Ian']
print (df1)
print (df2)
if df1.c.replace('Member',np.NaN).isnull().all():
df1 = df1[df1.columns.drop(['c'])]
print (df1)
if df2.c.replace('Member',np.NaN).isnull().all():
df2 = df2[df2.columns.drop(['c'])]
print (df2)
Output of this is:
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
a b
0 dave blue
1 bill red
2 sally green
my idea is simple, maybe it will help you. I wanna make sure that you want this one: drop the whole column if this column contains only NaN or 'Member' else do nothing.
So I need to check the column first (contain only NaN or 'Member'). We change 'Member' to NaN and use numpy for a test(or something else).
import pandas as pd
df = pd.DataFrame({'A':['dave','bill','sally','ian'],'B':['blue','red','green','org'],'C':[np.nan,np.nan,'Member','Paid']})
df2 = df.drop(index=[3],axis=0)
print(df)
print(df2)
# df 1
col = pd.Series([np.nan if x=='Member' else x for x in df['C'].tolist()])
if col.isnull().all():
df = df.drop(columns='C')
# df2
col = pd.Series([np.nan if x=='Member' else x for x in df2['C'].tolist()])
if col.isnull().all():
df2 = df2.drop(columns='C')
print(df)
print(df2)
A B C
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 ian org Paid
A B
0 dave blue
1 bill red
2 sally green

Splitting Column Lists in Pandas DataFrame

I'm looking for an good way to solve the following problem. My current fix is not particularly clean, and I'm hoping to learn from your insight.
Suppose I have a Panda DataFrame, whose entries look like this:
>>> df=pd.DataFrame(index=[1,2,3],columns=['Color','Texture','IsGlass'])
>>> df['Color']=[np.nan,['Red','Blue'],['Blue', 'Green', 'Purple']]
>>> df['Texture']=[['Rough'],np.nan,['Silky', 'Shiny', 'Fuzzy']]
>>> df['IsGlass']=[1,0,1]
>>> df
Color Texture IsGlass
1 NaN ['Rough'] 1
2 ['Red', 'Blue'] NaN 0
3 ['Blue', 'Green', 'Purple'] ['Silky','Shiny','Fuzzy'] 1
So each observation in the index corresponds to something I measured about its color, texture, and whether it's glass or not. What I'd like to do is turn this into a new "indicator" DataFrame, by creating a column for each observed value, and changing the corresponding entry to a one if I observed it, and NaN if I have no information.
>>> df
Red Blue Green Purple Rough Silky Shiny Fuzzy Is Glass
1 Nan Nan Nan Nan 1 NaN Nan Nan 1
2 1 1 Nan Nan Nan Nan Nan Nan 0
3 Nan 1 1 1 Nan 1 1 1 1
I have solution that loops over each column, looks at its values, and through a series of Try/Excepts for non-Nan values splits the lists, creates a new column, etc., and concatenates.
This is my first post to StackOverflow - I hope this post conforms to the posting guidelines. Thanks.
Stacking Hacks!
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.stack().unstack(fill_value=[])
def b(c):
d = mlb.fit_transform(c)
return pd.DataFrame(d, c.index, mlb.classes_)
pd.concat([b(df[c]) for c in ['Color', 'Texture']], axis=1).join(df.IsGlass)
Blue Green Purple Red Fuzzy Rough Shiny Silky IsGlass
1 0 0 0 0 0 1 0 0 1
2 1 0 0 1 0 0 0 0 0
3 1 1 1 0 1 0 1 1 1
I am just using pandas, get_dummies
l=[pd.get_dummies(df[x].apply(pd.Series).stack(dropna=False)).sum(level=0) for x in ['Color','Texture']]
pd.concat(l,axis=1).assign(IsGlass=df.IsGlass)
Out[662]:
Blue Green Purple Red Fuzzy Rough Shiny Silky IsGlass
1 0 0 0 0 0 1 0 0 1
2 1 0 0 1 0 0 0 0 0
3 1 1 1 0 1 0 1 1 1
For each texture/color in each row, I check if the value is null. If not, we add that value as a column = 1 for that row.
import numpy as np
import pandas as pd
df=pd.DataFrame(index=[1,2,3],columns=['Color','Texture','IsGlass'])
df['Color']=[np.nan,['Red','Blue'],['Blue', 'Green', 'Purple']]
df['Texture']=[['Rough'],np.nan,['Silky', 'Shiny', 'Fuzzy']]
df['IsGlass']=[1,0,1]
for row in df.itertuples():
if not np.all(pd.isnull(row.Color)):
for val in row.Color:
df.loc[row.Index,val] = 1
if not np.all(pd.isnull(row.Texture)):
for val in row.Texture:
df.loc[row.Index,val] = 1

Replace values in pandas column based on nan in another column

For pairs of columns, i want to replace the values of the second columns with nan if the values in the first is nan.
I have tried without success
>import pandas as pd
>
> df=pd.DataFrame({'a': ['r', np.nan, np.nan, 's'], 'b':[0.5, 0.5, 0.2,
> 0.02], 'c':['n','r', np.nan, 's' ], 'd':[1,0.5,0.2,0.05]})
>
>listA=['a','c']
>listB=['b','d']
>for color, ratio in zip(listA, listB):
>>df.loc[df[color].isnull(), ratio] == np.nan
df remain unchanged
other test using def (failed)
>def Test(df):
>> if df[color]== np.nan:
>> >> return df[ratio]== np.nan
>> else:
>> >>return
>for color, ratio in zip(listA, listB):
>>>>df[ratio]=df.apply(Test, axis=1)
Thanks
It seems you have typo, change == to =:
for color, ratio in zip(listA, listB):
df.loc[df[color].isnull(), ratio] = np.nan
print (df)
a b c d
0 r 0.50 n 1.00
1 NaN NaN r 0.50
2 NaN NaN NaN NaN
3 s 0.02 s 0.05
Another solution with mask for replace True values of mask to NaN by default:
for color, ratio in zip(listA, listB):
df[ratio] = df[ratio].mask(df[color].isnull())
print (df)
a b c d
0 r 0.50 n 1.00
1 NaN NaN r 0.50
2 NaN NaN NaN NaN
3 s 0.02 s 0.05

Append Two Dataframes Together (Pandas, Python3)

I am trying to append/join(?) two different dataframes together that don't share any overlapping data.
DF1 looks like
Teams Points
Red 2
Green 1
Orange 3
Yellow 4
....
Brown 6
and DF2 looks like
Area Miles
2 3
1 2
....
7 12
I am trying to append these together using
bigdata = df1.append(df2,ignore_index = True).reset_index()
but I get this
Teams Points
Red 2
Green 1
Orange 3
Yellow 4
Area Miles
2 3
1 2
How do I get something like this?
Teams Points Area Miles
Red 2 2 3
Green 1 1 2
Orange 3
Yellow 4
EDIT: in regards to Edchum's answers, I have tried merge and join but each create somewhat strange tables. Instead of what I am looking for (as listed above) it will return something like this:
Teams Points Area Miles
Red 2 2 3
Green 1
Orange 3 1 2
Yellow 4
Use concat and pass param axis=1:
In [4]:
pd.concat([df1,df2], axis=1)
Out[4]:
Teams Points Area Miles
0 Red 2 2 3
1 Green 1 1 2
2 Orange 3 NaN NaN
3 Yellow 4 NaN NaN
join also works:
In [8]:
df1.join(df2)
Out[8]:
Teams Points Area Miles
0 Red 2 2 3
1 Green 1 1 2
2 Orange 3 NaN NaN
3 Yellow 4 NaN NaN
As does merge:
In [11]:
df1.merge(df2,left_index=True, right_index=True, how='left')
Out[11]:
Teams Points Area Miles
0 Red 2 2 3
1 Green 1 1 2
2 Orange 3 NaN NaN
3 Yellow 4 NaN NaN
EDIT
In the case where the indices do not align where for example your first df has index [0,1,2,3] and your second df has index [0,2] this will mean that the above operations will naturally align against the first df's index resulting in a NaN row for index row 1. To fix this you can reindex the second df either by calling reset_index() or assign directly like so: df2.index =[0,1].

Resources