Manipulating a dataframe conditionally - python-3.x

I have the following data I am attempt to do the following;
If elements in tag_3 & tag_4 are 'NaN' then return an intermediate df with the following columns: tag_0, tag_1 & tag_2.
If elements in tag_4 only are 'NaN' then return another intermediate df with the following columns: tag_0, tag_2, tag_3.
Finally if ALL columns have non-NaN values then return an intermediate df with the following columns: tag_0, tag_3, tag_4.
DATA:
data = {'tag_0': ['1', '2', '3'],
'tag_1': ['4', '5', '6'],
'tag_2': ['7', '8', '9'],
'tag_3': ['NaN', '10', '11'],
'tag_4': ['NaN', 'NaN', '12']}
df_1 = pd.DataFrame(data, columns = ['tag_0', 'tag_1', 'tag_2', 'tag_3', 'tag_4'])
dummy data

I like to use bool masks for this sort of task in pandas because I think it is easy to read, but there are other ways to go about it.
What is bool mask?
A bool mask is essentially a Series of True/False values that is applied to a DataFrame to filter it.
Step 1: create the Series of True/False values.
tag_3_is_nan = df['tag3'].isna()
tag_4_is_nan = df['tag4'].isna()
Step 2: apply them to the DataFrame
df[bool_mask]
In your case this would be applied using the following logic.
Case 1: If elements in tag_3 & tag_4 are 'NaN' then return an intermediate df with the following columns: tag_0, tag_1 & tag_2.
df[tag_3_is_nan & tag_4_is_nan][['tag_0', 'tag_1', 'tag_2']]
Case 2: If elements in tag_4 only are 'NaN' then return another intermediate df with the following columns: tag_0, tag_2, tag_3.
df[tag_4_is_nan & ~tag_3_is_nan][['tag_0', 'tag_2', 'tag_3']]
The ~ is equal to not - so ~tag_3_is_nan means tag_3 is not nan.
Case 3: Finally if ALL columns have non-NaN values then return an intermediate df with the following columns: tag_0, tag_3, tag_4.
Dropping all rows that contain at least one NaN value is simple in pandas - just use the method dropna()
df.dropna()[['tag_0', 'tag_3', 'tag_4']]
To avoid settingWithCopyWarning down the line you should copy the filtered df.
Above uses None but your example uses 'NaN' as a string. You can use the same method if your data contains strings of 'NaN' rather than actual None.
tag_3_is_nan_string = df['tag3'] == 'NaN'

Related

concat_ws and coalesce in pyspark

In Pyspark, I want to combine concat_ws and coalesce whilst using the list method. For example I know this works:
from pyspark.sql.functions import concat_ws, col
df = spark.createDataFrame([["A", "B"], ["C", None], [None, "D"]]).toDF("Type", "Segment")
#display(df)
df = df.withColumn("concat_ws2", concat_ws(':', coalesce('Type', lit("")), coalesce('Segment', lit(""))))
display(df)
But I want to be able to utilise the *[list] method so I don't have to list out all the columns within that bit of code, i.e. something like this instead:
from pyspark.sql.functions import concat_ws, col
df = spark.createDataFrame([["A", "B"], ["C", None], [None, "D"]]).toDF("Type", "Segment")
list = ["Type", "Segment"]
df = df.withColumn("almost_desired_output", concat_ws(':', *list))
display(df)
However as you can see, I want to be able to coalesce NULL with a blank, but not sure if that's possible using the *[list] method or do I really have to list out all the columns?
This would work:
Iterate over list of columns names
df=df.withColumn("almost_desired_output", concat_ws(':', *[coalesce(name, lit('')).alias(name) for name in df.schema.names]))
Output:
Or, Use fill - it'll fill all the null values across all columns of Dataframe (but this changes in the actual column, which may can break some use-cases)
df.na.fill("").withColumn("almost_desired_output", concat_ws(':', *list)
Or, Use selectExpr (again this changes in the actual column, which may can break some use-cases)
list = ["Type", "Segment"] # or just use df.schema.names
list2 = ["coalesce(type,' ') as Type", "coalesce(Segment,' ') as Segment"]
df=df.selectExpr(list2).withColumn("almost_desired_output", concat_ws(':', *list))

How to divide multilevel columns in Python

I have a df like this:
arrays = [['bar', 'bar', 'baz', 'baz'],
['one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(3, 4), index=['A', 'B', 'C'], columns=index)
df.head()
returning:
I want to add some columns where all second level dimensions are divided by each other - bar one is divided by baz one, and bar two is divided by baz two, etc.
df[["bar"]]/df[["baz"]]
and
df[["bar"]].div(df[["baz"]])
returns NaN's
You can select both levels by only one []:
df1 = df["bar"]/df["baz"]
print (df1)
second one two
A 1.564478 -0.115979
B 14.604267 -19.749265
C -0.511788 -0.436637
If want add MultiIndex add MultiIndex.from_product:
df1.columns = pd.MultiIndex.from_product([['new'], df1.columns], names=df.columns.names)
print (df1)
first new
second one two
A 1.564478 -0.115979
B 14.604267 -19.749265
C -0.511788 -0.436637
Another idea for MultiIndex in output is use your solution with rename columns to same names, here new:
df2 = df[["bar"]].rename(columns={'bar':'new'})/df[["baz"]].rename(columns={'baz':'new'})
print (df2)
first new
second one two
A 1.564478 -0.115979
B 14.604267 -19.749265
C -0.511788 -0.436637

Categorizing a data based on string in each row

I have the following dataframe:
raw_data = {'name': ['Willard', 'Nan', 'Omar', 'Spencer'],
'Last_Name': ['Smith', 'Nan', 'Sheng', 'Poursafar'],
'favorite_color': ['blue', 'red', 'Nan', "green"],
'Statues': ['Match', 'Mis-Match', 'Match', 'Mis_match']}
df = pd.DataFrame(raw_data, columns = ['name', 'age', 'favorite_color', 'grade'])
df
I wanna do the following tasks:
Separate the rows that contain Match and Mis-match
Make a category that only contains people whose first name and last name are Nan and love a color(any color except for nan).
Can you guys help me?
Use boolean indexing:
df1 = df[df['Statues'] == 'Match']
df2 = df[df['Statues'] =='Mis-Match']
If missing values are not strings use Series.isna and
Series.notna:
df3 = df[df['Name'].isna() & df['Last_NameName'].isna() & df['favorite_color'].notna()]
If Nans are strings compare by Nan:
df3 = df[(df['Name'] == 'Nan') &
(df['Last_NameName'] == 'Nan') &
(df['favorite_color'] != 'Nan')]

Remove consecutive duplicate entries from pandas in each cell

I have a data frame that looks like
d = {'col1': ['a,a,b', 'a,c,c,b'], 'col2': ['a,a,b', 'a,b,b,a']}
pd.DataFrame(data=d)
expected output
d={'col1':['a,b','a,c,b'],'col2':['a,b','a,b,a']}
I have tried like this :
arr = ['a', 'a', 'b', 'a', 'a', 'c','c']
print([x[0] for x in groupby(arr)])
How do I remove the duplicate entries in each row and column of dataframe?
a,a,b,c should be a,b,c
From what I understand, you don't want to include values which repeat in a sequence, you can try with this custom function:
def myfunc(x):
s=pd.Series(x.split(','))
res=s[s.ne(s.shift())]
return ','.join(res.values)
print(df.applymap(myfunc))
col1 col2
0 a,b a,b
1 a,c,b a,b,a
Another function can be created with itertools.groupby such as :
from itertools import groupby
def myfunc(x):
l=[x[0] for x in groupby(x.split(','))]
return ','.join(l)
You could define a function to help with this, then use .applymap to apply it to all columns (or .apply one column at a time):
d = {'col1': ['a,a,b', 'a,c,c,b'], 'col2': ['a,a,b', 'a,b,b,a']}
df = pd.DataFrame(data=d)
def remove_dups(string):
split = string.split(',') # split string into a list
uniques = set(split) # remove duplicate list elements
return ','.join(uniques) # rejoin the list elements into a string
result = df.applymap(remove_dups)
This returns:
col1 col2
0 a,b a,b
1 a,c,b a,b
Edit: This looks slightly different to your expected output, why do you expect a,b,a for the second row in col2?
Edit2: to preserve the original order, you can replace the set() function with unique_everseen()
from more_itertools import unique_everseen
.
.
.
uniques = unique_everseen(split)

Merge and then sort columns of a dataframe based on the columns of the merging dataframe

I have two dataframes, both indexed with timestamps. I would like to preserve the order of the columns in the first dataframe that is merged.
For example:
#required packages
import pandas as pd
import numpy as np
# defining stuff
num_periods_1 = 11
num_periods_2 = 4
# create sample time series
dates1 = pd.date_range('1/1/2000 00:00:00', periods=num_periods_1, freq='10min')
dates2 = pd.date_range('1/1/2000 01:30:00', periods=num_periods_2, freq='10min')
column_names_1 = ['C', 'B', 'A']
column_names_2 = ['B', 'C', 'D']
df1 = pd.DataFrame(np.random.randn(num_periods_1, len(column_names_1)), index=dates1, columns=column_names_1)
df2 = pd.DataFrame(np.random.randn(num_periods_2, len(column_names_2)), index=dates2, columns=column_names_2)
df3 = df1.merge(df2, how='outer', left_index=True, right_index=True, suffixes=['_1', '_2'])
print("\nData Frame Three:\n", df3)
The above code generates two data frames the first with columns C, B, and A. The second dataframe has columns B, C, and D. The current output has the columns in the following order; C_1, B_1, A, B_2, C_2, D. What I want the columns from the output of the merge to be C_1, C_2, B_1, B_2, A_1, D_2. The order of the columns is preserved from the first data frame and any data similar to the second data frame is added next to the corresponding data.
Could there be a setting in merge or can I use sort_index to do this?
EDIT: Maybe a better way to phrase the sorting process would be to call it uncollated. Where each column is put together and so on.
Using an OrderedDict, as you suggested.
from collections import OrderedDict
from itertools import chain
c = df3.columns.tolist()
o = OrderedDict()
for x in c:
o.setdefault(x.split('_')[0], []).append(x)
c = list(chain.from_iterable(o.values()))
df3 = df3[c]
An alternative that involves extracting the prefixes and then calling sorted on the index.
# https://stackoverflow.com/a/46839182/4909087
p = [s[0] for s in c]
c = sorted(c, key=lambda x: (p.index(x[0]), x))
df = df[c]

Resources