I have a dataframe with 2 columns like the following:
ColA
COLB
ABC
Null
Null
a
Null
b
DEF
Null
Null
c
Null
d
Null
e
GHI
Null
IJK
f
I want to categories the “COLB” based on the “COLA” so that the final output look like :
ColA
COLB
ABC
a,b
DEF
c,d,e
GHI
Empty
IJK
f
How can I do this using pandas ?
Lets start by creating the DataFrame:
df1 = pd.DataFrame({'ColA':['ABC',np.NaN,np.NaN,'DEF',np.NaN,np.NaN,np.NaN,'GHI','IJK'],'ColB':[np.NaN,'a','b',np.NaN,'c','d','e',np.NaN,'f']})
Next we fill all NaN values with previous occurence:
df1.ColA.fillna(method='ffill',inplace=True)
Then we identify columns with empty colB:
t1 = df1.groupby('ColA').count()
fill_list = t1[t1['ColB'] == 0].index
df1.loc[df1.ColA.isin(fill_list),'ColB'] = 'Empty'
Finally group by and join colB:
df1 = df1.dropna()
df1.groupby('ColA').apply(lambda x: ','.join(x.ColB))
Output:
use for loop for modify and then groupby
(I suppose that your null values are string. if it is false you can first replace them with string value with replace method in dataframe)
import pandas as pd
for i in range(1,len(df)):
if df.ColA.loc[i] == 'Null':
df.ColA.loc[i] = df.ColA.loc[i-1]
df = df.groupby(by=['ColA']).aggregate({'ColB': lambda x: ','.join(x)})
Related
I have a Csv which has data in different manner :
Data Set is given below
data = [[12,'abc#xyz.com', 'NaN', 'NaN' ], [12,'abc#xyz.com','NaN' , 'NaN'], ['NaN', 'NaN','x' , 'y' ] , ['NaN','NaN', 'a','b'] , ['13','qwer#123.com','NaN','NaN'],['NaN','NaN', 'x','r']]
df = pd.DataFrame(data , columns = ['id' , 'email','notes_key' , 'notes_value'])
df
Ideally third and fourth column should have the same id as first column.
The column name notes_key and notes_value represents the key:value pair i.e. the key is notes_key and its corresponding pair is in notes_pair.
I have to manipulate the dataframe in a way such that output turns out :
data = [[12,abc#xyz.com,x,y],[12,abc#xyz.com,a,b]]
df = pd.DataFrame(data , columns =['id','email','notes_key' , 'notes_value'])
I tried dropping the null values.
You can forward filling missing values by id and then remove rows if missing values in both columns notes_key,notes_value:
#if necessary
df = df.replace('NaN', np.nan)
df[['id','email']] = df[['id','email']].ffill()
df = df.dropna(subset=['notes_key','notes_value'], how='all')
print (df)
id email notes_key notes_value
2 12 abc#xyz.com x y
3 12 abc#xyz.com a b
5 13 qwer#123.com x r
I have a Spark Dataframe as below
ID
Col A
Col B
1
null
Some Value
2
Some Value
null
I need to add a new column which contains the column name (among Col A and Col B) which is not null.
So the expected dataframe should look like,
ID
Col A
Col B
result
1
null
Some Value
Col B
2
Some Value
null
Col A
Any help would be much appreciated.
Thank you!
after creating temp views from your dataframe eg
df.createOrReplaceTempView("my_data")
you may run the following on your spark session using newdf = sparkSession.sql("query here")
SELECT
ID,
ColA,
ColB,
CASE
WHEN ColA IS NULL AND ColB IS NULL THEN NULL
WHEN ColB IS NULL THEN 'ColA'
WHEN ColA IS NULL THEN 'ColB'
ELSE 'ColA Col B'
END AS result
FROM my_data
or just using python
from pyspark.sql.functions import when, col
df = df.withColumn("result",when(
col("Col A").isNull() & col("Col B").isNull() , None
).when(
col("Col B").isNull() ,'Col A'
).when(
col("Col A").isNull() ,'Col B'
).otherwise('Col A Col B')
)
In my dataframe, I have multiple columns whose values I would like to replace into one column. For instance, I would like the NaN values in MEDICATIONS: columns to be replaced by a value if it exists in any other column except MEDICATION:
Input:
Expected Output:
`
df['MEDICATIONS'].combine_first(df["Rest of the columns besides MEDICATIONS:"])
`
Link of the dataset:
https://drive.google.com/file/d/1cyZ_OWrGNvJyc8ZPNFVe543UAI9snHDT/view?usp=sharing
Something like this?
import pandas as pd
df = pd.read_csv('data - data.csv')
del df['Unnamed: 0']
df['Combined_Meds'] = df.astype(str).values.sum(axis=1)
df['Combined_Meds'] = df['Combined_Meds'].str.replace('nan', '', regex=False)
cols = list(df.columns)
cols = [cols[-1]] + cols[:-1]
df = df[cols]
df.sample(10)
col_exclusions = ['numerator','Numerator' 'Denominator', "denominator"]
dataframe
id prim_numerator sec_Numerator tern_Numerator tern_Denominator final_denominator Result
1 12 23 45 54 56 Fail
Final output is id and Result
using regex
import re
pat = re.compile('|'.join(col_exclusions),flags=re.IGNORECASE)
final_cols = [c for c in df.columns if not re.search(pat,c)]
#out:
['id', 'Result']
print(df[final_cols])
id Result
0 1 Fail
if you want to drop
df = df.drop([c for c in df.columns if re.search(pat,c)],axis=1)
or the pure pandas approach thanks to #Anky_91
df.loc[:,~df.columns.str.contains('|'.join(col_exclusions),case=False)]
You can be explicit and use del for columns that contain the suffixes in your input list:
for column in df.columns:
if any([column.endswith(suffix) for suffix in col_exclusions]):
del df[column]
You can also use the following approach where the column names are splitted then matched with col_exclusions
df.drop(columns=[i for i in df.columns if i.split("_")[-1] in col_exclusions], inplace=True)
print(df.head())
I have two dataframes like the df1 and df2 examples below. I would like to compare values between the dataframes, and return the columns where the dataframes have different values in the column. So in the example below it would return column B. Any tips are greatly apreciated.
df1
A B C
1 2 3
1 1 1
df2
A B C
1 1 3
1 1 1
Comparing dataframes using != or ne() return a boolean dataframe on which you can look for any True values using any(). This returns a boolean series which you can index with itself.
s = (df1 != df2).any()
s[s].index
Index(['B'], dtype='object')
In your above example using eq with all
df1.eq(df2).all().loc[lambda x : ~x].index
Out[720]: Index(['B'], dtype='object')