How to get the column name which is not null - apache-spark

I have a Spark Dataframe as below
ID
Col A
Col B
1
null
Some Value
2
Some Value
null
I need to add a new column which contains the column name (among Col A and Col B) which is not null.
So the expected dataframe should look like,
ID
Col A
Col B
result
1
null
Some Value
Col B
2
Some Value
null
Col A
Any help would be much appreciated.
Thank you!

after creating temp views from your dataframe eg
df.createOrReplaceTempView("my_data")
you may run the following on your spark session using newdf = sparkSession.sql("query here")
SELECT
ID,
ColA,
ColB,
CASE
WHEN ColA IS NULL AND ColB IS NULL THEN NULL
WHEN ColB IS NULL THEN 'ColA'
WHEN ColA IS NULL THEN 'ColB'
ELSE 'ColA Col B'
END AS result
FROM my_data
or just using python
from pyspark.sql.functions import when, col
df = df.withColumn("result",when(
col("Col A").isNull() & col("Col B").isNull() , None
).when(
col("Col B").isNull() ,'Col A'
).when(
col("Col A").isNull() ,'Col B'
).otherwise('Col A Col B')
)

Related

Compare blank string and null spark sql

I am writing an SQL query that joins two tables. The problem that I am facing is that the column on which I am joining is blank (""," ") on one table and null on the other.
Table A
id
col
1
2
3
SG
Table B
id
col
a
null
b
null
c
SG
source_alleg = spark.sql("""
SELECT A.*,B.COL as COLB FROM TABLEA A LEFT JOIN TABLEB B
ON A.COL = B.COL
""")
For my use case blank values and null are same. I want to do something like Trim(a.col) which will convert blank values to null and hence find all the matches in the join.
Output:
id
col
colb
1
either null or blank
either null or blank
2
either null or blank
either null or blank
3
SG
SG
In sql the NULL are ignored during a join unless you use a outer join or full join
more information : https://www.geeksforgeeks.org/difference-between-left-right-and-full-outer-join/
if you want to convert the nulls to a string you can just use an if
select
if(isnull(trim(col1)),"yourstring", col1),
if(isnull(trim(col2)),"yourstring", col2)
from T;

How to categorize one column value based on another column value

I have a dataframe with 2 columns like the following:
ColA
COLB
ABC
Null
Null
a
Null
b
DEF
Null
Null
c
Null
d
Null
e
GHI
Null
IJK
f
I want to categories the “COLB” based on the “COLA” so that the final output look like :
ColA
COLB
ABC
a,b
DEF
c,d,e
GHI
Empty
IJK
f
How can I do this using pandas ?
Lets start by creating the DataFrame:
df1 = pd.DataFrame({'ColA':['ABC',np.NaN,np.NaN,'DEF',np.NaN,np.NaN,np.NaN,'GHI','IJK'],'ColB':[np.NaN,'a','b',np.NaN,'c','d','e',np.NaN,'f']})
Next we fill all NaN values with previous occurence:
df1.ColA.fillna(method='ffill',inplace=True)
Then we identify columns with empty colB:
t1 = df1.groupby('ColA').count()
fill_list = t1[t1['ColB'] == 0].index
df1.loc[df1.ColA.isin(fill_list),'ColB'] = 'Empty'
Finally group by and join colB:
df1 = df1.dropna()
df1.groupby('ColA').apply(lambda x: ','.join(x.ColB))
Output:
use for loop for modify and then groupby
(I suppose that your null values are string. if it is false you can first replace them with string value with replace method in dataframe)
import pandas as pd
for i in range(1,len(df)):
if df.ColA.loc[i] == 'Null':
df.ColA.loc[i] = df.ColA.loc[i-1]
df = df.groupby(by=['ColA']).aggregate({'ColB': lambda x: ','.join(x)})

Check if one column value is in another column in pandas

I want to compare one column with another column in the same dataframe. Not just between adjacent value but iterating through every value in Col1 and see if it exists in Col2 values.
Col1 Col2 exists
cat pig true
a cat false
pig b true
mat axe false
Thanks.
Col1_value = set(df['Col1'].unique())
df['exists'] = df['Col2'].map(lambda x : True if x in Col1_value else False)

For loop for dropping a string pattern from a column name

I am attempting to drop '_Adj' from a column name, in a 'df_merged' data frame if (1) a column name contains 'eTIV' or "eTIV1'.
for col in df_merged.columns:
if 'eTIV1' in col or 'eTIV' in col:
df_merged.columns.str.replace('_Adj', '')
This code seems to be producing the following error:
KeyError: '[] not found in axis'
Here are two options:
Option 1
df_merged.columns = [col.replace('_Adj','') if 'eTIV' in col else col for col in list(df_merged.columns)]
Option 2
df_merged = df_merged.rename(columns={col: col.replace('_Adj','') if 'eTIV' in col else col for col in df_merged.columns})

Pivoting in Pig

This is related to the question in Pivot table with Apache Pig.
I have the input data as
Id Name Value
1 Column1 Row11
1 Column2 Row12
1 Column3 Row13
2 Column1 Row21
2 Column2 Row22
2 Column3 Row23
and want to pivot and get the output as
Id Column1 Column2 Column3
1 Row11 Row12 Row13
2 Row21 Row22 Row23
Pls let me know how to do it in Pig.
The simplest way to do it without UDF is to group on Id and than in nested foreach select rows for each of the column names, then join them in the generate. See script:
inpt = load '~/rows_to_cols.txt' as (Id : chararray, Name : chararray, Value: chararray);
grp = group inpt by Id;
maps = foreach grp {
col1 = filter inpt by Name == 'Column1';
col2 = filter inpt by Name == 'Column2';
col3 = filter inpt by Name == 'Column3';
generate flatten(group) as Id, flatten(col1.Value) as Column1, flatten(col2.Value) as Column2, flatten(col3.Value) as Column3;
};
Output:
(1,Row11,Row12,Row13)
(2,Row21,Row22,Row23)
Another option would be to write a UDF which converts a bag{name, value} into a map[], than use get values by using column names as keys (Ex. vals#'Column1').
Not sure about pig, but in spark, you could do this with a one-line command
df.groupBy("Id").pivot("Name").agg(first("Value"))

Resources