Check if one column value is in another column in pandas - python-3.x

I want to compare one column with another column in the same dataframe. Not just between adjacent value but iterating through every value in Col1 and see if it exists in Col2 values.
Col1 Col2 exists
cat pig true
a cat false
pig b true
mat axe false
Thanks.

Col1_value = set(df['Col1'].unique())
df['exists'] = df['Col2'].map(lambda x : True if x in Col1_value else False)

Related

How to categorize one column value based on another column value

I have a dataframe with 2 columns like the following:
ColA
COLB
ABC
Null
Null
a
Null
b
DEF
Null
Null
c
Null
d
Null
e
GHI
Null
IJK
f
I want to categories the “COLB” based on the “COLA” so that the final output look like :
ColA
COLB
ABC
a,b
DEF
c,d,e
GHI
Empty
IJK
f
How can I do this using pandas ?
Lets start by creating the DataFrame:
df1 = pd.DataFrame({'ColA':['ABC',np.NaN,np.NaN,'DEF',np.NaN,np.NaN,np.NaN,'GHI','IJK'],'ColB':[np.NaN,'a','b',np.NaN,'c','d','e',np.NaN,'f']})
Next we fill all NaN values with previous occurence:
df1.ColA.fillna(method='ffill',inplace=True)
Then we identify columns with empty colB:
t1 = df1.groupby('ColA').count()
fill_list = t1[t1['ColB'] == 0].index
df1.loc[df1.ColA.isin(fill_list),'ColB'] = 'Empty'
Finally group by and join colB:
df1 = df1.dropna()
df1.groupby('ColA').apply(lambda x: ','.join(x.ColB))
Output:
use for loop for modify and then groupby
(I suppose that your null values are string. if it is false you can first replace them with string value with replace method in dataframe)
import pandas as pd
for i in range(1,len(df)):
if df.ColA.loc[i] == 'Null':
df.ColA.loc[i] = df.ColA.loc[i-1]
df = df.groupby(by=['ColA']).aggregate({'ColB': lambda x: ','.join(x)})

How to get the column name which is not null

I have a Spark Dataframe as below
ID
Col A
Col B
1
null
Some Value
2
Some Value
null
I need to add a new column which contains the column name (among Col A and Col B) which is not null.
So the expected dataframe should look like,
ID
Col A
Col B
result
1
null
Some Value
Col B
2
Some Value
null
Col A
Any help would be much appreciated.
Thank you!
after creating temp views from your dataframe eg
df.createOrReplaceTempView("my_data")
you may run the following on your spark session using newdf = sparkSession.sql("query here")
SELECT
ID,
ColA,
ColB,
CASE
WHEN ColA IS NULL AND ColB IS NULL THEN NULL
WHEN ColB IS NULL THEN 'ColA'
WHEN ColA IS NULL THEN 'ColB'
ELSE 'ColA Col B'
END AS result
FROM my_data
or just using python
from pyspark.sql.functions import when, col
df = df.withColumn("result",when(
col("Col A").isNull() & col("Col B").isNull() , None
).when(
col("Col B").isNull() ,'Col A'
).when(
col("Col A").isNull() ,'Col B'
).otherwise('Col A Col B')
)

Pandas - df['A'].isnull() vs df['A']=='' difference

As title says, I'm bit confused between the usage isnull() and ==''. sometimes when empty columns are added to a dataframe, isnull() does not work.
FDF = pd.DataFrame()
FDF['A'] = ''
print (FDF.loc[FDF['A'].isnull()])
but in the same case following works.
print (FDF.loc[FDF['A']==''])
is it because the way I added a blank column in a dataframe? if so, what is the correct way to add an empty column ?
In pandas '' is not equal to np.nan
''==np.nan
Out[51]: False
That is why when you do the isnull check it will return False for empty string
Also, when you assign it assign a empty value series to the dataframe
FDF.A
Out[54]: Series([], Name: A, dtype: object)
Correct way to assign value
FDF['A'] = ['']
FDF
Out[59]:
A
0
All above is due to the empty dataframe assignment, after we have the index value not empty for the dataframe
We can do
FDF['A'] = ['']
FDF['B'] = ''
FDF
Out[64]:
A B
0

Adding n new columns to pandas Data Frame

Given a Data Frame like the following:
df = pd.DataFrame({'term' : ['analys','applic','architectur','assess','item','methodolog','research','rs','studi','suggest','test','tool','viewer','work'],
'newValue' : [0.810419, 0.631963 ,0.687348, 0.810554, 0.725366, 0.742715, 0.799152, 0.599030, 0.652112, 0.683228, 0.711307, 0.625563, 0.604190, 0.724763]})
df = df.set_index('term')
print(df)
newValue
term
analys 0.810419
applic 0.631963
architectur 0.687348
assess 0.810554
item 0.725366
methodolog 0.742715
research 0.799152
rs 0.599030
studi 0.652112
suggest 0.683228
test 0.711307
tool 0.625563
viewer 0.604190
work 0.724763
I want to add n new empty columns "".
Therefore, I have a value stored in variable n which indicates the number of required new columns.
n = 5
Thanks for your help in advance!
According to this answer,
Each not empty DataFrame has columns, index and some values.
So your dataframe must not have a column without name anyway.
This is the shortest way that I know of to achieve your goal:
n = 5
for i in range(n):
df[len(df.columns)] = ""
newValue 1 2 3 4 5
term
analys 0.810419
applic 0.631963
architectur 0.687348
assess 0.810554
item 0.725366
methodolog 0.742715
research 0.799152
rs 0.599030
studi 0.652112
suggest 0.683228
test 0.711307
tool 0.625563
viewer 0.604190
work 0.724763
IIUC, you can use:
n= 5
df=(pd.concat([df,pd.DataFrame(columns=['col'+str(i)
for i in range(n)])],axis=1,sort=False).fillna(''))
print(df)
newValue col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
analys 0.810419
applic 0.631963
architectur 0.687348
assess 0.810554
item 0.725366
methodolog 0.742715
research 0.799152
rs 0.599030
studi 0.652112
suggest 0.683228
test 0.711307
tool 0.625563
viewer 0.604190
work 0.724763
Note: You can remove the fillna() if you want NaN.

How to slice out column names based on column into row of new dataframe?

I have a df that looks like this
data.answers.1542213647002.subItemType data.answers.1542213647002.value.1542213647003
thank you for the response TRUE
How do I slice out the column name only for columns that have the string .value. and the column has the value TRUE into a new df like so?:
new_df
old_column_names
data.answers.1542213647002.value.1542213647003
I have roughly 100 more columns with .value. in it but not all of them have TRUE in them as values.
assume this sample df:
df = pd.DataFrame({'col':[1,2]*5,
'col2.value.something':[True,False]*5,
'col3.value.something':[5]*10,
'col4':[True]*10})
then
# boolean indexing with stack
new = pd.DataFrame(list(df[((df==True) & (df.columns.str.contains('.value.')))].stack().index))
# drop duplicates
new = new.drop(columns=0).drop_duplicates()
1
0 col2.value.something

Resources