compare columns with NaN or <NA> values pandas - python-3.x

I have the dataframe with NaN and values, now I want to compare two columns in the same dataframe whether each row values in null or not null. For examples,
if the column a_1 have null values, column a_2 have not null values, then for that particular
row, the result should be 1 in the new column a_12.
If the values in both a_1(value is 123) & a_2(value is 345) is not null, and the values are
not equal, then the result should be 3 in column a_12.
below is the code snippet I have used for comparison, for the scenario 1, I am getting the result as 3 instead of 1. Please guide me to get the correct output.
try:
if (x[cols[0]]==x[cols[1]]) & (~np.isnan(x[cols[0]])) & (~np.isnan(x[cols[1]])):
return 0
elif (np.isnan(x[cols[0]])) & (np.isnan(x[cols[1]])):
return 0
elif (~np.isnan(x[cols[0]])) & (np.isnan(x[cols[1]])):
return 1
elif (np.isnan(x[cols[0]])) & (~np.isnan(x[cols[1]])):
return 2
elif (x[cols[0]]!=x[cols[1]]) & (~np.isnan(x[cols[0]])) & (~np.isnan(x[cols[1]])):
return 3
else:
pass
except Exception as exc:
if (x[cols[0]]==x[cols[1]]) & (pd.notna(x[cols[0]])) & (pd.notna(x[cols[1]])):
return 0
elif (pd.isna(x[cols[0]])) & (pd.isna(x[cols[1]])):
return 0
elif (pd.notna(x[cols[0]])) & (pd.isna(x[cols[1]])):
return 1
elif (pd.isna(x[cols[0]])) & (pd.notna(x[cols[1]])):
return 2
elif (x[cols[0]]!=x[cols[1]]) & (pd.notna(x[cols[0]])) & (pd.notna(x[cols[1]])):
return 3
else:
pass
I have used pd.isna() and pd.notna(), also np.isnan() and ~np.isnan(), because for some columns the second method (np.isnan()) is working, for some columns its just throwing an error.
Please guide me to achieve the result as excepted.
Expected Output:
| a_1 | a_2 | result |
|-----------|---------|--------|
| gssfwe | gssfwe | 0 |
| <NA> | <NA> | 0 |
| fsfsfw | <NA> | 1 |
| <NA> | qweweqw | 2 |
| adsadgsgd | wwuwquq | 3 |
Output Got with the above code:
| a_1 | a_2 | result |
|-----------|---------|--------|
| gssfwe | gssfwe | 0 |
| <NA> | <NA> | 0 |
| fsfsfw | <NA> | 3 |
| <NA> | qweweqw | 3 |
| adsadgsgd | wwuwquq | 3 |

Going by the logic in your code, you'd want to define a function and apply it across your DataFrame.
df = pd.DataFrame({'a_1': [1, 2, np.nan, np.nan, 1], 'a_2': [2, np.nan, 1, np.nan, 1]})
The categories you want map neatly to binary numbers, which you can use to write a short function like -
def nan_check(row):
x, y = row
if x != y:
return int(f'{int(pd.notna(y))}{int(pd.notna(x))}', base=2)
return 0
df['flag'] = df.apply(nan_check, axis=1)
Output
a_1 a_2 flag
0 1.0 2.0 3
1 2.0 NaN 1
2 NaN 1.0 2
3 NaN NaN 0
4 1.0 1.0 0

You can try np.select, but I think you need to rethink the condition and the expected output
Condition 1: if the column a_1 have null values, column a_2 have not null values, then for that particular row, the result should be 1 in the new column a_12.
Condition 2: If the values in both a_1 & a_2 is not null, and the values are not equal, then the result should be 3 in column a_12.
df['a_12'] = np.select(
[df['a_1'].isna() & df['a_2'].notna(),
df['a_1'].notna() & df['a_2'].notna() & df['a_1'].ne(df['a_2'])],
[1, 3],
default=0
)
print(df)
a_1 a_2 result a_12
0 gssfwe gssfwe 0 0
1 NaN NaN 0 0
2 fsfsfw NaN 1 0 # Shouldn't be Condition 1 since a_1 is not NaN
3 NaN qweweqw 2 1 # Condition 1
4 adsadgsgd wwuwquq 3 3

Related

Create new column and calculate values to the column in python row wise

I need to create a new column as Billing and Non-Billing based on the Billable column. If the Billable is 'Yes' then i should create a new column as Billing and if its 'No' then need to create a new column as 'Non-Billable' and need to calculate it. Calculation should be in row axis.
Calculation for Billing in row:
Billing = df[Billing] * sum/168 * 100
Calculation for Non-Billing in row:
Non-Billing = df[Non-Billing] * sum/ 168 * 100
Data
Employee Name | Java | Python| .Net | React | Billable|
----------------------------------------------------------------
|Priya | 10 | | 5 | | Yes |
|Krithi | | 10 | 20 | | No |
|Surthi | | 5 | | | yes |
|Meena | | 20 | | 10 | No |
|Manju | 20 | 10 | 10 | | Yes |
Output
I have tried using insert statement but i cannot keep on inserting it. I tried append also but its not working.
Bill_amt = []
Non_Bill_amt = []
for i in df['Billable']:
if i == "Yes" or i == None:
Bill_amt = (df[Bill_amt].sum(axis=1)/168 * 100).round(2)
df.insert (len( df.columns ), column='Billable Amount', value=Bill_amt )#inserting the column and it name
#CANNOT INSERT ROW AFTER IT AND CANNOT APPEND IT TOO
else:
Non_Bill_amt = (DF[Non_Bill_amt].sum ( axis=1 ) / 168 * 100).round ( 2 )
df.insert ( len ( df.columns ), column='Non Billable Amount', value=Non_Bill_amt ) #inserting the column and its name
#CANNOT INSERT ROW AFTER IT.
Use .sum(axis=1) and then np.where() to put the values in respective columns. For example:
x = df.loc[:, "Java":"React"].sum(axis=1) / 168 * 100
df["Bill"] = np.where(df["Billable"].str.lower() == "yes", x, "")
df["Non_Bill"] = np.where(df["Billable"].str.lower() == "no", x, "")
print(df)
Prints:
Employee_Name Java Python .Net React Billable Bill Non_Bill
0 Priya 10.0 NaN 5.0 NaN Yes 8.928571428571429
1 Krithi NaN 10.0 20.0 NaN No 17.857142857142858
2 Surthi NaN 5.0 NaN NaN yes 2.976190476190476
3 Meena NaN 20.0 NaN 10.0 No 17.857142857142858
4 Manju 20.0 10.0 10.0 NaN Yes 23.809523809523807

I have pandas dataframe with 3 columns and want output like this

DataFrame of 3 Column
a b c
1 2 4
1 2 4
1 2 4
Want Output like this
a b c a+b a+c b+c a+b+c
1 2 4 3 5 6 7
1 2 4 3 5 6 7
1 2 4 3 5 6 7
Create all combinations with length 2 or more by columns and then assign sum:
from itertools import chain, combinations
#https://stackoverflow.com/a/5898031
comb = chain(*map(lambda x: combinations(df.columns, x), range(2, len(df.columns)+1)))
for c in comb:
df[f'{"+".join(c)}'] = df.loc[:, c].sum(axis=1)
print (df)
a b c a+b a+c b+c a+b+c
0 1 2 4 3 5 6 7
1 1 2 4 3 5 6 7
2 1 2 4 3 5 6 7
You should always post your approach while asking a question. However, here it goes. This the easiest but probably not the most elegant way to solve it. For a more elegant approach, you should follow jezrael's answer.
Make your pandas dataframe here:
import pandas as pd
df = pd.DataFrame({"a": [1, 1, 1], "b": [2, 2, 2], "c": [4, 4, 4]})
Now make your desired dataframe like this:
df["a+b"] = df["a"] + df["b"]
df["a+c"] = df["a"] + df["c"]
df["b+c"] = df["b"] + df["c"]
df["a" + "b" + "c"] = df["a"] + df["b"] + df["c"]
This gives you:
| | a | b | c | a+b | a+c | b+c | abc |
|---:|----:|----:|----:|------:|------:|------:|------:|
| 0 | 1 | 2 | 4 | 3 | 5 | 6 | 7 |
| 1 | 1 | 2 | 4 | 3 | 5 | 6 | 7 |
| 2 | 1 | 2 | 4 | 3 | 5 | 6 | 7 |

Group rows based on the current occurance of a variable

I am trying to group a dataframe based on the occurrence a variable. For example take this dataframe
| col_1 | col_2
---------------------
0 | 1 | 1
1 | 0 | 1
2 | 0 | 1
3 | 0 | -1
4 | 0 | -1
5 | 0 | -1
6 | 0 | NaN
7 | -1 | NaN
8 | 0 | NaN
9 | 0 | -1
10| 0 | -1
11| 0 | -1
I want to group variable based on the current occurrence of a variable in column_2 to a dataframe and get the next sequence into another dataframe and likewise till the end of dataframe while also ignoring NaN.
So the final output would be like:
ones_1 =
| col_1 | col_2
---------------------
0 | 1 | 1
1 | 0 | 1
2 | 0 | 1
mones_1 =
3 | 0 | -1
4 | 0 | -1
5 | 0 | -1
mones_2 =
9 | 0 | -1
10| 0 | -1
11| 0 | -1
I suggest create dictionary of DataFrames:
#only non missing rows
mask = df['col_2'].notna()
#create unique groups
g = df['col_2'].ne(df['col_2'].shift()).cumsum()
#create counter of filtered g
g = g[mask].groupby(df['col_2']).transform(lambda x:pd.factorize(x)[0]) + 1
#map positive and negative values to strings and add counter values
g = df.loc[mask, 'col_2'].map({-1:'mones_',1:'ones_'}) + g.astype(str)
#generally groups
#g = 'val' + df.loc[mask, 'col_2'].astype(str) + ' no' + g.astype(str)
print (g)
0 ones_1
1 ones_1
2 ones_1
3 mones_1
4 mones_1
5 mones_1
9 mones_2
10 mones_2
11 mones_2
Name: col_2, dtype: object
#create dictionary of DataFrames
dfs = dict(tuple(df.groupby(g)))
print (dfs)
{'mones_1': col_1 col_2
3 0 -1.0
4 0 -1.0
5 0 -1.0, 'mones_2': col_1 col_2
9 0 -1.0
10 0 -1.0
11 0 -1.0, 'ones_1': col_1 col_2
0 1 1.0
1 0 1.0
2 0 1.0}
#select by keys
print (dfs['ones_1'])
col_1 col_2
0 1 1.0
1 0 1.0
2 0 1.0
It is not recommended, but possible create DataFrames by groups with variable names:
for i, g in df.groupby(g):
globals()[i] = g
print (ones_1)
col_1 col_2
0 1 1.0
1 0 1.0
2 0 1.0
here is another logic (keeping them in dictionary is the idea again):
m=df[df.col_2.notna()] #filter out the NaN rows
#check if the index are in sequence along with that check if values changes per row
s=m.col_2.ne(m.col_2.shift())|m.index.to_series().diff().fillna(1).gt(1)
dfs={f'df_{int(i)}':g for i , g in df.groupby(s.cumsum())} #groupby and store in dict
Access the dataframes by accessing the keys:
print(dfs['df_1'])
print('---------------------------------')
print(dfs['df_2'])
print('---------------------------------')
print(dfs['df_3'])
col_1 col_2
0 1 1.0
1 0 1.0
2 0 1.0
---------------------------------
col_1 col_2
3 0 -1.0
4 0 -1.0
5 0 -1.0
---------------------------------
col_1 col_2
9 0 -1.0
10 0 -1.0
11 0 -1.0

Replace specific value in pandas dataframe column, else convert column to numeric

Given the following pandas dataframe
+----+------------------+-------------------------------------+--------------------------------+
| | AgeAt_X | AgeAt_Y | AgeAt_Z |
|----+------------------+-------------------------------------+--------------------------------+
| 0 | Older than 100 | Older than 100 | 74.13 |
| 1 | nan | nan | 58.46 |
| 2 | nan | 8.4 | 54.15 |
| 3 | nan | nan | 57.04 |
| 4 | nan | 57.04 | nan |
+----+------------------+-------------------------------------+--------------------------------+
how can I replace values in specific columns which equal Older than 100 with nan
+----+------------------+-------------------------------------+--------------------------------+
| | AgeAt_X | AgeAt_Y | AgeAt_Z |
|----+------------------+-------------------------------------+--------------------------------+
| 0 | nan | nan | 74.13 |
| 1 | nan | nan | 58.46 |
| 2 | nan | 8.4 | 54.15 |
| 3 | nan | nan | 57.04 |
| 4 | nan | 57.04 | nan |
+----+------------------+-------------------------------------+--------------------------------+
Notes
After removing the Older than 100 string from the desired columns, I convert the columns to numeric in order to perform calculations on said columns.
There are other columns in this dataframe (that I have excluded from this example), which will not be converted to numeric, so the conversion to numeric must be done one column at a time.
What I've tried
Attempt 1
if df.isin('Older than 100'):
df.loc[df['AgeAt_X']] = ''
else:
df['AgeAt_X'] = pd.to_numeric(df["AgeAt_X"])
Attempt 2
if df.loc[df['AgeAt_X']] == 'Older than 100r':
df.loc[df['AgeAt_X']] = ''
elif df.loc[df['AgeAt_X']] == '':
df['AgeAt_X'] = pd.to_numeric(df["AgeAt_X"])
Attempt 3
df['AgeAt_X'] = ['' if ele == 'Older than 100' else df.loc[df['AgeAt_X']] for ele in df['AgeAt_X']]
Attempts 1, 2 and 3 return the following error:
KeyError: 'None of [0 NaN\n1 NaN\n2 NaN\n3 NaN\n4 NaN\n5 NaN\n6 NaN\n7 NaN\n8 NaN\n9 NaN\n10 NaN\n11 NaN\n12 NaN\n13 NaN\n14 NaN\n15 NaN\n16 NaN\n17 NaN\n18 NaN\n19 NaN\n20 NaN\n21 NaN\n22 NaN\n23 NaN\n24 NaN\n25 NaN\n26 NaN\n27 NaN\n28 NaN\n29 NaN\n ..\n6332 NaN\n6333 NaN\n6334 NaN\n6335 NaN\n6336 NaN\n6337 NaN\n6338 NaN\n6339 NaN\n6340 NaN\n6341 NaN\n6342 NaN\n6343 NaN\n6344 NaN\n6345 NaN\n6346 NaN\n6347 NaN\n6348 NaN\n6349 NaN\n6350 NaN\n6351 NaN\n6352 NaN\n6353 NaN\n6354 NaN\n6355 NaN\n6356 NaN\n6357 NaN\n6358 NaN\n6359 NaN\n6360 NaN\n6361 NaN\nName: AgeAt_X, Length: 6362, dtype: float64] are in the [index]'
Attempt 4
df['AgeAt_X'] = df['AgeAt_X'].replace({'Older than 100': ''})
Attempt 4 returns the following error:
TypeError: Cannot compare types 'ndarray(dtype=float64)' and 'str'
I've also looked at a few posts. The two below do not actually replace the value but create a new column derived from others
Replace specific values in Pandas DataFrame
Pandas replace DataFrame values
We can loop through each column and check if the sentence is present. If we get a hit, we replace the sentence with NaN with Series.str.replace and right after convert it to numeric with Series.astype, in this case float:
df.dtypes
AgeAt_X object
AgeAt_Y object
AgeAt_Z float64
dtype: object
sent = 'Older than 100'
for col in df.columns:
if sent in df[col].values:
df[col] = df[col].str.replace(sent, 'NaN')
df[col] = df[col].astype(float)
print(df)
AgeAt_X AgeAt_Y AgeAt_Z
0 NaN NaN 74.13
1 NaN NaN 58.46
2 NaN 8.40 54.15
3 NaN NaN 57.04
4 NaN 57.04 NaN
df.dtypes
AgeAt_X float64
AgeAt_Y float64
AgeAt_Z float64
dtype: object
If I understand you correctly, you can replace all occurrences of Older than 100 with np.nan with a single call to DataFrame.replace. If all remaining values are numeric, then the replace will implicitly change the data type of the column to numeric:
# Minimal example DataFrame
df = pd.DataFrame({'AgeAt_X': ['Older than 100', np.nan, np.nan],
'AgeAt_Y': ['Older than 100', np.nan, 8.4],
'AgeAt_Z': [74.13, 58.46, 54.15]})
df
AgeAt_X AgeAt_Y AgeAt_Z
0 Older than 100 Older than 100 74.13
1 NaN NaN 58.46
2 NaN 8.4 54.15
df.dtypes
AgeAt_X object
AgeAt_Y object
AgeAt_Z float64
dtype: object
# Replace occurrences of 'Older than 100' with np.nan in any column
df.replace('Older than 100', np.nan, inplace=True)
df
AgeAt_X AgeAt_Y AgeAt_Z
0 NaN NaN 74.13
1 NaN NaN 58.46
2 NaN 8.4 54.15
df.dtypes
AgeAt_X float64
AgeAt_Y float64
AgeAt_Z float64
dtype: object

Write 1s faster to col-rows based on positions in a list

I'm new to pandas. I'm using a dataframe to tally how many times two positions match.
Here is the code in question...right at the start. The "what am I trying to accomplish" below...
def crossovers(df, index):
# Duplicate the dataframe passed in
_dfcopy = df.copy(deep=True)
# Set all values to 0
_dfcopy[:] = 0.0
# change the value of any col/row where there's a shared SNP
for i in index:
for j in index:
if i == j: continue # Don't include self as a shared SNP
_dfcopy[i][j] = 1
# Return the DataFrame.
# Should only contain 0s (no shared SNP) or 1s ( a shared SNP)
return _dfcopy
QUESTION:*
The data is flipping all the 0s in a dataframe to 1s, for all the intersections of rows/columns in a list (see details below).
I.e. if the list is
_indices = [0,2,3]
...all the locations at (0,2); (0,3); (2,0); (2,3); (3,0); and (3,2) get flipped to 1s.
Currently I do this by iterating through the list recursively onto itself. But this is painfully slow...and I'm passing in 16 million lines of data (16 mil indices).
How can I speed up this overall process?
LONGER DESCRIPTION
I start with a dataframe called sharedby_BOTH similar to below, except much larger (70 cols x 70 rows)- I'm using it to tally occurrences of shared data intersections.
Rows (index) are labeled 0,1,2,3 & 4...70 - as are the columns. Each location contains a 0.
sharedby_BOTH
0 1 2 3 4 (more)
------------------
0 | 0 | 0 | 0 | 0 | 0
1 | 0 | 0 | 0 | 0 | 0
2 | 0 | 0 | 0 | 0 | 0
3 | 0 | 0 | 0 | 0 | 0
4 | 0 | 0 | 0 | 0 | 0
(more)
Then I have a list, which contains intersecting data.
_indices = [0,2,3 (more)] # for example
This means that 0, 2, & 3 all contain shared data. So, I pass it to crossovers which returns a dataframe with a "1" at the intersection places, obtaining this...
0 1 2 3 4 (more)
------------------
0 | 0 | 0 | 1 | 1 | 0
1 | 0 | 0 | 0 | 0 | 0
2 | 1 | 0 | 0 | 1 | 0
3 | 1 | 0 | 1 | 0 | 0
4 | 0 | 0 | 0 | 0 | 0
(more)
...where the shared data locations are (0,2),(0,3),(2,0),(2,3),(3,0),(3,2).
*Notice that self is not recognized [(0,0), (2,2), and (3,3) DO NOT have 1s] *
Then I add this to the original dataframe with this code (inside a loop)...
sharedby_BOTH = sharedby_BOTH.add(crossovers(sharedby_BOTH, _indices)
I repeat this in a loop...
for pos, pos_val in chrom_val.items(): # pos_val is a dict
_indices = [i for i, x in enumerate(pos_val["sharedby"]) if (x == "HET")]
sharedby_BOTH = sharedby_BOTH.add(crossovers(sharedby_BOTH, _indices))
The end result is that sharedby_BOTH will look like the following, if I added the three example _indices
sharedby_BOTH = sharedby_BOTH.add(crossovers(sharedby_BOTH, [0,2,3] ))
sharedby_BOTH = sharedby_BOTH.add(crossovers(sharedby_BOTH, [0,2,4] ))
sharedby_BOTH = sharedby_BOTH.add(crossovers(sharedby_BOTH, [0,2,3] ))
0 1 2 3 4 (more)
------------------
0 | 0 | 0 | 3 | 2 | 1
1 | 0 | 0 | 0 | 0 | 0
2 | 3 | 0 | 0 | 2 | 1
3 | 2 | 0 | 2 | 0 | 0
4 | 1 | 0 | 1 | 0 | 0
(more)
...where, amongst the three indices passed in...
0shared data with 2 a total of three times so (0,2) and (2,0) totaled three.
0shared data with 3 twice so (0,3) and (3,0) total two.
0shared data with 4 only once, so (0,4) and (4,0) total one.
I hope this makes sense :)
EDIT
I did try the following...
addit = pd.DataFrame(1, index=_indices, columns=_indices)
sharedby_BOTH = sharedby_BOTH.add(addit)
BUT...then any locations within sharedby_BOTH that DID NOT HAVE SHARED DATA ended up as NAN
I.e...
sharedby_BOTH = pd.DataFrame(0, index=[x for x in range(4)], columns=[x for x in range(4)])
_indices = [0,2,3 (more)] # for example
addit = pd.DataFrame(1, index=_indices, columns=_indices)
sharedby_BOTH = sharedby_BOTH.add(addit)
0 1 2 3 4 (more)
------------------
0 | NAN | NAN | 1 | 1 | NAN
1 | NAN | NAN | NAN | NAN | NAN
2 | 1 | NAN | NAN | 1 | NAN
3 | 1 | NAN | 1 | NAN | NAN
4 | NAN | NAN | NAN | NAN | NAN
(more)
I'd organize it with numpy slice assignment and the handy np.triu_indices function. It returns the row and column indices of the upper triangle. I make sure to pass k=1 to ensure I skip the diagonal. When I slice assign, I make sure to use both i, j and j, i to get upper and lower
triangles.
def xover(n, idx):
idx = np.asarray(idx)
a = np.zeros((n, n))
i_, j_ = np.triu_indices(len(idx), 1)
i = idx[i_]
j = idx[j_]
a[i, j] = 1
a[j, i] = 1
return a
pd.DataFrame(xover(len(df), [0, 2, 3]), df.index, df.columns)
0 1 2 3
0 0.0 0.0 1.0 1.0
1 0.0 0.0 0.0 0.0
2 1.0 0.0 0.0 1.0
3 1.0 0.0 1.0 0.0
Timings
%timeit pd.DataFrame(xover(len(df), [0, 2, 3]), df.index, df.columns)
10000 loops, best of 3: 192 µs per loop
%%timeit
for i,j in product(li,repeat=2):
if i != j:
ndf.loc[i,j] = 1
100 loops, best of 3: 6.8 ms per loop
You can use itertools product and loc for assignment i.e
from itertools import product
li = [ 0,2,3]
ndf = df.copy()
for i,j in product(li,repeat=2):
if i != j:
ndf.loc[i,j] = 1
0 1 2 3 4
0 0 0 1 1 0
1 0 0 0 0 0
2 1 0 0 1 0
3 1 0 1 0 0
4 0 0 0 0 0

Resources