pandas create a Boolean column for a df based on one condition on a column of another df - python-3.x

I have two dfs, A and B. A is like,
date id
2017-10-31 1
2017-11-01 2
2017-08-01 3
B is like,
type id
1 1
2 2
3 3
I like to create a new boolean column has_b for A, set the column value to True if its corresponding row (A joins B on id) in B does not have type == 1, and its time delta is > 90 days comparing to datetime.utcnow().day; and False otherwise, here is my solution
B = B[B['type'] != 1]
A['has_b'] = A.merge(B[['id', 'type']], how='left', on='id')['date'].apply(lambda x: datetime.utcnow().day - x.day > 90)
A['has_b'].fillna(value=False, inplace=True)
expect to see A result in,
date id has_b
2017-10-31 1 False
2017-11-01 2 False
2017-08-01 3 True
I am wondering if there is a better way to do this, in terms of more concise and efficient code.

First merge A and B on id -
i = A.merge(B, on='id')
Now, compute has_b -
x = i.type.ne(1)
y = (pd.to_datetime('today') - i.date).dt.days.gt(90)
i['has_b'] = (x & y)
Merge back i and A -
C = A.merge(i[['id', 'has_b']], on='id')
C
date id has_b
0 2017-10-31 1 False
1 2017-11-01 2 False
2 2017-08-01 3 True
Details
x will return a boolean mask for the first condition.
i.type.ne(1)
0 False
1 True
2 True
Name: type, dtype: bool
y will return a boolean mask for the second condition. Use to_datetime('today') to get the current date, subtract this from the date column, and access the days component with dt.days.
(pd.to_datetime('today') - i.date).dt.days.gt(90)
0 False
1 False
2 True
Name: date, dtype: bool
In case, A and B's IDs do not align, you may need a left merge instead of an inner merge, for the last step -
C = A.merge(i[['id', 'has_b']], on='id', how='left')
C's has_b column will contain NaNs in this case.

Related

In Python Pandas, how do I combine two columns containing strings using if/else statement or similar?

I have created a pandas dataframe from an excel file where first two columns are:
df = pd.DataFrame({'0':['','','Location Code','pH','Ag','Alkalinity'], '1':['Lab Id','Collection Date','','','µg/L','mg/L']})
which looks like this:
df[0] df[1]
Lab Id
Collection Date
Location Code
pH
Ag µg/L
Alkalinity mg/L
I want to merge these columns into one that looks like this:
df[0]
Lab Id
Collection Date
Location Code
pH
Ag (µg/L)
Alkalinity (mg/L)
I believe I need a control statement before combining df[0] and df[1] which would appear like this:
if **there is a blank space in either column, then it performs**:
df[0] = df[0].astype(str)+df[1].astype(str)
else:
df[0] = df[0].astype(str)+' ('+df[1].astype(str)+')'
but I am not sure how to write the if statement. Could anyone please guide me here.
Thank you very much.
We can try np.select
cond=[(df['0']=='') & (df['1']!=''), (df['0']!='') & (df['1']==''), (df['0']!='') & (df['1'] !='')]
val=[df['1'], df['0'], df['0']+ '('+df['1']+')']
df['new']=np.select(cond,val)
df
0 1 new
0 Lab Id Lab Id
1 Collection Date Collection Date
2 Location Code Location Code
3 pH pH
4 Ag µg/L Ag(µg/L)
5 Alkalinity mg/L Alkalinity(mg/L)
if value is Na, maybe:
df['result'] = df[0].fillna(df[1])
This works using numpy where, and the string concatenation assumption is based on the data shared :
df.assign(
merger=np.where(
df["1"].str.endswith("/L"),
df["0"].str.cat(df["1"], "(").add(")"),
df["0"].str.cat(df["1"], ""),
)
)
0 1 merger
0 Lab Id Lab Id
1 Collection Date Collection Date
2 Location Code Location Code
3 pH pH
4 Ag µg/L Ag(µg/L)
5 Alkalinity mg/L Alkalinity(mg/L)
Or, you could just assign it to "0", if that is what you are after :
df["0"] = np.where(
df["1"].str.endswith("/L"),
df["0"].str.cat(df["1"], "(").add(")"),
df["0"].str.cat(df["1"], ""),
)
Here is another way:
First you replace values you are going to concat with the value + '()'
df['1'].loc[df.replace('', np.nan).notnull().all(axis =1 )] = '(' + df['1'] + ')'
Now we fill in missing values with bfill and ffill
df = df.replace('', np.nan).bfill(axis = 1).ffill(axis = 1)
Only thing remaining, is to merge values wherever we have brackets
df.loc[:, 'merge'] = np.where(df['1'].str.endswith(')'), df['0'] + df['1'], df['1'])
Test if empty value at least in one column 0,1 by DataFrame.eq and DataFrame.any and then join both columns like in your answer in numpy.where:
df = pd.DataFrame({0:['','','Location Code','pH','Ag','Alkalinity'],
1:['Lab Id','Collection Date','','',u'µg/L','mg/L']})
print (df[[0,1]].eq(''))
0 1
0 True False
1 True False
2 False True
3 False True
4 False False
5 False False
print (df[[0,1]].eq('').any(axis=1))
0 True
1 True
2 True
3 True
4 False
5 False
dtype: bool
df[0] = np.where(df[[0,1]].eq('').any(axis=1),
df[0].astype(str)+df[1].astype(str),
df[0].astype(str)+' ('+df[1].astype(str)+')')
print (df)
0 1
0 Lab Id Lab Id
1 Collection Date Collection Date
2 Location Code
3 pH
4 Ag (µg/L) µg/L
5 Alkalinity (mg/L) mg/L

Looping in rows in pandas

My data frame has first columns as IDs as follows:
ID
A123
A234
A456
A123
A234
Now I need to create a new column Indicator which will add one in front of each ID which is getting repeated.
Desired Output:
ID Indicator
A123 1
A234 1
A456 0
A123 1
A234 1
This is a pretty simple operation in Pandas once you get the hang of it, so you may want to invest some time in a tutorial. What you need to do is call the conventient function duplicated() of the ID column, an instance of pandas.core.series.Series. So:
import pandas as pd
df = pd.DataFrame(["A123", "A234", "A456", "A123", "A234"], columns=["ID"])
df.ID.duplicated()
0 False
1 False
2 False
3 True
4 True
Name: ID, dtype: bool
It returns a Series with boolean values. You can take that new Seriesand call its apply function that will then return a Series with values using the return of apply. So to turn each boolean into 0 or 1, all you need to do is apply int:
df.ID.duplicated().apply(int) // or df["ID"].duplicated().apply(int)
0 0
1 0
2 0
3 1
4 1
Name: ID, dtype: int64
There are lots of other convention functions in Series. If you need to do something more complicated, you can apply() a custom function, e.g.
def custom_function(value):
return str(int(value))
df.ID.duplicated().apply(custom_function)
0 0
1 0
2 0
3 1
4 1
Name: ID, dtype: object
You can also use the apply() of the DataFrame itself to call functions across all rows or columns, specified using axis.

Identify and count alternating parts of a column in a (timeseries) dataframe

I am analyzing trades done in a futures contract, based on a csv file with a list of trades (columns are Side, Qty, Price, Date).
I have imported the file and sorted the trades chronologically by time. The column "Side" (BUY/SELL) is now:
B
S
S
B
B
S
S
B
B
B
B
I want to give each row of B's and each row of S's a unique number, in order for me to group each individual parts of B's and S's for further analysis. I want for example to find out what the average price of each row of Bs and each row of Ss are.
In the example above there are 5 rows/parts in total, 3 B's and 2 S's. The first row of B's should be 1. The second row of B's should be 3 and the last row of B's should be 5. Basically I want to add a column with this output:
1
2
2
3
3
4
4
5
5
5
5
Now I should be able to find the average price of the four B's in row number 5 using groupby with the new column as argument and mean().
But how can I make the counter needed for this new column? I am able to identify each change using somehing like np.where(), diff(), abs() + cumsum() and 1 and -1, but I dont see how I can add +1 to each alternation.
Use Series.shift with compare not equal and cumulative sum by Series.cumsum:
df['new'] = df['Side'].ne(df['Side'].shift()).cumsum()
How it working:
df = df.assign(shifted = df['Side'].shift(),
mask = df['Side'].ne(df['Side'].shift()),
new = df['Side'].ne(df['Side'].shift()).cumsum())
print (df)
Side shifted mask new
0 B NaN True 1
1 S B True 2
2 S S False 2
3 B S True 3
4 B B False 3
5 S B True 4
6 S S False 4
7 B S True 5
8 B B False 5
9 B B False 5
10 B B False 5

Looking for NaN values in a specific column in df [duplicate]

Now I know how to check the dataframe for specific values across multiple columns. However, I cant seem to work out how to carry out an if statement based on a boolean response.
For example:
Walk directories using os.walk and read in a specific file into a dataframe.
for root, dirs, files in os.walk(main):
filters = '*specificfile.csv'
for filename in fnmatch.filter(files, filters):
df = pd.read_csv(os.path.join(root, filename),error_bad_lines=False)
Now checking that dataframe across multiple columns. The first value being the column name (column1), the next value is the specific value I am looking for in that column(banana). I am then checking another column (column2) for a specific value (green). If both of these are true I want to carry out a specific task. However if it is false I want to do something else.
so something like:
if (df['column1']=='banana') & (df['colour']=='green'):
do something
else:
do something
If you want to check if any row of the DataFrame meets your conditions you can use .any() along with your condition . Example -
if ((df['column1']=='banana') & (df['colour']=='green')).any():
Example -
In [16]: df
Out[16]:
A B
0 1 2
1 3 4
2 5 6
In [17]: ((df['A']==1) & (df['B'] == 2)).any()
Out[17]: True
This is because your condition - ((df['column1']=='banana') & (df['colour']=='green')) - returns a Series of True/False values.
This is because in pandas when you compare a series against a scalar value, it returns the result of comparing each row of that series against the scalar value and the result is a series of True/False values indicating the result of comparison of that row with the scalar value. Example -
In [19]: (df['A']==1)
Out[19]:
0 True
1 False
2 False
Name: A, dtype: bool
In [20]: (df['B'] == 2)
Out[20]:
0 True
1 False
2 False
Name: B, dtype: bool
And the & does row-wise and for the two series. Example -
In [18]: ((df['A']==1) & (df['B'] == 2))
Out[18]:
0 True
1 False
2 False
dtype: bool
Now to check if any of the values from this series is True, you can use .any() , to check if all the values in the series are True, you can use .all() .

Comparing equality of groupby objects

Say we have dataframe one df1 and dataframe two df2.
import pandas as pd
dict1= {'group':['A','A','B','C','C','C'],'col2':[1,7,4,2,1,0],'col3':[1,1,3,4,5,3]}
df1 = pd.DataFrame(data=dict1).set_index('group')
dict2 = {'group':['A','A','B','C','C','C'],'col2':[1,7,400,2,1,0],'col3':[1,1,3,4,5,3500]}
df2 = pd.DataFrame(data=dict2).set_index('group')
df1
col2 col3
group
A 1 1
A 7 1
B 4 3
C 2 4
C 1 5
C 0 3
df2
col2 col3
group
A 1 1
A 7 1
B 400 3
C 2 4
C 1 5
C 0 3500
In pandas it is easy to compare the equality of these two dataframes with df1.equals(df2). In this case False.
However, we can see that some in this groups (A in the given toy example) are equal and some are not (groups B and C). I want to check for equality between these groups. In other words, check the equality between the dataframes with index A and B etc.
Here is my attempt. We wish to group the data
g1 = df1.groupby('group')
g2 = df2.groupby('group')
Naively trying g1.equals(g2) gives the error Cannot access callable attribute 'equals' of 'DataFrameGroupBy' objects, try using the 'apply' method.
However, if we try
g1.apply(lambda x: x.equals(g2))
We get a series
group
A False
B False
C False
dtype: bool
However the first entry should be True since the first case group A is equal between the two dataframes.
I can see that I could laboriously construct nested loops to do this, but that's slow. I feel there a way to do this in pandas without usings loops? I think I am misusing the apply method?
You can call get_group on g2 to retrieve the group to compare, you can access the group name using the attribute .name:
In[316]:
g1.apply(lambda x: x.equals(g2.get_group(x.name)))
Out[316]:
group
A True
B False
C False
dtype: bool
EDIT
To handle non-existent groups:
In[320]:
g1.apply(lambda x: x.equals(g2.get_group(x.name)) if x.name in g2.groups else False)
Out[320]:
group
A True
B False
C False
dtype: bool
Example:
In[323]:
dict1= {'group':['A','A','B','C','C','C','D'],'col2':[1,7,4,2,1,0,-1],'col3':[1,1,3,4,
5,3,-1]}
df1 = pd.DataFrame(data=dict1).set_index('group')
g1 = df1.groupby('group')
g1.apply(lambda x: x.equals(g2.get_group(x.name)) if x.name in g2.groups else False)
Out[323]:
group
A True
B False
C False
D False
dtype: bool
Here .groups returns a dict of the groups, the keys are the group name/labels, we can test for existence using x.name in g2.groups and modify the lambda to handle non-existent groups

Resources