How to combine conditional formatting and str.contains in pandas dataframe to create new column? - python-3.x

I try to add new column in pandas dataframe based on text in new column, for example this is my data:
>>> data
No Description
1 Extention Slack 1 Month
2 Extention Slack 1 Year
3 Slack 6 Month
4 Slack 1 Year
What I need is
No Description M M+1 M+2 M+3 M+4 M+5 M+6 ... M+11
1 Extention Slack 1 Month 1 0 0 0 0 0 0 0
2 Extention Slack 1 Year 1 1 1 1 1 1 1 1
3 Slack 6 Month 1 1 1 1 1 1 0 0
4 Slack 3 Month 1 1 1 0 0 0 0 0
What I did is
import numpy as np
data['M'] = np.where(data['Description'].str.contains('1 Year'), 1, 0)
How suppose I do this?

From the Description column, you want to infer based on the part {time} {time_label} like 1 Year or 1 Month where to fill with one or zero over a period of 12 months.
Here's one way to do what you want:
# create two temporary columns
# time: holds the numeric value associated with time_label (month or year)
df['time'], df['time_label'] = df.Description.str.split().apply(lambda x: pd.Series(x[-2:])).values.T
# define the numeric equivalent of Month and Year
mapping = {"Month":1, "Year":12}
for month in range(12):
# if is only here to pretty print M, M+1, M+2, ...
# you can remove it if you accept M+0, M+1, ...
if month == 0:
df["M"] = np.where(df.time.astype(int)*df.time_label.map(mapping) >= month+1, 1, 0)
else:
df["M"+"+"+str(month)] = np.where(df.time.astype(int)*df.time_label.map(mapping) >= month+1, 1, 0)
A fully reproducible example:
import pandas as pd
import numpy as np
from StringIO import StringIO
data = """
No Description
1 "Extention Slack 1 Month"
2 "Extention Slack 1 Year"
3 "Slack 6 Month"
4 "Slack 3 Month"
"""
# StringIO(data) : to simulate reading the data
# change df with your dataframe
df = pd.read_table(StringIO(data), sep="\s+")
# create two temporary columns
# time: holds the numeric value associated with time_label (month or year)
df['time'], df['time_label'] = df.Description.str.split().apply(lambda x: pd.Series(x[-2:])).values.T
# define the numeric equivalent of Month and Year
mapping = {"Month":1, "Year":12}
for month in range(12):
# if is only here to pretty print M, M+1, M+2, ...
if month == 0:
df["M"] = np.where(df.time.astype(int)*df.time_label.map(mapping) >= month+1, 1, 0)
else:
df["M"+"+"+str(month)] = np.where(df.time.astype(int)*df.time_label.map(mapping) >= month+1, 1, 0)
# remove temporary columns
df.drop(['time','time_label'], axis=1, inplace=True)
print(df)
output:
No Description M M+1 M+2 M+3 M+4 M+5 M+6 M+7 M+8 \
0 1 Extention Slack 1 Month 1 0 0 0 0 0 0 0 0
1 2 Extention Slack 1 Year 1 1 1 1 1 1 1 1 1
2 3 Slack 6 Month 1 1 1 1 1 1 0 0 0
3 4 Slack 3 Month 1 1 1 0 0 0 0 0 0
M+9 M+10 M+11
0 0 0 0
1 1 1 1
2 0 0 0
3 0 0 0

Related

groupby and trim some rows based on condition

I have a data frame something like this:
df = pd.DataFrame({"ID":[1,1,2,2,2,3,3,3,3,3],
"IF_car":[1,0,0,1,0,0,0,1,0,1],
"IF_car_history":[0,0,0,1,0,0,0,1,0,1],
"observation":[0,0,0,1,0,0,0,2,0,3]})
I want output where I can trim rows in groupby with ID and condition on "IF_car_history" == 1
tried_df = df.groupby(['ID']).apply(lambda x: x.loc[:(x['IF_car_history'] == '1').idxmax(),:]).reset_index(drop = True)
I want to drop rows in a groupby by after i get ['IF_car_history'] == '1'
expected output:
Thanks
First compare values for mask m by Series.eq and then use GroupBy.cumsum, and for values before 1 compare by 0, last filter by boolean indexing, but because id necesary remove after last 1 is used swapped values by slicing with [::-1].
m = df['IF_car_history'].eq(1).iloc[::-1]
df1 = df[m.groupby(df['ID']).cumsum().ne(0).iloc[::-1]]
print (df1)
ID IF_car IF_car_history observation
2 2 0 0 0
3 2 1 1 1
5 3 0 0 0
6 3 0 0 0
7 3 1 1 2
8 3 0 0 0
9 3 1 1 3

I have DataFrame's columns and data in list i want to put the relevant data to relevant column

suppose you have given list of all item you can have and separately you have list of data and whose shape of list is not fixed it may contain any number of item you wished to create a dataframe from it and you have to put it on write column
for example
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
# and from this I wants to create a dummy variable like this
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
If want indicator columns filled by 0 and 1 only use MultiLabelBinarizer with DataFrame.reindex if want change ordering of columns by list and if possible some value not exist add only 0 column:
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = (pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_)
.reindex(columns, axis=1, fill_value=0))
print (df)
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
Or Series.str.get_dummies:
df = pd.Series(data).str.join('|').str.get_dummies().reindex(columns, axis=1, fill_value=0)
print (df)
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
This is one approach using collections.Counter.
Ex:
from collections import Counter
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt']]
data = map(Counter, data)
#df = pd.DataFrame(data, columns=columns)
df = pd.DataFrame(data, columns=columns).fillna(0).astype(int)
print(df)
Output:
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
You can try converting data to a dataframe:
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
df = pd.DataFrame(data)
df
0 1 2
0 hat tie None
1 shoe tie shirt
2 tie shirt None
Them use:
pd.get_dummies(df.stack()).groupby(level=0).agg('sum')
hat shirt shoe tie
0 1 0 0 1
1 0 1 1 1
2 0 1 0 1
Explanation:
df.stack() returns a MultiIndex Series:
0 0 hat
1 tie
1 0 shoe
1 tie
2 shirt
2 0 tie
1 shirt
dtype: object
If we get the dummy values of this series we get:
hat shirt shoe tie
0 0 1 0 0 0
1 0 0 0 1
1 0 0 0 1 0
1 0 0 0 1
2 0 1 0 0
2 0 0 0 0 1
1 0 1 0 0
Then you just have to groupby the index and merge them using sum (because we know that there will only be one or zero after get_dummies):
df = pd.get_dummies(df.stack()).groupby(level=0).agg('sum')

Faster way to count number of timestamps before another timestamp

I have two dataframe "train" and "log". "log" has datetime columns "time1" while train has datetime column "time2". For every row in "train" I want to find out counts of "time1" when "time1" is before "time2".
I already tried the apply method with dataframe.
def log_count(row):
return sum((log['user_id'] == row['user_id']) & (log['time1'] < row['time2']))
train.apply(log_count, axis = 1)
It is taking very long with this approach.
Since you want to do this once for each (paired) user_id group, you could do the following:
Create a column called is_log which is 1 in log and 0 in train:
log['is_log'] = 1
train['is_log'] = 0
The is_log column will be used to keep track of whether or not a row comes from log or train.
Concatenate the log and train DataFrames:
combined = pd.concat(
[log.rename(columns=dict(time1="time")), train.rename(columns=dict(time2="time"))],
axis=0,
ignore_index=True,
sort=False,
)
Sort the combined DataFrame by user_id and time:
combined = combined.sort_values(by=["user_id", "time"])
So now combined looks something like this:
time user_id is_log
6 2000-01-17 0 0
0 2000-03-13 0 1
1 2000-06-08 0 1
7 2000-06-25 0 0
4 2000-07-09 0 1
8 2000-07-18 0 0
10 2000-03-13 1 0
5 2000-04-16 1 0
3 2000-08-04 1 1
9 2000-08-17 1 0
2 2000-10-20 1 1
Now the count that you are looking for can be expressed as a cumulative sum of the is_log column, grouped by user_id:
combined["count"] = combined.groupby("user_id")["is_log"].cumsum()
train = combined.loc[combined["is_log"] == 0]
This is the main idea: Counting the number of 1s in the is_log column is equivalent to counting the number of times in log which come before each time in train.
For example,
import numpy as np
import pandas as pd
np.random.seed(2019)
def random_dates(N):
return np.datetime64("2000-01-01") + np.random.randint(
365, size=N
) * np.timedelta64(1, "D")
N = 5
log = pd.DataFrame({"time1": random_dates(N), "user_id": np.random.randint(2, size=N)})
train = pd.DataFrame(
{
"time2": np.r_[random_dates(N), log.loc[0, "time1"]],
"user_id": np.random.randint(2, size=N + 1),
}
)
log["is_log"] = 1
train["is_log"] = 0
combined = pd.concat(
[log.rename(columns=dict(time1="time")), train.rename(columns=dict(time2="time"))],
axis=0,
ignore_index=True,
sort=False,
)
combined = combined.sort_values(by=["user_id", "time"])
combined["count"] = combined.groupby("user_id")["is_log"].cumsum()
train = combined.loc[combined["is_log"] == 0]
print(log)
# time1 user_id is_log
# 0 2000-03-13 0 1
# 1 2000-06-08 0 1
# 2 2000-10-20 1 1
# 3 2000-08-04 1 1
# 4 2000-07-09 0 1
print(train)
yields
time user_id is_log count
6 2000-01-17 0 0 0
7 2000-06-25 0 0 2
8 2000-07-18 0 0 3
10 2000-03-13 1 0 0
5 2000-04-16 1 0 0
9 2000-08-17 1 0 1

Comparing two different sized pandas Dataframes and to find the row index with equal values

I need some help with comparing two pandas dataframe
I have two dataframes
The first dataframe is
df1 =
a b c d
0 1 1 1 1
1 0 1 0 1
2 0 0 0 1
3 1 1 1 1
4 1 0 1 0
5 1 1 1 0
6 0 0 1 0
7 0 1 0 1
and the second dataframe is
df2 =
a b c d
0 1 1 1 1
1 1 0 1 0
2 0 0 1 0
I want to find the row index of dataframe 1 (df1) which the entire row is the same as the rows in dataframe 2 (df2). My expect result would be
0
3
4
6
The order of the above index does not need to be in order, all I want is the index of dataframe 1 (df1)
Is there a way without using for loop?
Thanks
Tommy
You can using merge
df1.merge(df2,indicator=True,how='left').loc[lambda x : x['_merge']=='both'].index
Out[459]: Int64Index([0, 3, 4, 6], dtype='int64')

How to apply function to data frame column to created iterated column

I have IDs with system event times, and I have grouped the event times by id (individual systems) and made a new column where the value is 1 if the eventtimes.diff() is greater than 1 day, else 0 . Now that I have the flag I am trying to make a function that will be applied to groupby('ID') so the new column starts with 1 and keeps returning 1 for each row in the new column until the flag shows 1 then the new column will go up 1, to 2 and keep returning 2 until the flag shows 1 again.
I will apply this along with groupby('ID') since I need the new column to start over again at 1 for each ID.
I have tried to the following:
def try(x):
y = 1
if row['flag']==0:
y = y
else:
y += y+1
df['NewCol'] = df.groupby('ID')['flag'].apply(try)
I have tried differing variations of the above to no avail. Thanks in advance for any help you may provide.
Also, feel free to let me know if I messed up posting the question. Not sure if my title is great either.
Use boolean indexing for filtering + cumcount + reindex what is much faster solution as loopy apply :
I think you need for count only 1 per group and if no 1 then 1 is added to output:
df = pd.DataFrame({
'ID': ['a','a','a','a','b','b','b','b','b'],
'flag': [0,0,1,1,0,0,1,1,1]
})
df['new'] = (df[df['flag'] == 1].groupby('ID')['flag']
.cumcount()
.add(1)
.reindex(df.index, fill_value=1))
print (df)
ID flag new
0 a 0 1
1 a 0 1
2 a 1 1
3 a 1 2
4 b 0 1
5 b 0 1
6 b 1 1
7 b 1 2
8 b 1 3
Detail:
#filter by condition
print (df[df['flag'] == 1])
ID flag
2 a 1
3 a 1
6 b 1
7 b 1
8 b 1
#count per group
print (df[df['flag'] == 1].groupby('ID')['flag'].cumcount())
2 0
3 1
6 0
7 1
8 2
dtype: int64
#add 1 for count from 1
print (df[df['flag'] == 1].groupby('ID')['flag'].cumcount().add(1))
2 1
3 2
6 1
7 2
8 3
dtype: int64
If need count 0 and if no 0 is added -1:
df['new'] = (df[df['flag'] == 0].groupby('ID')['flag']
.cumcount()
.add(1)
.reindex(df.index, fill_value=-1))
print (df)
ID flag new
0 a 0 1
1 a 0 2
2 a 1 -1
3 a 1 -1
4 b 0 1
5 b 0 2
6 b 1 -1
7 b 1 -1
8 b 1 -1
Another 2 step solution:
df['new'] = df[df['flag'] == 1].groupby('ID')['flag'].cumcount().add(1)
df['new'] = df['new'].fillna(1).astype(int)
print (df)
ID flag new
0 a 0 1
1 a 0 1
2 a 1 1
3 a 1 2
4 b 0 1
5 b 0 1
6 b 1 1
7 b 1 2
8 b 1 3

Resources