Re-formatting a dataframe to show sequence number and time difference after a groupby - python-3.x

I have a pandas dataframe that has an identifier, a sequence number, and a timestamp.
For example:
MyIndex seq_no timestamp
1 181 7:56
1 182 7:57
1 183 7:59
2 184 8:01
2 185 8:04
3 186 8:05
3 187 8:08
3 188 8:10
I want to reformat by showing a sequence number for each index and with the time difference, something like:
MyIndex seq_no timediff
1 1 0
1 2 1
1 3 2
2 1 0
2 2 3
3 1 0
3 2 3
3 3 2
I know I can get the seq_no by doing
df.groupby("MyIndex")["seq_no"].rank(method="first", ascending=True)
but how do I get the time difference? Bonus points if you show me how to do the time difference between steps, or total timediff from the start.

I think the simplest way to get the difference is to convert the timestamp to a single unit. You can then calculate the difference with groupby and shift.
import pandas as pd
from io import StringIO
data = """Index seq_no timestamp
1 181 7:56
1 182 7:57
1 183 7:59
2 184 8:01
2 185 8:04
3 186 8:05
3 187 8:08
3 188 8:10"""
df = pd.read_csv(StringIO(data), sep='\s+')
# use cumcount to get new seq_no
df['seq_no_new'] = df.groupby('Index').cumcount() + 1
# can convert timestamp by splitting string
# and then casting to int
time = df['timestamp'].str.split(':', expand=True).astype(int)
df['time'] = time.iloc[:, 0] * 60 + time.iloc[:, 1]
# you then calculate the difference with groupby/shift
# fillna values with 0 and cast to int
df['timediff'] = (df['time'] - df.groupby('Index')['time'].shift(1)).fillna(0).astype(int)
# pick columns you want at the end
df = df.loc[:, ['Index', 'seq_no_new', 'timediff']]
Output
>>>df
Index seq_no_new timediff
0 1 1 0
1 1 2 1
2 1 3 2
3 2 1 0
4 2 2 3
5 3 1 0
6 3 2 3
7 3 3 2

Related

loops application in dataframe to find output

I have the following data:
dict={'A':[1,2,3,4,5],'B':[10,20,233,29,2],'C':[10,20,3040,230,238]...................}
and
df= pd.Dataframe(dict)
In this manner I have 20 columns with 5 numerical entry in each column
I want to have a new column where the value should come as the following logic:
0 A[0]*B[0]+A[0]*C[0] + A[0]*D[0].......
1 A[1]*B[1]+A[1]*C[1] + A[1]*D[1].......
2 A[2]*B[2]+A[2]*B[2] + A[2]*D[2].......
I tried in the following manner but manually I can not put 20 columns, so I wanted to know the way to apply a loop to get the desired output
:
lst=[]
for i in range(0,5):
j=df.A[i]*df.B[i]+ df.A[i]*df.C[i]+.......
lst.append(j)
i=i+1
A potential solution is the following. I am only taking the example you posted but is works fine for more. Your data is df
A B C
0 1 10 10
1 2 20 20
2 3 233 3040
3 4 29 230
4 5 2 238
You can create a new column, D by first subsetting your dataframe
add = df.loc[:, df.columns != 'A']
and then take the sum over all multiplications of the columns in D with column A in the following way:
df['D'] = df['A']*add.sum(axis=1)
which returns
A B C D
0 1 10 10 20
1 2 20 20 80
2 3 233 3040 9819
3 4 29 230 1036
4 5 2 238 1200

How to remove Initial rows in a dataframe in python

I have 4 dataframes with weekly sales values for a year for 4 products. Some of the initial rows are 0 as no sales. there are some other 0 values as well in between the weeks.
I want to remove those initial 0 values, keeping the in between 0s.
For example
Week Sales(prod 1)
1 0
2 0
3 100
4 120
5 55
6 0
7 60.
Week Sales(prod 2)
1 0
2 0
3 0
4 120
5 0
6 30
7 60.
I want to remove row 1,2 from 1st table and 1,2,3 frm 2nd.
Few Assumption based on your example dataframe:
DataFrame is created using pandas
week always start with 1
will remove all the starting weeks only which are having 0 sales
Solution:
Python libraries Required
- pandas, more_itertools
Example DataFrame (df):
Week Sales
1 0
2 0
3 0
4 120
5 0
6 30
7 60
Python Code:
import pandas as pd
import more_itertools as mit
filter_col = 'Sales'
filter_val = 0
##function which returns the index to be removed
def return_initial_week_index_with_zero_sales(df,filter_col,filter_val):
index_wzs = [False]
if df[filter_col].iloc[1]==filter_val:
index_list = df[df[filter_col]==filter_val].index.tolist()
index_wzs = [list(group) for group in mit.consecutive_groups(index_list)]
else:
pass
return index_wzs[0]
##calling above function and removing index from the dataframe
df = df.set_index('Week')
weeks_to_be_removed = return_initial_week_index_with_zero_sales(df,filter_col,filter_val)
if weeks_to_be_removed:
print('Initial weeks with 0 sales are {}'.format(weeks_to_be_removed))
df = df.drop(index=weeks_to_be_removed)
else:
print('No initial week has 0 sales')
df.reset_index(inplace=True)
Result:df
Week Sales
4 120
5 55
6 0
7 60
I hope it helps, you can modify the function as per your requirement.

Selective multiplication of a pandas dataframe

I have a pandas Dataframe and Series of the form
df = pd.DataFrame({'Key':[2345,2542,5436,2468,7463],
'Segment':[0] * 5,
'Values':[2,4,6,6,4]})
print (df)
Key Segment Values
0 2345 0 2
1 2542 0 4
2 5436 0 6
3 2468 0 6
4 7463 0 4
s = pd.Series([5436, 2345])
print (s)
0 5436
1 2345
dtype: int64
In the original df, I want to multiply the 3rd column(Values) by 7 except for the keys which are present in the series. So my final df should look like
What should be the best way to achieve this in Python 3.x?
Use DataFrame.loc with Series.isin for filter Value column with inverted condition for non membership with multiple by scalar:
df.loc[~df['Key'].isin(s), 'Values'] *= 7
print (df)
Key Segment Values
0 2345 0 2
1 2542 0 28
2 5436 0 6
3 2468 0 42
4 7463 0 28
Another method could be using numpy.where():
df['Values'] *= np.where(~df['Key'].isin([5436, 2345]), 7,1)

Subset and Loop to create a new column [duplicate]

With the DataFrame below as an example,
In [83]:
df = pd.DataFrame({'A':[1,1,2,2],'B':[1,2,1,2],'values':np.arange(10,30,5)})
df
Out[83]:
A B values
0 1 1 10
1 1 2 15
2 2 1 20
3 2 2 25
What would be a simple way to generate a new column containing some aggregation of the data over one of the columns?
For example, if I sum values over items in A
In [84]:
df.groupby('A').sum()['values']
Out[84]:
A
1 25
2 45
Name: values
How can I get
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45
In [20]: df = pd.DataFrame({'A':[1,1,2,2],'B':[1,2,1,2],'values':np.arange(10,30,5)})
In [21]: df
Out[21]:
A B values
0 1 1 10
1 1 2 15
2 2 1 20
3 2 2 25
In [22]: df['sum_values_A'] = df.groupby('A')['values'].transform(np.sum)
In [23]: df
Out[23]:
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45
I found a way using join:
In [101]:
aggregated = df.groupby('A').sum()['values']
aggregated.name = 'sum_values_A'
df.join(aggregated,on='A')
Out[101]:
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45
Anyone has a simpler way to do it?
This is not so direct but I found it very intuitive (the use of map to create new columns from another column) and can be applied to many other cases:
gb = df.groupby('A').sum()['values']
def getvalue(x):
return gb[x]
df['sum'] = df['A'].map(getvalue)
df
In [15]: def sum_col(df, col, new_col):
....: df[new_col] = df[col].sum()
....: return df
In [16]: df.groupby("A").apply(sum_col, 'values', 'sum_values_A')
Out[16]:
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45

Pandas assign value of one column based on another

Given the following data frame:
import pandas as pd
df = pd.DataFrame(
{'A':[10,20,30,40,50,60],
'B':[1,2,1,4,5,4]
})
df
A B
0 10 1
1 20 2
2 30 1
3 40 4
4 50 5
5 60 4
I would like a new column 'C' to have values be equal to those in 'A' where the corresponding values for 'B' are less than 3 else 0.
The desired result is as follows:
A B C
0 10 1 10
1 20 2 20
2 30 1 30
3 40 4 0
4 50 5 0
5 60 4 0
Thanks in advance!
Use np.where:
df['C'] = np.where(df['B'] < 3, df['A'], 0)
>>> df
A B C
0 10 1 10
1 20 2 20
2 30 1 30
3 40 4 0
4 50 5 0
5 60 4 0
Here you can use pandas method where direct on the column:
In [3]:
df['C'] = df['A'].where(df['B'] < 3,0)
df
Out[3]:
A B C
0 10 1 10
1 20 2 20
2 30 1 30
3 40 4 0
4 50 5 0
5 60 4 0
Timings
In [4]:
%timeit df['A'].where(df['B'] < 3,0)
%timeit np.where(df['B'] < 3, df['A'], 0)
1000 loops, best of 3: 1.4 ms per loop
1000 loops, best of 3: 407 µs per loop
np.where is faster here but pandas where is doing more checking and has more options so it depends on the use case here.

Resources