Group pandas elements according to a column - python-3.x

I have the following pandas dataframe:
import pandas as pd
data = {'Sentences':['Sentence1', 'Sentence2', 'Sentence3', 'Sentences4', 'Sentences5', 'Sentences6','Sentences7', 'Sentences8'],'Time':[1,0,0,1,0,0,1,0]}
df = pd.DataFrame(data)
print(df)
I was wondering how to extract all the "Sentences" according to the "Time" column. I want to gather all the "sentences" from the first "1" to the last "0".
Maybe the expected output explains it better:
[[Sentences1,Sentences2,Sentences3],[Sentences4,Sentences5,Sentences6],[Sentences7,Sentences8]]
Is this somehow possible ? Sorry, I am very new to pandas.

Try this:
s = df['Time'].cumsum()
df.set_index([s, df.groupby(s).cumcount()])['Sentences'].unstack().to_numpy().tolist()
Output:
[['Sentence1', 'Sentence2', 'Sentence3'],
['Sentences4', 'Sentences5', 'Sentences6'],
['Sentences7', 'Sentences8', nan]]
Details:
Use cumsum to group by Time = 1 with following Time = 0.
Next, use groupby with cumcount to increment within each group
Lastly, use set_index and unstack to reshape dataframe.

Related

Dealing with duplicates in a pandas query

I have the following DataFrame:
data = {'Customer_ID': ['123','2','1010','123'],
'Date_Create': ['12/08/2010','04/10/1998','27/05/2010','12/08/2010'],
'Purchase':[1,1,0,1]
}
df = pd.DataFrame(data, columns = ['Customer_ID', 'Date_Create','Purchase'])
I want to perform this query:
df_2 = df[['Customer_ID','Date_Create','Purchase']].groupby(['Customer_ID'],
as_index=False).sum().sort_values(by='Purchase', ascending=False)
The objective of this query is to sum all purchases(boolean field) and as output a dataframe with 3 columns: 'Customer_ID', 'Date_Create','Purchase
Problem is: the field Date_Create is not in query because it has duplicate as the date_creation of the account does not change.
How can i solve it?
thx
If im understanding it correctly and your source data has some duplicates,
There's a function specifically for this, dataframe.drop_duplicates()
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html
To only consider some columns in the duplicate check, use subset:
df2 = df.drop_duplicates(subset=['Customer_ID','Date_Create'])
You can add column Date_Create to groupby if same values per Customer_ID:
(df.groupby(['Customer_ID','Date_Create'], as_index=False)['Purchase']
.sum()
.sort_values(by='Purchase', ascending=False))
If not, use some aggregation function - e.g. GroupBy.first for first date per groups:
(df.groupby('Customer_ID')
.agg(Purchase = ('Purchase', 'sum'), Date_Create= ('Date_Create', 'first'))
.reset_index()
.sort_values(by='Purchase', ascending=False))

how to use lambda function to select larger values from two python dataframes whilst comparing them by date?

I want to map through the rows of df1 and compare those with the values of df2 , by month and day, across every year in df2,leaving only the values in df1 which are larger than those in df2, to add into a new column, 'New'. df1 and df2 are of the same size, and are indexed by 'Month' and 'Day'. what would be the best way to do this?
df1=pd.DataFrame({'Date':['2015-01-01','2015-01-02','2015-01-03','2015-01-``04','2005-01-05'],'Values':[-5.6,-5.6,0,3.9,9.4]})
df1.Date=pd.to_datetime(df1.Date)
df1['Day']=pd.DatetimeIndex(df1['Date']).day
df1['Month']=pd.DatetimeIndex(df1['Date']).month
df1.set_index(['Month','Day'],inplace=True)
df1
df2 = pd.DataFrame({'Date':['2005-01-01','2005-01-02','2005-01-03','2005-01-``04','2005-01-05'],'Values':[-13.3,-12.2,6.7,8.8,15.5]})
df2.Date=pd.to_datetime(df1.Date)
df2['Day']=pd.DatetimeIndex(df2['Date']).day
df2['Month']=pd.DatetimeIndex(df2['Date']).month
df2.set_index(['Month','Day'],inplace=True)
df2
df1 and df2
df2['New']=df2[df2['Values']<df1['Values']]
gives
ValueError: Can only compare identically-labeled Series objects
I have also tried
df2['New']=df2[df2['Values'].apply(lambda x: x < df1['Values'].values)]
The best way to handle your problem is by using numpy as a tool. Numpy has an attribute called "where"that helps a lot in cases like this.
This is how the sentence works:
df1['new column that will contain the comparison results'] = np.where(condition,'value if true','value if false').
First import the library:
import numpy as np
Using the condition provided by you:
df2['New'] = np.where(df2['Values'] > df1['Values'], df2['Values'],'')
So, I think that solves your problem... You can change the value passed to the False condition to every thin you want, this is only an example.
Tell us if it worked!
Let´s try two possible solutions:
The first solution is to sort the index first.
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
Perform a simple test to see if it works!
df1 == df2
it is possible to raise some kind of error, so if that happens, try this correction instead:
df1.sort_index(inplace=True, axis=1)
df2.sort_index(inplace=True, axis=1)
The second solution is to drop the indexes and reset it:
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
Perform a simple test to see if it works!
df1 == df2
See if it works and tell us the result.

pandas dataframe manipulation without using loop

Please find the below input and output. Corresponding to each store id and period id , 11 Items should be present , if any item is missing, add it and fill that row with 0
without using loop.
Any help is highly appreciated.
input
Expected Output
You can do this:
Sample df:
df = pd.DataFrame({'store_id':[1160962,1160962,1160962,1160962,1160962,1160962,1160962,1160962,1160962,1160962, 1160962],
'period_id':[1025,1025,1025,1025,1025,1025,1026,1026,1026,1026,1026],
'item_x':[1,4,5,6,7,8,1,2,5,6,7],
'z':[1,4,5,6,7,8,1,2,5,6,7]})
Solution:
num = range(1,12)
def f(x):
return x.reindex(num, fill_value=0)\
.assign(store_id=x['store_id'].mode()[0], period_id = x['period_id'].mode()[0])
df.set_index('item_x').groupby(['store_id','period_id'], group_keys=False).apply(f).reset_index()
You can do:
from itertools import product
pdindex=product(df.groupby(["store_id", "period_id"]).groups, range(1,12))
pdindex=pd.MultiIndex.from_tuples(map(lambda x: (*x[0], x[1]), pdindex), names=["store_id", "period_id", "Item"])
df=df.set_index(["store_id", "period_id", "Item"])
res=pd.DataFrame(index=pdindex, columns=df.columns)
res.loc[df.index, df.columns]=df
res=res.fillna(0).reset_index()
Now this will work only assuming you don't have any Item outside of range [1,11].
This is a simplification of #GrzegorzSkibinski's correct answer.
This answer is not modifying the original DataFrame. It uses fewer variables to store intermediate data structures and employs a list comprehension to simplify an use of map.
I'm also using reindex() rather than creating a new DataFrame using the generated index and populating it with the original data.
import pandas as pd
import itertools
df.set_index(
["store_id", "period_id", "Item_x"]
).reindex(
pd.MultiIndex.from_tuples([
group + (item,)
for group, item in itertools.product(
df.groupby(["store_id", "period_id"]).groups,
range(1, 12),
)],
names=["store_id", "period_id", "Item_x"]
),
fill_value=0,
).reset_index()
In testing, output matched what you listed as expected.

Python - Filtering Pandas Timestamp Index

Given Timestamp indices with many per day, how can I get a list containing only the last Timestamp of a day? So in case I have such:
import pandas as pd
all = [pd.Timestamp('2016-05-01 10:23:45'),
pd.Timestamp('2016-05-01 18:56:34'),
pd.Timestamp('2016-05-01 23:56:37'),
pd.Timestamp('2016-05-02 03:54:24'),
pd.Timestamp('2016-05-02 14:32:45'),
pd.Timestamp('2016-05-02 15:38:55')]
I would like to get:
# End of Day:
EoD = [pd.Timestamp('2016-05-01 23:56:37'),
pd.Timestamp('2016-05-02 15:38:55')]
Thx in advance!
Try pandas groupby
all = pd.Series(all)
all.groupby([all.dt.year, all.dt.month, all.dt.day]).max()
You get
2016 5 1 2016-05-01 23:56:37
2 2016-05-02 15:38:55
I've created an example dataframe.
import pandas as pd
all = [pd.Timestamp('2016-05-01 10:23:45'),
pd.Timestamp('2016-05-01 18:56:34'),
pd.Timestamp('2016-05-01 23:56:37'),
pd.Timestamp('2016-05-02 03:54:24'),
pd.Timestamp('2016-05-02 14:32:45'),
pd.Timestamp('2016-05-02 15:38:55')]
df = pd.DataFrame({'values':0}, index = all)
Assuming your data frame is structured as example, most importantly is sorted by index, code below is supposed to help you.
for date in set(df.index.date):
print(df[df.index.date == date].iloc[-1,:])
This code will for each unique date in your dataframe return last row of the slice so while sorted it'll return your last record for the day. And hey, it's pythonic. (I believe so at least)

Pandas iterate over group with SeriesGroupBy objects to Burst Data in MatPlotLib

I am attempting to iterate through an index of a grouped-by dataframe (see comments in code).
import pandas as pd
import numpy as np
df = pd.DataFrame({'COL1':['A','A','B','B'], 'COL2': [1,1,2,2,], 'COL3': [2,3,4,6]})
#Here, I'm creating a copy of 'COL2' so that it's still a column after I assign it to
#the index.
#I'm guessing there's a better way to do this in the groupby line.
df['COL2_copy'] = df['COL2']
df = df.groupby(['COL2_copy'], as_index=True)
#This will actually be a more complex function (creating a chart in MatPlotLib based on the
#data frame)
#slice(group) per iteration through the index ('COL2')).
#I'll just keep it simple for now.
def pandfun(df):
#Here's the real issue:
df['COL4'] = np.trunc(np.round(df['COL3']*100,0))
return df['COL4']
pandfun(df)
TypeError: unsupported operand type(s) for *: 'SeriesGroupBy' and 'int'
The desired results are: 200 and 300 for the first group and 400 and 600 for the second group. So to summarize what I believe to be the main problem here is that I want to select individual groups of rows by index (i.e. 'COL2 == 1') and within each group, refer to individual rows for a calculation.
I am taking this approach because I'll actually be using this with a MatPlotLib function that I created and I want to "burst" the data into one chart for each group in the dataframe, where each chart refers to individual row data for a given group.
I did this instead:
1. Get a unique list of values from COL2 and create a copy of df:
ulist = pd.unique(df['COL2'].ravel())
df = df2
Iterate over that list where it matched in COL2:
for i in ulist:
df = df2.loc[df2['COL2']==i]
Within each iteration, apply the function.
Within the MatPlotLib function, enter the following code at the top:
df.reset_index()
This served to reset the selection of rows after each iteration.

Resources