How to group by and sum based on values in another column - pandas-groupby

I have a dataframe for example:
Date Amount. Type
1/2/2011. 200. S
1/2/2011. 300. R
1/3/2012. 400. S
1/3/2012. 300. S
I need :
Date. Result. Flow
1/2/2011 S-R. -100
1/3/2011. S+S. 700
Now wherever R occurs, I need to subtract the amount from that date and wherever S ovcjes, I need to add.
I tried: df.groupby([df.Date.values, df.Type])[“Amount”])
But this didn’t give me what I want. Any help would be appreciated.

Convert all R to negative values:
df['amount'] = [x[1]['amount'] if x[1]['type'] == 'S' else x[1]['amount'] * -1 for x in df.iterrows()]
Group by 'date':
df2 = df.groupby(df['date']).sum()
Result:
amount
date
2011-01-02 -100
2011-01-03 700
If you want to add columns or change names you can do so from there.

Related

get rows by date regardless of format of date in pandas

I have data as follows:
Col1,ColDate
a,2020-09-11 08:43:00
b,2020-09-12 09:43:00
c,13-09-2020 09:43:00
d,09/16/2020 10:43:00
e,09/19/2020 12:43:00
f,09/12/2020 15:43:00
Intention is to get all rows between 11th sep and 13th sept, regardless of the format. In pandas
I am trying the following:
df[df["ColDate"].between('11-09-2020','13-09-2020')]
I get an empty dataframe.
You can try this,
df[pd.to_datetime(df['ColDate']).dt.strftime('%d-%m-%Y').between('11-09-2020','13-09-2020')]
Col1 ColDate
0 a 2020-09-11 08:43:00
1 b 2020-09-12 09:43:00
2 c 13-09-2020 09:43:00
5 f 09/12/2020 15:43:00
but its really hard to say which will be considered month and day, because of the date format being jumbled.
Please Check the snippet. You can first convert your Coldate to pd.to_datetime format and then you can apply a mask over it like this.
df['ColDate'] = pd.to_datetime(df['ColDate'])
mask = (df['ColDate'] > '2020-09-11') & (df['ColDate'] <='2020-09-13')
df = df.loc[mask]
Output
Col1 ColDate
0 a 2020-09-11 08:43:00
1 b 2020-09-12 09:43:00
5 f 2020-09-12 15:43:00

calculate percentage of occurrences in column pandas

I have a column with thousands of rows. I want to select the top significant one. Let's say I want to select all the rows that would represent 90% of my sample. How would I do that?
I have a dataframe with 2 columns, one for product_id one showing whether it was purchased or not (value is or 0 or 1)
product_id purchased
a 1
b 0
c 0
d 1
a 1
. .
. .
with df['product_id'].value_counts() I can have all my product-ids ranked by number of occurrences.
Let's say now I want to get the number of product_ids that I should consider in my future analysis that would represent 90% of the total of occurences.
Is there a way to do that?
If want all product_id with counts under 0.9 then use:
s = df['product_id'].value_counts(normalize=True).cumsum()
df1 = df[df['product_id'].isin(s.index[s < 0.9])]
Or if want all rows sorted by counts and get 90% of them:
s1 = df['product_id'].map(df['product_id'].value_counts()).sort_values(ascending=False)
df2 = df.loc[s1.index[:int(len(df) * 0.9)]]

How to apply formula in pandas

I am trying to apply the formula in the column but not able to.
I have data in dataframe:
Date 2018-04-16 00:00:00
Quantity 8317.000
Total Value (Lacs) 259962.50
I want to apply a formula in Total Value (Lacs) column
formula is: = [ Total Value (Lacs) multiplied by 100000 ] divided by [Quantity (000’s) multiplied by 100] by using pandas
I have tried something
a = df['Total Value (Lacs)']
b = df['Quantity']
c = (a * 100000 / b * 100)
print (c)
or
df['Price'] = ((df['Total Value (Lacs)']) * 100000 / (df['Quantity']) * 100)
print (df)
error:
TypeError: unsupported operand type(s) for /: 'str' and 'str'
Edit
I have tried below code:
df['Price'] = float((float(df['Total Value (Lacs)'])) * 100000 / float((df['Quantity'])) * 100)
but getting the wrong value
price 312567632.6
expecting
price 31256.76326
Edit 1
Type error means that you've tried to apply operator / to two strings. There's no such operator defined for str type in python, so you should convert you data to some numeric type, float in your case.
I didn't understand extactly how your data looks like. But if it's like this:
df
Out:
Date Quantity Total Value (Lacs)
2018-04-16 00:00:00 8317.000 259962.50
2018-04-17 00:00:00 7823.000 234004.50
You can convert it to numeric type, convert all the columns to the correct type (I suppose that Date column is an index column):
df_float = df.apply(pd.to_numeric)
df_float.dtypes()
Out:
Quantity float64
Total Value (Lacs) int64
dtype: object
After all, you can just deal with columns:
df['Price'] = (df_float['Total Value (Lacs)'] * 100000
/ df_float['Quantity'] * 100)
df['Price']
Out:
2018-04-16 00:00:00 319930.7592441217
2018-04-17 00:00:00 334309.8102814262
Another approach is define the function and apply it to each row with pd.DataFrame.apply:
def get_price(row):
try:
price = (float(row['Total Value (Lacs)']) * 100000
/ float(row['Quantity']) * 100)
except (TypeError, ValueError): # If bad data in this row, can't convert to float
price = None
return price
df['Price'] = df.apply(get_price, axis=1)
df['Price']
Out:
2018-04-16 00:00:00 319930.7592441217
2018-04-17 00:00:00 334309.8102814262
axis=1 means "aplly to each row"
If you have transposed data - as in your example, you should transpose it or to apply function to each column using axis=0.
Eidt 2:
Looks like your data is just single column, and it has dtype pd.Series. So if you select a row with data['Quantity'], you'll get something like 8317.000 of type str. There's no pd.Series.apply method, of course. So, in that case you may act in this way:
index_to_convert = ['Quantity', 'Total Value (Lacs)']
data[index_to_convert] = pd.to_numeric(data[index_to_convert])
and only numeric columns were converted. The just do the formula:
data['Price'] = (data['Total Value (Lacs)'] * 100000
/ data['Quantity'] * 100)
data
Out:
Date 2018-04-16 00:00:00
Quantity 8317
Total Value (Lacs) 259962
Price 3.12568e+08
But in most cases this solution not so handy, I strongly advice convert your data to DataFrame and deal with it, because DataFrame provides more flexibility and сapabilities.
Сonverting process:
df = data.to_frame().T.set_index('Date')
There are three consecutive actions:
Convert your data into DataFrame
Transpose it to (now columns are vertical virtually)
Set "Date" as index column
Results:
df
Out:
Quantity Total Value (Lacs)
Date
2018-04-16 00:00:00 8317.00 259962.50
After the previous steps you can apply Edit 1 code to your data. Also it's applicable there is more than one series in your data.
One more:
If your data has more than one value for each index, i.e multiple quantities ets:
data
Out:
Date 2018-04-16 00:00:00
Quantity 8317.00
Total Value (Lacs) 259962.50
Date 2018-04-17 00:00:00
Quantity 6434.00
Total Value (Lacs) 230002.50
You also can convert it into pd.DataFrame, step-by-step.
Group your data by an index entries and apply a list to groups:
data.groupby(level=0).apply(list)
Out:
Date [2018-04-16 00:00:00, 2018-04-17 00:00:00]
Quantity [8317.00, 6434.00]
Total Value (Lacs) [259962.50, 230002.50]
Then apply pd.Series to each row:
data.groupby(level=0).apply(list).apply(pd.Series)
Out: 0 1
Date 2018-04-16 00:00:00 2018-04-17 00:00:00
Quantity 8317.00 6434.00
Total Value (Lacs) 259962.50 230002.50
Transpose returned DataFrame, set 'Date' column as index:
series.groupby(level=0).apply(list).apply(pd.Series).T.set_index('Date')
Out:
Quantity Total Value (Lacs)
Date
2018-04-16 00:00:00 8317.00 259962.50
2018-04-17 00:00:00 6434.00 230002.50
Apply the solution from Edit 1.
Hope it helps!
You are getting this error because the data extracted from the dataframe are strings as shown in your error, you will need to convert the string into a float.
Convert your dataframe to values instead of strings. You can achieve that by:
values = df.values
Then you can extract the values from this array.
Alternatively, after extracting data from the dataframe convert it to float by using:
b=float(df['Quantity'])
use this:
df['price'] = ((df['Total Value (Lacs)'].apply(pd.to_numeric)) * 100000 / (df['Quantity'].apply(pd.to_numeric)) * 100)

Slicing specific rows of a column in pandas Dataframe

In the flowing data frame in Pandas, I want to extract columns corresponding dates between '03/01' and '06/01'. I don't want to use the index at all, as my input would be a start and end dates. How could I do so ?
A B
0 01/01 56
1 02/01 54
2 03/01 66
3 04/01 77
4 05/01 66
5 06/01 72
6 07/01 132
7 08/01 127
First create a list of the dates you need using daterange. I'm adding the year 2000 since you need to supply a year for this to work, im then cutting it off to get the desired strings. In real life you might want to pay attention to the actual year due to things like leap-days.
date_start = '03/01'
date_end = '06/01'
dates = [x.strftime('%m/%d') for x in pd.date_range('2000/{}'.format(date_start),
'2000/{}'.format(date_end), freq='D')]
dates is now equal to:
['03/01',
'03/02',
'03/03',
'03/04',
.....
'05/29',
'05/30',
'05/31',
'06/01']
Then simply use the isin argument and you are done
df = df.loc[df.A.isin(dates)]
df
If your columns is a datetime column I guess you can skip the strftime part in th list comprehension to get the right result.
You are welcome to use boolean masking, i.e.:
df[(df.A >= start_date) && (df.A <= end_date)]
Inside the bracket is a boolean array of True and False. Only rows that fulfill your given condition (evaluates to True) will be returned. This is a great tool to have and it works well with pandas and numpy.

How can I count categorical columns by month in Pandas?

I have time series data with a column which can take a value A, B, or C.
An example of my data looks like this:
date,category
2017-01-01,A
2017-01-15,B
2017-01-20,A
2017-02-02,C
2017-02-03,A
2017-02-05,C
2017-02-08,C
I want to group my data by month and store both the sum of the count of A and the count of B in column a_or_b_count and the count of C in c_count.
I've tried several things, but the closest I've been able to do is to preprocess the data with the following function:
def preprocess(df):
# Remove everything more granular than day by splitting the stringified version of the date.
df['date'] = pd.to_datetime(df['date'].apply(lambda t: t.replace('\ufeff', '')), format="%Y-%m-%d")
# Set the time column as the index and drop redundant time column now that time is indexed. Do this op in-place.
df = df.set_index(df.date)
df.drop('date', inplace=True, axis=1)
# Group all events by (year, month) and count category by values.
counted_events = df.groupby([(df.index.year), (df.index.month)], as_index=True).category.value_counts()
counted_events.index.names = ["year", "month", "category"]
return counted_events
which gives me the following:
year month category
2017 1 A 2
B 1
2 C 3
A 1
The process to sum up all A's and B's would be quite manual since category becomes a part of the index in this case.
I'm an absolute pandas menace, so I'm likely making this much harder than it actually is. Can anyone give tips for how to achieve this grouping in pandas?
I tried this so posting though I like #Scott Boston's solution better as I combined A and B values earlier.
df.date = pd.to_datetime(df.date, format = '%Y-%m-%d')
df.loc[(df.category == 'A')|(df.category == 'B'), 'category'] = 'AB'
new_df = df.groupby([df.date.dt.year,df.date.dt.month]).category.value_counts().unstack().fillna(0)
new_df.columns = ['a_or_b_count', 'c_count']
new_df.index.names = ['Year', 'Month']
a_or_b_count c_count
Year Month
2017 1 3.0 0.0
2 1.0 3.0

Resources