Pandas groupby dataframe then return single value result (sum, total) - python-3.x

Dears,
Please help me, I am stucked.I guess it should not be difficult but I feel overwhelmed.
Need to make ageing of receivables, therefore they must be separated in different buckets.
Suppose we have only 3 groups: current, above_10Days and above_20Days and the following table:
d = {'Cust': [Dfg, Ers, Dac, Vds, Mhf, Kld, Xsd, Hun],
'Amount': [10000, 100000, 4000, 5411, 756000, 524058, 4444785, 54788,
'Days': 150, 21, 30, 231, 48, 15, -4, -14 }
I need to group the amounts to a total sum, depending on the Ageing group.
Example:
Current: 4499573, etc.
For that purpose, I tried to group the receivables with such code:
above_10Days = df.groupby((df['Days'] > 0) & (df['Days'] <= 10))
above10sum = above_10Days.Amount.sum().iloc[1]
It works perfect but only when they are actual amount in this group.
When they are no such A/R it throws an exception and stop executing. I tried to use function or to make 'None' value to 0, but no success.
Hopefully someone could know the solution.
Thanks in advance

IIUC:
d = {'Cust': ['Dfg', 'Ers', 'Dac', 'Vds', 'Mhf', 'Kld', 'Xsd', 'Hun'],
'Amount': [10000, 100000, 4000, 5411, 756000, 524058, 4444785, 54788],
'Days': [150, 21, 30, 231, 48, 15, -4, -14] }
df = pd.DataFrame(d)
#Updated to assign to output dataframe
df_out = (df.groupby(pd.cut(df.Days,
[-np.inf,10,20,np.inf],
labels=['Current','Above 10 Days','Above 20 Days']))['Amount']
.sum())
Output:
Days
Current 4499573
Above 10 Days 524058
Above 20 Days 875411
Name: Amount, dtype: int64
Varible assignent using .loc:
varCurrent = df_out.loc['Current']
var10 = df_out.loc['Above 10 Days']
var20 = df_out.loc['Above 20 Days']
print(varCurrent,var10,var20)
Output:
4499573 524058 875411

Related

In Django, how do I write a query that filters a date column by a specific day of the week?

I'm using Python 3.9, Django 3.1, and PostGres 10. I have the following query to give me the articles created in a particular date range ...
qset = Article.objects.filter(
created_day__gte=start_date,
created_day__lte=end_date
)
What I would like to add is a clause to specify the number of articles created in a date range that were also created on a specific day of the week (e.g. Monday), where the days of the week are represented by integers (0 = Monday, 1 = Tuesday, ... 6 = Sunday). How do I add a clause that would also filter by the day of the week?
After your block of code that is presented in the question.
You can further filter the qset by doing the following:
monday_qset = qset.objects.filter(created_day__week_day=2)
day of week from 1 (Sunday) to 7 (Saturday).
You can find more details about this lookup here on django docs.
We can use ExtractWeekDay() (Django-Docs) and Count():
from django.db.models.functions import ExtractWeekDay
from django.db.models import Count
qset = Article.objects.filter(
created_day__gte=start_date,
created_day__lte=end_date
).annotate(
week_day=ExtractWeekDay("created_day")
).values(
"week_day"
).annotate(
count=Count("pk")
).order_by("week_day")
The output will be something like this:
<QuerySet [
{'count': 1098, 'week_day': 2},
{'count': 55, 'week_day': 3},
{'count': 29, 'week_day': 4},
{'count': 41, 'week_day': 5},
{'count': 25, 'week_day': 6}
]>

Structural Question Regarding pandas .drop method

df2=df.drop(df[df['issue']=="prob"].index)
df2.head()
The code immediately below works fine.
But why is there a need to type df[df[ rather than the below?
df2=df.drop(df['issue']=="prob"].index)
df2.head()
I know that the immediately above won't work while the former does. I would like to understand why or know what exactly I should google.
Also ~ any advice on a more relevant title would be appreciated.
Thanks!
Option 1: df[df['issue']=="prob"] produces a DataFrame with a subset of values.
Option 2: df['issue']=="prob" produces a pandas.Series with a Boolean for every row.
.drop works for Option 1, because it knows to just drop the selected indices, vs. all of the indices returned from Option 2.
I would use the following methods to remove rows.
Use ~ (not) to select the opposite of the Boolean selection.
df = df[~(df.treatment == 'Yes')]
Select rows with only the desired value
df = df[(df.treatment == 'No')]
import pandas as pd
import numpy as np
import random
# sample dataframe
np.random.seed(365)
random.seed(365)
rows = 25
data = {'a': np.random.randint(10, size=(rows)),
'groups': [random.choice(['1-5', '6-25', '26-100', '100-500', '500-1000', '>1000']) for _ in range(rows)],
'treatment': [random.choice(['Yes', 'No']) for _ in range(rows)],
'date': pd.bdate_range(datetime.today(), freq='d', periods=rows).tolist()}
df = pd.DataFrame(data)
df[df.treatment == 'Yes'].index
Produces just the indices where treatment is 'Yes', therefore df.drop(df[df.treatment == 'Yes'].index) only drops the indices in the list.
df[df.treatment == 'Yes'].index
[out]:
Int64Index([0, 1, 2, 4, 6, 7, 8, 11, 12, 13, 14, 15, 19, 21], dtype='int64')
df.drop(df[df.treatment == 'Yes'].index)
[out]:
a groups treatment date
3 5 6-25 No 2020-08-15
5 2 500-1000 No 2020-08-17
9 0 500-1000 No 2020-08-21
10 3 100-500 No 2020-08-22
16 8 1-5 No 2020-08-28
17 4 1-5 No 2020-08-29
18 3 1-5 No 2020-08-30
20 6 500-1000 No 2020-09-01
22 6 6-25 No 2020-09-03
23 8 100-500 No 2020-09-04
24 9 26-100 No 2020-09-05
(df.treatment == 'Yes').index
Produces all of the indices, therefore df.drop((df.treatment == 'Yes').index) drops all of the indices, leaving an empty dataframe.
(df.treatment == 'Yes').index
[out]:
RangeIndex(start=0, stop=25, step=1)
df.drop((df.treatment == 'Yes').index)
[out]:
Empty DataFrame
Columns: [a, groups, treatment, date]
Index: []

Retrieve one value from default dict on python 3

I have a dict which I populate with data using setdefault method as follows:
if date not in call_dict:
call_dict.setdefault(date, [0, 0]).append(0)
else:
call_dict[date][0] += 1
if x[12] != 'ANSWERED':
call_dict[date][1] += 1
call_dict[date][2] = 100*(call_dict[date][1]/call_dict[date][0])
At the end of this process I have a dict which is structured like this:
{'key' : value0, value1, value2}
Then I have to plot only key and value2 (as value2 is a function of the key) but I can't find the way to access this value.
I tried myDict.values() but it did not work (as expected). I would really like to avoid creating another list/dict/anything because of script performance and also to achieve
that I would still have to reach for the value2 which would solve my problem.
Any ideas how to solve this problem?
Sample values of dict:
{'08:23': [45, 17, 37.77777777777778],
'08:24': [44, 15, 34.090909090909086],
'08:25': [46, 24, 52.17391304347826],
'08:48': [49, 19, 38.775510204081634],
You can get them from the dictionary with a list comprehension:
data = { "2018-07-01": [45, 17, 37.77777777777778],
"2018-07-02": [44, 15, 34.090909090909086],
"2018-07-03": [46, 24, 52.17391304347826],
"2018-07-04": [49, 19, 38.775510204081634]}
xy = [(x,data[x][2]) for x in data.keys()] # extract tuples of (date-key, 3rd value)
print(xy)
Output:
[('2018-07-01', 37.77777777777778), ('2018-07-02', 34.090909090909086),
('2018-07-03', 52.17391304347826), ('2018-07-04', 38.775510204081634)]
If you need them for plotting you might want to do:
x,y = zip(*xy)
print(x)
print(y)
Output:
('2018-07-01', '2018-07-02', '2018-07-03', '2018-07-04') # x
(37.77777777777778, 34.090909090909086, 52.17391304347826, 38.775510204081634) # y
and supply those to your plotting library as x and y data.
Doku: zip(*iterables)

Data frame group by and find a value in a new data frame

I have a data frame and I have to see if there exists an entry where for every marketplace there is last_saturday data which is also a maximum data entry.
data = {
'marketplace': [3, 3, 4, 4, 5, 3, 4],
'date': ['2017-11-11', '2017-11-10', '2017-11-07', '2017-11-08', '2017-11-10', '2017-11-09', '2017-11-10']
}
last_saturday = '2017-11-11'
df = pd.DataFrame(data, columns= ['marketplace', 'date'])
df_sub = df.groupby(['marketplace'])['date'].max()
print(df_sub)
I get df_sub =
marketplace
3 2017-11-11
4 2017-11-10
5 2017-11-10
Name: date, dtype: object
How can I iterate through df_sub to see if the date for a marketplace matches last_saturday?
When I try to print out the dates print(df_sub['date']) I get the following error:
TypeError: an integer is required
tz=getattr(series.dtype, 'tz', None))
File "pandas/_libs/index.pyx", line 83, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 91, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 141, in pandas._libs.index.IndexEngine.get_loc
KeyError: 'date'
I assume that in order to access data in df_sub I have to use iloc or loc but not sure how.
I believe you need compare Series with value only - get boolean mask and need any for check at least one True:
print ((df_sub == last_saturday).any())
True
print (df_sub == last_saturday)
3 True
4 False
5 False
Name: date, dtype: bool
Or create DataFrame first by parameter as_index=False or reset_index:
df_sub = df.groupby(['marketplace'], as_index=False)['date'].max()
#df_sub = df.groupby(['marketplace'])['date'].max().reset_index()
print(df_sub)
marketplace date
0 3 2017-11-11
1 4 2017-11-10
2 5 2017-11-10
And compare column:
print ((df_sub['date'] == last_saturday).any())
True

Create a function that converts time to a binary response variable

I currently have an RDD where I have two columns which are
Row(pickup_time=datetime.datetime(2014, 2, 9, 14, 51)
dropoff_time=datetime.datetime(2014, 2, 9, 14, 58)
I want to transform these into a binary response variable where 1 will indicate night time and 0 will indicate day time.
I know that we can use UserDefinedFunction to create a function where it would change these to the desired format.
For example I have another column which is a string which specifies payment type as either 'CSH' or 'CRD' so I am able to solve that doing this
pay_map = {'CRD':1.0, 'CSH':0.0}
pay_bin = UserDefinedFunction(lambda z: pay_map[z], DoubleType())
df = df.withColumn('payment_type', pay_bin(df['payment_type']))
How would I apply this same logic to the question I am asking? If it helps I am trying to transform these variables since I will be running a decision tree.
There is no need for UDF here. You can use between and type casting:
from pyspark.sql.functions import hour
def in_range(colname, lower_bound=6, upper_bound=17):
"""
:param colname - Input column name (str)
:lower_bound - Lower bound for day hour (int, 0-23)
:upper_bound - Upper bound for day hour (int, 0-23)
"""
assert 0 <= lower_bound <= 23
assert 0 <= upper_bound <= 23
if lower_bound < upper_bound:
return hour(colname).between(lower_bound, upper_bound).cast("integer")
else:
return (
(hour(colname) >= lower_bound) |
(hour(colname) <= upper_bound)
).cast("integer")
Example usage:
df = sc.parallelize([
Row(
pickup_time=datetime.datetime(2014, 2, 9, 14, 51),
dropoff_time=datetime.datetime(2014, 2, 9, 14, 58)
),
Row(
pickup_time=datetime.datetime(2014, 2, 9, 19, 51),
dropoff_time=datetime.datetime(2014, 2, 9, 1, 58)
)
]).toDF()
(df
.withColumn("dropoff_during_day", in_range("dropoff_time"))
# between 6pm and 5am
.withColumn("pickpup_during_night", in_range("pickup_time", 18, 5)))
+--------------------+--------------------+------------------+--------------------+
| dropoff_time| pickup_time|dropoff_during_day|pickpup_during_night|
+--------------------+--------------------+------------------+--------------------+
|2014-02-09 14:58:...|2014-02-09 14:51:...| 1| 0|
|2014-02-09 01:58:...|2014-02-09 19:51:...| 0| 1|
+--------------------+--------------------+------------------+--------------------+

Resources