I have a large data frame with transaction data. What I am trying to do is use python to aggregate the data starting with zip codes, then a year and month, finally the total number of transactions for that month.
My Df:
Date VAR1 VAR2 ZipCode Transactions
YYYY-MM-DD. X. Y. 12345. 1.
So the first thing I did was convert the to date time
df['Date'] = pd.to_datetime(df['Date'])
df.info()
# Date datetime64[ns]
Then I split the data into year-month and number of transactions:
# grouping the data by year and month
per = df.Date.dt.to_period("M")
g = df.groupby(per)
g.sum() # so now that this works, we need to break it up into zip codes
Which gives an output of:
Date. Transactions
YYYY-MM. X
YYYY-MM. Y
My questions is, what am I missing to get the zipcodes in front:
ZipCode. Date. Transactions
123345. YYYY-MM. sum()
Any and all help is greatly apprecaited
I believe you need add column ZipCode to groupby if need grouping per zip and per months:
per = df.Date.dt.to_period("M")
df1 = df.groupby(['ZipCode',per])['Transactions'].sum().reset_index()
Related
Suppose we have a very large table that we'd like to process statistics for incrementally.
Date
Amount
Customer
2022-12-20
30
Mary
2022-12-21
12
Mary
2022-12-20
12
Bob
2022-12-21
15
Bob
2022-12-22
15
Alice
We'd like to be able to calculate incrementally how much we made per distinct customer for a date range. So from 12-20 to 12-22 (inclusive), we'd have 3 distinct customers, but 12-20 to 12-21 there are 2 distinct customers.
If we want to run this pipeline once a day and there are many customers, how can we keep a rolling count of distinct customers for an arbitrary date range? Is there a way to do this without storing a huge list of customer names for each day?
We'd like to support a frontend that has a date range filter and can quickly calculate results for that date range. For example:
Start Date
End Date
Average Income Per Customer
2022-12-20
2022-12-21
(30+12+12+15)/2 = 34.5
2022-12-20
2022-12-22
(30+12+12+15+15)/3 = 28
The only approach I can think of is to store a set of customer names for each day, and when viewing the results calculate the size of the joined set of sets to calculate distinct customers. This seems inefficient. In this case we'd store the following table, with the customer column being extremely large.
Date
Total Income
Customers
2022-12-20
42
set(Mary, Bob)
2022-12-21
27
set(Mary, Bob)
2022-12-22
15
set(Alice)
For me the best solution is to do some pre calculations for the existing data, then for the new data that come everyday, do the caclulation only on new data, and add the results to the previous calclulated data, also do partitioning on date column as we filter on dates, this will trigger spark push down filters and accelerate your queries.
There's 2 approach: one to get the sum amount between 2 dates, and other for the distinct customers between 2 dates:
For amout use prefix sum by adding the sum of all previous days to the last day, then to get the difference between the 2 dates you can just substract these 2 days only without looping all dates between.
For distinct customers, the best approach I can think of is to save the date and customer columns in a new file, and partition by dates, that would help to optimize the queries, then use the fast approx_count_distinct.
Here's some code:
spark = SparkSession.builder.master("local[*]").getOrCreate()
data = [
["2022-12-20", 30, "Mary"],
["2022-12-21", 12, "Mary"],
["2022-12-20", 12, "Bob"],
["2022-12-21", 15, "Bob"],
["2022-12-22", 15, "Alice"],
]
df = spark.createDataFrame(data).toDF("Date", "Amount", "Customer")
def init_amout_data(df):
w = Window.orderBy(col("Date"))
amount_sum_df = df.groupby("Date").agg(sum("Amount").alias("Amount")) \
.withColumn("amout_sum", sum(col("Amount")).over(w)) \
.withColumn("prev_amout_sum", lag("amout_sum", 1, 0).over(w)).select("Date", "amout_sum", "prev_amout_sum")
amount_sum_df.write.mode("overwrite").partitionBy("Date").parquet("./path/amount_data_df")
amount_sum_df.show(truncate=False)
# keep only customer data to avoid unecessary data when querying, partitioning by Date will make query faster due to spark filter push down mechanism
def init_customers_data(df):
df.select("Date", "Customer").write.mode("overwrite").partitionBy("Date").parquet("./path/customers_data_df")
# each day update the amount data dataframe (example at midnight), with only yesterday data: by talking the last amout_sum and adding to it the amount of the last day
def update_amount_data(last_partition):
amountDataDf = spark.read.parquet("./path/amount_data_df")
maxDate = getMaxDate("./path/amount_data_df") # implement a hadoop method to get the last partition date
lastMaxPartition = amountDataDf.filter(col("date") == maxDate)
lastPartitionAmountSum = lastMaxPartition.select("amout_sum").first.getLong(0)
yesterday_amount_sum = last_partition.groupby("Date").agg(sum("Amount").alias("amount_sum"))
newPartition = yesterday_amount_sum.withColumn("amount_sum", col("amount_sum") + lastPartitionAmountSum) \
.withColumn("prev_amout_sum", lit(lastPartitionAmountSum))
newPartition.write.mode("append").partitionBy("Date").parquet("./path/amount_data_df")
def update_cusomers_data(last_partition):
last_partition.write.mode("append").partitionBy("Date").parquet("./path/customers_data_df")
def query_amount_date(beginDate, endDate):
amountDataDf = spark.read.parquet("./path/amount_data_df")
endDateAmount = amountDataDf.filter(col("Date") == endDate).select("amout_sum").first.getLong(0)
beginDateDf = amountDataDf.filter(col("Date") == beginDate).select("prev_amout_sum").first.getLong(0)
diff_amount = endDateAmount - beginDateDf
return diff_amount
def query_customers_date(beginDate, endDate):
customersDataDf = spark.read.parquet("./path/customers_data_df")
distinct_customers_nb = customersDataDf.filter(col("date").between(lit(beginDate), lit(endDate))) \
.agg(approx_count_distinct(df.Customer).alias('distinct_customers')).first.getLong(0)
return distinct_customers_nb
# This is should be executed the first time only
init_amout_data(df)
init_customers_data(df)
# This is should be executed everyday at midnight with data of the last day only
last_day_partition = df.filter(col("date") == yesterday_date)
update_amount_data(last_day_partition)
update_cusomers_data(last_day_partition)
# Optimized queries that should be executed with
beginDate = "2022-12-20"
endDate = "2022-12-22"
answer = query_amount_date(beginDate, endDate) / query_customers_date(beginDate, endDate)
print(answer)
If calculating the distinct customer is not fast enough, there's another approach using the same pre sum calculation of all distinct customers and another table for distinct customer, each day if there's a new customer increment the first table and add that customer to the second table, if not don't do anything.
Finally there are some tricks for optimizing the goupBy or window functions using salting oo extended partitioning.
You can achieve this by filtering rows with dates between start_date and end_date then grouping by customer_id and calculating the sum of amounts and then getting avg of these amounts. this approach works for only one start_date and end_date and you should run this code with different parameters to solve with different date ranges
start_date = '2022-12-20'
end_date = '2022-12-21'
(
df
.withColumn('isInRange', F.col('date').between(start_date, end_date))
.filter(F.col('isInRange'))
.groupby('customer')
.agg(F.sum('amount').alias('sum'))
.agg(F.avg('sum').alias('avg income'))
).show()
I have the following data, with each row per ID and DATE. A person with the same ID can occupy multiple rows, hence multiple dates. I want to aggregate it into one person (or ID) per row, and the dates will be aggregated into a list of date
From this
ID DATE
1 2012-03-04
1 2013-04-15
1 2019-01-09
2 2013-04-09
2 2016-01-01
2 2018-05-09
To this
ID DATE
1 [2012-03-04, 2013-04-15, 2019-01-09]
2 [2013-04-09, 2016-01-01, 2018-05-09]
Here is my attempt
df.sort_values(by=['ID', 'DATE'], ascending=True, inplace=True)
df = df[['ID', 'DATE']]
df_pivot = df.groupby('ID').aggregate(lambda tdf: tdf.unique().tolist())
df_pivot = pd.DataFrame(df_pivot.to_records())
The problem is it returns something like this
ID DATE
1 [1375228800000000000, 1411948800000000000, 1484524800000000000]
2 [1524528000000000000, 1529539200000000000, 1529542200000000000]
What kind of date format is this? I can't seem to find the right function to convert it back to the typical date format.
If need unique values in lists use DataFrame.drop_duplicates before aggregate lists:
df = (df.sort_values(by=['ID', 'DATE'], ascending=True)
.drop_duplicates(['ID', 'DATE'])
.groupby('ID')['DATE']
.agg(list))
In your solution should working, but it is slow:
df_pivot = df.groupby('ID')['DATE'].aggregate(lambda tdf: tdf.drop_duplicates().tolist())
What kind of date format is this?
If is native datetimes, called also unix datetime in nanoseconds.
Many ways... agg preferred because apply can be very slow
df.groupby('ID')['DATE'].agg(list)
Or
df.groupby('ID')['DATE'].apply(lambda x: x.to_list())
Simply use groupby() and apply() method:
result=df.groupby('ID')['DATE'].apply(list)
OR
result=df.groupby('ID')['DATE'].agg(list)
Now If you print result you will get your desired output:
ID
1 [ 2012-03-04, 2013-04-15, 2019-01-09]
2 [ 2013-04-09, 2016-01-01, 2018-05-09]
Name: DATE, dtype: object
The above code is giving you Series,If you want Dataframe Then use:
result=df.groupby('ID')['DATE'].apply(list).reset_index()
I am trying to calculate basic statistics using pandas. I have precip values for a whole year from 1956. I created a "Date" column that has date for the entire year using pd.date_range. Then I calculated the max value for the year and the date of maximum value. The date of maximum value show "Timestamp('1956-06-19 00:00:00" as the output. How do I extract just the date. I do not need the timestamp or the 00:00:00 time
#Create Date Column
year = 1956
start_date = datetime.date(year,1,1)
end_date = datetime.date(year,12,31)
precip_file["Date"] = pd.date_range(start=start_date,end=end_date,freq="D")
#Yearly maximum value and date of maximum value
yearly_max = precip_file["Precip (mm)"].max(skipna=True)
max_index = precip_file["Precip (mm)"].idxmax()
yearly_max_date = precip_file.iat[max_index,2
Image of output dictionary I am trying to create
May be a duplicate of this question, although I can't tell whether you are trying to convert one DateTime or a column of DateTimes.
I have the following dataframe:
I need to resample the data to calculate the weekly pct_change(). How can i get the weekly change ?
Something like data['pct_week'] = data['Adj Close'].resample('W').ffill().pct_change() but the data need to groupby data.groupby(['month', 'week'])
This way every month would yield 4 values for weekly change.Which i can graph then
What i did was df['pct_week'] = data['Adj Close'].groupby(['week', 'day']).pct_change() but i got this error TypeError: 'type' object does not support item assignment
If want grouping with resample first is necessary DatetimeIndex only, so added
DataFrame.reset_index by all levels without first, then grouping and resample with custom function, because pct_change for resample is not implemented:
def percent_change(x):
return pd.Series(x).pct_change()
Another idea is use numpy solution for pct_change:
def percent_change(x):
return x / np.concatenate(([np.nan], x[:-1])) - 1
df1 = (df.reset_index(level=[1,2,3])
.groupby(['month', 'week'])['Adj Close']
.resample('W')
.apply(percent_change))
this way every month would yield 4 values for weekly change
So it seems there is no groupby, only necessary downsample like sum and chain Series.pct_change:
df2 = (df.reset_index(level=[1,2,3])
.resample('W')['Adj Close']
.sum()
.pct_change())
Drop unwanted indexes. The datetime index is enough for re-sampling / grouping
df.index = df.index.droplevel(['month', 'week', 'day'])
Re-sample by week, select the column needed, add a aggregation function and then calculate percentage change.
df.resample('W')['Adj Close'].mean().pct_change()
Given a Spark dataframe that I have
val df = Seq(
("2019-01-01",100),
("2019-01-02",101),
("2019-01-03",102),
("2019-01-04",103),
("2019-01-05",102),
("2019-01-06",99),
("2019-01-07",98),
("2019-01-08",100),
("2019-01-09",47)
).toDF("day","records")
I want to add a new column to this so that I get an average value of last N records on a given day. For example, if N=3, then on a given day, that value should be average of last 3 values EXCLUDING the current record
For example, for day 2019-01-05, it would be (103+102+101)/3
How I can use efficiently use over() clause in order to do this in Spark?
PySpark solution.
Window definition should be 3 PRECEDING AND 1 FOLLOWING which translates to positions (-3,-1) with both boundaries included.
from pyspark.sql import Window
from pyspark.sql.functions import avg
w = Window.orderBy(df.day)
df_with_rsum = df.withColumn("rsum_prev_3_days",avg(df.records).over(w).rowsBetween(-3, -1))
df_with_rsum.show()
The solution assumes there is one row per date in the dataframe without missing dates in between. If not, aggregate the rows by date before applying the window function.