I found a way to get the number I need, but I was hoping to get some input on how to accomplish it in a less cumbersome way. I need a running total of transactions to date in order to make it into a plotly plot. The data I have only includes a few columns: id, date, and amount. Here's the code I have so far:
fy20 = pd.read_excel('./data/transactions.xlsx', parse_dates=['date'])
def daily_money(df):
df = df.groupby('date').amount.sum()
df = df.groupby(df.index.day).cumsum()
df = df.cumsum().to_frame().reset_index()
return df
fy20 = daily_money(fy20)
This appears to accomplish the goal, but it seems like there must be a simpler way. Please let me know if you have any suggestions on how to simplify this.
It looks to me like this should work:
df.groupby('date')['amount'].sum().cumsum()
This works because DataFrame.groupby automatically sorts by the group keys, so the cumulative sum is already looking at the data it needs.
If you want it as a DataFrame with a new index instead of a Series, you can just use Series.reset_index, which converts the series to a DataFrame first, but unless you need the date as a normal column (rather than the index) later, you don't need to do that.
Related
I have this data and I need to sum the total ['Tons'] per ['KeyCode´] and how to show all the ['Origin'] that make this total.
I did a groupby by KeyCode, but I can't figure out how to proceed.
df = data.groupby("KeyCode", group_keys=False).apply(lambda x: x)
df.head()
you can use all aggregation functions in a groupby object to get the values you are interested on.
To get the total of something, the function is sum(), whereas to get the unique values we use unique(). To use unique you must first get the series (a column, in Excel language) related to the variable you want to get the unique values for each value of the groupby variable.
See the code and result below:
data = pd.read_excel("total cargo.xlsx")
# get total of 'Tons'
res = data.groupby("KeyCode", group_keys=False).sum()
# get all the 'Origin' values for each keycode
res['Origins'] = data.groupby("KeyCode", group_keys=False)['Origin'].unique()
res.head()
I have two Dataframes
facts:
columns: data, start_date and end_date
holidays:
column: holiday_date
What I want is a way to produce another Dataframe that has columns:
data, start_date, end_date and num_holidays
Where num_holidays is computed as: Number of days between start and end that are not weekends or holidays (as in the holidays table).
The solution is here if we wanted to do this in PL/SQL. Crux is this part of code:
--Calculate and return the number of workdays using the input parameters.
--This is the meat of the function.
--This is really just one formula with a couple of parts that are listed on separate lines for documentation purposes.
RETURN (
SELECT
--Start with total number of days including weekends
(DATEDIFF(dd,#StartDate, #EndDate)+1)
--Subtact 2 days for each full weekend
-(DATEDIFF(wk,#StartDate, #EndDate)*2)
--If StartDate is a Sunday, Subtract 1
-(CASE WHEN DATENAME(dw, #StartDate) = 'Sunday'
THEN 1
ELSE 0
END)
--If EndDate is a Saturday, Subtract 1
-(CASE WHEN DATENAME(dw, #EndDate) = 'Saturday'
THEN 1
ELSE 0
END)
--Subtract all holidays
-(Select Count(*) from [dbo].[tblHolidays]
where [HolDate] between #StartDate and #EndDate )
)
END
I'm new to pyspark and was wondering what's the efficient way to do this? I can post the udf I'm writing if it helps though I'm going slow because I feel it's the wrong thing to do:
Is there a better way than creating a UDF that reads the holidays table in a Dataframe and joins with it to count the holidays? Can I even join inside a udf?
Is there a way to write a pandas_udf instead? Would it be faster enough?
Are there some optimizations I can apply like cache the holidays table somehow on every worker?
Something like this may work:
from pyspark.sql import functions as F
df_facts = spark.createDataFrame(
[('data1', '2022-05-08', '2022-05-14'),
('data1', '2022-05-08', '2022-05-21')],
['data', 'start_date', 'end_date']
)
df_holidays = spark.createDataFrame([('2022-05-10',)], ['holiday_date'])
df = df_facts.withColumn('exploded', F.explode(F.sequence(F.to_date('start_date'), F.to_date('end_date'))))
df = df.filter(~F.dayofweek('exploded').isin([1, 7]))
df = df.join(F.broadcast(df_holidays), df.exploded == df_holidays.holiday_date, 'anti')
df = df.groupBy('data', 'start_date', 'end_date').agg(F.count('exploded').alias('business_days'))
df.show()
# +-----+----------+----------+-------------+
# | data|start_date| end_date|business_days|
# +-----+----------+----------+-------------+
# |data1|2022-05-08|2022-05-14| 4|
# |data1|2022-05-08|2022-05-21| 9|
# +-----+----------+----------+-------------+
Answers:
Is there a better way than creating a UDF...?
This method does not use udf, so it must perform better.
Is there a way to write a pandas_udf instead? Would it be faster enough?
pandas_udf performs better than regular udf. But no-udf approaches should be even better.
Are there some optimizations I can apply like cache the holidays table somehow on every worker?
Spark engine performs optimizations itself. However, there are some relatively rare cases when you may help it. In the answer, I have used F.broadcast(df_holidays). The broadcast sends the dataframe to all of the workers. But I am sure that the table would automatically be broadcasted to the workers, as it looks like it's supposed to be very small.
I have a train_df and a test_df, which come from the same original dataframe, but were split up in some proportion to form the training and test datasets, respectively.
Both train and test dataframes have identical structure:
A PeriodIndex with daily buckets
n number of columns that represent observed values in those time buckets e.g. Sales, Price, etc.
I now want to construct a yhat_df, which stores predicted values for each of the columns. In the "naive" case, yhat_df columns values are simply the last observed training dataset value.
So I go about constructing yhat_df as below:
import pandas as pd
yhat_df = pd.DataFrame().reindex_like(test_df)
yhat_df[train_df.columns[0]].fillna(train_df.tail(1).values[0][0], inplace=True)
yhat_df(train_df.columns[1]].fillna(train_df.tail(1).values[0][1], inplace=True)
This appears to work, and since I have only two columns, the extra typing is bearable.
I was wondering if there is simpler way, especially one that does not need me to go column by column.
I tried the following but that just populates the column values correctly where the PeriodIndex values match. It seems fillna() attempts to do a join() of sorts internally on the Index:
yhat_df.fillna(train_df.tail(1), inplace=True)
If I could figure out a way for fillna() to ignore index, maybe this would work?
you can use fillna with a dictionary to fill each column with a different value, so I think:
yhat_df = yhat_df.fillna(train_df.tail(1).to_dict('records')[0])
should work, but if I understand well what you do, then even directly create the dataframe with:
yhat_df = pd.DataFrame(train_df.tail(1).to_dict('records')[0],
index = test_df.index, columns = test_df.columns)
I have a pandas dataframe df with the column Name. I did:
for name in df['Name'].unique ():
X = df[df['Name'] == name]
print (X.head())
but then X contains all kinds of different Name, not an unique name I want.
What did I do wrong?
Thanks a lot
You probably don't want to overwrite X with every iteration of your loop and only keep the dataframe containing the last value of df['Name'].unique().
Depending on your data and goal, you might want to use groupby as jezrael suggests, or maybe do something like df[~df['Name'].duplicated()].
I want to iterate over the rows of a dataframe, but keep each row as a dataframe that has the exact same format of the parent dataframe, except with only one row. I know about calling DataFrame() and passing in the index and columns, but for some reason this doesn't always give me the same format of the parent dataframe. Calling to_frame() on the series (i.e. the row) does cast it back to a dataframe, but often transposed or in some way different from the parent dataframe format. Isn't there some easy way to do this and guarantee it will always be the same format for each row?
Here is what I came up with as my best solution so far:
def transact(self, orders):
# Buy or Sell
if len(orders) > 1:
empty_order = orders.iloc[0:0]
for index, order in orders.iterrows():
empty_order.loc[index] = order
#empty_order.append(order)
self.sub_transact(empty_order)
else:
self.sub_transact(orders)
In essence, I empty the dataframe and then insert the series, from the For loop, back into it. This works correctly, but gives the following warning:
C:\Users\BNielson\Google Drive\My Files\machine-learning\Python-Machine-Learning\ML4T_Ex2_1.py:57: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
empty_order.loc[index] = order
C:\Users\BNielson\Anaconda3\envs\PythonMachineLearning\lib\site-packages\pandas\core\indexing.py:477: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.obj[item] = s
So it's this line giving the warning:
empty_order.loc[index] = order
This is particularly strange because I am using .loc already, when normally you get this error when you don't use .loc.
There is a much much easier way to do what I want.
order.to_frame().T
So...
if len(orders) > 1:
for index, order in orders.iterrows():
self.sub_transact(order.to_frame().T)
else:
self.sub_transact(orders)
What this actually does is translates the series (which still contains the necessary column and index information) back to a dataframe. But for some Moronic (but I'm sure Pythonic) reason it transposes it so that the previous row is now the column and the previous columns are now multiple rows! So you just transpose it back.
Use groupby with a unique list. groupby does exactly what you are asking for as in, it iterates over each group and each group is a dataframe. So, if you manipulate it such that you groupby a value that is unique for each and every row, you'll get a single row dataframe when you iterate over the group
for n, group in df.groupby(np.arange(len(df))):
pass
# do stuff
If I can suggest an alternative way than it would be like this:
for index, order in orders.iterrows():
orders.loc[index:index]
orders.loc[index:index] is exactly one row dataframe slice with the same structure, including index and column names.