Pandas: Sliding window, summing app 14 day data - python-3.x

I do wonder how it is possible to make sliding windows in Pandas.
I have a dataframe with three columns.
Country | Number | DayOfTheYear
===================================
No | 50 | 0
No | 20 | 1
No | 37 | 2
I would love to see 14 day chunks for every country and day combination.
The country think can be ignored for the moment, since I can filter those manually in some way. But imagine there is only one country, is there a smart way to get some sort of summed up sliding window, resulting in something like the following?
Country | Sum | DatesOftheYear
===================================
No | 504 | 0-13
No | 207 | 1-14
No | 337 | 2-15
I would also accept if if they where disjunct, being only 0-13, 14-27, etc.
But I just cannot come along with Pandas. I know an old SQL solution, but is there anybody having a nice idea for Pandas?

If you want a rolling windows of your dataframe, you can simply use the .rolling function of pandas : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html
In your case : df["Number"].rolling(14).sum()

Related

Getting multiple readings from .txt into excel

I'm not sure if this is the correct place to ask this, but basically I have a .txt file containing values that came from 2 separate sensors.
Example of some data:
{"t":3838202,"s":0,"n":"x1","v":-1052}
{"t":3838203,"s":0,"n":"y1","v":44}
{"t":3838204,"s":0,"n":"z1","v":-84}
{"t":3838435,"s":0,"n":"x1","v":-1052}
{"t":3838436,"s":0,"n":"y1","v":36}
{"t":3838437,"s":0,"n":"z1","v":-80}
{"t":3838670,"s":0,"n":"x1","v":-1056}
{"t":3838671,"s":0,"n":"y1","v":52}
{"t":3838672,"s":0,"n":"z1","v":-88}
{"t":3838902,"s":0,"n":"x1","v":-1052}
{"t":3838903,"s":0,"n":"y1","v":48}
{"t":3838904,"s":0,"n":"z1","v":-80}
{"t":3839136,"s":0,"n":"x1","v":-1056}
{"t":3839137,"s":0,"n":"y1","v":40}
{"t":3839138,"s":0,"n":"z1","v":-80}
x2:-944
y2:108
z2:-380
{"t":3839841,"s":0,"n":"x1","v":-1052}
{"t":3839842,"s":0,"n":"y1","v":44}
{"t":3839843,"s":0,"n":"z1","v":-80}
x2:-948
y2:100
z2:-380
{"t":3840541,"s":0,"n":"x1","v":-1052}
{"t":3840542,"s":0,"n":"y1","v":40}
{"t":3840543,"s":0,"n":"z1","v":-84}
{"t":3840774,"s":0,"n":"x1","v":-1052}
{"t":3840775,"s":0,"n":"y1","v":40}
{"t":3840776,"s":0,"n":"z1","v":-84}
x2:-948
y2:108
z2:-368
I'm trying to get the data into excel, so that for each "chunk" of data in the x1y1z1 section, I take the last set of recorded data and discard the rest and "pair" it with the next set of x2y2z2 data. I don't think I'm explaining it very well, but I basically want to take that text file and get this in excel:
+---------+-------+----+-----+------+-----+------+
| t | x1 | y1 | z1 | x2 | y2 | z2 |
+---------+-------+----+-----+------+-----+------+
| 3839138 | -1056 | 40 | -80 | -944 | 100 | -380 |
| 3839843 | -1052 | 44 | -80 | -948 | 100 | -380 |
| 3840776 | -1052 | 40 | -84 | -948 | 108 | -368 |
+---------+-------+----+-----+------+-----+------+
I'm really stuck as to where I should even start
I think like a programmer, so I would approach this problem in steps. If you are not a programmer, this might not be so helpful to you, and I am sorry for that.
First, define the data. How does each line of data get read and understood.
Second, write a parsing utility. A piece of code which interprets the data as it is read in and stores it in the form you want for your output
Third, import data into Excel.
So, based on the limited data you provided, I am not sure how you are able to determine the x1,y1,z1,x2,y2,z2 for each t, but I assume that the values enclosed in curly braces have something to do with that based on the values for s, n, and v I'm seeing in there. So, first of all you need to clearly determine the way you read the data. Take it one line at a time, and determine how you would build your output table based on each line of data. I assume you would treat the lines enclosed in curly braces differently from the lines with standalone x/y/z values for example.
I hope this points you in the right direction.

Using multiple parent IDs for cutoff times in deep feature synthesis

My data looks like: People <-- Events <--Activities. The parent is People, of which the only variable is the person_id. Events and Activities both have a time index, along with event_id and activity_id, both which have a few features.
Members of the 'People' entity visit places at all different times. I am trying to generate deep features for people. If people is something like [1,2,3], how do I pass cut off times that create deep features for something like (Person,cutofftime): [1,January2], [1, January3]
If I have only 3 People, it seems like I can't pass a cutoff_time dataframe that has 10 rows (for example, person 1 with 10 possible cutoff times). Trying this gives me the error "Duplicated rows in cutoff time dataframe", despite dropping duplicates from my cutoff_times dataframe.
Must I include time index in the People Entity? This would leave my parent entity with multiple people in the index, although they would have different time index. My instinct is that the people entity should not include any datetime column. I would like to give cut off times to the DFS function.
My cutoff_times df.head looks like this, and has multiple instances of some people_id:
+-------------------------------------------+
| person_id time label |
+-------------------------------------------+
| 0 f_GZSVLYU 2019-12-06 0.0 |
| 1 f_ATBJEQS 2019-12-06 1.0 |
| 2 f_GLFYVAY 2019-12-06 0.5 |
| 3 f_DIHPTPA 2019-12-06 0.5 |
| 4 f_GZSVLYU 2019-12-02 1.0 |
+-------------------------------------------+
The Parent People Entity is like this:
+-------------------+
| person_id |
+-------------------+
| 0 f_GZSVLYU |
| 1 f_ATBJEQS |
| 2 f_GLFYVAY |
| 3 f_DIHPTPA |
| 4 f_DVOYHRQ |
+-------------------+
How can I make featuretools understand what I'm trying to do?
'Duplicated rows in cutoff time dataframe.' I have explored my cutoff_times df and there are no duplicate rows. Person_id, times, and labels all have multiple occurrences each but no 2 rows are the same. Could these duplicates the error is referring to be somewhere else in the EntitySet?
The answer is one row of the cutoff_df had the same ID and time but with different labels. That's a problem.

How to cycle a Pandas dataframe grouping by hierarchical multiindex from top to bottom and store results

I'm trying to create a forecasting process using hierarchical time series. My problem is that I can't find a way to create a for loop that hierarchically extracts daily time series from a pandas dataframe grouping the sum of quantities by date. The resulting daily time series should be passed to a function inside the loop, and the results stored in some other object.
Dataset
The initial dataset is a table that represents the daily sales data of 3 hierarchical levels: city, shop, product. The initial table has this structure:
+============+============+============+============+==========+
| Id_Level_1 | Id_Level_2 | Id_Level_3 | Date | Quantity |
+============+============+============+============+==========+
| Rome | Shop1 | Prod1 | 01/01/2015 | 50 |
+------------+------------+------------+------------+----------+
| Rome | Shop1 | Prod1 | 02/01/2015 | 25 |
+------------+------------+------------+------------+----------+
| Rome | Shop1 | Prod1 | 03/01/2015 | 73 |
+------------+------------+------------+------------+----------+
| Rome | Shop1 | Prod1 | 04/01/2015 | 62 |
+------------+------------+------------+------------+----------+
| ... | ... | ... | ... | ... |
+------------+------------+------------+------------+----------+
| Milan | Shop3 | Prod9 | 31/12/2018 | 185 |
+------------+------------+------------+------------+----------+
| Milan | Shop3 | Prod9 | 31/12/2018 | 147 |
+------------+------------+------------+------------+----------+
| Milan | Shop3 | Prod9 | 31/12/2018 | 206 |
+------------+------------+------------+------------+----------+
Each City (Id_Level_1) has many Shops (Id_Level_2), and each one has some Products (Id_Level_3). Each shop has a different mix of products (maybe shop1 and shop3 have product7, which is not available in other shops). All data are daily and the measure of interest is the quantity.
Hierarchical Index (MultiIndex)
I need to create a tree structure (hierarchical structure) to extract a time series for each "node" of the structure. I call a "node" a cobination of the hierarchical keys, i.e. "Rome" and "Milan" are nodes of Level 1, while "Rome|Shop1" and "Milan|Shop9" are nodes of level 2. In particulare, I need this on level 3, because each product (Id_Level_3) has different sales in each shop of each city. Here is the strict hierarchy.
Nodes of level 3 are "Rome, Shop1, Prod1", "Rome, Shop1, Prod2", "Rome, Shop2, Prod1", and so on. The key of the nodes is logically the concatenation of the ids.
For each node, the time series is composed by two columns: Date and Quantity.
# MultiIndex dataframe
Liv_Labels = ['Id_Level_1', 'Id_Level_2', 'Id_Level_3', 'Date']
df.set_index(Liv_Labels, drop=False, inplace=True)
The I need to extract the aggregated time series in order but keeping the hierarchical nodes.
Level 0:
Level_0 = df.groupby(level=['Data'])['Qta'].sum()
Level 1:
# Node Level 1 "Rome"
Level_1['Rome'] = df.loc[idx[['Rome'],:,:]].groupby(level=['Data']).sum()
# Node Level 1 "Milan"
Level_1['Milan'] = df.loc[idx[['Milan'],:,:]].groupby(level=['Data']).sum()
Level 2:
# Node Level 2 "Rome, Shop1"
Level_2['Rome',] = df.loc[idx[['Rome'],['Shop1'],:]].groupby(level=['Data']).sum()
... repeat for each level 2 node ...
# Node Level 2 "Milan, Shop9"
Level_2['Milan'] = df.loc[idx[['Milan'],['Shop9'],:]].groupby(level=['Data']).sum()
Attempts
I already tried creating dictionaries and multiindex, but my problem is that I can't get a proper "node" use inside the loop. I can't even extract the unique level nodes keys, so I can't collect a specific node time series.
# Get level labels
Level_Labels = ['Id_Liv'+str(n) for n in range(1, Liv_Num+1)]+['Data']
# Initialize dictionary
TimeSeries = {}
# Get Level 0 time series
TimeSeries["Level_0"] = df.groupby(level=['Data'])['Qta'].sum()
# Get othe levels time series from 1 to Level_Num
for i in range(1, Liv_Num+1):
TimeSeries["Level_"+str(i)] = df.groupby(level=Level_Labels[0:i]+['Data'])['Qta'].sum()
Desired result
I would like a loop the cycles my dataset with these actions:
Creates a structure of all the unique node keys
Extracts the node time series grouped by Date and Quantity
Store the time series in a structure for later use
Thanks in advance for any suggestion! Best regards.
FR
I'm currently working on a switch dataset that I polled from an sql database where each port on the respective switch has a data frame which has a time series. So to access this time series information for each specific port I represented the switches by their IP addresses and the various number of ports on the switch, and to make sure I don't re-query what I already queried before I used the .unique() method to get unique queries of each.
I set my index to be the IP and Port indices and accessed the port information like so:
def yield_df(df):
for ip in df.index.get_level_values('ip').unique():
for port in df.loc[ip].index.get_level_values('port').unique():
yield df.loc[ip].loc[port]
Then I cycled the port data frames with a for loop like so:
for port_df in yield_df(adb_df):
I'm sure there are faster ways to carry out these procedures in pandas but I hope this helps you start solving your problem

Spark: count events based on two columns

I have a table with events which are grouped by a uid. All rows have the columns uid, visit_num and event_num.
visit_num is an arbitrary counter that occasionally increases. event_num is the counter of interactions within the visit.
I want to merge these two counters into a single interaction counter that keeps increasing by 1 for each event and continues to increase when then next visit has started.
As I only look at the relative distance between events, it's fine if I don't start the counter at 1.
|uid |visit_num|event_num|interaction_num|
| 1 | 1 | 1 | 1 |
| 1 | 1 | 2 | 2 |
| 1 | 2 | 1 | 3 |
| 1 | 2 | 2 | 4 |
| 2 | 1 | 1 | 500 |
| 2 | 2 | 1 | 501 |
| 2 | 2 | 2 | 502 |
I can achieve this by repartitioning the data and using the monotonically_increasing_id like this:
df.repartition("uid")\
.sort("visit_num", "event_num")\
.withColumn("iid", fn.monotonically_increasing_id())
However the documentation states:
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.
As the id seems to be monotonically increasing by partition this seems fine. However:
I am close to reaching the 1 billion partition/uid threshold.
I don't want to rely on the current implementation not changing.
Is there a way I can start each uid with 1 as the first interaction num?
Edit
After testing this some more, I notice that some of the users don't seem to have consecutive iid values using the approach described above.
Edit 2: Windowing
Unfortunately there are some (rare) cases where more thanone row has the samevisit_numandevent_num`. I've tried using the windowing function as below, but due to this assigning the same rank to two identical columns, this is not really an option.
iid_window = Window.partitionBy("uid").orderBy("visit_num", "event_num")
df_sample_iid=df_sample.withColumn("iid", fn.rank().over(iid_window))
The best solution is the Windowing function with rank, as suggested by Jacek Laskowski.
iid_window = Window.partitionBy("uid").orderBy("visit_num", "event_num")
df_sample_iid=df_sample.withColumn("iid", fn.rank().over(iid_window))
In my specific case some more data cleaning was required but generally, this should work.

Pivot table calculate field based on grouped values

after years of quietly learning from this site I've finally hit a question who's answer I cannot seem to find on StackOverflow...
I have a pivot table that needs to calculate Net Promoter Score from several groups within a population.
Net promoter score is calculated like so:
[% of Population that give 9 or 10/10] - [% of Population that give 1 to 6/10]
Each individual record in my source data can only have a single Score of between 1 and 10:
RAW DATA:
Date (dd/mm) Country Type Score (1-10) NPS Category
01/05 US Order enq. 9 Promoter
13/05 US Check-out 5 Detractor
28/05 US Order enq. 7 Passive
So, with help from the answers below I've added a column to categorise each individual into the Promoter (9 or 10), Passive (7 or 8) and Detractor (1 to 6) groups based on that score: screenshot of raw data (with sensitive items hidden).
All that remains now is:
How can I create a calculated 'NPS' column like the one shown in my (rudimentary) representation of a pivot table below that takes the Detractor value from the Promoter value?
D = Detractor group
Pa = Passive group
Pr = Promoter group
| Order enquiry | Check-out |
| D Pa Pr NPS | D Pa Pr NPS |
-------------------------------------------------- |
GB | | |
May | 0 0 100 100 | 30 20 50 20 |
Jun | 10 30 60 50 | 35 35 60 25 |
Jul | 30 20 50 20 | 40 10 40 0 |
US | | |
May | 45 15 40 - 5 | 50 10 40 -10 |
Jun | 40 30 30 -10 | 40 30 30 -10 |
Jul | 5 35 60 55 | 20 40 40 20 |
My attempt at a calculated column can be seen in this screenshot. This results in an error and of course I haven't managed to convert the NPS counts into percentages yet.
It would be my suggestion to create a new column in the source that calculates D, Pa,Pr by a formula.
You can now create the % for these values in the pivot. The NPS column can either be calculated after pivoting the output field, or by creating a pivot-column formula in Excel.
It's not clear from your question how your data is laid out, or what exactly you're asking. From what I can see, you need to add a column in your raw data table, which says something like
=COUNTIFS(UniqueID,MyUniqueID,Score,">=9")-COUNTIFS(UniqueID,MyUniqueID,Score,"<=6")
Then another column that says
=IF(NetPromoter>=9,"Pr",IF(NetPromoter>=7,"Pa","D"))
And then in your pivot table you add the Classification as a new subcolumn, and add the NPS as the Average of your NPS column, or something like that.
Please show your data if you want the formulas changed to meet your actual range/variable terms.

Resources