I do wonder how it is possible to make sliding windows in Pandas.
I have a dataframe with three columns.
Country | Number | DayOfTheYear
===================================
No | 50 | 0
No | 20 | 1
No | 37 | 2
I would love to see 14 day chunks for every country and day combination.
The country think can be ignored for the moment, since I can filter those manually in some way. But imagine there is only one country, is there a smart way to get some sort of summed up sliding window, resulting in something like the following?
Country | Sum | DatesOftheYear
===================================
No | 504 | 0-13
No | 207 | 1-14
No | 337 | 2-15
I would also accept if if they where disjunct, being only 0-13, 14-27, etc.
But I just cannot come along with Pandas. I know an old SQL solution, but is there anybody having a nice idea for Pandas?
If you want a rolling windows of your dataframe, you can simply use the .rolling function of pandas : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html
In your case : df["Number"].rolling(14).sum()
I'm not sure if this is the correct place to ask this, but basically I have a .txt file containing values that came from 2 separate sensors.
Example of some data:
{"t":3838202,"s":0,"n":"x1","v":-1052}
{"t":3838203,"s":0,"n":"y1","v":44}
{"t":3838204,"s":0,"n":"z1","v":-84}
{"t":3838435,"s":0,"n":"x1","v":-1052}
{"t":3838436,"s":0,"n":"y1","v":36}
{"t":3838437,"s":0,"n":"z1","v":-80}
{"t":3838670,"s":0,"n":"x1","v":-1056}
{"t":3838671,"s":0,"n":"y1","v":52}
{"t":3838672,"s":0,"n":"z1","v":-88}
{"t":3838902,"s":0,"n":"x1","v":-1052}
{"t":3838903,"s":0,"n":"y1","v":48}
{"t":3838904,"s":0,"n":"z1","v":-80}
{"t":3839136,"s":0,"n":"x1","v":-1056}
{"t":3839137,"s":0,"n":"y1","v":40}
{"t":3839138,"s":0,"n":"z1","v":-80}
x2:-944
y2:108
z2:-380
{"t":3839841,"s":0,"n":"x1","v":-1052}
{"t":3839842,"s":0,"n":"y1","v":44}
{"t":3839843,"s":0,"n":"z1","v":-80}
x2:-948
y2:100
z2:-380
{"t":3840541,"s":0,"n":"x1","v":-1052}
{"t":3840542,"s":0,"n":"y1","v":40}
{"t":3840543,"s":0,"n":"z1","v":-84}
{"t":3840774,"s":0,"n":"x1","v":-1052}
{"t":3840775,"s":0,"n":"y1","v":40}
{"t":3840776,"s":0,"n":"z1","v":-84}
x2:-948
y2:108
z2:-368
I'm trying to get the data into excel, so that for each "chunk" of data in the x1y1z1 section, I take the last set of recorded data and discard the rest and "pair" it with the next set of x2y2z2 data. I don't think I'm explaining it very well, but I basically want to take that text file and get this in excel:
+---------+-------+----+-----+------+-----+------+
| t | x1 | y1 | z1 | x2 | y2 | z2 |
+---------+-------+----+-----+------+-----+------+
| 3839138 | -1056 | 40 | -80 | -944 | 100 | -380 |
| 3839843 | -1052 | 44 | -80 | -948 | 100 | -380 |
| 3840776 | -1052 | 40 | -84 | -948 | 108 | -368 |
+---------+-------+----+-----+------+-----+------+
I'm really stuck as to where I should even start
I think like a programmer, so I would approach this problem in steps. If you are not a programmer, this might not be so helpful to you, and I am sorry for that.
First, define the data. How does each line of data get read and understood.
Second, write a parsing utility. A piece of code which interprets the data as it is read in and stores it in the form you want for your output
Third, import data into Excel.
So, based on the limited data you provided, I am not sure how you are able to determine the x1,y1,z1,x2,y2,z2 for each t, but I assume that the values enclosed in curly braces have something to do with that based on the values for s, n, and v I'm seeing in there. So, first of all you need to clearly determine the way you read the data. Take it one line at a time, and determine how you would build your output table based on each line of data. I assume you would treat the lines enclosed in curly braces differently from the lines with standalone x/y/z values for example.
I hope this points you in the right direction.
I am trying to work on a Excel that has a giant amount of data with dates, to simplify I want it to group the different numbers into weeks, allow me to explain:
The actual rows are like:
29-11-2018 | 49 | 1 | 4 |7 | 2
30-11-2018 | 49 | 4 | 0 |2 | 1
Where "49" is the week number from the date. I'm trying to make Excel put together those lines by week and add the other lines, like this:
49 | 5 | 4 | 9 | 3
And this for all the weeks, so I can know the exact number of data for every week.
Is there a way of doing this?
Thanks!
Regards,
Assuming your data is located at A2:F3..
H2 ---> =B2
put
I2 ---> =IF($B1<>$B2,C2,I1+C2)
and drag to L2, then
N2 ---> =IF($B3<>$B2,H2,"")
and drag to R2. Select H2:R2 and drag to the end..
you'll see your intended result in column N to R.
I have a table with events which are grouped by a uid. All rows have the columns uid, visit_num and event_num.
visit_num is an arbitrary counter that occasionally increases. event_num is the counter of interactions within the visit.
I want to merge these two counters into a single interaction counter that keeps increasing by 1 for each event and continues to increase when then next visit has started.
As I only look at the relative distance between events, it's fine if I don't start the counter at 1.
|uid |visit_num|event_num|interaction_num|
| 1 | 1 | 1 | 1 |
| 1 | 1 | 2 | 2 |
| 1 | 2 | 1 | 3 |
| 1 | 2 | 2 | 4 |
| 2 | 1 | 1 | 500 |
| 2 | 2 | 1 | 501 |
| 2 | 2 | 2 | 502 |
I can achieve this by repartitioning the data and using the monotonically_increasing_id like this:
df.repartition("uid")\
.sort("visit_num", "event_num")\
.withColumn("iid", fn.monotonically_increasing_id())
However the documentation states:
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.
As the id seems to be monotonically increasing by partition this seems fine. However:
I am close to reaching the 1 billion partition/uid threshold.
I don't want to rely on the current implementation not changing.
Is there a way I can start each uid with 1 as the first interaction num?
Edit
After testing this some more, I notice that some of the users don't seem to have consecutive iid values using the approach described above.
Edit 2: Windowing
Unfortunately there are some (rare) cases where more thanone row has the samevisit_numandevent_num`. I've tried using the windowing function as below, but due to this assigning the same rank to two identical columns, this is not really an option.
iid_window = Window.partitionBy("uid").orderBy("visit_num", "event_num")
df_sample_iid=df_sample.withColumn("iid", fn.rank().over(iid_window))
The best solution is the Windowing function with rank, as suggested by Jacek Laskowski.
iid_window = Window.partitionBy("uid").orderBy("visit_num", "event_num")
df_sample_iid=df_sample.withColumn("iid", fn.rank().over(iid_window))
In my specific case some more data cleaning was required but generally, this should work.
Slightly wordy title but here goes
I have a grid in excel which includes 3 columns (media spend, marginal revenue returns & media channel invested in) and I want to create the column below called desired cumulative spend
The reason the grid is structured in this way it does is that it represents an optimised spend laydown ordered by how much of each media channel's budget should be invested in until the marginal returns diminish such that it should be substituted for another media channel.
It is possible that this substitution can then be reversed back to the original channel if the new channel has a sharply diminishing curve, such that all marginal benefit associated to the new channel diminishes and the total spend level still means it is mathematically sensible to switch back to the original curve (maybe it has a lower base level but reduces less sharply). It is also possible that at the point in which the marginal benefit associated to the new channel diminishes, the best next step is to invest in a third channel.
The desired new spend column has two elements to it
it is a simple accumulation of spend from row to row when the
media channel is constant from row to row
it is a slightly more tricky accumulation of spend when the media
channel changes - then it needs to be able to reference back to the
last spend level associated to the channel which has been
substituted in. For row 4, the logic I am struggling with would need
to the running total from row 3 plus the new spend level associated
to row 4 minus the spend level the last time this channel was used
(row 2)
|spend | mar return | media | desired cumulative spend |
|------ |----------- |-------| ----------------------------------------- |
1 | £580 | 128 | chan1 | 580 |
2 | £620 | 121 | chan1 | 580+(620-580) |
3 | £900 | 115.8 | chan2 | 580+(620-580)+900 |
4 | £660 | 115.1 | chan1 | 580+(620-580)+900+(660-620) |
5 | £920 | 114 | chan2 | 580+(620-580)+900+(660-620)+(920-900) |
6 | £940 | 112 | chan2 | 580+(620-580)+900+(660-620)+(920-900)+(940-920) |
If my comment is the correct sugestion, then something like this should do it (£580 is at A2, so the first output is D2):
D2 =A2
D3 =D2+A3-IF(COUNTIF($C$2:C2,C3),INDEX(A:A,MAX(IF($C$2:C2=C3,ROW($A$2:A2)))))
D3 contains an array formula and must be confirmed with ctrl+shift+enter.
Now you can simply copy down from D3.