Pandas groupby timestamp and increase count - python-3.x

I am a beginner in pandas and I would like some help about a problem I have.
I have a csv file structured as follow:
#timestamp. message. name. ID
2021-07-10 14:01:00 user 0001 has logged out. User Log Off. 0001
2021-07-10 14:01:10 user 0002 has logged out. User Log Off. 0002
2021-07-10 14:01:15 user 0003 has logged out. User Log Off. 0003
2021-07-10 14:08:20 user 0001 has logged out. User Log Off. 0001
I would like to do, is to go through all the columns, and check if they are doubles, and if they are double in a time span of 10 min(based on the timestamp) to add a column with the number of the counted event.
for example this is what I would like to have as an output
#timestamp. message. name. ID. count
2021-07-10 14:01:00 user 0001 has logged out. User Log Off. 0001. 2
2021-07-10 14:01:10 user 0002 has logged out. User Log Off. 0002. 1
2021-07-10 14:01:15 user 0003 has logged out. User Log Off. 0003 1
Basically group the double event into only one row with the number of event counted in that time span.
Is this something achievable with pandas?
Thank you so much for any help

Here's an outline that you can follow:
# 0. sort data by timestamp if not already sorted
df = df.sort_values('#timestamp')
# lazy groupby
groups = df.groupby(['message.','name.', 'ID'])
# 1. compute the time differences `timediff` and compare to threshold
df['timediff'] = groups['#timstamp.'].diff() > pd.Timedelta('10T')
# 2. find the blocks with cumsum
df['block'] = groups['timediff'].cumsum()
# 3. groupby the blocks
out = (df.groupby(['blocks','message.','name.', 'ID'])
.agg({'#timestamp.':'first', 'timediff':'count'})
)
Note this will group 00:00:00, 00:09:00, and 00:18:00 together.

Related

How to analyze data in Excel

I have an Excel document that consists of huge data in need of analysis.
The data is basically objects with corresponding error messages. Typical output is:
**REC NO07/121007163**
Valuation for 0001 IFRS16 Balance sheet valuation
Asset transactions already posted need to be reversed
The valuation could not be completed
**REC NO07/121007165**
Valuation for 0001 IFRS16 Balance sheet valuation
Asset transactions already posted need to be reversed
The valuation could not be completed
**REC NO07/121007220**
Valuation for 0001 IFRS16 Balance sheet valuation
Closing balance 5 070,00 NOK liability available
Difference 5 070,00- NOK between clearing and expense available
**REC NO07/121007221**
Valuation for 0001 IFRS16 Balance sheet valuation
Closing balance 5 070,00 NOK liability available
Difference 5 070,00- NOK between clearing and expense available
What you see in bold above, is the object. This is not in bold in Excel, but I have made it bold here to explain. Everything in-between is the error message for that object.
The length (number of lines) of the error message could vary between objects.
What I would like to do, is basically convert the above to this:
REC NO07/121007163 Valuation for 0001 IFRS16 Balance sheet valuation. Asset transactions already posted need to be reversed. The valuation could not be completed
REC NO07/121007165 Valuation for 0001 IFRS16 Balance sheet valuation. Asset transactions already posted need to be reversed. The valuation could not be completed
REC NO07/121007220 Valuation for 0001 IFRS16 Balance sheet valuation. Closing balance 5 070,00 NOK liability available. Difference 5 070,00- NOK between clearing and expense available
REC NO07/121007221 Valuation for 0001 IFRS16 Balance sheet valuation. Closing balance 5 070,00 NOK liability available. Difference 5 070,00- NOK between clearing and expense available
I am adding a tab between the object and the error message.
I am combining all lines of the error message with ". "
Is this possible in Excel and if yes, is there anyone that could help me with that?
Thank you
Best regards
Antonis
I have tried to do this with formulas in Excel but as the number of lines for each error message varies, I was not able to solve it.
Assuming all the error codes start with REC and no excel version constraints per tags listed in the question, then you can use the following formula in cell B1:
=LET(A, A1:A16, m, ROWS(A), seq, SEQUENCE(m), idx, FILTER(seq, (LEFT(A,3)="REC")),
start, idx+1, end, VSTACK(DROP(idx-1,1), m), MAP(start, end,
LAMBDA(s,e, INDEX(A,s-1)&" "&TEXTJOIN(". ",, FILTER(A, (seq>=s) * (seq<=e))))))
Here is the output:
Basically, it finds first the index position of the error codes (idx) and based on that finds the start and end rows of each error message. Then we use MAP to concatenate the result via TEXTJOIN selecting on each iteration the range via FILTER and prefixing the error code (INDEX(A,s-1)).

Distinct sum in excel/sheets pivot table

Let's say I have some data I generated through a window function (lifetime_clicks)
I want to know in excel or google sheets pivot table if can get the % of lifetime clicks come from today_clicks (hour_clicks / lifetime_clicks)
user_id date hour_clicks lifetime_clicks
1 02 30 90000
1 02 2 90000
1. 02 200 90000
1. 03 4544 90000
I would like to group the data on date and sum(hour_clicks) and divide that by 90000, but everytime I enter lifetime_clicks to the calculated field, it sums the data.
is there a way to distinct sum(lifetime_clicks) to prevent such a thing from occurring?
I would like to group the data on date and sum(hour_clicks) and divide that by 90000
Easiest option is adding a HELPER column, like my column E in image above. Formula is just =C2/D2
Second option, if you are going to divide always by 90.000, then you could use a Calculated field.
To be honest, I think easiest way would be helper column.

Automatically restart formula given that parameter is met?

I am in need of some assistance. I have a list of multiple id's in column A, column b contains data of the number of items linked to the ID. I want to generate a list of every page pertaining to each ID as follows, so i want in column C (for example) "a - Page 0001" all the way until "a - 1000" given that a had 1000 pages but then when it reaches 1000, i want it to restart from b as follows:
Column A Column B
a 1000
b 2000
c 1500
d 1200
e 700
a - Page 0001
a - Page 0002
a - Page 0003
a - Page 0004
…
a - Page 1000
b - Page 0001
b - Page 0002
b - Page 0003
b - Page 0004
…
b - Page 0001
…
b - Page 2000
c - Page 0001
I have tried using the following formula:
=IF(ROW(C1)< B1+1,CONCATENATE($A$1," - Page ",TEXT(ROW(C1),"0000"),""))
The problem is that once it reaches 1000 I get errors (#VALUE!), firstly, I believe I have to $ the &A$1 otherwise when I drag the formula down it will just refer to the column to the left an i'll get a - Page 0001, b - page 0002, etc. Secondly, I am using the ROW function in order to generate the page numbers but I don't understand how I can force it to restart from 1 once it reaches the maximum (i.e. 1000 for a).
This formula will generate you list of individual pages:
=IFERROR(INDEX($A$1:$A$5,IFERROR(MATCH(ROW(C1)-1,$C$1:$C$5,1)+1,1))&" - Page "&RIGHT("0000"&ROW(C1)-IFERROR(INDEX($C$1:$C$5,MATCH(ROW(C1)-1,$C$1:$C$5,1)),0),4),"")
The key to making it work is column C which is a helper formula. In C we are going to place a running total of the number of pages. In C1 use:
=SUM($B$1:$B1)
note the missing $ in the last address, its important that it not be there. copy that down for the length of your table.
Note the hidden rows

Excel VBA to find non unique values with multiple conditions

I am looking for some help trying to create an excel macro. I have a very large sheet that look a bit like this:
Account NAME Address Dealer
68687 Sara 11 Wood 1111
68687 Sara 11 Wood 1111
68687 Sara 11 Wood 1111
12345 Tom 10 Main 7878
12345 Tom 10 Main 7878
54321 Tom 10 Main 7878
10101 John 25 Lake 3232
10101 25 Lake 3232
11111 John 25 Lake 3232
What I need to do is to highlight all the rows on the sheet where each Dealer has more than one unique value in the Account column, but it must also have some value in the name column.
So in the above example I would only want to highlight all the rows for dealer 7878.
I am not certain if I should look at loops or arrays, they might take a long time as the sheet is quite large.
Looking for some help.
Thanks.
James - Dirk gave you a good answer in his comment. It looks like this ...
The format formula is also put into Column F, so you can see the results of the calculation.
If you feel you should still have a VBA solution, this gives you a good starting point for how to layout your code ...
Ignore rows with empty name
Count rows where the dealer is the same as the dealer in the current row, and the account is NOT the same as the account in the current row
If the count found in Step 2 is greater than 0, highlight the current row.

Test data on next record

I would like to know if is is possible to read the next record when we are using SyncSORT (SyncTool) based on a certain condition.
Example of the input
Sort key will be account nbr + descending record type + amount
account nbr amount record type
11111111111 10 reversal not in the output
11111111111 10 deposit not in the output
33333333333 20 deposit in the output
44444444444 15 deposit in the output
55555555555 20 reversal in the output
55555555555 10 deposit in the output
66666666666 30 reversal in the output no match
When a reversal type is read, a deposit should follow with the same amount, in this case it both record the reversal and deposit should not be in the output file. It is possible the amount is not the same for the reversal and the deposit, in this case both records should be in the output file.
output
33333333333 20 deposit
44444444444 15 deposit
55555555555 20 reversal
55555555555 10 deposit
66666666666 30 reversal
Yes. As long as your SyncSORT is up-to-date enough.
You need to use JOINKEYS. Specify the same DSN for both input datasets, and indicate that they are SORTED. There is an undocumented feature which allows the use of JNFnCTNL files, like DFSORT.
In JNF1CNTL (which is a "preprocessor" for the first JOINKEYS dataset) temporarily add a sequence number to each record. The default is that the sequence starts at one. Here it is useful to be explicit...
Because, in JNF2CNTL you want to do the same thing, but start the sequence at zero (START=0).
The key for each of the JOINKEYS is the sequence number.
Use JOIN UNPAIRED,F1. Define a REFORMAT with all the data from the first file, and data for comparison from the second file.
This is what a four-record dataset would look like if you imagine the join:
- - A 0
A 1 B 1
B 2 C 2
C 3 D 3
D 3 - -
Because you specify JOIN UNPAIRED,F1 you won't actually see the mismatched A 0 (because that is on F2) but you will see the mismatched D 3.
If you look at your REFORMAT record, you now have data from the "current" record, and data from the "next" record.
Then there's a little more work to select only the records you want. But, dinner first...

Resources