Python - Group by with condition - python-3.x

I am trying do gropby function with condition and I am not sure how to get this to work.
Here is how my data looks like:
generated_id timestamp direction date hour
0 1 1590394859141 forward 2020-05-25 04:20:59.141000-04:00 4
1 2 1599758616945 forward 2020-09-10 13:23:36.945000-04:00 13
2 3 1599759625509 backward 2020-09-10 13:40:25.509000-04:00 13
I need to get count of values "forward" direction for each hour. Based on the same data above, I should have one value "forward at 4 and 1 "forward" values for 13.
I am trying to use this:
daily_sum = daily_df.groupby("hour")['direction'].count().reset_index()
Direction can also be backwards so I only need to focus on forward.
How do I do this?

daily_sum = daily_df[daily_df['direction'] == 'forward']\
.groupby("hour")['direction'].count().reset_index()

Related

Not to show axis label if value is zero EXCEL

Is there any way to not to show an axis label if value is zero against that?
Suppose if a table is like below
Vehicles Sold per Brand
jun-21
jul-21
ago-21
sept-21
Opel
2
4
3
5
Renoult
6
3
8
1
Ferrari
0
0
0
0
Mercedes
1
1
6
4
Seat
2
0
4
2
Others
12
11
15
16
If i want to not to get the graph of Ferrari in axis, what should I do?
I know that, I can hide that column if the graph is not to be shown for that. I can not use that since its a highly dynamic data and I dont want to go and hide it everytime.
Could somebody help?
Many thanks an advance
So, quick and dirty:
But I would then produce the table of numbers so that any row not to be included gets removed and then build the chart with 5 only and not have the gap. I will let you work on that.
So, did that as well, but I will let you figure out how to control the Legend:
The trick is to use large(), but you may need to be wrapping with if() to control 0 better...

Average data points in a range while condition is met in a Pandas DataFrame

I have a very large dataset with over 400,000 rows and growing. I understand that you are not supposed to use iterows to modify a pandas data frame. However I'm a little lost on what I should do in this case, since I'm not sure I could use .loc() or some rolling filter to modify a data frame in the way I need to. I'm trying to figure out if I can take a data frame and average the range while the condition is met. For example:
Condition
Temp.
Pressure
1
8
20
1
7
23
1
8
22
1
9
21
0
4
33
0
3
35
1
9
21
1
11
20
1
10
22
While the condition is == 1 the outputed dataframe would look like this:
Condition
Avg. Temp.
Avg. Pressure
1
8
21.5
1
10
21
Has anyone attempted something similar that can put me on the right path? I was thinking of using something like this:
df = pd.csv_read(csv_file)
for index, row in df.iterrows():
if row['condition'] == 1:
#start index = first value that equals 1
else: #end index & calculate rolling average of range
len = end - start
new_df = df.rolling(len).mean()
I know that my code isn't great, I also know I could brute force it doing something similar as I have shown above, but as I said it has a lot of rows and continues to grow so I need to be efficient.
TRY:
result = df.groupby((df.Condition != df.Condition.shift()).cumsum()).apply(
lambda x: x.rolling(len(x)).mean().dropna()).reset_index(drop=True)
print(result.loc[result.Condition.eq(1)]) # filter by required condition
OUTPUT:
Condition Temp. Pressure
0 1.0 8.0 21.5
2 1.0 10.0 21.0

How can I find the highest value between rows every time that they met a certain condition?

I have been struggling with a problem with my data frame build in pandas that is current like this
MyDataFrame:
Index Status Value
0 A 10
1 A 8
2 A 5
3 B 9
4 B 5
5 A 1
6 B 2
7 A 3
8 A 5
9 A 1
The desired output would be:
Index Status Value
0 A 10
1 B 9
2 A 1
3 B 2
4 A 5
So far I tried to use range and while conditions to filter, however, if I put a conditional like :
for i in range:
if Status[i] == "A":
print(Value[i])
if Status == "B":
break
** The code above is more an example of what I have been trying to reach my goal, I tried to use .iloc and range with while, but maybe in the wrong way idk.*
The desired output isn't printed.
One thing that complicates this filtering process is that MyDataFrame changes every time that I run the script since it uses another base of data to create this DataFrame.
I believe that I'm missing something simple, but it has been almost a week and I can't figure out.
Thanks in advance for all your answers and support.
Let us try using shift with cumsum create the groupby key , then it is groupby + agg
out = df.groupby(df.Status.ne(df.Status.shift()).cumsum()).agg({'Status':'first','Value':'max'})
Out[14]:
Status Value
Status
1 A 10
2 B 9
3 A 1
4 B 2
5 A 5
Very close to #BEN_YO:
grp = (df['Status'] != df['Status'].shift()).cumsum()
df.loc[df.groupby(grp)['Value'].idxmax()]
Output:
Status Value
Index
0 A 10
3 B 9
5 A 1
6 B 2
8 A 5
Create groups using shift and inequality with cumsum, then groupby and find the index of the max value of 'Value', idxmax, and filter the dataframe using loc

Average ifs with or in excel

So, I have this problem, I would like to find the average of a column by using the OR function to check criteria from adjusted columns, I tried putting OR into AverageIf function, fail, also tried the "Average(IF(OR(" again not the correct return. Thought it is a simple thing could be done easily but don't know why it doesn't work. So my table is something like this:
ID: Rate Check 1 Check 2 Check 3
1 5 1 1 1
2 3 1 1
3 2 1
4 4
5 5 1 1
6 3
7 4 1
I would like to find the average of the rate column by checking if there are any value in either Check 1; Check 2 or Check 3 columns, so in the above case i will get the average of all but row with the id 4 and 6. Is this possible without using a helper column?
You can use SUMPRODUCT()
=SUMPRODUCT(((C2:C8<>"")+(D2:D8<>"")+(E2:E8<>"")>0)*(B2:B8<>"")*B2:B8)/SUMPRODUCT(--((C2:C8<>"")+(D2:D8<>"")+(E2:E8<>"")>0)*(B2:B8<>""))
If your first ID starts in A2, use this formula (edited to handle empty values in the "Rate" column):
=AVERAGE(IF(MMULT(LEN(C2:E8)*LEN(B2:B8),ROW(INDIRECT("1:"&COLUMNS($C$1:$E$1)))),B2:B8))

Excel Cluster Bar(or Column) Chart

my data looks some thing like this:
Name Event Result
Bob 1 0
Mary 1 1
Sue 2 0
Tom 1 0
Dick 2 1
Harry 1 1
Mary 2 0
Sue 2 1
Dick 1 1
etc...
Names repeat, Event is the Event type, and Result is whether the event was successful or not (0, 1). What I want to end up with is a cluster bar chart with four bars to each name:
Event 1 # of Success
Event 1 # of Fail
Event 2 # of Success
Event 2 # of Fail
I figure I'll probably want to put this in a clustered stacked bar in the future, but if I can get the simple cluster going I can figure it out. A link to a good tutorial on event based charts would be appreciated. I'll keep searching and post back what I find. Thanks in advance!
Not Sure if this will fit your needs compleately, but it might be the quickest way to visualize the data:
Select your posted data, go to insert-tab, select pivot-chart (hides behind pivot-table) and insert it as a new sheet.
Then put the Event and the Result columns to the row-field and again the Event column to the value-field and set it up to use count instead of sum. Then you get the result beneath.

Resources