Grouping by substring of a column value in Pandas - python-3.x

While grouping a pandas dataframe, I've found a problem in the data that doesn't group my dataframe effectively, and now my grouping looks like -
challenge count mean
['acsc1', '[object Object]'] 1 0.000000
['acsc1', 'undefined'] 1 0.000000
['acsc1', 'wind-for'] 99 379.284146
['acsc1'] 47 19.340045
['acsc10', 'wind-for'] 73 370.148354
['acsc10'] 22 143.580856
How can I group these rows starting with ascs1 as one row (summing the other column values) and acsc10 into one row and so on? The desired result should look something like -
challenge category count mean
acsc1 wind-for 148 398.62
acsc10 wind-for 95 513.72
But I know the category column might be a stretch with the noise in this column.

This should get you the result you requested initially (without the category column)
df.groupby(df.challenge.apply(lambda x: x.split(",")[0].strip("[']"))).sum().reset_index()
Output
challenge count mean
0 acsc1 148 398.624191
1 acsc10 95 513.729210

We can do
s=pd.DataFrame(df['challenge'].tolist(),index=df.index,columns=['challenge','cate'])
d={'cate':'last','count':'count','mean':'sum'}
df=pd.concat([df.drop('challenge',1),s],axis=1).\
groupby('challenge').agg(d).reset_index()
Update fix the string type list
import ast
df.challenge=df.challenge.apply(ast.literal_eval)
df.groupby(df.challenge.str[0]).sum()
count mean
challenge
acsc1 148 398.624191
acsc10 95 513.729210

Related

How do I find the largest numbers in a area, and then find the value of a number in the same row?

I cannot find a way to do this, is it possible (with excel functions or VBS)?
For example, these would be the initial values:
Number
Value
101
234
102
324
103
345
104
325
105
437
106
443
107
806
108
476
109
538
110
546
And after taking the three highest numbers, this would be the output:
Number
Value
107
806
110
546
109
538
The data is constantly updating, so that might cause some issues.
You can use FILTER in combination with LARGE function to achieve this:
Columns A and B represent sample data. Cell D2 can contain this formula:
=FILTER($A$2:$B$9,$B$2:$B$9>=LARGE($B$2:$B$9,3))
If data is constantly updating, better to have an Excel Table (I named TB_NumVal), so the range index get automatically updated.
In cell: J2:
=SORT(FILTER(SORT(TB_NumVal,2), (ROW(TB_NumVal[Number])-1)
> ROWS(TB_NumVal[Number])-3),2,-1)
Here is the output:
Explanation
We sort the data, then since we start on row 2 (row 1 is the header) we substract 1. So
ROW(TB_NumVal[Number])-1
will provide the row number starting from one.
ROWS(TB_NumVal[Number])
is the total number of rows, in our case 10.
Using a filter condition like this:
(ROW(TB_NumVal[Number])-1) > ROWS(TB_NumVal[Number])-3)
ensures only the last three row will be selected, then finally sorted the filtered result by value in descending order to match the result of the screenshot of the question.

How to get multiple aggregation in a dataframe? cumsum and count columns

I need a column which aggregates using the count() function and another field using the cumsum() function in a dataframe
I would like to group it only once and the cumsum should be grouped with Site almost just like the count. How can I do this?
#I get the count by grouping site and arrived
df_arrived_gby = df.groupby(['Site','Arrived']).size().reset_index(name='Count_X')
#I do the cumsum but it should be groupby Site and Arrived same as above
#How can I do this?
df_arrived_gby['Cumsum_X'] = df_arrived_gby['Count_X'].cumsum()
print(df_arrived_gby)
Data example (it is not grouped by Site, so it continues adding the others):
Site Arrived Count Cumsum
198 T 30/06/2020 146 22368
199 T 31/05/2020 76 22444
200 V 05/01/2020 77 22521
201 V 05/02/2020 57 22578
First you need to get the values from the Count_X column, then you can cumsum():
df_arrived_gby['Cumsum_X'] = df_arrived_gby.Count_X.values.cumsum()
Let me know if that helps
I was able to do it using groupby on a new dataframe column as shown below:
df_arrived_gby['Cumsum'] = df_arrived_gby.groupby(['Site'])['Count X'].apply(lambda x: x.cumsum())

Reverse MATCH with a non existing value

I have data in Excel in the following format:
Column A Column B
20/03/2018 300
21/03/2018 200
22/03/2018 100
23/03/2018 90
24/03/2018 300
25/03/2018 200
26/03/2018 100
27/03/2018 50
28/03/2018 90
29/03/2018 100
30/03/2018 110
31/03/2018 120
I would like to get the date where the minimum of B would never be under 99 again chronologically. It the example above, that would happen the 29th of March.
If I try to get it with: =INDEX(A:A,MATCH(99,B1:B12,-1)) the value returned is 22/03/2018 as it is the first occurrence found, searched from top to bottom.
In this case it would be perfect to be able to do a reverse match(e.g. a match that searches from bottom to top of the range) but this option is not available. I have seen that it is possible to do reverse matches with the lookup function but in that case I need to provide a value that is actually in my data set (99 would not work).
The workaround I have found is to add a third column like the following (with the minimum of the upcoming value of B going down) and index match on top it.
Column A Column B Column C
20/03/2018 300 50
21/03/2018 200 50
22/03/2018 100 50
23/03/2018 90 50
24/03/2018 300 50
25/03/2018 200 50
26/03/2018 100 50
27/03/2018 50 50
28/03/2018 90 90
29/03/2018 100 100
30/03/2018 110 110
31/03/2018 120 120
Is there a way of achieving this without a third column?
The AGGREGATE function is great for problems like these:
=AGGREGATE(14,4,(B2:B13<99)*A2:A13,1)+1
What are those numeric arguments?
14 tells the function to replicate a LARGE function
4 to ignore no values (this function can ignore error values and other things)
More info here. I checked it works below:
If your dates aren't always consecutive, you'll need to add a bit more to the function:
=INDEX(A1:A12,MATCH(AGGREGATE(14,6,(B1:B12<99)*A1:A12,1),A1:A12,0)+1)
=INDEX(A1:A12,LARGE(IF(B1:B12<=99,ROW(B1:B12)+1),1))
This is an array formula (Ctrl+Shift+Enter while still in the formula bar)
Builds an array of the row 1 below results that are less than or equal to 99. Large then returns the largest row number for index.

Slicing specific rows of a column in pandas Dataframe

In the flowing data frame in Pandas, I want to extract columns corresponding dates between '03/01' and '06/01'. I don't want to use the index at all, as my input would be a start and end dates. How could I do so ?
A B
0 01/01 56
1 02/01 54
2 03/01 66
3 04/01 77
4 05/01 66
5 06/01 72
6 07/01 132
7 08/01 127
First create a list of the dates you need using daterange. I'm adding the year 2000 since you need to supply a year for this to work, im then cutting it off to get the desired strings. In real life you might want to pay attention to the actual year due to things like leap-days.
date_start = '03/01'
date_end = '06/01'
dates = [x.strftime('%m/%d') for x in pd.date_range('2000/{}'.format(date_start),
'2000/{}'.format(date_end), freq='D')]
dates is now equal to:
['03/01',
'03/02',
'03/03',
'03/04',
.....
'05/29',
'05/30',
'05/31',
'06/01']
Then simply use the isin argument and you are done
df = df.loc[df.A.isin(dates)]
df
If your columns is a datetime column I guess you can skip the strftime part in th list comprehension to get the right result.
You are welcome to use boolean masking, i.e.:
df[(df.A >= start_date) && (df.A <= end_date)]
Inside the bracket is a boolean array of True and False. Only rows that fulfill your given condition (evaluates to True) will be returned. This is a great tool to have and it works well with pandas and numpy.

identifying decrease in values in spark (outliers)

I have a large data set with millions of records which is something like
Movie Likes Comments Shares Views
A 100 10 20 30
A 102 11 22 35
A 104 12 25 45
A *103* 13 *24* 50
B 200 10 20 30
B 205 *9* 21 35
B *203* 12 29 42
B 210 13 *23* *39*
Likes, comments etc are rolling totals and they are suppose to increase. If there is drop in any of this for a movie then its a bad data needs to be identified.
I have initial thoughts about groupby movie and then sort within the group. I am using dataframes in spark 1.6 for processing and it does not seem to be achievable as there is no sorting within the grouped data in dataframe.
Buidling something for outlier detection can be another approach but because of time constraint I have not explored it yet.
Is there anyway I can achieve this ?
Thanks !!
You can use the lag window function to bring the previous values into scope:
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy('Movie).orderBy('maybesometemporalfield)
dataset.withColumn("lag_likes", lag('Likes, 1) over windowSpec)
.withColumn("lag_comments", lag('Comments, 1) over windowSpec)
.show
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-functions.html#lag
Another approach would be to assign a row number (if there isn't one already), lag that column, then join the row to it's previous row, to allow you to do the comparison.
HTH

Resources