spark dataframe sum of column based on condition - apache-spark

I want to calculate the portion of the value, with only two partitions( where type == red and where type != red)
ID | type | value
-----------------------------
1 | red | 10
2 | blue | 20
3 | yellow | 30
result should be :
ID | type | value | portion
-----------------------------
1 | red | 10 | 1
2 | blue | 20 |0.4
3 | yellow | 30 |0.6
The normal window function in spark only supports partitionby a whole column, but I need the "blue" and "yellow", together recognized as the "non-red" type.
Any idea?

First add a column is_red to easier differentiate between the two groups. Then you can groupBy this new column and get the sums for each of the two groups respectively.
To get the fraction (portion), simply divide each row's value by the correct sum, taking into account if the type is red or not. This part can be done using when and otherwise in Spark.
Below is the Scala code to do this. There is a sortBy since when using groupBy the order of results is not guaranteed. With the sort, sum1 below will contain the total sum for all non-red types while sum2 is the sum for red types.
val sum1 :: sum2 :: _ = df.withColumn("is_red", $"type" === lit("red"))
.groupBy($"is_red")
.agg(sum($"value"))
.collect()
.map(row => (row.getAs[Boolean](0), row.getAs[Long](1)))
.toList
.sortBy(_._1)
.map(_._2)
val df2 = df.withColumn("portion", when($"is_red", $"value"/lit(sum2)).otherwise($"value"/lit(sum1)))
The extra is_red column can be removed using drop.

Inspired by Shaido, I used an extra column is_red and the spark window function. But I'm not sure which one is better in performance.
df.withColumn("is_red", when(col("type").equalTo("Red"), "Red")
.otherwise("not Red")
.withColumn("portion", col("value")/sum("value)
.over(Window.partitionBy(col"is_Red")))
.drop(is_Red)

Related

To find the minimum and maximum corresponding value in between a range of clubbed values

I am having a small issue with writing a formula. I have my data in this format :
NAME | EXP | SALARY
A | 0.3 | 40000
B | 4.7 | 490000
C | 2.6 | 220000
D | 3.9 | 34000
E | 1.3 | 150000
F | 3.2 | 300000
G | 0.8 | 90000
H | 1.9 | 170000
I | 2.1 | 260000
J | 4.1 | 390000
this is what i want in my output :
EXP-RANGE | MIN SALARY | MAX SALARY
0-1
1-2
2-3
3-4
4-5
i want to find the minimum and maximum salary of people in the experience range
i tried using MIN(IF(<&>)) but it returns #VALUE?
i can also push all this data in to a Database and query it but I would greatly appreciate anyone who could formulate it so that i can work on Excel itself. Data size is 20000+ so i wouldn't prefer filters
Thanks in advance
Use MAXIFS and MINIFS:
=MINIFS(C:C,B:B,">="&LEFT(G2,FIND("-",G2)-1),B:B,"<"&MID(G2,FIND("-",G2)+1,LEN(G2)))
=MAXIFS(C:C,B:B,">="&LEFT(G2,FIND("-",G2)-1),B:B,"<"&MID(G2,FIND("-",G2)+1,LEN(G2)))
If one does not have MAXIFS or MINIFS we can use AGGREGATE:
=AGGREGATE(15,7,$C$2:$C$11/(($B$2:$B$11>=--LEFT(G2,FIND("-",G2)-1))*($B$2:$B$11<--MID(G2,FIND("-",G2)+1,LEN(G2)))),1)
=AGGREGATE(14,7,$C$2:$C$11/(($B$2:$B$11>=--LEFT(G2,FIND("-",G2)-1))*($B$2:$B$11<--MID(G2,FIND("-",G2)+1,LEN(G2)))),1)
AGGREGATE is an array type formula and the references should be limited to the data set.
Assume "Lookup table" put in A1:C11
Criteria with header in G1:G6
In H2 (Min wage), formula copied right to I2(Max wage) and all copied down:
=AGGREGATE(14+(COLUMN(A1)=1),6,$C$2:$C$11/($B$2:$B$11>=IMREAL($G2&"i"))/($B$2:$B$11<-IMAGINARY($G2&"i")),1)
Edit : In convert the criteria "0-1" to complex number "0-1i", the IMREAL() extract the left side number "0", and the IMAGINARY() extract the right side number "1"
=AGGREGATE(14+(COLUMN(A1)=1),6,..., return in Column K =AGGREGATE(15,6,….) and column L =AGGREGATE(14,6,….)

Extract a substring new column based on a substring based on conditions ideally with Pandas

I got a data set (Excel) with hundreds of entries. In one string column there is most of the information. The information is divided by '_' and typed in by humans. Therefore, it is not possible to work with index positions.
To create a usable data basis it's mandatory to extract information from this column in another column.
The search pattern = '*v*' is alone not enough. But combined with the condition that the first item has to be a digit it works.
I tried to get it to work with iterrows, iteritems, str.strip, str.extract and many more. But the best solution I received with a for-loop.
pattern = '_*v*_'
test = []
for i in df['col']:
'#Split the string in substrings
i = i.split('_')
for c in i:
if c.find('x') == 1:
if c[0].isdigit():
# print(c)
test.append(c)
else:
'#To be able to fix a few rows manually
test.append(0)
[4]: test =[22v3, 33v55, 4v2]
#Input
+-----------+-----------+
| col | targetcol |
+-----------+-----------+
| as_22v3 | |
| 33v55_bdd | |
| Ave_4v2 | |
+-----------+-----------+
#Output
+-----------+-----------+--+
| col | targetcol | |
+-----------+-----------+--+
| as_22v3 | 22v3 | |
| 33v55_bdd | 33v55 | |
| Ave_4v2 | 4v2 | |
+-----------+-----------+--+
My code does work, but only for the first few rows. It stops after 36 values and I can't figure out why. There is no error message besides of course that it is not possible to assign the list to a DataFrame series since it has not the same size.
pandas.Series.str.extract should help:
>>> df['col'].str.extract(r'(\d+v+\d+)')
0
0 22v3
1 33v55
2 4v2
df = pd.DataFrame({
'col': ['as_22v3', '33v55_bdd', 'Ave_4v2']
})
df['targetcol'] = df['col'].str.extract(r'(\d+v+\d+)')
EDIT
df = pd.DataFrame({
'col': ['as_22v3', '33v55_bdd', 'Ave_4v2', '_22 v3', 'space 2,2v3', '2.v3',
'2.111v999', 'asd.123v77', '1 v7', '123 v 8135']
})
pattern = r'(\d+(\,[0-9]+)?(\s+)?v\d+)'
df['result'] = df['col'].str.extract(pattern)[0]
col result
0 as_22v3 22v3
1 33v55_bdd 33v55
2 Ave_4v2 4v2
3 _22 v3 22 v3
4 space 2,2v3 2,2v3
5 2.v3 NaN
6 2.111v999 111v999
7 asd.123v77 123v77
8 1 v7 1 v7
9 123 v 8135 NaN
You say it stops after 36 values? You say it is Excel file you are processing? One thing you could try is to save data set to .csv file and try to read this file in with pd.read_csv function. There are sometimes some extra characters in Excel file that are not easily visible.

Excel: Sum cells if they share an identical unknown string

I have 154,901 rows of data that look like this:
Text String | 340
Where "Text String" represents a variable string that has no other pattern or order to it and cannot be predicted in any mathematical way, and 340 represents a random integer. How can I find the sum of all of the values sharing an identical string, and organize this data based on total per unique string?
For example, say I have the dataset
Alpha | 3
Alpha | 6
Beta | 4
Gamma | 1
Gamma | 3
Gamma | 8
Omega | 10
I'm looking for some way to present the data as:
Alpha | 9
Beta | 4
Gamma | 12
Omega | 10
The point of this being that I have a dataset so large that I cannot enumerate this manually, and I have a finite yet unknown amount of strings that I cannot reliably predict what they are.
Consider using a pivot table, and then aggregate the numbers by string. This is probably the least ugly option. – Tim Biegeleisen

Display Value in cell Y based on greater than, less than of cell X

Here's the scenario. I have a large spreadsheet of candidates for NHS at my school that are given a score by several teachers, community members, etc. I average out their score and then based on that number they are given a score/value from a rubric. I am looking for a formula that will read the value of cell X (their average score) and display a specific value in cell Y(their rubric score). The following is the criteria:
value<2.0, display 0
value>2.0 value<3.0, display 1
value>3.0 value<3.5, display 2
value>3.5 value<3.75, display 3
value>3.75, display 4
I tried looking this up and the closest I found was a formula that I modified to look like this:
=IF(I10="AVERAGE_CHARACTER",IF(I10<2,0,IF(AND(I10>2,I11<3),1,IF(AND(I10>3,I11<3.5),2,IF(AND(I10>3.5,I11<3,75),3,IF(I11>3.75,4,0))))))
All it says is FALSE in the cell. Not sure if I'm using the wrong formula or have a typo in the formula. Thoughts? If there is an alternate or easier method, I'm open for suggestions.
Thanks!
source: http://www.excelforum.com/excel-formulas-and-functions/575953-greater-than-x-but-less-than-y.html
It's easy if you keep the thresholds and the rubric in separate arrays:
=LOOKUP(A1,{0,2,3,3.5,3.75},{0,1,2,3,4})
You might use something like: (value to be changed in A1)
=VLOOKUP(A1,{0,0;2,1;3,2;3.5,3;3.75,4},2)
or having a table like this: (value to be changed in C1)
| A | B |
1 | 0 | 0 |
2 | 2 | 1 |
3 | 3 | 2 |
4 | 3.5 | 3 |
5 | 3.75 | 4 |
=VLOOKUP(C1,A1:B5,2)

Excel formula to get ranking position

I have a table of people with points. The more points, the higher your position. If you have the same points you are equal first, second etc.
| A | B | C
1 | name | position | points
2 | person1 | 1 | 10
3 | person2 | 2 | 9
4 | person3 | 2 | 9
5 | person4 | 2 | 9
6 | person5 | 5 | 8
7 | person6 | 6 | 7
Using an Excel formula, how can I automatically determine the position? I'm currently using an IF statement that works fine for 5 or 6 matching positions, but I can't add 30+ if statements because there's a limit to the formula.
=IF(C7=C2,B2,IF(C7=C3,B2+5,IF(C7=C4,B3+4,....
So if the points column is the same as the position above then it's the same position value. If the points are less than above then it drops a position so the previous row position +1. But if the row above that is the same then it's the previous position +2 and so on.
You could also use the RANK function
=RANK(C2,$C$2:$C$7,0)
It would return data like your example:
| A | B | C
1 | name | position | points
2 | person1 | 1 | 10
3 | person2 | 2 | 9
4 | person3 | 2 | 9
5 | person4 | 2 | 9
6 | person5 | 5 | 8
7 | person6 | 6 | 7
The 'Points' column needs to be sorted into descending order.
Type this to B3, and then pull it to the rest of the rows:
=IF(C3=C2,B2,B2+COUNTIF($C$1:$C3,C2))
What it does is:
If my points equals the previous points, I have the same position.
Othewise count the players with the same score as the previous one, and add their numbers to the previous player's position.
You can use the RANK function in Excel without necessarily sorting the data. Type =RANK(C2,$C$2:$C$7). Excel will find the relative position of the data in C2 and display the answer. Copy the formula through to C7 by dragging the small node at the right end of the cell cursor.
Try this in your forth column
=COUNTIF(B:B; ">" & B2) + 1
Replace B2 with B3 for next row and so on.
What this does is it counts how many records have more points then current one and then this adds current record position (+1 part).
If your C-column is sorted, you can check whether the current row is equal to your last row. If not, use the current row number as the ranking-position, otherwise use the value from above (value for b3):
=IF(C3=C2, B2, ROW()-1)
You can use the LARGE function to get the n-th highest value in case your C-column is not sorted:
=LARGE(C2:C7,3)
The way I've done this, which is a bit convoluted, is as follows:
Sort rows by the points in descending order
Create an additional column (D) starting at D2 with numbers 1,2,3,... total number of positions
In the cell for the actual positions (D2) use the formula if(C2=C1), D2, C1). This checks if the points in this row are the same as the points in the previous row. If it is it gives you the position of the previous row, otherwise it uses the value from column D and thus handle people with equal positions.
Copy this formula down the entire column
Copy the positions column(C), then paste special >> values to overwrite the formula with positions
Resort the rows to their original order
That's worked for me! If there's a better way I'd love to know it!

Resources