Count number of rows in a section until the blank line - google-docs

I've got a spreadsheet in google docs that looks like#
2010 <X>
Row1
Row2
Row3
Row4
1999 <Y>
Row1
Row2
Row3
Row4
Row5
Row6
Where X is 4 as there are for rows in 2010 section followed by a blank line. Y is 6 as there are 6 rows till the end of the document.
I did try
=filter(row(A:A), A:A="",row(A:A)<=max(filter(row(A:A),A:A<>"")))
and
=filter(row($A:$A),(len($A:$A)=0)*(row($A:$A)>row()))-row()-1
They all return
error: Mismatched range lengths
What could be done in this case? Using named section is really time-consuming as there will be up to 100 sections.
Thanks

if would be best to store your data like this instead:
A B C
--------------------------
1| 2010 Row1 <X>
2| 2010 Row2
3| 2010 Row3
4| 2010 Row4
5|
6| 1999 Row1 <Y>
7| 1999 Row2
8| 1999 Row3
9| 1999 Row4
10| 1999 Row5
11| 1999 Row6
because then X and Y can be calulated simply as: =countif(A:A,A1) and =countif(A:A,A6)
if you have to persist with your setup
A B
--------------------------
1| 2010 <X>
2| Row1
3| Row2
4| Row3
5| Row4
6|
7| 1999 <Y>
8| Row1
9| Row2
10| Row3
11| Row4
12| Row5
13| Row6
then you can use:
=ArrayFormula(match(1,--(ISBLANK(A1:A13)),0)-2)
(as long as there is always an empty row between the sets of "Years"
the ISBLANK(A1:A13) return an array result {FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,...}
the -- changes that to: {0,0,0,0,0,1,0...}
the match() returns the POSITION or ROW of the first 1 in that list : 6
we then take away 2, for the empty row and for the header row
we have to run the whole thin as an array formula because we need ISBLANK() to work on each cell in turn.

Related

How to subract row wise mean from each value of column and get one row wise max after subracting mean value pyspark

I want to calculate row wise mean and subract mean from each value of row and get maximum at he end
here is my dataframe
col1 | col2 | col3
0 | 2 | 3
4 | 2 | 3
1 | 0 | 3
0 | 0 | 0
df=df.withColumn("mean_value",(sum(col(x) for x in df.columns[0:2]) / 3).alias("mean"))
i can calculate row wise mean with line of code , but i want to minus mean value from each value of row and get the maximum value of row after subtraction of mean value.
Requeire results
col1 | col2 | col3 mean_Value Max_difference_Value
0 | 2 | 3 1.66 1.34
4 | 2 | 3 3.0 1.0
1 | 0 | 3 1.33 1.67
1 | 0 | 1 0.66 0.66
Note this is main formula: abs(mean-columns value).max()
Using greatest and list comprehension.
spark.sparkContext.parallelize(data_ls).toDF(['col1', 'col2', 'col3']). \
withColumn('mean_value', (sum(func.col(x) for x in ['col1', 'col2', 'col3']) / 3)). \
withColumn('max_diff_val',
func.greatest(*[func.abs(func.col(x) - func.col('mean_value')) for x in ['col1', 'col2', 'col3']])
). \
show()
# +----+----+----+------------------+------------------+
# |col1|col2|col3| mean_value| max_diff_val|
# +----+----+----+------------------+------------------+
# | 0| 2| 3|1.6666666666666667|1.6666666666666667|
# | 4| 2| 3| 3.0| 1.0|
# | 1| 0| 3|1.3333333333333333|1.6666666666666667|
# | 0| 0| 0| 0.0| 0.0|
# +----+----+----+------------------+------------------+
Have you tried UDFs?
from pyspark.sql.types import FloatType
from pyspark.sql.functions import udf
import numpy as np
#udf
def udf_mean(col1, col2, col3):
return np.mean([col1, col2, col3])
df = df.withColumn("mean_value", udf_mean(col1, col2, col3))
Similarly you can try for max difference value.

add rows in pyspark dataframe and adjust the column sequence accordingly

We have a dataframe like below say DF1
col_name
col_seq
Hash_enc_ind
abc
1
0
first_name
2
1
last_name
3
1
full_name
4
1
XYZ
5
0
sal
6
1
AAA
7
0
Now I want to add 2 rows for one row where hash_inc_ind =1 and adjust the col seq accordingly so that the output would be like
DF1:
col_name
col_seq
Hash_enc_ind
abc
1
0
first_name_h
2
1
first_name_e
3
1
last_name_h
4
1
last_name_e
5
1
full_name_h
6
1
full_name_e
7
1
XYZ
8
0
sal_h
9
1
sal_e
10
1
AAA
11
0
You can explode an array constructed using when:
import pyspark.sql.functions as F
df2 = df.withColumn(
'col_name',
F.expr("explode(transform(case when Hash_enc_ind = 1 then array('_h', '_e') else array('') end, x -> col_name || x))")
)
df2.show()
+------------+-------+------------+
| col_name|col_seq|Hash_enc_ind|
+------------+-------+------------+
| abc| 1| 0|
|first_name_h| 2| 1|
|first_name_e| 2| 1|
| last_name_h| 3| 1|
| last_name_e| 3| 1|
| full_name_h| 4| 1|
| full_name_e| 4| 1|
| XYZ| 5| 0|
| sal_h| 6| 1|
| sal_e| 6| 1|
| AAA| 7| 0|
+------------+-------+------------+

number of zero days in a row field

I have a spark dataframe like the input column below. It has a date column "dates" and a int column "qty". I would like to create a new column "daysout" that has the difference in days between the current date value and the first consecutive date where qty=0. I've provided example input and output below. Any tips are greatly appreciated.
input df:
dates qty
2020-04-01 1
2020-04-02 0
2020-04-03 0
2020-04-04 3
2020-04-05 0
2020-04-06 7
output:
dates qty daysout
2020-04-01 1 0
2020-04-02 0 0
2020-04-03 0 1
2020-04-04 3 2
2020-04-05 0 0
2020-04-06 7 1
Here is a possible approach which compares if current row is 0 and lagged row is not 0 , then takes a sum of that window , which then acts as a window for a row number to be assigned and subtract 1 to get your desired result:
import pyspark.sql.functions as F
w = Window().partitionBy().orderBy(F.col("dates"))
w1 = F.sum(F.when((F.col("qty")==0)&(F.lag("qty").over(w)!=0),1).otherwise(0)).over(w)
w2 = Window.partitionBy(w1).orderBy('dates')
df.withColumn("daysout",F.row_number().over(w2) - 1).show()
+----------+---+-------+
| dates|qty|daysout|
+----------+---+-------+
|2020-04-01| 1| 0|
|2020-04-02| 0| 0|
|2020-04-03| 0| 1|
|2020-04-04| 3| 2|
|2020-04-05| 0| 0|
|2020-04-06| 7| 1|
+----------+---+-------+

Pyspark, Group by count unique values in a column for a certain value in other column [duplicate]

This question already has answers here:
Reshaping/Pivoting data in Spark RDD and/or Spark DataFrames
(6 answers)
Closed 5 years ago.
That title, yes horrible, sorry. Here' what I mean:
Here's the starting dataset
C1 C2
AA H
AB M
AC M
AA H
AA L
AC L
Then it would turn into a new dataset with 4 columns:
C1 CH CM CL
AA 2 0 1
AB 0 1 0
AC 0 1 1
You can use the pivot api as following with groupBy and agg and other functions as
from pyspark.sql import functions as F
finaldf = df.groupBy("C1").pivot("C2").agg(F.count("C2").alias("count")).na.fill(0)
and you should have finaldf as
+---+---+---+---+
| C1| H| L| M|
+---+---+---+---+
| AA| 2| 1| 0|
| AB| 0| 0| 1|
| AC| 0| 1| 1|
+---+---+---+---+

How to sum data within the same month?

I have a set of data which look like this:
A B C
1| Date | Value | Sum
2| 2014/08/03 10
3| 2014/08/02 7
4| 2014/08/01 5 todo
5| 2014/07/03 6
6| 2014/07/02 2
7| 2014/07/01 6 todo
8| 2014/06/03 1
9| 2014/06/02 4
10|2014/06/01 3 todo
Each cell on the Sum column has this part of code (will be autocompleted by dragging it down): IF(MONTH($A2)=MONTH($A3);"";"todo")
This would result in displaying text at each month's last line. The "todo" part would be replaced with code, which counts all values within the same month of "each last line", what would be the best way to do that?
In C2 enter:
=IF(MONTH(A2)=MONTH(A3),"",SUM($B$1:B2)-SUM($C$1:C1))
and copy down
For your example:

Resources