Is it possible to get a sample n rows of a query with a where clause?
I tried to use the tablesample function below but I ended up only getting records in the first partition '2021-09-14.' P
select * from (select * from table where ts in ('2021-09-14', '2021-09-15')) tablesample (100 rows)
You can utilise Monotonically Increasing ID - here or Rand to generate an additional column which can be used to Order your dataset to generate the necessary sampling field
Both of these functions can be used in conjunction or individually
Further more you can use LIMIT clause to sample your required N records
NOTE - orderBy would be a costly operation
Data Preparation
input_str = """
1 2/12/2019 114 2
2 3/5/2019 116 1
3 3/3/2019 120 6
4 3/4/2019 321 10
6 6/5/2019 116 1
7 6/3/2019 116 1
8 10/1/2019 120 3
9 10/1/2019 120 3
10 10/1/2020 120 3
11 10/1/2020 120 3
12 10/1/2020 120 3
13 10/1/2022 120 3
14 10/1/2021 120 3
15 10/6/2019 120 3
""".split()
input_values = list(map(lambda x: x.strip() if x.strip() != 'null' else None, input_str))
cols = list(map(lambda x: x.strip() if x.strip() != 'null' else None, "shipment_id ship_date customer_id quantity".split()))
n = len(input_values)
input_list = [tuple(input_values[i:i+4]) for i in range(0,n,4)]
sparkDF = sql.createDataFrame(input_list, cols)
sparkDF = sparkDF.withColumn('ship_date',F.to_date(F.col('ship_date'),'d/M/yyyy'))
sparkDF.show()
+-----------+----------+-----------+--------+
|shipment_id| ship_date|customer_id|quantity|
+-----------+----------+-----------+--------+
| 1|2019-12-02| 114| 2|
| 2|2019-05-03| 116| 1|
| 3|2019-03-03| 120| 6|
| 4|2019-04-03| 321| 10|
| 6|2019-05-06| 116| 1|
| 7|2019-03-06| 116| 1|
| 8|2019-01-10| 120| 3|
| 9|2019-01-10| 120| 3|
| 10|2020-01-10| 120| 3|
| 11|2020-01-10| 120| 3|
| 12|2020-01-10| 120| 3|
| 13|2022-01-10| 120| 3|
| 14|2021-01-10| 120| 3|
| 15|2019-06-10| 120| 3|
+-----------+----------+-----------+--------+
Order By - Monotonically Increasing ID & Rand
sparkDF.createOrReplaceTempView("shipment_table")
sql.sql("""
SELECT
*
FROM (
SELECT
*
,monotonically_increasing_id() as increasing_id
,RAND(10) as random_order
FROM shipment_table
WHERE ship_date BETWEEN '2019-01-01' AND '2019-12-31'
ORDER BY monotonically_increasing_id() DESC ,RAND(10) DESC
LIMIT 5
)
""").show()
+-----------+----------+-----------+--------+-------------+-------------------+
|shipment_id| ship_date|customer_id|quantity|increasing_id| random_order|
+-----------+----------+-----------+--------+-------------+-------------------+
| 15|2019-06-10| 120| 3| 8589934593|0.11682250456449328|
| 9|2019-01-10| 120| 3| 8589934592|0.03422639313807285|
| 8|2019-01-10| 120| 3| 6| 0.8078688178371882|
| 7|2019-03-06| 116| 1| 5|0.36664222617947817|
| 6|2019-05-06| 116| 1| 4| 0.2093704977577|
+-----------+----------+-----------+--------+-------------+-------------------+
If you are using Dataset there is built-in functionality for this as outlined in the documenation:
sample(withReplacement: Boolean, fraction: Double): Dataset[T]
Returns a new Dataset by sampling a fraction of rows, using a random seed.
withReplacement: Sample with replacement or not.
fraction: Fraction of rows to generate, range [0.0, 1.0].
Since
1.6.0
Note
This is NOT guaranteed to provide exactly the fraction of the total count of the given Dataset.
To use this you'd filter your dataset against whatever criteria you're looking for, then sample the result. If you need an exact number of rows rather than a fraction you can follow the call to sample with limit(n) where n is the number of rows to return.
Related
I have a table structure like this:
unique_id | group | value_1 | value_2 | value_3
abc_xxx 1 200 null 100
def_xxx 1 0 3 40
ghi_xxx 2 300 1 2
that I need to extract the following information from:
Total number of rows per group
Count number of rows per group that contains null values.
Count number of rows per group with zero values.
I can do the first one with a simple groupBy and count
df.select().groupBy(group).count()
I'm not so sure how to approach the next two which is needed for me to compute the null and zero rate from the total rows per group.
data= [
('abc_xxx', 1, 200, None, 100),
('def_xxx', 1, 0, 3, 40 ),
('ghi_xxx', 2, 300, 1, 2 ),
]
df = spark.createDataFrame(data, ['unique_id','group','value_1','value_2','value_3'])
# new edit
df = df\
.withColumn('contains_null', when(isnull(col('value_1')) | isnull(col('value_2')) | isnull(col('value_3')), lit(1)).otherwise(lit(0)))\
.withColumn('contains_zero', when((col('value_1')==0) | (col('value_2')==0) | (col('value_3')==0), lit(1)).otherwise(lit(0)))
df.groupBy('group')\
.agg(count('unique_id').alias('total_rows'), sum('contains_null').alias('null_value_rows'), sum('contains_zero').alias('zero_value_rows')).show()
+-----+----------+---------------+---------------+
|group|total_rows|null_value_rows|zero_value_rows|
+-----+----------+---------------+---------------+
| 1| 2| 1| 1|
| 2| 1| 0| 0|
+-----+----------+---------------+---------------+
# total_count = (count('value_1') + count('value_2') + count('value_3'))
# null_count = (sum(when(isnull(col('value_1')), lit(1)).otherwise(lit(0)) + when(isnull(col('value_2')), lit(1)).otherwise(lit(0)) + when(isnull(col('value_3')), lit(1)).otherwise(lit(0))))
# zero_count = (sum(when(col('value_1')==0, lit(1)).otherwise(lit(0)) + when(col('value_2')==0, lit(1)).otherwise(lit(0)) + when(col('value_3')==0, lit(1)).otherwise(lit(0))))
# df.groupBy('group')\
# .agg(total_count.alias('total_numbers'), null_count.alias('null_values'), zero_count.alias('zero_values')).show()
#+-----+-------------+-----------+-----------+
#|group|total_numbers|null_values|zero_values|
#+-----+-------------+-----------+-----------+
#| 1| 5| 1| 1|
#| 2| 3| 0| 0|
#+-----+-------------+-----------+-----------+
I have the following Spark DataFrame:
id
month
column_1
column_2
A
1
100
0
A
2
200
1
A
3
800
2
A
4
1500
3
A
5
1200
0
A
6
1600
1
A
7
2500
2
A
8
2800
3
A
9
3000
4
I would like to create a new column, let's call it 'dif_column1' based on a dynamic lag which is given by column_2. The desired output would be:
id
month
column_1
column_2
dif_column1
A
1
100
0
0
A
2
200
1
100
A
3
800
2
700
A
4
1500
3
1400
A
5
1200
0
0
A
6
1600
1
400
A
7
2500
2
1300
A
8
2800
3
1600
A
9
3000
4
1800
I have tried to use the lag function but apparently I can only use an integer with the lag function, so it does not work:
w = Window.partitionBy("id")
sdf = sdf.withColumn("dif_column1", F.col("column_1") - F.lag("column_1",F.col("column_2")).over(w))
You can add a row number column, and do a self join based on the row number and the lag defined in column_2:
from pyspark.sql import functions as F, Window
w = Window.partitionBy("id").orderBy("month")
df1 = df.withColumn('rn', F.row_number().over(w))
df2 = df1.alias('t1').join(
df1.alias('t2'),
F.expr('(t1.id = t2.id) and (t1.rn = t2.rn + t1.column_2)'),
'left'
).selectExpr(
't1.*',
't1.column_1 - t2.column_1 as dif_column1'
).drop('rn')
df2.show()
+---+-----+--------+--------+-----------+
| id|month|column_1|column_2|dif_column1|
+---+-----+--------+--------+-----------+
| A| 1| 100| 0| 0|
| A| 2| 200| 1| 100|
| A| 3| 800| 2| 700|
| A| 4| 1500| 3| 1400|
| A| 5| 1200| 0| 0|
| A| 6| 1600| 1| 400|
| A| 7| 2500| 2| 1300|
| A| 8| 2800| 3| 1600|
| A| 9| 3000| 4| 1800|
+---+-----+--------+--------+-----------+
Create a spark dataframe from a pandas dataframe
import pandas as pd
df = pd.DataFrame({"b": ['A','A','A','A','B', 'B','B','C','C','D','D', 'D','D','D','D','D','D','D','D','D'],"Sno": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],"a": [3,-4,2, -1, -3, -1,-7,-6, 1, 1, -1, 1,4,5,-3,2,3,4, -1, -2]})
df2=spark.createDataFrame(df)
Next I use the window partition on the field 'b'
from pyspark.sql import window
win_spec = (window.Window.partitionBy(['b']).orderBy("Sno").rowsBetween(window.Window.unboundedPreceding, 0))
Add a field positive , negative based on the values and created a lambda funtion
df2 = df2.withColumn("pos_neg",col("a") <0)
pos_neg_func =udf(lambda x: ((x) & (x != x.shift())).cumsum())
tried creating a new column (which is a counter for negative values but within variable 'b'. so counter restarts when the field in 'b' changes. Also if there are consecutive -ve values, they should be treated as a single value, counter changes by 1
df3 = (df2.select('pos_neg',pos_neg_func('pos_neg').alias('val')))
I want the output as,
b Sno a val val_2
0 A 1 3 False 0
1 A 2 -4 True 1
2 A 3 2 False 1
3 A 4 -1 True 2
4 B 5 -3 True 1
5 B 6 -1 True 1
6 B 7 -7 True 1
7 C 8 -6 True 1
8 C 9 1 False 1
9 D 10 1 False 0
10 D 11 -1 True 1
11 D 12 1 False 1
12 D 13 4 False 1
13 D 14 5 False 1
14 D 15 -3 True 2
15 D 16 2 False 2
16 D 17 3 False 2
17 D 18 4 False 2
18 D 19 -1 True 3
19 D 20 -2 True 3
In python a simple function like following works:
df['val'] = df.groupby('b')['pos_neg'].transform(lambda x: ((x) & (x != x.shift())).cumsum())
josh-friedlander provided support in the above code
Pyspark doesn't have a shift function, but you could work with the lag window function which gives you the row before the current row. The first window (called w) sets the value of the val column to 1 if the value of the pos_neg column is True and the value of the previous pos_neg is False and to 0 otherwise.
With the second window (called w2) we calculate the cumulative sum to get your desired
import pandas as pd
import pyspark.sql.functions as F
from pyspark.sql import Window
df = pd.DataFrame({"b": ['A','A','A','A','B', 'B','B','C','C','D','D', 'D','D','D','D','D','D','D','D','D'],"Sno": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],"a": [3,-4,2, -1, -3, -1,-7,-6, 1, 1, -1, 1,4,5,-3,2,3,4, -1, -2]})
df2=spark.createDataFrame(df)
w = Window.partitionBy('b').orderBy('Sno')
w2 = Window.partitionBy('b').rowsBetween(Window.unboundedPreceding, 0).orderBy('Sno')
df2 = df2.withColumn("pos_neg",col("a") <0)
df2 = df2.withColumn('val', F.when((df2.pos_neg == True) & (F.lag('pos_neg', default=False).over(w) == False), 1).otherwise(0))
df2 = df2.withColumn('val', F.sum('val').over(w2))
df2.show()
Output:
+---+---+---+-------+---+
|Sno| a| b|pos_neg|val|
+---+---+---+-------+---+
| 5| -3| B| true| 1|
| 6| -1| B| true| 1|
| 7| -7| B| true| 1|
| 10| 1| D| false| 0|
| 11| -1| D| true| 1|
| 12| 1| D| false| 1|
| 13| 4| D| false| 1|
| 14| 5| D| false| 1|
| 15| -3| D| true| 2|
| 16| 2| D| false| 2|
| 17| 3| D| false| 2|
| 18| 4| D| false| 2|
| 19| -1| D| true| 3|
| 20| -2| D| true| 3|
| 8| -6| C| true| 1|
| 9| 1| C| false| 1|
| 1| 3| A| false| 0|
| 2| -4| A| true| 1|
| 3| 2| A| false| 1|
| 4| -1| A| true| 2|
+---+---+---+-------+---+
You may wonder why it was neccessary to have a column which allows us to order the dataset. Let me try to explain this with an example. The following data was read by pandas and got an index assigned (left column). You want to count the occurences of True in the pos_neg and you don't want to count consecuitive True's. This logic leads to the val2 column as shown below:
b Sno a pos_neg val_2
0 A 1 3 False 0
1 A 2 -4 True 1
2 A 3 2 False 1
3 A 4 -1 True 2
4 A 5 -5 True 2
...but it depends on the index it got from pandas (order of rows). When you change the order of the rows (and the corrosponding pandas index) you will get a different result when you apply your logic to the same rows just because the order is different:
b Sno a pos_neg val_2
0 A 1 3 False 0
1 A 3 2 False 0
2 A 2 -4 True 1
3 A 4 -1 True 1
4 A 5 -5 True 1
You see that the order of the rows is important. You might wonder now why pyspark doesn't create an index like pandas does. That is because spark keeps your data in several partitions which are distributed on your cluster and is depending on your data source even able to read the data distributedly. An index can't therefore not be added during the reading of the data. You can add one after the data was read with the monotonically_increasing_id function, but your data could already have a different order compared to your data source due to the read process.
Your sno column avoids this problem and guaranties that you will get always the same result for the same data (deterministic).
I have a table like below
id week count
A100 201008 2
A100 201009 9
A100 201010 16
A100 201011 23
A100 201012 30
A100 201013 36
A100 201015 43
A100 201017 50
A100 201018 57
A100 201019 63
A100 201023 70
A100 201024 82
A100 201025 88
A100 201026 95
A100 201027 102
Here, we can see that below weeks are missing :
First 201014 is missing
Second 201016 is missing
Third weeks missing 201020, 201021, 201022
My requirement is whenever we have missing values we need to show the count of previous week.
In this case output should be :
id week count
A100 201008 2
A100 201009 9
A100 201010 16
A100 201011 23
A100 201012 30
A100 201013 36
A100 201014 36
A100 201015 43
A100 201016 43
A100 201017 50
A100 201018 57
A100 201019 63
A100 201020 63
A100 201021 63
A100 201022 63
A100 201023 70
A100 201024 82
A100 201025 88
A100 201026 95
A100 201027 102
How I can achieve this requirement using hive/pyspark?
Although this answer is in Scala, Python version will look almost the same & can be easily converted.
Step 1:
Find the rows that has missing week(s) value prior to it.
Sample Input:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
//sample input
val input = sc.parallelize(List(("A100",201008,2), ("A100",201009,9),("A100",201014,4), ("A100",201016,45))).toDF("id","week","count")
scala> input.show
+----+------+-----+
| id| week|count|
+----+------+-----+
|A100|201008| 2|
|A100|201009| 9|
|A100|201014| 4| //missing 4 rows
|A100|201016| 45| //missing 1 row
+----+------+-----+
To find it, we can use .lead() function on week. And compute the difference between the leadWeek and week. The difference should not be > 1, if so there are missing row prior to it.
val diffDF = input
.withColumn("leadWeek", lead($"week", 1).over(Window.partitionBy($"id").orderBy($"week"))) // partitioning by id & computing lead()
.withColumn("diff", ($"leadWeek" - $"week") -1) // finding difference between leadWeek & week
scala> diffDF.show
+----+------+-----+--------+----+
| id| week|count|leadWeek|diff|
+----+------+-----+--------+----+
|A100|201008| 2| 201009| 0| // diff -> 0 represents that no rows needs to be added
|A100|201009| 9| 201014| 4| // diff -> 4 represents 4 rows are to be added after this row.
|A100|201014| 4| 201016| 1| // diff -> 1 represents 1 row to be added after this row.
|A100|201016| 45| null|null|
+----+------+-----+--------+----+
Step 2:
If the diff is >= 1: Create and Add n number of rows(InputWithDiff, check the case class below) as specified in
diff and increment week value accordingly. Return the newly
created rows along with the original row.
If the diff is 0, No additional computation is required. Return the original row as it is.
Convert diffDF to Dataset for ease of computation.
case class InputWithDiff(id: Option[String], week: Option[Int], count: Option[Int], leadWeek: Option[Int], diff: Option[Int])
val diffDS = diffDF.as[InputWithDiff]
val output = diffDS.flatMap(x => {
val diff = x.diff.getOrElse(0)
diff match {
case n if n >= 1 => x :: (1 to diff).map(y => InputWithDiff(x.id, Some(x.week.get + y), x.count,x.leadWeek, x.diff)).toList // create and append new Rows
case _ => List(x) // return as it is
}
}).drop("leadWeek", "diff").toDF // drop unnecessary columns & convert to DF
final output:
scala> output.show
+----+------+-----+
| id| week|count|
+----+------+-----+
|A100|201008| 2|
|A100|201009| 9|
|A100|201010| 9|
|A100|201011| 9|
|A100|201012| 9|
|A100|201013| 9|
|A100|201014| 4|
|A100|201015| 4|
|A100|201016| 45|
+----+------+-----+
PySpark solution
Sample Data
df = spark.createDataFrame([(1,201901,10),
(1,201903,9),
(1,201904,21),
(1,201906,42),
(1,201909,3),
(1,201912,56)
],['id','weeknum','val'])
df.show()
+---+-------+---+
| id|weeknum|val|
+---+-------+---+
| 1| 201901| 10|
| 1| 201903| 9|
| 1| 201904| 21|
| 1| 201906| 42|
| 1| 201909| 3|
| 1| 201912| 56|
+---+-------+---+
1) The basic idea is to create a combination of all id's and weeks (starting from the minimum possible value to the maximum) with a cross join.
from pyspark.sql.functions import min,max,sum,when
from pyspark.sql import Window
min_max_week = df.agg(min(df.weeknum),max(df.weeknum)).collect()
#Generate all weeks using range
all_weeks = spark.range(min_max_week[0][0],min_max_week[0][1]+1)
all_weeks = all_weeks.withColumnRenamed('id','weekno')
#all_weeks.show()
id_all_weeks = df.select(df.id).distinct().crossJoin(all_weeks).withColumnRenamed('id','aid')
#id_all_weeks.show()
2) Thereafter, left joining the original dataframe on to these combinations helps identify missing values.
res = id_all_weeks.join(df,(df.id == id_all_weeks.aid) & (df.weeknum == id_all_weeks.weekno),'left')
res.show()
+---+------+----+-------+----+
|aid|weekno| id|weeknum| val|
+---+------+----+-------+----+
| 1|201911|null| null|null|
| 1|201905|null| null|null|
| 1|201903| 1| 201903| 9|
| 1|201904| 1| 201904| 21|
| 1|201901| 1| 201901| 10|
| 1|201906| 1| 201906| 42|
| 1|201908|null| null|null|
| 1|201910|null| null|null|
| 1|201912| 1| 201912| 56|
| 1|201907|null| null|null|
| 1|201902|null| null|null|
| 1|201909| 1| 201909| 3|
+---+------+----+-------+----+
3) Then, use a combination of window functions, sum -> to assign groups
and max -> to fill in the missing values once the groups are classified.
w1 = Window.partitionBy(res.aid).orderBy(res.weekno)
groups = res.withColumn("grp",sum(when(res.id.isNull(),0).otherwise(1)).over(w1))
w2 = Window.partitionBy(groups.aid,groups.grp)
missing_values_filled = groups.withColumn('filled',max(groups.val).over(w2)) #select required columns as needed
missing_values_filled.show()
+---+------+----+-------+----+---+------+
|aid|weekno| id|weeknum| val|grp|filled|
+---+------+----+-------+----+---+------+
| 1|201901| 1| 201901| 10| 1| 10|
| 1|201902|null| null|null| 1| 10|
| 1|201903| 1| 201903| 9| 2| 9|
| 1|201904| 1| 201904| 21| 3| 21|
| 1|201905|null| null|null| 3| 21|
| 1|201906| 1| 201906| 42| 4| 42|
| 1|201907|null| null|null| 4| 42|
| 1|201908|null| null|null| 4| 42|
| 1|201909| 1| 201909| 3| 5| 3|
| 1|201910|null| null|null| 5| 3|
| 1|201911|null| null|null| 5| 3|
| 1|201912| 1| 201912| 56| 6| 56|
+---+------+----+-------+----+---+------+
Hive Query with the same logic as described above (assuming a table with all weeks can be created)
select id,weeknum,max(val) over(partition by id,grp) as val
from (select i.id
,w.weeknum
,t.val
,sum(case when t.id is null then 0 else 1 end) over(partition by i.id order by w.weeknum) as grp
from (select distinct id from tbl) i
cross join weeks_table w
left join tbl t on t.id = i.id and w.weeknum = t.weeknum
) t
I have a dataframe where I need to compare a few values and deduce a few things out of it.
For instance,
my DF
CITY DAY MONTH TAG RANGE VALUE RANK
A 1 01 A [50, 90] 55 1
A 2 02 B [30, 40] 34 3
A 1 03 A [05, 10] 15 20
A 1 04 B [50, 60] 11 10
A 1 05 B [50, 60] 54 4
I have to check , for every row if the value of "VALUE" lies between the "RANGE". Here, arr[0] is the lower limit and arr[1] is the upper limit.
I need to create a new DF such that,
NEW-DF
TAG Positive Negative
A 1 1
B 2 1
If the "value" lies between the given range and the rank < 5 then I would add it to "positive"
If the value doesnt lie in the given range , then it is a negative
If the value lies in the given range, but the rank > 5, then I would count it as negative
"Positive" and "Negative" is nothing but the count of the values fulfilling either conditions.
We can use element_at to get the elements at each position and compare them to the corresponding value in each row, along with the rank condition, and then perform a groupby with sum on the tag:
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType
range_df = df.withColumn('in_range', (F.element_at('range', 1).cast(IntegerType()) < F.col('value')) &
(F.col('value') < F.element_at('range', 2).cast(IntegerType())) &
(F.col('rank') < 5))
range_df.show()
grouped_df = range_df.groupby('tag').agg(F.sum(F.col('in_range').cast(IntegerType())).alias('total_positive'),
F.sum((~F.col('in_range')).cast(IntegerType())).alias('total_negative'))
grouped_df.show()
Output:
+---+--------+-----+----+--------+
|tag| range|value|rank|in_range|
+---+--------+-----+----+--------+
| A|[50, 90]| 55| 1| true|
| B|[30, 40]| 34| 3| true|
| A|[05, 10]| 15| 20| false|
| B|[50, 60]| 11| 10| false|
| B|[50, 60]| 54| 4| true|
+---+--------+-----+----+--------+
+---+--------------+--------------+
|tag|total_positive|total_negative|
+---+--------------+--------------+
| B| 2| 1|
| A| 1| 1|
+---+--------------+--------------+
You have to use a UDF first to process the range :
val df = Seq(("A","1","01","A","[50,90]","55","1")).toDF("city","day","month","tag","range","value","rank")
+----+---+-----+---+-------+-----+----+
|city|day|month|tag| range|value|rank|
+----+---+-----+---+-------+-----+----+
| A| 1| 01| A|[50,90]| 55| 1|
+----+---+-----+---+-------+-----+----+
def checkRange(range : String,rank : String, value : String) : String = {
val rangeProcess = range.dropRight(1).drop(1).split(",")
if (rank.toInt > 5){
"negative"
} else {
if (value > rangeProcess(0) && value < rangeProcess(1)){
"positive"
} else {
"negative"
}
}
}
val checkRangeUdf = udf(checkRange _)
df.withColumn("Result",checkRangeUdf(col("range"),col("rank"),col("value"))).show()
+----+---+-----+---+-------+-----+----+--------+
|city|day|month|tag| range|value|rank| Result|
+----+---+-----+---+-------+-----+----+--------+
| A| 1| 01| A|[50,90]| 55| 1|positive|
+----+---+-----+---+-------+-----+----+--------+
val result = df.withColumn("Result",checkRangeUdf(col("range"),col("rank"),col("value"))).groupBy("city","Result").count.show
+----+--------+-----+
|city| Result|count|
+----+--------+-----+
| A|positive| 1|
+----+--------+-----+