I'm testing the NTILE function on a simple dataset like this:
(id: string, value: double)
A 10
B 3
C 4
D 4
E 4
F 30
C 30
D 10
A 4
H 4
Running the following query against HIVE (on MapReduce)
SELECT tmp.id, tmp.sum_val, NTILE(4) OVER (ORDER BY tmp.sum_val) AS quartile FROM (SELECT id, sum(value) AS sum_val FROM testntile GROUP BY id) AS tmp
works fine with the following result:
(id, sum_val, quartile)
B 3 1
H 4 1
E 4 2
D 14 2
A 14 3
F 30 3
C 34 4
Running the same query against Hive on Spark (v 1.5) still works fine.
Running the same query against Spark SQL 1.5 (CDH 5.5.1)
val result = sqlContext.sql("SELECT tmp.id, tmp.sum_val, NTILE(4) OVER (ORDER BY tmp.sum_val) AS quartile FROM (SELECT id, sum(value) AS sum_val FROM testntile GROUP BY id) AS tmp")
result.collect().foreach(println)
I get the following wrong result:
[B,3.0,0]
[E,4.0,0]
[H,4.0,0]
[A,14.0,0]
[D,14.0,0]
[F,30.0,0]
[C,34.0,0]
IMPORTANT: the result is NOT deterministic because "sometimes" correct values are returned
Running the same algorithm directly on the dataframe
val x = sqlContext.sql("select id, sum(value) as sum_val from testntile group by id")
val w = Window.partitionBy("id").orderBy("sum_val")
val resultDF = x.select( x("id"),x("sum_val"), ntile(4).over(w) )
still returns a wrong result.
Am I doing something wrong? Any ideas? Thanks in advance for your answers.
If you use Window.partitionBy("id").orderBy("sum_val") you are grouping by id and after you are applying ntile function. So in this way every group has one element and ntile apply the same value for every id.
In order to achieve your first result, you need to remove partitionBy("id") and use only Window.orderBy("sum_val").
This is how I modify your code:
val w = Window.orderBy("sum_val")
val resultDF = x.orderBy("sum_val").select( x("id"),x("sum_val"), ntile(4).over(w) )
And this is the print of resultDF.show():
+---+-------+-----+
| id|sum_val|ntile|
+---+-------+-----+
| B| 3| 1|
| E| 4| 1|
| H| 4| 2|
| D| 14| 2|
| A| 14| 3|
| F| 30| 3|
| C| 34| 4|
+---+-------+-----+
Related
I have a specific requirement where I need to query a dataframe based on a range condition.
The values of the range come from the rows of another dataframe and so I will have as many queries as the rows in this different dataframe.
Using collect() in my scenario seems to be the bottleneck because it brings every row to the driver.
Sample example:
I need to execute a query on table 2 for every row in table 1
Table 1:
ID1
Num1
Num2
1
10
3
2
40
4
Table 2
ID2
Num3
1
9
2
39
3
22
4
12
For the first row in table 1, I create a range [10-3,10+3] =[7,13] => this becomes the range for the first query.
For the second row in table 2, I create a range [40-4,40+4] =[36,44] => this becomes the range for the second query.
I am currently doing collect() and iterating over the rows to get the values. I use these values as ranges in my queries for Table 2.
Output of Query 1:
ID2
Num3
1
9
4
12
Output of Query 2:
ID2
Num3
2
39
Since the number of rows in table 1 is very large, doing a collect() operation is costly.
And since the values are numeric, I assume a join won't work.
Any help in optimizing this task is appreciated.
Depending on what you want your output to look like, you could solve this with a join. Consider the following code:
case class FirstType(id1: Int, num1: Int, num2: Int)
case class Bounds(id1: Int, lowerBound: Int, upperBound: Int)
case class SecondType(id2: Int, num3: Int)
val df = Seq((1, 10, 3), (2, 40, 4)).toDF("id1", "num1", "num2").as[FirstType]
df.show
+---+----+----+
|id1|num1|num2|
+---+----+----+
| 1| 10| 3|
| 2| 40| 4|
+---+----+----+
val df2 = Seq((1, 9), (2, 39), (3, 22), (4, 12)).toDF("id2", "num3").as[SecondType]
df2.show
+---+----+
|id2|num3|
+---+----+
| 1| 9|
| 2| 39|
| 3| 22|
| 4| 12|
+---+----+
val bounds = df.map(x => Bounds(x.id1, x.num1 - x.num2, x.num1 + x.num2))
bounds.show
+---+----------+----------+
|id1|lowerBound|upperBound|
+---+----------+----------+
| 1| 7| 13|
| 2| 36| 44|
+---+----------+----------+
val test = bounds.join(df2, df2("num3") >= bounds("lowerBound") && df2("num3") <= bounds("upperBound"))
test.show
+---+----------+----------+---+----+
|id1|lowerBound|upperBound|id2|num3|
+---+----------+----------+---+----+
| 1| 7| 13| 1| 9|
| 2| 36| 44| 2| 39|
| 1| 7| 13| 4| 12|
+---+----------+----------+---+----+
In here, I do the following:
Create 3 case classes to be able to use typed datasets later on
Create the 2 dataframes
Create an auxilliary dataframe called bounds, which contains the lower/upper bounds
Join the second dataframe onto that auxilliary one
As you can see, the test dataframe contains the result. For each unique combination of the id1, lowerBound and upperBound columns you'll get the different dataframes that you wanted if you look at the id2 and num3 columns only.
You could, for example use a groupBy operation to group by these 3 columns and then do whatever you wanted with the output KeyValueGroupedDataset (something like test.groupBy("id1", "lowerBound", "upperBound")). From there on it depends on what you want: if you want to apply an operation to each dataset for each of the bounds you could use the mapValues method of KeyValueGroupedDataset.
Hope this helps!
I have a dataset which consists of two columns C1 and C2.The columns are associated with a relation of many to many.
What I would like to do is find for each C2 the value C1 which has the most associations with C2 values overall.
For example:
C1 | C2
1 | 2
1 | 5
1 | 9
2 | 9
2 | 8
We can see here that 1 is matched to 3 values of C2 while 2 is matched to 2 so i would like as output:
Out1 |Out2| matches
2 | 1 | 3
5 | 1 | 3
9 | 1 | 3 (1 wins because 3>2)
8 | 2 | 2
What I have done so far is:
dataset = sc.textFile("...").\
map(lambda line: (line.split(",")[0],list(line.split(",")[1]) ) ).\
reduceByKey(lambda x , y : x+y )
What this does is for each C1 value gather all the C2 matches,the count of this list is our desired matches column. What I would like now is somehow use each value in this list as a new key and have a mapping like :
(Key ,Value_list[value1,value2,...]) -->(value1 , key ),(value2 , key)...
How could this be done using spark? Any advice would be really helpful.
Thanks in advance!
The dataframe API is perhaps easier for this kind of task. You can group by C1, get the count, then group by C2, and get the value of C1 that corresponds to the highest number of matches.
import pyspark.sql.functions as F
df = spark.read.csv('file.csv', header=True, inferSchema=True)
df2 = (df.groupBy('C1')
.count()
.join(df, 'C1')
.groupBy(F.col('C2').alias('Out1'))
.agg(
F.max(
F.struct(F.col('count').alias('matches'), F.col('C1').alias('Out2'))
).alias('c')
)
.select('Out1', 'c.Out2', 'c.matches')
.orderBy('Out1')
)
df2.show()
+----+----+-------+
|Out1|Out2|matches|
+----+----+-------+
| 2| 1| 3|
| 5| 1| 3|
| 8| 2| 2|
| 9| 1| 3|
+----+----+-------+
We can get the desired result easily using dataframe API.
from pyspark.sql import *
import pyspark.sql.functions as fun
from pyspark.sql.window import Window
spark = SparkSession.builder.master("local[*]").getOrCreate()
# preparing sample dataframe
data = [(1, 2), (1, 5), (1, 9), (2, 9), (2, 8)]
schema = ["c1", "c2"]
df = spark.createDataFrame(data, schema)
output = df.withColumn("matches", fun.count("c1").over(Window.partitionBy("c1"))) \
.groupby(fun.col('C2').alias('out1')) \
.agg(fun.first(fun.col("c1")).alias("out2"), fun.max("matches").alias("matches"))
output.show()
# output
+----+----+-------+
|Out1|out2|matches|
+----+----+-------+
| 9| 1| 3|
| 5| 1| 3|
| 8| 2| 2|
| 2| 1| 3|
+----+----+-------+
I have data similar like
FieldA FieldB ExplodedField
1 A 1
1 A 2
1 A 3
2 B 3
2 B 5
I would like to join the data so the output will look in the following way:
FieldA FieldB ExplodedField
1 A 1
1 A 1,2
1 A 1,2,3
2 B 3
2 B 3,5
How would you implement it in Spark. Notice that the input dataset is very large
Try with window partitionBy + orderBy with collect_list and concat_ws functions.
Example:
val df=Seq((1,"A",1),(1,"A",2),(1,"A",3)).toDF("FieldA","FieldB","ExplodedField").withColumn("mid",monotonically)
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val win=Window.partitionBy("FieldA","FieldB").orderBy("mid")
df.withColumn("ExplodedField",concat_ws(",",collect_list(col("ExplodedField")).over(win))).
drop("mid").
show()
/*
+------+------+-------------+
|FieldA|FieldB|ExplodedField|
+------+------+-------------+
| 1| A| 1|
| 1| A| 1,2|
| 1| A| 1,2,3|
+------+------+-------------+
*/
I have data like the example data below. I’m trying to create a new column in my data using PySpark that would be the category of the first event for a customer based on the timestamp. Like the example output data below.
I have an example below of what I think would accomplish it using a window function in sql.
I’m pretty new to PySpark. I understand you can run sql inside of PySpark. I’m wondering if I have the code correct below to run the sql window function in PySpark. That is I’m wondering if I can just paste the sql code inside of spark.sql, as I have below.
Input:
eventid customerid category timestamp
1 3 a 1/1/12
2 3 b 2/3/14
4 2 c 4/1/12
Output:
eventid customerid category timestamp first_event
1 3 a 1/1/12 a
2 3 b 2/3/14 a
4 2 c 4/1/12 c
window function example:
select eventid, customerid, category, timestamp
FIRST_VALUE(catgegory) over(partition by customerid order by timestamp) first_event
from table
# implementing window function example with pyspark
PySpark:
# Note: assume df is dataframe with structure of table above
# (df is table)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(“Operations”).getOrCreate()
# Register the DataFrame as a SQL temporary view
df.createOrReplaceView(“Table”)
sql_results = spark.sql(“select eventid, customerid, category, timestamp
FIRST_VALUE(catgegory) over(partition by customerid order by timestamp) first_event
from table”)
# display results
sql_results.show()
You can use window function in pyspark as well
>>> import pyspark.sql.functions as F
>>> from pyspark.sql.window import Window
>>>
>>> df.show()
+-------+----------+--------+---------+
|eventid|customerid|category|timestamp|
+-------+----------+--------+---------+
| 1| 3| a| 1/1/12|
| 2| 3| b| 2/3/14|
| 4| 2| c| 4/1/12|
+-------+----------+--------+---------+
>>> window = Window.partitionBy('customerid')
>>> df = df.withColumn('first_event', F.first('category').over(window))
>>>
>>> df.show()
+-------+----------+--------+---------+-----------+
|eventid|customerid|category|timestamp|first_event|
+-------+----------+--------+---------+-----------+
| 1| 3| a| 1/1/12| a|
| 2| 3| b| 2/3/14| a|
| 4| 2| c| 4/1/12| c|
+-------+----------+--------+---------+-----------+
I have a two dataframes that I need to join by one column and take just rows from the first dataframe if that id is contained in the same column of second dataframe:
df1:
id a b
2 1 1
3 0.5 1
4 1 2
5 2 1
df2:
id c d
2 fs a
5 fa f
Desired output:
df:
id a b
2 1 1
5 2 1
I have tried with df1.join(df2("id"),"left"), but gives me error :'Dataframe' object is not callable.
df2("id") is not a valid python syntax for selecting columns, you'd either need df2[["id"]] or use select df2.select("id"); For your example, you can do:
df1.join(df2.select("id"), "id").show()
+---+---+---+
| id| a| b|
+---+---+---+
| 5|2.0| 1|
| 2|1.0| 1|
+---+---+---+
or:
df1.join(df2[["id"]], "id").show()
+---+---+---+
| id| a| b|
+---+---+---+
| 5|2.0| 1|
| 2|1.0| 1|
+---+---+---+
If you need to check if id exists in df2 and does not need any column in your output from df2 then isin() is more efficient solution (This is similar to EXISTS and IN in SQL).
df1 = spark.createDataFrame([(2,1,1) ,(3,5,1,),(4,1,2),(5,2,1)], "id: Int, a : Int , b : Int")
df2 = spark.createDataFrame([(2,'fs','a') ,(5,'fa','f')], ['id','c','d'])
Create df2.id as list and pass it to df1 under isin()
from pyspark.sql.functions import col
df2_list = df2.select('id').rdd.map(lambda row : row[0]).collect()
df1.where(col('id').isin(df2_list)).show()
#+---+---+---+
#| id| a| b|
#+---+---+---+
#| 2| 1| 1|
#| 5| 2| 1|
#+---+---+---+
It is reccomended to use isin() IF -
You don't need to return data from the refrence dataframe/table
You have duplicates in the refrence dataframe/table (JOIN can cause duplicate rows if values are repeated)
You just want to check existence of particular value