Related
I have a dataframe that looks like this
val df = Seq(
(1,"a,b,c"),
(2,"b,c")
).toDF("id","page_path")
df.createOrReplaceTempView("df")
df.show()
+---+---------+
| id|page_path|
+---+---------+
| 1| a,b,c|
| 2| b,c|
+---+---------+
I want to perform one hot encoding on this page_path column such that the output would look like -
Can I do this using one-hot encoding in Spark?
Column "page_path" can be split, and then values exploded, and pivoted:
df
.withColumn("splitted", split($"page_path",","))
.withColumn("exploded", explode($"splitted"))
.groupBy("id")
.pivot("exploded")
.count()
// replace nulls with 0
.na.fill(0)
Output:
+---+---+---+---+
|id |a |b |c |
+---+---+---+---+
|1 |1 |1 |1 |
|2 |0 |1 |1 |
+---+---+---+---+
Since in the question you mentioned df.createOrReplaceTempView("df") thought of giving sql version of the same thing which pasha done.
In Databricks documenation they have mentioned many use cases with Pivot...
Below is the sql version for sql lovers.
In this approach, contrary to dataframe operations approach pivot uses implicit grouping there is no need for seperate group by clause in the sql.
val df: DataFrame = Seq((1, "a,b,c"),(2, "b,c")).toDF("id", "page_path")
df.createOrReplaceTempView("df")
spark.sql(
"""
|Select * from
|( select id, explode(split( page_path ,',')) as exploded from df )
|pivot(count(exploded) for exploded in ('is_a','is_b','is_c')
|)
""".stripMargin).na.fill(0).show
Result :
+---+----+----+----+
| id|is_a|is_b|is_c|
+---+----+----+----+
| 1| 0| 0| 0|
| 2| 0| 0| 0|
+---+----+----+----+
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object Solution {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[4]").setAppName("SparkClusterApp")
val sparkSession = SparkSession.builder.config(sparkConf).getOrCreate
import sparkSession.implicits._
val df = Seq((1, "a,b,c"),(2, "b,c")).toDF("id", "page_path")
df.createOrReplaceTempView("df")
df.withColumn("_tmp", split($"page_path", "\\,")).select( $"id",
when(array_contains($"_tmp","a"),"1").otherwise("0").as("is_a"),
when(array_contains($"_tmp","b"),"1").otherwise("0").as("is_b"),
when(array_contains($"_tmp","c"),"1").otherwise("0").as("is_c")).show()
}
}
Given two spark dataframes A and B with the same number of columns and rows, I want to compute the numerical difference between the two dataframes and store it into another dataframe (or another data structure optionally).
For instance let us have the following datasets
DataFrame A:
+----+---+
| A | B |
+----+---+
| 1| 0|
| 1| 0|
+----+---+
DataFrame B:
----+---+
| A | B |
+----+---+
| 1| 0 |
| 0| 0 |
+----+---+
How to obtain B-A, i.e
+----+---+
| c1 | c2|
+----+---+
| 0| 0 |
| -1| 0 |
+----+---+
In practice the real dataframes have a consequent number of rows and 50+ columns for which the difference need to be computed. What is the Spark/Scala way of doing it?
I was able to solve this by using the approach below. This code can work with any number of columns. You just have to change the input DFs accordingly.
import org.apache.spark.sql.Row
val df0 = Seq((1, 5), (1, 4)).toDF("a", "b")
val df1 = Seq((1, 0), (3, 2)).toDF("a", "b")
val columns = df0.columns
val rdd = df0.rdd.zip(df1.rdd).map {
x =>
val arr = columns.map(column =>
x._2.getAs[Int](column) - x._1.getAs[Int](column))
Row(arr: _*)
}
spark.createDataFrame(rdd, df0.schema).show(false)
Output generated:
df0=>
+---+---+
|a |b |
+---+---+
|1 |5 |
|1 |4 |
+---+---+
df1=>
+---+---+
|a |b |
+---+---+
|1 |0 |
|3 |2 |
+---+---+
Output=>
+---+---+
|a |b |
+---+---+
|0 |-5 |
|2 |-2 |
+---+---+
If your df A is the same as df B you can try below approach. I don't know if this will work correct for large datasets, it will be better to have id for joining already instead of creating it using monotonically_increasing_id().
import spark.implicits._
import org.apache.spark.sql.functions._
val df0 = Seq((1, 0), (1, 0)).toDF("a", "b")
val df1 = Seq((1, 0), (0, 0)).toDF("a", "b")
// new cols names
val colNamesA = df0.columns.map("A_" + _)
val colNamesB = df0.columns.map("B_" + _)
// rename cols and add id
val dfA = df0.toDF(colNamesA: _*)
.withColumn("id", monotonically_increasing_id())
val dfB = df1.toDF(colNamesB: _*)
.withColumn("id", monotonically_increasing_id())
dfA.show()
dfB.show()
// get columns without id
val dfACols = dfA.columns.dropRight(1).map(dfA(_))
val dfBCols = dfB.columns.dropRight(1).map(dfB(_))
// diff between cols
val calcCols = (dfACols zip dfBCols).map(s=>s._2-s._1)
// join dfs
val joined = dfA.join(dfB, "id")
joined.show()
calcCols.foreach(_.explain(true))
joined.select(calcCols:_*).show()
+---+---+---+
|A_a|A_b| id|
+---+---+---+
| 1| 0| 0|
| 1| 0| 1|
+---+---+---+
+---+---+---+
|B_a|B_b| id|
+---+---+---+
| 1| 0| 0|
| 0| 0| 1|
+---+---+---+
+---+---+---+---+---+
| id|A_a|A_b|B_a|B_b|
+---+---+---+---+---+
| 0| 1| 0| 1| 0|
| 1| 1| 0| 0| 0|
+---+---+---+---+---+
(B_a#26 - A_a#18)
(B_b#27 - A_b#19)
+-----------+-----------+
|(B_a - A_a)|(B_b - A_b)|
+-----------+-----------+
| 0| 0|
| -1| 0|
+-----------+-----------+
I have a dataframe (df) and within the dataframe I have a column user_id
df = sc.parallelize([(1, "not_set"),
(2, "user_001"),
(3, "user_002"),
(4, "n/a"),
(5, "N/A"),
(6, "userid_not_set"),
(7, "user_003"),
(8, "user_004")]).toDF(["key", "user_id"])
df:
+---+--------------+
|key| user_id|
+---+--------------+
| 1| not_set|
| 2| user_003|
| 3| user_004|
| 4| n/a|
| 5| N/A|
| 6|userid_not_set|
| 7| user_003|
| 8| user_004|
+---+--------------+
I would like to replace the following values: not_set, n/a, N/A and userid_not_set with null.
It would be good if I could add any new values to a list and they to could be changed.
I am currently using a CASE statement within spark.sql to preform this and would like to change this to pyspark.
None inside the when() function corresponds to the null. In case you wish to fill in anything else instead of null, you have to fill it in it's place.
from pyspark.sql.functions import col
df = df.withColumn(
"user_id",
when(
col("user_id").isin('not_set', 'n/a', 'N/A', 'userid_not_set'),
None
).otherwise(col("user_id"))
)
df.show()
+---+--------+
|key| user_id|
+---+--------+
| 1| null|
| 2|user_001|
| 3|user_002|
| 4| null|
| 5| null|
| 6| null|
| 7|user_003|
| 8|user_004|
+---+--------+
You can use the in-built when function, which is the equivalent of a case expression.
from pyspark.sql import functions as f
df.select(df.key,f.when(df.user_id.isin(['not_set', 'n/a', 'N/A']),None).otherwise(df.user_id)).show()
Also the values needed can be stored in a list and be referenced.
val_list = ['not_set', 'n/a', 'N/A']
df.select(df.key,f.when(df.user_id.isin(val_list),None).otherwise(df.user_id)).show()
PFB few approaches. I am assuming that all the legitimate user IDs starts with "user_". Please try below code.
from pyspark.sql.functions import *
df.withColumn(
"user_id",
when(col("user_id").startswith("user_"),col("user_id")).otherwise(None)
).show()
Another One.
cond = """case when user_id in ('not_set', 'n/a', 'N/A', 'userid_not_set') then null
else user_id
end"""
df.withColumn("ID", expr(cond)).show()
Another One.
cond = """case when user_id like 'user_%' then user_id
else null
end"""
df.withColumn("ID", expr(cond)).show()
Another one.
df.withColumn(
"user_id",
when(col("user_id").rlike("user_"),col("user_id")).otherwise(None)
).show()
I have a data frame with two columns,
+---+-------+
| id| fruit|
+---+-------+
| 0| apple|
| 1| banana|
| 2|coconut|
| 1| banana|
| 2|coconut|
+---+-------+
also I have a universal List with all the items,
fruitList: Seq[String] = WrappedArray(apple, coconut, banana)
now I want to create a new column in the dataframe with an array of 1's,0's, where 1 represent the item exist and 0 if the item doesn't present for that row.
Desired Output
+---+-----------+
| id| fruitlist|
+---+-----------+
| 0| [1,0,0] |
| 1| [0,1,0] |
| 2|[0,0,1] |
| 1| [0,1,0] |
| 2|[0,0,1] |
+---+-----------+
This is something I tried,
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}
val df = spark.createDataFrame(Seq(
(0, "apple"),
(1, "banana"),
(2, "coconut"),
(1, "banana"),
(2, "coconut")
)).toDF("id", "fruit")
df.show
import org.apache.spark.sql.functions._
val fruitList = df.select(collect_set("fruit")).first().getAs[Seq[String]](0)
print(fruitList)
I tried to solve this with OneHotEncoder but the result was something like this after converting to dense vector, which is not what I needed.
+---+-------+----------+-------------+---------+
| id| fruit|fruitIndex| fruitVec| vd|
+---+-------+----------+-------------+---------+
| 0| apple| 2.0| (2,[],[])|[0.0,0.0]|
| 1| banana| 1.0|(2,[1],[1.0])|[0.0,1.0]|
| 2|coconut| 0.0|(2,[0],[1.0])|[1.0,0.0]|
| 1| banana| 1.0|(2,[1],[1.0])|[0.0,1.0]|
| 2|coconut| 0.0|(2,[0],[1.0])|[1.0,0.0]|
+---+-------+----------+-------------+---------+
If you have a collection as
val fruitList: Seq[String] = Array("apple", "coconut", "banana")
Then you can either do it using inbuilt functions or udf function
inbuilt functions (array, when and lit)
import org.apache.spark.sql.functions._
df.withColumn("fruitList", array(fruitList.map(x => when(lit(x) === col("fruit"),1).otherwise(0)): _*)).show(false)
udf function
import org.apache.spark.sql.functions._
def containedUdf = udf((fruit: String) => fruitList.map(x => if(x == fruit) 1 else 0))
df.withColumn("fruitList", containedUdf(col("fruit"))).show(false)
which should give you
+---+-------+---------+
|id |fruit |fruitList|
+---+-------+---------+
|0 |apple |[1, 0, 0]|
|1 |banana |[0, 0, 1]|
|2 |coconut|[0, 1, 0]|
|1 |banana |[0, 0, 1]|
|2 |coconut|[0, 1, 0]|
+---+-------+---------+
udf functions are easy to understand and straight forward, dealing with primitive datatypes but should be avoided if optimized and fast inbuilt functions are available to do the same task
I hope the answer is helpful
I have a database with time visit in timestamp like this
ID, time
1, 1493596800
1, 1493596900
1, 1493432800
2, 1493596800
2, 1493596850
2, 1493432800
I use spark SQL and I need to have the longest sequence of consecutives dates for each ID like
ID, longest_seq (days)
1, 2
2, 5
3, 1
I tried to adapt this answer Detect consecutive dates ranges using SQL to my case but I didn't manage to have what I expect.
SELECT ID, MIN (d), MAX(d)
FROM (
SELECT ID, cast(from_utc_timestamp(cast(time as timestamp), 'CEST') as date) AS d,
ROW_NUMBER() OVER(
PARTITION BY ID ORDER BY cast(from_utc_timestamp(cast(time as timestamp), 'CEST')
as date)) rn
FROM purchase
where ID is not null
GROUP BY ID, cast(from_utc_timestamp(cast(time as timestamp), 'CEST') as date)
)
GROUP BY ID, rn
ORDER BY ID
If someone has some clue on how to fix this request, or what's wrong in it, I would appreciate the help
Thanks
[EDIT] A more explicit input /output
ID, time
1, 1
1, 2
1, 3
2, 1
2, 3
2, 4
2, 5
2, 10
2, 11
3, 1
3, 4
3, 9
3, 11
The result would be :
ID, MaxSeq (in days)
1,3
2,3
3,1
All the visits are in timestamp, but I need consecutives days, then each visit by day is counted once by day
My answer below is adapted from https://dzone.com/articles/how-to-find-the-longest-consecutive-series-of-even for use in Spark SQL. You'll have wrap the SQL queries with:
spark.sql("""
SQL_QUERY
""")
So, for the first query:
CREATE TABLE intermediate_1 AS
SELECT
id,
time,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY time) AS rn,
time - ROW_NUMBER() OVER (PARTITION BY id ORDER BY time) AS grp
FROM purchase
This will give you:
id, time, rn, grp
1, 1, 1, 0
1, 2, 2, 0
1, 3, 3, 0
2, 1, 1, 0
2, 3, 2, 1
2, 4, 3, 1
2, 5, 4, 1
2, 10, 5, 5
2, 11, 6, 5
3, 1, 1, 0
3, 4, 2, 2
3, 9, 3, 6
3, 11, 4, 7
We can see that the consecutive rows have the same grp value. Then we will use GROUP BY and COUNT to get the number of consecutive time.
CREATE TABLE intermediate_2 AS
SELECT
id,
grp,
COUNT(*) AS num_consecutive
FROM intermediate_1
GROUP BY id, grp
This will return:
id, grp, num_consecutive
1, 0, 3
2, 0, 1
2, 1, 3
2, 5, 2
3, 0, 1
3, 2, 1
3, 6, 1
3, 7, 1
Now we just use MAX and GROUP BY to get the max number of consecutive time.
CREATE TABLE final AS
SELECT
id,
MAX(num_consecutive) as max_consecutive
FROM intermediate_2
GROUP BY id
Which will give you:
id, max_consecutive
1, 3
2, 3
3, 1
Hope this helps!
That's the case for my beloved window aggregate functions!
I think the following example could help you out (at least to get started).
The following is the dataset I use. I translated your time (in longs) to numeric time to denote the day (and avoid messing around with timestamps in Spark SQL which could make the solution harder to comprehend...possibly).
In the below visit dataset, time column represents the days between dates so 1s one by one represent consecutive days.
scala> visits.show
+---+----+
| ID|time|
+---+----+
| 1| 1|
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 3|
| 1| 3|
| 2| 1|
| 3| 1|
| 3| 2|
| 3| 2|
+---+----+
Let's define the window specification to group id rows together.
import org.apache.spark.sql.expressions.Window
val idsSortedByTime = Window.
partitionBy("id").
orderBy("time")
With that you rank the rows and count rows with the same rank.
val answer = visits.
select($"id", $"time", rank over idsSortedByTime as "rank").
groupBy("id", "time", "rank").
agg(count("*") as "count")
scala> answer.show
+---+----+----+-----+
| id|time|rank|count|
+---+----+----+-----+
| 1| 1| 1| 2|
| 1| 2| 3| 1|
| 1| 3| 4| 3|
| 3| 1| 1| 1|
| 3| 2| 2| 2|
| 2| 1| 1| 1|
+---+----+----+-----+
That appears (very close?) to a solution. You seem done!
Using spark.sql and with intermediate tables
scala> val df = Seq((1, 1),(1, 2),(1, 3),(2, 1),(2, 3),(2, 4),(2, 5),(2, 10),(2, 11),(3, 1),(3, 4),(3, 9),(3, 11)).toDF("id","time")
df: org.apache.spark.sql.DataFrame = [id: int, time: int]
scala> df.createOrReplaceTempView("tb1")
scala> spark.sql(""" with tb2(select id,time, time-row_number() over(partition by id order by time) rw1 from tb1), tb3(select id,count(rw1) rw2 from tb2 group by id,rw1) select id, rw2 from tb3 where (id,rw2) in (select id,max(rw2) from tb3 group by id) group by id, rw2 """).show(false)
+---+---+
|id |rw2|
+---+---+
|1 |3 |
|3 |1 |
|2 |3 |
+---+---+
scala>
Solution using DataFrame API:
import org.apache.spark.sql.functions._
import spark.implicits._
val df1 = Seq((1, 1),(1, 2),(1, 3),(2, 1),(2, 3),(2, 4),(2, 5),(2, 10),(2, 11),(3, 1),(3, 4),(3, 9),(3, 11)).toDF("ID","time")
df1.show(false)
df1.printSchema()
val w = Window.partitionBy("ID").orderBy("time")
val df2 = df1.withColumn("rank", col("time") - row_number().over(w))
.groupBy("ID", "rank")
.agg(count("rank").alias("count"))
.groupBy("ID")
.agg(max("count").alias("time"))
.orderBy("ID")
df2.show(false)
Console output:
+---+----+
|ID |time|
+---+----+
|1 |1 |
|1 |2 |
|1 |3 |
|2 |1 |
|2 |3 |
|2 |4 |
|2 |5 |
|2 |10 |
|2 |11 |
|3 |1 |
|3 |4 |
|3 |9 |
|3 |11 |
+---+----+
root
|-- ID: integer (nullable = false)
|-- time: integer (nullable = false)
+---+----+
|ID |time|
+---+----+
|1 |3 |
|2 |3 |
|3 |1 |
+---+----+