I am using spark-sql-2.4.1v with java8 version.
I have a scenario where I need to copy current row and create another row modifying few columns data how can this be achieved in spark-sql ?
Ex :
Given
val data = List(
("20", "score", "school", 14 ,12),
("21", "score", "school", 13 , 13),
("22", "rate", "school", 11 ,14)
)
val df = data.toDF("id", "code", "entity", "value1","value2")
Current Output
+---+-----+------+------+------+
| id| code|entity|value1|value2|
+---+-----+------+------+------+
| 20|score|school| 14| 12|
| 21|score|school| 13| 13|
| 22| rate|school| 11| 14|
+---+-----+------+------+------+
When column "code" is "rate" copy it as two rows i.e. one is
original , second it is another row with new code "old_ rate" like
below
Expected output :
+---+--------+------+------+------+
| id| code|entity|value1|value2|
+---+--------+------+------+------+
| 20| score|school| 14| 12|
| 21| score|school| 13| 13|
| 22| rate|school| 11| 14|
| 22|new_rate|school| 11| 14|
+---+--------+------+------+------+
how to achieve this ?
you can use this approach for your scenario,
df.union(df.filter($"code"==="rate").withColumn("code",concat(lit("new_"), $"code"))).show()
/*
+---+--------+------+------+------+
| id| code|entity|value1|value2|
+---+--------+------+------+------+
| 20| score|school| 14| 12|
| 21| score|school| 13| 13|
| 22| rate|school| 11| 14|
| 22|new_rate|school| 11| 14|
+---+--------+------+------+------+
*/
Use when to check code === rate, if it is matched then replace that column value with array(lit("rate"),lit("new_rate")) & not matched column values array($"code") then explode code column.
Check below code.
scala> df.show(false)
+---+-----+------+------+------+
|id |code |entity|value1|value2|
+---+-----+------+------+------+
|20 |score|school|14 |12 |
|21 |score|school|13 |13 |
|22 |rate |school|11 |14 |
+---+-----+------+------+------+
val colExpr = explode(
when(
$"code" === "rate",
array(
lit("rate"),
lit("new_rate")
)
)
.otherwise(array($"code"))
)
scala> df.withColumn("code",colExpr).show(false)
+---+--------+------+------+------+
|id |code |entity|value1|value2|
+---+--------+------+------+------+
|20 |score |school|14 |12 |
|21 |score |school|13 |13 |
|22 |rate |school|11 |14 |
|22 |new_rate|school|11 |14 |
+---+--------+------+------+------+
Related
I am using spark-sql-2.4.1v how to do various joins depend on the value of column I need get multiple look up values of map_val column for given value columns as show below.
Sample data:
val data = List(
("20", "score", "school", "2018-03-31", 14 , 12),
("21", "score", "school", "2018-03-31", 13 , 13),
("22", "rate", "school", "2018-03-31", 11 , 14),
("21", "rate", "school", "2018-03-31", 13 , 12)
)
val df = data.toDF("id", "code", "entity", "date", "value1", "value2")
df.show
+---+-----+------+----------+------+------+
| id| code|entity| date|value1|value2|
+---+-----+------+----------+------+------+
| 20|score|school|2018-03-31| 14| 12|
| 21|score|school|2018-03-31| 13| 13|
| 22| rate|school|2018-03-31| 11| 14|
| 21| rate|school|2018-03-31| 13| 12|
+---+-----+------+----------+------+------+
val resultDs = df
.withColumn("value1",
when(col("code").isin("rate") , functions.callUDF("udfFunc",col("value1")))
.otherwise(col("value1").cast(DoubleType))
)
udfFunc maps as follows
11->a
12->b
13->c
14->d
Expected output
+---+-----+------+----------+------+------+
| id| code|entity| date|value1|value2|
+---+-----+------+----------+------+------+
| 20|score|school|2018-03-31| 14| 12|
| 21|score|school|2018-03-31| 13| 13|
| 22| rate|school|2018-03-31| a | 14|
| 21| rate|school|2018-03-31| c | 12|
+---+-----+------+----------+------+------+
But it is giving output as
+---+-----+------+----------+------+------+
| id| code|entity| date|value1|value2|
+---+-----+------+----------+------+------+
| 20|score|school|2018-03-31| null| 12|
| 21|score|school|2018-03-31| null| 13|
| 22| rate|school|2018-03-31| a | 14|
| 21| rate|school|2018-03-31| c | 12|
+---+-----+------+----------+------+------+
why "otherwise" condition is not working as expected. any idea what is wrong here ??
Column should contains same datatype.
Note - DoubleType can not store StringTyp data, So you need to convert DoubleType to StringType.
val resultDs = df
.withColumn("value1",
when(col("code") === lit("rate") ,functions.callUDF("udfFunc",col("value1")))
.otherwise(col("value1").cast(StringType)) // Should be StringType
)
Or
val resultDs = df
.withColumn("value1",
when(col("code").isin("rate") , functions.callUDF("udfFunc",col("value1")))
.otherwise(col("value1").cast(StringType)) // Modified to StringType
)
I would suggest to modify to-
df
.withColumn("value1",
when(col("code") === lit("rate") , functions.callUDF("udfFunc",col("value1")))
.otherwise(col("value1").cast(StringType))
)
and check once
it is possible to apply many expression in the same selectExpr,
for example If I have this DF:
+---+
| i|
+---+
| 10|
| 15|
| 11|
| 56|
+---+
how to multiply by 2 and rename the column as this :
df.selectExpr("i*2 as multiplication")
def selectExpr(exprs: String*): org.apache.spark.sql.DataFrame
If you have many expressions you have to pass them comma separated strings. Please check below code.
scala> val df = (1 to 10).toDF("id")
df: org.apache.spark.sql.DataFrame = [id: int]
scala> df.selectExpr("id*2 as twotimes", "id * 3 as threetimes").show
+--------+----------+
|twotimes|threetimes|
+--------+----------+
| 2| 3|
| 4| 6|
| 6| 9|
| 8| 12|
| 10| 15|
| 12| 18|
| 14| 21|
| 16| 24|
| 18| 27|
| 20| 30|
+--------+----------+
Yes, you can pass multiple expressions inside the df.selectExpr. https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.Dataset#selectExpr(exprs:String*):org.apache.spark.sql.DataFrame
scala> case class Person(name: String, lanme: String)
scala> val personDS = Seq(Person("Max", 1), Person("Adam", 2), Person("Muller", 3)).toDS()
scala > personDs.show(false)
+------+---+
|name |age|
+------+---+
|Max |1 |
|Adam |2 |
|Muller|3 |
+------+---+
scala> personDS.selectExpr("age*2 as multiple","name").show(false)
+--------+------+
|multiple|name |
+--------+------+
|2 |Max |
|4 |Adam |
|6 |Muller|
+--------+------+
Or else you can also use withColumn to achieve the same results
scala> personDS.withColumn("multiple",$"age"*2).select($"multiple",$"name").show(false)
+--------+------+
|multiple|name |
+--------+------+
|2 |Max |
|4 |Adam |
|6 |Muller|
+--------+------+
I have a data frame in pyspark like below.
df.show()
+---+-------+----+
| id| type|s_id|
+---+-------+----+
| 1| ios| 11|
| 1| ios| 12|
| 1| ios| 13|
| 1| ios| 14|
| 1|android| 15|
| 1|android| 16|
| 1|android| 17|
| 2| ios| 21|
| 2|android| 18|
+---+-------+----+
Now from this data frame I want to create another data frame by pivoting it.
df1.show()
+---+-----+-----+-----+---------+---------+---------+
| id| ios1| ios2| ios3| android1| android2| android3|
+---+-----+-----+-----+---------+---------+---------+
| 1| 11| 12| 13| 15| 16| 17|
| 2| 21| Null| Null| 18| Null| Null|
+---+-----+-----+-----+---------+---------+---------+
Here I need to consider a condition that for each Id even though there will be more than 3 types I want to consider only 3 or less than 3.
How can I do that?
Edit
new_df.show()
+---+-------+----+
| id| type|s_id|
+---+-------+----+
| 1| ios| 11|
| 1| ios| 12|
| 1| | 13|
| 1| | 14|
| 1|andriod| 15|
| 1| | 16|
| 1| | 17|
| 2|andriod| 18|
| 2| ios| 21|
+---+-------+----+
The result I am getting is below
+---+----+----+----+--------+----+----+
| id| 1| 2| 3|andriod1|ios1|ios2|
+---+----+----+----+--------+----+----+
| 1| 13| 14| 16| 15| 11| 12|
| 2|null|null|null| 18| 21|null|
+---+----+----+----+--------+----+----+
What I want is
+---+--------+--------+--------+----+----+----+
|id |android1|android2|android3|ios1|ios2|ios3|
+---+--------+--------+--------+----+----+----+
|1 |15 | null| null| 11| 12|null|
|2 |18 | null| null| 21|null|null|
+---+--------+--------+--------+----+----+----+
Using the following logic should get you your desired result.
Window function is used to generate row number for each group of id and type ordered by s_id. Generated row number is used to filter and concat with type. Then finally grouping and pivoting should give you your desired output
from pyspark.sql import Window
windowSpec = Window.partitionBy("id", "type").orderBy("s_id")
from pyspark.sql import functions as f
df.withColumn("ranks", f.row_number().over(windowSpec))\
.filter(f.col("ranks") < 4)\
.withColumn("type", f.concat(f.col("type"), f.col("ranks")))\
.drop("ranks")\
.groupBy("id")\
.pivot("type")\
.agg(f.first("s_id"))\
.show(truncate=False)
which should give you
+---+--------+--------+--------+----+----+----+
|id |android1|android2|android3|ios1|ios2|ios3|
+---+--------+--------+--------+----+----+----+
|1 |15 |16 |17 |11 |12 |13 |
|2 |18 |null |null |21 |null|null|
+---+--------+--------+--------+----+----+----+
answer for the edited part
You just need an additional filter as
df.withColumn("ranks", f.row_number().over(windowSpec)) \
.filter(f.col("ranks") < 4) \
.filter(f.col("type") != "") \
.withColumn("type", f.concat(f.col("type"), f.col("ranks"))) \
.drop("ranks") \
.groupBy("id") \
.pivot("type") \
.agg(f.first("s_id")) \
.show(truncate=False)
which would give you
+---+--------+----+----+
|id |andriod1|ios1|ios2|
+---+--------+----+----+
|1 |15 |11 |12 |
|2 |18 |21 |null|
+---+--------+----+----+
Now this dataframe lacks android2, android3 and ios3 columns. Because they are not present in your updated input data. you can add them using withColumn api and populate null values
I have created two data frames in pyspark like below. In these data frames I have column id. I want to perform a full outer join on these two data frames.
valuesA = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
a = sqlContext.createDataFrame(valuesA,['name','id'])
a.show()
+---------+---+
| name| id|
+---------+---+
| Pirate| 1|
| Monkey| 2|
| Ninja| 3|
|Spaghetti| 4|
+---------+---+
valuesB = [('dave',1),('Thor',2),('face',3), ('test',5)]
b = sqlContext.createDataFrame(valuesB,['Movie','id'])
b.show()
+-----+---+
|Movie| id|
+-----+---+
| dave| 1|
| Thor| 2|
| face| 3|
| test| 5|
+-----+---+
full_outer_join = a.join(b, a.id == b.id,how='full')
full_outer_join.show()
+---------+----+-----+----+
| name| id|Movie| id|
+---------+----+-----+----+
| Pirate| 1| dave| 1|
| Monkey| 2| Thor| 2|
| Ninja| 3| face| 3|
|Spaghetti| 4| null|null|
| null|null| test| 5|
+---------+----+-----+----+
I want to have a result like below when I do a full_outer_join
+---------+-----+----+
| name|Movie| id|
+---------+-----+----+
| Pirate| dave| 1|
| Monkey| Thor| 2|
| Ninja| face| 3|
|Spaghetti| null| 4|
| null| test| 5|
+---------+-----+----+
I have done like below but getting some different result
full_outer_join = a.join(b, a.id == b.id,how='full').select(a.id, a.name, b.Movie)
full_outer_join.show()
+---------+----+-----+
| name| id|Movie|
+---------+----+-----+
| Pirate| 1| dave|
| Monkey| 2| Thor|
| Ninja| 3| face|
|Spaghetti| 4| null|
| null|null| test|
+---------+----+-----+
As you can see that I am missing Id 5 in my result data frame.
How can I achieve what I want?
Since the join columns have the same name, you can specify the join columns as a list:
a.join(b, ['id'], how='full').show()
+---+---------+-----+
| id| name|Movie|
+---+---------+-----+
| 5| null| test|
| 1| Pirate| dave|
| 3| Ninja| face|
| 2| Monkey| Thor|
| 4|Spaghetti| null|
+---+---------+-----+
Or coalesce the two id columns:
import pyspark.sql.functions as F
a.join(b, a.id == b.id, how='full').select(
F.coalesce(a.id, b.id).alias('id'), a.name, b.Movie
).show()
+---+---------+-----+
| id| name|Movie|
+---+---------+-----+
| 5| null| test|
| 1| Pirate| dave|
| 3| Ninja| face|
| 2| Monkey| Thor|
| 4|Spaghetti| null|
+---+---------+-----+
You can either reaname the column id from the dataframe b and drop later or use the list in join condition.
a.join(b, ['id'], how='full')
Output:
+---+---------+-----+
|id |name |Movie|
+---+---------+-----+
|1 |Pirate |dave |
|3 |Ninja |face |
|5 |null |test |
|4 |Spaghetti|null |
|2 |Monkey |Thor |
+---+---------+-----+
I have a PySpark df:
+---+---+---+---+---+---+---+---+
| id| a1| b1| c1| d1| e1| f1|ref|
+---+---+---+---+---+---+---+---+
| 0| 1| 23| 4| 8| 9| 5| b1|
| 1| 2| 43| 8| 10| 20| 43| e1|
| 2| 3| 15| 0| 1| 23| 7| b1|
| 3| 4| 2| 6| 11| 5| 8| d1|
| 4| 5| 6| 7| 2| 8| 1| f1|
+---+---+---+---+---+---+---+---+
I eventually want to create another column "out" whose values are based on "ref" column. For example, in the first row ref column has b1 as value. In "out" column i would like to see column "b1" value i.e., 23.
Here is the expected output:
+---+---+---+---+---+---+---+---+---+
| id| a1| b1| c1| d1| e1| f1|ref|out|
+---+---+---+---+---+---+---+---+---+
| 0| 1| 23| 4| 8| 9| 5| b1| 23|
| 1| 2| 43| 8| 10| 20| 43| e1| 20|
| 2| 3| 15| 0| 1| 23| 7| b1| 15|
| 3| 4| 2| 6| 11| 5| 8| d1| 11|
| 4| 5| 6| 7| 2| 8| 1| f1| 1|
+---+---+---+---+---+---+---+---+---+
Please advise on how to achieve the "out" column. I'm using Spark 1.6 version.Thanks
Independent of version you can convert to RDD, map, and convert back to DataFrame:
df = spark.createDataFrame(
[(0, 1, 23, 4, 8, 9, 5, "b1"), (1, 2, 43, 8, 10, 20, 43, "e1")],
("id", "a1", "b1", "c1", "d1", "e1", "f1", "ref")
)
df.rdd.map(lambda row: row + (row[row.ref], )).toDF(df.columns + ["out"])
+---+---+---+---+---+---+---+---+---+
| id| a1| b1| c1| d1| e1| f1|ref|out|
+---+---+---+---+---+---+---+---+---+
| 0| 1| 23| 4| 8| 9| 5| b1| 23|
| 1| 2| 43| 8| 10| 20| 43| e1| 20|
+---+---+---+---+---+---+---+---+---+
You could also preserve schema
from pyspark.sql.types import LongType, StructField
spark.createDataFrame(
df.rdd.map(lambda row: row + (row[row.ref], )),
df.schema.add(StructField("out", LongType())))
With DataFrames you can compose complex Columns. In 1.6:
from pyspark.sql.functions import array, col, udf
from pyspark.sql.types import LongType, MapType, StringType
data_cols = [x for x in df.columns if x not in {"id", "ref"}]
# Literal map from column name to index
name_to_index = udf(
lambda: {x: i for i, x in enumerate(data_cols)},
MapType(StringType(), LongType())
)()
# Array of data
data_array = array(*[col(c) for c in data_cols])
df.withColumn("out", data_array[name_to_index[col("ref")]])
+---+---+---+---+---+---+---+---+---+
| id| a1| b1| c1| d1| e1| f1|ref|out|
+---+---+---+---+---+---+---+---+---+
| 0| 1| 23| 4| 8| 9| 5| b1| 23|
| 1| 2| 43| 8| 10| 20| 43| e1| 20|
+---+---+---+---+---+---+---+---+---+
In 2.x you can skip intermediate objects:
from pyspark.sql.functions import create_map, lit, col
from itertools import chain
# Map from column name to column value
name_to_value = create_map(*chain.from_iterable(
(lit(c), col(c)) for c in data_cols
))
df.withColumn("out", name_to_value[col("ref")])
+---+---+---+---+---+---+---+---+---+
| id| a1| b1| c1| d1| e1| f1|ref|out|
+---+---+---+---+---+---+---+---+---+
| 0| 1| 23| 4| 8| 9| 5| b1| 23|
| 1| 2| 43| 8| 10| 20| 43| e1| 20|
+---+---+---+---+---+---+---+---+---+
Finally you can use when:
from pyspark.sql.functions import col, lit, when
from functools import reduce
out = reduce(
lambda acc, x: when(col("ref") == x, col(x)).otherwise(acc),
data_cols,
lit(None)
)
+---+---+---+---+---+---+---+---+---+
| id| a1| b1| c1| d1| e1| f1|ref|out|
+---+---+---+---+---+---+---+---+---+
| 0| 1| 23| 4| 8| 9| 5| b1| 23|
| 1| 2| 43| 8| 10| 20| 43| e1| 20|
+---+---+---+---+---+---+---+---+---+
The OP has asked python solution. I'm just answering the same in spark-scala 2.X for reference. Hope it helps somebody
scala> val df = Seq((0, 1, 23, 4, 8, 9, 5, "b1"), (1, 2, 43, 8, 10, 20, 43, "e1"), (2, 3, 15, 0, 1, 23, 7, "b1"),(3, 4, 2, 6, 11, 5, 8, "d1"),(4, 5, 6, 7, 2, 8, 1, "f1")).toDF("id", "a1", "b1", "c1", "d1", "e1", "f1", "ref")
df: org.apache.spark.sql.DataFrame = [id: int, a1: int ... 6 more fields]
scala> df.show(false)
+---+---+---+---+---+---+---+---+
|id |a1 |b1 |c1 |d1 |e1 |f1 |ref|
+---+---+---+---+---+---+---+---+
|0 |1 |23 |4 |8 |9 |5 |b1 |
|1 |2 |43 |8 |10 |20 |43 |e1 |
|2 |3 |15 |0 |1 |23 |7 |b1 |
|3 |4 |2 |6 |11 |5 |8 |d1 |
|4 |5 |6 |7 |2 |8 |1 |f1 |
+---+---+---+---+---+---+---+---+
scala> val colx = df.columns.filter(x=>x!="ref").filter(x=>x!="id")
colx: Array[String] = Array(a1, b1, c1, d1, e1, f1)
scala> val colm = colx.map( x=> when(col("ref")===lit(x),col(x)) )
colm: Array[org.apache.spark.sql.Column] = Array(CASE WHEN (ref = a1) THEN a1 END, CASE WHEN (ref = b1) THEN b1 END, CASE WHEN (ref = c1) THEN c1 END, CASE WHEN (ref = d1) THEN d1 END, CASE WHEN (ref = e1) THEN e1 END, CASE WHEN (ref = f1) THEN f1 END)
scala> df.select(col("*"),concat_ws("",array(colm:_*)).as("res1")).show(false)
+---+---+---+---+---+---+---+---+----+
|id |a1 |b1 |c1 |d1 |e1 |f1 |ref|res1|
+---+---+---+---+---+---+---+---+----+
|0 |1 |23 |4 |8 |9 |5 |b1 |23 |
|1 |2 |43 |8 |10 |20 |43 |e1 |20 |
|2 |3 |15 |0 |1 |23 |7 |b1 |15 |
|3 |4 |2 |6 |11 |5 |8 |d1 |11 |
|4 |5 |6 |7 |2 |8 |1 |f1 |1 |
+---+---+---+---+---+---+---+---+----+
scala>