Convert Dataframe into Array of Json - apache-spark

I have create a spark data-frame in following manner :
+----+-------+
| age| number|
+----+-------+
| 16| 12|
| 16| 13|
| 16| 14|
| 17| 15|
| 17| 16|
| 17| 17|
+----+-------+
I want to convert it in following json format :
[{
'age' : 16,
'name' : [12,13,14]
},{
'age' : 17,
'name' : [15,16,17]
}]
How can i achieve the same?

You can try to_json function. Something like this.
import spark.implicits._
val list = List((16,12), (16,13), (16,14), (17,15), (17,16), (17,17))
val df = spark.parallelize(list).toDF("age", "number")
val jsondf = df.groupBy($"age").agg(collect_list($"number").as("name"))
.withColumn("json", to_json(struct($"age", $"name")))
.drop("age", "name")
.agg(collect_list($"json").as("json"))
The results are below. I hope it helps.
+------------------------------------------------------------+
|json |
+------------------------------------------------------------+
|[{"age":16,"name":[12,13,14]}, {"age":17,"name":[15,16,17]}]|
+------------------------------------------------------------+

Related

Can you please explain the parameters of takeSample takeSample() in pyspark [duplicate]

Reading the spark documentation: http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample
There is this boolean parameter withReplacement without much explanation.
sample(withReplacement, fraction, seed=None)
What is it and how do we use it?
The parameter withReplacement controls the Uniqueness of sample result. If we treat a Dataset as a bucket of balls, withReplacement=true means, taking a random ball out of the bucket and place it back into it. that means, the same ball can be picked up again.
Assuming all unique elements in a Dataset:
withReplacement=true, same element can be produced more than once as the result of sample.
withReplacement=false, each element of the dataset will be sampled only once.
import spark.implicits._
val df = Seq(1, 2, 3, 5, 6, 7, 8, 9, 10).toDF("ids")
df.show()
df.sample(true, 0.5, 5)
.show
df.sample(false, 0.5, 5)
.show
Result
+---+
|ids|
+---+
| 1|
| 2|
| 3|
| 5|
| 6|
| 7|
| 8|
| 9|
| 10|
+---+
+---+
|ids|
+---+
| 6|
| 7|
| 7|
| 9|
| 10|
+---+
+---+
|ids|
+---+
| 1|
| 3|
| 7|
| 8|
| 9|
+---+
This is actually mentioned in the spark docs version 2.3.
https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample
withReplacement – Sample with replacement
case class Member(id: Int, name: String, role: String)
val member1 = new Member(1, "User1", "Data Engineer")
val member2 = new Member(2, "User2", "Software Engineer")
val member3 = new Member(3, "User3", "DevOps Engineer")
val memberDF = Seq(member1, member2, member3).toDF
memberDF.sample(true, 0.4).show
+---+-----+-----------------+
| id| name| role|
+---+-----+-----------------+
| 1|User1| Data Engineer|
| 2|User2|Software Engineer|
+---+-----+-----------------+
memberDF.sample(true, 0.4).show
+---+-----+---------------+
| id| name| role|
+---+-----+---------------+
| 3|User3|DevOps Engineer|
+---+-----+---------------+
memberDF.sample(true, 0.4).show
+---+-----+-----------------+
| id| name| role|
+---+-----+-----------------+
| 2|User2|Software Engineer|
| 3|User3| DevOps Engineer|
+---+-----+-----------------+

otherwise-clause not working as expect , whats wrong here?

I am using spark-sql-2.4.1v how to do various joins depend on the value of column I need get multiple look up values of map_val column for given value columns as show below.
Sample data:
val data = List(
("20", "score", "school", "2018-03-31", 14 , 12),
("21", "score", "school", "2018-03-31", 13 , 13),
("22", "rate", "school", "2018-03-31", 11 , 14),
("21", "rate", "school", "2018-03-31", 13 , 12)
)
val df = data.toDF("id", "code", "entity", "date", "value1", "value2")
df.show
+---+-----+------+----------+------+------+
| id| code|entity| date|value1|value2|
+---+-----+------+----------+------+------+
| 20|score|school|2018-03-31| 14| 12|
| 21|score|school|2018-03-31| 13| 13|
| 22| rate|school|2018-03-31| 11| 14|
| 21| rate|school|2018-03-31| 13| 12|
+---+-----+------+----------+------+------+
val resultDs = df
.withColumn("value1",
when(col("code").isin("rate") , functions.callUDF("udfFunc",col("value1")))
.otherwise(col("value1").cast(DoubleType))
)
udfFunc maps as follows
11->a
12->b
13->c
14->d
Expected output
+---+-----+------+----------+------+------+
| id| code|entity| date|value1|value2|
+---+-----+------+----------+------+------+
| 20|score|school|2018-03-31| 14| 12|
| 21|score|school|2018-03-31| 13| 13|
| 22| rate|school|2018-03-31| a | 14|
| 21| rate|school|2018-03-31| c | 12|
+---+-----+------+----------+------+------+
But it is giving output as
+---+-----+------+----------+------+------+
| id| code|entity| date|value1|value2|
+---+-----+------+----------+------+------+
| 20|score|school|2018-03-31| null| 12|
| 21|score|school|2018-03-31| null| 13|
| 22| rate|school|2018-03-31| a | 14|
| 21| rate|school|2018-03-31| c | 12|
+---+-----+------+----------+------+------+
why "otherwise" condition is not working as expected. any idea what is wrong here ??
Column should contains same datatype.
Note - DoubleType can not store StringTyp data, So you need to convert DoubleType to StringType.
val resultDs = df
.withColumn("value1",
when(col("code") === lit("rate") ,functions.callUDF("udfFunc",col("value1")))
.otherwise(col("value1").cast(StringType)) // Should be StringType
)
Or
val resultDs = df
.withColumn("value1",
when(col("code").isin("rate") , functions.callUDF("udfFunc",col("value1")))
.otherwise(col("value1").cast(StringType)) // Modified to StringType
)
I would suggest to modify to-
df
.withColumn("value1",
when(col("code") === lit("rate") , functions.callUDF("udfFunc",col("value1")))
.otherwise(col("value1").cast(StringType))
)
and check once

copy current row , modify it and add a new row in spark

I am using spark-sql-2.4.1v with java8 version.
I have a scenario where I need to copy current row and create another row modifying few columns data how can this be achieved in spark-sql ?
Ex :
Given
val data = List(
("20", "score", "school", 14 ,12),
("21", "score", "school", 13 , 13),
("22", "rate", "school", 11 ,14)
)
val df = data.toDF("id", "code", "entity", "value1","value2")
Current Output
+---+-----+------+------+------+
| id| code|entity|value1|value2|
+---+-----+------+------+------+
| 20|score|school| 14| 12|
| 21|score|school| 13| 13|
| 22| rate|school| 11| 14|
+---+-----+------+------+------+
When column "code" is "rate" copy it as two rows i.e. one is
original , second it is another row with new code "old_ rate" like
below
Expected output :
+---+--------+------+------+------+
| id| code|entity|value1|value2|
+---+--------+------+------+------+
| 20| score|school| 14| 12|
| 21| score|school| 13| 13|
| 22| rate|school| 11| 14|
| 22|new_rate|school| 11| 14|
+---+--------+------+------+------+
how to achieve this ?
you can use this approach for your scenario,
df.union(df.filter($"code"==="rate").withColumn("code",concat(lit("new_"), $"code"))).show()
/*
+---+--------+------+------+------+
| id| code|entity|value1|value2|
+---+--------+------+------+------+
| 20| score|school| 14| 12|
| 21| score|school| 13| 13|
| 22| rate|school| 11| 14|
| 22|new_rate|school| 11| 14|
+---+--------+------+------+------+
*/
Use when to check code === rate, if it is matched then replace that column value with array(lit("rate"),lit("new_rate")) & not matched column values array($"code") then explode code column.
Check below code.
scala> df.show(false)
+---+-----+------+------+------+
|id |code |entity|value1|value2|
+---+-----+------+------+------+
|20 |score|school|14 |12 |
|21 |score|school|13 |13 |
|22 |rate |school|11 |14 |
+---+-----+------+------+------+
val colExpr = explode(
when(
$"code" === "rate",
array(
lit("rate"),
lit("new_rate")
)
)
.otherwise(array($"code"))
)
scala> df.withColumn("code",colExpr).show(false)
+---+--------+------+------+------+
|id |code |entity|value1|value2|
+---+--------+------+------+------+
|20 |score |school|14 |12 |
|21 |score |school|13 |13 |
|22 |rate |school|11 |14 |
|22 |new_rate|school|11 |14 |
+---+--------+------+------+------+

What does withReplacement do, if specified for sample against a Spark Dataframe

Reading the spark documentation: http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample
There is this boolean parameter withReplacement without much explanation.
sample(withReplacement, fraction, seed=None)
What is it and how do we use it?
The parameter withReplacement controls the Uniqueness of sample result. If we treat a Dataset as a bucket of balls, withReplacement=true means, taking a random ball out of the bucket and place it back into it. that means, the same ball can be picked up again.
Assuming all unique elements in a Dataset:
withReplacement=true, same element can be produced more than once as the result of sample.
withReplacement=false, each element of the dataset will be sampled only once.
import spark.implicits._
val df = Seq(1, 2, 3, 5, 6, 7, 8, 9, 10).toDF("ids")
df.show()
df.sample(true, 0.5, 5)
.show
df.sample(false, 0.5, 5)
.show
Result
+---+
|ids|
+---+
| 1|
| 2|
| 3|
| 5|
| 6|
| 7|
| 8|
| 9|
| 10|
+---+
+---+
|ids|
+---+
| 6|
| 7|
| 7|
| 9|
| 10|
+---+
+---+
|ids|
+---+
| 1|
| 3|
| 7|
| 8|
| 9|
+---+
This is actually mentioned in the spark docs version 2.3.
https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample
withReplacement – Sample with replacement
case class Member(id: Int, name: String, role: String)
val member1 = new Member(1, "User1", "Data Engineer")
val member2 = new Member(2, "User2", "Software Engineer")
val member3 = new Member(3, "User3", "DevOps Engineer")
val memberDF = Seq(member1, member2, member3).toDF
memberDF.sample(true, 0.4).show
+---+-----+-----------------+
| id| name| role|
+---+-----+-----------------+
| 1|User1| Data Engineer|
| 2|User2|Software Engineer|
+---+-----+-----------------+
memberDF.sample(true, 0.4).show
+---+-----+---------------+
| id| name| role|
+---+-----+---------------+
| 3|User3|DevOps Engineer|
+---+-----+---------------+
memberDF.sample(true, 0.4).show
+---+-----+-----------------+
| id| name| role|
+---+-----+-----------------+
| 2|User2|Software Engineer|
| 3|User3| DevOps Engineer|
+---+-----+-----------------+

Reading Hive Tables in Spark Dataframe without header

I have the following Hive table:
select* from employee;
OK
abc 19 da
xyz 25 sa
pqr 30 er
suv 45 dr
when I read this in spark(pyspark):
df = hiveCtx.sql('select* from spark_hive.employee')
df.show()
+----+----+-----+
|name| age| role|
+----+----+-----+
|name|null| role|
| abc| 19| da|
| xyz| 25| sa|
| pqr| 30| er|
| suv| 45| dr|
+----+----+-----+
I end up getting the headers in my spark DataFrame. Is there a simple way to remove it ?
Also, am I missing something while reading the table into the DataFrame (Ideally I shouldn't be getting the header right ?) ?
You have to remove header from the result. You can do like this:
scala> val df = sql("select * from employee")
df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
scala> df.show
+----+----+----+
| id|name| age|
+----+----+----+
|null|name|null|
| 1| abc| 19|
| 2| xyz| 25|
| 3| pqr| 30|
| 4| suv| 45|
+----+----+----+
scala> val header = df.first()
header: org.apache.spark.sql.Row = [null,name,null]
scala> val data = df.filter(row => row != header)
data: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, name: string ... 1 more field]
scala> data.show
+---+----+---+
| id|name|age|
+---+----+---+
| 1| abc| 19|
| 2| xyz| 25|
| 3| pqr| 30|
| 4| suv| 45|
+---+----+---+
Thanks.
you can use skip.header.line.count for skip this header. You could also specify the same while creating the table. For example:
create external table testtable ( id int,name string, age int)
row format delimited .............
tblproperties ("skip.header.line.count"="1");
after that load the data and then check your query I hope you will get expected output.
Not the Most elegant way, but this worked with pyspark:
rddWithoutHeader = dfemp.rdd.filter(lambda line: line!=header)
dfnew = sqlContext.createDataFrame(rddWithoutHeader)

Resources