Distinct counts using Apache Spark DataFrame or SQL - apache-spark

My Schema looks like below:
scala> airing.printSchema()
root
|-- program: struct (nullable = true)
| |-- detail: struct (nullable = true)
| | |-- contributors: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- contributorId: string (nullable = true)
| | | | |-- name: string (nullable = true)
| | | | |-- order: long (nullable = true)
I need to count based on the unique Actors, to find the most popular actors.
My code is as:
val castCounts = airing.groupBy("program.detail.contributors.name").count().sort(desc("count")).take(10)
To my shock, I am getting duplicates as shown in the below snapshot. I expected each individual actor to occur once, with a distinct count:
Printing the results below:
[WrappedArray(),4344]
[WrappedArray(Matt Smith),16]
[WrappedArray(Phil Keoghan),15]
[WrappedArray(Don Adams, Barbara Feldon, Edward Platt),10]
[WrappedArray(Edward Platt, Don Adams, Barbara Feldon),10]

There are 2 steps
use explode function to make your data flat so each row of data only have 1 contributor.
val df = airing.withColumn("contributor", explode(col("program.detail.contributors"))))
Get result from new df which contributor has been exploded.
val castCounts = df.groupBy("contributor.name").count().sort(desc("count")).take(10)

Related

How do I transpose all columns to rows in Pyspark?

I am trying to transpose the columns to rows and load it to the data base. My input is the Json file.
{"09087":{"values": ["76573433","2222322323","768346865"],"values1": ["7686548898","33256768","09864324567"],"values2": ["234523723","64238793333333","75478393333"],"values3": ["87765","46389333","9234689677"]},"090881": {"values": ["76573443433","22276762322323","7683878746865"],"values1": ["768637676548898","3398776256768","0986456834324567"],"values2": ["23877644523723","64238867658793333333","754788776393333"],"values3": ["87765","46389333","9234689677"]}}
Pyspark:
df = spark.read.option("multiline", "true").format("json").load("testfile.json")
Schema:
root
|-- 09087: struct (nullable = true)
| |-- values: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- values1: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- values2: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- values3: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- 090881: struct (nullable = true)
| |-- values: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- values1: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- values2: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- values3: array (nullable = true)
| | |-- element: string (containsNull = true)
Data:
df.show()
+--------------------+--------------------+
| 09087| 090881|
+--------------------+--------------------+
|{[76573433, 22223...|{[76573443433, 22...|
+--------------------+--------------------+
OUTPUT:
Name values values1 values2 values3
09087 76573433 7686548898 234523723 87765
09087 2222322323 33256768 64238793333333 9234689677
09087 768346865 09864324567 75478393333 46389333
090881 76573443433 768637676548898 23877644523723 87765
090881 22276762322323 3398776256768 64238867658793333333 46389333
090881 7683878746865 0986456834324567 754788776393333 9234689677
Actually I just gave 2 columns as input but I have lot of them. I have been trying this- could someone please help me on this. Thanks in advance.
Pyspark translation of my scala solution:
from pyspark.sql import
rdd = spark.sparkContext.parallelize([("""{"09087":{"values": ["76573433","2222322323","768346865"],"values1": ["7686548898","33256768","09864324567"],"values2": ["234523723","64238793333333","75478393333"],"values3": ["87765","46389333","9234689677"]},"090881": {"values": ["76573443433","22276762322323","7683878746865"],"values1": ["768637676548898","3398776256768","0986456834324567"],"values2": ["23877644523723","64238867658793333333","754788776393333"],"values3": ["87765","46389333","9234689677"]}}""" )])
df = spark.read.json(rdd)
df.select(\
explode (\#explode array into rows
array(\
*[ struct(\# make a stuct from column name and values
lit( col_name ).alias("Name"),\
col(col_name+".*")\
) for col_name in df.columns ])))\
.select(\
col("col.Name").alias("Name"),\
explode(\
arrays_zip(\# make an array of structs from multiple arrays. The name of the struct.column will be it's index in the orginal array.
col("col.values"),\
col("col.values1"),\
col("col.values2"),\
col("col.values3")\
)\
).alias("columns")\
).select( col("Name"),col("columns.*")).show()#use '.*' syntax to change struct.column into table.column
+------+--------------+----------------+--------------------+----------+
| Name| 0| 1| 2| 3|
+------+--------------+----------------+--------------------+----------+
| 09087| 76573433| 7686548898| 234523723| 87765|
| 09087| 2222322323| 33256768| 64238793333333| 46389333|
| 09087| 768346865| 09864324567| 75478393333|9234689677|
|090881| 76573443433| 768637676548898| 23877644523723| 87765|
|090881|22276762322323| 3398776256768|64238867658793333333| 46389333|
|090881| 7683878746865|0986456834324567| 754788776393333|9234689677|
+------+--------------+----------------+--------------------+----------+
//make dummy data
val df = spark.sqlContext.read.json(res4)
val rdd = spark.sparkContext.parallelize(Seq(("""{"09087":{"values": ["76573433","2222322323","768346865"],"values1": ["7686548898","33256768","09864324567"],"values2": ["234523723","64238793333333","75478393333"],"values3": ["87765","46389333","9234689677"]},"090881": {"values": ["76573443433","22276762322323","7683878746865"],"values1": ["768637676548898","3398776256768","0986456834324567"],"values2": ["23877644523723","64238867658793333333","754788776393333"],"values3": ["87765","46389333","9234689677"]}}""" )))
val df = spark.sqlContext.read.json(rdd)
df.select(
explode ( // explode an array into rows
array( // make an array
(for( col_name <- df.columns )
yield
struct( //create struct with names that can be use as columns
lit(s"$col_name").as("Name") ,
col(s"$col_name.*")
)
).toSeq :_* // make sequence into VarArgs
).as("rows")
)
).select(
col("col.Name"),
expr("explode(
arrays_zip(
col.values ,
col.values1,
col.values2,
col.values3)) " //use array_zip to suck together multiple identical length arrays into 1 array(of structs) with struct containing the names column of the index.
).as("columns")
).select(
col("Name"),
col("columns.*") // rename as required.
).show()
+------+--------------+----------------+--------------------+----------+
| Name| 0| 1| 2| 3|
+------+--------------+----------------+--------------------+----------+
| 09087| 76573433| 7686548898| 234523723| 87765|
| 09087| 2222322323| 33256768| 64238793333333| 46389333|
| 09087| 768346865| 09864324567| 75478393333|9234689677|
|090881| 76573443433| 768637676548898| 23877644523723| 87765|
|090881|22276762322323| 3398776256768|64238867658793333333| 46389333|
|090881| 7683878746865|0986456834324567| 754788776393333|9234689677|
+------+--------------+----------------+--------------------+----------+
for more info on arrays_zip see here.

How to print nested data structure in a presentable way

I have below sample data & structure and trying to play around to better understand SparkSQL,Pyspark commands.
schemaTest="`id` BIGINT NOT NULL,`name` STRING,`address` STRUCT<`number`: INT, `road`: STRING,
`city`: STRUCT<`name`: STRING, `postcode`: BIGINT>>,`numbers` ARRAY<INT>"
data = [(1,"Smith",(1200,"North Custer RD",("Sugar Land TX",75034)),[2815,2133])]
this is what I get from printSchema:
root
|-- id: long (nullable = false)
|-- name: string (nullable = true)
|-- address: struct (nullable = true)
| |-- number: integer (nullable = true)
| |-- road: string (nullable = true)
| |-- city: struct (nullable = true)
| | |-- name: string (nullable = true)
| | |-- postcode: long (nullable = true)
|-- numbers: array (nullable = true)
| |-- element: integer (containsNull = true)
when I query the df , this is how it's represented and I am trying to re-format the "address" column for a better representation:
+---+-----+-----------------------------------------------+------------+
|id |name |address |numbers |
+---+-----+-----------------------------------------------+------------+
|1 |Smith|{1200, North Custer RD, {Sugar Land TX, 75034}}|[2815, 2133]|
+---+-----+-----------------------------------------------+------------+
I want it to be more like this:
+---+-----+------------------------------------------+------------+
|id |name |address |numbers |
+---+-----+------------------------------------------+------------+
|1 |Smith|1200 North Custer RD, Sugar Land TX, 75034|[2815, 2133]|
+---+-----+------------------------------------------+------------+
I tried explode to see if I can extract but it says mismatch (I am assuming cannot perform explode on structType).
can someone give me an example using withColumn how to reformat the "Address" column?. or if you have any other approach?
You can use concat built-in function to create a string from several columns, as follows:
from pyspark.sql import functions as F
result = input_df.withColumn(
'address',
F.concat(
F.col('address.number'),
F.lit(' '),
F.col('address.road'),
F.lit(', '),
F.col('address.city.name'),
F.lit(', '),
F.col('address.city.postcode')
)
)

Way to concatenate Array of structs

I have a column that contains array of structs. It looks like this:
|-- Network: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Code: string (nullable = true)
| | |-- Signal: string (nullable = true)
This is just a small sample, there are many more columns inside the struct than this. Is there a way to take the arrays in the column for each row, concatenate them and make them into one string? For example, we could have something like this:
[["example", 2], ["example2", 3]]
Is there a way to make into:
"example2example3"?
Assuming having a dataframe df with the following schema:
df.printSchema
df with sample data:
df.show(false)
You need to first explode the Network array to select the struct elements Code and signal.
var myDf = df.select(explode($"Network").as("Network"))
Then you need to concat the two columns using the concat() function and then pass the output to the collect_list() function which will aggregate all rows into one row of type array<string>
myDf = myDf.select(collect_list(concat($"Network.code",$"Network.signal")).as("data"))
Finally, you need to concat into the required format which can be done using concat_ws() function which takes two arguments, the first being the separator to be placed between two string and the second argument being a column with array<string> type which is our output from our previous step. As per your use case, we don't need any separator to be placed between two concatenates strings hence we keep the separator argument as an empty quote.
myDf = myDf.select(concat_ws("",$"data").as("data"))
All the above steps can be done in one line
myDf= myDf.select(explode($"Network").as("Network")).select(concat_ws("",collect_list(concat($"Network.code",$"Network.signal"))).as("data")).show(false)
If you want the output directly into a String variable then use:
val myStr = myDf.first.get(0).toString
print(myStr)
There is a library called spark-hats (Github, small article) that you might find very useful in these situations.
With its use, you can map the array easily and output the concatenation next to the elements or even somewhere else if you provide a fully qualified name.
Setup
import org.apache.spark.sql.functions._
import za.co.absa.spark.hats.Extensions._
scala> df.printSchema
root
|-- info: struct (nullable = true)
| |-- drivers: struct (nullable = true)
| | |-- carName: string (nullable = true)
| | |-- carNumbers: string (nullable = true)
| | |-- driver: string (nullable = true)
|-- teamName: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- team1: string (nullable = true)
| | |-- team2: string (nullable = true)
scala> df.show(false)
+---------------------------+------------------------------+
|info |teamName |
+---------------------------+------------------------------+
|[[RB7, 33, Max Verstappen]]|[[Redbull, rb], [Monster, mt]]|
+---------------------------+------------------------------+
Command you are looking for
scala> val dfOut = df.nestedMapColumn(inputColumnName = "teamName", outputColumnName = "nextElementInArray", expression = a => concat(a.getField("team1"), a.getField("team2")) )
dfOut: org.apache.spark.sql.DataFrame = [info: struct<drivers: struct<carName: string, carNumbers: string ... 1 more field>>, teamName: array<struct<team1:string,team2:string,nextElementInArray:string>>]
Output
scala> dfOut.printSchema
root
|-- info: struct (nullable = true)
| |-- drivers: struct (nullable = true)
| | |-- carName: string (nullable = true)
| | |-- carNumbers: string (nullable = true)
| | |-- driver: string (nullable = true)
|-- teamName: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- team1: string (nullable = true)
| | |-- team2: string (nullable = true)
| | |-- nextElementInArray: string (nullable = true)
scala> dfOut.show(false)
+---------------------------+----------------------------------------------------+
|info |teamName |
+---------------------------+----------------------------------------------------+
|[[RB7, 33, Max Verstappen]]|[[Redbull, rb, Redbullrb], [Monster, mt, Monstermt]]|
+---------------------------+----------------------------------------------------+

Pyspark issue loading xml files with com.databricks:spark-xml

I'm trying to push some academic POC to work that rely on pyspark with com.databricks:spark-xml. The goal is to load the Stack Exchange Data Dump xml format (https://archive.org/details/stackexchange) to pyspark df.
It works like a charm with correctly formatted xml with proper tags but fail with Stack Exchange Dump as follows:
<users>
<row Id="-1" Reputation="1" CreationDate="2014-07-30T18:05:25.020" DisplayName="Community" LastAccessDate="2014-07-30T18:05:25.020" Location="on the server farm" AboutMe=" I feel pretty, Oh, so pretty" Views="0" UpVotes="26" DownVotes="701" AccountId="-1" />
</users>
Depending on the root tag, row tag I'm getting empty schema or..something:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.xml').option("rowTag", "users").load('./tmp/test/Users.xml')
df.printSchema()
df.show()
root
|-- row: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _AboutMe: string (nullable = true)
| | |-- _AccountId: long (nullable = true)
| | |-- _CreationDate: string (nullable = true)
| | |-- _DisplayName: string (nullable = true)
| | |-- _DownVotes: long (nullable = true)
| | |-- _Id: long (nullable = true)
| | |-- _LastAccessDate: string (nullable = true)
| | |-- _Location: string (nullable = true)
| | |-- _ProfileImageUrl: string (nullable = true)
| | |-- _Reputation: long (nullable = true)
| | |-- _UpVotes: long (nullable = true)
| | |-- _VALUE: string (nullable = true)
| | |-- _Views: long (nullable = true)
| | |-- _WebsiteUrl: string (nullable = true)
+--------------------+
| row|
+--------------------+
|[[Hi, I'm not ......|
+--------------------+
Spark : 1.6.0
Python : 2.7.15
Com.databricks : spark-xml_2.10:0.4.1
I would be extremely grateful for any advise.
Kind Regards,
P.
I tried the same method (spark-xml on stackoverflow dump files) some time ago and I failed... Mostly because DF is seen as an array of structures and the processing performance was really bad. Instead, I recommend to use standard text reader and map Key="Value" in every line with UDF like this:
pattern = re.compile(' ([A-Za-z]+)="([^"]*)"')
parse_line = lambda line: {key:value for key,value in pattern.findall(line)}
You can also use my code to get the proper data types: https://github.com/szczeles/pyspark-notebooks/blob/master/stackoverflow/stackexchange-convert.ipynb (the schema matches dumps for March 2017).

How to create a single Array Structure column for "multiple" individual DF/DS columns with Spark SCALA

Say I have two table,order_table and room_table
order_table
+----------+---------+
| order_id | info |
+----------+---------+
| order1 | infos |
+----------+---------+
room_table with many columns
+----------+---------+-----+
| order_id | room_id | ... |
+----------+---------+-----+
| order1 | room1 | ... |
| order1 | room2 | ... |
+----------+---------+-----+
I want to add select * from room_table group by order_id result as collect list to order_table new column rooms.
Output table should keep the schema:
-order_id string,
-info string,
-room array<struct>
--room_id string,
--room_price int,
--room_name string
-- ....
val df1 = Seq(("order_1", "order_1_info"),
("order_2", "order_2_info")).toDF("order_id", "info")
val df2 = Seq(("order_1", "room_1", 100, "palace_1"),
("order_2", "room_2", 200, "palace_2"),
("order_1", "room_3", 100, "palace_3"),
("order_2", "room_8", 200, "palace_x"))
.toDF("order_id", "room_id", "room_price", "room_name")
val cols: Array[String] = df2.columns
val df3 = df2.groupBy("order_id").agg(collect_list(struct(cols.head, cols.tail:_*)) as "room")
val df4 = df1.join(df3, Seq("order_id"))
df4.show()
df4.printSchema()
In above snippet, I just made some sample dataframes for use.
Output : -
+--------+------------+--------------------+
|order_id| info| room|
+--------+------------+--------------------+
| order_1|order_1_info|[[order_1,room_1,...|
| order_2|order_2_info|[[order_2,room_2,...|
+--------+------------+--------------------+
Schema:-
root
|-- order_id: string (nullable = true)
|-- info: string (nullable = true)
|-- room: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- order_id: string (nullable = true)
| | |-- room_id: string (nullable = true)
| | |-- room_price: integer (nullable = false)
| | |-- room_name: string (nullable = true)
I hope this is helpful

Resources