I am trying to transpose the columns to rows and load it to the data base. My input is the Json file.
{"09087":{"values": ["76573433","2222322323","768346865"],"values1": ["7686548898","33256768","09864324567"],"values2": ["234523723","64238793333333","75478393333"],"values3": ["87765","46389333","9234689677"]},"090881": {"values": ["76573443433","22276762322323","7683878746865"],"values1": ["768637676548898","3398776256768","0986456834324567"],"values2": ["23877644523723","64238867658793333333","754788776393333"],"values3": ["87765","46389333","9234689677"]}}
Pyspark:
df = spark.read.option("multiline", "true").format("json").load("testfile.json")
Schema:
root
|-- 09087: struct (nullable = true)
| |-- values: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- values1: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- values2: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- values3: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- 090881: struct (nullable = true)
| |-- values: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- values1: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- values2: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- values3: array (nullable = true)
| | |-- element: string (containsNull = true)
Data:
df.show()
+--------------------+--------------------+
| 09087| 090881|
+--------------------+--------------------+
|{[76573433, 22223...|{[76573443433, 22...|
+--------------------+--------------------+
OUTPUT:
Name values values1 values2 values3
09087 76573433 7686548898 234523723 87765
09087 2222322323 33256768 64238793333333 9234689677
09087 768346865 09864324567 75478393333 46389333
090881 76573443433 768637676548898 23877644523723 87765
090881 22276762322323 3398776256768 64238867658793333333 46389333
090881 7683878746865 0986456834324567 754788776393333 9234689677
Actually I just gave 2 columns as input but I have lot of them. I have been trying this- could someone please help me on this. Thanks in advance.
Pyspark translation of my scala solution:
from pyspark.sql import
rdd = spark.sparkContext.parallelize([("""{"09087":{"values": ["76573433","2222322323","768346865"],"values1": ["7686548898","33256768","09864324567"],"values2": ["234523723","64238793333333","75478393333"],"values3": ["87765","46389333","9234689677"]},"090881": {"values": ["76573443433","22276762322323","7683878746865"],"values1": ["768637676548898","3398776256768","0986456834324567"],"values2": ["23877644523723","64238867658793333333","754788776393333"],"values3": ["87765","46389333","9234689677"]}}""" )])
df = spark.read.json(rdd)
df.select(\
explode (\#explode array into rows
array(\
*[ struct(\# make a stuct from column name and values
lit( col_name ).alias("Name"),\
col(col_name+".*")\
) for col_name in df.columns ])))\
.select(\
col("col.Name").alias("Name"),\
explode(\
arrays_zip(\# make an array of structs from multiple arrays. The name of the struct.column will be it's index in the orginal array.
col("col.values"),\
col("col.values1"),\
col("col.values2"),\
col("col.values3")\
)\
).alias("columns")\
).select( col("Name"),col("columns.*")).show()#use '.*' syntax to change struct.column into table.column
+------+--------------+----------------+--------------------+----------+
| Name| 0| 1| 2| 3|
+------+--------------+----------------+--------------------+----------+
| 09087| 76573433| 7686548898| 234523723| 87765|
| 09087| 2222322323| 33256768| 64238793333333| 46389333|
| 09087| 768346865| 09864324567| 75478393333|9234689677|
|090881| 76573443433| 768637676548898| 23877644523723| 87765|
|090881|22276762322323| 3398776256768|64238867658793333333| 46389333|
|090881| 7683878746865|0986456834324567| 754788776393333|9234689677|
+------+--------------+----------------+--------------------+----------+
//make dummy data
val df = spark.sqlContext.read.json(res4)
val rdd = spark.sparkContext.parallelize(Seq(("""{"09087":{"values": ["76573433","2222322323","768346865"],"values1": ["7686548898","33256768","09864324567"],"values2": ["234523723","64238793333333","75478393333"],"values3": ["87765","46389333","9234689677"]},"090881": {"values": ["76573443433","22276762322323","7683878746865"],"values1": ["768637676548898","3398776256768","0986456834324567"],"values2": ["23877644523723","64238867658793333333","754788776393333"],"values3": ["87765","46389333","9234689677"]}}""" )))
val df = spark.sqlContext.read.json(rdd)
df.select(
explode ( // explode an array into rows
array( // make an array
(for( col_name <- df.columns )
yield
struct( //create struct with names that can be use as columns
lit(s"$col_name").as("Name") ,
col(s"$col_name.*")
)
).toSeq :_* // make sequence into VarArgs
).as("rows")
)
).select(
col("col.Name"),
expr("explode(
arrays_zip(
col.values ,
col.values1,
col.values2,
col.values3)) " //use array_zip to suck together multiple identical length arrays into 1 array(of structs) with struct containing the names column of the index.
).as("columns")
).select(
col("Name"),
col("columns.*") // rename as required.
).show()
+------+--------------+----------------+--------------------+----------+
| Name| 0| 1| 2| 3|
+------+--------------+----------------+--------------------+----------+
| 09087| 76573433| 7686548898| 234523723| 87765|
| 09087| 2222322323| 33256768| 64238793333333| 46389333|
| 09087| 768346865| 09864324567| 75478393333|9234689677|
|090881| 76573443433| 768637676548898| 23877644523723| 87765|
|090881|22276762322323| 3398776256768|64238867658793333333| 46389333|
|090881| 7683878746865|0986456834324567| 754788776393333|9234689677|
+------+--------------+----------------+--------------------+----------+
for more info on arrays_zip see here.
Related
I have a column that contains array of structs. It looks like this:
|-- Network: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Code: string (nullable = true)
| | |-- Signal: string (nullable = true)
This is just a small sample, there are many more columns inside the struct than this. Is there a way to take the arrays in the column for each row, concatenate them and make them into one string? For example, we could have something like this:
[["example", 2], ["example2", 3]]
Is there a way to make into:
"example2example3"?
Assuming having a dataframe df with the following schema:
df.printSchema
df with sample data:
df.show(false)
You need to first explode the Network array to select the struct elements Code and signal.
var myDf = df.select(explode($"Network").as("Network"))
Then you need to concat the two columns using the concat() function and then pass the output to the collect_list() function which will aggregate all rows into one row of type array<string>
myDf = myDf.select(collect_list(concat($"Network.code",$"Network.signal")).as("data"))
Finally, you need to concat into the required format which can be done using concat_ws() function which takes two arguments, the first being the separator to be placed between two string and the second argument being a column with array<string> type which is our output from our previous step. As per your use case, we don't need any separator to be placed between two concatenates strings hence we keep the separator argument as an empty quote.
myDf = myDf.select(concat_ws("",$"data").as("data"))
All the above steps can be done in one line
myDf= myDf.select(explode($"Network").as("Network")).select(concat_ws("",collect_list(concat($"Network.code",$"Network.signal"))).as("data")).show(false)
If you want the output directly into a String variable then use:
val myStr = myDf.first.get(0).toString
print(myStr)
There is a library called spark-hats (Github, small article) that you might find very useful in these situations.
With its use, you can map the array easily and output the concatenation next to the elements or even somewhere else if you provide a fully qualified name.
Setup
import org.apache.spark.sql.functions._
import za.co.absa.spark.hats.Extensions._
scala> df.printSchema
root
|-- info: struct (nullable = true)
| |-- drivers: struct (nullable = true)
| | |-- carName: string (nullable = true)
| | |-- carNumbers: string (nullable = true)
| | |-- driver: string (nullable = true)
|-- teamName: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- team1: string (nullable = true)
| | |-- team2: string (nullable = true)
scala> df.show(false)
+---------------------------+------------------------------+
|info |teamName |
+---------------------------+------------------------------+
|[[RB7, 33, Max Verstappen]]|[[Redbull, rb], [Monster, mt]]|
+---------------------------+------------------------------+
Command you are looking for
scala> val dfOut = df.nestedMapColumn(inputColumnName = "teamName", outputColumnName = "nextElementInArray", expression = a => concat(a.getField("team1"), a.getField("team2")) )
dfOut: org.apache.spark.sql.DataFrame = [info: struct<drivers: struct<carName: string, carNumbers: string ... 1 more field>>, teamName: array<struct<team1:string,team2:string,nextElementInArray:string>>]
Output
scala> dfOut.printSchema
root
|-- info: struct (nullable = true)
| |-- drivers: struct (nullable = true)
| | |-- carName: string (nullable = true)
| | |-- carNumbers: string (nullable = true)
| | |-- driver: string (nullable = true)
|-- teamName: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- team1: string (nullable = true)
| | |-- team2: string (nullable = true)
| | |-- nextElementInArray: string (nullable = true)
scala> dfOut.show(false)
+---------------------------+----------------------------------------------------+
|info |teamName |
+---------------------------+----------------------------------------------------+
|[[RB7, 33, Max Verstappen]]|[[Redbull, rb, Redbullrb], [Monster, mt, Monstermt]]|
+---------------------------+----------------------------------------------------+
I have a spark dataframe as shown below with a struct field.
val arrayStructData = Seq(
Row("James",Row("Java","XX",120)),
Row("Michael",Row("Java","",200)),
Row("Robert",Row("Java","XZ",null)),
Row("Washington",Row("","XX",120))
)
val arrayStructSchema = new StructType().add("name",StringType).add("my_struct", new StructType().add("name",StringType).add("author",StringType).add("pages",IntegerType))
val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayStructData),arrayStructSchema)
df.printSchema()
root
|-- name: string (nullable = true)
|-- my_struct: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- author: string (nullable = true)
| |-- pages: integer (nullable = true)
df.show(false)
+----------+---------------+
|name |my_struct |
+----------+---------------+
|James |[Java, XX, 120]|
|Michael |[Java, , 200] |
|Robert |[Java, XZ,] |
|Washington|[, XX, 120] |
+----------+---------------+
I want to construct an output column called final_list which shows me absence or presence of elements in the struct. The problem is, that the struct elements are just limited to 3 in this example but in actual data there are a 1,000 elements in the struct and every record may or may not contain values in each element.
Here is how I want to construct the column -
val cleaned_df = spark.sql(s"""select name, case when my_struct.name = "" then "" else "name" end as name_present
, case when my_struct.author = "" then "" else "author" end as author_present
, case when my_struct.pages = "" then "" else "pages" end as pages_present
from df""")
cleaned_df.createOrReplaceTempView("cleaned_df")
cleaned_df.show(false)
+----------+------------+--------------+-------------+
|name |name_present|author_present|pages_present|
+----------+------------+--------------+-------------+
|James |name |author |pages |
|Michael |name | |pages |
|Robert |name |author |pages |
|Washington| |author |pages |
+----------+------------+--------------+-------------+
So I write a case statement for every column to capture its presence or absence. And then I do the concat like below to get final output -
val final_df = spark.sql(s"""
select name, concat_ws("," , name_present, author_present, pages_present) as final_list
from cleaned_df
""")
final_df.show(false)
+----------+-----------------+
|name |final_list |
+----------+-----------------+
|James |name,author,pages|
|Michael |name,,pages |
|Robert |name,author,pages|
|Washington|,author,pages |
+----------+-----------------+
I cannot write a giant case statement to capture this for a 1,000 element struct. Is there a smarter way to do this ? Perhaps a UDF ?
I am using Spark 2.4.3. I dont know if there is any higher order functions that support this. But the schema of my real dataframe looks like below -
|-- name: string (nullable = true)
|-- my_struct: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- author: string (nullable = true)
| |-- element3: integer (nullable = true)
| |-- element4: string (nullable = true)
| |-- element5: double (nullable = true)
.....
.....
| |-- element1000: string (nullable = true)
You mentioned already an UDF. With an UDF you can iterate over all fields of my_struct and collect the flags:
def availableFields = (in:Row) => {
val ret = scala.collection.mutable.ListBuffer.empty[String]
for( i <- Range(0, in.size)) {
if( !in.isNullAt(i) && in.get(i) != "") {
ret += in.schema.fields(i).name
}
}
ret.mkString(",")
}
val availableFieldsUdf = udf(availableFields)
df.withColumn("final_list", availableFieldsUdf(col("my_struct")) ).show(false)
prints
+----------+---------------+-----------------+
|name |my_struct |final_list |
+----------+---------------+-----------------+
|James |[Java, XX, 120]|name,author,pages|
|Michael |[Java, , 200] |name,pages |
|Robert |[Java, XZ,] |name,author |
|Washington|[, XX, 120] |author,pages |
+----------+---------------+-----------------+
Without UDF.
Schema
scala> df.printSchema
root
|-- name: string (nullable = true)
|-- my_struct: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- author: string (nullable = true)
| |-- pages: integer (nullable = true)
Constructing Expression
scala>
val expr = df
.select("my_struct.*") // Extracting struct columns.
.columns
.map(c => (c, trim(col(s"my_struct.${c}")))) // Constructing tupes ("name", trim(col("my_struct.name")))
.map(c => when(c._2.isNotNull and c._2 =!= "",lit(c._1))) // Checking Not Null & Empty values.
expr: Array[org.apache.spark.sql.Column] = Array(CASE WHEN ((trim(my_struct.name) IS NOT NULL) AND (NOT (trim(my_struct.name) = ))) THEN name END, CASE WHEN ((trim(my_struct.author) IS NOT NULL) AND (NOT (trim(my_struct.author) = ))) THEN author END, CASE WHEN ((trim(my_struct.pages) IS NOT NULL) AND (NOT (trim(my_struct.pages) = ))) THEN pages END)
Applying Expression to DataFrame
scala> df.withColumn("final_list",concat_ws(",",expr:_*)).show
+----------+---------------+-----------------+
| name| my_struct| final_list|
+----------+---------------+-----------------+
| James|[Java, XX, 120]|name,author,pages|
| Michael| [Java, , 200]| name,pages|
| Robert| [Java, XZ,]| name,author|
|Washington| [, XX, 120]| author,pages|
+----------+---------------+-----------------+
I'm trying to push some academic POC to work that rely on pyspark with com.databricks:spark-xml. The goal is to load the Stack Exchange Data Dump xml format (https://archive.org/details/stackexchange) to pyspark df.
It works like a charm with correctly formatted xml with proper tags but fail with Stack Exchange Dump as follows:
<users>
<row Id="-1" Reputation="1" CreationDate="2014-07-30T18:05:25.020" DisplayName="Community" LastAccessDate="2014-07-30T18:05:25.020" Location="on the server farm" AboutMe=" I feel pretty, Oh, so pretty" Views="0" UpVotes="26" DownVotes="701" AccountId="-1" />
</users>
Depending on the root tag, row tag I'm getting empty schema or..something:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.xml').option("rowTag", "users").load('./tmp/test/Users.xml')
df.printSchema()
df.show()
root
|-- row: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _AboutMe: string (nullable = true)
| | |-- _AccountId: long (nullable = true)
| | |-- _CreationDate: string (nullable = true)
| | |-- _DisplayName: string (nullable = true)
| | |-- _DownVotes: long (nullable = true)
| | |-- _Id: long (nullable = true)
| | |-- _LastAccessDate: string (nullable = true)
| | |-- _Location: string (nullable = true)
| | |-- _ProfileImageUrl: string (nullable = true)
| | |-- _Reputation: long (nullable = true)
| | |-- _UpVotes: long (nullable = true)
| | |-- _VALUE: string (nullable = true)
| | |-- _Views: long (nullable = true)
| | |-- _WebsiteUrl: string (nullable = true)
+--------------------+
| row|
+--------------------+
|[[Hi, I'm not ......|
+--------------------+
Spark : 1.6.0
Python : 2.7.15
Com.databricks : spark-xml_2.10:0.4.1
I would be extremely grateful for any advise.
Kind Regards,
P.
I tried the same method (spark-xml on stackoverflow dump files) some time ago and I failed... Mostly because DF is seen as an array of structures and the processing performance was really bad. Instead, I recommend to use standard text reader and map Key="Value" in every line with UDF like this:
pattern = re.compile(' ([A-Za-z]+)="([^"]*)"')
parse_line = lambda line: {key:value for key,value in pattern.findall(line)}
You can also use my code to get the proper data types: https://github.com/szczeles/pyspark-notebooks/blob/master/stackoverflow/stackexchange-convert.ipynb (the schema matches dumps for March 2017).
Say I have two table,order_table and room_table
order_table
+----------+---------+
| order_id | info |
+----------+---------+
| order1 | infos |
+----------+---------+
room_table with many columns
+----------+---------+-----+
| order_id | room_id | ... |
+----------+---------+-----+
| order1 | room1 | ... |
| order1 | room2 | ... |
+----------+---------+-----+
I want to add select * from room_table group by order_id result as collect list to order_table new column rooms.
Output table should keep the schema:
-order_id string,
-info string,
-room array<struct>
--room_id string,
--room_price int,
--room_name string
-- ....
val df1 = Seq(("order_1", "order_1_info"),
("order_2", "order_2_info")).toDF("order_id", "info")
val df2 = Seq(("order_1", "room_1", 100, "palace_1"),
("order_2", "room_2", 200, "palace_2"),
("order_1", "room_3", 100, "palace_3"),
("order_2", "room_8", 200, "palace_x"))
.toDF("order_id", "room_id", "room_price", "room_name")
val cols: Array[String] = df2.columns
val df3 = df2.groupBy("order_id").agg(collect_list(struct(cols.head, cols.tail:_*)) as "room")
val df4 = df1.join(df3, Seq("order_id"))
df4.show()
df4.printSchema()
In above snippet, I just made some sample dataframes for use.
Output : -
+--------+------------+--------------------+
|order_id| info| room|
+--------+------------+--------------------+
| order_1|order_1_info|[[order_1,room_1,...|
| order_2|order_2_info|[[order_2,room_2,...|
+--------+------------+--------------------+
Schema:-
root
|-- order_id: string (nullable = true)
|-- info: string (nullable = true)
|-- room: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- order_id: string (nullable = true)
| | |-- room_id: string (nullable = true)
| | |-- room_price: integer (nullable = false)
| | |-- room_name: string (nullable = true)
I hope this is helpful
My Schema looks like below:
scala> airing.printSchema()
root
|-- program: struct (nullable = true)
| |-- detail: struct (nullable = true)
| | |-- contributors: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- contributorId: string (nullable = true)
| | | | |-- name: string (nullable = true)
| | | | |-- order: long (nullable = true)
I need to count based on the unique Actors, to find the most popular actors.
My code is as:
val castCounts = airing.groupBy("program.detail.contributors.name").count().sort(desc("count")).take(10)
To my shock, I am getting duplicates as shown in the below snapshot. I expected each individual actor to occur once, with a distinct count:
Printing the results below:
[WrappedArray(),4344]
[WrappedArray(Matt Smith),16]
[WrappedArray(Phil Keoghan),15]
[WrappedArray(Don Adams, Barbara Feldon, Edward Platt),10]
[WrappedArray(Edward Platt, Don Adams, Barbara Feldon),10]
There are 2 steps
use explode function to make your data flat so each row of data only have 1 contributor.
val df = airing.withColumn("contributor", explode(col("program.detail.contributors"))))
Get result from new df which contributor has been exploded.
val castCounts = df.groupBy("contributor.name").count().sort(desc("count")).take(10)