Combine multiple columns into single column in SPARK - apache-spark

I have a flattened incoming data in the below format in my parquet file:
I want to convert it into the below format where I am non-flattening my structure:
I tried the following:
Dataset<Row> rows = df.select(col("id"), col("country_cd"),
explode(array("fullname_1", "fullname_2")).as("fullname"),
explode(array("firstname_1", "firstname_2")).as("firstname"));
But it gives the below error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Only one generator allowed per select clause but found 2: explode(array(fullname_1, fullname_2)), explode(array(firstname_1, firstname_2));
I understand it is because you cannot use more than 1 explode in a query.
I am looking for options to do the above in Spark Java.

This type of problem is most easily solved with a .flatMap(). A .flatMap() is like a .map() except that it allows you to output n records for each input record, as opposed to a 1:1 ratio.
val df = Seq(
(1, "USA", "Lee M", "Lee", "Dan A White", "Dan"),
(2, "CAN", "Pate Poland", "Pate", "Don Derheim", "Don")
).toDF("id", "country_code", "fullname_1", "firstname_1", "fullname_2", "firstname_2")
df.flatMap(row => {
val id = row.getAs[Int]("id")
val cc = row.getAs[String]("country_code")
Seq(
(id, cc, row.getAs[String]("fullname_1"), row.getAs[String]("firstname_1")),
(id, cc, row.getAs[String]("fullname_1"), row.getAs[String]("firstname_1"))
)
}).toDF("id", "country_code", "fullname", "firstname").show()
This results in the following:
+---+------------+-----------+---------+
| id|country_code| fullname|firstname|
+---+------------+-----------+---------+
| 1| USA| Lee M| Lee|
| 1| USA| Lee M| Lee|
| 2| CAN|Pate Poland| Pate|
| 2| CAN|Pate Poland| Pate|
+---+------------+-----------+---------+

You need to wrap first and last names into an array of structs, which you later then explode:
Dataset<Row> rows = df.select(col("id"), col("country_cd"),
explode(
array(
struct(
col("firstname_1").as("firstname"), col("fullname_1").as("fullname")),
struct(
col("firstname_2").as("firstname"), col("fullname_2").as("fullname"))
)
)
)
This way you'll get fast narrow transformation, have Scala/Python/R portability and it should run quicker than the df.flatMap solution, which will turn Dataframe to an RDD, which Query Optimizer cannot improve. There might be additional pressure from Java Garbage Collector because of copying from unsafe byte arrays to java objects.

As a database person, I like to use set-based operations for things like this, eg union
val df = Seq(
("1", "USA", "Lee M", "Lee", "Dan A White", "Dan"),
("2", "CAN", "Pate Poland", "Pate", "Don Derheim", "Don")
).toDF("id", "country_code", "fullname_1", "firstname_1", "fullname_2", "firstname_2")
val df_new = df
.select("id", "country_code", "fullname_1", "firstname_1").union(df.select("id", "country_code", "fullname_2", "firstname_2"))
.orderBy("id")
df_new.show
df.createOrReplaceTempView("tmp")
Or the equivalent SQL:
%sql
SELECT id, country_code, fullname_1 AS fullname, firstname_1 AS firstname
FROM tmp
UNION
SELECT id, country_code, fullname_2, firstname_2
FROM tmp
My results:
I suppose one advantage over the flatMap technique is you don't have to specify the datatypes and it appears simpler on the face of it. It's up to you of course.

Related

Custom output file format write with Spark

I have a requirement to write the following output format.
primary_key_value^attribute1:value1;attribute2:value2;attribute3:value3;attribute4:value4
The output will be written to a file. I can concat the values manually and make a string out of it. Are there any best practices that I can follow to get Spark to write this output
You could add the name of the column with concat or concat_ws and write semi colons as separators. In scala, it would look like this:
val df = Seq((0, "val1", "val2", "val3")).toDF("id", "col1", "col2", "col3")
val res = df
.select(df.columns.map(c => concat_ws(":", lit(c), col(c)).alias(c)) : _*)
res.show()
+----+---------+---------+---------+
| id| col1| col2| col3|
+----+---------+---------+---------+
|id:0|col1:val1|col2:val2|col3:val3|
+----+---------+---------+---------+
And then:
res.write.option("sep", ";").csv("...")
In Pyspark, for each column you can use the concat function, to concatenate the column name and its value, and apply all of this in the select operator.
After you write this with the csv function :
df.select(* [f.concat(col, f.lit(":"), f.lit(col)) for col in df.columns] ).write.option("header", "false").option("delimiter", ";").csv("../path")

Assigning columns to another columns in a Spark Dataframe using Scala

I was looking at this excellent question so as to improve my Scala skills and the answer: Extract a column value and assign it to another column as an array in spark dataframe
I created my modified code as follows which works, but am left with a few questions:
import spark.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(
("r1", 1, 1),
("r2", 6, 4),
("r3", 4, 1),
("r4", 1, 2)
)).toDF("ID", "a", "b")
val uniqueVal = df.select("b").distinct().map(x => x.getAs[Int](0)).collect.toList
def myfun: Int => List[Int] = _ => uniqueVal
def myfun_udf = udf(myfun)
df.withColumn("X", myfun_udf( col("b") )).show
+---+---+---+---------+
| ID| a| b| X|
+---+---+---+---------+
| r1| 1| 1|[1, 4, 2]|
| r2| 6| 4|[1, 4, 2]|
| r3| 4| 1|[1, 4, 2]|
| r4| 1| 2|[1, 4, 2]|
+---+---+---+---------+
It works, but:
I note b column is put in twice.
I can also put in column a on the second statement and I get the same result. E.g. and what point is that then?
df.withColumn("X", myfun_udf( col("a") )).show
If I put in col ID then it gets null.
So, I am wondering why the second col is input?
And how this could be made to work generically for all columns?
So, this was code that I looked at elsewhere, but I am missing something.
The code you've shown doesn't make much sense:
It is not scalable - in the worst case scenario size of each row is proportional to the size
As you've already figure out it doesn't need argument at all.
It doesn't need (and what's important it didn't need) udf at the time it was written (on 2016-12-23 Spark 1.6 and 2.0 where already released)
If you still wanted to use udf nullary variant would suffice
Overall it is just another convoluted and misleading answer that served OP at the point. I'd ignore (or vote accordingly) and move on.
So how could this be done:
If you have a local list and you really want to use udf. For single sequence use udf with nullary function:
val uniqueBVal: Seq[Int] = ???
val addUniqueBValCol = udf(() => uniqueBVal)
df.withColumn("X", addUniqueBValCol())
Generalize to:
import scala.reflect.runtime.universe.TypeTag
def addLiteral[T : TypeTag](xs: Seq[T]) = udf(() => xs)
val x = addLiteral[Int](uniqueBVal)
df.withColumn("X", x())
Better don't use udf:
import org.apache.spark.sql.functions._
df.withColumn("x", array(uniquBVal map lit: _*))
As of
And how this could be made to work generically for all columns?
as mentioned at the beginning the whole concept is hard to defend. Either window functions (completely not scalable)
import org.apache.spark.sql.expressions.Window
val w = Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.select($"*" +: df.columns.map(c => collect_set(c).over(w).alias(s"${c}_unique")): _*)
or cross join with aggregate (most of the time not scalable)
val uniqueValues = df.select(
df.columns map (c => collect_set(col(c)).alias(s"${c}_unique")):_*
)
df.crossJoin(uniqueValues)
In general though - you'll have to rethink your approach, if this comes anywhere actual applications, unless you know for sure, that cardinalities of columns are small and have strict upper bounds.
Take away message is - don't trust random code that random people post in Internet. This one included.

Spark-Scala Try Select Statement

I'm trying to incorporate a Try().getOrElse() statement in my select statement for a Spark DataFrame. The project I'm working on is going to be applied to multiple environments. However, each environment is a little different in terms of the naming of the raw data for ONLY one field. I do not want to write several different functions to handle each different field. Is there a elegant way to handle exceptions, like this below, in a DataFrame select statement?
val dfFilter = dfRaw
.select(
Try($"some.field.nameOption1).getOrElse($"some.field.nameOption2"),
$"some.field.abc",
$"some.field.def"
)
dfFilter.show(33, false)
However, I keep getting the following error, which makes sense because it does not exist in this environments raw data, but I'd expect the getOrElse statement to catch that exception.
org.apache.spark.sql.AnalysisException: No such struct field nameOption1 in...
Is there a good way to handle exceptions in Scala Spark for select statements? Or will I need to code up different functions for each case?
val selectedColumns = if (dfRaw.columns.contains("some.field.nameOption1")) $"some.field.nameOption2" else $"some.field.nameOption2"
val dfFilter = dfRaw
.select(selectedColumns, ...)
So I'm revisiting this question after a year. I believe this solution to be much more elegant to implement. Please let me know anyone else's thoughts:
// Generate a fake DataFrame
val df = Seq(
("1234", "A", "AAA"),
("1134", "B", "BBB"),
("2353", "C", "CCC")
).toDF("id", "name", "nameAlt")
// Extract the column names
val columns = df.columns
// Add a "new" column name that is NOT present in the above DataFrame
val columnsAdd = columns ++ Array("someNewColumn")
// Let's then "try" to select all of the columns
df.select(columnsAdd.flatMap(c => Try(df(c)).toOption): _*).show(false)
// Let's reduce the DF again...should yield the same results
val dfNew = df.select("id", "name")
dfNew.select(columnsAdd.flatMap(c => Try(dfNew(c)).toOption): _*).show(false)
// Results
columns: Array[String] = Array(id, name, nameAlt)
columnsAdd: Array[String] = Array(id, name, nameAlt, someNewColumn)
+----+----+-------+
|id |name|nameAlt|
+----+----+-------+
|1234|A |AAA |
|1134|B |BBB |
|2353|C |CCC |
+----+----+-------+
dfNew: org.apache.spark.sql.DataFrame = [id: string, name: string]
+----+----+
|id |name|
+----+----+
|1234|A |
|1134|B |
|2353|C |
+----+----+

What is the equivalent to Hive's find_in_set function (without registering a temp view)?

Below dataframe has 2 columns,
user_id
user_id_list (array)
requirement is to find the position of user_id in the user_id_list.
Sample record:
user_id = x1
user_id_list = ('X2','X1','X3','X6')
Result:
postition = 2
I need the dataframe with 3rd column which has the position of user_id in the list.
Result dataframe columns:
user_id
user_id_list
position
I can achieve this using find_in_set() hive function after registering the dataframe as view using createOrReplaceTempView.
Is there a sql function available in spark to get this done without registering the view?
My advice would be to implement an UDF, just as Yura mentioned. Here is a short example of what it can look like:
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val df = List((1, Array(2, 3, 1)), (2, Array(1, 2,3))).toDF("user_id","user_id_list")
df.show
+-------+------------+
|user_id|user_id_list|
+-------+------------+
| 1| [2, 3, 1]|
| 2| [1, 2, 3]|
+-------+------------+
val findPosition = udf((user_id: Int, user_id_list: Seq[Int]) => {
user_id_list.indexOf(user_id)
})
val df2 = df.withColumn("position", findPosition($"user_id", $"user_id_list"))
df2.show
+-------+------------+--------+
|user_id|user_id_list|position|
+-------+------------+--------+
| 1| [2, 3, 1]| 2|
| 2| [1, 2, 3]| 1|
+-------+------------+--------+
Is there a sql function available in spark to get this done without registering the view?
No, but you don't have to register a DataFrame to use find_in_set either.
expr function (with find_in_set)
You can (temporarily) switch to SQL mode using expr function instead (see functions object):
Parses the expression string into the column that it represents
val users = Seq(("x1", Array("X2","X1","X3","X6"))).toDF("user_id", "user_id_list")
val positions = users.
as[(String, Array[String])].
map { case (uid, ids) => (uid, ids, ids.mkString(",")) }.
toDF("user_id", "user_id_list", "ids"). // only for nicer column names
withColumn("position", expr("find_in_set(upper(user_id), ids)")).
select("user_id", "user_id_list", "position")
scala> positions.show
+-------+----------------+--------+
|user_id| user_id_list|position|
+-------+----------------+--------+
| x1|[X2, X1, X3, X6]| 2|
+-------+----------------+--------+
posexplode function
You could also use posexplode function (from functions object) that saves you some Scala custom coding and is better optimized than UDFs (that forces deserialization of internal binary rows into JVM objects).
scala> users.
select('*, posexplode($"user_id_list")).
filter(lower($"user_id") === lower($"col")).
select($"user_id", $"user_id_list", $"pos" as "position").
show
+-------+----------------+--------+
|user_id| user_id_list|position|
+-------+----------------+--------+
| x1|[X2, X1, X3, X6]| 1|
+-------+----------------+--------+
I'm not aware of such function is Spark SQL API. There's a function to find if array contains a value (called array_contains) but that's not what you need.
You could use posexplode to explode array to rows with position and then filter by it, like this: dataframe.select($"id", posexplode($"ids")).filter($"id" === $"col").select($"id", $"pos"). Still it may be not optimal solution depending on length of a user ids list. Currently (for version 2.1.1) Spark doesn't do optimization to replace above code with direct array lookup - it will generate rows and filter by it.
Also take into the account that this approach will filter out any rows where user_id is not in user_ids_list so you may want to take extra efforts to overcome this.
I would advice to implement UDF which does exactly what you need. On the downside: Spark can't look into the UDF so it'll have to deserialize data to Java objects and back.

How to create a dataframe from a string key=value delimited by ";"

I have a Hive table with the structure:
I need to read the string field, breaking the keys and turn into a Hive table columns, the final table should look like this:
Very important, the number of keys in the string is dynamic and the name of the keys is also dynamic
An attempt would be to read the string with Spark SQL, create a dataframe with the schema based on all the strings and use saveAsTable () function to transform the dataframe the hive final table, but do not know how to do this
Any suggestion ?
A naive (assuming unique (code, date) combinations and no embedded = and ; in the string) can look like this:
import org.apache.spark.sql.functions.{explode, split}
val df = Seq(
(1, 1, "key1=value11;key2=value12;key3=value13;key4=value14"),
(1, 2, "key1=value21;key2=value22;key3=value23;key4=value24"),
(2, 4, "key3=value33;key4=value34;key5=value35")
).toDF("code", "date", "string")
val bits = split($"string", ";")
val kv = split($"pair", "=")
df
.withColumn("bits", bits) // Split column by `;`
.withColumn("pair", explode($"bits")) // Explode into multiple rows
.withColumn("key", kv(0)) // Extract key
.withColumn("val", kv(1)) // Extract value
// Pivot to wide format
.groupBy("code", "date")
.pivot("key")
.agg(first("val"))
// +----+----+-------+-------+-------+-------+-------+
// |code|date| key1| key2| key3| key4| key5|
// +----+----+-------+-------+-------+-------+-------+
// | 1| 2|value21|value22|value23|value24| null|
// | 1| 1|value11|value12|value13|value14| null|
// | 2| 4| null| null|value33|value34|value35|
// +----+----+-------+-------+-------+-------+-------+
This can be easily adjust to handle the case when (code, date) are not unique and you can process more complex string patterns using UDF.
Depending on a language you use and a number of columns you may be better with using RDD or Dataset. It is also worth to consider dropping full explode / pivot in favor of an UDF.
val parse = udf((text: String) => text.split(";").map(_.split("=")).collect {
case Array(k, v) => (k, v)
}.toMap)
val keys = udf((pairs: Map[String, String]) => pairs.keys.toList)
// Parse strings to Map[String, String]
val withKVs = df.withColumn("kvs", parse($"string"))
val keys = withKVs
.select(explode(keys($"kvs"))).distinct // Get unique keys
.as[String]
.collect.sorted.toList // Collect and sort
// Build a list of expressions for subsequent select
val exprs = keys.map(key => $"kvs".getItem(key).alias(key))
withKVs.select($"code" :: $"date" :: exprs: _*)
In Spark 1.5 you can try:
val keys = withKVs.select($"kvs").rdd
.flatMap(_.getAs[Map[String, String]]("kvs").keys)
.distinct
.collect.sorted.toList

Resources