How to create a dataframe from a string key=value delimited by ";" - apache-spark

I have a Hive table with the structure:
I need to read the string field, breaking the keys and turn into a Hive table columns, the final table should look like this:
Very important, the number of keys in the string is dynamic and the name of the keys is also dynamic
An attempt would be to read the string with Spark SQL, create a dataframe with the schema based on all the strings and use saveAsTable () function to transform the dataframe the hive final table, but do not know how to do this
Any suggestion ?

A naive (assuming unique (code, date) combinations and no embedded = and ; in the string) can look like this:
import org.apache.spark.sql.functions.{explode, split}
val df = Seq(
(1, 1, "key1=value11;key2=value12;key3=value13;key4=value14"),
(1, 2, "key1=value21;key2=value22;key3=value23;key4=value24"),
(2, 4, "key3=value33;key4=value34;key5=value35")
).toDF("code", "date", "string")
val bits = split($"string", ";")
val kv = split($"pair", "=")
df
.withColumn("bits", bits) // Split column by `;`
.withColumn("pair", explode($"bits")) // Explode into multiple rows
.withColumn("key", kv(0)) // Extract key
.withColumn("val", kv(1)) // Extract value
// Pivot to wide format
.groupBy("code", "date")
.pivot("key")
.agg(first("val"))
// +----+----+-------+-------+-------+-------+-------+
// |code|date| key1| key2| key3| key4| key5|
// +----+----+-------+-------+-------+-------+-------+
// | 1| 2|value21|value22|value23|value24| null|
// | 1| 1|value11|value12|value13|value14| null|
// | 2| 4| null| null|value33|value34|value35|
// +----+----+-------+-------+-------+-------+-------+
This can be easily adjust to handle the case when (code, date) are not unique and you can process more complex string patterns using UDF.
Depending on a language you use and a number of columns you may be better with using RDD or Dataset. It is also worth to consider dropping full explode / pivot in favor of an UDF.
val parse = udf((text: String) => text.split(";").map(_.split("=")).collect {
case Array(k, v) => (k, v)
}.toMap)
val keys = udf((pairs: Map[String, String]) => pairs.keys.toList)
// Parse strings to Map[String, String]
val withKVs = df.withColumn("kvs", parse($"string"))
val keys = withKVs
.select(explode(keys($"kvs"))).distinct // Get unique keys
.as[String]
.collect.sorted.toList // Collect and sort
// Build a list of expressions for subsequent select
val exprs = keys.map(key => $"kvs".getItem(key).alias(key))
withKVs.select($"code" :: $"date" :: exprs: _*)
In Spark 1.5 you can try:
val keys = withKVs.select($"kvs").rdd
.flatMap(_.getAs[Map[String, String]]("kvs").keys)
.distinct
.collect.sorted.toList

Related

How to query nested Array type of a json file using Spark?

How can I query a nested Array type using joins using Spark dataset?
Currently I'm exploding the Array type and doing join on the dataset where I need to remove the matched data. But is there a way wherein I can directly query it without exploding.
{
"id": 525,
"arrayRecords": [
{
"field1": 525,
"field2": 0
},
{
"field1": 537,
"field2": 1
}
]
}
The code
val df = sqlContext.read.json("jsonfile")
val someDF = Seq(("1"),("525"),("3")).toDF("FIELDIDS")
val withSRCRec =df.select($"*",explode($"arrayRecords")as("exploded_arrayRecords"))
val fieldIdMatchedDF= withSRCRec.as("table1").join(someDF.as("table2"),$"table1.exploded_arrayRecords.field1"===$"table2.FIELDIDS").select($"table1.exploded_arrayRecords.field1")
val finalDf = df.as("table1").join(fieldIdMatchedDF.as("table2"),$"table1.id"===$"table2.id","leftanti")
Id records having fieldIds need to be removed
You could use array_except instead:
array_except(col1: Column, col2: Column): Column Returns an array of the elements in the first array but not in the second array, without duplicates. The order of elements in the result is not determined
A solution could be as follows:
val input = spark.read.option("multiLine", true).json("input.json")
scala> input.show(false)
+--------------------+---+
|arrayRecords |id |
+--------------------+---+
|[[525, 0], [537, 1]]|525|
+--------------------+---+
// Since field1 is of type int, let's convert the ids to ints
// You could do this in Scala directly or in Spark SQL's select
val fieldIds = Seq("1", "525", "3").toDF("FIELDIDS").select($"FIELDIDS" cast "int")
// Collect the ids for array_except
val ids = fieldIds.select(collect_set("FIELDIDS") as "ids")
// The trick is to crossJoin (it is cheap given 1-row ids dataset)
val solution = input
.crossJoin(ids)
.select(array_except($"arrayRecords.field1", $"ids") as "unmatched")
scala> solution.show
+---------+
|unmatched|
+---------+
| [537]|
+---------+
You can register a temporary table based on your dataset and query it with SQL. It would be something like this:
someDs.registerTempTable("sometable");
sql("SELECT array['field'] FROM sometable");

Spark-Scala Try Select Statement

I'm trying to incorporate a Try().getOrElse() statement in my select statement for a Spark DataFrame. The project I'm working on is going to be applied to multiple environments. However, each environment is a little different in terms of the naming of the raw data for ONLY one field. I do not want to write several different functions to handle each different field. Is there a elegant way to handle exceptions, like this below, in a DataFrame select statement?
val dfFilter = dfRaw
.select(
Try($"some.field.nameOption1).getOrElse($"some.field.nameOption2"),
$"some.field.abc",
$"some.field.def"
)
dfFilter.show(33, false)
However, I keep getting the following error, which makes sense because it does not exist in this environments raw data, but I'd expect the getOrElse statement to catch that exception.
org.apache.spark.sql.AnalysisException: No such struct field nameOption1 in...
Is there a good way to handle exceptions in Scala Spark for select statements? Or will I need to code up different functions for each case?
val selectedColumns = if (dfRaw.columns.contains("some.field.nameOption1")) $"some.field.nameOption2" else $"some.field.nameOption2"
val dfFilter = dfRaw
.select(selectedColumns, ...)
So I'm revisiting this question after a year. I believe this solution to be much more elegant to implement. Please let me know anyone else's thoughts:
// Generate a fake DataFrame
val df = Seq(
("1234", "A", "AAA"),
("1134", "B", "BBB"),
("2353", "C", "CCC")
).toDF("id", "name", "nameAlt")
// Extract the column names
val columns = df.columns
// Add a "new" column name that is NOT present in the above DataFrame
val columnsAdd = columns ++ Array("someNewColumn")
// Let's then "try" to select all of the columns
df.select(columnsAdd.flatMap(c => Try(df(c)).toOption): _*).show(false)
// Let's reduce the DF again...should yield the same results
val dfNew = df.select("id", "name")
dfNew.select(columnsAdd.flatMap(c => Try(dfNew(c)).toOption): _*).show(false)
// Results
columns: Array[String] = Array(id, name, nameAlt)
columnsAdd: Array[String] = Array(id, name, nameAlt, someNewColumn)
+----+----+-------+
|id |name|nameAlt|
+----+----+-------+
|1234|A |AAA |
|1134|B |BBB |
|2353|C |CCC |
+----+----+-------+
dfNew: org.apache.spark.sql.DataFrame = [id: string, name: string]
+----+----+
|id |name|
+----+----+
|1234|A |
|1134|B |
|2353|C |
+----+----+

Get examples for rows that are removed by a filter from a spark dataframe

Suppose I have a spark dataframe df with some columns (id,...) and a string sqlFilter with a SQL filter, e.g. "id is not null".
I want to filter the dataframe df based on sqlFilter, i.e.
val filtered = df.filter(sqlFilter)
Now, I want to have a list of 10 ids from df that were removed by the filter.
Currently, I'm using a "leftanti" join to achieve this, i.e.
val examples = df.select("id").join(filtered.select("id"), Seq("id"), "leftanti")
.take(10)
.map(row => Option(row.get(0)) match { case None => "null" case Some(x) => x.toString})
However, this is really slow.
My guess is that this can be implemented faster, because spark only has to have a list for every partitition
and add an id to the list when filter removes a row and the list contains less than 10 elements. Once the action after
filter finishes, spark has to collect all the lists from the partitions until it has 10 ids.
I wanted to use accumulators as described here,
but I failed because I could not find out how to parse and use sqlFilter.
Has anybody an idea how I can improve the performance?
Update
Ramesh Maharjan suggested in the comments to inverse the SQL query, i.e.
df.filter(s"NOT ($filterString)")
.select(key)
.take(10)
.map(row => Option(row.get(0)) match { case None => "null" case Some(x) => x.toString})
This indeed improves the performance but it is not 100% equivalent.
If there are multiple rows with the same id, the id will end up in the examples if one row is removed due to the filter. With the leftantit join it does not end up in the examples because the id is still in filtered.
However, that is fine with me.
I'm still interested if it is possible to create the list "on the fly" with accumulators or something similar.
Update 2
Another issue with inverting the filter is the logical value UNKNOWN in SQL, because NOT UNKNWON = UNKNOWN, i.e. NOT(null <> 1) <=> UNKNOWN and hence this row shows up neither in the filtered dataframe nor in the inverted dataframe.
You can use a custom accumulator (because longAccumulator won't help you as all ids will be null); and you must formulate your filter statement as function :
Suppose you have a dataframe :
+----+--------+
| id| name|
+----+--------+
| 1|record 1|
|null|record 2|
| 3|record 3|
+----+--------+
Then you could do :
import org.apache.spark.util.AccumulatorV2
class RowAccumulator(var value: Seq[Row]) extends AccumulatorV2[Row, Seq[Row]] {
def this() = this(Seq.empty[Row])
override def isZero: Boolean = value.isEmpty
override def copy(): AccumulatorV2[Row, Seq[Row]] = new RowAccumulator(value)
override def reset(): Unit = value = Seq.empty[Row]
override def add(v: Row): Unit = value = value :+ v
override def merge(other: AccumulatorV2[Row, Seq[Row]]): Unit = value = value ++ other.value
}
val filteredAccum = new RowAccumulator()
ss.sparkContext.register(filteredAccum, "Filter Accum")
val filterIdIsNotNull = (r:Row) => {
if(r.isNullAt(r.fieldIndex("id"))) {
filteredAccum.add(r)
false
} else {
true
}}
df
.filter(filterIdIsNotNull)
.show()
println(filteredAccum.value)
gives
+---+--------+
| id| name|
+---+--------+
| 1|record 1|
| 3|record 3|
+---+--------+
List([null,record 2])
But personally I would not do this, I would rather do something like you've already suggested :
val dfWithFilter = df
.withColumn("keep",expr("id is not null"))
.cache() // check whether caching is feasibly
// show 10 records which we do not keep
dfWithFilter.filter(!$"keep").drop($"keep").show(10) // or use take(10)
+----+--------+
| id| name|
+----+--------+
|null|record 2|
+----+--------+
// rows that we keep
val filteredDf = dfWithFilter.filter($"keep").drop($"keep")

Spark dataframe after join null check for integer type columns

I am implemeting solutions for removing duplicate elements from two dataframes using left out join. After performing join condition, I have to check null columns for right table.
val llist = Seq(("bob", "2015-01-13", 4), ("alice", "2015-04-23",10))
val left = llist.toDF("name","date","duration")
val right = Seq(("alice", "2015-04-23",10),("bob", "2015-04-23", 23)).toDF("name","date","duration")
val df = left.join(right , left("name") === right("name") &&
left("date") === right("date") &&
left("duration").cast(StringType) === right("duration").cast(StringType)
,"left_outer").filter(right("duration").isNull)
But I am unable to filter out integer columns with null values. How can we do null check for integers after join ?
It's rather unclear what you want to achieve. The way you do it creates ambiguous column names. In addition, you reference the original (source) dataframe (right) in the filter condition, not the joined dataframe.
If you want to join them, you can do:
val df = left
.join(right , Seq("name","date","duration"),"left_outer")
But will not result in any "null" columns because duplicated columns are removed.
+-----+----------+--------+
| name| date|duration|
+-----+----------+--------+
| bob|2015-01-13| 4|
|alice|2015-04-23| 10|
+-----+----------+--------+
Otherwise, you can try this:
val df = left.as('left)
.join(right.as('right) ,
$"left.name" === $"right.name"
and $"left.date" === $"right.date"
and $"left.duration" === $"right.duration"
,"left_outer"
)
.filter($"right.duration".isNull)
this will result in
+----+----------+--------+----+----+--------+
|name| date|duration|name|date|duration|
+----+----------+--------+----+----+--------+
| bob|2015-01-13| 4|null|null| null|
+----+----------+--------+----+----+--------+
EDIT:
If you just want to remove duplicates, you could to this:
val df = left.unionAll(right).distinct()

What is the most efficient way to repartition a data frame with the following constraints?

I have a time series dataframe stored at one partition
+-------------+------+----+-------+
| TimeStamp| X| Y| Z|
+-------------+------+----+-------+
|1448949705421|-35888|4969|3491754|
|1448949705423|-35081|2795|3489177|
|1448949705425|-35976|5830|3488618|
|1448949705426|-36927|4729|3491807|
|1448949705428|-36416|6246|3490364|
|1448949705429|-36073|7067|3491556|
|1448949705431|-38553|3714|3489545|
|1448949705433|-39008|3034|3490230|
|1448949705434|-35295|4005|3489426|
|1448949705436|-36397|5076|3490941|
+-------------+------+----+-------+
I want to repartition this dataframe into 10 partitions, such that the first partition has roughly the first 1/10 rows, the second partition has roughly the second 1/10 rows, and so on.
One way I can think of is:
var df = ???
// add index to df
val rdd = df.rdd.zipWithIndex().map(indexedRow =>
Row.fromSeq(indexedRow._2.toLong +: indexedRow._1.toSeq))
val newstructure = StructType(Seq(StructField("rn", LongType, true)).++(df.schema.fields))
val dfWithIndex = sqlContext.createDataFrame(rdd, newstructure)
// create a group number using the index
val udfToInt = udf[Int, Double](_.toInt)
val dfWithGrp = dfWithIndex.withColumn("group", udfToInt($"rn" / (df.count / 10)))
// repartition by the "group" column
val partitionedDF = dfWithGrp.repartition(10, $"group")
Another way I can think of is by using a partitioner:
//After creating a group number
val grpIndex = dfWithGrp.schema.fields.size - 1
val partitionedRDD = dfWithGrp.rdd.map(r => (r.getInt(grpIndex), r))
.partitionBy(new HashPartitioner(10))
.values
But they seem to be not efficient because we need to add index first and then create a group number using the index. Is there a way to do this without adding an extra group column?

Resources