case insensitive match in spark dataframe MapType - apache-spark

Using spark 2.4.1, I'm trying to get a key value from a MapType in a case insensitive fashion but spark does not seems to follow spark.sql.caseSensitive=false.
Starting spark with:
spark-shell --conf spark.sql.caseSensitive=false
Given dataframe:
val df = List(Map("a" -> 1), Map("A" -> 2)).toDF("m")
+--------+
| m|
+--------+
|[a -> 1]|
|[A -> 2]|
+--------+
And executing any of these will only return one row. (case sensitive match in the keys of the map but case insensitive in the name of the column)
df.filter($"M.A".isNotNull).count
df.filter($"M"("A").isNotNull).count
df.filter($"M".getField("A").isNotNull).count
Is there a way to get the field resolution to be case insensitive when resolving a key in a map?
Update:
I dug into spark code to find that it's probably a bug/feature. It looks like it calls GetMapValue (complexTypeExtractors.scala) with simple StringType ordering instead of using the case insensitive Resolver as it does in GetStructField.
I filled a JIRA for that: SPARK-27820

Not exactly pretty but should do the trick:
import org.apache.spark.sql.functions._
df.select(
// Re-create the map
map_from_arrays(
// Convert keys to uppercase
expr("transform(map_keys(m), x -> upper(x))"),
// Values
map_values($"m")
)("A".toUpperCase)
)

Related

Merge / concatenate an array of maps into one map in spark SQL with built-ins

Consider the following DataFrame. Here I want the array of maps merged into one map without using UDFs.
+---+------------------------------------+
|id |greek |
+---+------------------------------------+
|1 |[{alpha -> beta}, {gamma -> delta}] |
|2 |[{epsilon -> zeta}, {etha -> theta}]|
+---+------------------------------------+
I think I've tried all the mapping funcions in the pyspark 3 docs. I thought I'd be able to do map_from_entries, but it just throws an exception where it says it requires maps and not an array of maps?
Although I'm aware that this is easily done using UDFs, I find it hard to believe that there are no easier way?
Runnable python code
from pyspark.sql import SparkSession
spark = (
SparkSession
.builder
.getOrCreate()
)
df = spark.createDataFrame([
(1, [{"alpha": "beta"}, {"gamma": "delta"}]),
(2, [{"epsilon": "zeta"}, {"etha": "theta"}])
],
schema=["id", "greek"]
)
Another version using high-order functions:
map_schema = df.selectExpr('greek[0]').dtypes[0][1]
expr = "REDUCE(greek, cast(map() as {schema}), (acc, el) -> map_concat(acc, el))".format(schema=map_schema)
df = df.withColumn("Concated", F.expr(expr))
Output:
+---+------------------------------------+--------------------------------+
|id |greek |Concated |
+---+------------------------------------+--------------------------------+
|1 |[{alpha -> beta}, {gamma -> delta}] |{alpha -> beta, gamma -> delta} |
|2 |[{epsilon -> zeta}, {etha -> theta}]|{epsilon -> zeta, etha -> theta}|
+---+------------------------------------+--------------------------------+
I figured one approach that would use the aggregate built-in:
import pyspark.sql.functions as F
## Aggregate needs a column with the array to be iterated,
## an initial value and a merge function.
## For the initial value, we need an empty map with corresponding map schema
## which evaluates to (map<string,string>) in this case
map_schema = df.selectExpr('greek[0]').dtypes[0][1]
## F.create_map() creates a 'map<null,null>' type.
empty_map = F.create_map().cast(map_schema)
df.withColumn("Concated",
F.aggregate(
# Values to iterate
col=F.col("greek"),
# Initial value
initialValue=empty_map,
merge = lambda acc, el: F.map_concat(acc, el)
)
)
Edit
As pointed out by #kafels, the issue about duplicates should be addressed. According to the spark configuration docs it would throw an exception if the keys are duplicate. To avoid this, and let the last key win, set the following spark sql option:
spark.conf.set('spark.sql.mapKeyDedupPolicy', 'LAST_WIN')
My approach is explode the parent list, explode the keys, explode the values, then merge them all together
(df
.withColumn('g', F.explode('greek'))
.withColumn('k', F.explode(F.map_keys('g')))
.withColumn('v', F.explode(F.map_values('g')))
.groupBy('id')
.agg(
F.collect_list('k').alias('key'),
F.collect_list('v').alias('value')
)
.withColumn('single_map', (F.map_from_arrays('key', 'value')))
.show(10, False)
)
# +---+---------------+-------------+--------------------------------+
# |id |key |value |single_map |
# +---+---------------+-------------+--------------------------------+
# |1 |[alpha, gamma] |[beta, delta]|{alpha -> beta, gamma -> delta} |
# |2 |[epsilon, etha]|[zeta, theta]|{epsilon -> zeta, etha -> theta}|
# +---+---------------+-------------+--------------------------------+

Spark-Scala Try Select Statement

I'm trying to incorporate a Try().getOrElse() statement in my select statement for a Spark DataFrame. The project I'm working on is going to be applied to multiple environments. However, each environment is a little different in terms of the naming of the raw data for ONLY one field. I do not want to write several different functions to handle each different field. Is there a elegant way to handle exceptions, like this below, in a DataFrame select statement?
val dfFilter = dfRaw
.select(
Try($"some.field.nameOption1).getOrElse($"some.field.nameOption2"),
$"some.field.abc",
$"some.field.def"
)
dfFilter.show(33, false)
However, I keep getting the following error, which makes sense because it does not exist in this environments raw data, but I'd expect the getOrElse statement to catch that exception.
org.apache.spark.sql.AnalysisException: No such struct field nameOption1 in...
Is there a good way to handle exceptions in Scala Spark for select statements? Or will I need to code up different functions for each case?
val selectedColumns = if (dfRaw.columns.contains("some.field.nameOption1")) $"some.field.nameOption2" else $"some.field.nameOption2"
val dfFilter = dfRaw
.select(selectedColumns, ...)
So I'm revisiting this question after a year. I believe this solution to be much more elegant to implement. Please let me know anyone else's thoughts:
// Generate a fake DataFrame
val df = Seq(
("1234", "A", "AAA"),
("1134", "B", "BBB"),
("2353", "C", "CCC")
).toDF("id", "name", "nameAlt")
// Extract the column names
val columns = df.columns
// Add a "new" column name that is NOT present in the above DataFrame
val columnsAdd = columns ++ Array("someNewColumn")
// Let's then "try" to select all of the columns
df.select(columnsAdd.flatMap(c => Try(df(c)).toOption): _*).show(false)
// Let's reduce the DF again...should yield the same results
val dfNew = df.select("id", "name")
dfNew.select(columnsAdd.flatMap(c => Try(dfNew(c)).toOption): _*).show(false)
// Results
columns: Array[String] = Array(id, name, nameAlt)
columnsAdd: Array[String] = Array(id, name, nameAlt, someNewColumn)
+----+----+-------+
|id |name|nameAlt|
+----+----+-------+
|1234|A |AAA |
|1134|B |BBB |
|2353|C |CCC |
+----+----+-------+
dfNew: org.apache.spark.sql.DataFrame = [id: string, name: string]
+----+----+
|id |name|
+----+----+
|1234|A |
|1134|B |
|2353|C |
+----+----+

How to swap minus sign from last position in a string to first position in hive?

How to swap negative sign from last position to first in a string or integer to first position in hive and/ spark?
example: 22-
required: -22
My code is:
val Validation1 = spark.sql("Select case when substr(YTTLSVAL-,-1,1)='-' then cast(concat('-',substr(YTTLSVAL-,1,length(YTTLSVAL-)-1)) as int) else cast(YTTLSVAL- as int) end as column_name")
scala> Seq("-abcd", "def", "23-", "we").toDF("value").createOrReplaceTempView("values")
scala> val f = (x: String) => if(x.endsWith("-")) s"-${x.dropRight(1)}" else x
scala> spark.udf.register("myudf", f)
scala> spark.sql("select *, myudf(*) as custval from values").show
+-----+-------+
|value|custval|
+-----+-------+
|-abcd| -abcd|
| def| def|
| 23-| -23|
| we| we|
+-----+-------+
EDIT
On second thought, since, UDF's are discouraged unless you absolutely need them (since they create a black box for spark's optimisation engine), please use below way that uses regex_replace instead. Have tested this and it works:
scala> spark.sql("select REGEXP_REPLACE ( value, '^(\\.+)(-)$','-$1') as custval from values").show
You could try REGEXP_REPLACE. This patter searches for a number followed by - at the end and puts it before the number if found.
SELECT REGEXP_REPLACE ( val, '^(\\d+)-$','-$1')
Column functions

How to convert column of MapType(StringType, StringType) into StringType?

So I have this streaming dataframe and I'm trying to cast this 'customer_ids' column to a simple string.
schema = StructType()\
.add("customer_ids", MapType(StringType(), StringType()))\
.add("date", TimestampType())
original_sdf = spark.readStream.option("maxFilesPerTrigger", 800)\
.load(path=source, ftormat="parquet", schema=schema)\
.select('customer_ids', 'date')
The intend to this conversion is to group by this column and agregate by max(date) like this
original_sdf.groupBy('customer_ids')\
.agg(max('date')) \
.writeStream \
.trigger(once=True) \
.format("memory") \
.queryName('query') \
.outputMode("complete") \
.start()
but I got this exception
AnalysisException: u'expression `customer_ids` cannot be used as a grouping expression because its data type map<string,string> is not an orderable data type.
How can I cast this kind of streaming DataFrame column or any other way to groupBy this column?
TL;DR Use getItem method to access the values per key in a MapType column.
The real question is what key(s) you want to groupBy since a MapType column can have a variety of keys. Every key can be a column with values from the map column.
You can access keys using Column.getItem method (or a similar python voodoo):
getItem(key: Any): Colum An expression that gets an item at position ordinal out of an array, or gets a value by key key in a MapType.
(I use Scala and am leaving converting it to pyspark as a home exercise)
val ds = Seq(Map("hello" -> "world")).toDF("m")
scala> ds.show(false)
+-------------------+
|m |
+-------------------+
|Map(hello -> world)|
+-------------------+
scala> ds.select($"m".getItem("hello") as "hello").show
+-----+
|hello|
+-----+
|world|
+-----+

How to read a nested collection in Spark

I have a parquet table with one of the columns being
, array<struct<col1,col2,..colN>>
Can run queries against this table in Hive using LATERAL VIEW syntax.
How to read this table into an RDD, and more importantly how to filter, map etc this nested collection in Spark?
Could not find any references to this in Spark documentation. Thanks in advance for any information!
ps. I felt might be helpful to give some stats on the table.
Number of columns in main table ~600. Number of rows ~200m.
Number of "columns" in nested collection ~10. Avg number of records in nested collection ~35.
There is no magic in the case of nested collection. Spark will handle the same way a RDD[(String, String)] and a RDD[(String, Seq[String])].
Reading such nested collection from Parquet files can be tricky, though.
Let's take an example from the spark-shell (1.3.1):
scala> import sqlContext.implicits._
import sqlContext.implicits._
scala> case class Inner(a: String, b: String)
defined class Inner
scala> case class Outer(key: String, inners: Seq[Inner])
defined class Outer
Write the parquet file:
scala> val outers = sc.parallelize(List(Outer("k1", List(Inner("a", "b")))))
outers: org.apache.spark.rdd.RDD[Outer] = ParallelCollectionRDD[0] at parallelize at <console>:25
scala> outers.toDF.saveAsParquetFile("outers.parquet")
Read the parquet file:
scala> import org.apache.spark.sql.catalyst.expressions.Row
import org.apache.spark.sql.catalyst.expressions.Row
scala> val dataFrame = sqlContext.parquetFile("outers.parquet")
dataFrame: org.apache.spark.sql.DataFrame = [key: string, inners: array<struct<a:string,b:string>>]
scala> val outers = dataFrame.map { row =>
| val key = row.getString(0)
| val inners = row.getAs[Seq[Row]](1).map(r => Inner(r.getString(0), r.getString(1)))
| Outer(key, inners)
| }
outers: org.apache.spark.rdd.RDD[Outer] = MapPartitionsRDD[8] at map at DataFrame.scala:848
The important part is row.getAs[Seq[Row]](1). The internal representation of a nested sequence of struct is ArrayBuffer[Row], you could use any super-type of it instead of Seq[Row]. The 1 is the column index in the outer row. I used the method getAs here but there are alternatives in the latest versions of Spark. See the source code of the Row trait.
Now that you have a RDD[Outer], you can apply any wanted transformation or action.
// Filter the outers
outers.filter(_.inners.nonEmpty)
// Filter the inners
outers.map(outer => outer.copy(inners = outer.inners.filter(_.a == "a")))
Note that we used the spark-SQL library only to read the parquet file. You could for example select only the wanted columns directly on the DataFrame, before mapping it to a RDD.
dataFrame.select('col1, 'col2).map { row => ... }
I'll give a Python-based answer since that's what I'm using. I think Scala has something similar.
The explode function was added in Spark 1.4.0 to handle nested arrays in DataFrames, according to the Python API docs.
Create a test dataframe:
from pyspark.sql import Row
df = sqlContext.createDataFrame([Row(a=1, intlist=[1,2,3]), Row(a=2, intlist=[4,5,6])])
df.show()
## +-+--------------------+
## |a| intlist|
## +-+--------------------+
## |1|ArrayBuffer(1, 2, 3)|
## |2|ArrayBuffer(4, 5, 6)|
## +-+--------------------+
Use explode to flatten the list column:
from pyspark.sql.functions import explode
df.select(df.a, explode(df.intlist)).show()
## +-+---+
## |a|_c0|
## +-+---+
## |1| 1|
## |1| 2|
## |1| 3|
## |2| 4|
## |2| 5|
## |2| 6|
## +-+---+
Another approach would be using pattern matching like this:
val rdd: RDD[(String, List[(String, String)]] = dataFrame.map(_.toSeq.toList match {
case List(key: String, inners: Seq[Row]) => key -> inners.map(_.toSeq.toList match {
case List(a:String, b: String) => (a, b)
}).toList
})
You can pattern match directly on Row but it is likely to fail for a few reasons.
Above answers are all great answers and tackle this question from different sides; Spark SQL is also quite useful way to access nested data.
Here's example how to use explode() in SQL directly to query nested collection.
SELECT hholdid, tsp.person_seq_no
FROM ( SELECT hholdid, explode(tsp_ids) as tsp
FROM disc_mrt.unified_fact uf
)
tsp_ids is a nested of structs, which has many attributes, including person_seq_no which I'm selecting in the outer query above.
Above was tested in Spark 2.0. I did a small test and it doesn't work in Spark 1.6. This question was asked when Spark 2 wasn't around, so this answer adds nicely to the list of available options to deal with nested structures.
Have a look also on following JIRAs for Hive-compatible way to query nested data using LATERAL VIEW OUTER syntax, since Spark 2.2 also supports OUTER explode (e.g. when a nested collection is empty, but you still want to have attributes from a parent record):
SPARK-13721: Add support for LATERAL VIEW OUTER explode()
Noticable not resolved JIRA on explode() for SQL access:
SPARK-7549: Support aggregating over nested fields

Resources