Spark: extract value from map based on another column - apache-spark

I have the following data frame
+--------------------+-----------------------------------------------------------------------------------------------------+-----------------+
|user_id |map_data |key_field. |
+--------------------+-----------------------------------------------------------------------------------------------------+-----------------+
|VG1uTie2pzg5E89148k9|[2.0 -> [11.0, another_val_for_key_2], 1.0 -> [22.0, another_val_for_key_1]] |1 |
+--------------------+-----------------------------------------------------------------------------------------------------+-----------------+
and the following case class
case class A(d:Double, str: String)
map_data is a column of type Map[Double, A]
I am trying to create a new column that is based on the map_data column and the key_field column.
Something in the form of
df
.withColumn("value_from_map",
col("map_data").getItem(col("key_field").cast(DoubleType)).getItem("str"))
When I'm using hardcoded key it works, for example:
df
.withColumn("value_from_map",
col("map_data").getItem(2).getItem("str"))
so I'm not sure what I am missing

Managed to solved it with a UDF function
val extract = udf( (key: Int, map: Map[Double, GenericRowWithSchema]) =>
map(key).getAs[String]("str")
)
...
.withColumn("value_from_map", extract(col("key_field"), col("map_data")))

Related

PySpark - convert RDD to pair key value RDD

I created rdd from CSV
lines = sc.textFile(data)
now I need to convert lines to key value rdd
where value where value will be string (after splitting) and key will be number of column of csv
for example CSV
Col 1
Col2
73
230666
55
149610
I want to get rdd.take(1):
[(1,73), (2, 230666)]
I create rdd of lists
lines_of_list = lines_data.map(lambda line : line.split(','))
I create function that get list and return list of tuples (key, value)
def list_of_tuple (l):
list_tup = []
for i in range(len(l[0])):
list_tup.append((l[0][i],i))
return(list_tup)
But I can’t get the correct result when I try to map this function on RDD
You can use the PySpark's create_map function to do so, like so:
from pyspark.sql.functions import create_map, col, lit
df = spark.createDataFrame([(73, 230666), (55, 149610)], "Col1: int, Col2: int")
mapped_df = df.select(create_map(lit(1), col("Col1")).alias("mappedCol1"), create_map(lit(2), col("Col2")).alias("mappedCol2"))
mapped_df.show()
+----------+-------------+
|mappedCol1| mappedCol2|
+----------+-------------+
| {1 -> 73}|{2 -> 230666}|
| {1 -> 55}|{2 -> 149610}|
+----------+-------------+
If you still want use RDD API, then its a property of DataFrame, so you can use it like so:
mapped_df.rdd.take(1)
Out[32]: [Row(mappedCol1={1: 73}, mappedCol2={2: 230666})]
I fixed the problem in this way:
def list_of_tuple (line_rdd):
l = line_rdd.split(',')
list_tup = []
for i in range(len(l)):
list_tup.append((l[i],i))
return(list_tup)
pairs_rdd = lines_data.map(lambda line: list_of_tuple(line))

Expand array of map type in pyspark

I have a column with type ArrayType(MapType(StringType, StringType)) and i want to expand in such way that keys present in map type becomes column name and its value(maptype) becomes value's of column.
here is an example:
[[version -> HTTP/1.1], [code -> 400], [reason -> Illegal character VCHAR='.'], [Content-Type -> text/html;charset=iso-8859-1], [Content-Length -> 70], [Connection -> close], [Server -> Jetty(9.4.24.v20191120)], [body -> 3c68313e426164204d65737361676520]]
this is a row which have arrayType() of maptype()
Skip the ArrayType. Use a UDF directly from the json
from pyspark.sql.types import MapType, StringType
#udf(returnType=MapType(StringType(), StringType()))
def http_flatten(s):
if s is None:
return None
import json
out = json.loads(s)["http"][0]["out"]
data = dict()
for e in out:
data.update(e)
return data
Then use something like this
kafka_df.select(
http_flatten(
kafka_df.value.cast("string").alias("value")
).alias("headers")
)
Then select("headers.*") to get those as the top level columns

How to Flatten spark dataframe Row to multiple Dataframe Rows

Hi I have a spark data frame which prints like this (single row)
[abc,WrappedArray(11918,1233),WrappedArray(46734,1234),1487530800317]
So inside a row i have wrapped array, I want to flatten it and create a dataframe which has single value for each array for example above row should transform something like this
[abc,11918,46734,1487530800317]
[abc,1233,1234,1487530800317]
So i got dataframe with 2 Rows instead of 1, So each corresponding element from wrapped array should go in new row.
Edit 1 after 1st answer:
What if i have 3 arrays in my input
WrappedArray(46734,1234,[abc,WrappedArray(11918,1233),WrappedArray(46734,1234),WrappedArray(1,2),1487530800317]
my output should be
[abc,11918,46734,1,1487530800317]
[abc,1233,1234,2,1487530800317]
Definitely not the best solution, but this would work:
case class TestFormat(a: String, b: Seq[String], c: Seq[String], d: String)
val data = Seq(TestFormat("abc", Seq("11918","1233"),
Seq("46734","1234"), "1487530800317")).toDS
val zipThem: (Seq[String], Seq[String]) => Seq[(String, String)] = _.zip(_)
val udfZip = udf(zipThem)
data.select($"a", explode(udfZip($"b", $"c")) as "tmp", $"d")
.select($"a", $"tmp._1" as "b", $"tmp._2" as "c", $"d")
.show
The problem is that by default you cannot be sure that both Sequences are of equal length.
The probably better solution would be to reformat the whole data frame into a structure that models the data, e.g.
root
-- a
-- d
-- records
---- b
---- c
Thanks for answering #swebbo, you answer helped me getting this done:
I did this:
import org.apache.spark.sql.functions.{explode, udf}
import sqlContext.implicits._
val zipColumns = udf((x: Seq[Long], y: Seq[Long], z: Seq[Long]) => (x.zip(y).zip(z)) map {
case ((a,b),c) => (a,b,c)
})
val flattened = subDf.withColumn("columns", explode(zipColumns($"col3", $"col4", $"col5"))).select(
$"col1", $"col2",
$"columns._1".alias("col3"), $"columns._2".alias("col4"), $"columns._3".alias("col5"))
flattened.show
Hope that is understandable :)

spark: row to element

New to Spark.
I'd like to do some transformation on the "wordList" column of a spark DataFrame, df, of the type org.apache.spark.sql.DataFrame = [id: string, wordList: array<string>].
I use dataBricks. df looks like:
+--------------------+--------------------+
| id| wordList|
+--------------------+--------------------+
|08b0a9b6-3b9a-47a...| [a]|
|23c2ef79-8dce-4ad...|[ag, adfg, asdfgg...|
|26a7682f-2ce6-4eb...|[ghe, gener, ghee...|
|2ab530b5-04bc-463...|[bap, pemm, pava,...|
+--------------------+--------------------+
More specifically, I have defined a function shrinkList(ol: List[String]): List[String] that takes a list and returns a shorter list, and would like to apply it on the wordList column. The question is, how do I convert the row to a list?
df.select("wordList").map(t => shrinkList(t(1))) give the error: type mismatch;
found : Any
required: List[String]
Also, I'm not sure about "t(1)" here. I'd rather use the column name instead of the index, in case the order of the columns change in the future. But I can't seem to make t$"wordList" or t.wordList or t("wordList") work. So instead of using t(1), what selector can I use to select the "wordList" column?
Try:
df.select("wordList").map(t => shrinkList(t.getSeq[String](0).toList))
or
df.select("wordList").map(t => shrinkList(t.getAs[Seq[String]]("wordList").toList))

Spark Dataframe groupBy and sort results into a list

I have a Spark Dataframe and I would like to group the elements by a key and have the results as a sorted list
Currently I am using:
df.groupBy("columnA").agg(collect_list("columnB"))
How do I make the items in the list sorted ascending order?
You could try the function sort_array available in the functions package:
import org.apache.spark.sql.functions._
df.groupBy("columnA").agg(sort_array(collect_list("columnB")))
Just wanted to add another hint to the answer of Daniel de Paula regarding sort_array solution.
If you want to sort elements according to a different column, you can form a struct of two fields:
the sort by field
the result field
Since structs are sorted field by field, you'll get the order you want, all you need is to get rid of the sort by column in each element of the resulting list.
The same approach can be applied with several sort by columns when needed.
Here's an example that can be run in local spark-shell (use :paste mode):
import org.apache.spark.sql.Row
import spark.implicits._
case class Employee(name: String, department: String, salary: Double)
val employees = Seq(
Employee("JSMITH", "A", 20.0),
Employee("AJOHNSON", "A", 650.0),
Employee("CBAKER", "A", 650.2),
Employee("TGREEN", "A", 13.0),
Employee("CHORTON", "B", 111.0),
Employee("AIVANOV", "B", 233.0),
Employee("VSMIRNOV", "B", 11.0)
)
val employeesDF = spark.createDataFrame(employees)
val getNames = udf { salaryNames: Seq[Row] =>
salaryNames.map { case Row(_: Double, name: String) => name }
}
employeesDF
.groupBy($"department")
.agg(collect_list(struct($"salary", $"name")).as("salaryNames"))
.withColumn("namesSortedBySalary", getNames(sort_array($"salaryNames", asc = false)))
.show(truncate = false)
The result:
+----------+--------------------------------------------------------------------+----------------------------------+
|department|salaryNames |namesSortedBySalary |
+----------+--------------------------------------------------------------------+----------------------------------+
|B |[[111.0, CHORTON], [233.0, AIVANOV], [11.0, VSMIRNOV]] |[AIVANOV, CHORTON, VSMIRNOV] |
|A |[[20.0, JSMITH], [650.0, AJOHNSON], [650.2, CBAKER], [13.0, TGREEN]]|[CBAKER, AJOHNSON, JSMITH, TGREEN]|
+----------+--------------------------------------------------------------------+----------------------------------+

Resources