How do I merge rows of maps into one map in Spark? - apache-spark

I currently have a dataframe:
df
.groupBy($"letters")
.agg(collect_list($"numbers").as("numbers"))
.select(map($"letters",$"numbers").as("data"))
.agg(collect_list($"data").as("data"))
.select(to_json($"data").as("output"))
.show(false)
+------------------------------------------+
|output |
+------------------------------------------+
|[{"abc":["123","456"]},{"def":["123"]}] |
+------------------------------------------+
How can I get it into this format?
+------------------------------------------+
|output |
+------------------------------------------+
|{"abc":["123","456"],"def":["123"]} |
+------------------------------------------+
So that basically it is one map, with no [] brackets at the ends
In other words, it is currently
res34: org.apache.spark.sql.DataFrame = [data: array<map<string,array<string>>>]

Assuming your dataset is this:
+-------------------------------------+
|output |
+-------------------------------------+
|[{abc -> [123, 456]}, {def -> [123]}]|
+-------------------------------------+
That has been created through:
var ds = spark.sparkContext.parallelize(Seq(
List(Map("abc" -> Array("123", "456")), Map("def" -> Array("123"))),
)).toDF("output")
We get:
+-------------------------------------+----------+-------------------+---------------------------------+
|output |keys |values |map |
+-------------------------------------+----------+-------------------+---------------------------------+
|[{abc -> [123, 456]}, {def -> [123]}]|[abc, def]|[[123, 456], [123]]|{abc -> [123, 456], def -> [123]}|
+-------------------------------------+----------+-------------------+---------------------------------+
Through:
ds = ds
.withColumn("keys", expr("transform(output, x -> map_keys(x)[0])"))
.withColumn("values", expr("transform(output, x -> flatten(map_values(x)))"))
.withColumn("map", map_from_arrays(col("keys"), col("values")))
I could not think of a better solution, the code is also self explanatory so I don't think there is a need for comments, hope it works for you, good luck!

try pyspark.sql.functions.map_concat

Related

Dataframe column is list of strings: how to apply transformation to each element?

Assuming a dataframe where a the content of a column is one list of 0 to n strings
df = pd.DataFrame({'col_w_list':[['c/100/a/111','c/100/a/584','c/100/a/324'],
['c/100/a/327'],
['c/100/a/324','c/100/a/327'],
['c/100/a/111','c/100/a/584','c/100/a/999'],
['c/100/a/584','c/100/a/327','c/100/a/999']
]})
How would I go about transforming the column (either the same or a new one) if all I wanted was the last set of digits, meaning
| | target_still_list |
|--|-----------------------|
|0 | ['111', '584', '324'] |
|1 | ['327'] |
|2 | ['324', '327'] |
|3 | ['111', '584', '999'] |
|4 | ['584', '327', '999'] |
I know how to handle this one list at a time
from os import path
ls = ['c/100/a/111','c/100/a/584','c/100/a/324']
new_ls = [path.split(x)[1] for x in ls]
# or, alternatively
new_ls = [x.split('/')[3] for x in ls]
But I have failed at doing the same over a dataframe. For instance
df['target_still_list'] = df['col_w_list'].apply([lambda x: x.split('/')[3] for x in df['col_w_list']])
Throws an AttributeError at me.
How to apply transformation to each element?
For a data frame, you can use pandas.DataFrame.applymap.
For a series, you can use pandas.Series.map or pandas.Series.apply, which is your posted solution.
Your error is caused by the lambda expression. It takes an element x, so the type of x is list, you can directly iterate over its items.
The correct code should be,
df['target_still_list'] = df['col_w_list'].apply(lambda x: [item.split('/')[-1] for item in x])
# or
# df['target_still_list'] = df['col_w_list'].map(lambda x: [item.split('/')[-1] for item in x])
# or (NOTE: This assignment works only if df has only one column.)
# df['target_still_list'] = df.applymap(lambda x: [item.split('/')[-1] for item in x])

How to change value in a Map Datatype

I have a dataframe having a column of type MapType<StringType, StringType>.
|-- identity: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
Identity column contains a key "update".
+-------------+
identity |
+-------+-----+
[update -> Y]|
[update -> Y]|
[update -> Y]|
[update -> Y]|
+-------+-----+
How do I change the value of key "update" from "Y" to "N"?
I'm using spark version 2.3
Any help will be appreciated. Thank you!
AFAIK, in spark 2.3 there are no built in function to handle maps. The only way is probably to design a UDF:
val df = Seq(Map(1 -> 2, 3 -> 4), Map(7 -> 8, 1 -> 6)).toDF("m")
// a function that sets the value "new" to all key equal to "1"
val fun = udf((m : Map[String, String]) =>
m.map{ case (key, value) => (key, if (key == "1") "new" else value) }
)
df.withColumn("m", fun('m)).show(false)
+------------------+
|m |
+------------------+
|{1 -> new, 3 -> 4}|
|{7 -> 8, 1 -> new}|
+------------------+
JSON solution
One alternative is to explode the map, make the updates and re aggregate it. Unfortunately, there is no way in spark 2.3 to create a map from a dynamic number of items. You could however aggregate the map as a json dictionary and then use the from_json function. I am pretty sure the first solution would be more efficient, but who knows. In pyspark, this solution might be faster than the UDF though.
df
.withColumn("id", monotonically_increasing_id)
.select($"id", explode('m))
.withColumn("value", when('key === "1" ,lit("new")).otherwise('value))
.withColumn("entry", concat(lit("\""), 'key, lit("\" : \""), 'value, lit("\"")))
.groupBy("id").agg( collect_list('entry) as "list")
.withColumn("json", concat(lit("{"), concat_ws(",", 'list), lit("}")))
.withColumn("m", from_json('json, MapType(StringType, StringType)))
.show(false)
Which yields the same result as before.

PySpark cosin-similarity Transformer

I have a DataFrame with two columns, each contain vectors, e.g.
+-------------+------------+
| v1 | v2 |
+-------------+------------+
| [1,1.2,0.4] | [2,0.4,5] |
| [1,.2,0.6] | [2,.2,5] |
| . | . |
| . | . |
| . | . |
| [0,1.2,.6] | [2,.2,0.4] |
+-------------+------------+
I would like to add another column to this DataFrame that contains the cosin-similarity between the two vectors in each row.
Is there a Transformer for this?
Is Transformer the right approach for this task?
If it is the right approach and there is no such Transformer, could you give me a pointer to how to write such myself?
Not aware of any transformation that can directly compute consine-similarity here.
You can write your own udf for such functionality:
from pyspark.ml.linalg import Vectors, DenseVector
from pyspark.sql import functions as F
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.types import *
v = [(DenseVector([1,1.2,0.4]), DenseVector([2,0.4,5])),
(DenseVector([1,2,0.6]), DenseVector([2,0.2,5])),
(DenseVector([0,1.2,0.6]), DenseVector([2,0.2,0.4]))]
dfv1 = spark.createDataFrame(v, ['v1', 'v2'])
dfv1 = dfv1.withColumn('v1v2', F.struct([F.col('v1'), F.col('v2')]))
dfv1.show(truncate=False)
Here's the DataFrame with combined vectors:
+-------------+-------------+------------------------------+
|v1 |v2 |v1v2 |
+-------------+-------------+------------------------------+
|[1.0,1.2,0.4]|[2.0,0.4,5.0]|[[1.0,1.2,0.4], [2.0,0.4,5.0]]|
|[1.0,2.0,0.6]|[2.0,0.2,5.0]|[[1.0,2.0,0.6], [2.0,0.2,5.0]]|
|[0.0,1.2,0.6]|[2.0,0.2,0.4]|[[0.0,1.2,0.6], [2.0,0.2,0.4]]|
+-------------+-------------+------------------------------+
Now we can define our udf for cosine similarity:
dot_prod_udf = F.udf(lambda v: float(v[0].dot(v[1])/v[0].norm(None)/v[1].norm(None)), FloatType())
dfv1 = dfv1.withColumn('cosine_similarity', dot_prod_udf(dfv1['v1v2']))
dfv1.show(truncate=False)
The last column shows the cosine similarity:
+-------------+-------------+------------------------------+-----------------+
|v1 |v2 |v1v2 |cosine_similarity|
+-------------+-------------+------------------------------+-----------------+
|[1.0,1.2,0.4]|[2.0,0.4,5.0]|[[1.0,1.2,0.4], [2.0,0.4,5.0]]|0.51451445 |
|[1.0,2.0,0.6]|[2.0,0.2,5.0]|[[1.0,2.0,0.6], [2.0,0.2,5.0]]|0.4328257 |
|[0.0,1.2,0.6]|[2.0,0.2,0.4]|[[0.0,1.2,0.6], [2.0,0.2,0.4]]|0.17457432 |
+-------------+-------------+------------------------------+-----------------+

Spark-java : Exception in thread "main" org.apache.spark.sql.AnalysisException [duplicate]

I am new to spark SQL,
In MS SQL, we have LEFT keyword, LEFT(Columnname,1) in('D','A') then 1 else 0.
How to implement the same in SPARK SQL.
You can use substring function with positive pos to take from the left:
import org.apache.spark.sql.functions.substring
substring(column, 0, 1)
and negative pos to take from the right:
substring(column, -1, 1)
So in Scala you can define
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.substring
def left(col: Column, n: Int) = {
assert(n >= 0)
substring(col, 0, n)
}
def right(col: Column, n: Int) = {
assert(n >= 0)
substring(col, -n, n)
}
val df = Seq("foobar").toDF("str")
df.select(
Seq(left _, right _).flatMap(f => (1 to 3).map(i => f($"str", i))): _*
).show
+--------------------+--------------------+--------------------+---------------------+---------------------+---------------------+
|substring(str, 0, 1)|substring(str, 0, 2)|substring(str, 0, 3)|substring(str, -1, 1)|substring(str, -2, 2)|substring(str, -3, 3)|
+--------------------+--------------------+--------------------+---------------------+---------------------+---------------------+
| f| fo| foo| r| ar| bar|
+--------------------+--------------------+--------------------+---------------------+---------------------+---------------------+
Similarly in Python:
from pyspark.sql.functions import substring
from pyspark.sql.column import Column
def left(col, n):
assert isinstance(col, (Column, str))
assert isinstance(n, int) and n >= 0
return substring(col, 0, n)
def right(col, n):
assert isinstance(col, (Column, str))
assert isinstance(n, int) and n >= 0
return substring(col, -n, n)
import org.apache.spark.sql.functions._
Use substring(column, 0, 1) instead of LEFT function.
where
0 : starting position in the string
1 : Number of characters to be selected
Example : Consider a LEFT function :
LEFT(upper(SKU),2)
Corresponding SparkSQL statement would be :
substring(upper(SKU),1,2)
To build upon user6910411's answer, you can also use isin and then to build a new column with the result of your character comparison.
Final full code would look something like this
import org.apache.spark.sql.functions._
df.select(substring($"Columnname", 0, 1) as "ch")
.withColumn("result", when($"ch".isin("D", "A"), 1).otherwise(0))
There are Spark SQL right and left functions as of Spark 2.3
Suppose you have the following DataFrame.
+----------------------+
|some_string |
+----------------------+
|this 23 has 44 numbers|
|no numbers |
|null |
+----------------------+
Here's how to get the leftmost two elements using the SQL left function:
df.select(expr("left(some_string, 2)").as("left_two")).show(false)
+--------+
|left_two|
+--------+
|th |
|no |
|null |
+--------+
Passing in SQL strings to expr() isn't ideal. Scala API users don't want to deal with SQL string formatting.
I created a library called bebe that provides easy access to the left function:
df.select(bebe_left(col("some_string"), lit(2)).as("left_two")).show()
+--------+
|left_two|
+--------+
|th |
|no |
|null |
+--------+
The Spark SQL right and bebe_right functions work in a similar manner.
You can use the Spark SQL functions with the expr hack, but it's better to use the bebe functions that are more flexible and type safe.

extract or filter MapType of Spark DataFrame

I have a DataFrame that contains various columns.
One column contains a Map[Integer,Integer[]].
It looks like { 2345 -> [1,34,2]; 543 -> [12,3,2,5]; 2 -> [3,4]}
Now what I need to do is filter out some keys.
I have a Set of Integers (javaIntSet) in Java with which I should filter such that
col(x).keySet.isin(javaIntSet)
ie. the above map should only contain the key 2 and 543 but not the other two and should look like {543 -> [12,3,2,5]; 2 -> [3,4]} after filtering.
Documentation of how to use the Java Column Class is sparse.
How do I extract the col(x) such that I can just filter it in java and then replace the cell data with a filtered map. Or are there any useful functions of columns I am overlooking.
Can I write an UDF2<Map<Integer, Integer[]>,Set<Integer>,Map<Integer,Integer[]>
I can write an UDF1<String,String> but I am not so sure how it works with more complex parameters.
Generally the javaIntSet is only a dozen and usually less than a 100 values. The Map usually also has only a handful entries (0-5 usually).
I have to do this in Java (unfortunately) but I am familiar with Scala. A Scala answer that I translate myself to Java would already be very helpful.
You don't need a UDF. Might be cleaner with one, but you could just as easily do it with DataFrame.explode:
case class MapTest(id: Int, map: Map[Int,Int])
val mapDf = Seq(
MapTest(1, Map((1,3),(2,10),(3,2)) ),
MapTest(2, Map((1,12),(2,333),(3,543)) )
).toDF("id", "map")
mapDf.show
+---+--------------------+
| id| map|
+---+--------------------+
| 1|Map(1 -> 3, 2 -> ...|
| 2|Map(1 -> 12, 2 ->...|
+---+--------------------+
Then you can use explode:
mapDf.explode($"map"){
case Row(map: Map[Int,Int] #unchecked) => {
val newMap = map.filter(m => m._1 != 1) // <-- do filtering here
Seq(Tuple1(newMap))
}
}.show
+---+--------------------+--------------------+
| id| map| _1|
+---+--------------------+--------------------+
| 1|Map(1 -> 3, 2 -> ...|Map(2 -> 10, 3 -> 2)|
| 2|Map(1 -> 12, 2 ->...|Map(2 -> 333, 3 -...|
+---+--------------------+--------------------+
If you did want to do the UDF, it would look like this:
val mapFilter = udf[Map[Int,Int],Map[Int,Int]](map => {
val newMap = map.filter(m => m._1 != 1) // <-- do filtering here
newMap
})
mapDf.withColumn("newMap", mapFilter($"map")).show
+---+--------------------+--------------------+
| id| map| newMap|
+---+--------------------+--------------------+
| 1|Map(1 -> 3, 2 -> ...|Map(2 -> 10, 3 -> 2)|
| 2|Map(1 -> 12, 2 ->...|Map(2 -> 333, 3 -...|
+---+--------------------+--------------------+
DataFrame.explode is a little more complicated, but ultimately more flexible. For example, you could divide the original row into two rows -- one containing the map with the elements filtered out, the other a map with the reverse -- the elements that were filtered.

Resources