Spark join dataframes & datasets

Spark join dataframes & datasets - apache-spark

I have a DataFrame called Link with a dynamic amount of fields/columns in a Row.
Some fields however had the structure [ClassName]Id that contain an id
[ClassName]Id's are always of type String
I have a couple of Datasets each of a different type [ClassName]
Each Dataset has at least fields id (String) and typeName (String), which is always filled with the String value of the [ClassName]
e.g. If I would have 3 DataSets of type A, B and C
Link:
+----+-----+-----+-----+
| id | AId | BId | CId |
+----+-----+-----+-----+
| XX | A01 | B02 | C04 |
| XY | null| B05 | C07 |
A:
+-----+----------+-----+-----+
| id | typeName | ... | ... |
+-----+----------+-----+-----+
| A01 | A | ... | ... |
B:
+-----+----------+-----+-----+
| id | typeName | ... | ... |
+-----+----------+-----+-----+
| B02 | B | ... | ... |
The preferred end result would be the Link Dataframe where each Id is either replace or appended by a field called [ClassName] With the original object encapsulated.
Result:
+----+----------------+----------------+----------------+
| id | A | B | C |
+----+----------------+----------------+----------------+
| XX | A(A01, A, ...) | B(B02, B, ...) | C(C04, C, ...) |
| XY | null | B(B05, B, ...) | C(C07, C, ...) |
Things I've tried
Recursive Call on joinWith.
The first call succeeds returning a tuple/Row where the first element is the original Row and the second the matched [ClassName]
However the second iteration starts nesting these results.
Trying to 'unnest' these results using map either results in Encoder hell (since the resulting Row is not a fixed type) or the Encoding is so complex that it results in a catalyst error
join as RDD Can't work this one out yet.
Any ideas are welcome.

So I figured out how I could do what I want.
I made some changes for it to work for me, but it's a
For reference purpose I will show my steps, maybe it can be useful for someone in the future?
First I declare a datatype that shares all properties of A, B, C, etc. that I'm Interested in and make the classes extend from this super type
case class Base(id: String, typeName: String)
case class A(override val id: String, override val typeName: String) extends Base(id, typeName)
Next I load the link Dataframe
val linkDataFrame = spark.read.parquet("[path]")
I want to convert this DataFrame in something joinable, this means creating a placeholder for the joined sources and a way to convert all the single Id fields (AId, BId, etc) into a Map of source -> id's. Spark has a sql map method that is useful. Also we need to convert the Base class to a StructType for use in the encoder. Tried multiple ways, but couldn't circumvent specific declaration (otherwise casting errors)
val linkDataFrame = spark.read.parquet("[path]")
case class LinkReformatted(ids: Map[String, Long], sources: Map[String, Base])
// Maps each column ending with Id into a Map of (columnname1 (-Id), value1, columnname2 (-Id), value2)
val mapper = linkDataFrame.columns.toList
.filter(
_.matches("(?i).*Id$")
)
.flatMap(
c => List(lit(c.replaceAll("(?i)Id$", "")), col(c))
)
val baseStructType = ScalaReflection.schemaFor[Base].dataType.asInstanceOf[StructType]
All these parts made it possible to create a new DataFrame with the Id's all in one field called ids and a placeholder for the sources in an empty Map[String, Base]
val linkDatasetReformatted = linkDataFrame.select(
map(mapper: _*).alias("ids")
)
.withColumn("sources", lit(null).cast(MapType(StringType, baseStructType)))
.as[LinkReformatted]
The next step was to join all source Datasets (A,B, etc) to this reformatted Link dataset. A lot of stuff happens in this tailrecursive method
#tailrec
def recursiveJoinBases(sourceDataset: Dataset[LinkReformatted], datasets: List[Dataset[Base]]): Dataset[LinkReformatted] = datasets match {
case Nil => sourceDataset // Nothing left to join, return it
case baseDataset :: remainingDatasets => {
val typeName = baseDataset.head.typeName // extract the type from base (each field hase same value)
val masterName = "source" // something to name the source
val joinedDataset = sourceDataset.as(masterName) // joining source
.joinWith(
baseDataset.as(typeName), // with a base A,B, etc
col(s"$typeName.id") === col(s"$masterName.ids.$typeName"), // join on source.ids.[typeName]
"left_outer"
)
.map {
case (source, base) => {
val newSources = if (source.sources == null) Map(typeName -> base) else source.sources + (typeName -> base) // append or create map of sources
source.copy(sources = newSources)
}
}
.as[LinkReformatted]
recursiveJoinBases(joinedDataset, remainingDatasets)
}
}
You now end up with a Dataset of LinkReformatted records where for each corresponding typeName -> id in the ids field is a corresponding typeName -> Base in the sources field.
For me that was enough. I could extract everything I needed using some map function over this final Dataset
I hope this somewhat helps. I understand it's not the exact solution I was asking about, nor is it all very straightforward.

Related

Dataframe column is list of strings: how to apply transformation to each element?

Assuming a dataframe where a the content of a column is one list of 0 to n strings
df = pd.DataFrame({'col_w_list':[['c/100/a/111','c/100/a/584','c/100/a/324'],
['c/100/a/327'],
['c/100/a/324','c/100/a/327'],
['c/100/a/111','c/100/a/584','c/100/a/999'],
['c/100/a/584','c/100/a/327','c/100/a/999']
]})
How would I go about transforming the column (either the same or a new one) if all I wanted was the last set of digits, meaning
| | target_still_list |
|--|-----------------------|
|0 | ['111', '584', '324'] |
|1 | ['327'] |
|2 | ['324', '327'] |
|3 | ['111', '584', '999'] |
|4 | ['584', '327', '999'] |
I know how to handle this one list at a time
from os import path
ls = ['c/100/a/111','c/100/a/584','c/100/a/324']
new_ls = [path.split(x)[1] for x in ls]
# or, alternatively
new_ls = [x.split('/')[3] for x in ls]
But I have failed at doing the same over a dataframe. For instance
df['target_still_list'] = df['col_w_list'].apply([lambda x: x.split('/')[3] for x in df['col_w_list']])
Throws an AttributeError at me.

How to apply transformation to each element?
For a data frame, you can use pandas.DataFrame.applymap.
For a series, you can use pandas.Series.map or pandas.Series.apply, which is your posted solution.
Your error is caused by the lambda expression. It takes an element x, so the type of x is list, you can directly iterate over its items.
The correct code should be,
df['target_still_list'] = df['col_w_list'].apply(lambda x: [item.split('/')[-1] for item in x])
# or
# df['target_still_list'] = df['col_w_list'].map(lambda x: [item.split('/')[-1] for item in x])
# or (NOTE: This assignment works only if df has only one column.)
# df['target_still_list'] = df.applymap(lambda x: [item.split('/')[-1] for item in x])

How to change value in a Map Datatype

I have a dataframe having a column of type MapType<StringType, StringType>.
|-- identity: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
Identity column contains a key "update".
+-------------+
identity |
+-------+-----+
[update -> Y]|
[update -> Y]|
[update -> Y]|
[update -> Y]|
+-------+-----+
How do I change the value of key "update" from "Y" to "N"?
I'm using spark version 2.3
Any help will be appreciated. Thank you!

AFAIK, in spark 2.3 there are no built in function to handle maps. The only way is probably to design a UDF:
val df = Seq(Map(1 -> 2, 3 -> 4), Map(7 -> 8, 1 -> 6)).toDF("m")
// a function that sets the value "new" to all key equal to "1"
val fun = udf((m : Map[String, String]) =>
m.map{ case (key, value) => (key, if (key == "1") "new" else value) }
)
df.withColumn("m", fun('m)).show(false)
+------------------+
|m |
+------------------+
|{1 -> new, 3 -> 4}|
|{7 -> 8, 1 -> new}|
+------------------+
JSON solution
One alternative is to explode the map, make the updates and re aggregate it. Unfortunately, there is no way in spark 2.3 to create a map from a dynamic number of items. You could however aggregate the map as a json dictionary and then use the from_json function. I am pretty sure the first solution would be more efficient, but who knows. In pyspark, this solution might be faster than the UDF though.
df
.withColumn("id", monotonically_increasing_id)
.select($"id", explode('m))
.withColumn("value", when('key === "1" ,lit("new")).otherwise('value))
.withColumn("entry", concat(lit("\""), 'key, lit("\" : \""), 'value, lit("\"")))
.groupBy("id").agg( collect_list('entry) as "list")
.withColumn("json", concat(lit("{"), concat_ws(",", 'list), lit("}")))
.withColumn("m", from_json('json, MapType(StringType, StringType)))
.show(false)
Which yields the same result as before.

Searching in collection of tuple - Cassandra

Here is my data :-
CREATE TABLE collect_things(k int PRIMARY KEY,n set<frozen<tuple<text, text>>>);
INSERT INTO collect_things (k, n) VALUES(1, {('hello', 'cassandra')});
CREATE INDEX n_index ON collect_things (n);
Now I have to query like this :-
SELECT * FROM collect_things WHERE n contains ('cassandra') ALLOW FILTERING ;
Output :-
k | n
---+---------
Expected output :-
k | n
---+---------
1 | {('hello', 'cassandra')}
I want to fetch my data with 'cassandra' value . Is this possible ?

Tuple inside Collection must be define as frozen.
A frozen value serializes multiple components into a single value. Non-frozen types allow updates to individual fields. Cassandra treats the value of a frozen type as a blob. The entire value must be overwritten.
You must treat frozen as a single value and you can't separate them. So when querying provide the complete frozen tuple ('hello', 'cassandra').
SELECT * FROM collect_things WHERE n CONTAINS ('hello', 'cassandra');
If you have the data :
k | n
---+---------------------------------------------
1 | {('hello', 'cassandra'), ('test', 'seach')}
2 | {('test', 'seach')}
Output :
k | n
---+---------------------------------------------
1 | {('hello', 'cassandra'), ('test', 'seach')}
Source : https://docs.datastax.com/en/cql/3.1/cql/cql_reference/collection_type_r.html

extract or filter MapType of Spark DataFrame

I have a DataFrame that contains various columns.
One column contains a Map[Integer,Integer[]].
It looks like { 2345 -> [1,34,2]; 543 -> [12,3,2,5]; 2 -> [3,4]}
Now what I need to do is filter out some keys.
I have a Set of Integers (javaIntSet) in Java with which I should filter such that
col(x).keySet.isin(javaIntSet)
ie. the above map should only contain the key 2 and 543 but not the other two and should look like {543 -> [12,3,2,5]; 2 -> [3,4]} after filtering.
Documentation of how to use the Java Column Class is sparse.
How do I extract the col(x) such that I can just filter it in java and then replace the cell data with a filtered map. Or are there any useful functions of columns I am overlooking.
Can I write an UDF2<Map<Integer, Integer[]>,Set<Integer>,Map<Integer,Integer[]>
I can write an UDF1<String,String> but I am not so sure how it works with more complex parameters.
Generally the javaIntSet is only a dozen and usually less than a 100 values. The Map usually also has only a handful entries (0-5 usually).
I have to do this in Java (unfortunately) but I am familiar with Scala. A Scala answer that I translate myself to Java would already be very helpful.

You don't need a UDF. Might be cleaner with one, but you could just as easily do it with DataFrame.explode:
case class MapTest(id: Int, map: Map[Int,Int])
val mapDf = Seq(
MapTest(1, Map((1,3),(2,10),(3,2)) ),
MapTest(2, Map((1,12),(2,333),(3,543)) )
).toDF("id", "map")
mapDf.show
+---+--------------------+
| id| map|
+---+--------------------+
| 1|Map(1 -> 3, 2 -> ...|
| 2|Map(1 -> 12, 2 ->...|
+---+--------------------+
Then you can use explode:
mapDf.explode($"map"){
case Row(map: Map[Int,Int] #unchecked) => {
val newMap = map.filter(m => m._1 != 1) // <-- do filtering here
Seq(Tuple1(newMap))
}
}.show
+---+--------------------+--------------------+
| id| map| _1|
+---+--------------------+--------------------+
| 1|Map(1 -> 3, 2 -> ...|Map(2 -> 10, 3 -> 2)|
| 2|Map(1 -> 12, 2 ->...|Map(2 -> 333, 3 -...|
+---+--------------------+--------------------+
If you did want to do the UDF, it would look like this:
val mapFilter = udf[Map[Int,Int],Map[Int,Int]](map => {
val newMap = map.filter(m => m._1 != 1) // <-- do filtering here
newMap
})
mapDf.withColumn("newMap", mapFilter($"map")).show
+---+--------------------+--------------------+
| id| map| newMap|
+---+--------------------+--------------------+
| 1|Map(1 -> 3, 2 -> ...|Map(2 -> 10, 3 -> 2)|
| 2|Map(1 -> 12, 2 ->...|Map(2 -> 333, 3 -...|
+---+--------------------+--------------------+
DataFrame.explode is a little more complicated, but ultimately more flexible. For example, you could divide the original row into two rows -- one containing the map with the elements filtered out, the other a map with the reverse -- the elements that were filtered.

Scala: How to count occurrences of unique items in a certain index?

I have a list that is formatted like the lists below:
List(List(21, Georgetown, Male),List(29, Medford, Male),List(18, Manchester, Male),List(27, Georgetown, Female))
And I need to count the occurrences of each unique town name, then return the town name and the amount of times it was counted. But I only want to return the one town that had the most occurences. So if I applied the function to the list above, I would get
(Georgetown, 2)
I'm coming from Java, so I know how to do this process in a longer way, but I want to utilize some of Scala's built in methods.

scala> val towns = List(
| List(21, "Georgetown", "Male"),
| List(29, "Medford", "Male"),
| List(18, "Manchester", "Male"),
| List(27, "Georgetown", "Female"))
towns: List[List[Any]] = ...
scala> towns.map({ case List(a, b, c) => (b, c) }).groupBy(_._1).mapValues(_.length).maxBy(_._2)
res0: (Any, Int) = (Georgetown,2)

This is a pretty weird structure, but a way to do it would be with:
val items : List[List[Any]] = List(
List(List(21, "Georgetown", "Male")),
List(List(29, "Medford", "Male")),
List(List(18, "Manchester", "Male")),
List(List(27, "Georgetown", "Female"))).map(_.flatten)
val results = items.foldLeft(Map[String,Int]()) {
(acc,item) =>
val key = item(1).asInstanceOf[String]
val count = acc.getOrElse(key, 0 )
acc + (key -> (count + 1))
}
println(results)
Which produces:
Map(Georgetown -> 2, Medford -> 1, Manchester -> 1)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark join dataframes & datasets - apache-spark

Related

Dataframe column is list of strings: how to apply transformation to each element?

How to change value in a Map Datatype

Searching in collection of tuple - Cassandra

extract or filter MapType of Spark DataFrame

Scala: How to count occurrences of unique items in a certain index?

Categories

Resources