spark find which persons often go to the same counties - apache-spark

The data source is:
val spark = SparkSession.builder().master("local[1,1]").config("spark.sql.shuffle.partitions", "1").config("spark.sql.crossJoin.enabled","true").getOrCreate()
spark.sparkContext.setLogLevel("error")
import spark.implicits._
val df=Seq(
("tom","America","2019"),
("jim","America","2019"),
("jack","America","2019"),
("tom","Russia","2019"),
("jim","Russia","2019"),
("jack","Russia","2019"),
("alex","Russia","2019"),
("tom","America","2018"),
("jim","America","2018"),
("tom","Germany","2018"),
("jim","England","2018")
).toDF("person","country","year")
I want to find which persons often go to the same countries for each year,and where they gone together,so what I expect is a json like this:
[{
"year": "2019",
"items": [{
"persons": ["tom", "jim", "jack"],
"common": ["America", "Russia"],
"times": 2
}, {
"persons": ["tom", "jack"],
"common": ["America", "Russia"],
"times": 2
}, {
"persons": ["tom", "jim"],
"common": ["America", "Russia"],
"times": 2
}, {
"persons": ["jack", "jim"],
"common": ["America", "Russia"],
"times": 2
}]
},
{
"year": "2018",
"items": [{
"persons": ["tom", "jim"],
"common": ["America"],
"times": 1
}]
}
]
Then I am not sure what model shall I use?
I tried Frequent Items Pattern:
val df1=df.where('year===2019)
val rdd1= df1.groupBy("country").agg(collect_set('person)).drop("country","year")
.as[Array[String]].rdd
val fpg = new FPGrowth()
.setMinSupport(0.3)
.setNumPartitions(10)
val schema = new StructType().add(new StructField("items", ArrayType(StringType))).add(new StructField("freq", LongType))
val model = fpg.run(rdd1);
val rdd2 = model.freqItemsets.map(itemset => Row(itemset.items, itemset.freq))
val df1 = spark.createDataFrame(rdd2, schema).where(size('items)>1)
.show()
loop for every year
val df2=df.where('year===2018)
val rdd2= df1.groupBy("country").agg(collect_set('person)).drop("country","year")
.as[Array[String]].rdd
....
val model = fpg.run(rdd12);
....
The result is :
for 2019
+----------------+----+
| items|freq|
+----------------+----+
| [jack, tom]| 2|
|[jack, tom, jim]| 2|
| [jack, jim]| 2|
| [tom, jim]| 2|
+----------------+----+
for 2018:
+----------+----+
| items|freq|
+----------+----+
|[tom, jim]| 1|
+----------+----+
But I can not get when and where they gone together,becuase the rdd I give to FPGRowth must be a RDD[Array[String]],no more columns allowed.
Is there any other better model?How can I achieve it?
I also want to know how many times each person group go together
Maybe what I should use collaborative filtering

Just self-join and aggregate
import org.apache.spark.sql.functions._
df.alias("left")
.join(df.alias("right"), Seq("country", "year"))
.where($"left.person" < $"right.person")
.groupBy(array($"left.person", $"right.person").alias("persons"))
.agg(collect_set(struct($"country", $"year")).alias("common"))

Try this:
val window = Window.partitionBy("country", "year")
df
.withColumn("persons", collect_set('person) over window)
.drop('person)
.distinct()
.groupBy('persons)
.agg(collect_set(struct('country, 'year)).alias("common"))
Output (tested):
+----------+----------------------------------+
|persons |common |
+----------+----------------------------------+
|[jim, tom]|[[America, 2019], [Russia, 2019]] |
|[tom] |[[Germany, 2018], [America, 2018]]|
|[jim] |[[Russia, 2018], [England, 2018]] |
+----------+----------------------------------+

Related

How to (dynamically) join array with a struct, to get a value from the struct for each element in the array?

I am trying to parse/flatten a JSON data, containing an array and a struct.
For every "Id" in "data_array" column, I need to get the "EstValue" from "data_struct" column. Column name in "data_struct" is the actual id (from "data_array"). Tried my best to use a dynamic join, but getting error "Column is not iterable". Can't we use dynamic join conditions in PySpark, like we can in SQL? Is there any better way for achieving this?
JSON Input file:
{
"data_array": [
{
"id": 1,
"name": "ABC"
},
{
"id": 2,
"name": "DEF"
}
],
"data_struct": {
"1": {
"estimated": {
"value": 123
},
"completed": {
"value": 1234
}
},
"2": {
"estimated": {
"value": 456
},
"completed": {
"value": 4567
}
}
}
}
Desired output:
Id Name EstValue CompValue
1 ABC 123 1234
2 DEF 456 4567
My PySpark code:
from pyspark.sql.functions import *
rawDF = spark.read.json([f"abfss://{pADLSContainer}#{pADLSGen2}.dfs.core.windows.net/{pADLSDirectory}/InputFile.json"], multiLine = "true")
idDF = rawDF.select(explode("data_array").alias("data_array")) \
.select(col("data_array.id").alias("id"))
idDF.show(n=2,vertical=True,truncate=150)
finalDF = idDF.join(rawDF, (idDF.id == rawDF.select(col("data_struct." + idDF.Id))) )
finalDF.show(n=2,vertical=True,truncate=150)
Error:
def __iter__(self): raise TypeError("Column is not iterable")
Self joins create problems. In this case, you can avoid the join.
You could make arrays from both columns, zip them together and use inline to extract into columns. The most difficult part is creating array from "data_struct" column. Maybe there's a better way, but I only could think of first transforming it into map type.
Input:
s = """
{
"data_array": [
{
"id": 1,
"name": "ABC"
},
{
"id": 2,
"name": "DEF"
}
],
"data_struct": {
"1": {
"estimated": {
"value": 123
},
"completed": {
"value": 1234
}
},
"2": {
"estimated": {
"value": 456
},
"completed": {
"value": 4567
}
}
}
}
"""
rawDF = spark.read.json(sc.parallelize([s]), multiLine = "true")
Script:
id = F.transform('data_array', lambda x: x.id).alias('Id')
name = F.transform('data_array', lambda x: x['name']).alias('Name')
map = F.from_json(F.to_json("data_struct"), 'map<string, struct<estimated:struct<value:long>,completed:struct<value:long>>>')
est_val = F.transform(id, lambda x: map[x].estimated.value).alias('EstValue')
comp_val = F.transform(id, lambda x: map[x].completed.value).alias('CompValue')
df = rawDF.withColumn('y', F.arrays_zip(id, name, est_val, comp_val))
df = df.selectExpr("inline(y)")
df.show()
# +---+----+--------+---------+
# | Id|Name|EstValue|CompValue|
# +---+----+--------+---------+
# | 1| ABC| 123| 1234|
# | 2| DEF| 456| 4567|
# +---+----+--------+---------+

How can I do such transformation?

env: spark2.4.5
source.json:
{
"a_key": "1",
"a_pro": "2",
"a_con": "3",
"b_key": "4",
"b_pro": "5",
"b_con": "6",
"c_key": "7",
"c_pro": "8",
"c_con": "9",
...
}
traget.json:
{
"factors": [
{
"name": "a",
"key": "1",
"pros": "2",
"cons": "3"
},
{
"name": "b",
"key": "4",
"pros": "5",
"cons": "6"
},
{
"name": "c",
"key": "7",
"pros": "8",
"cons": "9"
},
...
]
}
As you can see the target 'name' is a part of key of sources. For instance, 'a' is the 'name' of 'a_key', 'a_pro', 'a_con'. I really don't know how to extract a value from key, and do some 'group by' transforming. Can anybody give me some suggestion?
IIUC first create the dataframe from the input json
json_data = {
"a_key": "1",
"a_pro": "2",
"a_con": "3",
"b_key": "4",
"b_pro": "5",
"b_con": "6",
"c_key": "7",
"c_pro": "8",
"c_con": "9"
}
df=spark.createDataFrame(list(map(list,json_data.items())),['key','value'])
df.show()
+-----+-----+
| key|value|
+-----+-----+
|a_key| 1|
|a_pro| 2|
|a_con| 3|
|b_key| 4|
|b_pro| 5|
|b_con| 6|
|c_key| 7|
|c_pro| 8|
|c_con| 9|
+-----+-----+
Now create the required columns from existing column
import pyspark.sql.functions as f
df2 = df.withColumn('Name', f.substring('key',1,1)).\
withColumn('Attributes', f.concat(f.split('key','_')[1],f.lit('s')))
df2.show()
+-----+-----+----+----------+
| key|value|Name|Attributes|
+-----+-----+----+----------+
|a_key| 1| a| keys|
|a_pro| 2| a| pros|
|a_con| 3| a| cons|
|b_key| 4| b| keys|
|b_pro| 5| b| pros|
|b_con| 6| b| cons|
|c_key| 7| c| keys|
|c_pro| 8| c| pros|
|c_con| 9| c| cons|
+-----+-----+----+----------+
Now pivot the dataframe and collect result as json object
output_json = df2.groupBy('Name').\
pivot('Attributes').\
agg(f.min('value')).\
select(f.collect_list(f.struct('Name','keys','cons','pros')).alias('factors')).\
toJSON().collect()
import json
print(json.dumps(json.loads(output_json[0]),indent=4))
{
"factors": [
{
"Name": "c",
"keys": "7",
"cons": "9",
"pros": "8"
},
{
"Name": "b",
"keys": "4",
"cons": "6",
"pros": "5"
},
{
"Name": "a",
"keys": "1",
"cons": "3",
"pros": "2"
}
]
}
No need to involve dataframes for this, some simple string and dictionary manipulation will do:
import json
source = {
"a_key": "1",
"a_pro": "2",
"a_con": "3",
"b_key": "4",
"b_pro": "5",
"b_con": "6",
"c_key": "7",
"c_pro": "8",
"c_con": "9",
}
factors = {}
# Prepare each factor dictionary
for k, v in source.items():
factor, item = k.split('_')
d = factors.get(factor, {})
d[item] = v
factors[factor] = d
# Prepare result dictionary
target = {
'factors': []
}
# Move name attribute into dictionary & append
for k, v in factors.items():
d = v
d['name'] = k
target['factors'].append(d)
result = json.dumps(target)
Your data is strange, but the following code can help you solve it:
source.json:
{
"a_key": "1",
"a_pro": "2",
"a_con": "3",
"b_key": "4",
"b_pro": "5",
"b_con": "6",
"c_key": "7",
"c_pro": "8",
"c_con": "9"
}
code:
val sparkSession = SparkSession.builder()
.appName("readAndWriteJsonTest")
.master("local[*]").getOrCreate()
val dataFrame = sparkSession.read.format("json").load("R:\\data\\source.json")
// println(dataFrame.rdd.count())
val mapRdd: RDD[(String, (String, String))] = dataFrame.rdd.map(_.getString(0))
.filter(_.split("\\:").length == 2)
.map(line => {
val Array(key1, value1) = line.split("\\:")
val Array(name, key2) = key1.replace("\"", "").trim.split("\\_")
val value2 = value1.replace("\"", "").replace(",", "").trim
(name, (key2, value2))
})
// mapRdd.collect().foreach(println)
val initVale = new ArrayBuffer[(String, String)]
val function1 = (buffer1: ArrayBuffer[(String, String)], t1: (String, String)) => buffer1.+=(t1)
val function2 = (buffer1: ArrayBuffer[(String, String)], buffer2: ArrayBuffer[(String, String)]) => buffer1.++(buffer2)
val aggRdd: RDD[(String, ArrayBuffer[(String, String)])] = mapRdd.aggregateByKey(initVale)(function1, function2)
// aggRdd.collect().foreach(println)
import scala.collection.JavaConverters._
val persons: util.List[Person] = aggRdd.map(line => {
val name = line._1
val keyValue = line._2(0)._2
val prosValue = line._2(1)._2
val consvalue = line._2(2)._2
Person(name, keyValue, prosValue, consvalue)
}).collect().toList.asJava
import com.google.gson.GsonBuilder
val gson = new GsonBuilder().create
val factors = Factors(persons)
val targetJsonStr = gson.toJson(factors)
println(targetJsonStr)
traget.json:
{
"factors": [
{
"name": "a",
"key": "1",
"pros": "2",
"cons": "3"
},
{
"name": "b",
"key": "4",
"pros": "5",
"cons": "6"
},
{
"name": "c",
"key": "7",
"pros": "8",
"cons": "9"
}
]
}
You can put the above code into the test method and run it to see the result you want.
#Test
def readAndSaveJsonTest: Unit = {}
Hope it can help you.

Spark returns an array of null values when select sub-entities

I have an entity:
{
"id": "123",
"col_1": null,
"sub_entities": [
{ "sub_entity_id": "s-1", "col_2": null },
{ "sub_entity_id": "s-2", "col_2": null }
]
}
and I loaded it to spark: val entities = spark.read.json("...").
entities.filter(size($"sub_entities.col_2") === 0) returns nothing. The behavior seems weird because all the col_2 are null but the null value is counted.
I then tried select col_2 and noticed it returns an array of null values (2 null values in this case).
entities.select($"col_1", $"sub_entities.col_2").show(false)
+--------+------------------+
|col_1 |sub_entities.col_2|
+--------+------------------+
|null |[,] |
+--------+------------------+
How to write a query that returns only objects from the array where col_2 is not null?
To query array objects we need to first flatten out the array using explode function then query the dataframe!
Example:
val df=spark.read.json(Seq("""{"id": "123","col_1": null,"sub_entities": [ { "sub_entity_id": "s-1", "col_2": null }, { "sub_entity_id": "s-2", "col_2": null }]}""").toDS)
df.selectExpr("explode(sub_entities)","*").select("col.*","id","col_1").show()
//+-----+-------------+---+-----+
//|col_2|sub_entity_id| id|col_1|
//+-----+-------------+---+-----+
//| null| s-1|123| null|
//| null| s-2|123| null|
//+-----+-------------+---+-----+
df.selectExpr("explode(sub_entities)","*").select("col.*","id","col_1").filter(col("col_2").isNull).show()
//+-----+-------------+---+-----+
//|col_2|sub_entity_id| id|col_1|
//+-----+-------------+---+-----+
//| null| s-1|123| null|
//| null| s-2|123| null|
//+-----+-------------+---+-----+
This filters out only the array of col_2 as you mentioned, if you need different output when you do df.select($"col_1", $"sub_entities").show, I can update the answer:
val json =
"""
{
"id": "123",
"col_1": null,
"sub_entities": [
{ "sub_entity_id": "s-1", "col_2": null },
{ "sub_entity_id": "s-2", "col_2": null }
]
}
"""
val df = spark.read.json(Seq(json).toDS)
val removeNulls = udf((arr : Seq[String]) => arr.filter((x: String) => x != null))
df.select($"col_1", removeNulls($"sub_entities.col_2").as("sub_entities.col_2")).show(false)
+-----+------------------+
|col_1|sub_entities.col_2|
+-----+------------------+
|null |[] |
+-----+------------------+

How to "where" based on the last StructType of a list

Suppose I have a DataFrame of a StructType list column named 'arr', which can be described by the following json,
{
"otherAttribute": "blabla...",
"arr": [
{
"domain": "books",
"others": "blabla..."
}
{
"domain": "music",
"others": "blabla..."
}
]
}
{
"otherAttribute": "blabla...",
"arr": [
{
"domain": "music",
"others": "blabla..."
}
{
"domain": "furniture",
"others": "blabla..."
}
]
}
... ...
We want to filter out records such that the last StructType in "arr" has its "domain" attribute being "music". In the above example, we need to keep the firs record but discard the second record. Need help to write such "where" clause.
The answer is based on this data:
+---------------+----------------------------------------------+
|other_attribute|arr |
+---------------+----------------------------------------------+
|first |[[books, ...], [music, ...]] |
|second |[[books, ...], [music, ...], [furniture, ...]]|
|third |[[football, ...], [soccer, ...]] |
+---------------+----------------------------------------------+
arr here is an array of structs.
Each element of arr has attributes domain and others (filled with ... here).
DataFrame API approach (F is pyspark.sql.functions):
df.filter(
F.col("arr")[F.size(F.col("arr")) - 1]["domain"] == "music"
)
The SQL way:
SELECT
other_attribute,
arr
FROM df
WHERE arr[size(arr) - 1]['domain'] = 'music'
The output table will look like this:
+---------------+----------------------------+
|other_attribute|arr |
+---------------+----------------------------+
|first |[[books, ...], [music, ...]]|
+---------------+----------------------------+
Full code (suggesting using PySpark console):
import pyspark.sql.types as T
import pyspark.sql.functions as F
schema = T.StructType()\
.add("other_attribute", T.StringType())\
.add("arr", T.ArrayType(
T.StructType()
.add("domain", T.StringType())
.add("others", T.StringType())
)
)
df = spark.createDataFrame([
["first", [["books", "..."], ["music", "..."]]],
["second", [["books", "..."], ["music", "..."], ["furniture", "..."]]],
["third", [["football", "..."], ["soccer", "..."]]]
], schema)
filtered = df.filter(
F.col("arr")[F.size(F.col("arr")) - 1]["domain"] == "music"
)
filtered.show(100, False)
df.createOrReplaceTempView("df")
filtered_with_sql = spark.sql("""
SELECT
other_attribute,
arr
FROM df
WHERE arr[size(arr) - 1]['domain'] = 'music'
""")
filtered_with_sql.show(100, False)

How to filter by elements in array field in JSON format?

I have data set with one of the field containing array as below:
{ "name" : "James", "subjects" : [ "english", "french", "botany" ] },
{ "name" : "neo", "subjects" : [ "english", "physics" ] },
{ "name" : "john", "subjects" : [ "spanish", "mathematics" ] }
Now i want to filter using Dataset.filter function by passing Column object. I tried isin function of Column and array_contains function of functions but did not work.
Is there a way to create Column object that will filter the dataset where an array field contains one of the values?
There are multiple ways to do this--once you've imported Encoders implicitly:
import sparkSession.implicits._
First, you can turn your DataFrame, which is a DataSet[Row], into a strongly typed DataSet[Student], which allows you to use familiar (at least if you know Scala) Scala idioms:
case class Student(name: String, subjects: Seq[String])
sparkSession.read.json("my.json")
.as[Student]
.filter(_.subjects.contains("english"))
You can also use a pure-Column based approach in your DataFrame with array_contains from the helpful Spark functions library:
sparkSession.read.json("my.json").filter(array_contains($"subjects", "english"))
Finally, although it may not be helpful to you here, keep in mind that you can also use explode from the same functions library to give each subject its own row in the column:
sparkSession.read.json("my.json")
.select($"name", explode($"subjects").as("subjects"))
.filter($"subjects" === "english")
Spark SQL's DataFrameReader supports so-called JSON Lines text format (aka newline-delimited JSON) where:
Each Line is a Valid JSON Value
You can use json operator to read the dataset.
// on command line
$ cat subjects.jsonl
{ "name" : "James", "subjects" : [ "english", "french", "botany" ] }
{ "name" : "neo", "subjects" : [ "english", "physics" ] }
{ "name" : "john", "subjects" : [ "spanish", "mathematics" ] }
// in spark-shell
scala> val subjects = spark.read.json("subjects.jsonl")
subjects: org.apache.spark.sql.DataFrame = [name: string, subjects: array<string>]
scala> subjects.show(truncate = false)
+-----+-------------------------+
|name |subjects |
+-----+-------------------------+
|James|[english, french, botany]|
|neo |[english, physics] |
|john |[spanish, mathematics] |
+-----+-------------------------+
scala> subjects.printSchema
root
|-- name: string (nullable = true)
|-- subjects: array (nullable = true)
| |-- element: string (containsNull = true)
With that, you should have a look at functions library when you can find Collection functions that deal with array-based inputs, e.g. array_contains or explode.
That's what you can find in the answer from #Vidya.
What is missing is my beloved Dataset.flatMap that, given the subjects Dataset, could be used as follows:
scala> subjects
.as[(String, Seq[String])] // convert to Dataset[(String, Seq[String])] for more type-safety
.flatMap { case (student, subjects) => subjects.map(s => (student, s)) } // typed expand
.filter(_._2.toLowerCase == "english") // filter out non-english subjects
.show
+-----+-------+
| _1| _2|
+-----+-------+
|James|english|
| neo|english|
+-----+-------+
That however doesn't look as good/nice as its for-comprehension version.
val subjectsDF = subjects.as[(String, Seq[String])]
val englishStudents = for {
(student, ss) <- subjectsDF // flatMap
subject <- ss // map
if subject.toLowerCase == "english"
} yield (student, subject)
scala> englishStudents.show
+-----+-------+
| _1| _2|
+-----+-------+
|James|english|
| neo|english|
+-----+-------+
Moreover, as of Spark 2.2 (soon to be released), you've got DataFrameReader.json operator that you can use to read a Dataset[String].
scala> spark.version
res0: String = 2.3.0-SNAPSHOT
import org.apache.spark.sql.Dataset
val subjects: Dataset[String] = Seq(
"""{ "name" : "James", "subjects" : [ "english", "french", "botany" ] }""",
"""{ "name" : "neo", "subjects" : [ "english", "physics" ] }""",
"""{ "name" : "john", "subjects" : [ "spanish", "mathematics" ]}""").toDS
scala> spark.read.option("inferSchema", true).json(subjects).show(truncate = false)
+-----+-------------------------+
|name |subjects |
+-----+-------------------------+
|James|[english, french, botany]|
|neo |[english, physics] |
|john |[spanish, mathematics] |
+-----+-------------------------+
As per my understanding, you are trying to find the records within DataFrame based on the array column which contains a particular string. For example, in this case, you are trying to find the records which contain the particular subject say "english".
Let first create a sample DataFrame
import org.apache.spark.sql.functions._
val json_data = """[{ "name" : "James", "subjects" : [ "english", "french", "botany" ] },
{ "name" : "neo", "subjects" : [ "english", "physics" ] },
{ "name" : "john", "subjects" : [ "spanish", "mathematics" ] }]"""
val df = spark.read.json(Seq(json_data).toDS).toDF
Now let's try to find the records which contain the subject say "english". Here we can use the higher-order function "array_contains" which is available from spark 2.4.0.
df.filter(array_contains($"subjects", "english")).show(truncate=false)
// Output
+-----+-------------------------+------------+
|name |subjects |contains_eng|
+-----+-------------------------+------------+
|James|[english, french, botany]|true |
|neo |[english, physics] |true |
+-----+-------------------------+------------+
You can find more details about the functions here (scala and python).
I hope this helps.

Resources