I have a series of expressions used to map raw JSON data to normalized column data. I'm trying to think of a way to efficiently apply this to every row as there are multiple schemas to consider.
Right now, I have one massive CASE statement (built dynamically) that gets interpreted to SQL like this:
SELECT
CASE
WHEN schema = 'A' THEN CONCAT(get_json_object(payload, '$.FirstName'), ' ', get_json_object(payload, '$.LastName'))
WHEN schema = 'B' THEN get_json_object(payload, '$.Name')
END as name,
CASE
WHEN schema = 'A' THEN get_json_object(payload, '$.Telephone')
WHEN schema = 'B' THEN get_json_object(payload, '$.PhoneNumber')
END as phone_number
This works, I just worry about performance as the number of schemas and columns increases. I want to see if there's another way and here is my idea.
I have a DataFrame expressions_df of valid SparkSQL expressions.
schema
column
column_expression
A
name
CONCAT(get_json_object(payload, '$.FirstName'), ' ', get_json_object(payload, '$.LastName'))
A
phone_number
get_json_object(payload, '$.Telephone')
B
name
get_json_object(payload, '$.Name')
B
phone_number
get_json_object(payload, '$.PhoneNumber')
This DataFrame is used as a lookup table of sorts against a DataFrame raw_df:
schema
payload
A
{"FirstName": "John", "LastName": "Doe", "Telephone": "123-456-7890"}
B
{"Name": "Jane Doe", "PhoneNumber": "123-567-1234"}
I'd like to do something like this where column_expression is passed to F.expr and used to interpret the SQL and return the appropriate value.
from pyspark.sql import functions as F
(
raw_df
.join(expressions_df, 'schema')
.select(
F.expr(column_expression)
)
.dropDuplicates()
)
The desired end result would be something like this so that no matter what the original schema is, the data is transformed to the same standard using the expressions as shown in the SQL or expressions_df.
| name | phone_number |
| -------- | ------------ |
| John Doe | 123-456-7890 |
| Jane Doe | 123-567-1234 |
You can't use directly a DataFrame column value as an expression with expr function. You'll have to collect all the expressions into a python object in order to be able to pass them as parameters to expr.
Here's one way to do it where the expressions are collected into a dict then for each schema we apply a different select expression. Finally, union all the dataframes to get the desired output:
from collections import defaultdict
from functools import reduce
import pyspark.sql.functions as F
exprs = defaultdict(list)
for r in expressions_df.collect():
exprs[r.schema].append(F.expr(r.column_expression).alias(r.column))
schemas = [r.schema for r in raw_df.select("schema").distinct().collect()]
final_df = reduce(DataFrame.union, [raw_df.filter(f"schema='{s}'").select(*exprs[s]) for s in schemas])
final_df.show()
#+--------+------------+
#| name|phone_number|
#+--------+------------+
#|Jane Doe|123-567-1234|
#|John Doe|123-456-7890|
#+--------+------------+
Related
I ask the similarity questions before, but for some reasons, It is sad that I have to reimplement it in PySpark.
For example,
app col1
app1 anybody love me?
app2 I hate u
app3 this hat is good
app4 I don't like this one
app5 oh my god
app6 damn you.
app7 such nice girl
app8 xxxxx
app9 pretty prefect
app10 don't love me.
app11 xxx anybody?
I want to match a keyword list like ['anybody', 'love', 'you', 'xxx', 'don't'] and select the matched keyword result as a new column, named keyword as follows:
app keyword
app1 [anybody, love]
app4 [don't]
app6 [you]
app8 [xxx]
app10 [don't, love]
app11 [xxx]
As the accepted answer the suitable way I can do is create a temporary dataframe which is converted by a string list then inner join these two dataframe together.
And select the rows of app and keyword that are matched in the condition.
-- Hiveql implementation
select t.app, k.keyword
from mytable t
inner join (values ('anybody'), ('you'), ('xxx'), ('don''t')) as k(keyword)
on t.col1 like conca('%', k.keyword, '%')
But I am not familiar with PySpark and awkward to reimplement it.
Could anyone help me?
Thanks in advances.
Please find below two possible approaches:
Option 1
The first option is to use the dataframe API to implement the analogous join as in your previous question. Here we convert the keywords list into a dataframe and then join it with the large dataframe (notice that we broadcast the small dataframe to ensure better performance):
from pyspark.sql.functions import broadcast
df = spark.createDataFrame([
["app1", "anybody love me?"],
["app4", "I don't like this one"],
["app5", "oh my god"],
["app6", "damn you."],
["app7", "such nice girl"],
["app8", "xxxxx"],
["app10", "don't love me."]
]).toDF("app", "col1")
# create keywords dataframe
kdf = spark.createDataFrame([(k,) for k in keywords], "key string")
# +-----+
# | key|
# +-----+
# | xxx|
# |don't|
# +-----+
df.join(broadcast(kdf), df["col1"].contains(kdf["key"]), "inner")
# +-----+---------------------+-----+
# |app |col1 |key |
# +-----+---------------------+-----+
# |app4 |I don't like this one|don't|
# |app8 |xxxxx |xxx |
# |app10|don't love me. |don't|
# +-----+---------------------+-----+
The join condition is based on contains function of the Column class.
Option 2
You also can use PySpark high-order function filter in combination with rlike within an expr:
from pyspark.sql.functions import lit, expr, array
df = spark.createDataFrame([
["app1", "anybody love me?"],
["app4", "I don't like this one"],
["app5", "oh my god"],
["app6", "damn you."],
["app7", "such nice girl"],
["app8", "xxxxx"],
["app10", "don't love me."]
]).toDF("app", "col1")
keywords = ["xxx", "don't"]
df.withColumn("keywords", array([lit(k) for k in keywords])) \
.withColumn("keywords", expr("filter(keywords, k -> col1 rlike k)")) \
.where("size(keywords) > 0") \
.show(10, False)
# +-----+---------------------+--------+
# |app |col1 |keywords|
# +-----+---------------------+--------+
# |app4 |I don't like this one|[don't] |
# |app8 |xxxxx |[xxx] |
# |app10|don't love me. |[don't] |
# +-----+---------------------+--------+
Explanation
with array([lit(k) for k in keywords]) we generate an array which contains the keywords that our search will be based on and then we append it to the existing dataframe using withColumn.
next with expr("size(filter(keywords, k -> col1 rlike k)) > 0") we go through the items of keywords trying to figure out if any of them is present in col1 text. If that is true filter will return one or more items and size will be greater than 0 which consists our where condition for retrieving the records.
I'd like to filter a dataframe using an external file.
This is how I use the filter now:
val Insert = Append_Ot.filter(
col("Name2").equalTo("brazil") ||
col("Name2").equalTo("france") ||
col("Name2").equalTo("algeria") ||
col("Name2").equalTo("tunisia") ||
col("Name2").equalTo("egypte"))
Instead of using hardcoded string literals, I'd like to create an external file with the values to filter by.
So I create this file:
val filter_numfile = sc.textFile("/user/zh/worskspace/filter_nmb.txt")
.map(_.split(" ")(1))
.collect
This gives me:
filter_numfile: Array[String] = Array(brazil, france, algeria, tunisia, egypte)
And then, I use isin function on Name2 column.
val Insert = Append_Ot.where($"Name2".isin(filter_numfile: _*))
But this gives me an empty dataframe. Why?
I am just adding some information to philantrovert answer in filter dataframe from external file
His answer is perfect but there might be some case unmatch so you will have to check for case mismatch as well
tl;dr Make sure that the letters use consistent case, i.e. they are all in upper or lower case. Simply use upper or lower standard functions.
lets say you have input file as
1 Algeria
2 tunisia
3 brazil
4 Egypt
you read the text file and change all the countries to lowercase as
val countries = sc.textFile("path to input file").map(_.split(" ")(1).trim)
.collect.toSeq
val array = Array(countries.map(_.toLowerCase) : _*)
Then you have your dataframe
val Append_Ot = sc.parallelize(Seq(("brazil"),("tunisia"),("algeria"),("name"))).toDF("Name2")
where you apply following condition
import org.apache.spark.sql.functions._
val Insert = Append_Ot.where(lower($"Name2").isin(array : _* ))
you should have output as
+-------+
|Name2 |
+-------+
|brazil |
|tunisia|
|algeria|
+-------+
The empty dataframe might be due to spelling mismatch too.
I have some json data where one of the elements is an array. Here is a sample dataset:
{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}{"sname":"berkeley", "year":2012}, {"sname":"mit", "year":2016}]}
{"name":"Andy", "schools":[{"sname":"ucsb", "year":2011}, {"sname":"ucsd", "year":2015}]]}
I want to use name as key and for a given name, I want to combine all the school names in the order they are present in the array.
Here is the desired o/p:
michael, "stanford berkeley mit"
Andy "ucsb ucsd"
Here is my code:
val people = sqlContext.read.json("test.json")
val flattened = people.select($"name", explode($"schools").as("schools_flat"))
val schools = flattened.select("name", "schools_flat.sname")
scala> schools.show()
+-------+--------+
| name| sname|
+-------+--------+
|Michael|stanford|
|Michael|berkeley|
|Michael| mit|
+-------+--------+
Unfortunately, when I group this by key, I am not sure if order will be retained (most likely not). I don't want to the school names for Michael to be reorderd, they should appear as they were present in the original json array. Any help with this will be great.
Why explode and group instead of select?
people.select("name", "schools.sname")
It will preserve order as you want.
The following code does what was asked in the question.
val people = sqlContext.read.json("test.json")
val test = people.select("name", "schools.sname")
val getConcatenated = udf( (first: Seq[String]) => { first.mkString(" ") } )
val test_cat = newtest.withColumn("sname_concat", getConcatenated(col("sname"))).select("name", "sname_concat")
I have a Spark Dataframe and I would like to group the elements by a key and have the results as a sorted list
Currently I am using:
df.groupBy("columnA").agg(collect_list("columnB"))
How do I make the items in the list sorted ascending order?
You could try the function sort_array available in the functions package:
import org.apache.spark.sql.functions._
df.groupBy("columnA").agg(sort_array(collect_list("columnB")))
Just wanted to add another hint to the answer of Daniel de Paula regarding sort_array solution.
If you want to sort elements according to a different column, you can form a struct of two fields:
the sort by field
the result field
Since structs are sorted field by field, you'll get the order you want, all you need is to get rid of the sort by column in each element of the resulting list.
The same approach can be applied with several sort by columns when needed.
Here's an example that can be run in local spark-shell (use :paste mode):
import org.apache.spark.sql.Row
import spark.implicits._
case class Employee(name: String, department: String, salary: Double)
val employees = Seq(
Employee("JSMITH", "A", 20.0),
Employee("AJOHNSON", "A", 650.0),
Employee("CBAKER", "A", 650.2),
Employee("TGREEN", "A", 13.0),
Employee("CHORTON", "B", 111.0),
Employee("AIVANOV", "B", 233.0),
Employee("VSMIRNOV", "B", 11.0)
)
val employeesDF = spark.createDataFrame(employees)
val getNames = udf { salaryNames: Seq[Row] =>
salaryNames.map { case Row(_: Double, name: String) => name }
}
employeesDF
.groupBy($"department")
.agg(collect_list(struct($"salary", $"name")).as("salaryNames"))
.withColumn("namesSortedBySalary", getNames(sort_array($"salaryNames", asc = false)))
.show(truncate = false)
The result:
+----------+--------------------------------------------------------------------+----------------------------------+
|department|salaryNames |namesSortedBySalary |
+----------+--------------------------------------------------------------------+----------------------------------+
|B |[[111.0, CHORTON], [233.0, AIVANOV], [11.0, VSMIRNOV]] |[AIVANOV, CHORTON, VSMIRNOV] |
|A |[[20.0, JSMITH], [650.0, AJOHNSON], [650.2, CBAKER], [13.0, TGREEN]]|[CBAKER, AJOHNSON, JSMITH, TGREEN]|
+----------+--------------------------------------------------------------------+----------------------------------+
E.g
sqlContext = SQLContext(sc)
sample=sqlContext.sql("select Name ,age ,city from user")
sample.show()
The above statement prints theentire table on terminal. But I want to access each row in that table using for or while to perform further calculations.
You simply cannot. DataFrames, same as other distributed data structures, are not iterable and can be accessed using only dedicated higher order function and / or SQL methods.
You can of course collect
for row in df.rdd.collect():
do_something(row)
or convert toLocalIterator
for row in df.rdd.toLocalIterator():
do_something(row)
and iterate locally as shown above, but it beats all purpose of using Spark.
To "loop" and take advantage of Spark's parallel computation framework, you could define a custom function and use map.
def customFunction(row):
return (row.name, row.age, row.city)
sample2 = sample.rdd.map(customFunction)
or
sample2 = sample.rdd.map(lambda x: (x.name, x.age, x.city))
The custom function would then be applied to every row of the dataframe. Note that sample2 will be a RDD, not a dataframe.
Map may be needed if you are going to perform more complex computations. If you just need to add a simple derived column, you can use the withColumn, with returns a dataframe.
sample3 = sample.withColumn('age2', sample.age + 2)
Using list comprehensions in python, you can collect an entire column of values into a list using just two lines:
df = sqlContext.sql("show tables in default")
tableList = [x["tableName"] for x in df.rdd.collect()]
In the above example, we return a list of tables in database 'default', but the same can be adapted by replacing the query used in sql().
Or more abbreviated:
tableList = [x["tableName"] for x in sqlContext.sql("show tables in default").rdd.collect()]
And for your example of three columns, we can create a list of dictionaries, and then iterate through them in a for loop.
sql_text = "select name, age, city from user"
tupleList = [{name:x["name"], age:x["age"], city:x["city"]}
for x in sqlContext.sql(sql_text).rdd.collect()]
for row in tupleList:
print("{} is a {} year old from {}".format(
row["name"],
row["age"],
row["city"]))
It might not be the best practice, but you can simply target a specific column using collect(), export it as a list of Rows, and loop through the list.
Assume this is your df:
+----------+----------+-------------------+-----------+-----------+------------------+
| Date| New_Date| New_Timestamp|date_sub_10|date_add_10|time_diff_from_now|
+----------+----------+-------------------+-----------+-----------+------------------+
|2020-09-23|2020-09-23|2020-09-23 00:00:00| 2020-09-13| 2020-10-03| 51148 |
|2020-09-24|2020-09-24|2020-09-24 00:00:00| 2020-09-14| 2020-10-04| -35252 |
|2020-01-25|2020-01-25|2020-01-25 00:00:00| 2020-01-15| 2020-02-04| 20963548 |
|2020-01-11|2020-01-11|2020-01-11 00:00:00| 2020-01-01| 2020-01-21| 22173148 |
+----------+----------+-------------------+-----------+-----------+------------------+
to loop through rows in Date column:
rows = df3.select('Date').collect()
final_list = []
for i in rows:
final_list.append(i[0])
print(final_list)
Give A Try Like this
result = spark.createDataFrame([('SpeciesId','int'), ('SpeciesName','string')],["col_name", "data_type"]);
for f in result.collect():
print (f.col_name)
If you want to do something to each row in a DataFrame object, use map. This will allow you to perform further calculations on each row. It's the equivalent of looping across the entire dataset from 0 to len(dataset)-1.
Note that this will return a PipelinedRDD, not a DataFrame.
above
tupleList = [{name:x["name"], age:x["age"], city:x["city"]}
should be
tupleList = [{'name':x["name"], 'age':x["age"], 'city':x["city"]}
for name, age, and city are not variables but simply keys of the dictionary.