Endless execution with spark udf - apache-spark

I'm want to get the country with lat and long, so i used geopy and create a sample dataframe
data = [{"latitude": -23.558111, "longitude": -46.64439},
{"latitude": 41.877445, "longitude": -87.723846},
{"latitude": 29.986801, "longitude": -90.166314}
]
then create a udf
#F.udf("string")
def city_state_country(lat,lng):
geolocator = Nominatim(user_agent="geoap")
coord = f"{lat},{lng}"
location = geolocator.reverse(coord, exactly_one=True)
address = location.raw['address']
country = address.get('country', '')
return country
and it works this is the result
df2 = df.withColumn("contr",city_state_country("latitude","longitude"))
+----------+----------+-------------+
| latitude| longitude| contr|
+----------+----------+-------------+
|-23.558111| -46.64439| Brasil|
| 41.877445|-87.723846|United States|
| 29.986801|-90.166314|United States|
+----------+----------+-------------+
, but when I want to use my data with the schema
root
|-- id: integer (nullable = true)
|-- open_time: string (nullable = true)
|-- starting_lng: float (nullable = true)
|-- starting_lat: float (nullable = true)
|-- user_id: string (nullable = true)
|-- date: string (nullable = true)
|-- lat/long: string (nullable = false)
and 4 million rows, so I use limit and select
df_open_app3= df_open_app2.select("starting_lng","starting_lat").limit(10)
Finally, use the same udf
df_open_app4= df_open_app3.withColumn('con', city_state_country("starting_lat","starting_lng"))
The problem is that when I execute a display the process is endless, I don't know why but theorically should be process only 10 rows

I tried the similar scenario in my environment and it's working perfectly fine with around million records.
Created sample UDF function and dataframe with around million records.
selected particular columns and executed function on that.
As Derek O suggested try by using .cache() while creating a dataframe co you don't need to re-read the dataframe. so, that you can reuse the cached dataframe. When you have billions of records. Since action triggers the transformations, display is the first action hence it triggers the execution of all above dataframes creations ang might causing abnormal behavior.

Related

Spark - convert array of JSON Strings to Struct array, filter and concat with root

I am totally new to Spark and I'm writing a pipeline to perform some transformations into a list of audits.
Example of my data:
{
"id": 932522712299,
"ticket_id": 12,
"created_at": "2020-02-14T19:05:16Z",
"author_id": 392401450482,
"events": ["{\"id\": 932522713292, \"type\": \"VoiceComment\", \"public\": false, \"data\": {\"from\": \"11987654321\", \"to\": \"+1987644\"}"],
}
My schema is basically:
root
|-- id: long (nullable = true)
|-- ticket_id: long (nullable = true)
|-- created_at: string (nullable = true)
|-- author_id: long (nullable = true)
|-- events: array (nullable = true)
| |-- element: string (containsNull = true)
My transformations has a few steps:
Split events by type: comments, tags, change or update;
For each event found, I must add ticket_id, author_id and created_at from root;
It must have one output for each event type.
Basically, each object inside event's array is a string JSON because each type has a different structure - the only attribute common between them it's the type.
I have reach my goals doing some terrible work by converting my dataframe to dict using the following code:
audits = list(map(lambda row: row.asDict(), df.collect()))`
comments = []
for audit in audits:
base_info = {'ticket_id': audit['ticket_id'], 'created_at': audit['created_at'], 'author_id': audit['author_id']}
audit['events'] = [json.loads(x) for x in audit['events']]
audit_comments = [
{**x, **base_info}
for x in audit['events']
if x['type'] == "Comment" or x['type'] == "VoiceComment"
]
comments.extend(audit_comments)
Maybe this question sounds lame or lazy but I'm really stuck in simple things like:
how to parse 'events' items to struct?
how to select event by type and add informations from root? Maybe using select syntax?
Any help is appreciated.
Since the events array elements don't have the same structure for all rows, what you can do is convert it to a Map(String, String).
Using from_json function and the schema MapType(StringType(), StringType()):
df = df.withColumn("events", explode("events"))\
.withColumn("events", from_json(col("events"), MapType(StringType(), StringType())))
Then, using element_at (Spark 2.4+), you can get the type like this:
df = df.withColumn("event_type", element_at(col("events"), "type"))
df.printSchema()
#root
#|-- author_id: long (nullable = true)
#|-- created_at: string (nullable = true)
#|-- events: map (nullable = true)
#| |-- key: string
#| |-- value: string (valueContainsNull = true)
#|-- id: long (nullable = true)
#|-- ticket_id: long (nullable = true)
#|-- event_type: string (nullable = true)
Now, you can filter and select as normal columns:
df.filter(col("event_type") == lit("VoiceComment")).show(truncate=False)
#+------------+--------------------+-----------------------------------------------------------------------------------------------------------+------------+---------+------------+
#|author_id |created_at |events |id |ticket_id|event_type |
#+------------+--------------------+-----------------------------------------------------------------------------------------------------------+------------+---------+------------+
#|392401450482|2020-02-14T19:05:16Z|[id -> 932522713292, type -> VoiceComment, public -> false, data -> {"from":"11987654321","to":"+1987644"}]|932522712299|12 |VoiceComment|
#+------------+--------------------+-----------------------------------------------------------------------------------------------------------+------------+---------+------------+
Your code will load the complete events data onto the master node, which has submitted the job. The spark way to process data wants you to create a map reduce job. There are multiple api for this - they create a DAG Plan for the job and the plan is manifested only when calling specific functions like head or show.
A job like this will be distributed to all machines in a cluster.
When working with a dataframe api, a lot can be done with pyspark.sql.functions
Below the same tranformations with spark.sql dataframe api
import pyspark.sql.functions as F
df = df.withColumn('event', F.explode(df.events)).drop(df.events)
df = df.withColumn('event', F.from_json(df.event, 'STRUCT <id: INT, type: STRING, public: Boolean, data: STRUCT<from: STRING, to: STRING>>'))
events = df.where('event.type = "Comment" OR event.type == "VoiceComment"')
events.printSchema()
events.head(100)
When data cannot be processed with sql expressions you can implement a plain user defined function - UDF or Pandas UDF

spark read orc with specific columns

I have a orc file, when read with below option it reads all the columns .
val df= spark.read.orc("/some/path/")
df.printSChema
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- value: string (nullable = true)
|-- all: string (nullable = true)
|-- next: string (nullable = true)
|-- action: string (nullable = true)
but I want to read only two columns from that file , is there any way to read only two columns (id,name) while loading orc file ?
is there any way to read only two columns (id,name) while loading orc file ?
Yes, all you need is subsequent select. Spark will take care of the rest for you:
val df = spark.read.orc("/some/path/").select("id", "name")
Spark has lazy execution model. So you can do any data transformation in you code without immediate real effect. Only after action call Spark start to doing job. And Spark are smart enough not to do extra work.
So you can write like this:
val inDF: DataFrame = spark.read.orc("/some/path/")
import spark.implicits._
val filteredDF: DataFrame = inDF.select($"id", $"name")
// any additional transformations
// real work starts after this action
val result: Array[Row] = filteredDF.collect()

Spark inner joins results in empty records

I'm performing an inner join between dataframes to only keep the sales for specific days:
val days_df = ss.createDataFrame(days_array.map(Tuple1(_))).toDF("DAY_ID")
val filtered_sales = sales.join(days_df,Seq("DAY_ID")
filtered_sales.show()
This results in an empty filtered_sales dataframe (0 records), both columns DAY_ID have the same type (string).
root
|-- DAY_ID: string (nullable = true)
root
|-- SKU: string (nullable = true)
|-- DAY_ID: string (nullable = true)
|-- STORE_ID: string (nullable = true)
|-- SALES_UNIT: integer (nullable = true)
|-- SALES_REVENUE: decimal(20,5) (nullable = true)
The sales df is populated from a 20GB file.
Using the same code with a small file of some KB will work fine with the join and I can see the results. The empty result dataframe occurs only with bigger dataset.
If I change the code and use the following one, it works fine even with the 20GB sales file:
sales.filter(sales("DAY_ID").isin(days_array:_*))
.show()
What is wrong with the inner join?
Try to broadcast days_array and then apply inner join. As days_array is too small compared to another table, broadcasting will help.

Multiple aggregations on nested structure in a single Spark statement

I have a json structure like this:
{
"a":5,
"b":10,
"c":{
"c1": 3,
"c4": 5
}
}
I have a dataframe created from this structure with several million rows. What I need are aggregation in several keys like this:
df.agg(count($"b") as "cntB", sum($"c.c4") as "sumC")
Do I just miss the syntax? Or is there a different way to do it? Most important Spark should only scan the data once for all aggregations.
It is possible, but your JSON must be in one line.
Each line = new JSON object.
val json = sc.parallelize(
"{\"a\":5,\"b\":10,\"c\":{\"c1\": 3,\"c4\": 5}}" :: Nil)
val jsons = sqlContext.read.json(json)
jsons.agg(count($"b") as "cntB", sum($"c.c4") as "sumC").show
Works fine - please see that json is formatted to be in one line.
jsons.printSchema() is printing:
root
|-- a: long (nullable = true)
|-- b: long (nullable = true)
|-- c: struct (nullable = true)
| |-- c1: long (nullable = true)
| |-- c4: long (nullable = true)

PySpark : Use of aggregate function for complex query

Suppose you have a dataframe with 3 columns which are numeric - like:
>>> df.show()
+-----------+---------------+------------------+--------------+---------------------+
| IP| URL|num_rebuf_sessions|first_rebuffer|num_backward_sessions|
+-----------+---------------+------------------+--------------+---------------------+
|10.45.12.13| ww.tre.com/ada| 1261| 764| 2043|
|10.54.12.34|www.rwr.com/yuy| 1126| 295| 1376|
|10.44.23.09|www.453.kig/827| 2725| 678| 1036|
|10.23.43.14|www.res.poe/skh| 2438| 224| 1455|
|10.32.22.10|www.res.poe/skh| 3655| 157| 1838|
|10.45.12.13|www.453.kig/827| 7578| 63| 1754|
|10.45.12.13| ww.tre.com/ada| 3854| 448| 1224|
|10.34.22.10|www.rwr.com/yuy| 1029| 758| 1275|
|10.54.12.34| ww.tre.com/ada| 7341| 10| 856|
|10.34.22.10| ww.tre.com/ada| 4150| 455| 1372|
+-----------+---------------+------------------+--------------+---------------------+
With schema being
>>> df.printSchema()
root
|-- IP: string (nullable = true)
|-- URL: string (nullable = true)
|-- num_rebuf_sessions: long (nullable = false)
|-- first_rebuffer: long (nullable = false)
|-- num_backward_sessions: long (nullable = false)
Question
I am interested in finding a complex query aggregation - like say (sum(num_rebuf_sessions) - sum(num_backward_sessions)) * 100 / sum(first_rebuffer)
How do I do it programmatically?
The query aggregation can be anything which is provided as a input (assume a xml file or a json file)
Note:
1. In interpreter, I can run the complete statement like
>>> df.groupBy(keyList).agg((((func.sum('num_rebuf_sessions') - func.sum('first_rebuffer')) * 100)/func.sum('num_backward_sessions')).alias('Result')).show()
+-----------+---------------+------------------+
| IP| URL| Result|
+-----------+---------------+------------------+
|10.54.12.34|www.rwr.com/yuy|263.70753561548884|
|10.23.43.14|www.453.kig/827| 278.7099317601011|
|10.34.22.10| ww.tre.com/ada|187.53939800299088|
+-----------+---------------+------------------+
But programmatically, it would take only a dict or a list of column which would mean tough to achieve the above functionality.
So is the pyspark.sql.context.SQLContext.sql only option left? Or am I missing something obvious?

Resources