Spark inner joins results in empty records - apache-spark

I'm performing an inner join between dataframes to only keep the sales for specific days:
val days_df = ss.createDataFrame(days_array.map(Tuple1(_))).toDF("DAY_ID")
val filtered_sales = sales.join(days_df,Seq("DAY_ID")
filtered_sales.show()
This results in an empty filtered_sales dataframe (0 records), both columns DAY_ID have the same type (string).
root
|-- DAY_ID: string (nullable = true)
root
|-- SKU: string (nullable = true)
|-- DAY_ID: string (nullable = true)
|-- STORE_ID: string (nullable = true)
|-- SALES_UNIT: integer (nullable = true)
|-- SALES_REVENUE: decimal(20,5) (nullable = true)
The sales df is populated from a 20GB file.
Using the same code with a small file of some KB will work fine with the join and I can see the results. The empty result dataframe occurs only with bigger dataset.
If I change the code and use the following one, it works fine even with the 20GB sales file:
sales.filter(sales("DAY_ID").isin(days_array:_*))
.show()
What is wrong with the inner join?

Try to broadcast days_array and then apply inner join. As days_array is too small compared to another table, broadcasting will help.

Related

Endless execution with spark udf

I'm want to get the country with lat and long, so i used geopy and create a sample dataframe
data = [{"latitude": -23.558111, "longitude": -46.64439},
{"latitude": 41.877445, "longitude": -87.723846},
{"latitude": 29.986801, "longitude": -90.166314}
]
then create a udf
#F.udf("string")
def city_state_country(lat,lng):
geolocator = Nominatim(user_agent="geoap")
coord = f"{lat},{lng}"
location = geolocator.reverse(coord, exactly_one=True)
address = location.raw['address']
country = address.get('country', '')
return country
and it works this is the result
df2 = df.withColumn("contr",city_state_country("latitude","longitude"))
+----------+----------+-------------+
| latitude| longitude| contr|
+----------+----------+-------------+
|-23.558111| -46.64439| Brasil|
| 41.877445|-87.723846|United States|
| 29.986801|-90.166314|United States|
+----------+----------+-------------+
, but when I want to use my data with the schema
root
|-- id: integer (nullable = true)
|-- open_time: string (nullable = true)
|-- starting_lng: float (nullable = true)
|-- starting_lat: float (nullable = true)
|-- user_id: string (nullable = true)
|-- date: string (nullable = true)
|-- lat/long: string (nullable = false)
and 4 million rows, so I use limit and select
df_open_app3= df_open_app2.select("starting_lng","starting_lat").limit(10)
Finally, use the same udf
df_open_app4= df_open_app3.withColumn('con', city_state_country("starting_lat","starting_lng"))
The problem is that when I execute a display the process is endless, I don't know why but theorically should be process only 10 rows
I tried the similar scenario in my environment and it's working perfectly fine with around million records.
Created sample UDF function and dataframe with around million records.
selected particular columns and executed function on that.
As Derek O suggested try by using .cache() while creating a dataframe co you don't need to re-read the dataframe. so, that you can reuse the cached dataframe. When you have billions of records. Since action triggers the transformations, display is the first action hence it triggers the execution of all above dataframes creations ang might causing abnormal behavior.

Loading Parquet Files with Different Column Ordering

I have two Parquet directories that are being loaded into Spark. They have all the same columns, but the columns are in a different order.
val df = spark.read.parquet("url1").printSchema()
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
val df = spark.read.parquet("url2").printSchema()
root
|-- b: string (nullable = true)
|-- a: string (nullable = true)
val urls = Array("url1", "url2")
val df = spark.read.parquet(urls: _*).printSchema()
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
When I load the files together they always seem to take on the ordering of url1. I'm worried that having the parquet files in url1 and url2 saved in a different order will have unintended consequences, such as a and b swapping values. Can someone explain how parquet loads columns stored in a different order, with links to official documentation, if possible?

pyspark with hive - can't properly create with partition and save a table from a dataframe

I'm trying to convert json files to parquet with very few transformations (adding date) but I then need to partition this data before saving it to parquet.
I'm hitting a wall on this area.
Here is the creation process of the table:
df_temp = spark.read.json(data_location) \
.filter(
cond3
)
df_temp = df_temp.withColumn("date", fn.to_date(fn.lit(today.strftime("%Y-%m-%d"))))
df_temp.createOrReplaceTempView("{}_tmp".format("duration_small"))
spark.sql("CREATE TABLE IF NOT EXISTS {1} LIKE {0}_tmp LOCATION '{2}/{1}'".format("duration_small","duration", warehouse_location))
spark.sql("DESC {}".format("duration"))
then regarding the save of the conversion:
df_final.write.mode("append").format("parquet").partitionBy("customer_id", "date").saveAsTable('duration')
but this generates the following error:
pyspark.sql.utils.AnalysisException: '\nSpecified partitioning does not match that of the existing table default.duration.\nSpecified partition columns: [customer_id, date]\nExisting partition columns: []\n ;'
the schema being:
root
|-- action_id: string (nullable = true)
|-- customer_id: string (nullable = true)
|-- duration: long (nullable = true)
|-- initial_value: string (nullable = true)
|-- item_class: string (nullable = true)
|-- set_value: string (nullable = true)
|-- start_time: string (nullable = true)
|-- stop_time: string (nullable = true)
|-- undo_event: string (nullable = true)
|-- year: integer (nullable = true)
|-- month: integer (nullable = true)
|-- day: integer (nullable = true)
|-- date: date (nullable = true)
Thus I tried to change the create table to:
spark.sql("CREATE TABLE IF NOT EXISTS {1} LIKE {0}_tmp PARTITIONED BY (customer_id, date) LOCATION '{2}/{1}'".format("duration_small","duration", warehouse_location))
But this create an error like:
...mismatched input 'PARTITIONED' expecting ...
So I discovered that PARTITIONED BY doesn't work with LIKE but I'm running out of ideas.
if using USING instead of LIKE I got the error:
pyspark.sql.utils.AnalysisException: 'It is not allowed to specify partition columns when the table schema is not defined. When the table schema is not provided, schema and partition columns will be inferred.;'
How am I supposed to add a partition when creating the table?
Ps - Once the schema of the table is defined with the partitions, I want to simply use:
df_final.write.format("parquet").insertInto('duration')
I finally figured out how to do it with spark.
df_temp.read.json...
df_temp.createOrReplaceTempView("{}_tmp".format("duration_small"))
spark.sql("""
CREATE TABLE IF NOT EXISTS {1}
USING PARQUET
PARTITIONED BY (customer_id, date)
LOCATION '{2}/{1}' AS SELECT * FROM {0}_tmp
""".format("duration_small","duration", warehouse_location))
spark.sql("DESC {}".format("duration"))
df_temp.write.mode("append").partitionBy("customer_id", "date").saveAsTable('duration')
I don't know why, but if I can't use insertInto, it uses a weird customer_id out of nowhere and doesn't append the different dates.

Is there any better way to convert Array<int> to Array<String> in pyspark

A very huge DataFrame with schema:
root
|-- id: string (nullable = true)
|-- ext: array (nullable = true)
| |-- element: integer (containsNull = true)
So far I try to explode data, then collect_list:
select
id,
collect_list(cast(item as string))
from default.dual
lateral view explode(ext) t as item
group by
id
But this way is too expansive.
You can simply cast the ext column to a string array
df = source.withColumn("ext", source.ext.cast("array<string>"))
df.printSchema()
df.show()

Multiple aggregations on nested structure in a single Spark statement

I have a json structure like this:
{
"a":5,
"b":10,
"c":{
"c1": 3,
"c4": 5
}
}
I have a dataframe created from this structure with several million rows. What I need are aggregation in several keys like this:
df.agg(count($"b") as "cntB", sum($"c.c4") as "sumC")
Do I just miss the syntax? Or is there a different way to do it? Most important Spark should only scan the data once for all aggregations.
It is possible, but your JSON must be in one line.
Each line = new JSON object.
val json = sc.parallelize(
"{\"a\":5,\"b\":10,\"c\":{\"c1\": 3,\"c4\": 5}}" :: Nil)
val jsons = sqlContext.read.json(json)
jsons.agg(count($"b") as "cntB", sum($"c.c4") as "sumC").show
Works fine - please see that json is formatted to be in one line.
jsons.printSchema() is printing:
root
|-- a: long (nullable = true)
|-- b: long (nullable = true)
|-- c: struct (nullable = true)
| |-- c1: long (nullable = true)
| |-- c4: long (nullable = true)

Resources