Multiple aggregations on nested structure in a single Spark statement - apache-spark

I have a json structure like this:
{
"a":5,
"b":10,
"c":{
"c1": 3,
"c4": 5
}
}
I have a dataframe created from this structure with several million rows. What I need are aggregation in several keys like this:
df.agg(count($"b") as "cntB", sum($"c.c4") as "sumC")
Do I just miss the syntax? Or is there a different way to do it? Most important Spark should only scan the data once for all aggregations.

It is possible, but your JSON must be in one line.
Each line = new JSON object.
val json = sc.parallelize(
"{\"a\":5,\"b\":10,\"c\":{\"c1\": 3,\"c4\": 5}}" :: Nil)
val jsons = sqlContext.read.json(json)
jsons.agg(count($"b") as "cntB", sum($"c.c4") as "sumC").show
Works fine - please see that json is formatted to be in one line.
jsons.printSchema() is printing:
root
|-- a: long (nullable = true)
|-- b: long (nullable = true)
|-- c: struct (nullable = true)
| |-- c1: long (nullable = true)
| |-- c4: long (nullable = true)

Related

Apache Spark's from_json not working as expected

In my Spark application, I am trying to read the incoming JSON data, sent through the socket. The data is in string format. eg. {"deviceId": "1", "temperature":4.5}.
I created a schema as shown below:
StructType dataSchema = new StructType()
.add("deviceId", "string")
.add("temperature", "double");
I wrote the below code to extract the fields, turn them into a column, so I can use that in SQL queries.
Dataset<Row> normalizedStream = stream.select(functions.from_json(new Column("value"),dataSchema)).as("json");
Dataset<Data> test = normalizedStream.select("json.*").as(Encoders.bean(Data.class));
test.printSchema();
Data.class
public class Data {
private String deviceId;
private double temperature;
}
But when I submit the Spark app, the output schema is as below.
root
|-- from_json(value): struct (nullable = true)
| |-- deviceId: string (nullable = true)
| |-- temperature: double (nullable = true)
the from_json function is coming as a column name.
What I expect is:
root
|-- deviceId: string (nullable = true)
|-- temperature: double (nullable = true)
How to fix the above? Please, let me know what I doing wrong.
The problem is the placement of alias. Right now, you are placing an alias to select, and not to from_json where it is supposed to be.
Right now, json.* does not work because the renaming is not working as intended, therefore no column called json can be found, nor any children inside of it.
So, if you move the brackets from this:
...(new Column("value"),dataSchema)).as("json");
to this:
...(new Column("value"),dataSchema).as("json"));
your final data and schema will look as:
+--------+-----------+
|deviceId|temperature|
+--------+-----------+
|1 |4.5 |
+--------+-----------+
root
|-- deviceId: string (nullable = true)
|-- temperature: double (nullable = true)
which is what you intend to do. Hope this helps, good luck!

Pyspark: Write CSV from JSON file with struct column

I'm reading a .json file that contains the structure below, and I need to generate a csv with this data in column form, I know that I can't directly write an array-type object in a csv, I used the explode function to remove the fields I need , being able to leave them in a columnar form, but when writing the data frame in csv, I'm getting an error when using the explode function, from what I understand it's not possible to do this with two variables in the same select, can someone help me with something alternative?
from pyspark.sql.functions import col, explode
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.master("local[1]")
.appName("sample")
.getOrCreate())
df = (spark.read.option("multiline", "true")
.json("data/origin/crops.json"))
df2 = (explode('history').alias('history'), explode('trial').alias('trial'))
.select('history.started_at', 'history.finished_at', col('id'), trial.is_trial, trial.ws10_max))
(df2.write.format('com.databricks.spark.csv')
.mode('overwrite')
.option("header","true")
.save('data/output/'))
root
|-- history: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- finished_at: string (nullable = true)
| | |-- started_at: string (nullable = true)
|-- id: long (nullable = true)
|-- trial: struct (nullable = true)
| |-- is_trial: boolean (nullable = true)
| |-- ws10_max: double (nullable = true)
I'm trying to return something like this
started_at
finished_at
is_trial
ws10_max
First
row
row
Second
row
row
Thank you!
Use explode on array and select("struct.*") on struct.
df.select("trial", "id", explode('history').alias('history')),
.select('id', 'history.*', 'trial.*'))

Loading Parquet Files with Different Column Ordering

I have two Parquet directories that are being loaded into Spark. They have all the same columns, but the columns are in a different order.
val df = spark.read.parquet("url1").printSchema()
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
val df = spark.read.parquet("url2").printSchema()
root
|-- b: string (nullable = true)
|-- a: string (nullable = true)
val urls = Array("url1", "url2")
val df = spark.read.parquet(urls: _*).printSchema()
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
When I load the files together they always seem to take on the ordering of url1. I'm worried that having the parquet files in url1 and url2 saved in a different order will have unintended consequences, such as a and b swapping values. Can someone explain how parquet loads columns stored in a different order, with links to official documentation, if possible?

How to find out the schema from a tons of messy structured data?

I have a huge dataset with messy structured schema.
Say, the same data fields can have different data type of data, for example, data.tags can be a list of string or a list of object
I tried to load the JSON data from hdfs and print the schema but it occurs the error below.
TypeError: Can not merge type <class 'pyspark.sql.types.ArrayType'> and <class 'pyspark.sql.types.StringType'>
Here is the code
data_json = sc.textFile(data_path)
data_dataset = data_json.map(json.loads)
data_dataset_df = data_dataset.toDF()
data_dataset_df.printSchema()
Is it possible to figure out the schema something like
root
|-- children: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: boolean (valueContainsNull = true)
| |-- element: string
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- occupation: string (nullable = true)
in this case?
If I understand correctly, you're looking to find how to infer the schema of a JSON file. You should take a look at reading the JSON into a DataFrame straightaway, instead of through a Python mapping function.
Also, I'm referring you to How to infer schema of JSON files?, as I think it answers your question.

Spark inner joins results in empty records

I'm performing an inner join between dataframes to only keep the sales for specific days:
val days_df = ss.createDataFrame(days_array.map(Tuple1(_))).toDF("DAY_ID")
val filtered_sales = sales.join(days_df,Seq("DAY_ID")
filtered_sales.show()
This results in an empty filtered_sales dataframe (0 records), both columns DAY_ID have the same type (string).
root
|-- DAY_ID: string (nullable = true)
root
|-- SKU: string (nullable = true)
|-- DAY_ID: string (nullable = true)
|-- STORE_ID: string (nullable = true)
|-- SALES_UNIT: integer (nullable = true)
|-- SALES_REVENUE: decimal(20,5) (nullable = true)
The sales df is populated from a 20GB file.
Using the same code with a small file of some KB will work fine with the join and I can see the results. The empty result dataframe occurs only with bigger dataset.
If I change the code and use the following one, it works fine even with the 20GB sales file:
sales.filter(sales("DAY_ID").isin(days_array:_*))
.show()
What is wrong with the inner join?
Try to broadcast days_array and then apply inner join. As days_array is too small compared to another table, broadcasting will help.

Resources