Apache Spark's from_json not working as expected - apache-spark

In my Spark application, I am trying to read the incoming JSON data, sent through the socket. The data is in string format. eg. {"deviceId": "1", "temperature":4.5}.
I created a schema as shown below:
StructType dataSchema = new StructType()
.add("deviceId", "string")
.add("temperature", "double");
I wrote the below code to extract the fields, turn them into a column, so I can use that in SQL queries.
Dataset<Row> normalizedStream = stream.select(functions.from_json(new Column("value"),dataSchema)).as("json");
Dataset<Data> test = normalizedStream.select("json.*").as(Encoders.bean(Data.class));
test.printSchema();
Data.class
public class Data {
private String deviceId;
private double temperature;
}
But when I submit the Spark app, the output schema is as below.
root
|-- from_json(value): struct (nullable = true)
| |-- deviceId: string (nullable = true)
| |-- temperature: double (nullable = true)
the from_json function is coming as a column name.
What I expect is:
root
|-- deviceId: string (nullable = true)
|-- temperature: double (nullable = true)
How to fix the above? Please, let me know what I doing wrong.

The problem is the placement of alias. Right now, you are placing an alias to select, and not to from_json where it is supposed to be.
Right now, json.* does not work because the renaming is not working as intended, therefore no column called json can be found, nor any children inside of it.
So, if you move the brackets from this:
...(new Column("value"),dataSchema)).as("json");
to this:
...(new Column("value"),dataSchema).as("json"));
your final data and schema will look as:
+--------+-----------+
|deviceId|temperature|
+--------+-----------+
|1 |4.5 |
+--------+-----------+
root
|-- deviceId: string (nullable = true)
|-- temperature: double (nullable = true)
which is what you intend to do. Hope this helps, good luck!

Related

Pyspark: Write CSV from JSON file with struct column

I'm reading a .json file that contains the structure below, and I need to generate a csv with this data in column form, I know that I can't directly write an array-type object in a csv, I used the explode function to remove the fields I need , being able to leave them in a columnar form, but when writing the data frame in csv, I'm getting an error when using the explode function, from what I understand it's not possible to do this with two variables in the same select, can someone help me with something alternative?
from pyspark.sql.functions import col, explode
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.master("local[1]")
.appName("sample")
.getOrCreate())
df = (spark.read.option("multiline", "true")
.json("data/origin/crops.json"))
df2 = (explode('history').alias('history'), explode('trial').alias('trial'))
.select('history.started_at', 'history.finished_at', col('id'), trial.is_trial, trial.ws10_max))
(df2.write.format('com.databricks.spark.csv')
.mode('overwrite')
.option("header","true")
.save('data/output/'))
root
|-- history: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- finished_at: string (nullable = true)
| | |-- started_at: string (nullable = true)
|-- id: long (nullable = true)
|-- trial: struct (nullable = true)
| |-- is_trial: boolean (nullable = true)
| |-- ws10_max: double (nullable = true)
I'm trying to return something like this
started_at
finished_at
is_trial
ws10_max
First
row
row
Second
row
row
Thank you!
Use explode on array and select("struct.*") on struct.
df.select("trial", "id", explode('history').alias('history')),
.select('id', 'history.*', 'trial.*'))

How to find out the schema from a tons of messy structured data?

I have a huge dataset with messy structured schema.
Say, the same data fields can have different data type of data, for example, data.tags can be a list of string or a list of object
I tried to load the JSON data from hdfs and print the schema but it occurs the error below.
TypeError: Can not merge type <class 'pyspark.sql.types.ArrayType'> and <class 'pyspark.sql.types.StringType'>
Here is the code
data_json = sc.textFile(data_path)
data_dataset = data_json.map(json.loads)
data_dataset_df = data_dataset.toDF()
data_dataset_df.printSchema()
Is it possible to figure out the schema something like
root
|-- children: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: boolean (valueContainsNull = true)
| |-- element: string
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- occupation: string (nullable = true)
in this case?
If I understand correctly, you're looking to find how to infer the schema of a JSON file. You should take a look at reading the JSON into a DataFrame straightaway, instead of through a Python mapping function.
Also, I'm referring you to How to infer schema of JSON files?, as I think it answers your question.

Create dataframe on printschema output

I have created a dataframe on top of parquet file and now able to see the dataframe schema.Now I want to create dataframe on top of the printschema output
df = spark.read.parquet("s3/location")
df.printschema()
the output looks like [(cola , string) , (colb,string)]
Now I want to create dataframe on the output of printschema .
What would be the best way to do that
Adding more inputs on what has been achieved so far -
df1 = sqlContext.read.parquet("s3://t1")
df1.printSchema()
We got the below result -
root
|-- Atp: string (nullable = true)
|-- Ccetp: string (nullable = true)
|-- Ccref: string (nullable = true)
|-- Ccbbn: string (nullable = true)
|-- Ccsdt: string (nullable = true)
|-- Ccedt: string (nullable = true)
|-- Ccfdt: string (nullable = true)
|-- Ccddt: string (nullable = true)
|-- Ccamt: string (nullable = true)
We want to create dataframe with two columns - 1) colname , 2) datatype
But if we run the below code -
schemaRDD = spark.sparkContext.parallelize([df1.schema.json()])
schema_df = spark.read.json(schemaRDD)
schema_df.show()
We are getting below output where we are getting the entire column names and datatype in a single row -
+--------------------+------+
| fields| type|
+--------------------+------+
|[[Atp,true,str...|struct|
+--------------------+------+
Looking for a output like
Atp| string
Ccetp| string
Ccref| string
Ccbbn| string
Ccsdt| string
Ccedt| string
Ccfdt| string
Ccddt| string
Ccamt| string
Not sure what language your are using but on pyspark I would do it like this:
schemaRDD = spark.sparkContext.parallelize([df.schema.json()])
schema_df = spark.read.json(schemaRDD)
schema_df = sqlContext.createDataFrame(zip([col[0] for col in df1.dtypes], [col[1] for col in df1.dtypes]), schema=['colname', 'datatype'])

JSON Struct to Map[String,String] using sqlContext

I am trying to read json data in spark streaming job.
By default sqlContext.read.json(rdd) is converting all map types to struct types.
|-- legal_name: struct (nullable = true)
| |-- first_name: string (nullable = true)
| |-- last_name: string (nullable = true)
| |-- middle_name: string (nullable = true)
But when i read from hive table using sqlContext
val a = sqlContext.sql("select * from student_record")
below is the schema.
|-- leagalname: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
Is there any way we can read data using read.json(rdd) and get Map data type?
Is there any option like
spark.sql.schema.convertStructToMap?
Any help is appreciated.
You need to explicitly define your schema, when calling read.json.
You can read about the details in Programmatically specifying the schema in the Spark SQL Documentation.
For example in your specific case it would be
import org.apache.spark.sql.types._
val schema = StructType(List(StructField("legal_name",MapType(StringType,StringType,true))))
That would be one column legal_name being a map.
When you have defined you schema you can call
sqlContext.read.json(rdd, schema) to create your data frame from your JSON dataset with the desired schema.

Multiple aggregations on nested structure in a single Spark statement

I have a json structure like this:
{
"a":5,
"b":10,
"c":{
"c1": 3,
"c4": 5
}
}
I have a dataframe created from this structure with several million rows. What I need are aggregation in several keys like this:
df.agg(count($"b") as "cntB", sum($"c.c4") as "sumC")
Do I just miss the syntax? Or is there a different way to do it? Most important Spark should only scan the data once for all aggregations.
It is possible, but your JSON must be in one line.
Each line = new JSON object.
val json = sc.parallelize(
"{\"a\":5,\"b\":10,\"c\":{\"c1\": 3,\"c4\": 5}}" :: Nil)
val jsons = sqlContext.read.json(json)
jsons.agg(count($"b") as "cntB", sum($"c.c4") as "sumC").show
Works fine - please see that json is formatted to be in one line.
jsons.printSchema() is printing:
root
|-- a: long (nullable = true)
|-- b: long (nullable = true)
|-- c: struct (nullable = true)
| |-- c1: long (nullable = true)
| |-- c4: long (nullable = true)

Resources