JSON Struct to Map[String,String] using sqlContext - apache-spark

I am trying to read json data in spark streaming job.
By default sqlContext.read.json(rdd) is converting all map types to struct types.
|-- legal_name: struct (nullable = true)
| |-- first_name: string (nullable = true)
| |-- last_name: string (nullable = true)
| |-- middle_name: string (nullable = true)
But when i read from hive table using sqlContext
val a = sqlContext.sql("select * from student_record")
below is the schema.
|-- leagalname: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
Is there any way we can read data using read.json(rdd) and get Map data type?
Is there any option like
spark.sql.schema.convertStructToMap?
Any help is appreciated.

You need to explicitly define your schema, when calling read.json.
You can read about the details in Programmatically specifying the schema in the Spark SQL Documentation.
For example in your specific case it would be
import org.apache.spark.sql.types._
val schema = StructType(List(StructField("legal_name",MapType(StringType,StringType,true))))
That would be one column legal_name being a map.
When you have defined you schema you can call
sqlContext.read.json(rdd, schema) to create your data frame from your JSON dataset with the desired schema.

Related

Apache Spark's from_json not working as expected

In my Spark application, I am trying to read the incoming JSON data, sent through the socket. The data is in string format. eg. {"deviceId": "1", "temperature":4.5}.
I created a schema as shown below:
StructType dataSchema = new StructType()
.add("deviceId", "string")
.add("temperature", "double");
I wrote the below code to extract the fields, turn them into a column, so I can use that in SQL queries.
Dataset<Row> normalizedStream = stream.select(functions.from_json(new Column("value"),dataSchema)).as("json");
Dataset<Data> test = normalizedStream.select("json.*").as(Encoders.bean(Data.class));
test.printSchema();
Data.class
public class Data {
private String deviceId;
private double temperature;
}
But when I submit the Spark app, the output schema is as below.
root
|-- from_json(value): struct (nullable = true)
| |-- deviceId: string (nullable = true)
| |-- temperature: double (nullable = true)
the from_json function is coming as a column name.
What I expect is:
root
|-- deviceId: string (nullable = true)
|-- temperature: double (nullable = true)
How to fix the above? Please, let me know what I doing wrong.
The problem is the placement of alias. Right now, you are placing an alias to select, and not to from_json where it is supposed to be.
Right now, json.* does not work because the renaming is not working as intended, therefore no column called json can be found, nor any children inside of it.
So, if you move the brackets from this:
...(new Column("value"),dataSchema)).as("json");
to this:
...(new Column("value"),dataSchema).as("json"));
your final data and schema will look as:
+--------+-----------+
|deviceId|temperature|
+--------+-----------+
|1 |4.5 |
+--------+-----------+
root
|-- deviceId: string (nullable = true)
|-- temperature: double (nullable = true)
which is what you intend to do. Hope this helps, good luck!

Pyspark: Write CSV from JSON file with struct column

I'm reading a .json file that contains the structure below, and I need to generate a csv with this data in column form, I know that I can't directly write an array-type object in a csv, I used the explode function to remove the fields I need , being able to leave them in a columnar form, but when writing the data frame in csv, I'm getting an error when using the explode function, from what I understand it's not possible to do this with two variables in the same select, can someone help me with something alternative?
from pyspark.sql.functions import col, explode
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.master("local[1]")
.appName("sample")
.getOrCreate())
df = (spark.read.option("multiline", "true")
.json("data/origin/crops.json"))
df2 = (explode('history').alias('history'), explode('trial').alias('trial'))
.select('history.started_at', 'history.finished_at', col('id'), trial.is_trial, trial.ws10_max))
(df2.write.format('com.databricks.spark.csv')
.mode('overwrite')
.option("header","true")
.save('data/output/'))
root
|-- history: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- finished_at: string (nullable = true)
| | |-- started_at: string (nullable = true)
|-- id: long (nullable = true)
|-- trial: struct (nullable = true)
| |-- is_trial: boolean (nullable = true)
| |-- ws10_max: double (nullable = true)
I'm trying to return something like this
started_at
finished_at
is_trial
ws10_max
First
row
row
Second
row
row
Thank you!
Use explode on array and select("struct.*") on struct.
df.select("trial", "id", explode('history').alias('history')),
.select('id', 'history.*', 'trial.*'))

Pyspark add empty literal map of type string

Similar to this question I want to add a column to my pyspark DataFrame containing nothing but an empty map. If I use the suggested answer from that question, however, the type of the map is <null,null>, unlike in the answer posted there.
from pyspark.sql.functions import create_map
spark.range(1).withColumn("test", create_map()).printSchema()
root
|-- test: map(nullable = false)
| |-- key: null
| |-- value: null (valueContainsNull = false)
I need an empty <string,string> map. I can do it in Scala like so:
import org.apache.spark.sql.functions.typedLit
spark.range(1).withColumn("test", typedLit(Map[String, String]())).printSchema()
root
|-- test: map(nullable = false)
| |-- key: string
| |-- value: string (valueContainsNull = true)
How can I do it in pyspark? I am using Spark 3.01 with underlying Scala 2.12 on Databricks Runtime 7.3 LTS. I need the <string,string> map because otherwise I can't save my Dataframe to parquet:
AnalysisException: Parquet data source does not support map<null,null> data type.;
You can cast the map to the appropriate type creating the map using create_map.
from pyspark.sql.functions import create_map
spark.range(1).withColumn("test", create_map().cast("map<string,string>")).printSchema()
root
|-- id: long (nullable = false)
|-- test: map (nullable = false)
| |-- key: string
| |-- value: string (valueContainsNull = true)

Reading orc file of Hive managed tables in pyspark

I am trying to read orc file of a managed hive table using below pyspark code.
spark.read.format('orc').load('hive managed table path')
when i do a print schema on fetched dataframe, it is as follow
root
|-- operation: integer (nullable = true)
|-- originalTransaction: long (nullable = true)
|-- bucket: integer (nullable = true)
|-- rowId: long (nullable = true)
|-- currentTransaction: long (nullable = true)
|-- row: struct (nullable = true)
| |-- col1: float (nullable = true)
| |-- col2: integer (nullable = true)
|-- partition_by_column: date (nullable = true)
Now i am not able to parse this data and do any manipulation on data frame. While applying action like show(), i am getting an error saying
java.lang.IllegalArgumentException: Include vector the wrong length
did someone face the same issue? if yes can you please suggest how to resolve it.
It's a known issue.
You get that error because you're trying to read Hive ACID table but Spark still doesn't have support for this.
Maybe you can export your Hive table to normal ORC files and then read them with Spark or try using alternatives like Hive JDBC as described here
As i am not sure about the versions You can try other ways to load the ORC file.
Using SqlContext
val df = sqlContext.read.format("orc").load(orcfile)
OR
val df= spark.read.option("inferSchema", true).orc("filepath")
OR SparkSql(recommended)
import spark.sql
sql("SELECT * FROM table_name").show()

How to find out the schema from a tons of messy structured data?

I have a huge dataset with messy structured schema.
Say, the same data fields can have different data type of data, for example, data.tags can be a list of string or a list of object
I tried to load the JSON data from hdfs and print the schema but it occurs the error below.
TypeError: Can not merge type <class 'pyspark.sql.types.ArrayType'> and <class 'pyspark.sql.types.StringType'>
Here is the code
data_json = sc.textFile(data_path)
data_dataset = data_json.map(json.loads)
data_dataset_df = data_dataset.toDF()
data_dataset_df.printSchema()
Is it possible to figure out the schema something like
root
|-- children: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: boolean (valueContainsNull = true)
| |-- element: string
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- occupation: string (nullable = true)
in this case?
If I understand correctly, you're looking to find how to infer the schema of a JSON file. You should take a look at reading the JSON into a DataFrame straightaway, instead of through a Python mapping function.
Also, I'm referring you to How to infer schema of JSON files?, as I think it answers your question.

Resources