I am new in PySpark . can anyone help me how to read json data using pyspark.
what we have done,
(1) main.py
import os.path
from pyspark.sql import SparkSession
def fileNameInput(filename,spark):
try:
if(os.path.isfile(filename)):
loadFileIntoHdfs(filename,spark)
else:
print("File does not exists")
except OSError:
print("Error while finding file")
def loadFileIntoHdfs(fileName,spark):
df = spark.read.json(fileName)
df.show()
if __name__ == '__main__':
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
file_name = input("Enter file location : ")
fileNameInput(file_name,spark)
When I run above code it throws error message
File "/opt/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/opt/spark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o41.showString.
: org.apache.spark.sql.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
Thanks in advance
Your JSON works in my pyspark. I can get a similar error when the record text goes across multiple lines. Please ensure that each record fits in one line.
Alternatively, tell it to support multi-line records:
spark.read.json(filename, multiLine=True)
What works:
{ "employees": [{ "firstName": "John", "lastName": "Doe" }, { "firstName": "Anna", "lastName": "Smith" }, { "firstName": "Peter", "lastName": "Jones" } ] }
That outputs:
spark.read.json('/home/ernest/Desktop/brokenjson.json').printSchema()
root
|-- employees: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- firstName: string (nullable = true)
| | |-- lastName: string (nullable = true)
When I try some input like this:
{
"employees": [{ "firstName": "John", "lastName": "Doe" }, { "firstName": "Anna", "lastName": "Smith" }, { "firstName": "Peter", "lastName": "Jones" } ] }
Then I get the corrupt record in schema:
root
|-- _corrupt_record: string (nullable = true)
But when used with multiline options, the latter input works too.
Related
I have a structured stream reading from Kafka and am trying to convert the JSON payload using a Struct schema.
{
"fields": [
{
"metadata": {},
"name": "test",
"nullable": true,
"type": {
"containsNull": true,
"elementType": {
"fields": [
{
"metadata": {},
"name": "message",
"nullable": false,
"type": "string"
},
{
"metadata": {},
"name": "recipient_id",
"nullable": true,
"type": "long"
}
],
"type": "struct"
},
"type": "array"
}
},
{
"metadata": {},
"name": "user_id",
"nullable": true,
"type": "long"
}
],
"type": "struct"
}
Converting the json schema to Struct by results in the following.
StructType.fromJson(jsonSchema)
StructType([StructField('test', ArrayType(StructType([StructField('message', StringType(), False), StructField('recipient_id', LongType(), True)]), True), True), StructField('user_id', LongType(), True)])
Converting the payload using this schema results in a data frame schema where the nullable is set to true even though it is set to false in the above schema and passing the null value to the field is not resulting in any errors.
spark_df = spark_df.selectExpr('timestamp', "CAST(value AS STRING)")
spark_df = spark_df.withColumn("value",from_json(col("value"),schemaNew, {"mode": "FAILFAST"}))
spark_df.printSchema()
root
|-- timestamp: timestamp (nullable = true)
|-- value: struct (nullable = true)
| |-- test: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- message: string (nullable = true)
| | | |-- recipient_id: long (nullable = true)
| |-- user_id: long (nullable = true)
How else can we read the schema from a file and apply nullable property and convert the JSON data to proper dataframe?
Since from_json() ignores nullability information (see explanation below), you could try add a filter() right after that will drop such data (where message is null). Or you could try to use createDataFrame() which checks for nullability and merge back if that's an option.
Why from_json() ignores nullability info
Prior to spark 2.3.1, this was happening unexpectedly due to the Jackson parsers producing null regardless of the nullability. There were 2 options to resolve this (from this SPARK issue):
Ignore nullability information in the schema, and assume all fields are nullable. Hence keeping the same behaviour as before due to the limitation of Jackson parsers.
Validate the data, and fail during execution if the data has null.
They went with option 1 because it's "the more performant option and a lot easier to do" and "less invasive" too. Hence, from_json() ignores nullability information set in the schema.
I'm using the below code to read data from an api where the payload is in json format using pyspark in azure databricks. All the fields are defined as string but keep running into json_tuple requires that all arguments are strings error.
Schema:
root
|-- Payload: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ActiveDate: string (nullable = true)
| | |-- BusinessId: string (nullable = true)
| | |-- BusinessName: string (nullable = true)
JSON:
{
"Payload":
[
{
"ActiveDate": "2008-11-25",
"BusinessId": "5678",
"BusinessName": "ACL"
},
{
"ActiveDate": "2009-03-22",
"BusinessId": "6789",
"BusinessName": "BCL"
}
]
}
PySpark:
from pyspark.sql import functions as F
df = df.select(F.col('Payload'), F.json_tuple(F.col('Payload'), 'ActiveDate', 'BusinessId', 'BusinessName') \.alias('ActiveDate', 'BusinessId', 'BusinessName'))
df.write.format("delta").mode("overwrite").saveAsTable("delta_payload")
Error:
AnalysisException: cannot resolve 'json_tuple(`Payload`, 'ActiveDate', 'BusinessId', 'BusinessName')' due to data type mismatch: json_tuple requires that all arguments are strings;
From your schema it looks like the JSON is already parsed, so Payload is of ArrayType rather than StringType containing JSON, hence the error.
You probably need explode instead of json_tuple:
>>> from pyspark.sql.functions import explode
>>> df = spark.createDataFrame([{
... "Payload":
... [
... {
... "ActiveDate": "2008-11-25",
... "BusinessId": "5678",
... "BusinessName": "ACL"
... },
... {
... "ActiveDate": "2009-03-22",
... "BusinessId": "6789",
... "BusinessName": "BCL"
... }
... ]
... }])
>>> df.schema
StructType(List(StructField(Payload,ArrayType(MapType(StringType,StringType,true),true),true)))
>>> df.select(explode("Payload").alias("x")).select("x.ActiveDate", "x.BusinessName", "x.BusinessId").show()
+----------+------------+----------+
|ActiveDate|BusinessName|BusinessId|
+----------+------------+----------+
|2008-11-25| ACL| 5678|
|2009-03-22| BCL| 6789|
+----------+------------+----------+
need some help on my first attempt to parse JSON coming on Kafka to Spark structured streaming.
I am struggling to convert the incoming JSON and covert it into flat dataframe for further processing.
My input json is
[
{ "siteId": "30:47:47:BE:16:8F", "siteData":
[
{ "dataseries": "trend-255", "values":
[
{"ts": 1502715600, "value": 35.74 },
{"ts": 1502715660, "value": 35.65 },
{"ts": 1502715720, "value": 35.58 },
{"ts": 1502715780, "value": 35.55 }
]
},
{ "dataseries": "trend-256", "values":
[
{"ts": 1502715840, "value": 18.45 },
{"ts": 1502715900, "value": 18.35 },
{"ts": 1502715960, "value": 18.32 }
]
}
]
},
{ "siteId": "30:47:47:BE:16:FF", "siteData":
[
{ "dataseries": "trend-255", "values":
[
{"ts": 1502715600, "value": 35.74 },
{"ts": 1502715660, "value": 35.65 },
{"ts": 1502715720, "value": 35.58 },
{"ts": 1502715780, "value": 35.55 }
]
},
{ "dataseries": "trend-256", "values":
[
{"ts": 1502715840, "value": 18.45 },
{"ts": 1502715900, "value": 18.35 },
{"ts": 1502715960, "value": 18.32 }
]
}
]
}
]
Spark schema is
data1_spark_schema = ArrayType(
StructType([
StructField("siteId", StringType(), False),
StructField("siteData", ArrayType(StructType([
StructField("dataseries", StringType(), False),
StructField("values", ArrayType(StructType([
StructField("ts", IntegerType(), False),
StructField("value", StringType(), False)
]), False), False)
]), False), False)
]), False
)
My very simple code is:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from config.general import kafka_instance
from config.general import topic
from schemas.schema import data1_spark_schema
spark = SparkSession \
.builder \
.appName("Structured_BMS_Feed") \
.getOrCreate()
stream = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafka_instance) \
.option("subscribe", topic) \
.option("startingOffsets", "latest") \
.option("max.poll.records", 100) \
.option("failOnDataLoss", False) \
.load()
stream_records = stream.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING) as bms_data1") \
.select(from_json("bms_data1", data1_spark_schema).alias("bms_data1"))
sites = stream_records.select(explode("bms_data1").alias("site")) \
.select("site.*")
sites.printSchema()
stream_debug = sites.writeStream \
.outputMode("append") \
.format("console") \
.option("numRows", 20) \
.option("truncate", False) \
.start()
stream_debug.awaitTermination()
When I run this code I schema is printing like this:
root
|-- siteId: string (nullable = false)
|-- siteData: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- dataseries: string (nullable = false)
| | |-- values: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- ts: integer (nullable = false)
| | | | |-- value: string (nullable = false)
Is it possible to have this schema in a way where I get all fields in a flat dataframe instead of nested JSON. So for every ts and value it should give me one row with its parent dataseries and site id.
Answering my own question. I managed to flatten it using following lines:
sites_flat = stream_records.select(explode("bms_data1").alias("site")) \
.select("site.siteId", explode("site.siteData").alias("siteData")) \
.select("siteId", "siteData.dataseries", explode("siteData.values").alias("values")) \
.select("siteId", "dataseries", "values.*")
I am facing an issue for which I am seeking your help. I have a task to convert a JSON file to dataSet so that it can be loaded into HIVE.
Code 1
SparkSession spark1 = SparkSession
.builder()
.appName("File_Validation")
.config("spark.some.config.option", "some-value")
.getOrCreate();
Dataset<Row> df = spark1.read().json("input/sample.json");
df.show();
Above code is throwing me a NullPointerException.
I tried another way
Code 2
JavaRDD<String> jsonFile = context.textFile("input/sample.json");
Dataset<Row> df2 = spark1.read().json(jsonFile);
df2.show();
created an RDD and passed it to the spark1 (sparkSession)
this code 2 is making the json to a different format with header as
+--------------------+
| _corrupt_record|
+--------------------+
with schema as - |-- _corrupt_record: string (nullable = true)
Please help in fixing it.
Sample JSON
{
"user": "gT35Hhhre9m",
"dates": ["2016-01-29", "2016-01-28"],
"status": "OK",
"reason": "some reason",
"content": [{
"foo": 123,
"bar": "val1"
}, {
"foo": 456,
"bar": "val2"
}, {
"foo": 789,
"bar": "val3"
}, {
"foo": 124,
"bar": "val4"
}, {
"foo": 126,
"bar": "val5"
}]
}
Your JSON should be in one line - one json in one line per one object.
In example:
{ "property1: 1 }
{ "property1: 2 }
It will be read as Dataset with 2 objects inside and one column
From documentation:
Note that the file that is offered as a json file is not a typical
JSON file. Each line must contain a separate, self-contained valid
JSON object. As a consequence, a regular multi-line JSON file will
most often fail.
Of course read data with SparkSession, as it will inference schema
You cant read formated JSON in spark, Your JSON should be in a single line like this :
{"user": "gT35Hhhre9m","dates": ["2016-01-29", "2016-01-28"],"status": "OK","reason": "some reason","content": [{"foo": 123,"bar": "val1"}, {"foo": 456,"bar": "val2"}, {"foo": 789,"bar": "val3"}, {"foo": 124,"bar": "val4"}, {"foo": 126,"bar": "val5"}]}
Or it could be multi-lined JSON like this :
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
I am using the new Apache Spark version 1.4.0 Data-frames API to extract information from Twitter's Status JSON, mostly focused on the Entities Object - the relevant part to this question is showed below:
{
...
...
"entities": {
"hashtags": [],
"trends": [],
"urls": [],
"user_mentions": [
{
"screen_name": "linobocchini",
"name": "Lino Bocchini",
"id": 187356243,
"id_str": "187356243",
"indices": [ 3, 16 ]
},
{
"screen_name": "jeanwyllys_real",
"name": "Jean Wyllys",
"id": 111123176,
"id_str": "111123176",
"indices": [ 79, 95 ]
}
],
"symbols": []
},
...
...
}
There are several examples on how extract information from primitives types as string, integer, etc - but I couldn't find anything on how to process those kind of complex structures.
I tried the code below but it is still doesn't work, it throws an Exception
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val tweets = sqlContext.read.json("tweets.json")
// this function is just to filter empty entities.user_mentions[] nodes
// some tweets doesn't contains any mentions
import org.apache.spark.sql.functions.udf
val isEmpty = udf((value: List[Any]) => value.isEmpty)
import org.apache.spark.sql._
import sqlContext.implicits._
case class UserMention(id: Long, idStr: String, indices: Array[Long], name: String, screenName: String)
val mentions = tweets.select("entities.user_mentions").
filter(!isEmpty($"user_mentions")).
explode($"user_mentions") {
case Row(arr: Array[Row]) => arr.map { elem =>
UserMention(
elem.getAs[Long]("id"),
elem.getAs[String]("is_str"),
elem.getAs[Array[Long]]("indices"),
elem.getAs[String]("name"),
elem.getAs[String]("screen_name"))
}
}
mentions.first
Exception when I try to call mentions.first:
scala> mentions.first
15/06/23 22:15:06 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 8)
scala.MatchError: [List([187356243,187356243,List(3, 16),Lino Bocchini,linobocchini], [111123176,111123176,List(79, 95),Jean Wyllys,jeanwyllys_real])] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
at $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:34)
at $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:34)
at scala.Function1$$anonfun$andThen$1.apply(Function1.scala:55)
at org.apache.spark.sql.catalyst.expressions.UserDefinedGenerator.eval(generators.scala:81)
What is wrong here? I understand it is related to the types but I couldn't figure out it yet.
As additional context, the structure mapped automatically is:
scala> mentions.printSchema
root
|-- user_mentions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- id_str: string (nullable = true)
| | |-- indices: array (nullable = true)
| | | |-- element: long (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- screen_name: string (nullable = true)
NOTE 1: I know it is possible to solve this using HiveQL but I would like to use Data-frames once there is so much momentum around it.
SELECT explode(entities.user_mentions) as mentions
FROM tweets
NOTE 2: the UDF val isEmpty = udf((value: List[Any]) => value.isEmpty) is a ugly hack and I'm missing something here, but was the only way I came up to avoid a NPE
here is a solution that works, with just one small hack.
The main idea is to work around the type problem by declaring a List[String] rather than List[Row]:
val mentions = tweets.explode("entities.user_mentions", "mention"){m: List[String] => m}
This creates a second column called "mention" of type "Struct":
| entities| mention|
+--------------------+--------------------+
|[List(),List(),Li...|[187356243,187356...|
|[List(),List(),Li...|[111123176,111123...|
Now do a map() to extract the fields inside mention. The getStruct(1) call gets the value in column 1 of each row:
case class Mention(id: Long, id_str: String, indices: Seq[Int], name: String, screen_name: String)
val mentionsRdd = mentions.map(
row =>
{
val mention = row.getStruct(1)
Mention(mention.getLong(0), mention.getString(1), mention.getSeq[Int](2), mention.getString(3), mention.getString(4))
}
)
And convert the RDD back into a DataFrame:
val mentionsDf = mentionsRdd.toDF()
There you go!
| id| id_str| indices| name| screen_name|
+---------+---------+------------+-------------+---------------+
|187356243|187356243| List(3, 16)|Lino Bocchini| linobocchini|
|111123176|111123176|List(79, 95)| Jean Wyllys|jeanwyllys_real|
Try doing this:
case Row(arr: Seq[Row]) => arr.map { elem =>