How to write to a JSON Column in BigQuery using Spark Scala - apache-spark

I have a dataframe which contains a JSON String as a StringType Column. How can you write this to BigQuery JSON Column (New Feature)? Spark JSON Write to BigQuery with examples is not available https://github.com/GoogleCloudDataproc/spark-bigquery-connector
When we try to write the string in append Mode to a JSON Column we get an error
Field source has changed type from JSON to STRING
According to Readme -> Datatypes -> JSON
How this can be implemented with an example

Related

How to write a dataframe column with json data (STRING type) to BigQuery table as JSON type using pyspark?

I have a pyspark dataframe with a column containing json string (column type is string). I would like to write this dataframe to Bigquery table with column type as JSON. I got below information from this link
https://github.com/GoogleCloudDataproc/spark-bigquery-connector
Spark has no JSON type. The values are read as String. In order to write JSON back to BigQuery, the following conditions are REQUIRED:
Use the INDIRECT write method
Use the AVRO intermediate format
The DataFrame field MUST be of type String and has an entry of sqlType=JSON in its metadata
I am not sure how to set an entry of sqlType=JSON in dataframe field metadata? Can someone please help?
I am using below code to write dataframe to Bigquery table
df.write \
.format("bigquery") \
.option("temporaryGcsBucket","some-bucket") \
.save("dataset.table")
Have look at the withMetadata column.
pyspark.sql.DataFrame.withMetadata
df_meta = df.withMetadata('age', {'foo': 'bar'})
df_meta.schema['age'].metadata
{'foo': 'bar'}

DateTime datatype in BigQuery

I have a partitioned table where one of the column is of type DateTime and the table is partitioned on same column. According to spark-bigquery documentation, the corresponding Spark SQL type is of String type.
https://github.com/GoogleCloudDataproc/spark-bigquery-connector
I tried doing the same but I am getting datatype mismatch issue.
Code Snippet:
ZonedDateTime nowPST = ZonedDateTime.ofInstant(Instant.now(), TimeZone.getTimeZone("PST").toZoneId());
df = df.withColumn("createdDate", lit(nowPST.toLocalDateTime().toString()));
Error:
Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Failed to load to <PROJECT_ID>:<DATASET_NAME>.<TABLE_NAME> in job JobId{project=<PROJECT_ID>, job=<JOB_ID>, location=US}. BigQuery error was Provided Schema does not match Table <PROJECT_ID>:<DATASET_NAME>.<TABLE_NAME>. Field createdDate has changed type from DATETIME to STRING
at com.google.cloud.spark.bigquery.BigQueryWriteHelper.loadDataToBigQuery(BigQueryWriteHelper.scala:156)
at com.google.cloud.spark.bigquery.BigQueryWriteHelper.writeDataFrameToBigQuery(BigQueryWriteHelper.scala:89)
... 36 more
As Spark has no support for DateTime, the BigQuery connector does not support writing DateTime - there is no equivalent Spark data type that can be used. We are exploring ways to augment the DataFrame's metadata in order to support the types which are supported by BigQuery and not by Spark (DateTime, Time, Geography).
At the moment please have this field as String, and have the conversion on the BigQuery side.
I am running into this issue now as well with both geography https://community.databricks.com/s/question/0D58Y000099mPyDSAU/does-databricks-support-writing-geographygeometry-data-into-bigquery
And for Datetime types. The only way I could get the table from databricks to BigQuery (without creating a temporary table and Inserting the data as this would still be costly due to the size of the table) was to write the table out to a CSV into a GCS Bucket
results_df.write.format("csv").mode("overwrite").save("gs://<bucket-name>/ancillary_test")
And then load the data from the bucket to the table in BigQuery specifying the schema
LOAD DATA INTO <dataset>.<tablename>(
PRICENODEID INTEGER,
ISONAME STRING,
PRICENODENAME STRING,
MARKETTYPE STRING,
GMTDATETIME TIMESTAMP,
TIMEZONE STRING,
LOCALDATETIME DATETIME,
ANCILLARY STRING,
PRICE FLOAT64,
CHANGE_DATE TIMESTAMP
)
FROM FILES (
format = 'CSV',
uris = ['gs://<bucket-name>/ancillary_test/*.csv']
);

how to query data from Pyspark sql context if key is not present in json fie , How to catch give sql analysis execption

I am using Pyspark to transform JSON in a Dataframe. And I am successfully able to transform it. But the problem I am facing is there is a key which will be present in some JSON file and will not be present in another. When I flatten the JSON with Pyspark SQL context and the key is not present in some JSON file, it gives error in creating my Pyspark data frame, throwing SQL Analysis Exception.
for example my sample JSON
{
"_id" : ObjectId("5eba227a0bce34b401e7899a"),
"origin" : "inbound",
"converse" : "72412952",
"Start" : "2020-04-20T06:12:20.89Z",
"End" : "2020-04-20T06:12:53.919Z",
"ConversationMos" : 4.88228940963745,
"ConversationRFactor" : 92.4383773803711,
"participantId" : "bbe4de4c-7b3e-49f1-8",
}
The above JSON participant id will be available in some JSON and not in another JSON files
My pysaprk code snippet:
fetchFile = sark.read.format(file_type)\
.option("inferSchema", "true")\
.option("header","true")\
.load(generated_FileLocation)
fetch file.registerTempTable("CreateDataFrame")
tempData = sqlContext.sql("select origin,converse,start,end,participantId from CreateDataFrame")
When, in some JSON file participantId is not present, an exception is coming. How to handle that kind of exception that if the key is not present so column will contain null or any other ways to handle it
You can simply check if the column is not there then add it will empty values.
The code for the same goes like:
from pyspark.sql import functions as f
fetchFile = sark.read.format(file_type)\
.option("inferSchema", "true")\
.option("header","true")\
.load(generated_FileLocation)
if not 'participantId' in df.columns:
df = df.withColumn('participantId', f.lit(''))
fetch file.registerTempTable("CreateDataFrame")
tempData = sqlContext.sql("select origin,converse,start,end,participantId from CreateDataFrame")
I think you're calling Spark to read one file at a time and inferring the schema at the same time.
What Spark is telling you with the SQL Analysis exception is that your file and your inferred schema doesn't have the key you're looking for. What you have to do is get to your good schema and apply it to all of the files you want to process. Ideally, processing all of your files at once.
There are three strategies:
Infer your schema from lots of files. You should get the aggregate of all of the keys. Spark will run two passes over the data.
df = spark.read.json('/path/to/your/directory/full/of/json/files')
schema = df.schema
print(schema)
Create a schema object
I find this tedious to do, but will speed up your code. Here is a reference: https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.types.StructType
Read the schema from a well formed file then use that to read your whole directory. Also, by printing the schema object, you can copy paste that back into your code for option #2.
schema = spark.read.json('path/to/well/formed/file.json')
print(schema)
my_df = spark.read.schema(schema).json('path/to/entire/folder/full/of/json')

BigQuery exports NUMERIC data type as binary data type in AVRO

I am exporting the data from BigQuery table which has column named prop12 defined as NUMERIC data type. Please note that destination format is AVRO and can't be changed.
bq extract --destination_format AVRO datasetName.myTableName /path/to/file-1-*.avro
When i am reading avro data, using spark it is not able to convert this NUMERIC data type to Integer.
--prop12: binary (nullable = true)
cannot resolve 'CAST(`prop12` AS INT)' due to data type mismatch: cannot cast BinaryType to IntegerType
Is there any way i can specify prop12 should be exported as Integer while doing bq extract?
OR
If it is not possible during bq export, am i left with only option of reading the binary data in spark?
Is there any way i can specify prop12 should be exported as Integer
while doing bq extract?
In the extract command you can't do it. You can create a new temporary table and then extract it:
bq query --nouse_legacy_sql '
CREATE TABLE `my_dataset.my_temp_table`
OPTIONS(
expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 10 MINUTE)
) AS
SELECT * REPLACE (CAST(prop12 AS INT64) AS prop12)
FROM `my_dataset.my_table`;
' && bq extract --destination_format AVRO my_dataset.my_temp_table /path/to/file-1-*.avro
Consider that this will generate additional cost.
If it is not possible during bq export, am i left with only option of
reading the binary data in spark?
Numeric types in BigQuery are 16-bytes, it could be possible to work with them as decimal. You can try casting them as decimal instead.

Reading Avro file in Spark and extracting column values

I want to read an avro file using Spark (I am using Spark 1.3.0 so I don't have data frames)
I read the avro file using this piece of code
import org.apache.avro.generic.GenericRecord
import org.apache.avro.mapred.AvroKey
import org.apache.avro.mapreduce.AvroKeyInputFormat
import org.apache.hadoop.io.NullWritable
import org.apache.spark.SparkContext
private def readAvro(sparkContext: SparkContext, path: String) = {
sparkContext.newAPIHadoopFile[
AvroKey[GenericRecord],
NullWritable,
AvroKeyInputFormat[GenericRecord]
](path)
}
I execute this and get an RDD. Now from RDD, how do I extract value of specific columns? like loop through all records and give value of column name?
[edit]As suggested by Justin below I tried
val rdd = sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](input)
rdd.map(record=> record._1.get("accountId")).toArray().foreach(println)
but I get an error
<console>:34: error: value get is not a member of org.apache.avro.mapred.AvroKey[org.apache.avro.generic.GenericRecord]
rdd.map(record=> record._1.get("accountId")).toArray().foreach(println)
AvroKey has a datum method to extract the wrapped value. And GenericRecord has a get method that accepts the column name as a string. So you can just extract the columns using a map
rdd.map(record=>record._1.datum.get("COLNAME"))

Resources