How to Convert spark dataframe to nested json using spark scala dynamically - azure

I want to convert the DataFrame to nested json. Sourse Data:-
DataFrame have data value like :-
Expected Output:-
I have to convert DataFrame value to Nested Json like : -
Appreciate your help !

If you want to persist the data then save dataframe with format json
df.write.json("path")
You can use toJSON function, which will convert dataframe to Dataset[String]
df.toJSON
If there's only one element then you can further manipulate to get string
df.toJSON.take(1).head
Thanks.

Related

Write each row of a dataframe to a separate json file in s3 with pyspark

in one of my projects, I need to write each row of a dataframe into a separate S3 file in json format. In the actual implementation, map/foreach's input is a Row, though I don't seem to find any member function on Row that could transform a row into json format.
I'm using spark df and don't want to convert it to pandas (as it involves sending everything to the driver?), hence cannot use the to_json function. Is there any other way to do it? I can definitely write my own json converter based on my specific df schema, but just wondering if there is a readily available module.

Prevent pyspark/spark from transforming timestamp when creating a dataframe from a parquet file

I am reading a parquet file into a dataframe. My goal is to verify that my time data (column type in parquet : timestamp) are ISO 8601.
The dates in time column look like this : 2021-03-13T05:34:27.100Z or 2021-03-13T05:34:27.100+0000
But when I read my dataframe, pyspark transform 2021-03-13T05:34:27.100Z into 2021-03-13 05:34:27.100
I want to keep the original format, but I can't figure out how to stop pyspark from doing this. I tried to use a custom schema with string for dates but I get this error: Parquet column cannot be converted in file file.snappy.parquet. Column: [time], Expected: string, Found: INT96
Also I tried using conf parameters but it didn't work for me.

iterating complex dataframe with array of structfield

I have data in one of dataframe's column with the following schema
<type 'list'>: [StructField(data,StructType(List(StructField(account,StructType(List(StructField(Id,StringType,true),StructField(Name,StringType,true),StructField(books,ArrayType(StructType(List(StructField(bookTile,StringType,true),StructField(bookId,StringType,true),StructField(bookName,StringType,true))),true),true)))))))]
I want to interate them extract each value out of it and create a new dataframe. Is there any inbuilt functions in pyspark supports this or I should iterate them? Any efficient way?

Is there a way to get the column data type in pyspark?

Has been discussed that the way to find the column datatype in pyspark is using df.dtypes get datatype of column using pyspark. The problem with this is that for datatypes like an array or struct you get something like array<string> or array<integer>.
Question: Is there a native way to get the pyspark data type? Like ArrayType(StringType,true)
Just use schema:
df.schema[column_name].dataType

how to add a Incremental column ID for a table in spark SQL

I'm working on a spark mllib algorithm. The dataset I have is in this form
Company":"XXXX","CurrentTitle":"XYZ","Edu_Title":"ABC","Exp_mnth":.(there are more values similar to these)
Im trying to raw code String values to Numeric values. So, I tried using zipwithuniqueID for unique value for each of the string values.For some reason I'm not able to save the modified dataset to the disk. Can I do this in any way using spark SQL? or what would be the better approach for this?
Scala
import org.apache.spark.sql.functions.monotonically_increasing_id
val dataFrame1 = dataFrame0.withColumn("index",monotonically_increasing_id())
Java
Import org.apache.spark.sql.functions;
Dataset<Row> dataFrame1 = dataFrame0.withColumn("index",functions.monotonically_increasing_id());

Resources