Extracting value from nested array and struct spark - apache-spark

I have a schema like the below. I was wondering what is the best way in spark to select the elements seat and drive then cast it into a string. I am reading this in a dataframe with spark 1.6.
|-- cars: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- carId: string (nullable = true)
| | |-- carCode: string (nullable = true)
| | |-- carNumber: string (nullable = true)
| | |-- features: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- seat: string (nullable = true)
| | | | |-- drive: string (nullable = true)
The output of cars.features as car_features in json:
"cars_features":[[{"seat":"Auto","drive":"Manual"}]]
I am trying to select "Auto" and put it into a dataframe column and "Manual" and put into another column.
current attempt returns the whole structure as:
+-------------------+
|car_features |
+-------------------+
| [[Auto,Manual]] |
+-------------------+
col("car.features").getItem(0).as("car_features_seat")

I had to drill into array twice:
col("car.features").getItem(0).getItem(0).getItem("seat").cast("String").as("car_features_seat")
This extracts "Auto"

Related

Extract Nested JSON columns using Explode function

I have a data structure like below
data: struct (nullable = true)
| |-- event_Id: long (nullable = true)
| |-- data_nested: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- item1: string (nullable = true)
| | | |-- item2: string (nullable = true)
| | | |-- item3: string (nullable = true)
| | | | |-- item3_item1: long (containsNull = true)
| |-- other_elemets: array (nullable = true)
I want to take out data like
-+-------------------+-------+----------+
|item1 |item2 |item3 |
+--------------------+-------+----------+
|A |Android|null |
|B |Android|null |
|C |iOS |null |
|D |iOS |null |
-+-------------------+-------+----------+
I have got the data using functions like
df = spark.read.json(['s3://<data_location>]).select("data.data_nested").persist()
df.select(explode(col('element')).alias('item1'))
which gives the item1 column. Now want to know how to use explode to get item2 and item3 most efficiently

Reading nested Json structure in PySpark

I am new to PySpark.I am trying to read the values for one of the nested column of my JSON data.Here is my json structure:
-- _index: string (nullable = true)
|-- _score: string (nullable = true)
|-- _source: struct (nullable = true)
| |-- layers: struct (nullable = true)
| | |-- R1.TEST6: struct (nullable = true)
| | | |-- R1.TEST1: struct (nullable = true)
| | | | |-- R1.TEST1.idx: string (nullable = true)
| | | | |-- R1.TEST1.ide: string (nullable = true)
| | | |-- R1.TEST3: struct (nullable = true)
| | | | |-- R1.TEST3.PDU: string (nullable = true)
| | | | |-- R1.TEST3.pdu: string (nullable = true)
| | | | |-- R1.TEST4: struct (nullable = true)
| | | | | |-- R1.TEST2: struct (nullable = true)
| | | | | | |-- R1.TEST2.agg: string (nullable = true)
| | | | | | |-- R1.TEST2.size: string (nullable = true)
| | | | | | |-- R1.TEST2.start: string (nullable = true)
| | | | | | |-- R1.TEST2.beam: string (nullable = true)
| | | | | | |-- R1.TEST2.startIndex: string (nullable = true)
| | | | | | |-- R1.TEST2.regType: string (nullable = true)
| | | | | | |-- R1.TEST2.coreSetType: string (nullable = true)
| | | | | | |-- R1.TEST2.cpType: string (nullable = true)
| | | | | | |-- R1.TEST2.column1: string (nullable = true)
| | | | | | |-- R1.TEST2.column1: string (nullable = true)
| | | | | | |-- R1.TEST2.column1: string (nullable = true)
| | | | | | |-- R1.TEST2.column1: string (nullable = true)
| | | | | | |-- R1.TEST2.column1: string (nullable = true)
| | | | | | |-- R1.TEST2.column1: string (nullable = true)
| | | | | | |-- R1.TEST2.column3: string (nullable = true)
As mentioned over the article,https://stackoverflow.com/questions/57811415/reading-a-nested-json-file-in-pyspark,I tried doing below:
df2 = df.select(F.array(F.expr("_source.*")).alias("Source"))
Now my requirement is to access the value that is underR1.TEST6: tag
But below code is not working:
df2.withColumn("source_data", F.explode(F.arrays_zip("Source"))).select("source_data.Source.R1.TEST6.R1.TEST1.idx").show()
Can someone please help me on how can I access all the fields of this nested JSON and create a table as there are multiple levels of nesting present in this JSON _source.R1.TEST6
So how to use explode at this many multiple levels under

PySpark - get value of array type from dataframe

My data frame is like below. I need to extract values from input Array Type column. How can I achieve this in PySpark?
None
root
|-- input: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: map (valueContainsNull = true)
| | | |-- key: string
| | | |-- value: double (valueContainsNull = true)
|-- A: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: map (valueContainsNull = true)
| | | |-- key: string
| | | |-- value: double (valueContainsNull = true)
|-- B: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: map (valueContainsNull = true)
| | | |-- key: string
| | | |-- value: double (valueContainsNull = true)
|-- C: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: map (valueContainsNull = true)
| | | |-- key: string
| | | |-- value: double (valueContainsNull = true)
|-- D: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: map (valueContainsNull = true)
| | | |-- key: string
| | | |-- value: double (valueContainsNull = true)
|-- E: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: map (valueContainsNull = true)
| | | |-- key: string
| | | |-- value: double (valueContainsNull = true)
|-- timestamp: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: map (valueContainsNull = true)
| | | |-- key: string
| | | |-- value: double (valueContainsNull = true)
from itertools import chain
df.select('input').rdd.flatMap(lambda x: chain(*(x))).map(lambda x: x.values()).collect()

Spark for Json Data

I am processing a nested complex Json and below is the schema for it.
root
|-- businessEntity: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- payGroup: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- reportingPeriod: struct (nullable = true)
| | | | | |-- worker: array (nullable = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- category: string (nullable = true)
| | | | | | | |-- person: struct (nullable = true)
| | | | | | | |-- tax: array (nullable = true)
| | | | | | | | |-- element: struct (containsNull = true)
| | | | | | | | | |-- code: string (nullable = true)
| | | | | | | | | |-- qtdAmount: double (nullable = true)
| | | | | | | | | |-- ytdAmount: double (nullable =
My requirement is to create a hashmap with code concatenated with qtdAmount as key and value of qtdAmount as value.
Map.put(code + "qtdAmount" , qtdAmount). How can i do this with spark.
I tried with below shell commands.
import org.apache.spark.sql._
val sqlcontext = new SQLContext(sc)
val cdm = sqlcontext.read.json("/user/edureka/CDM/cdm.json")
val spark = SparkSession.builder().appName("SQL").config("spark.some.config.option","some-vale").getOrCreate()
cdm.createOrReplaceTempView("CDM")
val sqlDF = spark.sql("SELECT businessEntity[0].payGroup[0] from CDM").show()
val address = spark.sql("SELECT businessEntity[0].payGroup[0].reportingPeriod.worker[0].person.address from CDM as address")
val worker = spark.sql("SELECT businessEntity[0].payGroup[0].reportingPeriod.worker from CDM")
val tax = spark.sql("SELECT businessEntity[0].payGroup[0].reportingPeriod.worker[0].tax from CDM")
val tax = sqlcontext.sql("SELECT businessEntity[0].payGroup[0].reportingPeriod.worker[0].tax from CDM")
tax.select("tax.code")
val codes = tax.select(expode(tax("code"))
scala> val codes = tax.withColumn("code",explode(tax("tax.code"))).withColumn("qtdAmount",explode(tax("tax.qtdAmount"))).withColumn("ytdAmount",explode(tax("tax.ytdAmount")))
i am trying to get all the codes and qtdAmount into a map. But i am not getting it. Using multiple explode statements for a single DF, is producing cartesian product of the elements.
Could someone please help on how to parse the json of this much complex in spark.
You can get code and qtyAmount in this way.
import sqlcontext.implicits._
cdm.select(
$"businessEntity.element.payGroup.element.reportingPeriod.worker.element.tax.element.code".as("code"),
$"businessEntity.element.payGroup.element.reportingPeriod.worker.element.tax.element.qtdAmount".as("qtdAmount")
).show
For detailed information, check this

Spark-SQL : Access array elements storing within a cell in a data frame

root
|
|-- dogs: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: struct (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- color: string (nullable = true)
| | | |-- sources: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | |-- _2: age (nullable = true)
Which shows below with data.select("dogs").show(2,False)
+---------------------------------------------------------------------------------+
|names |
+---------------------------------------------------------------------------------+
|[[[Max,White,WrappedArray(SanDiego)],3], [[Spot,Black,WrappedArray(SanDiego)],2]]|
|[[[Michael,Black,WrappedArray(SanJose)],1]] |
+---------------------------------------------------------------------------------+
only showing top 2 rows
I am wondering if it is possible to access the array elements in each cell? For example, I want to retrieve (Max, white), (Spot, Black) and (Michael, Black) from the dogs column.
In additional, I would like to expand the rows with n elements to n rows if possible.
Thanks!
You can use explode as below to get access to a dataframe with each row being a record from the array.
data.registerTempTable("data")
dataExplode = sqlContext.sql("select explode(dogs) as dog from data")
dataExplode.show()
Then, you can use select to obtain just the columns you are interested in.

Resources