PySpark Schema structure to read nested data - apache-spark

I am getting following error.
ValueError: field col4: Length of object (1) does not match with length of fields (2)
The data is in this format.
[
["N","S","3",null,null],
["N","P","4",[{"key1":"val1","key2":"val2"}],null],
["N","I","5",null,[{"key1":"val1","key2":"val2"}]],
["N","S","3",null,null]
]
The Schema I have defined is following:
schema = StructType(
StructField("col1", StringType(), True),
StructField("col2", StringType(), True),
StructField("col3", StringType(), True),
StructField("col4",
StructType(
StructField("key1", StringType(), True),
StructField("key2", StringType(), True)
)
),
StructField("col5",
StructType(
StructField("key1", StringType(), True),
StructField("key2", StringType(), True)
)
)
)
Please help me in identifying how I can read the data of this format.

Welcome to StackOverflow community.
Coming to your question, first you need to replace null with None, as null is not a keyword in either python or pyspark (unless you are using spark-sql).
Now regarding your schema - you need to define it as ArrayType wherever complex or list column structure is there. Inside that, you again need to specify StructType because within your list there is a dictionary with key and value pairs.
See below structure to visualize it better -
data = [["N","S","3",None,None], ["N","P","4",[{"key1":"val1","key2":"val2"}],None], ["N","I","5",None,[{"key1":"val1","key2":"val2"}]], ["N","S","3",None, None] ]
You need to convert this to a RDD as below -
data_rdd = sc.parallelize(data)
Once you're RDD is created then you need to create your dataframe using the schema I explained above -
from pyspark.sql.types import *
schema = schema = StructType(
[StructField("col1", StringType(), True),
StructField("col2", StringType(), True),
StructField("col3", StringType(), True),
StructField("col4",
ArrayType(
StructType([StructField("key1", StringType(), True),
StructField("key2", StringType(), True)])
)
),
StructField("col5",
ArrayType(
StructType([StructField("key1", StringType(), True),
StructField("key2", StringType(), True)])
)
)
]
)
df = spark.createDataFrame(data=data_rdd, schema=schema)
Output
df.show()
+----+----+----+--------------+--------------+
|col1|col2|col3| col4| col5|
+----+----+----+--------------+--------------+
| N| S| 3| null| null|
| N| P| 4|[{val1, val2}]| null|
| N| I| 5| null|[{val1, val2}]|
| N| S| 3| null| null|
+----+----+----+--------------+--------------+
Additionally, if you need the key and value as separate columns for both col4 and col5, in that case you need to create the schema as below -
schema = StructType(
[StructField("col1", StringType(), True),
StructField("col2", StringType(), True),
StructField("col3", StringType(), True),
StructField("col4",
ArrayType(
MapType(StringType(), StringType())
)
),
StructField("col5",
ArrayType(
MapType(StringType(), StringType())
)
)
]
)
from pyspark.sql.functions import *
df = spark.createDataFrame(data=sc.parallelize(data), schema=schema)
df.show(truncate=False)
#Input dataframe output -
+----+----+----+------------------------------+------------------------------+
|col1|col2|col3|col4 |col5 |
+----+----+----+------------------------------+------------------------------+
|N |S |3 |null |null |
|N |P |4 |[{key1 -> val1, key2 -> val2}]|null |
|N |I |5 |null |[{key1 -> val1, key2 -> val2}]|
|N |S |3 |null |null |
+----+----+----+------------------------------+------------------------------+
Finally, explode these columns col4 and col5 as below -
(df.withColumn('explode_col4', explode_outer( col('col4')))
.withColumn('explode_col5', explode_outer( col('col5')))
.select("col1", "col2", "col3", (explode_outer( col('explode_col4') ).alias('col4_key', 'col4_value')) , "explode_col5")
.select("col1", "col2", "col3", "col4_key", "col4_value", (explode_outer( col('explode_col5') ).alias('col5_key', 'col5_value')))
).show(truncate=False)
Output
+----+----+----+--------+----------+--------+----------+
|col1|col2|col3|col4_key|col4_value|col5_key|col5_value|
+----+----+----+--------+----------+--------+----------+
|N |S |3 |null |null |null |null |
|N |P |4 |key1 |val1 |null |null |
|N |P |4 |key2 |val2 |null |null |
|N |I |5 |null |null |key1 |val1 |
|N |I |5 |null |null |key2 |val2 |
|N |S |3 |null |null |null |null |
+----+----+----+--------+----------+--------+----------+

Related

How to create a spark dataframe from one of the column in the existing dataframe

Requirements:
I wanted to create a dataframe out of one column (existing dataframe ). That column value is multiple json list.
Problem:
Since the json does not have a fixed schema, i wasn't able to use the from_json function since it needs schema before to parse the columns.
Example
| Column A | Column B |
| 1 | [{"id":"123","phone":"124"}] |
| 3 | [{"id":"456","phone":"741"}] |
Expected output:
| id | phone|
| 123 | 124 |
| 456 | 741 |
Any thoughts on this ?
Try using Spark SQL to explode the "Column B" Array
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, ArrayType
spark = SparkSession.builder.appName("Test_app").getOrCreate()
input_data = [
(1, [{"id":"123","phone":"124"}]),
(3, [{"id":"456","phone":"741"}])
]
schema = StructType([
StructField("Column A", IntegerType(), True),
StructField("Column B", ArrayType(StructType([
StructField("id", StringType(), True),
StructField("phone", StringType(), True)
])), True)
])
df = spark.createDataFrame(input_data, schema)
df_exploded = df.selectExpr("Column A", "explode(Column B) as e") \
.select("e.id", "e.phone")
df_exploded.show()
Output is below ;
+---+-----+
| id|phone|
+---+-----+
|123| 124|
|456| 741|
+---+-----+
Convert it into an rdd and then read it as json. For testing I have removed the id element in the second row.
input_data = [
(1, [{"id":"123","phone":"124"}]),
(3, [{"phone":"741"}])
]
df = spark.createDataFrame(input_data, ["ColA","ColB"])
spark.read.json(df.rdd.map(lambda r: r.ColB)).show()
+----+-----+
| id|phone|
+----+-----+
| 123| 124|
|null| 741|
+----+-----+

Spark DateType column returning null

I would like to read in a file with the following structure with Apache Spark.
192 242 3 881250949 (the columns are tab separated)
from imbd I saw below note regarding date column: (unix seconds since 1/1/1970 UTC) -- not sure if this has anything to do with my issue
this is how I defined the schema:
from pyspark.sql.types import *
path_tocsv="dbfs:/tmp/data.csv"
schema = (StructType([
StructField("user_id", IntegerType(), True),
StructField("movie_id", IntegerType(), True),
StructField("rating", IntegerType(), True),
StructField("date", DateType(), True)]))
DataDF =spark.read.csv(path_tocsv, header=False,dateFormat='yyyy-MM-dd',
schema=schema,sep='\t')
but I am getting null for dates:
+-------+--------+------+----+
|user_id|movie_id|rating|date|
+-------+--------+------+----+
| 196| 242| 3|null|
| 186| 302| 3|null|
| 22| 377| 1|null|
| 244| 51| 2|null|
| 166| 346| 1|null|
| 298| 474| 4|null|
| 115| 265| 2|null|
| 253| 465| 5|null|
| 305| 451| 3|null|
| 6| 86| 3|null|
+-------+--------+------+----+
Any suggestions?
You need to change your DateType column to LongType. See below -
from pyspark.sql.types import *
path_tocsv="dbfs:/tmp/data.csv"
schema = (StructType([
StructField("user_id", IntegerType(), True),
StructField("movie_id", IntegerType(), True),
StructField("rating", IntegerType(), True),
StructField("date", LongType(), True)]))
DataDF =spark.read.csv(path_tocsv, header=False,dateFormat='yyyy-MM-dd',
schema=schema,sep='\t')
Once, your dataframe is created you could modify the date column as below -
from pyspark.sql.functions import *
df = DataDF.withColumn("date_modified", to_date(from_unixtime(col("date"))))#.drop("date")

Transform list in a dataframe (same row, different columns) in Pyspark

I got one list from a dataframe's column:
list_recs = [row[0] for row in df_recs.select("name").collect()]
The list looks like this:
Out[243]: ['COL-4560', 'D65-2242', 'D18-4751', 'D68-3303']
I want to transform it in a new dataframe, which value in one different column. I tried doing this:
from pyspark.sql import Row
rdd = sc.parallelize(list_recs)
recs = rdd.map(lambda x: Row(SKU=str(x[0]), REC_01=str(x[1]), REC_02=str(x[2]), REC_03=str(x[3])))#, REC_04=str(x[4]), REC_0=str(x[5])))
schemaRecs = sqlContext.createDataFrame(recs)
But the outcome I'm getting is:
+---+------+------+------+
|SKU|REC_01|REC_02|REC_03|
+---+------+------+------+
| C| O| L| -|
| D| 6| 5| -|
| D| 1| 8| -|
| D| 6| 8| -|
+---+------+------+------+
What I wanted:
+----------+-------------+-------------+-------------+
|SKU |REC_01 |REC_02 |REC_03 |
+----------+-------------+-------------+-------------+
| COL-4560| D65-2242| D18-4751| D68-3303|
+----------+-------------+-------------+-------------+
I've also tried spark.createDataFrame(lista_recs, StringType()) but got all the items in the same column.
Thank you in advance.
Define schema and use spark.createDataFrame()
list_recs=['COL-4560', 'D65-2242', 'D18-4751', 'D68-3303']
from pyspark.sql.functions import *
from pyspark.sql.types import *
schema = StructType([StructField("SKU", StringType(), True), StructField("REC_01", StringType(), True), StructField("REC_02", StringType(), True), StructField("REC_03", StringType(), True)])
spark.createDataFrame([list_recs],schema).show()
#+--------+--------+--------+--------+
#| SKU| REC_01| REC_02| REC_03|
#+--------+--------+--------+--------+
#|COL-4560|D65-2242|D18-4751|D68-3303|
#+--------+--------+--------+--------+

How to extract value of json when doing pyspark query

This is how the table look like
which I extract using the following command:
query="""
select
distinct
userid,
region,
json_data
from mytable
where
operation = 'myvalue'
"""
table=spark.sql(query)
Now, I wish to extract only value of msg_id in column json_data (which is a string column), with the following expected output:
How should I change the query in the above code to extract the json_data
Note:
The json format is not fix (i.e., may contains other fields), but the value I want to extract is always with msg_id.
I want to achieve during retrieval for efficiency reason, though I can retrieve the json_data and format it afterwards.
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType,StructField,StringType
spark = SparkSession.builder.getOrCreate()
schema = StructType([
StructField("a", StringType(), True),
StructField("b", StringType(), True),
StructField("json", StringType(), True)
])
data = [("a","b",'{"msg_id":"123","msg":"test"}'),("c","d",'{"msg_id":"456","column1":"test"}')]
df = spark.createDataFrame(data,schema)
json_schema = spark.read.json(df.rdd.map(lambda row: row.json)).schema
df2 = df.withColumn('parsed', from_json(col('json'), json_schema))
df2.createOrReplaceTempView("test")
spark.sql("select a,b,parsed.msg_id from test").show()```
OUTPUT >>>
+---+---+------+
| a| b|msg_id|
+---+---+------+
| a| b| 123|
| c| d| 456|
+---+---+------+
Instead of reading file to get schema you can specify schema using StructType and StructField syntax, or <> syntax or use schema_of_json as shown below:
df.show() #sampledataframe
#+------+------+-----------------------------------------+
#|userid|region|json_data |
#+------+------+-----------------------------------------+
#|1 |US |{"msg_id":123} |
#|2 |US |{"msg_id":123} |
#|3 |US |{"msg_id":123} |
#|4 |US |{"msg_id":123,"is_ads":true,"location":2}|
#|5 |US |{"msg_id":456} |
#+------+------+-----------------------------------------+
from pyspark.sql import functions as F
from pyspark.sql.types import *
schema = StructType([StructField("msg_id", LongType(), True),
StructField("is_ads", BooleanType(), True),
StructField("location", LongType(), True)])
#OR
schema= 'struct<is_ads:boolean,location:bigint,msg_id:bigint>'
#OR
schema= df.select(F.schema_of_json("""{"msg_id":123,"is_ads":true,"location":2}""")).collect()[0][0]
df.withColumn("json_data", F.from_json("json_data",schema))\
.select("userid","region","json_data.msg_id").show()
#+------+------+------+
#|userid|region|msg_id|
#+------+------+------+
#| 1| US| 123|
#| 2| US| 123|
#| 3| US| 123|
#| 4| US| 123|
#| 5| US| 456|
#+------+------+------+

Spark doesn't read columns with null values in first row

Below is the content in my csv file :
A1,B1,C1
A2,B2,C2,D1
A3,B3,C3,D2,E1
A4,B4,C4,D3
A5,B5,C5,,E2
So, there are 5 columns but only 3 values in the first row.
I read it using the following command :
val csvDF : DataFrame = spark.read
.option("header", "false")
.option("delimiter", ",")
.option("inferSchema", "false")
.csv("file.csv")
And following is what i get using csvDF.show()
+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
| A1| B1| C1|
| A2| B2| C2|
| A3| B3| C3|
| A4| B4| C4|
| A5| B5| C5|
+---+---+---+
How can i read all the data in all the columns?
Basically your csv-file isn't properly formatted in the sense that it doesn't have a equal number of columns in each row, which is required if you want to read it with spark.read.csv. However, you can instead read it with spark.read.textFile and then parse each row.
As I understand it, you do not know the number of columns beforehand, so you want your code to handle an arbitrary number of columns. To do this you need to establish the maximum number of columns in your data set, so you need two passes over your data set.
For this particular problem, I would actually go with RDDs instead of DataFrames or Datasets, like this:
val data = spark.read.textFile("file.csv").rdd
val rdd = data.map(s => (s, s.split(",").length)).cache
val maxColumns = rdd.map(_._2).max()
val x = rdd
.map(row => {
val rowData = row._1.split(",")
val extraColumns = Array.ofDim[String](maxColumns - rowData.length)
Row((rowData ++ extraColumns).toList:_*)
})
Hope that helps :)
You can read it as a dataset with only one column (for example by using another delimiter) :
var df = spark.read.format("csv").option("delimiter",";").load("test.csv")
df.show()
+--------------+
| _c0|
+--------------+
| A1,B1,C1|
| A2,B2,C2,D1|
|A3,B3,C3,D2,E1|
| A4,B4,C4,D3|
| A5,B5,C5,,E2|
+--------------+
Then you can use this answer to manually split your column in five, this will add null values when the element does not exist :
var csvDF = df.withColumn("_tmp",split($"_c0",",")).select(
$"_tmp".getItem(0).as("col1"),
$"_tmp".getItem(1).as("col2"),
$"_tmp".getItem(2).as("col3"),
$"_tmp".getItem(3).as("col4"),
$"_tmp".getItem(4).as("col5")
)
csvDF.show()
+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
| A1| B1| C1|null|null|
| A2| B2| C2| D1|null|
| A3| B3| C3| D2| E1|
| A4| B4| C4| D3|null|
| A5| B5| C5| | E2|
+----+----+----+----+----+
If the column dataTypes and number of columns are known then you can define schema and apply the schema while reading the csv file as dataframe. Below I have defined all five columns as stringType
val schema = StructType(Seq(
StructField("col1", StringType, true),
StructField("col2", StringType, true),
StructField("col3", StringType, true),
StructField("col4", StringType, true),
StructField("col5", StringType, true)))
val csvDF : DataFrame = sqlContext.read
.option("header", "false")
.option("delimiter", ",")
.option("inferSchema", "false")
.schema(schema)
.csv("file.csv")
You should be getting dataframe as
+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
|A1 |B1 |C1 |null|null|
|A2 |B2 |C2 |D1 |null|
|A3 |B3 |C3 |D2 |E1 |
|A4 |B4 |C4 |D3 |null|
|A5 |B5 |C5 |null|E2 |
+----+----+----+----+----+

Resources