Related
I am reading xml using databricks spark xml with below schema. the subelement X_PAT can occur more than one time, to handle
this I have used arraytype(structtype),ne xt transformation is to create multiple columns out of this single column.
<root_tag>
<id>fff9</id>
<X1000>
<X_PAT>
<X_PAT01>IC</X_PAT01>
<X_PAT02>EDISUPPORT</X_PAT02>
<X_PAT03>TE</X_PAT03>
</X_PAT>
<X_PAT>
<X_PAT01>IC1</X_PAT01>
<X_PAT02>EDISUPPORT1</X_PAT02>
<X_PAT03>TE1</X_PAT03>
</X_PAT>
</X1000>
</root_tag>
from pyspark.sql import SparkSession
from pyspark.sql.types import *
jar_path = "/Users/nsrinivas/com.databricks_spark-xml_2.10-0.4.1.jar"
spark = SparkSession.builder.appName("Spark - XML read").master("local[*]") \
.config("spark.jars", jar_path) \
.config("spark.executor.extraClassPath", jar_path) \
.config("spark.executor.extraLibrary", jar_path) \
.config("spark.driver.extraClassPath", jar_path) \
.getOrCreate()
xml_schema = StructType()
xml_schema.add("id", StringType(), True)
x1000 = StructType([
StructField("X_PAT",
ArrayType(StructType([
StructField("X_PAT01", StringType()),
StructField("X_PAT02", StringType()),
StructField("X_PAT03", StringType())]))),
])
xml_schema.add("X1000", x1000, True)
df = spark.read.format("xml").option("rowTag", "root_tag").option("valueTag", False) \
.load("root_tag.xml", schema=xml_schema)
df.select("id", "X1000.X_PAT").show(truncate=False)
I get the output as below:
+------------+--------------------------------------------+
|id |X_PAT |
+------------+--------------------------------------------+
|fff9 |[[IC1, SUPPORT1, TE1], [IC2, SUPPORT2, TE2]]|
+------------+--------------------------------------------+
but I want the X_PAT to be flatten and create multiple columns like below then I will rename the colums.
+-----+-------+-------+-------+-------+-------+-------+
|id |X_PAT01|X_PAT02|X_PAT03|X_PAT01|X_PAT02|X_PAT03|
+-----+-------+-------+-------+-------+-------+-------+
|fff9 |IC1 |SUPPORT1|TE1 |IC2 |SUPPORT2|TE2 |
+-----+-------+-------+-------+-------+-------+-------+
then i would rename the new columns as below
id|XPAT_1_01|XPAT_1_02|XPAT_1_03|XPAT_2_01|XPAT_2_02|XPAT_2_03|
I tried using X1000.X_PAT.* but it is throwing below error
pyspark.sql.utils.AnalysisException: 'Can only star expand struct data types. Attribute: ArrayBuffer(L_1000A, S_PER);'
Any ideas please?
Try this:
df = spark.createDataFrame([('1',[['IC1', 'SUPPORT1', 'TE1'],['IC2', 'SUPPORT2', 'TE2']]),('2',[['IC1', 'SUPPORT1', 'TE1'],['IC2','SUPPORT2', 'TE2']])],['id','X_PAT01'])
Define a function to parse the data
def create_column(df):
data = df.select('X_PAT01').collect()[0][0]
for each_list in range(len(data)):
for each_item in range(len(data[each_list])):
df = df.withColumn('X_PAT_'+str(each_list)+'_0'+str(each_item), F.lit(data[each_list][each_item]))
return df
calling
df = create_column(df)
output
This is a simple approach to horizontally explode array elements as per your requirement:
df2=(df1
.select('id',
*(col('X_PAT')
.getItem(i) #Fetch the nested array elements
.getItem(j) #Fetch the individual string elements from each nested array element
.alias(f'X_PAT_{i+1}_{str(j+1).zfill(2)}') #Format the column alias
for i in range(2) #outer loop
for j in range(3) #inner loop
)
)
)
Input vs Output:
Input(df1):
+----+--------------------------------------------+
|id |X_PAT |
+----+--------------------------------------------+
|fff9|[[IC1, SUPPORT1, TE1], [IC2, SUPPORT2, TE2]]|
+----+--------------------------------------------+
Output(df2):
+----+----------+----------+----------+----------+----------+----------+
| id|X_PAT_1_01|X_PAT_1_02|X_PAT_1_03|X_PAT_2_01|X_PAT_2_02|X_PAT_2_03|
+----+----------+----------+----------+----------+----------+----------+
|fff9| IC1| SUPPORT1| TE1| IC2| SUPPORT2| TE2|
+----+----------+----------+----------+----------+----------+----------+
Although this involves for loops, as the operations are directly performed on the dataframe (without collecting/converting to RDD), you should not encounter any issue.
I have a dataframe that has a column that is a JSON string
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as F
sc = SparkSession.builder.getOrCreate()
l = [
(1, """{"key1": true, "nested_key": {"mylist": ["foo", "bar"], "mybool": true}})"""),
(2, """{"key1": true, "nested_key": {"mylist": "", "mybool": true}})"""),
]
df = sc.createDataFrame(l, ["id", "json_str"])
and want to parse the json_str column with from_json using a schema
schema = StructType([
StructField("key1", BooleanType(), False),
StructField("nested_key", StructType([
StructField("mylist", ArrayType(StringType()), False),
StructField("mybool", BooleanType(), False)
]))
])
df = df.withColumn("data", F.from_json(F.col("json_str"), schema))
df.show(truncate=False)
+---+--------------------------+
|id |data |
+---+--------------------------+
|1 |[true, [[foo, bar], true]]|
|2 |[true, [, true]] |
+---+--------------------------+
As one can see, the second row didn't conform to the schema in schema so it's null even though I passed False to nullable in the StructField. It's important to my pipeline that if there's data that doesn't conform to the schema defined that an alert get raised somehow, but I'm not sure about the best way to do this in Pyspark. The real data has many, many keys, some of them nested so checking each one with some form of isNan isn't feasable and since we already defined the schema it feels like there should be away to leverage that.
If it matters, I don't necessarily need to check the schema of the whole dataframe, I'm really after checking the schema of the StructType column
Check out the options parameter:
https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html?highlight=from_json#pyspark.sql.functions.from_json
It's a little vague, but it allows you to pass a dict to the underlying method here:
https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html?highlight=from_json#pyspark.sql.DataFrameReader.json
You might have success passing something like options={'mode' : 'FAILFAST'}.
I am working off Kafka 2.3.0 and Spark 2.3.4. I have already built a Kafka Connector which reads off a CSV file and posts a line from the CSV to the relevant Kafka topic. The line is like so:
"201310,XYZ001,Sup,XYZ,A,0,Presales,6,Callout,0,0,1,N,Prospect".
The CSV contains 1000s of such lines. The Connector is able to successfully post them on the topic and I am also able to get the message in Spark. I am not sure how can I deserialize that message to my schema? Note that the messages are headerless so the key part in the kafka message is null. The value part includes the complete CSV string as above. My code is below.
I looked at this - How to deserialize records from Kafka using Structured Streaming in Java? but was unable to port it to my csv case. In addition I've tried other spark sql mechanisms to try and retrieve the individual row from the 'value' column but to no avail. If I do manage to get a compiling version (e.g. a map over the indivValues Dataset or dsRawData) I get errors similar to: "org.apache.spark.sql.AnalysisException: cannot resolve 'IC' given input columns: [value];". If I understand correctly, it is because value is a comma separated string and spark isn't really going to magically map it for me without me doing 'something'.
//build the spark session
SparkSession sparkSession = SparkSession.builder()
.appName(seCfg.arg0AppName)
.config("spark.cassandra.connection.host",config.arg2CassandraIp)
.getOrCreate();
...
//my target schema is this:
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("timeOfOrigin", DataTypes.TimestampType, true),
DataTypes.createStructField("cName", DataTypes.StringType, true),
DataTypes.createStructField("cRole", DataTypes.StringType, true),
DataTypes.createStructField("bName", DataTypes.StringType, true),
DataTypes.createStructField("stage", DataTypes.StringType, true),
DataTypes.createStructField("intId", DataTypes.IntegerType, true),
DataTypes.createStructField("intName", DataTypes.StringType, true),
DataTypes.createStructField("intCatId", DataTypes.IntegerType, true),
DataTypes.createStructField("catName", DataTypes.StringType, true),
DataTypes.createStructField("are_vval", DataTypes.IntegerType, true),
DataTypes.createStructField("isee_vval", DataTypes.IntegerType, true),
DataTypes.createStructField("opCode", DataTypes.IntegerType, true),
DataTypes.createStructField("opType", DataTypes.StringType, true),
DataTypes.createStructField("opName", DataTypes.StringType, true)
});
...
Dataset<Row> dsRawData = sparkSession
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", config.arg3Kafkabootstrapurl)
.option("subscribe", config.arg1TopicName)
.option("failOnDataLoss", "false")
.load();
//getting individual terms like '201310', 'XYZ001'.. from "values"
Dataset<String> indivValues = dsRawData
.selectExpr("CAST(value AS STRING)")
.as(Encoders.STRING())
.flatMap((FlatMapFunction<String, String>) x -> Arrays.asList(x.split(",")).iterator(), Encoders.STRING());
//indivValues when printed to console looks like below which confirms that //I receive the data correctly and completely
/*
When printed on console, looks like this:
+--------------------+
| value|
+--------------------+
| 201310|
| XYZ001|
| Sup|
| XYZ|
| A|
| 0|
| Presales|
| 6|
| Callout|
| 0|
| 0|
| 1|
| N|
| Prospect|
+--------------------+
*/
StreamingQuery sq = indivValues.writeStream()
.outputMode("append")
.format("console")
.start();
//await termination
sq.awaitTermination();
I require the data to be typed as my custom schema shown above since I would be running mathematical calculations over it (for every new row combined with some older rows).
Is it better to synthesize headers in the Kafka Connector source task before pushing them onto the topic? Will having headers make this issue resolution simpler?
Thanks!
Given your existing code, the easiest way to parse your input from your dsRawData is to convert it to a Dataset<String> and then use the native csv reader api
//dsRawData has raw incoming data from Kafka...
Dataset<String> indivValues = dsRawData
.selectExpr("CAST(value AS STRING)")
.as(Encoders.STRING());
Dataset<Row> finalValues = sparkSession.read()
.schema(schema)
.option("delimiter",",")
.csv(indivValues);
With such a construct you can use exactly the same CSV parsing options that are available when directly reading a CSV file from Spark.
I have been able to resolve this now. Via use of spark sql. The code to the solution is below.
//dsRawData has raw incoming data from Kafka...
Dataset<String> indivValues = dsRawData
.selectExpr("CAST(value AS STRING)")
.as(Encoders.STRING());
//create new columns, parse out the orig message and fill column with the values
Dataset<Row> dataAsSchema2 = indivValues
.selectExpr("value",
"split(value,',')[0] as time",
"split(value,',')[1] as cname",
"split(value,',')[2] as crole",
"split(value,',')[3] as bname",
"split(value,',')[4] as stage",
"split(value,',')[5] as intid",
"split(value,',')[6] as intname",
"split(value,',')[7] as intcatid",
"split(value,',')[8] as catname",
"split(value,',')[9] as are_vval",
"split(value,',')[10] as isee_vval",
"split(value,',')[11] as opcode",
"split(value,',')[12] as optype",
"split(value,',')[13] as opname")
.drop("value");
//remove any whitespaces as they interfere with data type conversions
dataAsSchema2 = dataAsSchema2
.withColumn("intid", functions.regexp_replace(functions.col("int_id"),
" ", ""))
.withColumn("intcatid", functions.regexp_replace(functions.col("intcatid"),
" ", ""))
.withColumn("are_vval", functions.regexp_replace(functions.col("are_vval"),
" ", ""))
.withColumn("isee_vval", functions.regexp_replace(functions.col("isee_vval"),
" ", ""))
.withColumn("opcode", functions.regexp_replace(functions.col("opcode"),
" ", ""));
//change types to ready for calc
dataAsSchema2 = dataAsSchema2
.withColumn("intcatid",functions.col("intcatid").cast(DataTypes.IntegerType))
.withColumn("intid",functions.col("intid").cast(DataTypes.IntegerType))
.withColumn("are_vval",functions.col("are_vval").cast(DataTypes.IntegerType))
.withColumn("isee_vval",functions.col("isee_vval").cast(DataTypes.IntegerType))
.withColumn("opcode",functions.col("opcode").cast(DataTypes.IntegerType));
//build a POJO dataset
Encoder<Pojoclass2> encoder = Encoders.bean(Pojoclass2.class);
Dataset<Pojoclass2> pjClass = new Dataset<Pojoclass2>(sparkSession, dataAsSchema2.logicalPlan(), encoder);
In PySpark it you can define a schema and read data sources with this pre-defined schema, e. g.:
Schema = StructType([ StructField("temperature", DoubleType(), True),
StructField("temperature_unit", StringType(), True),
StructField("humidity", DoubleType(), True),
StructField("humidity_unit", StringType(), True),
StructField("pressure", DoubleType(), True),
StructField("pressure_unit", StringType(), True)
])
For some datasources it is possible to infer the schema from the data-source and get a dataframe with this schema definition.
Is it possible to get the schema definition (in the form described above) from a dataframe, where the data has been inferred before?
df.printSchema() prints the schema as a tree, but I need to reuse the schema, having it defined as above,so I can read a data-source with this schema that has been inferred before from another data-source.
Yes it is possible. Use DataFrame.schema property
schema
Returns the schema of this DataFrame as a pyspark.sql.types.StructType.
>>> df.schema
StructType(List(StructField(age,IntegerType,true),StructField(name,StringType,true)))
New in version 1.3.
Schema can be also exported to JSON and imported back if needed.
The code below will give you a well formatted tabular schema definition of the known dataframe. Quite useful when you have very huge number of columns & where editing is cumbersome. You can then now apply it to your new dataframe & hand-edit any columns you may want to accordingly.
from pyspark.sql.types import StructType
schema = [i for i in df.schema]
And then from here, you have your new schema:
NewSchema = StructType(schema)
If you are looking for a DDL string from PySpark:
df: DataFrame = spark.read.load('LOCATION')
schema_json = df.schema.json()
ddl = spark.sparkContext._jvm.org.apache.spark.sql.types.DataType.fromJson(schema_json).toDDL()
You could re-use schema for existing Dataframe
l = [('Ankita',25,'F'),('Jalfaizy',22,'M'),('saurabh',20,'M'),('Bala',26,None)]
people_rdd=spark.sparkContext.parallelize(l)
schemaPeople = people_rdd.toDF(['name','age','gender'])
schemaPeople.show()
+--------+---+------+
| name|age|gender|
+--------+---+------+
| Ankita| 25| F|
|Jalfaizy| 22| M|
| saurabh| 20| M|
| Bala| 26| null|
+--------+---+------+
spark.createDataFrame(people_rdd,schemaPeople.schema).show()
+--------+---+------+
| name|age|gender|
+--------+---+------+
| Ankita| 25| F|
|Jalfaizy| 22| M|
| saurabh| 20| M|
| Bala| 26| null|
+--------+---+------+
Just use df.schema to get the underlying schema of dataframe
schemaPeople.schema
StructType(List(StructField(name,StringType,true),StructField(age,LongType,true),StructField(gender,StringType,true)))
Pyspark since version 3.3.0 return df.schema in python-way https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.schema.html#pyspark.sql.DataFrame.schema
>>> df.schema
StructType([StructField('age', IntegerType(), True),
StructField('name', StringType(), True)])
I am new to spark and I saw that there are two ways to create a data frame's schema.
I have an RDD: empRDD with data(split by ",")
+---+-------+------+-----+
| 1| Mark| 1000| HR|
| 2| Peter| 1200|SALES|
| 3| Henry| 1500| HR|
| 4| Adam| 2000| IT|
| 5| Steve| 2500| IT|
| 6| Brian| 2700| IT|
| 7|Michael| 3000| HR|
| 8| Steve| 10000|SALES|
| 9| Peter| 7000| HR|
| 10| Dan| 6000| BS|
+---+-------+------+-----+
val empFile = sc.textFile("emp")
val empData = empFile.map(e => e.split(","))
First way to create schema is using a case class:
case class employee(id:Int, name:String, salary:Int, dept:String)
val empRDD = empData.map(e => employee(e(0).toInt, e(1), e(2).toInt, e(3)))
val empDF = empRDD.toDF()
Second way is using StructType:
val empSchema = StructType(Array(StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("salary", IntegerType, true),
StructField("dept", StringType, true)))
val empRDD = empdata.map(e => Row(e(0).toInt, e(1), e(2).toInt, e(3)))
val empDF = sqlContext.createDataFrame(empRDD, empSchema)
Personally I prefer to code using StructType. But I don't know which way is recommended in the actual industry projects. Could anyone let me know the preferred way ?
You can use spark-csv library to read a csv files, This library have lots of options as per our requirement.
You can read a csv file as
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("data.csv")
However you can also provide schema manually.
I think the best way is to read a csv with spark-csv as a dataset as
val cities = spark.read
.option("header", "true")
.csv(location)
.as[employee]
Read the advantage of dataset over rdd and dataframe here.
You can also generate the schema from case class if you have it already.
import org.apache.spark.sql.Encoders
val empSchema = Encoders.product[Employee].schema
Hope this helps
In the case when you are creating your RDD's from a CSV file(or any delimited file) you can infer schema automatically as #Shankar Koirala mentioned.
In case you are creating your RDD's from a different source then:
A. When you have less number of fields(less than 22) you can create it using case classes.
B. When you have more than 22 fields you need to create schema programmatically
Link to Spark Programming Guide
If your input file is delimited file, you can use databrick's spark-csv library.
Use this way:
// For spark < 2.0
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.option("nullValue", "")
.load("./data.csv");
df.show();
For spark 2.0;
DataFrame df = sqlContext.read()
.format("csv")
.option("header", "true")
.option("nullValue", "")
.load("./data.csv");
df.show();
There are lots of customization possible using option in the command.
Such as:
.option("inferSchema", "true") to infer data types of each column automatically.
.option("codec", "org.apache.hadoop.io.compress.GzipCodec") to define compression codec
.option("delimiter", ",") to specify delimiter as ','
Databrick's spark-csv library is ported in to spark 2.0.
Using this library will give you freedom from the difficulties parsing various use cases of delimited files.
Refer: https://github.com/databricks/spark-csv