How to get the schema definition from a dataframe in PySpark? - apache-spark

In PySpark it you can define a schema and read data sources with this pre-defined schema, e. g.:
Schema = StructType([ StructField("temperature", DoubleType(), True),
StructField("temperature_unit", StringType(), True),
StructField("humidity", DoubleType(), True),
StructField("humidity_unit", StringType(), True),
StructField("pressure", DoubleType(), True),
StructField("pressure_unit", StringType(), True)
])
For some datasources it is possible to infer the schema from the data-source and get a dataframe with this schema definition.
Is it possible to get the schema definition (in the form described above) from a dataframe, where the data has been inferred before?
df.printSchema() prints the schema as a tree, but I need to reuse the schema, having it defined as above,so I can read a data-source with this schema that has been inferred before from another data-source.

Yes it is possible. Use DataFrame.schema property
schema
Returns the schema of this DataFrame as a pyspark.sql.types.StructType.
>>> df.schema
StructType(List(StructField(age,IntegerType,true),StructField(name,StringType,true)))
New in version 1.3.
Schema can be also exported to JSON and imported back if needed.

The code below will give you a well formatted tabular schema definition of the known dataframe. Quite useful when you have very huge number of columns & where editing is cumbersome. You can then now apply it to your new dataframe & hand-edit any columns you may want to accordingly.
from pyspark.sql.types import StructType
schema = [i for i in df.schema]
And then from here, you have your new schema:
NewSchema = StructType(schema)

If you are looking for a DDL string from PySpark:
df: DataFrame = spark.read.load('LOCATION')
schema_json = df.schema.json()
ddl = spark.sparkContext._jvm.org.apache.spark.sql.types.DataType.fromJson(schema_json).toDDL()

You could re-use schema for existing Dataframe
l = [('Ankita',25,'F'),('Jalfaizy',22,'M'),('saurabh',20,'M'),('Bala',26,None)]
people_rdd=spark.sparkContext.parallelize(l)
schemaPeople = people_rdd.toDF(['name','age','gender'])
schemaPeople.show()
+--------+---+------+
| name|age|gender|
+--------+---+------+
| Ankita| 25| F|
|Jalfaizy| 22| M|
| saurabh| 20| M|
| Bala| 26| null|
+--------+---+------+
spark.createDataFrame(people_rdd,schemaPeople.schema).show()
+--------+---+------+
| name|age|gender|
+--------+---+------+
| Ankita| 25| F|
|Jalfaizy| 22| M|
| saurabh| 20| M|
| Bala| 26| null|
+--------+---+------+
Just use df.schema to get the underlying schema of dataframe
schemaPeople.schema
StructType(List(StructField(name,StringType,true),StructField(age,LongType,true),StructField(gender,StringType,true)))

Pyspark since version 3.3.0 return df.schema in python-way https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.schema.html#pyspark.sql.DataFrame.schema
>>> df.schema
StructType([StructField('age', IntegerType(), True),
StructField('name', StringType(), True)])

Related

pyspark.createDataFrame(rdd, schema) returns just null values

My RDD (From ElasticSearch) looks like this.
[
('rty456ui', {'#timestamp': '2022-10-10T24:56:10.000259+0000', 'host': {'id': 'test-host-id-1'}, 'watchlists': {'ioc': {'summary': '127.0.0.1', 'tags': ('Dummy Tag',)}}, 'source': {'ip': '127.0.0.1'}, 'event': {'created': '2022-10-10T13:56:10+00:00', 'id': 'rty456ui'}, 'tags': ('Mon',)}),
('cxs980qw', {'#timestamp': '2022-10-10T13:56:10.000259+0000', 'host': {'id': 'test-host-id-2'}, 'watchlists': {'ioc': {'summary': '0.0.0.1', 'tags': ('Dummy Tag',)}}, 'source': {'ip': '0.0.0.1'}, 'event': {'created': '2022-10-10T24:56:10+00:00', 'id': 'cxs980qw'}, 'tags': ('Mon', 'Tue')})
]
(What I find interesting is Lists in ES are converted to Tuples in RDD)
I am trying to convert it into something like this.
+---------------+-----------+-----------+---------------------------+-----------------------+-----------------------+---------------+
|host.id |event.id |source.ip |event.created |watchlists.ioc.summary |watchlists.ioc.tags |tags |
+---------------+-----------+-----------+---------------------------+-----------------------+-----------------------+---------------+
|test-host-id-1 |rty456ui |127.0.0.1 |2022-10-10T13:56:10+00:00 |127.0.0.1 |[Dummy Tag] |[Mon] |
|test-host-id-2 |cxs980qw |0.0.0.1 |2022-10-10T24:56:10+00:00 |127.0.0.1 |[Dummy Tag] |[Mon, Tue] |
+---------------+-----------+-----------+---------------------------+-----------------------+-----------------------+---------------+
However, getting this.
+-------+--------+---------+-------------+----------------------+-------------------+-------------------------------+
|host.id|event.id|source.ip|event.created|watchlists.ioc.summary|watchlists.ioc.tags|tags |
+-------+--------+---------+-------------+----------------------+-------------------+-------------------------------+
|null |null |null |null |null |null |[Ljava.lang.Object;#6c704e6e |
|null |null |null |null |null |null |[Ljava.lang.Object;#701ea4c8 |
+-------+--------+---------+-------------+----------------------+-------------------+-------------------------------+
Code
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([
StructField("host.id",StringType(), True),
StructField("event.id",StringType(), True),
StructField("source.ip",StringType(), True),
StructField("event.created", StringType(), True),
StructField("watchlists.ioc.summary", StringType(), True),
StructField("watchlists.ioc.tags", StringType(), True),
StructField("tags", StringType(), True)
])
df = spark.createDataFrame(es_rdd.map(lambda x: x[1]),schema)
df.show(truncate=False)
I'm trying to convert an rdd into Dataframe. Additionally, I want to define the schema for it. However, pyspark.createDataFrame(rdd, schema) returns just null values, even though the rdd has data. Further, I get [Ljava.lang.Object;#701ea4c8 in the output too. So what am I missing here?
Your post cover 2 questions:
Why all columns will be null even I declare the schema when I transform the RDD to dataframe: In your schema, you use StructTypeColumn.StructFiedColumn (eg host.id) to get the value in RDD. However, this type of selection statement could only work when you use Spark SQL select statement and I think there is no such parsing here. To achieve your goal, you might have to update your lambda function inside map function to extract the exact element like
rdd_trans = rdd.map(lambda x: (x[1]['host']['id'], x[1]['event']['id'], ))
Why the output of tag column is not shown as expected: It's because when you declare your tag column, you declare it as a string column, you should use ArrayType instead.

How to add commentaries to Glue on an AWS EMR using Pyspark

I'm having a problem where I cannot find a way to save commentaries on glue metadata with Pyspark.
Currently I create new tables using :
df.write \
.saveAsTable(
'db_temp.tb_temp',
format='parquet',
path='s3://datalake-123/table/df/',
mode='overwrite'
)
So if possible, I would like to add the comments in glue using code, just like the picture bellow shows :
You need to modify existing schema of dataframe by adding required comment. After schema modification, create new dataframe using modified schema and write dataframe as table.
df = spark.createDataFrame([(1, 'abc'), (2, 'def')], ["id", "name"])
schema = StructType([StructField("id", IntegerType(), False, {"comment": "This is ID"}),
StructField("name", StringType(), True, {"comment": "This is name"})])
df_with_comment = spark.createDataFrame(df.rdd, schema)
df_with_comment.write.format('parquet').saveAsTable('mytable')
spark.sql('describe mytable').show()
+--------+---------+------------+
|col_name|data_type| comment|
+--------+---------+------------+
| id| int| This is ID|
| name| string|This is name|
+--------+---------+------------+

Spark SQL error AnalysisException: cannot resolve column_name

This is a common error in Spark SQL, I tried all other answers but no differences!
I want to read the following small CSV file from HDFS (or even local filesystem).
+----+-----------+----------+-------------------+-----------+------+------+------+------+------+
| id| first_name| last_name| ssn| test1| test2| test3| test4| final| grade|
+----+-----------+----------+-------------------+-----------+------+------+------+------+------+
| 4.0| Dandy| Jim| 087-75-4321| 47.0| 1.0| 23.0| 36.0| 45.0| C+|
|13.0| Elephant| Ima| 456-71-9012| 45.0| 1.0| 78.0| 88.0| 77.0| B-|
|14.0| Franklin| Benny| 234-56-2890| 50.0| 1.0| 90.0| 80.0| 90.0| B-|
|15.0| George| Boy| 345-67-3901| 40.0| 1.0| 11.0| -1.0| 4.0| B|
|16.0| Heffalump| Harvey| 632-79-9439| 30.0| 1.0| 20.0| 30.0| 40.0| C|
+----+-----------+----------+-------------------+-----------+------+------+------+------+------+
Here is the code:
List<String> cols = new ArrayList<>();
Collections.addAll(cols, "id, first_name".replaceAll("\\s+", "").split(","));
Dataset<Row> temp = spark.read()
.format("org.apache.spark.csv")
.option("header", true)
.option("inferSchema", true)
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
.csv(path)
.selectExpr(JavaConverters.asScalaIteratorConverter(cols.iterator()).asScala().toSeq());
but it errors:
Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved attribute(s) 'first_name missing from last_name#14, test1#16,id#12, test4#19, ssn#15, test3#18, grade#21, test2#17, final#20, first_name#13 in operator 'Project [id#12, 'first_name];;
'Project [id#12, 'first_name]
+- Relation[id#12, first_name#13, last_name#14, ssn#15, test1#16, test2#17, test3#18, test4#19, final#20, grade#21] csv
In some cases it works without error:
If I didn't select anything, it will successfully get all data.
If I just select column "id".
Even I tried using views and SQL method:
df.createOrReplaceTempView("csvFile");
spark.sql(SELECT id, first_name FROM csvFile).show()
but I got the same errro!
I did the same with the same data that I read from the database, and it was without any error.
I use Spark 2.2.1.
No need to convert String[] --> List<String> --> Seq<String>.
Simply pass array in selectExpr method, because selectExpr support varargs datatype.
Dataset<Row> temp = spark.read()
.format("org.apache.spark.csv")
.option("header", true)
.option("inferSchema", true)
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
.csv(path)
.selectExpr("id, first_name".replaceAll("\\s+", "").split(","));
It was because of the incorrect structure of the CSV file! I removed the white spaces from the CSV file and now it works!

What is the efficient way to create schema for a dataframe?

I am new to spark and I saw that there are two ways to create a data frame's schema.
I have an RDD: empRDD with data(split by ",")
+---+-------+------+-----+
| 1| Mark| 1000| HR|
| 2| Peter| 1200|SALES|
| 3| Henry| 1500| HR|
| 4| Adam| 2000| IT|
| 5| Steve| 2500| IT|
| 6| Brian| 2700| IT|
| 7|Michael| 3000| HR|
| 8| Steve| 10000|SALES|
| 9| Peter| 7000| HR|
| 10| Dan| 6000| BS|
+---+-------+------+-----+
val empFile = sc.textFile("emp")
val empData = empFile.map(e => e.split(","))
First way to create schema is using a case class:
case class employee(id:Int, name:String, salary:Int, dept:String)
val empRDD = empData.map(e => employee(e(0).toInt, e(1), e(2).toInt, e(3)))
val empDF = empRDD.toDF()
Second way is using StructType:
val empSchema = StructType(Array(StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("salary", IntegerType, true),
StructField("dept", StringType, true)))
val empRDD = empdata.map(e => Row(e(0).toInt, e(1), e(2).toInt, e(3)))
val empDF = sqlContext.createDataFrame(empRDD, empSchema)
Personally I prefer to code using StructType. But I don't know which way is recommended in the actual industry projects. Could anyone let me know the preferred way ?
You can use spark-csv library to read a csv files, This library have lots of options as per our requirement.
You can read a csv file as
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("data.csv")
However you can also provide schema manually.
I think the best way is to read a csv with spark-csv as a dataset as
val cities = spark.read
.option("header", "true")
.csv(location)
.as[employee]
Read the advantage of dataset over rdd and dataframe here.
You can also generate the schema from case class if you have it already.
import org.apache.spark.sql.Encoders
val empSchema = Encoders.product[Employee].schema
Hope this helps
In the case when you are creating your RDD's from a CSV file(or any delimited file) you can infer schema automatically as #Shankar Koirala mentioned.
In case you are creating your RDD's from a different source then:
A. When you have less number of fields(less than 22) you can create it using case classes.
B. When you have more than 22 fields you need to create schema programmatically
Link to Spark Programming Guide
If your input file is delimited file, you can use databrick's spark-csv library.
Use this way:
// For spark < 2.0
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.option("nullValue", "")
.load("./data.csv");
df.show();
For spark 2.0;
DataFrame df = sqlContext.read()
.format("csv")
.option("header", "true")
.option("nullValue", "")
.load("./data.csv");
df.show();
There are lots of customization possible using option in the command.
Such as:
.option("inferSchema", "true") to infer data types of each column automatically.
.option("codec", "org.apache.hadoop.io.compress.GzipCodec") to define compression codec
.option("delimiter", ",") to specify delimiter as ','
Databrick's spark-csv library is ported in to spark 2.0.
Using this library will give you freedom from the difficulties parsing various use cases of delimited files.
Refer: https://github.com/databricks/spark-csv

Spark csv to dataframe skip first row

I am loading csv to dataframe using -
sqlContext.read.format("com.databricks.spark.csv").option("header", "true").
option("delimiter", ",").load("file.csv")
but my input file contains date in the first row and header from second row.
example
20160612
id,name,age
1,abc,12
2,bcd,33
How can i skip this first row while converting csv to dataframe?
Here are several options that I can think of since the data bricks module doesn't seem to provide a skip line option:
Option one: Add a "#" character in front of the first line, and the line will be automatically considered as comment and ignored by the data.bricks csv module;
Option two: Create your customized schema and specify the mode option as DROPMALFORMED which will drop the first line since it contains less token than expected in the customSchema:
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
val customSchema = StructType(Array(StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("age", IntegerType, true)))
val df = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
option("mode", "DROPMALFORMED").
schema(customSchema).load("test.txt")
df.show
16/06/12 21:24:05 WARN CsvRelation$: Number format exception. Dropping
malformed line: id,name,age
+---+----+---+
| id|name|age|
+---+----+---+
| 1| abc| 12|
| 2| bcd| 33|
+---+----+---+
Note the warning message here which says dropped malformed line:
Option three: Write your own parser to drop the line that doesn't have length of three:
val file = sc.textFile("pathToYourCsvFile")
val df = file.map(line => line.split(",")).
filter(lines => lines.length == 3 && lines(0)!= "id").
map(row => (row(0), row(1), row(2))).
toDF("id", "name", "age")
df.show
+---+----+---+
| id|name|age|
+---+----+---+
| 1| abc| 12|
| 2| bcd| 33|
+---+----+---+

Resources