Attach description of columns in Apache Spark using parquet format - apache-spark

I read a parquet with :
df = spark.read.parquet(file_name)
And get the columns with:
df.columns
And returns a list of columns ['col1', 'col2', 'col3']
I read that parquet format is able to store some metadata in the file.
Is there a way to store and read extra metadata, for example, attach a human description of what is each column?
Thanks.

There is no way to read or store arbitrary additional metadata in a Parquet file.
When metadata in a Parquet file is mentioned it is referring to the technical metadata associated with the field including the number of nested fields, type information, length information, etc. If you look at the SchemaElement class in the documentation for Parquet ( https://static.javadoc.io/org.apache.parquet/parquet-format/2.6.0/org/apache/parquet/format/SchemaElement.html) you will find all of the available metadata for each field in a schema. This does not include any human readable description beyond the field name.
A good overview of the Parquet metadata can be found in the "File Format" section here - https://parquet.apache.org/documentation/latest/

Related

Databricks: Incompatible format detected (temp view)

I am trying to create a temp view from a number of parquet files, but it does not work so far. As a first step, I am trying to create a dataframe by reading parquets from a path. I want to load all parquet files into the df, but so far I dont even manage to load a single one, as you can see on the screenshot below. Can anyone help me out here? Thanks
Info: batch_source_path is the string in column "path", row 1
Your data is in Delta format and this is how you must read:
data = spark.read.load('your_path_here', format='delta')

How to find out whether Spark table is parquet or delta?

I have a database with some tables in parquet format and others in delta format. If I want to append data to the table, I need to specify it if a table is in delta format (default is parquet).
How can I determine a table's format?
I tried show tblproperties <tbl_name> but this gives an empty result.
According to Delta lake Api Doc you can check
DeltaTable.isDeltaTable(Spark, "path")
Please see the note in the documentation
This uses the active SparkSession in the current thread to read the table data. Hence, this throws error if active SparkSession has not been set, that is, SparkSession.getActiveSession() is empty.

create different dataframe based on field value in Spark/Scala

I have a dataframe in below format with 2 fields. One of the field contains code and other field contains XML.
EventCd|XML_VALUE
1.3.6.10|<nt:SNMP>
<nt:var id="1.3.0" type="STRING"> MESSAGE </nt:var>
<nt:var id="1.3.9" type="STRING">AB-CD-EF</nt:var>
</nt:SNMP>
1.3.6.11|<nt:SNMP>
<nt:var id="1.3.1" type="STRING"> CALL </nt:var>
<nt:var id="1.3.2" type="STRING">XX-AC-EF</nt:var>
</nt:SNMPe
Based on value in code field I want to create different dataframe conditionally and place the data in corresponding hdfs folder.
if code is 1.3.6.10, it should create message dataframe and place files under ../message/ HDFS folder and if the code is 1.3.6.11, it should create call dataframe and write data into call hdfs folder like ../call/
I am able to create the dataframes using multiple filter options but is there any option to call only one dataframe and corresponding HDFS write command.
Can someone suggest how can I do this in spark/scala please.

Is DataFrame schema saved when using parquet format?

If one calls df.write.parquet(destination), is the DataFrame schema (i.e. StructType information) saved along with the data?
If the parquet files are generated by other programs other than Spark, how does sqlContext.read.parquet figure out the schema of the DataFrame?
Parquet files automatically preserves the schema of the original data when saving. So there will be no difference if it's Spark or another system that writes/reads the data.
If one or multiple columns are used to partition the data when saving, the data type for these columns are lost (since the information is stored in the file structure). The data types of these can be automatically inferred by Spark when reading (currently only numeric data types and strings are supported).
This automatic inference can be turned off by setting spark.sql.sources.partitionColumnTypeInference.enabled to false, which will make these columns be read as strings. For more information see here.

Should columns generated during ETL be added to schema?

I read a CSV then do some transformations, adding a few columns using .withColumn(...), then write to parquet format. Would it help performance to add those added columns to the read() schema?

Resources