SPARK-HIVE-key differences between Hive and Parquet from the perspective of table schema processing - apache-spark

I am new in spark and hive. I do not understand the statement
"Hive considers all columns nullable, while nullability in Parquet is significant"
If any one explain the statement with example it will better for me. Thank your.

In standard SQL syntax, when you create a table, you can state that a specific column is "nullable" (i.e. may contain a Null value) or not (i.e. trying to insert/update a Null value will throw an error).Nullable is the default.
Parquet schema syntax supports the same concept, although when using AVRO serialization, not-nullable is the default.
Caveat -- when you use Spark to read multiple Parquet files, these files may have different schemas. Imagine that the schema definition has changed over time, and newer files have 2 more Nullable columns at the end. Then you have to request "schema merging" so that Spark reads the schema from all files (not just one at random) to make sure that all these schemas are compatible, then at read-time the "undefined" columns are defaulted to Null for older files.
Hive HQL syntax does not support the standard SQL feature; every column is, and must be, nullable -- simply because Hive does not have total control on its data files!
Imagine a Hive partitioned table with 2 partitions...
one partition uses TextFile format and contains CSV dumps from
different sources, some showing up all expected columns, some missing
the last 2 columns because they use an older definition
the second partition uses Parquet format for history, created by Hive INSERT-SELECT queries, but older
Parquet files are missing the last 2 columns also, because they were created using the older table definition
For the Parquet-based partition, Hive does "schema merging", but instead of merging the file schemas together (like Spark), it merges each file schema with the table schema -- ignoring columns that are not defined in the table, and defaulting to Null all table columns that are not in the file.
Note that for the CSV-based partition, it's much more brutal, because the CSV files don't have a "schema" -- they just have a list of values that are mapped to the table columns, in order. On reaching EOL all missing columns are set to Null; on reaching the value for the last column, any extra value on the line is ignored.

Related

Spark partitioning of related data into row groups

With Apache Spark we can partition a dataframe into separate files when saving into Parquet format.
In the way Parquet files are written, each partition contains multiple row groups each of include column statistics pertaining to each group (e.g., min/max values, as well as number of NULL values).
Now, it would seem ideal in some situations to organize the Parquet file such that related data appears together in one or more row groups. This would be a secondary level of partitioning within each partition file (which constitutes the first level).
This is possible using for example pyarrow, but how can we do this with a distributed SQL engine such as Spark?
Besides partitioning you can order your data to group related data together in a limited set of partitions. Statement from Databricks:
Z-Ordering is a technique to colocate related information in the same
set of files
(
df
.write.option("header", True)
.orderBy(df.col_1.desc())
.partitionBy("col_2")
)

spark parquet partitioning which remove the partition column

If am using df.write.partitionby(col1).parquet(path) .
the data will remove the partition column on the data.
how to avoid it ?
You can duplicate col1 before writing:
df.withColumn("partition_col", col("col1")).write.partitionBy("partition_col").parquet(path)
Note that this step is not really necessary, because whenever you read a Parquet file in a partitioned directory structure, Spark will automatically add that as a new column to the dataframe.
Actually spark does not remove the column but it uses that column in a way to organize the files so that when you read the files it adds that as a column and display that to you in a table format. If you check the schema of the table or the schema of the dataframe you would still see that as a column in the table.
Also you are partitioning your data so you know how that data from table is queried frequently and based on that information you might have decided to partition the data so that your reads becomes faster and more efficient.

PySpark: how to read in partitioning columns when reading parquet

I have data stored in a parquet files and hive table partitioned by year, month, day. Thus, each parquet file is stored in /table_name/year/month/day/ folder.
I want to read in data for only some of the partitions. I have list of paths to individual partitions as follows:
paths_to_files = ['hdfs://data/table_name/2018/10/29',
'hdfs://data/table_name/2018/10/30']
And then try to do something like:
df = sqlContext.read.format("parquet").load(paths_to_files)
However, then my data does not include the information about year, month and day, as this is not part of the data per se, rather the information is stored in the path to the file.
I could use sql context and a send hive query with some select statement with where on the year, month and day columns to select only data from partitions i am interested in. However, i'd rather avoid constructing SQL query in python as I am very lazy and don't like reading SQL.
I have two questions:
what is the optimal way (performance-wise) to read in the data stored as parquet, where information about year, month, day is not present in the parquet file, but is only included in the path to the file? (either send hive query using sqlContext.sql('...'), or use read.parquet,... anything really.
Can i somehow extract the partitioning columns when using the
approach i outlined above?
Reading the direct file paths to the parent directory of the year partitions should be enough for a dataframe to determine there's partitions under it. However, it wouldn't know what to name the partitions without the directory structure /year=2018/month=10, for example.
Therefore, if you have Hive, then going via the metastore would be better because the partitions are named there, Hive stores extra useful information about your table, and then you're not reliant on knowing the direct path to the files on disk from the Spark code.
Not sure why you think you need to read/write SQL, though.
Use the Dataframe API instead, e.g
df = spark.table("table_name")
df_2018 = df.filter(df['year'] == 2018)
df_2018.show()
Your data isn't stored in a way optimal for parquet so you'd have to load files one by one and add the dates
Alternatively, you can move the files to a directory structure fit for parquet
( e.g. .../table/year=2018/month=10/day=29/file.parquet)
then you can read the parent directory (table) and filter on year, month, and day (and spark will only read the relevant directories) also you'd get these as attributes in your dataframe

spark reading missing columns in parquet

I have parquet files which I need to read from spark. Some files have few columns missing which are present in new files.
Since I do not know which files have column missing, I need to read all the files in spark. I have list of columns that I need to read. It may also be the case that all the files may have some column missing. I need to put a null in those columns which are missing.
When I try to do a
sqlContext.sql('query') it gives me error saying that columns are missing
If I define the schema and do a
sqlContext.read.parquet('s3://....').schema(parquet_schema)
It gives me the same error.
Help me here
You need to use parquet schema evolution strategy to address this situation.
As defined in the spark documentation
Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.
All you need to do is
val mergedDF = spark.read.option("mergeSchema", "true").parquet("'s3://....'")
This will give you parquet data with complete schema.
Pain point
In case your schema is non compatible for example one parquet file has col1 DataType as String and another parquet file has col1 DataType as Long.
Then the merge schema will fail.

Partition column is moved to end of row when saving a file to Parquet

For a given DataFrame just before being save'd to parquet here is the schema: notice that the centroid0 is the first column and is StringType:
However when saving the file using:
df.write.partitionBy(dfHolder.metadata.partitionCols: _*).format("parquet").mode("overwrite").save(fpath)
and with the partitionCols as centroid0:
then there is a (to me) surprising result:
the centroid0 partition column has been moved to the end of the Row
the data type has been changed to Integer
I confirmed the output path via println :
path=/git/block/target/scala-2.11/test-classes/data/output/blocking/out//level1/clusters
And here is the schema upon reading back from the saved parquet:
Why are those two modifications to the input schema occurring - and how can they be avoided - while still maintaining the centroid0 as a partitioning column?
Update A preferred answer should mention why /when the partitions were added to the end (vs the beginning) of the columns list. We need an understanding of the deterministic ordering.
In addition - is there any way to cause spark to "change it's mind" on the inferred column types? I have had to change the partitions from 0, 1 etc to c0, c1 etc in order to get the inference to map to StringType. Maybe that were required .. but if there were some spark setting to change the behavior that would make for an excellent answer.
When you write.partitionBy(...) Spark saves the partition field(s) as folder(s)
This is can be beneficial for reading data later as (with some file types, parquet included) it can optimize to read data just from partitions that you use (i.e. if you'd read and filter for centroid0==1 spark wouldn't read the other partitions
The effect of this is that the partition fields (centroid0 in your case) are not written into the parquet file only as folder names (centroid0=1, centroid0=2, etc.)
The side effect of these are 1. the type of the partition is inferred at run time (since the schema is not saved in the parquet) and in your case it happened that you only had integer values so it was inferred to integer.
The other side effect is that the partition field is added at the end/beginning of the schema as it reads the schema from the parquet files as one chunk and then it adds to that the partition field(s) as another (again, it is no longer part of the schema that is stored in the parquet)
You can actually pretty easily make use of ordering of the columns of a case class that holds the schema of your partitioned data. You will need to read the data from the path, inside which the partitioning columns are stored underneath to make Spark infer the values of these columns. Then simply apply re-ordering by using the case class schema with a statement like:
val encoder: Encoder[RecordType] = Encoders.product[RecordType]
spark.read
.schema(encoder.schema)
.format("parquet")
.option("mergeSchema", "true")
.load(myPath)
// reorder columns, since reading from partitioned data, the partitioning columns are put to end
.select(encoder.schema.fieldNames.head, encoder.schema.fieldNames.tail: _*)
.as[RecordType]
The reason is in fact pretty simple. When you partition by a column, each partition can only contain one value of the said column. Therefore it is useless to actually write the same value everywhere in the file, and this is why Spark does not. When the file is read, Spark uses the information contained in the names of the files to reconstruct the partitioning column and it is put at the end of the schema. The type of the column is not stored, it is inferred when reading, hence the integer type in your case.
NB: There is no particular reason as to why the column is added at the end. It could have been at the beginning. I guess it is just an arbitrary choice of implementation.
To avoid losing the type and the order of the columns, you could duplicate the partitioning column like this df.withColumn("X", 'YOUR_COLUMN).write.partitionBy("X").parquet("...").
You will waste space though. Also, spark uses the partitioning to optimize filters for instance. Don't forget to use the X column for filters after reading the data and not your column or Spark won't be able to perform any optimizations.

Resources