PySpark Incorrectly Inferring Parquet Date - apache-spark

I have an AWS Glue (2.0, Spark 2.4, Python3) job that is bringing in s3 stored Parquet files via the create_dynamic_frame.from_options() function. The source data for the parquet files is a MySQL table that houses infrequently populated data, UDF's, where out of 600 rows only two might have a value for the column. Spark is inferring these mostly NULL columns to be integers, where the true datatype is date (stored in Parquet as the number of days from unix date).
What I am wondering is, is there a way to force spark to look at all the data in the column, something like the ratio sampling that you can do for an RDD with either a dynamic frame or data frame? Is there an alternate but good way to go about this? This is a dynamic situation where I need to use the same script for migrating 100+ databases and the UDF's will differ by database, so it is not feasible to hardcode the data types.
Here is the code for creating the dynamic frame:
dyf_full = glueContext.create_dynamic_frame.from_options(connection_type='s3',
connection_options={'path': path, 'recurse': True, 'exclusions': exclusion_string},
format='parquet',
transformation_ctx='df_full')
and here is the inferred schema, where date field 2 is infrequently populated and date_field is more than 95% populated:
root
|-- CASE_ID: decimal
|-- DATE_FIELD: date
|-- DATE_FIELD_2: int
Please let me know if there is something else I can provide that would be beneficial. I am pretty stumped on this one.

Related

PySpark: how to read in partitioning columns when reading parquet

I have data stored in a parquet files and hive table partitioned by year, month, day. Thus, each parquet file is stored in /table_name/year/month/day/ folder.
I want to read in data for only some of the partitions. I have list of paths to individual partitions as follows:
paths_to_files = ['hdfs://data/table_name/2018/10/29',
'hdfs://data/table_name/2018/10/30']
And then try to do something like:
df = sqlContext.read.format("parquet").load(paths_to_files)
However, then my data does not include the information about year, month and day, as this is not part of the data per se, rather the information is stored in the path to the file.
I could use sql context and a send hive query with some select statement with where on the year, month and day columns to select only data from partitions i am interested in. However, i'd rather avoid constructing SQL query in python as I am very lazy and don't like reading SQL.
I have two questions:
what is the optimal way (performance-wise) to read in the data stored as parquet, where information about year, month, day is not present in the parquet file, but is only included in the path to the file? (either send hive query using sqlContext.sql('...'), or use read.parquet,... anything really.
Can i somehow extract the partitioning columns when using the
approach i outlined above?
Reading the direct file paths to the parent directory of the year partitions should be enough for a dataframe to determine there's partitions under it. However, it wouldn't know what to name the partitions without the directory structure /year=2018/month=10, for example.
Therefore, if you have Hive, then going via the metastore would be better because the partitions are named there, Hive stores extra useful information about your table, and then you're not reliant on knowing the direct path to the files on disk from the Spark code.
Not sure why you think you need to read/write SQL, though.
Use the Dataframe API instead, e.g
df = spark.table("table_name")
df_2018 = df.filter(df['year'] == 2018)
df_2018.show()
Your data isn't stored in a way optimal for parquet so you'd have to load files one by one and add the dates
Alternatively, you can move the files to a directory structure fit for parquet
( e.g. .../table/year=2018/month=10/day=29/file.parquet)
then you can read the parent directory (table) and filter on year, month, and day (and spark will only read the relevant directories) also you'd get these as attributes in your dataframe

spark reading missing columns in parquet

I have parquet files which I need to read from spark. Some files have few columns missing which are present in new files.
Since I do not know which files have column missing, I need to read all the files in spark. I have list of columns that I need to read. It may also be the case that all the files may have some column missing. I need to put a null in those columns which are missing.
When I try to do a
sqlContext.sql('query') it gives me error saying that columns are missing
If I define the schema and do a
sqlContext.read.parquet('s3://....').schema(parquet_schema)
It gives me the same error.
Help me here
You need to use parquet schema evolution strategy to address this situation.
As defined in the spark documentation
Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.
All you need to do is
val mergedDF = spark.read.option("mergeSchema", "true").parquet("'s3://....'")
This will give you parquet data with complete schema.
Pain point
In case your schema is non compatible for example one parquet file has col1 DataType as String and another parquet file has col1 DataType as Long.
Then the merge schema will fail.

Partition column is moved to end of row when saving a file to Parquet

For a given DataFrame just before being save'd to parquet here is the schema: notice that the centroid0 is the first column and is StringType:
However when saving the file using:
df.write.partitionBy(dfHolder.metadata.partitionCols: _*).format("parquet").mode("overwrite").save(fpath)
and with the partitionCols as centroid0:
then there is a (to me) surprising result:
the centroid0 partition column has been moved to the end of the Row
the data type has been changed to Integer
I confirmed the output path via println :
path=/git/block/target/scala-2.11/test-classes/data/output/blocking/out//level1/clusters
And here is the schema upon reading back from the saved parquet:
Why are those two modifications to the input schema occurring - and how can they be avoided - while still maintaining the centroid0 as a partitioning column?
Update A preferred answer should mention why /when the partitions were added to the end (vs the beginning) of the columns list. We need an understanding of the deterministic ordering.
In addition - is there any way to cause spark to "change it's mind" on the inferred column types? I have had to change the partitions from 0, 1 etc to c0, c1 etc in order to get the inference to map to StringType. Maybe that were required .. but if there were some spark setting to change the behavior that would make for an excellent answer.
When you write.partitionBy(...) Spark saves the partition field(s) as folder(s)
This is can be beneficial for reading data later as (with some file types, parquet included) it can optimize to read data just from partitions that you use (i.e. if you'd read and filter for centroid0==1 spark wouldn't read the other partitions
The effect of this is that the partition fields (centroid0 in your case) are not written into the parquet file only as folder names (centroid0=1, centroid0=2, etc.)
The side effect of these are 1. the type of the partition is inferred at run time (since the schema is not saved in the parquet) and in your case it happened that you only had integer values so it was inferred to integer.
The other side effect is that the partition field is added at the end/beginning of the schema as it reads the schema from the parquet files as one chunk and then it adds to that the partition field(s) as another (again, it is no longer part of the schema that is stored in the parquet)
You can actually pretty easily make use of ordering of the columns of a case class that holds the schema of your partitioned data. You will need to read the data from the path, inside which the partitioning columns are stored underneath to make Spark infer the values of these columns. Then simply apply re-ordering by using the case class schema with a statement like:
val encoder: Encoder[RecordType] = Encoders.product[RecordType]
spark.read
.schema(encoder.schema)
.format("parquet")
.option("mergeSchema", "true")
.load(myPath)
// reorder columns, since reading from partitioned data, the partitioning columns are put to end
.select(encoder.schema.fieldNames.head, encoder.schema.fieldNames.tail: _*)
.as[RecordType]
The reason is in fact pretty simple. When you partition by a column, each partition can only contain one value of the said column. Therefore it is useless to actually write the same value everywhere in the file, and this is why Spark does not. When the file is read, Spark uses the information contained in the names of the files to reconstruct the partitioning column and it is put at the end of the schema. The type of the column is not stored, it is inferred when reading, hence the integer type in your case.
NB: There is no particular reason as to why the column is added at the end. It could have been at the beginning. I guess it is just an arbitrary choice of implementation.
To avoid losing the type and the order of the columns, you could duplicate the partitioning column like this df.withColumn("X", 'YOUR_COLUMN).write.partitionBy("X").parquet("...").
You will waste space though. Also, spark uses the partitioning to optimize filters for instance. Don't forget to use the X column for filters after reading the data and not your column or Spark won't be able to perform any optimizations.

Enum equivalent in Spark Dataframe/Parquet

I have a table with hundreds of millions of rows, that I want to store in a dataframe in Spark and persist to disk as a parquet file.
The size of my Parquet file(s) is now in excess of 2TB and I want to make sure I have optimized this.
A large proportion of these columns are string values, that can be lengthy, but also often have very few values. For example I have a column with only two distinct values (a 20charcter and a 30 character string) and I have another column with a string that is on average 400characters long but only has about 400 distinct values across all entries.
In a relational database I would usually normalize those values out into a different table with references, or at least define my table with some sort of enum type.
I cannot see anything that matches that pattern in DF or parquet files. Is the columnar storage handling this efficiently? Or should I look into something to optimize this further?
Parquet doesn't have a mechanism for automatically generating enum-like types, but you can use the page dictionary. The page dictionary stores a list of values per parquet page to allow the rows to just reference back to the dictionary instead of rewriting the data. To enable the dictionary for the parquet writer in spark:
spark.conf.set("parquet.dictionary.enabled", "true")
spark.conf.set("parquet.dictionary.page.size", 2 * 1024 * 1024)
Note that you have to write the file with these options enabled, or it won't be used.
To enable filtering for existence using the dictionary, you can enable
spark.conf.set("parquet.filter.dictionary.enabled", "true")
Source: Parquet performance tuning:
The missing guide

SPARK-HIVE-key differences between Hive and Parquet from the perspective of table schema processing

I am new in spark and hive. I do not understand the statement
"Hive considers all columns nullable, while nullability in Parquet is significant"
If any one explain the statement with example it will better for me. Thank your.
In standard SQL syntax, when you create a table, you can state that a specific column is "nullable" (i.e. may contain a Null value) or not (i.e. trying to insert/update a Null value will throw an error).Nullable is the default.
Parquet schema syntax supports the same concept, although when using AVRO serialization, not-nullable is the default.
Caveat -- when you use Spark to read multiple Parquet files, these files may have different schemas. Imagine that the schema definition has changed over time, and newer files have 2 more Nullable columns at the end. Then you have to request "schema merging" so that Spark reads the schema from all files (not just one at random) to make sure that all these schemas are compatible, then at read-time the "undefined" columns are defaulted to Null for older files.
Hive HQL syntax does not support the standard SQL feature; every column is, and must be, nullable -- simply because Hive does not have total control on its data files!
Imagine a Hive partitioned table with 2 partitions...
one partition uses TextFile format and contains CSV dumps from
different sources, some showing up all expected columns, some missing
the last 2 columns because they use an older definition
the second partition uses Parquet format for history, created by Hive INSERT-SELECT queries, but older
Parquet files are missing the last 2 columns also, because they were created using the older table definition
For the Parquet-based partition, Hive does "schema merging", but instead of merging the file schemas together (like Spark), it merges each file schema with the table schema -- ignoring columns that are not defined in the table, and defaulting to Null all table columns that are not in the file.
Note that for the CSV-based partition, it's much more brutal, because the CSV files don't have a "schema" -- they just have a list of values that are mapped to the table columns, in order. On reaching EOL all missing columns are set to Null; on reaching the value for the last column, any extra value on the line is ignored.

Resources