Auto infer schema from parquet/ selectively convert string to float - apache-spark

I have a parquet file with 400+ columns, when I read it, the default datatypes attached to a lot of columns is String (may be due to the schema specified by someone else).
I was not able to find a parameter similar to
inferSchema=True' #for spark.read.parquet, present for spark.read.csv
I tried changing
mergeSchema=True #but it doesn't improve the results
To manually cast columns as float, I used
df_temp.select(*(col(c).cast("float").alias(c) for c in df_temp.columns))
this runs without error, but converts all the actual string column values to Null. I can't wrap this in a try, catch block as its not throwing any error.
Is there a way where i can check whether the columns contains only 'integer/ float' values and selectively cast those columns to float?

Parquet columns are typed, so there is no such thing as schema inference when loading Parquet files.
Is there a way where i can check whether the columns contains only 'integer/ float' values and selectively cast those columns to float?
You can use the same logic as Spark - define preferred type hierarchy and attempt to cast, until you get to the point, where you find the most selective type, that parses all values in the column.
How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?
Spark data type guesser UDAF

There's no easy way currently,
there's a Github issue already existing which can be referred
https://github.com/databricks/spark-csv/issues/264
somthing like https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala
existing for scala this can be created for pyspark

Related

Converting Oracle RAW types with Spark

I have a table in an Oracle DB that contains a column stored as a RAW type. I'm making a JDBC connection to read that column and, when I print the schema of the resulting dataframe, I notice that I have a column with a binary data type. This was what I was expecting to happen.
The thing is that I need to be able to read that column as a String so I thought that a simple data type conversion would solve it.
df.select("COLUMN").withColumn("COL_AS_STRING", col("COLUMN").cast(StringType)).show
But what I got was a bunch of random characters. As I'm dealing with a RAW type it was possible that a string representation of this data doesn't exist so, just to be safe, I did simple select to get the first rows from the source (using sqoop-eval) and somehow sqoop can display this column as a string.
I then thought that this could be an encoding problem so I tried this:
df.selectExpr("decode(COLUMN,'utf-8')").show
With utf-8 and a bunch of other encodings. But again all I got was random characters.
Does anyone know how can I do this data type conversion?

Pyspark most reliable way to verify column type

If I read data from a CSV, all the columns will be of "String" type by default. Generally I inspect the data using the following functions which gives an overview of the data and its types
df.dtypes
df.show()
df.printSchema()
df.distinct().count()
df.describe().show()
But, if there is a column that I believe is of a particular type e.g. Double, I cannot be sure if all the values are double if I don't have business knowledge and because
1- I cannot see all the values (millions of unique values)
2- If I explicitly cast it to double type, spark quietly converts the type without throwing any exception and the values which are not double are converted to "null" - for example
from pyspark.sql.types import DoubleType.
changedTypedf = df_original.withColumn('label', df_control_trip['id'].cast(DoubleType()))
What could be the best way to confirm the type of column then?
In Scala Dataframe has field "schema", guess, in Python the same:
df.schema.fields.find( _.name=="label").get.dataType

Partition column is moved to end of row when saving a file to Parquet

For a given DataFrame just before being save'd to parquet here is the schema: notice that the centroid0 is the first column and is StringType:
However when saving the file using:
df.write.partitionBy(dfHolder.metadata.partitionCols: _*).format("parquet").mode("overwrite").save(fpath)
and with the partitionCols as centroid0:
then there is a (to me) surprising result:
the centroid0 partition column has been moved to the end of the Row
the data type has been changed to Integer
I confirmed the output path via println :
path=/git/block/target/scala-2.11/test-classes/data/output/blocking/out//level1/clusters
And here is the schema upon reading back from the saved parquet:
Why are those two modifications to the input schema occurring - and how can they be avoided - while still maintaining the centroid0 as a partitioning column?
Update A preferred answer should mention why /when the partitions were added to the end (vs the beginning) of the columns list. We need an understanding of the deterministic ordering.
In addition - is there any way to cause spark to "change it's mind" on the inferred column types? I have had to change the partitions from 0, 1 etc to c0, c1 etc in order to get the inference to map to StringType. Maybe that were required .. but if there were some spark setting to change the behavior that would make for an excellent answer.
When you write.partitionBy(...) Spark saves the partition field(s) as folder(s)
This is can be beneficial for reading data later as (with some file types, parquet included) it can optimize to read data just from partitions that you use (i.e. if you'd read and filter for centroid0==1 spark wouldn't read the other partitions
The effect of this is that the partition fields (centroid0 in your case) are not written into the parquet file only as folder names (centroid0=1, centroid0=2, etc.)
The side effect of these are 1. the type of the partition is inferred at run time (since the schema is not saved in the parquet) and in your case it happened that you only had integer values so it was inferred to integer.
The other side effect is that the partition field is added at the end/beginning of the schema as it reads the schema from the parquet files as one chunk and then it adds to that the partition field(s) as another (again, it is no longer part of the schema that is stored in the parquet)
You can actually pretty easily make use of ordering of the columns of a case class that holds the schema of your partitioned data. You will need to read the data from the path, inside which the partitioning columns are stored underneath to make Spark infer the values of these columns. Then simply apply re-ordering by using the case class schema with a statement like:
val encoder: Encoder[RecordType] = Encoders.product[RecordType]
spark.read
.schema(encoder.schema)
.format("parquet")
.option("mergeSchema", "true")
.load(myPath)
// reorder columns, since reading from partitioned data, the partitioning columns are put to end
.select(encoder.schema.fieldNames.head, encoder.schema.fieldNames.tail: _*)
.as[RecordType]
The reason is in fact pretty simple. When you partition by a column, each partition can only contain one value of the said column. Therefore it is useless to actually write the same value everywhere in the file, and this is why Spark does not. When the file is read, Spark uses the information contained in the names of the files to reconstruct the partitioning column and it is put at the end of the schema. The type of the column is not stored, it is inferred when reading, hence the integer type in your case.
NB: There is no particular reason as to why the column is added at the end. It could have been at the beginning. I guess it is just an arbitrary choice of implementation.
To avoid losing the type and the order of the columns, you could duplicate the partitioning column like this df.withColumn("X", 'YOUR_COLUMN).write.partitionBy("X").parquet("...").
You will waste space though. Also, spark uses the partitioning to optimize filters for instance. Don't forget to use the X column for filters after reading the data and not your column or Spark won't be able to perform any optimizations.

Dynamic Query Item Used for Sorting

I'm using Cognos Framework Manager and I'm creating a Data Item for a dynamic sort. I'm creating the Data Item using a CASE WHEN, here's my sample code:
CASE #prompt('SortOrder', 'string')#
WHEN 'Date' THEN <Date Column>
WHEN 'ID' THEN <String Column>
END
I'm getting this error QE-DEF-0405 Incompatible data types in case statement. Although I can cast the date column into a string wouldn't that make sort go wrong for the 'date' option? Should I cast the date column in a different way, cast the whole case, or am I barking at the wrong tree? In line with my question, should there be a general rule when creating dynamic columns via CASE with multiple column data types?
Column in Framework Manager should have datatype. Only one datatype.
So you need to cast your date column to correctly sortable string.
E.g. 'yyyy-mm-dd' format.
You are using the two different types of data format, so in prompt function use token instead of string (#prompt('sortorder','token')#)

SQLite 3 CSV Import to Table

I am using this as a resource to get me started - http://www.pantz.org/software/sqlite/sqlite_commands_and_general_usage.html
Currently I am working on creating an AIR program making use of the built in SQLite database. I could be considered a complete noob in making SQL queries.
table column types
I have a rather large excel file (14K rows) that I have exported to a CSV file. It has 65 columns of varying data types (mostly ints, floats and short strings, MAYBE a few bools). I have no idea about the proper form of importing so as to preserve the column structure nor do I know the best data formats to choose per db column. I could use some input on this.
table creation utils
Is there a util that can read an XLS file and based on the column headers, generate a quick query statement to ease the pain of making the query manually? I saw this post but it seems geared towards a preexisting CSV file and makes use of python (something I am also a noob at)
Thank you in advance for your time.
J
SQLite3's column types basically boil down to:
TEXT
NUMERIC (REAL, FLOAT)
INTEGER (the various lengths of integer; but INT will normally do)
BLOB (binary objects)
Generally in a CSV file you will encounter strings (TEXT), decimal numbers (FLOAT), and integers (INT). If performance isn't critical, those are pretty much the only three column types you need. (CHAR(80) is smaller on disk than TEXT but for a few thousand rows it's not so much of an issue.)
As far as putting data into the columns is concerned, SQLite3 uses type coercion to convert the input data type to the column type whereever the conversion makes sense. So all you have to do is specify the correct column type, and SQLite will take care of storing it in the correct way.
For example the number -1230.00, the string "-1230.00", and the string "-1.23e3" will all coerce to the number 1230 when stored in a FLOAT column.
Note that if SQLite3 can't apply a meaningful type conversion, it will just store the original data without attempting to convert it at all. SQLite3 is quite happy to insert "Hello World!" into a FLOAT column. This is usually a Bad Thing.
See the SQLite3 documentation on column types and conversion for gems such as:
Type Affinity
In order to maximize compatibility between SQLite and other database
engines, SQLite supports the concept of "type affinity" on columns.
The type affinity of a column is the recommended type for data stored
in that column. The important idea here is that the type is
recommended, not required. Any column can still store any type of
data. It is just that some columns, given the choice, will prefer to
use one storage class over another. The preferred storage class for a
column is called its "affinity".

Resources