Pyspark most reliable way to verify column type - apache-spark

If I read data from a CSV, all the columns will be of "String" type by default. Generally I inspect the data using the following functions which gives an overview of the data and its types
df.dtypes
df.show()
df.printSchema()
df.distinct().count()
df.describe().show()
But, if there is a column that I believe is of a particular type e.g. Double, I cannot be sure if all the values are double if I don't have business knowledge and because
1- I cannot see all the values (millions of unique values)
2- If I explicitly cast it to double type, spark quietly converts the type without throwing any exception and the values which are not double are converted to "null" - for example
from pyspark.sql.types import DoubleType.
changedTypedf = df_original.withColumn('label', df_control_trip['id'].cast(DoubleType()))
What could be the best way to confirm the type of column then?

In Scala Dataframe has field "schema", guess, in Python the same:
df.schema.fields.find( _.name=="label").get.dataType

Related

Oracle Column type Number is showing Decimal value in spark

Using spark read jdbc option , i am reading oracle table and one of the column type is 'Number' type. After reading and writing into s3 bucket, dataframe printschema is showing decimal(38,10). I know cast to int type can help but issue with we created redshift table with Intger type and decimal value(data frame value) is not allowing to do copy command. Is there any solution other than cast option. ?

Is there a way to get the column data type in pyspark?

Has been discussed that the way to find the column datatype in pyspark is using df.dtypes get datatype of column using pyspark. The problem with this is that for datatypes like an array or struct you get something like array<string> or array<integer>.
Question: Is there a native way to get the pyspark data type? Like ArrayType(StringType,true)
Just use schema:
df.schema[column_name].dataType

Auto infer schema from parquet/ selectively convert string to float

I have a parquet file with 400+ columns, when I read it, the default datatypes attached to a lot of columns is String (may be due to the schema specified by someone else).
I was not able to find a parameter similar to
inferSchema=True' #for spark.read.parquet, present for spark.read.csv
I tried changing
mergeSchema=True #but it doesn't improve the results
To manually cast columns as float, I used
df_temp.select(*(col(c).cast("float").alias(c) for c in df_temp.columns))
this runs without error, but converts all the actual string column values to Null. I can't wrap this in a try, catch block as its not throwing any error.
Is there a way where i can check whether the columns contains only 'integer/ float' values and selectively cast those columns to float?
Parquet columns are typed, so there is no such thing as schema inference when loading Parquet files.
Is there a way where i can check whether the columns contains only 'integer/ float' values and selectively cast those columns to float?
You can use the same logic as Spark - define preferred type hierarchy and attempt to cast, until you get to the point, where you find the most selective type, that parses all values in the column.
How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?
Spark data type guesser UDAF
There's no easy way currently,
there's a Github issue already existing which can be referred
https://github.com/databricks/spark-csv/issues/264
somthing like https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala
existing for scala this can be created for pyspark

How to avoid dataset from renaming columns to value while mapping?

While mapping a dataset I keep having the problem that columns are being renamed from _1, _2 ect to value, value.
What is it which is causing the rename?
That's because map on Dataset causes that query is serialized and deserialized in Spark.
To Serialize it, Spark must now the Encoder. That's ewhy there is an object ExpressionEncoder with method apply. It's JavaDoc says:
A factory for constructing encoders that convert objects and primitives to and from the
internal row format using catalyst expressions and code generation. By default, the
expressions used to retrieve values from an input row when producing an object will be created as
follows:
- Classes will have their sub fields extracted by name using [[UnresolvedAttribute]] expressions
and [[UnresolvedExtractValue]] expressions.
- Tuples will have their subfields extracted by position using [[BoundReference]] expressions.
- Primitives will have their values extracted from the first ordinal with a schema that defaults
to the name `value`.
Please look at the last point. Your query is just mapped to primitives, so Catalyst uses name "value".
If you add .select('value.as("MyPropertyName")).as[CaseClass], the field names will be correct.
Types that will have column name "value":
Option(_)
Array
Collection types like Seq, Map
types like String, Timestamp, Date, BigDecimal

Dynamic Query Item Used for Sorting

I'm using Cognos Framework Manager and I'm creating a Data Item for a dynamic sort. I'm creating the Data Item using a CASE WHEN, here's my sample code:
CASE #prompt('SortOrder', 'string')#
WHEN 'Date' THEN <Date Column>
WHEN 'ID' THEN <String Column>
END
I'm getting this error QE-DEF-0405 Incompatible data types in case statement. Although I can cast the date column into a string wouldn't that make sort go wrong for the 'date' option? Should I cast the date column in a different way, cast the whole case, or am I barking at the wrong tree? In line with my question, should there be a general rule when creating dynamic columns via CASE with multiple column data types?
Column in Framework Manager should have datatype. Only one datatype.
So you need to cast your date column to correctly sortable string.
E.g. 'yyyy-mm-dd' format.
You are using the two different types of data format, so in prompt function use token instead of string (#prompt('sortorder','token')#)

Resources