Is there a way to get the column data type in pyspark? - apache-spark

Has been discussed that the way to find the column datatype in pyspark is using df.dtypes get datatype of column using pyspark. The problem with this is that for datatypes like an array or struct you get something like array<string> or array<integer>.
Question: Is there a native way to get the pyspark data type? Like ArrayType(StringType,true)

Just use schema:
df.schema[column_name].dataType

Related

How to Convert spark dataframe to nested json using spark scala dynamically

I want to convert the DataFrame to nested json. Sourse Data:-
DataFrame have data value like :-
Expected Output:-
I have to convert DataFrame value to Nested Json like : -
Appreciate your help !
If you want to persist the data then save dataframe with format json
df.write.json("path")
You can use toJSON function, which will convert dataframe to Dataset[String]
df.toJSON
If there's only one element then you can further manipulate to get string
df.toJSON.take(1).head
Thanks.

iterating complex dataframe with array of structfield

I have data in one of dataframe's column with the following schema
<type 'list'>: [StructField(data,StructType(List(StructField(account,StructType(List(StructField(Id,StringType,true),StructField(Name,StringType,true),StructField(books,ArrayType(StructType(List(StructField(bookTile,StringType,true),StructField(bookId,StringType,true),StructField(bookName,StringType,true))),true),true)))))))]
I want to interate them extract each value out of it and create a new dataframe. Is there any inbuilt functions in pyspark supports this or I should iterate them? Any efficient way?

Pyspark most reliable way to verify column type

If I read data from a CSV, all the columns will be of "String" type by default. Generally I inspect the data using the following functions which gives an overview of the data and its types
df.dtypes
df.show()
df.printSchema()
df.distinct().count()
df.describe().show()
But, if there is a column that I believe is of a particular type e.g. Double, I cannot be sure if all the values are double if I don't have business knowledge and because
1- I cannot see all the values (millions of unique values)
2- If I explicitly cast it to double type, spark quietly converts the type without throwing any exception and the values which are not double are converted to "null" - for example
from pyspark.sql.types import DoubleType.
changedTypedf = df_original.withColumn('label', df_control_trip['id'].cast(DoubleType()))
What could be the best way to confirm the type of column then?
In Scala Dataframe has field "schema", guess, in Python the same:
df.schema.fields.find( _.name=="label").get.dataType

Auto infer schema from parquet/ selectively convert string to float

I have a parquet file with 400+ columns, when I read it, the default datatypes attached to a lot of columns is String (may be due to the schema specified by someone else).
I was not able to find a parameter similar to
inferSchema=True' #for spark.read.parquet, present for spark.read.csv
I tried changing
mergeSchema=True #but it doesn't improve the results
To manually cast columns as float, I used
df_temp.select(*(col(c).cast("float").alias(c) for c in df_temp.columns))
this runs without error, but converts all the actual string column values to Null. I can't wrap this in a try, catch block as its not throwing any error.
Is there a way where i can check whether the columns contains only 'integer/ float' values and selectively cast those columns to float?
Parquet columns are typed, so there is no such thing as schema inference when loading Parquet files.
Is there a way where i can check whether the columns contains only 'integer/ float' values and selectively cast those columns to float?
You can use the same logic as Spark - define preferred type hierarchy and attempt to cast, until you get to the point, where you find the most selective type, that parses all values in the column.
How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?
Spark data type guesser UDAF
There's no easy way currently,
there's a Github issue already existing which can be referred
https://github.com/databricks/spark-csv/issues/264
somthing like https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala
existing for scala this can be created for pyspark

Unable to read timstamp value in pyspark from Hive (spark 1.6.1)

I am trying to read a Hive Table having a date column with datatype as timestamp, length=9.
my code looks something like the following -
df = hc.sql("select * from schema.table")
It can read all other columns (datatype = varchar).
It either reads null or gives none in date column
I have printed the df.dtypes and found that the dataframe schema was inferred correctly and date columns have timestamp datatype
surprisingly the same code works in notebook, only fails in spark-shell environment.
Can someone guide me what could be wrong or limitation that renders this error and how could I rectify it?
I have seen this problem in spark, Where It will display as null when the datatype is timestamp, Its a bug. There is a way to get around it where you have to read that date column as string using something like this to_char(Column_name ,'YYYY-MM-DD HH-MM-SS') as column_name, and then cast it to timestamp. If you can tell me the source type and the tool you used to pull the data like sqoop or are you getting the data in some form of files ?? I can help you better.

Resources