Prevent pyspark/spark from transforming timestamp when creating a dataframe from a parquet file - apache-spark

I am reading a parquet file into a dataframe. My goal is to verify that my time data (column type in parquet : timestamp) are ISO 8601.
The dates in time column look like this : 2021-03-13T05:34:27.100Z or 2021-03-13T05:34:27.100+0000
But when I read my dataframe, pyspark transform 2021-03-13T05:34:27.100Z into 2021-03-13 05:34:27.100
I want to keep the original format, but I can't figure out how to stop pyspark from doing this. I tried to use a custom schema with string for dates but I get this error: Parquet column cannot be converted in file file.snappy.parquet. Column: [time], Expected: string, Found: INT96
Also I tried using conf parameters but it didn't work for me.

Related

Pandas df.to_parquet write() got an unexpected keyword argument 'index' when ignoring index column

I am trying to export a pandas dataframe into a parquet format using the following:-
df.to_parquet("codeset.parquet", index=False)
I don't want to have index column in the parquet file so is this automatically done by to_parquet command or how can I get around this so that there is no index column included in the exported parquet.

parquet to spark dataframe with location column

I am trying to read in a parquet file into a Dataframe using spark. My requirement is to create another column in the dataframe using the parquet path of the parquet file.
Eg: I have parquet files in the following path in hdfs:
/event/2018-01-01/abc/xyz=10/parquet1.parquet
/event/2018-01-01/abc/xyz=10/parquet2.parquet
I want to read all the files in /event/2018-01-01/abc and create a column "dt" in the dataframe which specifies the date in the path. How do I extract the date from the path and create it as a column in the spark dataframe?

Spark: read from parquet an int column as long

I have a parquet file that is read by spark as an external table.
One of the columns is defined as int both in the parquet schema and in the spark table.
Recently, I've discovered int is too small for my needs, so I changed the column type to long in new parquet files.
I changed also the type in the spark table to bigint.
However, when I try to read an old parquet file (with int) by spark as external table (with bigint), I get the following error:
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
One possible solution is altering the column type in the old parquet to long, which I asked about here: How can I change parquet column type from int to long?, but it is very expensive since I have a lot of data.
Another possible solution is to read each parquet file according to its schema to a different spark table and create a union view of the old and new tables, which is very ugly.
Is there another way to read from parquet an int column as long in spark?
using pyspark couldn't you just do
df = spark.read.parquet('path to parquet files')
the just change the cast the column type in the dataframe
new_df = (df
.withColumn('col_name', col('col_name').cast(LongType()))
)
and then just save the new dataframe to same location with overwrite mode

apache-spark - Unparseable number Issue while writing data to parquet file from Spark DataFrame

I created a dataframe in Spark and loaded the data in the dataframe from AWS S3. When I'm writing the data to parquet from dataframe, it throws a following error:
Caused by: java.text.ParseException: Unparseable number:
"$250.00-$254.99" at
java.text.NumberFormat.parse(NumberFormat.java:385) at
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$2.apply$mcF$sp(CSVInferSchema.scala:261)
at
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$2.apply(CSVInferSchema.scala:261)
at
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$2.apply(CSVInferSchema.scala:261)
I do have a column in my dataset which has the value "$250.00-$254.99" and this column is defined as StringType in my dataframe.
Any help will be appreciated.
Thanks,
Vivek

Unable to read timstamp value in pyspark from Hive (spark 1.6.1)

I am trying to read a Hive Table having a date column with datatype as timestamp, length=9.
my code looks something like the following -
df = hc.sql("select * from schema.table")
It can read all other columns (datatype = varchar).
It either reads null or gives none in date column
I have printed the df.dtypes and found that the dataframe schema was inferred correctly and date columns have timestamp datatype
surprisingly the same code works in notebook, only fails in spark-shell environment.
Can someone guide me what could be wrong or limitation that renders this error and how could I rectify it?
I have seen this problem in spark, Where It will display as null when the datatype is timestamp, Its a bug. There is a way to get around it where you have to read that date column as string using something like this to_char(Column_name ,'YYYY-MM-DD HH-MM-SS') as column_name, and then cast it to timestamp. If you can tell me the source type and the tool you used to pull the data like sqoop or are you getting the data in some form of files ?? I can help you better.

Resources