apache-spark - Unparseable number Issue while writing data to parquet file from Spark DataFrame - apache-spark

I created a dataframe in Spark and loaded the data in the dataframe from AWS S3. When I'm writing the data to parquet from dataframe, it throws a following error:
Caused by: java.text.ParseException: Unparseable number:
"$250.00-$254.99" at
java.text.NumberFormat.parse(NumberFormat.java:385) at
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$2.apply$mcF$sp(CSVInferSchema.scala:261)
at
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$2.apply(CSVInferSchema.scala:261)
at
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$2.apply(CSVInferSchema.scala:261)
I do have a column in my dataset which has the value "$250.00-$254.99" and this column is defined as StringType in my dataframe.
Any help will be appreciated.
Thanks,
Vivek

Related

Prevent pyspark/spark from transforming timestamp when creating a dataframe from a parquet file

I am reading a parquet file into a dataframe. My goal is to verify that my time data (column type in parquet : timestamp) are ISO 8601.
The dates in time column look like this : 2021-03-13T05:34:27.100Z or 2021-03-13T05:34:27.100+0000
But when I read my dataframe, pyspark transform 2021-03-13T05:34:27.100Z into 2021-03-13 05:34:27.100
I want to keep the original format, but I can't figure out how to stop pyspark from doing this. I tried to use a custom schema with string for dates but I get this error: Parquet column cannot be converted in file file.snappy.parquet. Column: [time], Expected: string, Found: INT96
Also I tried using conf parameters but it didn't work for me.

Error while querying parquet table in presto

A hive parquet table is created over a spark dataframe saved in parquet format.
I am able to query the parquet data in my parquet table.
But while querying in presto, it shows an error: "Query 20200817_061959_00150_nztin failed: Can not read value at 0 in block 0 in file"
I am not using any decimal fields. Most of my fields are of string type and some of them are date and timestamp type.
Can someone help?

PySpark parquet datatypes

I am using PySpark to read a relative large csv file (~10GB):
ddf = spark.read.csv('directory/my_file.csv')
All the columns have the datatype string
After changing the datatype of for example column_a I can see the datatype changed to an integer. If I write the ddf to a parquet file and read the parquet file I notice that all columns have the datatype string again. Question: How can I make sure the parquet file contains the correct datatypes so that I do not have to change the datatype again (while reading the parquet file).
Notes:
I write the ddf as a parquet file as follows:
ddf.repartition(10).write.parquet('directory/my_parquet_file', mode='overwrite')
I use:
PySpark version 2.0.0.2
Python 3.x
I read my large files with pandas and not have this problem. Try use pandas.
http://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.read_csv.html
In[1]: Import pandas as pd
In[2]: df = pd.read_csv('directory/my_file.csv')

parquet to spark dataframe with location column

I am trying to read in a parquet file into a Dataframe using spark. My requirement is to create another column in the dataframe using the parquet path of the parquet file.
Eg: I have parquet files in the following path in hdfs:
/event/2018-01-01/abc/xyz=10/parquet1.parquet
/event/2018-01-01/abc/xyz=10/parquet2.parquet
I want to read all the files in /event/2018-01-01/abc and create a column "dt" in the dataframe which specifies the date in the path. How do I extract the date from the path and create it as a column in the spark dataframe?

Spark: read from parquet an int column as long

I have a parquet file that is read by spark as an external table.
One of the columns is defined as int both in the parquet schema and in the spark table.
Recently, I've discovered int is too small for my needs, so I changed the column type to long in new parquet files.
I changed also the type in the spark table to bigint.
However, when I try to read an old parquet file (with int) by spark as external table (with bigint), I get the following error:
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
One possible solution is altering the column type in the old parquet to long, which I asked about here: How can I change parquet column type from int to long?, but it is very expensive since I have a lot of data.
Another possible solution is to read each parquet file according to its schema to a different spark table and create a union view of the old and new tables, which is very ugly.
Is there another way to read from parquet an int column as long in spark?
using pyspark couldn't you just do
df = spark.read.parquet('path to parquet files')
the just change the cast the column type in the dataframe
new_df = (df
.withColumn('col_name', col('col_name').cast(LongType()))
)
and then just save the new dataframe to same location with overwrite mode

Resources