PySpark parquet datatypes - python-3.x

I am using PySpark to read a relative large csv file (~10GB):
ddf = spark.read.csv('directory/my_file.csv')
All the columns have the datatype string
After changing the datatype of for example column_a I can see the datatype changed to an integer. If I write the ddf to a parquet file and read the parquet file I notice that all columns have the datatype string again. Question: How can I make sure the parquet file contains the correct datatypes so that I do not have to change the datatype again (while reading the parquet file).
Notes:
I write the ddf as a parquet file as follows:
ddf.repartition(10).write.parquet('directory/my_parquet_file', mode='overwrite')
I use:
PySpark version 2.0.0.2
Python 3.x

I read my large files with pandas and not have this problem. Try use pandas.
http://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.read_csv.html
In[1]: Import pandas as pd
In[2]: df = pd.read_csv('directory/my_file.csv')

Related

Prevent pyspark/spark from transforming timestamp when creating a dataframe from a parquet file

I am reading a parquet file into a dataframe. My goal is to verify that my time data (column type in parquet : timestamp) are ISO 8601.
The dates in time column look like this : 2021-03-13T05:34:27.100Z or 2021-03-13T05:34:27.100+0000
But when I read my dataframe, pyspark transform 2021-03-13T05:34:27.100Z into 2021-03-13 05:34:27.100
I want to keep the original format, but I can't figure out how to stop pyspark from doing this. I tried to use a custom schema with string for dates but I get this error: Parquet column cannot be converted in file file.snappy.parquet. Column: [time], Expected: string, Found: INT96
Also I tried using conf parameters but it didn't work for me.

Pandas df.to_parquet write() got an unexpected keyword argument 'index' when ignoring index column

I am trying to export a pandas dataframe into a parquet format using the following:-
df.to_parquet("codeset.parquet", index=False)
I don't want to have index column in the parquet file so is this automatically done by to_parquet command or how can I get around this so that there is no index column included in the exported parquet.

parquet to spark dataframe with location column

I am trying to read in a parquet file into a Dataframe using spark. My requirement is to create another column in the dataframe using the parquet path of the parquet file.
Eg: I have parquet files in the following path in hdfs:
/event/2018-01-01/abc/xyz=10/parquet1.parquet
/event/2018-01-01/abc/xyz=10/parquet2.parquet
I want to read all the files in /event/2018-01-01/abc and create a column "dt" in the dataframe which specifies the date in the path. How do I extract the date from the path and create it as a column in the spark dataframe?

apache-spark - Unparseable number Issue while writing data to parquet file from Spark DataFrame

I created a dataframe in Spark and loaded the data in the dataframe from AWS S3. When I'm writing the data to parquet from dataframe, it throws a following error:
Caused by: java.text.ParseException: Unparseable number:
"$250.00-$254.99" at
java.text.NumberFormat.parse(NumberFormat.java:385) at
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$2.apply$mcF$sp(CSVInferSchema.scala:261)
at
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$2.apply(CSVInferSchema.scala:261)
at
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$2.apply(CSVInferSchema.scala:261)
I do have a column in my dataset which has the value "$250.00-$254.99" and this column is defined as StringType in my dataframe.
Any help will be appreciated.
Thanks,
Vivek

How to write summary of spark sql dataframe to excel file

I have a very large Dataframe with 8000 columns and 50000 rows.
I want to write its statistics information into excel file.
I think we can use describe() method. But how to write it to excel in good format. Thanks
The return type for describe is a pyspark dataframe. The easiest way to get the describe dataframe into an excel readable format is to convert it to a pandas dataframe and then write the pandas dataframe out as a csv file as below
import pandas
df.describe().toPandas().to_csv('fileOutput.csv')
If you want it in excel format, you can try below
import pandas
df.describe().toPandas().to_excel('fileOutput.xls', sheet_name = 'Sheet1', index = False)
Note, the above requires xlwt package to be installed (pip install xlwt in the command line)

Resources