Read in an excel file as a csv in pyspark - excel

first question here, so I apologise if something isn't clear.
I am new to pySpark, and using databricks I was trying to read in an excel file saved as a csv with the following code
df = spark.read.csv('/FileStore/tables/file.csv',
sep = ";",
inferSchema = "true",
header = "true")
This works fine, except some of the observations get null values, while in the excel file there are no null values. The actual values can be found in other rows.
Maybe better explained with an example:
If the excel file has the row A B C D
Then it becomes in the table (for some rows):
A B null null
C D null null
My question is how could I fix this? Thanks in advance

Right now you are setting your delimiter to be a ;, however in a CSV file the delimiter is usually a , (Comma Separated Values). If you us the spark CSV reader, the delimiter is automatically set to a comma:
spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/FileStore/tables/file.csv")

Related

Reading excel files in pyspark with 3rd row as header

I want to read read excel files as spark dataframe with 3rd row as a header.The synatax to read excel files as spark dataframe with 1st row as header is :
s_df = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema", "true") \
.load(path + 'Sales.xlsx')
and the equivalent syntax to read as pandas dataframe with 3rd row as header is :
p_df = pd.read_excel(path + 'Sales.xlsx',header=3)
I want to do the same thing in pyspark that is to read excel files as spark dataframe with 3rd row as header.
Use the dataAddress option to specify the cell/row where the data is located . As you need to skip two rows, your data (including header) starts from row A3.
s_df = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema","true") \
.option("dataAddress", "'Sheet1'!A3") \
.load("yourfilepath")
Also, note that if your first two rows are empty, then dataAddress does not have to specified. The leading null rows will be skipped by default.
Check the documentation here

Write dataframe and strings in one single csv file

I want to export a dataframe (500 rows,2 columns) from python to a CSV file.
However, I need to ensure that 1st 20 rows have some text/strings written and then the dataframe(500 rows,2 columns) should start from the 21st row onwards.
I referred to the following link: Skip first rows when writing csv (pandas.DataFrame.to_csv) . However, it does not satisfy my requirements.
Can somebody please let me know how do we do this?
Get first 20 rows and save it to another dataframe
Check if there are any null values
If not any null values, remove first 20 rows
Save df as a csv file
df2 = df.head(20)
df2 = df2.isnull().values.any()
if not df2:
df = df[10:]
df.to_csv('updated.csv')

How to preserve spaces in data(4spaces) for a column while writing to a csv file in pyspark

I have a an input csv file with one record. When I read the file in pyspark, the dataframe has three columns a, b, c respectively. a and c has data and b has data that is 4 spaces. While writing the file to csv, the 4 spaces data is lost and it is writing to the file as empty string.
Input file:
aaaa, , bbbb
Output file:
aaaa,"", bbbb
How can I preserve the 4 spaces data as is.?
When writing you need to set the options:
df.write
.option("ignoreLeadingWhiteSpace", "false")
.option("ignoreTrailingWhiteSpace", "false")
.csv(path)

Reading a Excel file in Spark with an integer column

I have a group of Excel sheets, that I am trying to read via spark through com.crealytics.spark.excel package.
In my excel sheet I have a column Survey ID that contains integer IDs.
When I read the data through spark I see the values are converted to double value.
How can I retain the format of the integer values while reading from excel sheet ?
This is what I tried :
val df = spark.read.format("com.crealytics.spark.excel")
.option("location", <somelocation>)
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("addColorColumns","False")
.load()
Actual Value
Value read via Spark
+-----------+
| Survey ID|
+-----------+
|1.7632889E7|
|1.7632889E7|
|1.7632934E7|
|1.7633233E7|
|1.7633534E7|
|1.7655812E7|
|1.7656079E7|
|1.7930478E7|
|1.7944498E7|
|1.8071246E7|
If I cast the column to integer I get the required formatted data. But is there a better way to do this?
val finalDf=df.withColumn("Survey ID", col("Survey ID").cast(sql.types.IntegerType))
There is a bug (or rather missing setting) in the excel library which renders column with large numbers as scientific notation. See https://github.com/crealytics/spark-excel/issues/126

Why columns are renamed as c0,c1 in the spark partitioned data?

Following is my source data,
Name |Date |
+-----+----------+
|Azure|2018-07-26|
|AWS |2018-07-27|
|GCP |2018-07-28|
|GCP |2018-07-28|
I have partitioned the data using Date column,
udl_file_df_read.write.format("csv").partitionBy("Date").mode("append").save(outputPath)
val events = spark.read.format("com.databricks.spark.csv").option("inferSchema","true").load(outputPath)
events.show()
The output column names are (c0,Date). I am not sure why the original column name is missing and how do I retain the column names?
Note This is not a duplicate question because of the below reasons Here columns other than partition columns are renamed as c0 and specifying base-path in option doesn't work.
You get column names like c0 because CSV format as used in the question doesn't preserve column names.
You can try writing with
udl_file_df_read
.write.
.option("header", "true")
...
and similarly read
spark
.read
.option("header", "true")
I was able to retain the schema by setting the option header to true when I write my file, I earlier thought I can use this option only to read the data.
udl_file_df_read.write.option("header" ="true" ). format("csv").partitionBy("Date").mode("append").save(outputPath)

Resources