Reading a Excel file in Spark with an integer column

Reading a Excel file in Spark with an integer column - excel

I have a group of Excel sheets, that I am trying to read via spark through com.crealytics.spark.excel package.
In my excel sheet I have a column Survey ID that contains integer IDs.
When I read the data through spark I see the values are converted to double value.
How can I retain the format of the integer values while reading from excel sheet ?
This is what I tried :
val df = spark.read.format("com.crealytics.spark.excel")
.option("location", <somelocation>)
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("addColorColumns","False")
.load()
Actual Value
Value read via Spark
+-----------+
| Survey ID|
+-----------+
|1.7632889E7|
|1.7632889E7|
|1.7632934E7|
|1.7633233E7|
|1.7633534E7|
|1.7655812E7|
|1.7656079E7|
|1.7930478E7|
|1.7944498E7|
|1.8071246E7|
If I cast the column to integer I get the required formatted data. But is there a better way to do this?
val finalDf=df.withColumn("Survey ID", col("Survey ID").cast(sql.types.IntegerType))

There is a bug (or rather missing setting) in the excel library which renders column with large numbers as scientific notation. See https://github.com/crealytics/spark-excel/issues/126

Related

PySpark - parallel processing

I belive it is a bacis question about data processing in spark.
Let's assue, there is a data frame:
PartitionColumn
ColumnB
ColumnC
First
value1
First
value2
Second
row
...
...
...
I am going to processig this data pararell using the PartitionColumn, so all rows with First value go to the First table, with the Second values go to the Second table etc.
Could I ask for a tip how to achive it in PySpark (2.x)?

Please refer partitionBy() section in this documentation
df.write \
.partitionBy("PartitionColumn") \
.mode("overwrite") \
.parquet("/path")
Your partitioned data will be saved under folders:
/path/First
/path/Second

writing pyspark data frame to text file

I have a pyspark data frame which I created from one table in sql server and
I did some transformation on that and now I am going to convert it to
dynamic data frame in order to be abale to save it as a text file
in s3 bucket. when I am writing data frame to text file I am going to
add another header to that file.
This is my dynamic data frame that will be saved as a file:
AT_DATE | AMG_INS | MONTHLY_AVG
2021-03-21 | MT.0000| 234.543
2021_02_12| MT.1002 | 34.567
I want to add another header on top of that while I am saving my text file I need to add another row like this:
HDR,FTP,PC
AT_DATE,AMG_INS,MONTHLY_AVG
2021-03-21,MT.0000,234.543
2021_02_12,MT.1002,34.567
This is separate row that I need to add on top of my text file.

To save your dataframe as a text file with additional headers lines, you have to perform the following steps:
Prepare your data dataframe
as you can only write to text one column dataframes, you first concatenate all values into one value column, using concat_ws spark SQL function
then you drop all columns but value column using select dataframe method
you add an order column with literal value 2, it will be used later to ensure that headers are at the top of the output text file
Prepare your header dataframe
You create a headers dataframe, containing one row per desired headers. Each row having two column:
a value column containing the header as a string
an order column containing the header order as an int (0 for the first header and 1 for the second header)
Write the union of headers and data dataframes
you union your first dataframe with the headers dataframe using union dataframe method
you use coalesce(1) dataframe method to have only one text file as output
you order your dataframe by your order column using orderBy dataframe method
you drop your order column
and you write the resulting dataframe
Complete code
Translated into code, it gives you below code snippet. I call your dynamic dataframe output_dataframe and your spark session spark and I write to /tmp/to_text_file:
from pyspark.sql import functions as F
data = output_dataframe \
.select(F.concat_ws(',', F.col("AT_DATE"), F.col("AMG_INS"), F.col("MONTHLY_AVG")).alias('value')) \
.withColumn('order', F.lit(2))
headers = sparkSession.createDataFrame([('HDR,FTP,PC', 0), ('AT_DATE,AMG_INS,MONTHLY_AVG', 1)], ['value', 'order'])
headers.union(data) \
.coalesce(1) \
.orderBy('order')\
.drop('order') \
.write.text("/tmp/to_text_file")

Reading excel files in pyspark with 3rd row as header

I want to read read excel files as spark dataframe with 3rd row as a header.The synatax to read excel files as spark dataframe with 1st row as header is :
s_df = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema", "true") \
.load(path + 'Sales.xlsx')
and the equivalent syntax to read as pandas dataframe with 3rd row as header is :
p_df = pd.read_excel(path + 'Sales.xlsx',header=3)
I want to do the same thing in pyspark that is to read excel files as spark dataframe with 3rd row as header.

Use the dataAddress option to specify the cell/row where the data is located . As you need to skip two rows, your data (including header) starts from row A3.
s_df = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema","true") \
.option("dataAddress", "'Sheet1'!A3") \
.load("yourfilepath")
Also, note that if your first two rows are empty, then dataAddress does not have to specified. The leading null rows will be skipped by default.
Check the documentation here

Read in an excel file as a csv in pyspark

first question here, so I apologise if something isn't clear.
I am new to pySpark, and using databricks I was trying to read in an excel file saved as a csv with the following code
df = spark.read.csv('/FileStore/tables/file.csv',
sep = ";",
inferSchema = "true",
header = "true")
This works fine, except some of the observations get null values, while in the excel file there are no null values. The actual values can be found in other rows.
Maybe better explained with an example:
If the excel file has the row A B C D
Then it becomes in the table (for some rows):
A B null null
C D null null
My question is how could I fix this? Thanks in advance

Right now you are setting your delimiter to be a ;, however in a CSV file the delimiter is usually a , (Comma Separated Values). If you us the spark CSV reader, the delimiter is automatically set to a comma:
spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/FileStore/tables/file.csv")

Why columns are renamed as c0,c1 in the spark partitioned data?

Following is my source data,
Name |Date |
+-----+----------+
|Azure|2018-07-26|
|AWS |2018-07-27|
|GCP |2018-07-28|
|GCP |2018-07-28|
I have partitioned the data using Date column,
udl_file_df_read.write.format("csv").partitionBy("Date").mode("append").save(outputPath)
val events = spark.read.format("com.databricks.spark.csv").option("inferSchema","true").load(outputPath)
events.show()
The output column names are (c0,Date). I am not sure why the original column name is missing and how do I retain the column names?
Note This is not a duplicate question because of the below reasons Here columns other than partition columns are renamed as c0 and specifying base-path in option doesn't work.

You get column names like c0 because CSV format as used in the question doesn't preserve column names.
You can try writing with
udl_file_df_read
.write.
.option("header", "true")
...
and similarly read
spark
.read
.option("header", "true")

I was able to retain the schema by setting the option header to true when I write my file, I earlier thought I can use this option only to read the data.
udl_file_df_read.write.option("header" ="true" ). format("csv").partitionBy("Date").mode("append").save(outputPath)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Reading a Excel file in Spark with an integer column - excel

There is a bug (or rather missing setting) in the excel library which renders column with large numbers as scientific notation. See https://github.com/crealytics/spark-excel/issues/126

Related

PySpark - parallel processing

writing pyspark data frame to text file

Reading excel files in pyspark with 3rd row as header

Read in an excel file as a csv in pyspark

Why columns are renamed as c0,c1 in the spark partitioned data?

Categories

Resources