Following is my source data,
Name |Date |
+-----+----------+
|Azure|2018-07-26|
|AWS |2018-07-27|
|GCP |2018-07-28|
|GCP |2018-07-28|
I have partitioned the data using Date column,
udl_file_df_read.write.format("csv").partitionBy("Date").mode("append").save(outputPath)
val events = spark.read.format("com.databricks.spark.csv").option("inferSchema","true").load(outputPath)
events.show()
The output column names are (c0,Date). I am not sure why the original column name is missing and how do I retain the column names?
Note This is not a duplicate question because of the below reasons Here columns other than partition columns are renamed as c0 and specifying base-path in option doesn't work.
You get column names like c0 because CSV format as used in the question doesn't preserve column names.
You can try writing with
udl_file_df_read
.write.
.option("header", "true")
...
and similarly read
spark
.read
.option("header", "true")
I was able to retain the schema by setting the option header to true when I write my file, I earlier thought I can use this option only to read the data.
udl_file_df_read.write.option("header" ="true" ). format("csv").partitionBy("Date").mode("append").save(outputPath)
Related
I belive it is a bacis question about data processing in spark.
Let's assue, there is a data frame:
PartitionColumn
ColumnB
ColumnC
First
value1
First
value2
Second
row
...
...
...
I am going to processig this data pararell using the PartitionColumn, so all rows with First value go to the First table, with the Second values go to the Second table etc.
Could I ask for a tip how to achive it in PySpark (2.x)?
Please refer partitionBy() section in this documentation
df.write \
.partitionBy("PartitionColumn") \
.mode("overwrite") \
.parquet("/path")
Your partitioned data will be saved under folders:
/path/First
/path/Second
I have a pyspark data frame which I created from one table in sql server and
I did some transformation on that and now I am going to convert it to
dynamic data frame in order to be abale to save it as a text file
in s3 bucket. when I am writing data frame to text file I am going to
add another header to that file.
This is my dynamic data frame that will be saved as a file:
AT_DATE | AMG_INS | MONTHLY_AVG
2021-03-21 | MT.0000| 234.543
2021_02_12| MT.1002 | 34.567
I want to add another header on top of that while I am saving my text file I need to add another row like this:
HDR,FTP,PC
AT_DATE,AMG_INS,MONTHLY_AVG
2021-03-21,MT.0000,234.543
2021_02_12,MT.1002,34.567
This is separate row that I need to add on top of my text file.
To save your dataframe as a text file with additional headers lines, you have to perform the following steps:
Prepare your data dataframe
as you can only write to text one column dataframes, you first concatenate all values into one value column, using concat_ws spark SQL function
then you drop all columns but value column using select dataframe method
you add an order column with literal value 2, it will be used later to ensure that headers are at the top of the output text file
Prepare your header dataframe
You create a headers dataframe, containing one row per desired headers. Each row having two column:
a value column containing the header as a string
an order column containing the header order as an int (0 for the first header and 1 for the second header)
Write the union of headers and data dataframes
you union your first dataframe with the headers dataframe using union dataframe method
you use coalesce(1) dataframe method to have only one text file as output
you order your dataframe by your order column using orderBy dataframe method
you drop your order column
and you write the resulting dataframe
Complete code
Translated into code, it gives you below code snippet. I call your dynamic dataframe output_dataframe and your spark session spark and I write to /tmp/to_text_file:
from pyspark.sql import functions as F
data = output_dataframe \
.select(F.concat_ws(',', F.col("AT_DATE"), F.col("AMG_INS"), F.col("MONTHLY_AVG")).alias('value')) \
.withColumn('order', F.lit(2))
headers = sparkSession.createDataFrame([('HDR,FTP,PC', 0), ('AT_DATE,AMG_INS,MONTHLY_AVG', 1)], ['value', 'order'])
headers.union(data) \
.coalesce(1) \
.orderBy('order')\
.drop('order') \
.write.text("/tmp/to_text_file")
first question here, so I apologise if something isn't clear.
I am new to pySpark, and using databricks I was trying to read in an excel file saved as a csv with the following code
df = spark.read.csv('/FileStore/tables/file.csv',
sep = ";",
inferSchema = "true",
header = "true")
This works fine, except some of the observations get null values, while in the excel file there are no null values. The actual values can be found in other rows.
Maybe better explained with an example:
If the excel file has the row A B C D
Then it becomes in the table (for some rows):
A B null null
C D null null
My question is how could I fix this? Thanks in advance
Right now you are setting your delimiter to be a ;, however in a CSV file the delimiter is usually a , (Comma Separated Values). If you us the spark CSV reader, the delimiter is automatically set to a comma:
spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/FileStore/tables/file.csv")
I have a group of Excel sheets, that I am trying to read via spark through com.crealytics.spark.excel package.
In my excel sheet I have a column Survey ID that contains integer IDs.
When I read the data through spark I see the values are converted to double value.
How can I retain the format of the integer values while reading from excel sheet ?
This is what I tried :
val df = spark.read.format("com.crealytics.spark.excel")
.option("location", <somelocation>)
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("addColorColumns","False")
.load()
Actual Value
Value read via Spark
+-----------+
| Survey ID|
+-----------+
|1.7632889E7|
|1.7632889E7|
|1.7632934E7|
|1.7633233E7|
|1.7633534E7|
|1.7655812E7|
|1.7656079E7|
|1.7930478E7|
|1.7944498E7|
|1.8071246E7|
If I cast the column to integer I get the required formatted data. But is there a better way to do this?
val finalDf=df.withColumn("Survey ID", col("Survey ID").cast(sql.types.IntegerType))
There is a bug (or rather missing setting) in the excel library which renders column with large numbers as scientific notation. See https://github.com/crealytics/spark-excel/issues/126
I am trying to join two dataframes with the same column names and compute some new values. after that i need to drop all columns of second table. The number of columns is huge. How can I do it in easier way? I tried to .drop("table2.*"),but this dont work.
You can use select with aliases:
df1.alias("df1")
.join(df2.alias("df2"), Seq("someJoinColumn"))
.select($"df1.*", $"someComputedColumn", ...)
reference with the parent DataFrame:
df1.join(df2, Seq("someJoinColumn")).select(df1("*"), $"someComputedColumn", ...)
Instead of dropping, you can select all the necessary columns that you want hold for further operations something like below
val newDataFrame = joinedDataFrame.select($"col1", $"col4", $"col6")