Is there a good(immutable) way to pre-define column for RDD, or remove column from RDD? - apache-spark

I was trying to add columns to Spark RDD that I loaded from csv file and when I'm calling withColumn() it returns new RDD, I don't want to force new RDD creation, so can I somehow adjust RDD schema(best way I imagine is to add column in schema and then do a map by row and add value to new coulmn)? Same question goes if I can remove column from RDD somehow if schema is already defined by CSV file?

Related

spark parquet partitioning which remove the partition column

If am using df.write.partitionby(col1).parquet(path) .
the data will remove the partition column on the data.
how to avoid it ?
You can duplicate col1 before writing:
df.withColumn("partition_col", col("col1")).write.partitionBy("partition_col").parquet(path)
Note that this step is not really necessary, because whenever you read a Parquet file in a partitioned directory structure, Spark will automatically add that as a new column to the dataframe.
Actually spark does not remove the column but it uses that column in a way to organize the files so that when you read the files it adds that as a column and display that to you in a table format. If you check the schema of the table or the schema of the dataframe you would still see that as a column in the table.
Also you are partitioning your data so you know how that data from table is queried frequently and based on that information you might have decided to partition the data so that your reads becomes faster and more efficient.

how a table data gets loaded into a dataframe in databricks? row by row or bulk?

I am new to databricks notebooks and dataframes. I have a requirement to load few columns(out of many) in a table of around 14million records into a dataframe. once the table is loaded, I need to create a new column based on values present in two columns.
I want to write the logic for the new column along with the select command while loading the table into dataframe.
Ex:
df = spark.read.table(tableName)
.select(columnsList)
.withColumn('newColumnName', 'logic')
will it have any performance impact? is it better to first load the table for the few columns into the df and then perform the column manipulation on the loaded df?
does the table data gets loaded all at once or row by row into the df? if row by row, then by including column manipulation logic while reading the table, am I causing any performance degradation?
Thanks in advance!!
This really depends on the underlying format of the table - is it backed by Parquet or Delta, or it's an interface to the actual database, etc. In general, Spark is trying to read only necessary data, and if, for example, Parquet is used (or Delta), then it's easier because it's column-oriented file format, so data for each column is placed together.
Regarding the question on the reading - Spark is lazy by default, so even if you put df = spark.read.table(....) as separate variable, then add .select, and then add .withColumn, it won't do anything until you call some action, for example .count, or write your results. Until that time, Spark will just check that table exists, your operations are correct, etc. You can always call .explain on the resulting dataframe to see how Spark will perform operations.
P.S. I recommend to grab a free copy of the Learning Spark, 2ed that is provided by Databricks - it will provide you a foundation for development of the code for Spark/Databricks

What's the difference between RDD and Dataframe in Spark? [duplicate]

This question already has answers here:
Difference between DataFrame, Dataset, and RDD in Spark
(14 answers)
Closed 3 years ago.
Hi I am relatively new to apache spark. I wanted to understand the difference between RDD,dataframe and datasets.
For example, I am pulling data from s3 bucket.
df=spark.read.parquet("s3://output/unattributedunattributed*")
In this case when I am loading data from s3, what would be RDD? Also since RDD is immutable , I can change value for df so df couldn't be rdd.
Appreciate if someone can explain the difference between RDD,dataframe and datasets.
df=spark.read.parquet("s3://output/unattributedunattributed*")
With this statement, you are creating a data frame.
To create RDD use
df=spark.textFile("s3://output/unattributedunattributed*")
RDD stands for Resilient Distributed Datasets. It is Read-only partition collection of records. RDD is the fundamental data structure of Spark. It allows a programmer to perform in-memory computations
In Dataframe, data organized into named columns. For example a table in a relational database. It is an immutable distributed collection of data. DataFrame in Spark allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction.
If you want to apply a map or filter to the whole dataset, use RDD
If you want to work on an individual column or want to perform operations/calculations on a column then use Dataframe.
for example, if you want to replace 'A' in whole data with 'B'
then RDD is useful.
rdd = rdd.map(lambda x: x.replace('A','B')
if you want to update the data type of the column, then use Dataframe.
dff = dff.withColumn("LastmodifiedTime_timestamp", col('LastmodifiedTime_time').cast('timestamp')
RDD can be converted into Dataframe and vice versa.

Avoid duplicate partition in datalake

When I write parquet file Im passing one of the column value as partition by but when the dataframe is empty it doesnt create the partition (it is expected) and does nothing. To overcome this if I pass
df.partitionOf("department=One").write(df)
and when the dataframe is NOT empty it creates two level of partition
location/department=One/department=One
Is there any way to skip one if the partition already exists to avoid duplicates?
What is the path you are passing while writing dataframe? I didn't find partitionOf function for spark dataframe.
I think this should work for your case
df.write.mode("append").partitionBy("department").parquet("location/")
If you don't want to append data for the partitions which are already there find the partitons key from existing parquet and drop data with those partition keys and write remaining data in append mode.
scala code:
val dfi=spark.read.parquet(pathPrefix+finalFile).select(col("department"))
val finalDf = df.join(dfi, df.col("department") == dfi.col("department"), "left_outer")
.where(dfi.col("department").isNull())
.select(dfl.columns.map(col):_*)
finalDf.write.mode("append").partitionBy("department").parquet("location/")
You can optimize first step (creating dfi ) by finding partition keys from your Dataframe and keeping only those partition keys for which path exists.

Is the first row of Dataset<Row> which is created from a csv file equals to the first row in the file?

I'm trying to remove header from the Dataset<Row> which is created with the data from csv file. There are bunch of ways to do it.
So, I'm wondering whether the first row in Dataset<Row> is always equals to the first row in the file (from which the Dataset<Row> is created)?
When you read the files, the records in the RDD/Dataframe/Dataset are in the order as they were in the files. But if you perform any operation that requires shuffling the order changes.
So you can remove the first row as soon as reading the file and before any operation that requires shuffling.
The best option would be using csv data source as
spark.read.option("header", true).csv(path)
This will take the first row as a header and use it as column name.

Resources