Is the first row of Dataset<Row> which is created from a csv file equals to the first row in the file? - apache-spark

I'm trying to remove header from the Dataset<Row> which is created with the data from csv file. There are bunch of ways to do it.
So, I'm wondering whether the first row in Dataset<Row> is always equals to the first row in the file (from which the Dataset<Row> is created)?

When you read the files, the records in the RDD/Dataframe/Dataset are in the order as they were in the files. But if you perform any operation that requires shuffling the order changes.
So you can remove the first row as soon as reading the file and before any operation that requires shuffling.
The best option would be using csv data source as
spark.read.option("header", true).csv(path)
This will take the first row as a header and use it as column name.

Related

How to ingest multiple csv files into a Spark dataframe?

I am trying to ingest 2 csv files into a single spark dataframe. However, the schema of these 2 datasets is very different, and when I perform the below operation, I get back only the schema of the second csv, as if the first one doesn't exist. How can I solve this? My final goal is to count the total number of words.
paths = ["abfss://lmne.dfs.core.windows.net/csvs/MachineLearning_reddit.csv", "abfss://test1#lmne.dfs.core.windows.net/csvs/bbc_news.csv"]
df0_spark=spark.read.format("csv").option("header","false").load(paths)
df0_spark.write.mode("overwrite").saveAsTable("ML_reddit2")
df0_spark.show()
I tried to load both of the files into a single spark dataframe, but it only gives me back one of the tables.
I have reproduced the above and got the below results.
For sample, I have two csv files in dbfs with different schemas. when I execute the above code, I got the same result.
To get the desired schema enable mergeSchemaand header while reading the files.
Code:
df0_spark=spark.read.format("csv").option("mergeSchema","true").option("header","true").load(paths)
df0_spark.show()
If you want to combine the two files without nulls, we should have a common identity column and we have to read the files individually and use inner join for that.
The solution that has worked for me the best in such cases was to read all distinct files separately, and then union them after they have been put into DataFrames. So your code could look something like this:
paths = ["abfss://lmne.dfs.core.windows.net/csvs/MachineLearning_reddit.csv", "abfss://test1#lmne.dfs.core.windows.net/csvs/bbc_news.csv"]
# Load all distinct CSV files
df1 = spark.read.option("header", false).csv(paths[0])
df2 = spark.read.option("header", false).csv(paths[1])
# Union DataFrames
combined_df = df1.unionByName(df2, allowMissingColumns=True)
Note: if the names of columns differ between the files, then for all columns from first file that are not present in second one, you will have null values. If the schema should be matching, then you can always rename the columns, before the unionByName step.

spark parquet partitioning which remove the partition column

If am using df.write.partitionby(col1).parquet(path) .
the data will remove the partition column on the data.
how to avoid it ?
You can duplicate col1 before writing:
df.withColumn("partition_col", col("col1")).write.partitionBy("partition_col").parquet(path)
Note that this step is not really necessary, because whenever you read a Parquet file in a partitioned directory structure, Spark will automatically add that as a new column to the dataframe.
Actually spark does not remove the column but it uses that column in a way to organize the files so that when you read the files it adds that as a column and display that to you in a table format. If you check the schema of the table or the schema of the dataframe you would still see that as a column in the table.
Also you are partitioning your data so you know how that data from table is queried frequently and based on that information you might have decided to partition the data so that your reads becomes faster and more efficient.

Parquet Format - split columns in different files

On the parquet documentation is explicitly mentioned that the design supports splitting the metadata and data into different files , including also the possibility that different column groups can be stored in different files.
However , I could not find any instructions on how to achieve that. In my use case I would like to store the metadata in one file , store columns 1-100 data in one file and 101-200 in a second file .
Any idea how to achieve this ?
If you are using PySpark, it's as easy as this:
df = spark.createDataFrameFrom(...)
df.write.parquet('file_name.parquet')
and it will create a folder called file_name.parquet in the default location in HDFS. You can just create two dataframes, one with columns 1-100, and the other dataframe with columns 101-200 and save them separately. It automatically will save the metadata, if you mean the data frame schema.
You can select a range of columns like this:
df_first_hundred = df.select(df.columns[:100])
df_second_hundred = df.select(df.columns[100:])
Save them as separate files:
df_first_hundred.write.parquet('df_first_hundred')
df_second_hundred.write.parquet('df_second_hundred')

how a table data gets loaded into a dataframe in databricks? row by row or bulk?

I am new to databricks notebooks and dataframes. I have a requirement to load few columns(out of many) in a table of around 14million records into a dataframe. once the table is loaded, I need to create a new column based on values present in two columns.
I want to write the logic for the new column along with the select command while loading the table into dataframe.
Ex:
df = spark.read.table(tableName)
.select(columnsList)
.withColumn('newColumnName', 'logic')
will it have any performance impact? is it better to first load the table for the few columns into the df and then perform the column manipulation on the loaded df?
does the table data gets loaded all at once or row by row into the df? if row by row, then by including column manipulation logic while reading the table, am I causing any performance degradation?
Thanks in advance!!
This really depends on the underlying format of the table - is it backed by Parquet or Delta, or it's an interface to the actual database, etc. In general, Spark is trying to read only necessary data, and if, for example, Parquet is used (or Delta), then it's easier because it's column-oriented file format, so data for each column is placed together.
Regarding the question on the reading - Spark is lazy by default, so even if you put df = spark.read.table(....) as separate variable, then add .select, and then add .withColumn, it won't do anything until you call some action, for example .count, or write your results. Until that time, Spark will just check that table exists, your operations are correct, etc. You can always call .explain on the resulting dataframe to see how Spark will perform operations.
P.S. I recommend to grab a free copy of the Learning Spark, 2ed that is provided by Databricks - it will provide you a foundation for development of the code for Spark/Databricks

Is there a good(immutable) way to pre-define column for RDD, or remove column from RDD?

I was trying to add columns to Spark RDD that I loaded from csv file and when I'm calling withColumn() it returns new RDD, I don't want to force new RDD creation, so can I somehow adjust RDD schema(best way I imagine is to add column in schema and then do a map by row and add value to new coulmn)? Same question goes if I can remove column from RDD somehow if schema is already defined by CSV file?

Resources