Combine ‘n’ data files to make a single Spark Dataframe [duplicate] - apache-spark

This question already has answers here:
How to perform union on two DataFrames with different amounts of columns in Spark?
(22 answers)
Closed 4 years ago.
I have ‘n’ number of delimited data sets, CSVs may be. But one of them might have a few extra columns. I am trying to read all of them as dataframes and put them in one. How can I merge them as an unionAll and make them a single dataframe ?
P.S: I can do this when I know what is ‘n’. And, it’s a simple unionAll when the column counts are equal.

There is another approach other than the solutions mentioned in first two comments.
Read all CSV files to a single RDD producing RDD[String].
Map to create Rdd[Row] with appropriate length while filling missing values with null or any suitable values.
Create dataFrame schema.
Create DataFrame from RDD[Row] using created Schema.
This may not be a good approach if the CSVs has large number of columns.
Hope this helps

Related

What's the difference between RDD and Dataframe in Spark? [duplicate]

This question already has answers here:
Difference between DataFrame, Dataset, and RDD in Spark
(14 answers)
Closed 3 years ago.
Hi I am relatively new to apache spark. I wanted to understand the difference between RDD,dataframe and datasets.
For example, I am pulling data from s3 bucket.
df=spark.read.parquet("s3://output/unattributedunattributed*")
In this case when I am loading data from s3, what would be RDD? Also since RDD is immutable , I can change value for df so df couldn't be rdd.
Appreciate if someone can explain the difference between RDD,dataframe and datasets.
df=spark.read.parquet("s3://output/unattributedunattributed*")
With this statement, you are creating a data frame.
To create RDD use
df=spark.textFile("s3://output/unattributedunattributed*")
RDD stands for Resilient Distributed Datasets. It is Read-only partition collection of records. RDD is the fundamental data structure of Spark. It allows a programmer to perform in-memory computations
In Dataframe, data organized into named columns. For example a table in a relational database. It is an immutable distributed collection of data. DataFrame in Spark allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction.
If you want to apply a map or filter to the whole dataset, use RDD
If you want to work on an individual column or want to perform operations/calculations on a column then use Dataframe.
for example, if you want to replace 'A' in whole data with 'B'
then RDD is useful.
rdd = rdd.map(lambda x: x.replace('A','B')
if you want to update the data type of the column, then use Dataframe.
dff = dff.withColumn("LastmodifiedTime_timestamp", col('LastmodifiedTime_time').cast('timestamp')
RDD can be converted into Dataframe and vice versa.

Spark DataFrame how to change permutation of one column without join [duplicate]

This question already has answers here:
Updating a dataframe column in spark
(5 answers)
Closed 3 years ago.
I am trying to use Pyspark to permute a column in a dataframe, aka shuffle all values for a single column across rows.
I am trying to avoid the solution where the column gets split and assigned an index column before being joined back to the original dataframe which also has an added index column. Primarily because of my understanding (which could be very wrong) that joins are bad in terms of runtime for a large dataset (millions of rows).
# for some dataframe spark_df
new_df = spark_df.select(colname).sort(colname)
new_df.show() # column values sorted nicely
spark_df.withColumn("ha", new_df[colname]).show()
# column "ha" no longer sorted and has same permutation as spark_df.colname
Thanks for any guidance in helping me understand this, I am a complete beginner with this :)
Edit: Sorry if I was being unclear in the question, I just wanted to replace a column with the sorted version of it without doing join. Thank you for pointing out that dfs are not mutable, but even doing spark_df.withColumn("ha", spark_df.select(colname).sort(colname)[colname]).show() shows column 'ha' as having the same permutation as 'colname' when doing sort on the column itself shows a different permutation. The question is mainly about why the permutation stays the same in the new column 'ha', not about how to replace a column. Thanks again! (Also changed the title to better reflect the question)
Spark dataframes and RDDs are immutable. Every time you make a transformation, a new one is created. Therefore, when you do new_df = spark_df.select(colname).sort(colname), spark_df remains unchanged. Only new_df is sorted. This is why spark_df.withColumn("ha", new_df[colname]) returns an unsorted dataframe.
Try new_df.withColumn("ha", new_df[colname]) instead.

Pyspark dataframe partition number [duplicate]

This question already has answers here:
Spark dataframe write method writing many small files
(6 answers)
Closed 4 years ago.
I have a data frame df, I want to partition it by date (a column in the df).
I have the code below:
df.write.partitionBy('date').mode(overwrite').orc('path')
Then under the path above, there are bunch folders, e.g. date=2018-10-08 etc...
But under the folder date=2018-10-08, there are 5 files, what I want is to reduce to only one file inside the date=2018-10-08 folder. How to do that? I still want it partitioned by date.
Thank you in advance!
In order to have 1 file per partition folder you will need to repartition the data by the partition column before writing. This will shuffle the data so the dates are in the same DataFrame/RDD partitions:
df.repartition('date').write.partitionBy('date').mode(overwrite').orc('path')

How to use Spark dataset GroupBy() [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 4 years ago.
I have a Hive table with the schema:
id bigint
name string
updated_dt bigint
There are many records having same id, but different name and updated_dt. For each id, I want to return the record (whole row) with the largest updated_dt.
My current approach is:
After reading data from Hive, I can use case class to convert data to RDD, and then use groupBy() to group by all the records with the same id together, and later picks the one with the largest updated_dt. Something like:
dataRdd.groupBy(_.id).map(x => x._2.toSeq.maxBy(_.updated_dt))
However, since I use Spark 2.1, it first convert data to dataset using case class, and then the above approach coverts data to RDD in order to use groupBy(). There may be some overhead converting dataset to RDD. So I was wondering if I can achieve this at the dataset level without converting to RDD?
Thanks a lot
Here is how you can do it using Dataset:
data.groupBy($"id").agg(max($"updated_dt") as "Max")
There is not much overhead if you convert it to RDD. If you choose to do using RDD, It can be more optimized by using .reduceByKey() instead of using .groupBy():
dataRdd.keyBy(_.id).reduceByKey((a,b) => if(a.updated_dt > b.updated_dt) a else b).values

More efficient way to Iterate & compute over columns [duplicate]

This question already has answers here:
Spark columnar performance
(2 answers)
Closed 5 years ago.
I have a very wide dataframe > 10,000 columns and I need to compute the percent of nulls in each. Right now I am doing:
threshold=0.9
for c in df_a.columns[:]:
if df_a[df_a[c].isNull()].count() >= (df_a.count()*threshold):
# print(c)
df_a=df_a.drop(c)
Of course this is a slow process and crashes on occasion. Is there a more efficient method I am missing?
Thanks!
There are few strategies you can take depending upon the size of the dataframe. The code looks good to me. You need to go through each column and count the number of null values.
One strategy is to cache the input dataframe. That will enable faster filtering. This however works if the dataframe is not huge
Also
df_a=df_a.drop(c)
I am little skeptical with this as this is changing the dataframe in the loop. Better to keep the null column names and drop from the dataframe later in a separate loop.
If the dataframe is huge and you can't cache it completely you can partition the dataframe in to some finite manageable columns. Like take 100 column each and cache that smaller dataframe and run the analysis 100 times in a loop.
Now you might want to keep track of the analyzed column list separate from the yet to be analyzed columns in this case. That way even if the job fails you can start the analysis from the rest of the columns.
You should avoid iterating when using pyspark, since it does not distribute the computations anymore.
Using count on a column will compute the count of non-null elements.
threshold = 0.9
import pyspark.sql.functions as psf
count_df = df_a\
.agg(*[psf.count("*").alias("count")]+ [psf.count(c).alias(c) for c in df_a.columns])\
.toPandas().transpose()
The first element is the number of lines in the dataframe:
total_count = count_df.iloc[0, 0]
kept_cols = count_df[count_df[0] > (1 - threshold)*total_count].iloc[1:,:]
df_a.select(list(kept_cols.index))

Resources