Writing Spark Dataframe directly to HIVE is taking too much time - apache-spark

I am writing 2 dataframes from Spark directly to Hive using PySpark. The first df has only one row and 7 columns. The second df has 20M rows and 20 columns. It took 10 mins to write the 1 df(1row) and around 30Mins to write 1M rows in the second DF. I dont know how long it will take to write the entire 20M, I killed the code before it can complete.
I have tried two approaches to write the df. I also cached the df to see if it would make the write faster but didn't seem to have any effect:
df_log.write.mode("append").insertInto("project_alpha.sends_log_test")
2nd Method
#df_log.registerTempTable("temp2")
#df_log.createOrReplaceTempView("temp2")
sqlContext.sql("insert into table project_alpha.sends_log_test select * from temp2")
In the 2nd approach I tried using both registerTempTable() as well as createOrReplaceTempView() but there was no difference in the run time.
Is there a way to write it faster or more efficiently. Thanks.

Are you sure the final tables are cached? It might be the issue that before writing the data it calculates the whole pipeline. You can check that in terminal/console where Spark runs.
Also, please check if the table you append to on Hive is not a temporary view - then it could be the issue of recalculating the view before appending new rows.
When I write data to Hive I always use:
df.write.saveAsTable('schema.table', mode='overwrite')
Please try:
df.write.saveAsTable('schema.table', mode='append')

Its bad idea(or design) to do insert into hive table. You have to save it as file and create a table on top of it or add as a partition to existing table.
Can you please try that route.

try repartition to small number of files lets say like .repartition(2000) and then write to hive. Large number of partitions in spark sometimes takes time to write.

Related

Spark dataframe distinct write is increasing the output size by almost 10 fold

I have a case where i am trying to write some results using dataframe write into S3 using the below query with input_table_1 size is 13 Gb and input_table_2 as 1 Mb
input_table_1 has columns account, membership and
input_table_2 has columns role, id , membership_id, quantity, start_date
SELECT
/*+ BROADCASTJOIN(input_table_2) */
account,
role,
id,
quantity,
cast(start_date AS string) AS start_date
FROM
input_table_1
INNER JOIN
input_table_2
ON array_contains(input_table_1.membership, input_table_2.membership_id)
where membership array contains list of member_ids
This dataset write using Spark dataframe is generating around 1.1TiB of data in S3 with around 700 billion records.
We identified that there are duplicates and used dataframe.distinct.write.parquet("s3path") to remove the duplicates . The record count is reduced to almost 1/3rd of the previous total count with around 200 billion rows but we observed that the output size in S3 is now 17.2 TiB .
I am very confused how this can happen.
I have used the following spark conf settings
spark.sql.shuffle.partitions=20000
I have tried to do a coalesce and write to s3 but it did not work.
Please suggest if this is expected and when can be done ?
There's two sides to this:
1) Physical translation of distinct in Spark
The Spark catalyst optimiser turns a distinct operation into an aggregation by means of the ReplaceDeduplicateWithAggregate rule (Note: in the execution plan distinct is named Deduplicate).
This basically means df.distinct() on all columns is translated into a groupBy on all columns with an empty aggregation:
df.groupBy(df.columns:_*).agg(Map.empty).
Spark uses a HashPartitioner when shuffling data for a groupBy on respective columns. Since the groupBy clause in your case contains all columns (well, implicitly, but it does), you're more or less randomly shuffling data to different nodes in the cluster.
Increasing spark.sql.shuffle.partitions in this case is not going to help.
Now on to the 2nd side, why does this affect the size of your parquet files so much?
2) Compression in parquet files
Parquet is a columnar format, will say your data is organised in columns rather than row by row. This allows for powerful compression if data is adequately laid-out & ordered. E.g. if a column contains the same value for a number of consecutive rows, it is enough to write that value just once and make a note of the number of repetitions (a strategy called run length encoding). But Parquet also uses various other compression strategies.
Unfortunately, data ends up pretty randomly in your case after shuffling to remove duplicates. The original partitioning of input_table_1 was much better fitted.
Solutions
There's no single answer how to solve this, but here's a few pointers I'd suggest doing next:
What's causing the duplicates? Could these be removed upstream? Or is there a problem with the join condition causing duplicates?
A simple solution is to just repartition the dataset after distinct to match the partitioning of your input data. Adding a secondary sorting (sortWithinPartition) is likely going to give you even better compression. However, this comes at the cost of an additional shuffle!
As #matt-andruff pointed out below, you can also achieve this in SQL using cluster by. Obviously, that also requires you to move the distinct keyword into your SQL statement.
Write your own deduplication algorithm as Spark Aggregator and group / shuffle the data just once in a meaningful way.

how a table data gets loaded into a dataframe in databricks? row by row or bulk?

I am new to databricks notebooks and dataframes. I have a requirement to load few columns(out of many) in a table of around 14million records into a dataframe. once the table is loaded, I need to create a new column based on values present in two columns.
I want to write the logic for the new column along with the select command while loading the table into dataframe.
Ex:
df = spark.read.table(tableName)
.select(columnsList)
.withColumn('newColumnName', 'logic')
will it have any performance impact? is it better to first load the table for the few columns into the df and then perform the column manipulation on the loaded df?
does the table data gets loaded all at once or row by row into the df? if row by row, then by including column manipulation logic while reading the table, am I causing any performance degradation?
Thanks in advance!!
This really depends on the underlying format of the table - is it backed by Parquet or Delta, or it's an interface to the actual database, etc. In general, Spark is trying to read only necessary data, and if, for example, Parquet is used (or Delta), then it's easier because it's column-oriented file format, so data for each column is placed together.
Regarding the question on the reading - Spark is lazy by default, so even if you put df = spark.read.table(....) as separate variable, then add .select, and then add .withColumn, it won't do anything until you call some action, for example .count, or write your results. Until that time, Spark will just check that table exists, your operations are correct, etc. You can always call .explain on the resulting dataframe to see how Spark will perform operations.
P.S. I recommend to grab a free copy of the Learning Spark, 2ed that is provided by Databricks - it will provide you a foundation for development of the code for Spark/Databricks

Whats the fastest way to loop through sorted dask dataframe?

I'm new to Pandas and Dask, Dask dataframes wrap pandas dataframes and share most of the same function calls.
I using Dask to sort(set_index) a largeish csv file ~1,000,000 rows ~100columns.
Once it's sorted I use itertuples() to grab each dataframe row, to compare with a row from a database with ~1,000,000 rows ~100 columns.
But it's running slowly (takes around 8 hours), is there a faster way to do this?
I used dask because it can sort very large csv files and has a flexible csv parsing engine. It'll also let me run more advanced operations on the dataset, and parse more data formats in the future
I could presort the csv but I want to see if Dask can be fast enough for my use case, it would make things alot more hands off in the long run.
By using iter_tuples, you are bringing each row back to the client, one by one. Please read up on map_partitions or map to see how you can apply function to rows or blocks of the dataframe without pulling data to the client.
Note that each worker should write to a different file, since they operate in parallel.

Most optimal way to removing Duplicates in pySpark

I am trying to remove duplicates in spark dataframes by using dropDuplicates() on couple of columns. But job is getting hung due to lots of shuffling involved and data skew. I have used 5 cores and 30GB of memory to do this. Data on which I am performing dropDuplicates() is about 12 million rows.
Please suggest me the most optimal way to remove duplicates in spark, considering data skew and shuffling involved.
Delete duplicate operations is an expensive operation as it compare values from one RDD to all other RDDs and tries to consolidate the results. Considering the size of your data results can time consuming.
I would recommend groupby transformation on the columns of your dataframe followed by commit action. This way only the consolidated results from your RDD will be compared with other RDD that too lazily and then you can request the result through any of the action like commit / show etc
transactions.groupBy("col1”,”col2").count.sort($"count".desc).show
distance():
df.select(['id', 'name']).distinct().show()
dropDuplicates()
df.dropDuplicates(['id', 'name']).show()
dropDuplicates() is the way to go if you want to drop duplicates over a subset of columns, but at the same time you want to keep all the columns of the original structure.

How do I create index on pyspark df?

I have bunch of hive tables.
I want to:
Pull the tables into a pyspark DF.
Do a UDF on them.
Join 4 tables based on customer id.
Is there a concept of indexing in spark to speed up the operation?
If so whats the command?
How do I create index on dataframe?
I understand your problem but the thing is, you acquire the data at the same time you process them. Therefore, calculating an index before joining is useless as it will take take more time to first create the index.
If you have several write operation, you may want to cache your data to speed up but otherwise, the index is not the solution to investigate.
There is maybe another thing you can try : df.repartition.
This will create partition on your df according to one column. But I have no idea if it can help.

Resources