Apache Spark page results or view results on large datasets - apache-spark

I am using Hive with Spark 1.6.3
I have a large dataset (40000 rows, 20 columns or so and each column contains maybe 500 Bytes - 3KB of data)
The query is a join to 3 datasets
I wish to be able to page the final join dataset, and i have found that i can use row_number() OVER (ORDER BY 1) to generate a unique row number for each row in the dataset.
After this I can do
SELECT * FROM dataset WHERE row between 1 AND 100
However, there are resources which advise not to use ORDER BY as it puts all data into 1 partition (I can see this is the case in the logs where the shuffle allocation is moving the data to one partition), when this happens I get out of memory exceptions.
How would i go about paging through the dataset in a more efficient way?
I have enabled persist - MEMORY_AND_DISK so that if a partition is too large it will spill to disk (and for some of the transformation I can see that at least some of the data is spilling to disk when I am not using row_number() )

One strategy could be select only the unique_key of the dataset first and apply row_number function on that dataset only. Since you are selecting a single column from a large dataset chances are higher that it will fit in a single partition.
val dfKey = df.select("uniqueKey")
dfKey.createOrUpdateTempTable("dfKey")
val dfWithRowNum = spark.sql(select dfKey*, row_number() as row_number OVER (ORDER BY 1))
// save dfWithRowNum
After to complete the row_number operation on the uniqueKey; save that dataframe. Now in the next stage join this dataframe with the bigger dataframe and append the row_number column to that.
dfOriginal.createOrUpdateTempTable("dfOriginal")
dfWithRowNum.createOrUpdateTempTable("dfWithRowNum")
val joined = spark.sql("select dfOriginal.* from dfOriginal join dfWithRowNum on dfOriginal.uniqueKey = dfWithRowNum.uniqueKey")
// save joined
Now you can query
SELECT * FROM joineddataset WHERE row between 1 AND 100
For the persist with MEMORY_DISK, I found that occasionally fail with insufficient memory. I would rather use DISK_ONLY where performance is penalized although the execution is guaranteed.

Well, you can apply this method on your final join dataframe.
You should also persist the dataframe as a file to guarantee the ordering, as reevaluation could creates a different order.

Related

Spark dataframe distinct write is increasing the output size by almost 10 fold

I have a case where i am trying to write some results using dataframe write into S3 using the below query with input_table_1 size is 13 Gb and input_table_2 as 1 Mb
input_table_1 has columns account, membership and
input_table_2 has columns role, id , membership_id, quantity, start_date
SELECT
/*+ BROADCASTJOIN(input_table_2) */
account,
role,
id,
quantity,
cast(start_date AS string) AS start_date
FROM
input_table_1
INNER JOIN
input_table_2
ON array_contains(input_table_1.membership, input_table_2.membership_id)
where membership array contains list of member_ids
This dataset write using Spark dataframe is generating around 1.1TiB of data in S3 with around 700 billion records.
We identified that there are duplicates and used dataframe.distinct.write.parquet("s3path") to remove the duplicates . The record count is reduced to almost 1/3rd of the previous total count with around 200 billion rows but we observed that the output size in S3 is now 17.2 TiB .
I am very confused how this can happen.
I have used the following spark conf settings
spark.sql.shuffle.partitions=20000
I have tried to do a coalesce and write to s3 but it did not work.
Please suggest if this is expected and when can be done ?
There's two sides to this:
1) Physical translation of distinct in Spark
The Spark catalyst optimiser turns a distinct operation into an aggregation by means of the ReplaceDeduplicateWithAggregate rule (Note: in the execution plan distinct is named Deduplicate).
This basically means df.distinct() on all columns is translated into a groupBy on all columns with an empty aggregation:
df.groupBy(df.columns:_*).agg(Map.empty).
Spark uses a HashPartitioner when shuffling data for a groupBy on respective columns. Since the groupBy clause in your case contains all columns (well, implicitly, but it does), you're more or less randomly shuffling data to different nodes in the cluster.
Increasing spark.sql.shuffle.partitions in this case is not going to help.
Now on to the 2nd side, why does this affect the size of your parquet files so much?
2) Compression in parquet files
Parquet is a columnar format, will say your data is organised in columns rather than row by row. This allows for powerful compression if data is adequately laid-out & ordered. E.g. if a column contains the same value for a number of consecutive rows, it is enough to write that value just once and make a note of the number of repetitions (a strategy called run length encoding). But Parquet also uses various other compression strategies.
Unfortunately, data ends up pretty randomly in your case after shuffling to remove duplicates. The original partitioning of input_table_1 was much better fitted.
Solutions
There's no single answer how to solve this, but here's a few pointers I'd suggest doing next:
What's causing the duplicates? Could these be removed upstream? Or is there a problem with the join condition causing duplicates?
A simple solution is to just repartition the dataset after distinct to match the partitioning of your input data. Adding a secondary sorting (sortWithinPartition) is likely going to give you even better compression. However, this comes at the cost of an additional shuffle!
As #matt-andruff pointed out below, you can also achieve this in SQL using cluster by. Obviously, that also requires you to move the distinct keyword into your SQL statement.
Write your own deduplication algorithm as Spark Aggregator and group / shuffle the data just once in a meaningful way.

Spark: sorting and assigning ids to dataset which has no unique id

I have a spark dataset which I get from some Hive table via SQL:
Dataset<Row> dataset = session.sql("select * from mytable order by myDate");
I need to assign unique increasing but not necessary sequential ids to the rows of my dataset sorted by myDate field i.e. assign ids like
1,4,6,7,8,9,16 etc
First thing I tried was row_number() function.
Dataset<Row> dataset = session.sql("select *,row_number() over () as rn from mytable order by myDate");
But I failed because myDate is not unique key (and there are no unique key in my dataset!), and I faced a very interesting bug. It turned out that each time I modify my dataset...
dataset.drop("redundantColumn");
dataset.join(..with something..);
dataset.select("rn","myDate");
... the dataset is being recalculated and thus the consequence of row number assignments is different! In other words, because my SQL query is non deterministic, each time I do something with my dataset generated by that query - I get different order of rows and thus different row-to-row_number matches.
Questions:
(1) is it possible to force Spark not to recalculate my dataset each time I do something with it? Indeed, why cant I drop columns or join dataset without re-running the initial query?
(2) any other solutions to this problem? Looks like the only option is to combine monotonic_increasing_id function with row_number and it looks quite verbose:
get dataset
add column with monotonic_increaing_id
create temp view from dataset
add row_number while selecting from that temp view
And again, I am not sure Spark will not reassign monotonic_increasing_id during some operations with my dataset like joining and adding columns.
(3) instead of option 2 - is there any way to assign monotonic_increasing_id function aligned with some sorted column?

Spark Generate A Lot Of Tasks Although Partition Number is 1 Pyspark

My code:
df = self.sql_context.sql(f"select max(id) as id from {table}")
return df.collect()[0][0]
My table is partitioned by id - it has 100M records but only 3 distinct id's.
I expected this query to work with 1 task and scan just the partition column (id).
I don't understand how I have 691 tasks for the collect line with just 3 partitions
I guess the query is executing full scan on the table but I can't figure why it doesn't scan just the metadata
Your df contains the result of an aggregation on the entire table, it contains only one row (with only one field being the max(id)), this is why it has only 1 partition.
But the original table DataFrame may have many partitions (or only 1 partition but its computation needs ~600 stages, triggering 1 task per stage, which is not that common)
Without details on your parallelism configurations and input source type and transformations, it is not easy to help more !

How to extract a Dataset content n rows by n rows?

I have to output the results of a Dataset into a Postgis (spatial) database. Spark doesn't handle it and I had to write specific code that cannot be serialized. It means that I can't use dataset.foreach(...) method, and I have to execute my database insertions from outside Spark tasks.
But a whole
List<Row> rows = ds.collectAsList()
will produce an out of memory error.
And a
List<Row> row = takeList();
only returns the n first rows of the dataset.
Is there a way to read sequentially the dataset, so that I can read its whole content from the beginning to the end, extracting each time only a fixed amount of rows ?
You can try randomSplit method to split your dataframe into multiple dataframes.
For example, to split into 3:
ds.randomSplit(Array(1,1,1))

Divide operation in spark using RDD or dataframe

Suppose there is a dataset with some number of rows.
I need to find out the Heterogeneity i.e.
distinct number of rows divide by total number of rows.
Please help me with spark query to execute the same.
Dataset and dataframe supports distinct function which finds distinct rows in the dataset.
So essentially you need to do
val heterogeneity = dataset.distinct.count / dataset.count
Only thing is if the dataset is big the distinct could be expensive and you might need to set the spark shuffle partition correctly.

Resources