How to paginate hive table in spark? - apache-spark

I wanted to do a pagination on a hive table having ~1.5 billion rows using pyspark. I came across one solution using ROW_NUMBER(). When I tried it, I am running out memory. Not sure whether spark is trying to bring in the complete table to it's memory and then doing a pagination.
After that, I came across this LIMIT clause in Hive SQL (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-LIMITClause) and tried it. But it failed in spark, the reason which I figured out was that hiveQL is not completely supported in spark.sql(). Spark SQL limit does not support multiple arguments for offset -> https://spark.apache.org/docs/3.0.0/sql-ref-syntax-qry-select-limit.html
Is there a good approach where in I can do pagination using spark?
PS: The hive table does not have an ID column, with which I can sort and do a pagination. :)

basic use of spark :
# Extract the data
df = spark.read.table("my_table")
# Transform the data
df = df.withColumn("new_col", some_transformation())
# Load the data
df.write ... # write wherever you want

Related

Reading Big Query using Spark BigQueryConnector

I want to read a big query using spark big query connector and pass the partition information into it. This is working fine but its reading the full table. I want to filter the data based on some partition value. How can I do it? I don't want to read the full table and then apply filter on spark dataset. I want to pass the partition information while reading itself. Is that even possible?
Dataset<Row> testDS = session.read().format("bigquery")
.option("table", <TABLE>)
//.option("partition",<PARTITION>)
.option("project", <PROJECT_ID>)
.option("parentProject", <PROJECT_ID>)
.load();
filter is working this way .option("filter", "_PARTITIONTIME = '2020-11-23 13:00:00'")

Spark-Cassandra connector: how to change collections write behavior

In Java, I have a Spark dataset (Spark Structured Streaming) with a column of type java.util.ArrayList<Short> and I want to write the dataset in a Cassandra table which has a corresponding list<smallint>.
Each time I write the row in Cassandra it updates an existing row and I want to customize the write behavior of the list in order to control if
the written list will overwrite the existing list or
the content of the written list will be appended to the content of the list already saved in Cassandra
I found in the spark-cassandra-connector source code a class CollectionBehavior which is extended by both CollectionAppend and CollectionOverwrite. It seems exatcly what I am looking for but I didn't find a way to use it while writing to Cassandra.
The dataset is written to Cassandra using:
dataset.write()
.format("org.apache.spark.sql.cassandra")
.option("table", table)
.option("keyspace", keyspace)
.mode(SaveMode.Append)
.save();
Is it possible to change this behavior?
To save to Cassandra collections while setting the save mode for the collection, use the RDD API. Dataset API seem to be missing this so far. So changing the dataset to RDD and using RDD methods to save to cassandra should be able to give you the behaviour you want.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md

read/write bucketed tables in Spark

I have a number of tables (with 100 million-ish rows) that are stored as external Hive tables using Parquet format. The Spark job needs to join several of them together, using a single column, with almost no filtering. The join column has unique values about 2/3X fewer than the number of rows.
I can see that there are shuffles happening by the join key; and I have been trying to utilize bucketing/partitioning to improve join performance. My thought is that if Spark can be made aware that each of these tables has been bucketed using the same column, it can load the dataframes and join them without shuffling. I have tried using Hive bucketing, but the shuffles don't go away. (From Spark's documentation it looks like Hive bucketing is not supported as of Spark 2.3.0 at least, which I found out later.) Can I use Spark's bucketing feature to do this? If yes, would I have to disable Hive support and just read the files directly? Or could I rewrite the tables once using Spark's bucketing scheme and still be able to read them as Hive tables?
EDIT: For writing out the Hive bucketed tables I was using something like:
customerDF
.write
.option("path", "/some/path")
.mode("overwrite")
.format("parquet")
.bucketBy(200, "customer_key")
.sortBy("customer_key")
.saveAsTable("table_name")
The writing part seems to work. However, reading from two tables written that way and joining them didn't work as I expected. That is, Spark was repartitioning both tables again into 200 partitions.
I don't have code for doing Spark bucketing right now but will update if I figure it out.

Ignite Spark Dataframe slow performance

I was trying to improve the performance of some existing spark dataframe by adding ignite on top of it. Following code is how we currently read dataframe
val df = sparksession.read.parquet(path).cache()
I managed to save and load spark dataframe from ignite by the example here: https://apacheignite-fs.readme.io/docs/ignite-data-frame. Following code is how I do it now with ignite
val df = spark.read()
.format(IgniteDataFrameSettings.FORMAT_IGNITE()) //Data source
.option(IgniteDataFrameSettings.OPTION_TABLE(), "person") //Table to read.
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), CONFIG) //Ignite config.
.load();
df.createOrReplaceTempView("person");
SQL Query(like select a, b, c from table where x) on ignite dataframe is working but the performance is much slower than spark alone(i.e without ignite, query spark DF directly), an SQL query often take 5 to 30 seconds, and it's common to be 2 or 3 times slower spark alone. I noticed many data(100MB+) are exchanged between ignite container and spark container for every query. Query with same "where" but smaller result is processed faster. Overall I feel ignite dataframe support seems to be a simple wrapper on top of spark. Hence most of the case it is slower than spark alone. Is my understanding correct?
Also by following the code example when the cache is created in ignite it automatically has a name like "SQL_PUBLIC_name_of_table_in_spark". So I could't change any cache configuration in xml (Because I need to specify cache name in xml/code to configure it and ignite will complain it already exists) Is this expected?
Thanks
First of all, it doesn't seem that your test is fair. In the first case you prefetch Parquet data, cache it locally in Spark, and only then execute the query. In case of Ignite DF you don't use caching, so data is fetched during query execution. Typically you will not be able to cache all your data, so performance with Parquet will go down significantly once some of the data needs to be fetched during execution.
However, with Ignite you can use indexing to improve the performance. For this particular case, you should create index on the x field to avoid scanning all the data every time query is executed. Here is the information on how to create an index: https://apacheignite-sql.readme.io/docs/create-index

Optimized hive data aggregation using hive

I have a hive table (80 million records) with the followig schema (event_id ,country,unit_id,date) and i need to export this data to a text file as with the following requirments:
1-Rows are aggregated(combined) by event_id.
2-Aggregated rows must be sorted according to date.
For example rows with same event_id must be combined as a list of lists, ordered according to date.
What is the best performance wise solution to make this job using spark ?
Note: This is expected to be a batch job.
Performance-wise, I think the best solution is to write a spark program (scala or python) to read in the underlying files to the hive table, do your transformations, and then write the output as a file.
I've found that it's much quicker to just read the files in spark rather than querying hive through spark and pulling the result into a dataframe.

Resources