Spark HiveContext vs HbaseContext? - apache-spark

I have a data-set of size 10 Petabytes. My current data is in HBase where I am using Spark HbaseContext but it is not performing well.
Will it be useful to move data from HbaseContext to HiveContext on Spark?

HiveContext is used to read data from Hive. so, if you switch to HiveContext the data has to be in Hive. I don't think what you are trying will work.

In my use case, I use mapPartition with a HBase connection inside. The key is just to know how to split.
For scan, you can create your own scanner, with prefix, etc...
For get it's even easier.
For puts, you can create a list of puts to do then batch insertion.
I don't use any HBaseContext and I have quite good performances on database of 1,2 billion rows.

Related

Can Apache Spark be used in place of Sqoop

I have tried connecting spark with JDBC connections to fetch data from MySQL / Teradata or similar RDBMS and was able analyse the data.
Can spark be used to store the data to HDFS?
Is there any possibility for spark outperforming
the activities of Sqoop.
Looking for you valuable answers and explanations.
There are two main things about Sqoop and Spark. The main difference is Sqoop will read the data from your RDMS doesn't matter what you have and you don't need to worry much about how you table is configured.
With Spark using JDBC connection is a little bit different how you need to load the data. If your database doesn't have any column like numeric ID or timestamp Spark will load ALL the data in one single partition. And then will try to process and save. If you have one column to use as partition than Spark sometimes can be even faster than Sqoop.
I would recommend you to take a look in this doc.enter link description here
The conclusion is, if you are going to do a simple export and that need to be done daily with no transformation I would recommend Sqoop to be simple to use and will not impact your database that much. Using Spark will work well IF your table is ready for that, besides that goes with Sqoop

read/write bucketed tables in Spark

I have a number of tables (with 100 million-ish rows) that are stored as external Hive tables using Parquet format. The Spark job needs to join several of them together, using a single column, with almost no filtering. The join column has unique values about 2/3X fewer than the number of rows.
I can see that there are shuffles happening by the join key; and I have been trying to utilize bucketing/partitioning to improve join performance. My thought is that if Spark can be made aware that each of these tables has been bucketed using the same column, it can load the dataframes and join them without shuffling. I have tried using Hive bucketing, but the shuffles don't go away. (From Spark's documentation it looks like Hive bucketing is not supported as of Spark 2.3.0 at least, which I found out later.) Can I use Spark's bucketing feature to do this? If yes, would I have to disable Hive support and just read the files directly? Or could I rewrite the tables once using Spark's bucketing scheme and still be able to read them as Hive tables?
EDIT: For writing out the Hive bucketed tables I was using something like:
customerDF
.write
.option("path", "/some/path")
.mode("overwrite")
.format("parquet")
.bucketBy(200, "customer_key")
.sortBy("customer_key")
.saveAsTable("table_name")
The writing part seems to work. However, reading from two tables written that way and joining them didn't work as I expected. That is, Spark was repartitioning both tables again into 200 partitions.
I don't have code for doing Spark bucketing right now but will update if I figure it out.

Ignite Spark Dataframe slow performance

I was trying to improve the performance of some existing spark dataframe by adding ignite on top of it. Following code is how we currently read dataframe
val df = sparksession.read.parquet(path).cache()
I managed to save and load spark dataframe from ignite by the example here: https://apacheignite-fs.readme.io/docs/ignite-data-frame. Following code is how I do it now with ignite
val df = spark.read()
.format(IgniteDataFrameSettings.FORMAT_IGNITE()) //Data source
.option(IgniteDataFrameSettings.OPTION_TABLE(), "person") //Table to read.
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), CONFIG) //Ignite config.
.load();
df.createOrReplaceTempView("person");
SQL Query(like select a, b, c from table where x) on ignite dataframe is working but the performance is much slower than spark alone(i.e without ignite, query spark DF directly), an SQL query often take 5 to 30 seconds, and it's common to be 2 or 3 times slower spark alone. I noticed many data(100MB+) are exchanged between ignite container and spark container for every query. Query with same "where" but smaller result is processed faster. Overall I feel ignite dataframe support seems to be a simple wrapper on top of spark. Hence most of the case it is slower than spark alone. Is my understanding correct?
Also by following the code example when the cache is created in ignite it automatically has a name like "SQL_PUBLIC_name_of_table_in_spark". So I could't change any cache configuration in xml (Because I need to specify cache name in xml/code to configure it and ignite will complain it already exists) Is this expected?
Thanks
First of all, it doesn't seem that your test is fair. In the first case you prefetch Parquet data, cache it locally in Spark, and only then execute the query. In case of Ignite DF you don't use caching, so data is fetched during query execution. Typically you will not be able to cache all your data, so performance with Parquet will go down significantly once some of the data needs to be fetched during execution.
However, with Ignite you can use indexing to improve the performance. For this particular case, you should create index on the x field to avoid scanning all the data every time query is executed. Here is the information on how to create an index: https://apacheignite-sql.readme.io/docs/create-index

Large Query or mutate Dataframe?

I am using a SparkSession to connect to a hive database. I'm trying to decide what is the best way to enrichment the data. I was using Spark Sql but I am weary to use it.
Does the SparkSql just call Hive Sql? So would that mean there is no improved performance from using Spark?
If not, should I just create a large sql query to spark, or should I grab a table I wan't convert it to a data frame and manipulate it using sparks functions?
No, Spark will read the data from Hive, but use its own execution engine. Performance and capabilities will differ. How much depends on the execution engine you are using for Hive. (M/R, Tez, Spark, LLAP?)
That's the same thing. I would stick to SQL queries, and A-B-test against Hive in the beginning, but SQL is notoriously difficult to maintain, where Scala/Python code using Spark's DataSet API is more user friendly in the long term.

Writing small amount of data to cassandra table in Spark

I need to write some small amount of data to a cassandra table in a spark application. The data is not an RDD, and it is just a double value. How to do this in a Spark application using Java API?
Thanks for any hint.
As a quick solution you can use sc.parallelize and save RDD to Cassandra as usual.
If you need to run a query you can use CassandraConnector pool like in doc:
val connector = CassandraConnector(sc.getConf)
connector.withSessionDo(session => ...)
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md

Resources