Spark hbase bulk load generates more than 15X data - apache-spark

I have Spark dataframe with just 2 columns like { Key| Value}. And this dataframe has 10 million records. I am inserting this into HBase table (has 10 pre-split regions) using bulk load approach from Spark. This works fine and loads the data successfully. When I checked the size table it was like 151GB (453 gb with 3x hadoop replication). I ran major compaction on that table, and table size got reduced to 35GB (105gb with 3x replication).
I am trying to run the same code and same data in a different cluster. But here I have quota limitation of 2TB to my namespace. My process fails while loading HFiles to HBase saying its quota limit exceeded.
I would like to know whether Spark creates much more data files than the required 151GB during the bulk load? If so how to avoid that? or is there better approach to load the same?
The question is that if actual data is around 151gb (before major_compact), then why 2TB size is not enough?

Related

Pyspark insert overwrite with dynamic partition is very slow

I am reading a 60gb sized csv file using pyspark, doing few basic transformations and loading it into hive dynamic partition table. Hdfs block size is 128mb, so 400+ partitions are created in spark. Transformation is completing in few minutes. But while loading it's taking nearly an hour. Hive execution load is on tez. Tried to load the unpartitioned table, taking less than 4 minutes. How can i improve the performance in this scenario?
I'm using hive warehouse connector.

Transfering a large table with small amount of memory using pyspark

I am trying to transfer multiple tables' data using pyspark (one table at a time). The problem is that two of my tables are a lot larger than my memory (Table 1 - 30GB, Table 2 - 12GB).
Unfortunately, I only have 6GB of memory (for driver + executor). All of my attempts to optimize the transfer process have failed. Here's my SparkSession Configuration:
spark = SparkSession.builder\
.config('spark.sql.shuffle.partitions', '300')\
.config('spark.driver.maxResultSize', '0')\
.config('spark.driver.memoryOverhead', '0')\
.config('spark.memory.offHeap.enabled', 'false')\
.config('spark.memory.fraction', '300')\
.master('local[*]')\
.appName('stackoverflow')\
.getOrCreate()
For reading and writing I'm using fetchsize and batchsize parameters and a simple connection to Postgresql DB. Using parameters like numPartitions are not available in this case - the script should be generic for about 70 tables.
I ran tons of tests and tuned all the parameters but none of them worked. Beside that, I noticed that there are memory spills but I can't understand why and how to disable it. Spark should be holding some rows at a time, write them to my destenation table then delete them from memory.
I'd be happy to get any tips from anyone who faced a similar challenge.

Data Processing in Parallel using Apache Spark with Pyspark

I have a daily level transaction dataset for three months going upto around 17 gb combined. Now I have a server with 16 cores and 64gb RAM with 1 tb of hardisk space. I have the transaction data broken into 90 files each having the same format and a set of queries which is to be run in this entire dataset and the query for each daily level data is the same for all 90 files. The result after the query is run is appended and then we get the resultant summary back. Now before I start on my endevour I was wondering if Apache Spark with pyspark can be used to solve this. I tried R but it was very slow and ultimately I got memory outage issue
So my question has two parts
How should I create my RDD? Should I pass my entire dataset as an RDD or is there any way I can tell spark to work in Parallel in these 90 datsets
2.Can I expect a significant speed improvement if I am not working with Hadoop

Spark Poor Query performance: How to improve query performance on Spark?

There is a lots of hype over how good and fast spark is for processing large amount of data.
So, we wanted to investigate the query performance of spark.
Machine configuration:
4 worker nodes, r3.2xlarge instances
Data
Our input data is stored in 12 splitted gzip files in S3.
What we did
We created a table using Spark SQL for the aforementioned input data set.
Then we cached the table. We found from Spark UI that Spark did not load all data into memory, rather it loaded some data into memory and some in disk.
UPDATE: We also tested with parquet files. In this case, all data was loaded in memory. Then we execute the same queries as below. Performance is still not good enough.
Query Performance
Let's assume the table name is Fact_data. We executed the following query on that cached table:
select date_key,sum(value) from Fact_data where date_key between 201401 and 201412 group by date_key order by 1
The query takes 1268.93sec to complete. This is huge compared to the execution time in Redshift (dc1.large cluster) which takes only 9.23 sec.
I also tested some other queries e.g, count, join etc. Spark is giving me really poor performance for each of the queries
Questions
Could you suggest anything that might improve the performance of the query? May be I am missing some optimization techniques. Any suggestion will be highly appreciated.
How to compel Spark to load all data in memory? Currently it stored some data in memory and some in disk.
Is there any performance difference in using Dataframe and SQL table? I think, no. Because under the hood they are using the same optimizer.
I suggest you use Parquet as your file format instead of gzipped files.
you can try increasing your --num-executors, --executor-memory and --executor-cores
if you're using YARN and your instance type is r3.2xlarge, make sure you container size yarn.nodemanager.resource.memory-mb is larger than your --executor-memory (maybe around 55G) you also need to set yarn.nodemanager.resource.cpu-vcores to 15.

Cassandra running out of memory for cql queries

We have a 32 node Cassandra cluster with around 100Gb per node using Murmur3 partitioner. It has time series data and we have build secondary indexes on two columns to perform range queries. Currently, the cluster is stable with all the data bulk loaded and all the secondary indexes rebuilt. The issue occurs when we are performing range queries using cql client or hector, just the query for count of rows takes a huge amount of time and it most cases causes nodes to fail due to memory issues. The nodes have 8gb memory, Cassandra MAX Heap is allotted to 4 GB. Has anyone else faced such an issue ? Is there a better way to do count queries ?
I've had similar issues and most often this can be solved by redesigning the schema bearing in mind the queries that you plan to execute against the data in Cassandra. For a timeseries data it is better to have wide tables with granularity depending on your queries. If your query requires data at a granularity of 1 hour, then it is best to have a wide table with all timestamped data points stored within a single row for every hour so you can get all the required data for 1 hour by reading just 1 row.
Since you say the data is bulk loaded, I am assuming that you may have put all the data into a single table which is why the get_count query is taking an enormous amount of time. We have a a cluster with 8GB RAM but have set the heap size to 3 GB because at 4GB, the RAM utilization is almost always at 8GB [full utilization].

Resources