Spark "distribute by" explodes size of original data - apache-spark

I'm trying to figure out why my 15 GB table balloons to 182 GB when I run a simple query on it.
First I read the table into Spark from Redshift. When I tell Spark to do a simple count on the table, it works fine. However, when I try to create a table I get all kinds of YARN failures and ultimately some of my tasks have shuffle spill memory of 182 GB.
Here is the problematic query (I've changed some of the names):
CREATE TABLE output
SELECT
*
FROM inputs
DISTRIBUTE BY id
SORT BY snapshot_date
What is going on? How could the shuffle spill exceed the total size of the input? I'm not doing a cartesian join, or anything like that. This is a super simple query.
I'm aware that Red Hat Linux (I use EMR on AWS) has virtual memory issues, since I came across that topic here, but I've added the recommended config classification=yarn-site,properties=[yarn.nodemanager.vmem-check-enabled=false]
to my EMR properties and the issue persists.
Here is a screenshot from the Spark UI, if it helps:

Related

Transfering a large table with small amount of memory using pyspark

I am trying to transfer multiple tables' data using pyspark (one table at a time). The problem is that two of my tables are a lot larger than my memory (Table 1 - 30GB, Table 2 - 12GB).
Unfortunately, I only have 6GB of memory (for driver + executor). All of my attempts to optimize the transfer process have failed. Here's my SparkSession Configuration:
spark = SparkSession.builder\
.config('spark.sql.shuffle.partitions', '300')\
.config('spark.driver.maxResultSize', '0')\
.config('spark.driver.memoryOverhead', '0')\
.config('spark.memory.offHeap.enabled', 'false')\
.config('spark.memory.fraction', '300')\
.master('local[*]')\
.appName('stackoverflow')\
.getOrCreate()
For reading and writing I'm using fetchsize and batchsize parameters and a simple connection to Postgresql DB. Using parameters like numPartitions are not available in this case - the script should be generic for about 70 tables.
I ran tons of tests and tuned all the parameters but none of them worked. Beside that, I noticed that there are memory spills but I can't understand why and how to disable it. Spark should be holding some rows at a time, write them to my destenation table then delete them from memory.
I'd be happy to get any tips from anyone who faced a similar challenge.

Spark hbase bulk load generates more than 15X data

I have Spark dataframe with just 2 columns like { Key| Value}. And this dataframe has 10 million records. I am inserting this into HBase table (has 10 pre-split regions) using bulk load approach from Spark. This works fine and loads the data successfully. When I checked the size table it was like 151GB (453 gb with 3x hadoop replication). I ran major compaction on that table, and table size got reduced to 35GB (105gb with 3x replication).
I am trying to run the same code and same data in a different cluster. But here I have quota limitation of 2TB to my namespace. My process fails while loading HFiles to HBase saying its quota limit exceeded.
I would like to know whether Spark creates much more data files than the required 151GB during the bulk load? If so how to avoid that? or is there better approach to load the same?
The question is that if actual data is around 151gb (before major_compact), then why 2TB size is not enough?

Mysql or Spark Processing of 400gb data

If I use spark in my case, based on block and cores will it be useful ?
I have 400 GB of data in single table i.e. User_events with multiple columns in MySQL. This table stores all user events from application. Indexes are there on required columns. I have an user interface where user can try different permutation and combination of fields under user_events
Currently I am facing the performance issues where query either takes 15/20 seconds or even longer or times out.
I have gone through couple of Spark tutorial but I am not sure if it can help here. Per mine understanding from spark,
First Spark has to bring all the data in memory. Bring 100 M record on netwok will be costly operation and I will be needing big memory for the
same. Isn't it ?
Once data in memory, Spark can distribute the data among partition based on cores and input data size. Then it can filter the data on each partition
in parallel. Here Spark can be beneficial as it can do the parallel operation while MySQL will be sequential. Is that correct ?
Is my understanding correct ?

Any way to see the size of the broadcast variable?

I have set the value of spark.sql.autoBroadcastJoinThreshold to a very high value of 20 GB. I am joining a table that I am sure is below this variable, however spark is doing a SortMergeJoin. If I set a broadcast hint then spark does a broadcast join and job finishes much faster. However, when run in production for some large tables, I run into errors. Is there a way to see the actual size of the table being broadcast? I wrote the table being broadcast to disk and it took only 32 MB in parquet. I tried to cache this table in Zeppelin and run a table.count() operation but nothing gets shown on on the Storage tab of the Spark History Server. spark.util.SizeEstimator doesn't seem to be giving accurate numbers for this table either. Any way to figure out the size of this table being broadcast?

Data Processing in Parallel using Apache Spark with Pyspark

I have a daily level transaction dataset for three months going upto around 17 gb combined. Now I have a server with 16 cores and 64gb RAM with 1 tb of hardisk space. I have the transaction data broken into 90 files each having the same format and a set of queries which is to be run in this entire dataset and the query for each daily level data is the same for all 90 files. The result after the query is run is appended and then we get the resultant summary back. Now before I start on my endevour I was wondering if Apache Spark with pyspark can be used to solve this. I tried R but it was very slow and ultimately I got memory outage issue
So my question has two parts
How should I create my RDD? Should I pass my entire dataset as an RDD or is there any way I can tell spark to work in Parallel in these 90 datsets
2.Can I expect a significant speed improvement if I am not working with Hadoop

Resources