Transfering a large table with small amount of memory using pyspark - apache-spark

I am trying to transfer multiple tables' data using pyspark (one table at a time). The problem is that two of my tables are a lot larger than my memory (Table 1 - 30GB, Table 2 - 12GB).
Unfortunately, I only have 6GB of memory (for driver + executor). All of my attempts to optimize the transfer process have failed. Here's my SparkSession Configuration:
spark = SparkSession.builder\
.config('spark.sql.shuffle.partitions', '300')\
.config('spark.driver.maxResultSize', '0')\
.config('spark.driver.memoryOverhead', '0')\
.config('spark.memory.offHeap.enabled', 'false')\
.config('spark.memory.fraction', '300')\
.master('local[*]')\
.appName('stackoverflow')\
.getOrCreate()
For reading and writing I'm using fetchsize and batchsize parameters and a simple connection to Postgresql DB. Using parameters like numPartitions are not available in this case - the script should be generic for about 70 tables.
I ran tons of tests and tuned all the parameters but none of them worked. Beside that, I noticed that there are memory spills but I can't understand why and how to disable it. Spark should be holding some rows at a time, write them to my destenation table then delete them from memory.
I'd be happy to get any tips from anyone who faced a similar challenge.

Related

Spark hbase bulk load generates more than 15X data

I have Spark dataframe with just 2 columns like { Key| Value}. And this dataframe has 10 million records. I am inserting this into HBase table (has 10 pre-split regions) using bulk load approach from Spark. This works fine and loads the data successfully. When I checked the size table it was like 151GB (453 gb with 3x hadoop replication). I ran major compaction on that table, and table size got reduced to 35GB (105gb with 3x replication).
I am trying to run the same code and same data in a different cluster. But here I have quota limitation of 2TB to my namespace. My process fails while loading HFiles to HBase saying its quota limit exceeded.
I would like to know whether Spark creates much more data files than the required 151GB during the bulk load? If so how to avoid that? or is there better approach to load the same?
The question is that if actual data is around 151gb (before major_compact), then why 2TB size is not enough?

Write to databricks table from spark worker node

Can someone let me know if I can write to a databricks table directly from a worker node in Spark ? Please provide the code snippets. I am partitioning big data around 100 million records and hence it is failing due to memory issues when I issue a collect statement to get the data back into driver node.
In general you are always writing from a Worker Node to a Databricks table. The collect should be avoided at all costs as you see - Driver OOM.
To avoid OOM issues you should do like most do, repartition your records so they fit inside the allowable partition sizes limit - 2GB or now 4GB with newer Spark releases, on your Worker Nodes and all well be fine. E.g.:
val repartitionedWikiDF = wikiDF.repartition(16)
val targetPath = f"{workingDir}/wiki.parquet"
repartitionedwikiDF.write.mode("OVERWRITE").parquet(targetPath)
display(dbutils.fs.ls(targetPath))
You can also perform df.repartition(col, N). There is also range partitioning.
Best approach is like this imo:
import org.apache.spark.sql.functions._
df.repartition(col("country"))
.write.partitionBy("country")
.parquet("repartitionedPartitionedBy.parquet")

Spark "distribute by" explodes size of original data

I'm trying to figure out why my 15 GB table balloons to 182 GB when I run a simple query on it.
First I read the table into Spark from Redshift. When I tell Spark to do a simple count on the table, it works fine. However, when I try to create a table I get all kinds of YARN failures and ultimately some of my tasks have shuffle spill memory of 182 GB.
Here is the problematic query (I've changed some of the names):
CREATE TABLE output
SELECT
*
FROM inputs
DISTRIBUTE BY id
SORT BY snapshot_date
What is going on? How could the shuffle spill exceed the total size of the input? I'm not doing a cartesian join, or anything like that. This is a super simple query.
I'm aware that Red Hat Linux (I use EMR on AWS) has virtual memory issues, since I came across that topic here, but I've added the recommended config classification=yarn-site,properties=[yarn.nodemanager.vmem-check-enabled=false]
to my EMR properties and the issue persists.
Here is a screenshot from the Spark UI, if it helps:

Writing a dataframe to disk taking an unrealistically long time in Pyspark (Spark 2.1.1)

I'm running Pyspark on a single server with multiple CPUs. All other operations (reading, joining, filtering, custom UDFs) are executed quickly except for writing to disk. The dataframe I'm trying to save is of size around ~400 gb with 200 partitions.
sc.getConf().getAll()
The driver memory is 16g, and working directory has enough space (> 10TB)
I'm trying to save using the following command:
df.repartition(1).write.csv("out.csv")
Wondering if anyone has run into the same issue. Also will changing any of the config parameters before pyspark is invoked help solve the issue?
Edits (a few clarifications):
When I mean other operations were executed quickly, there was always an action after transformation, in my case they were row counts. So all the operations were executed super fast. Still haven't gotten around why writing takes such a ridiculous amount of time.
One of my colleagues brought up the fact that the disks in our server might have a limit on concurrent writing which might be slowing things down, still investigating on this. Interested in knowing if others are seeing slow write times on a Spark cluster too. I have confirmation from one user regarding this on AWS cluster.
All other operations (reading, joining, filtering, custom UDFs)
There are because there are transformations - they don't do anything until data has to be saved.
The dataframe I'm trying to save is of size around ~400 gb
(...)
I'm trying to save using the following command:
df.repartition(1).write.csv("out.csv")
That just cannot work well. Even ignoring part where you use a single machine, saving 400GB with a single thread (!) is just hopeless. Even if it succeeds, it is not better than using plain bash script.
Skipping over Spark - sequential writes for 400GB will take a substantial amount of time, even on average size disk. And given multiple shuffle (join, repartition) data will be written to disk multiple times.
After a lot of trial and error, I realized that the issue was due to the method I used to read the file from disk. I was using the in-built read.csv function, and when I switched over to the read function in databricks-csv package the problem went away. I'm now able to write files to disk at a reasonable time. It's really strange, maybe it's a bug in 2.1.1 or databricks csv package is really optimized.
1.read.csv method
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("model") \
.config("spark.worker.dir", "xxxx") \
.getOrCreate()
df = spark.read.load("file.csv", format="csv", header = True)
write.csv("file_after_processing.csv")
2.Using the databricks-csv package
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('file.csv')
train.write.format('com.databricks.spark.csv').save('file_after_processing.csv')

spark datasax cassandra connector slow to read from heavy cassandra table

I am new to Spark/ Spark Cassandra Connector. We are trying spark for the first time in our team and we are using spark cassandra connector to connect to cassandra Database.
I wrote a query which is using a heavy table of the database and I saw that Spark Task didn't start until the query to the table fetched all the records.
It is taking more than 3 hours just to fetch all the records from the database.
To get the data from the DB we use.
CassandraJavaUtil.javaFunctions(sparkContextManager.getJavaSparkContext(SOURCE).sc())
.cassandraTable(keyspaceName, tableName);
Is there a way to tell spark to start working even if all the data didn't finish to download ?
Is there an option to tell spark-cassandra-connector to use more threads for the fetch ?
thanks,
kokou.
If you look at the Spark UI, how many partitions is your table scan creating? I just did something like this and I found that Spark was creating too many partitions for the scan and it was taking much longer as a result. The way I decreased the time on my job was by setting the configuration parameter spark.cassandra.input.split.size_in_mb to a value higher than the default. In my case it took a 20 minute job down to about four minutes. There are also a couple more Cassandra read specific Spark variables that you can set found here.
These stackoverflow questions are what I referenced originally, I hope they help you out as well.
Iterate large Cassandra table in small chunks
Set number of tasks on Cassandra table scan
EDIT:
After doing some performance testing with regards to fiddling with some Spark configuration parameters, I found that Spark was creating far too many table partitions when I wasn't giving the Spark executors enough memory. In my case, upping the memory by a gigabyte was enough to render the input split size parameter unnecessary. If you can't give the executors more memory, you may still need to set spark.cassandra.input.split.size_in_mbhigher as a form of workaround.

Resources