Apache Spark OutOfMemoryError (HeapSpace)

Apache Spark OutOfMemoryError (HeapSpace) - apache-spark

I have a dataset with ~5M rows x 20 columns, containing a groupID and a rowID. My goal is to check whether (some) columns contain more than a fixed fraction (say, 50%) of missing (null) values within a group. If this is found, the entire column is set to missing (null), for that group.
df = spark.read.parquet('path/to/parquet/')
check_columns = {'col1': ..., 'col2': ..., ...} # currently len(check_columns) = 8
for col, _ in check_columns.items():
total = (df
.groupBy('groupID').count()
.toDF('groupID', 'n_total')
)
missing = (df
.where(F.col(col).isNull())
.groupBy('groupID').count()
.toDF('groupID', 'n_missing')
)
# count_missing = count_missing.persist() # PERSIST TRY 1
# print('col {} found {} missing'.format(col, missing.count())) # missing.count() is b/w 1k-5k
poor_df = (total
.join(missing, 'groupID')
.withColumn('freq', F.col('n_missing') / F.col('n_total'))
.where(F.col('freq') > 0.5)
.select('groupID')
.toDF('poor_groupID')
)
df = (df
.join(poor_df, df['groupID'] == poor_df['poor_groupID'], 'left_outer')
.withColumn(col, (F.when(F.col('poor_groupID').isNotNull(), None)
.otherwise(df[col])
)
)
.select(df.columns)
)
stats = (missing
.withColumnRenamed('n_missing', 'cnt')
.collect() # FAIL 1
)
# df = df.persist() # PERSIST TRY 2
print(df.count()) # FAIL 2
I initially assigned 1G of spark.driver.memory and 4G of spark.executor.memory, eventually increasing the spark.driver.memory up to 10G.
Problem(s):
The loop runs like a charm during the first iterations, but towards the end,
around the 6th or 7th iteration I see my CPU utilization dropping (using 1
instead of 6 cores). Along with that, execution time for one iteration
increases significantly.
At some point, I get an OutOfMemory Error:
spark.driver.memory < 4G: at collect() (FAIL 1)
4G <= spark.driver.memory < 10G: at the count() step (FAIL 2)
Stack Trace for FAIL 1 case (relevant part):
[...]
py4j.protocol.Py4JJavaError: An error occurred while calling o1061.collectToPython.
: java.lang.OutOfMemoryError: Java heap space
[...]
The executor UI does not reflect excessive memory usage (it shows a <50k used
memory for the driver and <1G for the executor). The Spark metrics system
(app-XXX.driver.BlockManager.memory.memUsed_MB) does not either: it shows
600M to 1200M of used memory, but always >300M remaining memory.
(This would suggest that 2G driver memory should do it, but it doesn't.)
It also does not matter which column is processed first (as it is a loop over
a dict(), it can be in arbitrary order).
My questions thus:
What causes the OutOfMemory Error and why are not all available CPU cores
used towards the end?
And why do I need 10G spark.driver.memory when I am transferring only a few kB from the executors to the driver?
A few (general) questions to make sure I understand things properly:
If I get an OOM error, the right place to look at is almost always the driver
(b/c the executor spills to disk)?
Why would count() cause an OOM error - I thought this action would only
consume resources on the exector(s) (delivering a few bytes to the driver)?
Are the memory metrics (metrics system, UI) mentioned above the correct
places to look at?
BTW: I run Spark 2.1.0 in standalone mode.
UPDATE 2017-04-28
To drill down further, I enabled a heap dump for the driver:
cfg = SparkConfig()
cfg.set('spark.driver.extraJavaOptions', '-XX:+HeapDumpOnOutOfMemoryError')
I ran it with 8G of spark.driver.memory and I analyzed the heap dump with
Eclipse MAT. It turns out there are two classes of considerable size (~4G each):
java.lang.Thread
- char (2G)
- scala.collection.IndexedSeqLike
- scala.collection.mutable.WrappedArray (1G)
- java.lang.String (1G)
org.apache.spark.sql.execution.ui.SQLListener
- org.apache.spark.sql.execution.ui.SQLExecutionUIData
(various of up to 1G in size)
- java.lang.String
- ...
I tried to turn off the UI, using
cfg.set('spark.ui.enabled', 'false')
which made the UI unavailable, but didn't help on the OOM error. Also, I tried
to have the UI to keep less history, using
cfg.set('spark.ui.retainedJobs', '1')
cfg.set('spark.ui.retainedStages', '1')
cfg.set('spark.ui.retainedTasks', '1')
cfg.set('spark.sql.ui.retainedExecutions', '1')
cfg.set('spark.ui.retainedDeadExecutors', '1')
This also did not help.
UPDATE 2017-05-18
I found out about Spark's pyspark.sql.DataFrame.checkpoint method. This is like persist but gets rid of the dataframe's lineage. Thus it helps to circumvent the above mentioned issues.

Related

Spark Partitionby doesn't scale as expected

INPUT:
The input data set contains 10 million transactions in multiple files stored as parquet. The size of the entire data set including all files ranges from 6 to 8GB.
PROBLEM STATEMENT:
Partition the transactions based on customer id's which would create one folder per customer id and each folder containing all the transactions done by that particular customer.
HDFS has a hard limit of 6.4 million on the number of sub directories within a root directory that can be created so using the last two digits of the customer id ranging from 00,01,02...to 99 to create top level directories and each top level directory would contain all the customer id's ending with that specific two digits.
Sample output directory structure:
00/cust_id=100900/part1.csv
00/cust_id=100800/part33.csv
01/cust_id=100801/part1.csv
03/cust_id=100803/part1.csv
CODE:
// Reading input file and storing in cache
val parquetReader = sparksession.read
.parquet("/inputs")
.persist(StorageLevel.MEMORY_ONLY) //No spill will occur has enough memory
// Logic to partition
var customerIdEndingPattern = 0
while (cardAccountEndingPattern < 100) {
var idEndPattern = customerIdEndingPattern + ""
if (customerIdEndingPattern < 10) {
idEndPattern = "0" + customerIdEndingPattern
}
parquetReader
.filter(col("customer_id").endsWith(idEndPattern))
.repartition(945, col("customer_id"))
.write
.partitionBy("customer_id")
.option("header", "true")
.mode("append")
.csv("/" + idEndPattern)
customerIdEndingPattern = customerIdEndingPattern + 1
}
Spark Configuration:
Amazon EMR 5.29.0 (Spark 2.4.4 & Hadoop 2.8.5)
1 master and 10 slaves and each of them has 96 vCores and 768GB RAM(Amazon AWS R5.24xlarge instance). Hard disks are EBS with bust of 3000 IOPS for 30 mins.
'spark.hadoop.dfs.replication': '3',
'spark.driver.cores':'5',
'spark.driver.memory':'32g',
'spark.executor.instances': '189',
'spark.executor.memory': '32g',
'spark.executor.cores': '5',
'spark.executor.memoryOverhead':'8192',
'spark.driver.memoryOverhead':'8192',
'spark.default.parallelism':'945',
'spark.sql.shuffle.partitions' :'945',
'spark.serializer':'org.apache.spark.serializer.KryoSerializer',
'spark.dynamicAllocation.enabled': 'false',
'spark.memory.fraction':'0.8',
'spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version':'2',
'spark.memory.storageFraction':'0.2',
'spark.task.maxFailures': '6',
'spark.driver.extraJavaOptions': '-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError="kill -9 %p"
'spark.executor.extraJavaOptions': '-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError="kill -9 %p"
SCALING ISSUES:
Experimented from 10 to all the way upto 40 slaves(adjusting the spark configs accordingly) but still the same results the job takes more than 2hrs to complete(as shown in the first pic each job takes more than a minute and the while loop runs 99 times). Also the reads from remote executors are almost non existent(which is good) most are process local.
Partition seems to work fine(refer second pic) got 5 RDD blocks per instance and 5 tasks running at all times(each instance has 5 cores and 19 instances per slave node). GC is optimized too.
Each partitionby task as written in the while loop takes a minute or more to complete.
METRICS:
Sample duration of a few jobs we have 99 jobs in total
Partition seems okay
Summary from 1 job basically one partitionby execution
Summary of a few instances after full job completion hence RDD blocks is zero and the first row is driver.
So the question is how to optimize it more and why it's not scaling up? Is there a better way to go about it? Have I reached the max performance already? Assuming I have access to more resources in terms of hardware is there anything I could do better? Any suggestions are welcome.

Touching every record 100 times is very inefficient, even if data can be cached in memory and not be evicted downstream. Not to mention persisting alone is expensive
Instead you could add a virtual column
import org.apache.spark.sql.functions.substring
val df = sparksession.read
.parquet("/inputs")
.withColumn("partition_id", substring($"customer_id", -2, 2))
and use it later for partitioning
df
.write
.partitionBy("partition_id", "customer_id")
.option("header", "true")
.mode("append")
.csv("/")
To avoid to many small files you can repartition first using longer suffix
val nParts: Int = ???
val suffixLength: Int = ??? // >= suffix length used for write partitions
df
.repartitionByRange(
nParts,
substring($"customer_id", -suffixLength, suffixLength)
.write
.partitionBy("partition_id", "customer_id")
.option("header", "true")
.mode("append")
.csv("/")
Such changes will allow you to process all data in a single pass without any explicit caching.

Optimization Spark job - Spark 2.1

my spark job currently runs in 59 mins. I want to optimize it so that I it takes less time. I have noticed that the last step of the job takes a lot of time (55 mins) (see the screenshots of the spark job in Spark UI below).
I need to join a big dataset with a smaller one, apply transformations on this joined dataset (creating a new column).
At the end, I should have a dataset repartitioned based on the column PSP (see snippet of the code below). I also perform a sort at the end (sort each partition based on 3 columns).
All the details (infrastructure, configuration, code) can be found below.
Snippet of my code :
spark.conf.set("spark.sql.shuffle.partitions", 4158)
val uh = uh_months
.withColumn("UHDIN", datediff(to_date(unix_timestamp(col("UHDIN_YYYYMMDD"), "yyyyMMdd").cast(TimestampType)),
to_date(unix_timestamp(col("january"), "yyyy-MM-dd").cast(TimestampType))))
"ddMMMyyyy")).cast(TimestampType)))
.withColumn("DVA_1", date_format(col("DVA"), "dd/MM/yyyy"))
.drop("UHDIN_YYYYMMDD")
.drop("january")
.drop("DVA")
.persist()
val uh_flag_comment = new TransactionType().transform(uh)
uh.unpersist()
val uh_joined = uh_flag_comment.join(broadcast(smallDF), "NO_NUM")
.select(
uh.col("*"),
smallDF.col("PSP"),
smallDF.col("minrel"),
smallDF.col("Label"),
smallDF.col("StartDate"))
.withColumnRenamed("DVA_1", "DVA")
smallDF.unpersist()
val uh_to_be_sorted = uh_joined.repartition(4158, col("PSP"))
val uh_final = uh_joined.sortWithinPartitions(col("NO_NUM"), col("UHDIN"), col("HOURMV"))
uh_final
EDITED - Repartition logic
val sqlContext = spark.sqlContext
sqlContext.udf.register("randomUDF", (partitionCount: Int) => {
val r = new scala.util.Random
r.nextInt(partitionCount)
// Also tried with r.nextInt(partitionCount) + col("PSP")
})
val uh_to_be_sorted = uh_joined
.withColumn("tmp", callUDF("RandomUDF", lit("4158"))
.repartition(4158, col("tmp"))
.drop(col("tmp"))
val uh_final = uh_to_be_sorted.sortWithinPartitions(col("NO_NUM"), col("UHDIN"), col("HOURMV"))
uh_final
smallDF is a small dataset (535MB) that I broadcast.
TransactionType is a class where I add a new column of string elements to my uh dataframe based on the value of 3 columns (MMED, DEBCRED, NMTGP), checking the values of those columns using regex.
I previously faced a lot of issues (job failing) because of shuffle blocks that were not found. I discovered that I was spilling to disk and had a lot of GC memory issues so I increased the "spark.sql.shuffle.partitions" to 4158.
WHY 4158 ?
Partition_count = (stage input data) / (target size of your partition)
so Shuffle partition_count = (shuffle stage input data) / 200 MB = 860000/200=4300
I have 16*24 - 6 =378 cores availaible. So if I want to run every tasks in one go, I should divide 4300 by 378 which is approximately 11. Then 11*378=4158
Spark Version: 2.1
Cluster configuration:
24 compute nodes (workers)
16 vcores each
90 GB RAM per node
6 cores are already being used by other processes/jobs
Current Spark configuration:
-master: yarn
-executor-memory: 26G
-executor-cores: 5
-driver memory: 70G
-num-executors: 70
-spark.kryoserializer.buffer.max=512
-spark.driver.cores=5
-spark.driver.maxResultSize=500m
-spark.memory.storageFraction=0.4
-spark.memory.fraction=0.9
-spark.hadoop.fs.permissions.umask-mode=007
How is the job executed:
We build an artifact (jar) with IntelliJ and then send it to a server. Then a bash script is executed. This script:
export some environment variables (SPARK_HOME, HADOOP_CONF_DIR, PATH and SPARK_LOCAL_DIRS)
launch the spark-submit command with all the parameters defined in the spark configuration above
retrieves the yarn logs of the application
Spark UI screenshots
DAG

#Ali
From the Summary Metrics we can say that your data is Skewed ( Max Duration : 49 min and Max Shuffle Read Size/Records : 2.5 GB/ 23,947,440 where as on an average it's taking about 4-5 mins and processing less than 200 MB/1.2 MM rows)
Now that we know the problem might be skew of data in few partition(s) , I think we can fix this by changing repartition logic val uh_to_be_sorted = uh_joined.repartition(4158, col("PSP")) by chosing something (like some other column or adding any other column to PSP)
few links to refer on data skew and fix
https://dzone.com/articles/optimize-spark-with-distribute-by-cluster-by
https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/
Hope this helps

What is the best strategy to load huge datasets/data into Hive tables using Spark? [duplicate]

I am trying to move data from a table in PostgreSQL table to a Hive table on HDFS. To do that, I came up with the following code:
val conf = new SparkConf().setAppName("Spark-JDBC").set("spark.executor.heartbeatInterval","120s").set("spark.network.timeout","12000s").set("spark.sql.inMemoryColumnarStorage.compressed", "true").set("spark.sql.orc.filterPushdown","true").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").set("spark.kryoserializer.buffer.max","512m").set("spark.serializer", classOf[org.apache.spark.serializer.KryoSerializer].getName).set("spark.streaming.stopGracefullyOnShutdown","true").set("spark.yarn.driver.memoryOverhead","7168").set("spark.yarn.executor.memoryOverhead","7168").set("spark.sql.shuffle.partitions", "61").set("spark.default.parallelism", "60").set("spark.memory.storageFraction","0.5").set("spark.memory.fraction","0.6").set("spark.memory.offHeap.enabled","true").set("spark.memory.offHeap.size","16g").set("spark.dynamicAllocation.enabled", "false").set("spark.dynamicAllocation.enabled","true").set("spark.shuffle.service.enabled","true")
val spark = SparkSession.builder().config(conf).master("yarn").enableHiveSupport().config("hive.exec.dynamic.partition", "true").config("hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate()
def prepareFinalDF(splitColumns:List[String], textList: ListBuffer[String], allColumns:String, dataMapper:Map[String, String], partition_columns:Array[String], spark:SparkSession): DataFrame = {
val colList = allColumns.split(",").toList
val (partCols, npartCols) = colList.partition(p => partition_columns.contains(p.takeWhile(x => x != ' ')))
val queryCols = npartCols.mkString(",") + ", 0 as " + flagCol + "," + partCols.reverse.mkString(",")
val execQuery = s"select ${allColumns}, 0 as ${flagCol} from schema.tablename where period_year='2017' and period_num='12'"
val yearDF = spark.read.format("jdbc").option("url", connectionUrl).option("dbtable", s"(${execQuery}) as year2017")
.option("user", devUserName).option("password", devPassword)
.option("partitionColumn","cast_id")
.option("lowerBound", 1).option("upperBound", 100000)
.option("numPartitions",70).load()
val totalCols:List[String] = splitColumns ++ textList
val cdt = new ChangeDataTypes(totalCols, dataMapper)
hiveDataTypes = cdt.gpDetails()
val fc = prepareHiveTableSchema(hiveDataTypes, partition_columns)
val allColsOrdered = yearDF.columns.diff(partition_columns) ++ partition_columns
val allCols = allColsOrdered.map(colname => org.apache.spark.sql.functions.col(colname))
val resultDF = yearDF.select(allCols:_*)
val stringColumns = resultDF.schema.fields.filter(x => x.dataType == StringType).map(s => s.name)
val finalDF = stringColumns.foldLeft(resultDF) {
(tempDF, colName) => tempDF.withColumn(colName, regexp_replace(regexp_replace(col(colName), "[\r\n]+", " "), "[\t]+"," "))
}
finalDF
}
val dataDF = prepareFinalDF(splitColumns, textList, allColumns, dataMapper, partition_columns, spark)
val dataDFPart = dataDF.repartition(30)
dataDFPart.createOrReplaceTempView("preparedDF")
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
spark.sql("set hive.exec.dynamic.partition=true")
spark.sql(s"INSERT OVERWRITE TABLE schema.hivetable PARTITION(${prtn_String_columns}) select * from preparedDF")
The data is inserted into the hive table dynamically partitioned based on prtn_String_columns: source_system_name, period_year, period_num
Spark-submit used:
SPARK_MAJOR_VERSION=2 spark-submit --conf spark.ui.port=4090 --driver-class-path /home/fdlhdpetl/jars/postgresql-42.1.4.jar --jars /home/fdlhdpetl/jars/postgresql-42.1.4.jar --num-executors 80 --executor-cores 5 --executor-memory 50G --driver-memory 20G --driver-cores 3 --class com.partition.source.YearPartition splinter_2.11-0.1.jar --master=yarn --deploy-mode=cluster --keytab /home/fdlhdpetl/fdlhdpetl.keytab --principal fdlhdpetl#FDLDEV.COM --files /usr/hdp/current/spark2-client/conf/hive-site.xml,testconnection.properties --name Splinter --conf spark.executor.extraClassPath=/home/fdlhdpetl/jars/postgresql-42.1.4.jar
The following error messages are generated in the executor logs:
Container exited with a non-zero exit code 143.
Killed by external signal
18/10/03 15:37:24 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[SIGTERM handler,9,system]
java.lang.OutOfMemoryError: Java heap space
at java.util.zip.InflaterInputStream.<init>(InflaterInputStream.java:88)
at java.util.zip.ZipFile$ZipFileInflaterInputStream.<init>(ZipFile.java:393)
at java.util.zip.ZipFile.getInputStream(ZipFile.java:374)
at java.util.jar.JarFile.getManifestFromReference(JarFile.java:199)
at java.util.jar.JarFile.getManifest(JarFile.java:180)
at sun.misc.URLClassPath$JarLoader$2.getManifest(URLClassPath.java:944)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:450)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.util.SignalUtils$ActionHandler.handle(SignalUtils.scala:99)
at sun.misc.Signal$1.run(Signal.java:212)
at java.lang.Thread.run(Thread.java:745)
I see in the logs that the read is being executed properly with the given number of partitions as below:
Scan JDBCRelation((select column_names from schema.tablename where period_year='2017' and period_num='12') as year2017) [numPartitions=50]
Below is the state of executors in stages:
The data is not being partitioned properly. One partition is smaller while the other one becomes huge. There is a skew problem here.
While inserting the data into Hive table the job fails at the line:spark.sql(s"INSERT OVERWRITE TABLE schema.hivetable PARTITION(${prtn_String_columns}) select * from preparedDF") but I understand this is happening because of the data skew problem.
I tried to increase number of executors, increasing the executor memory, driver memory, tried to just save as csv file instead of saving the dataframe into a Hive table but nothing affects the execution from giving the exception:
java.lang.OutOfMemoryError: GC overhead limit exceeded
Is there anything in the code that I need to correct ? Could anyone let me know how can I fix this problem ?

Determine how many partitions you need given the amount of input data and your cluster resources. As a rule of thumb it is better to keep partition input under 1GB unless strictly necessary. and strictly smaller than the block size limit.
You've previously stated that you migrate 1TB of data values you use in different posts (5 - 70) are likely way to low to ensure smooth process.
Try to use value which won't require further repartitioning.
Know your data.
Analyze the columns available in the the dataset to determine if there any columns with high cardinality and uniform distribution to be distributed among desired number of partitions. These are good candidates for an import process. Additionally you should determine an exact range of values.
Aggregations with different centrality and skewness measure as well as histograms and basic counts-by-key are good exploration tools. For this part it is better to analyze data directly in the database, instead of fetching it to Spark.
Depending on the RDBMS you might be able to use width_bucket (PostgreSQL, Oracle) or equivalent function to get a decent idea how data will be distributed in Spark after loading with partitionColumn, lowerBound, upperBound, numPartitons.
s"""(SELECT width_bucket($partitionColum, $lowerBound, $upperBound, $numPartitons) AS bucket, COUNT(*)
FROM t
GROUP BY bucket) as tmp)"""
If there are no columns which satisfy above criteria consider:
Creating a custom one and exposing it via. a view. Hashes over multiple independent columns are usually good candidates. Please consult your database manual to determine functions that can be used here (DBMS_CRYPTO in Oracle, pgcrypto in PostgreSQL)*.
Using a set of independent columns which taken together provide high enough cardinality.
Optionally, if you're going to write to a partitioned Hive table, you should consider including Hive partitioning columns. It might limit the number of files generated later.
Prepare partitioning arguments
If column selected or created in the previous steps is numeric (or date / timestamp in Spark >= 2.4) provide it directly as the partitionColumn and use range values determined before to fill lowerBound and upperBound.
If bound values don't reflect the properties of data (min(col) for lowerBound, max(col) for upperBound) it can result in a significant data skew so thread carefully. In the worst case scenario, when bounds don't cover the range of data, all records will be fetched by a single machine, making it no better than no partitioning at all.
If column selected in the previous steps is categorical or is a set of columns generate a list of mutually exclusive predicates that fully cover the data, in a form that can be used in a SQL where clause.
For example if you have a column A with values {a1, a2, a3} and column B with values {b1, b2, b3}:
val predicates = for {
a <- Seq("a1", "a2", "a3")
b <- Seq("b1", "b2", "b3")
} yield s"A = $a AND B = $b"
Double check that conditions don't overlap and all combinations are covered. If these conditions are not satisfied you end up with duplicates or missing records respectively.
Pass data as predicates argument to jdbc call. Note that the number of partitions will be equal exactly to the number of predicates.
Put database in a read-only mode (any ongoing writes can cause data inconsistency. If possible you should lock database before you start the whole process, but if might be not possible, in your organization).
If the number of partitions matches the desired output load data without repartition and dump directly to the sink, if not you can try to repartition following the same rules as in the step 1.
If you still experience any problems make sure that you've properly configured Spark memory and GC options.
If none of the above works:
Consider dumping your data to a network / distributes storage using tools like COPY TO and read it directly from there.
Note that or standard database utilities you will typically need a POSIX compliant file system, so HDFS usually won't do.
The advantage of this approach is that you don't need to worry about the column properties, and there is no need for putting data in a read-only mode, to ensure consistency.
Using dedicated bulk transfer tools, like Apache Sqoop, and reshaping data afterwards.
* Don't use pseudocolumns - Pseudocolumn in Spark JDBC.

In my experience there are 4 kinds of memory settings which make a difference:
A) [1] Memory for storing data for processing reasons VS [2] Heap Space for holding the program stack
B) [1] Driver VS [2] executor memory
Up to now, I was always able to get my Spark jobs running successfully by increasing the appropriate kind of memory:
A2-B1 would therefor be the memory available on the driver to hold the program stack. Etc.
The property names are as follows:
A1-B1) executor-memory
A1-B2) driver-memory
A2-B1) spark.yarn.executor.memoryOverhead
A2-B2) spark.yarn.driver.memoryOverhead
Keep in mind that the sum of all *-B1 must be less than the available memory on your workers and the sum of all *-B2 must be less than the memory on your driver node.
My bet would be, that the culprit is one of the boldly marked heap settings.

There was an another question of yours routed here as duplicate
'How to avoid data skewing while reading huge datasets or tables into spark?
The data is not being partitioned properly. One partition is smaller while the
other one becomes huge on read.
I observed that one of the partition has nearly 2million rows and
while inserting there is a skew in partition. '
if the problem is to deal with data that is partitioned in a dataframe after read, Have you played around increasing the "numPartitions" value ?
.option("numPartitions",50)
lowerBound, upperBound form partition strides for generated WHERE clause expressions and numpartitions determines the number of split.
say for example, sometable has column - ID (we choose that as partitionColumn) ; value range we see in table for column-ID is from 1 to 1000 and we want to get all the records by running select * from sometable,
so we going with lowerbound = 1 & upperbound = 1000 and numpartition = 4
this will produce a dataframe of 4 partition with result of each Query by building sql based on our feed (lowerbound = 1 & upperbound = 1000 and numpartition = 4)
select * from sometable where ID < 250
select * from sometable where ID >= 250 and ID < 500
select * from sometable where ID >= 500 and ID < 750
select * from sometable where ID >= 750
what if most of the records in our table fall within the range of ID(500,750). that's the situation you are in to.
when we increase numpartition , the split happens even further and that reduce the volume of records in the same partition but this
is not a fine shot.
Instead of spark splitting the partitioncolumn based on boundaries we provide, if you think of feeding the split by yourself so, data can be evenly
splitted. you need to switch over to another JDBC method where instead of (lowerbound,upperbound & numpartition) we can provide
predicates directly.
def jdbc(url: String, table: String, predicates: Array[String], connectionProperties: Properties): DataFrame
Link

Spark driver memory for rdd.saveAsNewAPIHadoopFile and workarounds

I'm having issues with a particular spark method, saveAsNewAPIHadoopFile. The context is that I'm using pyspark, indexing RDDs with 1k, 10k, 50k, 500k, 1m records into ElasticSearch (ES).
For a variety of reasons, the Spark context is quite underpowered with a 2gb driver, and single 2gb executor.
I've had no problem until about 500k, when I'm getting java heap size problems. Increasing the spark.driver.memory to about 4gb, and I'm able to index more. However, there is a limit to how long this will work, and we would like to index in upwards of 500k, 1m, 5m, 20m records.
Also constrained to using pyspark, for a variety of reasons. The bottleneck and breakpoint seems to be a spark stage called take at SerDeUtil.scala:233, that no matter how many partitions the RDD has going in, it drops down to one, which I'm assuming is the driver collecting the partitions and preparing for indexing.
Now - I'm wondering if there is an efficient way to still use an approach like the following, given that constraint:
to_index_rdd.saveAsNewAPIHadoopFile(
path='-',
outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf={
"es.resource":"%s/record" % index_name,
"es.nodes":"192.168.45.10:9200",
"es.mapping.exclude":"temp_id",
"es.mapping.id":"temp_id",
}
)
In pursuit of a good solution, I might as well air some dirty laundry. I've got a terribly inefficient workaround that uses zipWithIndex to chunk an RDD, and send those subsets to the indexing function above. Looks a bit like this:
def index_chunks_to_es(spark=None, job=None, kwargs=None, rdd=None, chunk_size_limit=10000):
# zip with index
zrdd = rdd.zipWithIndex()
# get count
job.update_record_count(save=False)
count = job.record_count
# determine number of chunks
steps = count / chunk_size_limit
if steps % 1 != 0:
steps = int(steps) + 1
# evenly distribute chunks, while not exceeding chunk_limit
dist_chunk_size = int(count / steps) + 1
# loop through steps, appending subset to list for return
for step in range(0, steps):
# determine bounds
lower_bound = step * dist_chunk_size
upper_bound = (step + 1) * dist_chunk_size
print(lower_bound, upper_bound)
# select subset
rdd_subset = zrdd.filter(lambda x: x[1] >= lower_bound and x[1] < upper_bound).map(lambda x: x[0])
# index to ElasticSearch
ESIndex.index_job_to_es_spark(
spark,
job=job,
records_df=rdd_subset.toDF(),
index_mapper=kwargs['index_mapper']
)
It's slow, if I'm understanding correctly, because that zipWithIndex, filter, and map are evaluated for each resulting RDD subset. However, it's also memory efficient in that 500k, 1m, 5m, etc. records are never sent to saveAsNewAPIHadoopFile, instead, these smaller RDDs that a relatively small spark driver can handle.
Any suggestions for different approaches would be greatly appreciated. Perhaps that means now using the Elasticsearch-Hadoop connector, but instead sending raw JSON to ES?
Update:
Looks like I'm still getting java heap space errors with this workaround, but leaving here to demonstrate thinking for a possible workaround. Had not anticipated that zipWithIndex would collect everything on the driver (which I'm assuming is the case here)
Update #2
Here is a debug string of the RDD I'ma attempting to run through saveAsNewAPIHadoopFile:
(32) PythonRDD[6] at RDD at PythonRDD.scala:48 []
| MapPartitionsRDD[5] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| MapPartitionsRDD[4] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| ShuffledRowRDD[3] at javaToPython at NativeMethodAccessorImpl.java:-2 []
+-(1) MapPartitionsRDD[2] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| MapPartitionsRDD[1] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| JDBCRDD[0] at javaToPython at NativeMethodAccessorImpl.java:-2 []
Update #3
Below is a DAG visualization for the take at SerDeUtil.scala:233 that appears to run on driver/localhost:
And a DAG for the saveAsNewAPIHadoopFile for a much smaller job (around 1k rows), as the 500k rows attempts never actually fire as the SerDeUtil stage above is what appears to trigger the java heap size problem for larger RDDs:

I'm still a bit confused as to why this addresses the problem, but it does. When reading rows from MySQL with spark.jdbc.read, by passing bounds, the resulting RDD appears to be partitioned in such a way that saveAsNewAPIHadoopFile is successful for large RDDs.
Have a Django model for the DB rows, so can get first and last column IDs:
records = records.order_by('id')
start_id = records.first().id
end_id = records.last().id
Then, pass those to spark.read.jdbc:
sqldf = spark.read.jdbc(
settings.COMBINE_DATABASE['jdbc_url'],
'core_record',
properties=settings.COMBINE_DATABASE,
column='id',
lowerBound=bounds['lowerBound'],
upperBound=bounds['upperBound'],
numPartitions=settings.SPARK_REPARTITION
)
The debug string for the RDD shows that the originating RDD now has 10 partitions:
(32) PythonRDD[11] at RDD at PythonRDD.scala:48 []
| MapPartitionsRDD[10] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| MapPartitionsRDD[9] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| ShuffledRowRDD[8] at javaToPython at NativeMethodAccessorImpl.java:-2 []
+-(10) MapPartitionsRDD[7] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| MapPartitionsRDD[6] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| JDBCRDD[5] at javaToPython at NativeMethodAccessorImpl.java:-2 []
Where my understanding breaks down, is that you can see there is a manual/explicit repartitioning to 32, both in the debug string from the question, and this one above, which I thought would be enough to ease memory pressure on the saveAsNewAPIHadoopFile call, but apparently the Dataframe (turned into an RDD) from the original spark.jdbc.read matters even downstream.

Running large dataset causes timeout

I am building a Spark application that is relatively simple. Generally, the logic looks like this:
val file1 = sc.textFile("s3://file1/*")
val file2 = sc.textFile("s3://file2/*")
// map over files
val file1Map = file1.map(word => (word, "val1"))
val file2Map = file2.map(differentword => (differentword, "val2"))
val unionRdd = file1Map.union(file2Map)
val groupedUnion = unionRdd.groupByKey()
val output = groupedUnion.map(tuple => {
// do something that requires all the values, return new object
if(oneThingIsTrue) tuple._1 else "null"
}).filter(line => line != "null")
output.saveAsTextFile("s3://newfile/")
The question has to do with this not working when I run it with larger datasets. I can run it without errors when the Dataset is around 700GB. When I double it to 1.6TB, the job will get halfway before timing out. Here is the Err log:
INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 0, fetching them
INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker#172.31.4.36:39743)
ERROR MapOutputTrackerWorker: Error communicating with MapOutputTracker
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [800 seconds]. This timeout is controlled by spark.network.timeout
I have tried increasing the network timeout to both 800 seconds and 1600 seconds but all this does is delay the error for longer. I am running the code on 10r4.2xl which have 8 cores each and 62gb RAM. I have EBS setup to have 3TB storage. I am running this code via Zeppelin in Amazon EMR.
Can anyone help me debug this? The CPU usage of the cluster will be close to 90% the whole time until it gets halfway and it drops back to 0 completely. The other interesting thing is that it looks like it fails in the second stage when it is shuffling. As you can see from the trace, it is doing the fetch and never gets it.
Here is a photo from Ganglia.

I'm still not sure what caused this but I was able to get around it by coalescing the unionRdd and then grouping that result. Changing the above code to:
...
// union rdd is 30k partitions, coalesce into 8k
val unionRdd = file1Map.union(file2Map)
val col = unionRdd.coalesce(8000)
val groupedUnion = col.groupByKey()
...
It might not be efficient, but it works.

replace groupbykey with reduceByKey or aggregateByKey or combineByKey.
groupByKey must bring all like keys onto the same worker and this can cause an out of memory error. Not sure why there isn't a warning on using this function

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string