Spark Kafka Producer throwing Too many open files Exception - apache-spark

I am trying to run a Spark Kafka Job written in Java to produce around 10K records per batch to a Kafka Topic. This is a spark batch job which reads 100(total 1million records) hdfs part files sequentially in a loop and produce each part file of 10K records in a batch.
I am using org.apache.kafka.clients.producer.KafkaProducer API
Getting below exception:
org.apache.kafka.common.KafkaException: Failed to construct kafka producer
....
Caused by: org.apache.kafka.common.KafkaException: java.io.IOException: Too many open files
....
Caused by: java.io.IOException: Too many open files
Below is the configurations:
Cluster Resource availability:
---------------------------------
The cluster has more than 500 nodes, 150 Terabyte total memory, more than 30K cores
Spark Application configuration:
------------------------------------
Driver_memory: 24GB
--executor-cores: 5
--num-executors: 24
--executor-memory: 24GB
Topic Configuration:
--------------------
Partitions: 16
Replication: 3
Data size
----------
Each part file has 10K records
Total records 1million
Each batch produce 10K records
Please suggest some solutions for this as this is a very critical issue.
Thanks in advance

In Kafka, every topic is (optionally) split into many partitions. For each partition some files are maintained by brokers (for index and actual data).
kafka-topics --zookeeper localhost:2181 --describe --topic topic_name
will give you the number of partitions for topic topic_name. The default number of partitions per topic num.partitions is defined under /etc/kafka/server.properties
The total number of open files could be huge if the broker hosts many partitions and a particular partition has many log segment files.
You can see the current file descriptor limit by running
ulimit -n
You can also check the number of open files using lsof:
lsof | wc -l
To solve the issue you either need to change the limit of open file descriptors:
ulimit -n <noOfFiles>
or somehow reduce the number of open files (for example, reduce number of partitions per topic).

Related

Huge Multiline Json file is being processed by single Executor

I have a huge json file 35-40GB size, Its a MULTILINE JSON on hdfs. I have made use of .option('multiline', 'true').read.json('MULTILINE_JSONFILE_.json').repartition(50)
with Pyspark.
I have bumped up 60 Executors, 16 cores, 16GB Ememory and set memory overhead parameters.
Every run the Executors were being lost.
It is perfectly working for smaller files, but not with files > 15 GB
I have enough cluster resources.
From the spark UI what I have seen is every time the data is being processed by single executor, all other executors were idle.
I have seen the stages (0/2) Tasks(0/51)
I have re-partitioned the data as well.
Code:
spark.read.option('multiline', 'true').read.json('MULTILINE_JSONFILE_.json').repartition(50)
df.count()
df.rdd.glom().map(len).collect()
df.write.... (HDFSLOCATION, format='csv')
Goal: My goal is to apply UDF function on each of the column and clean the data and write to CSV format.
Size of dataframe is 8 million rows with 210 columns
Rule of thumb, Spark's parallelism is based on the number of input files. But you just specified only 1 file (MULTILINE_JSONFILE_.json), so Spark will use 1 cpu for processing following code
spark.read.option('multiline', 'true').read.json('MULTILINE_JSONFILE_.json')
even if you have 16 cores.
I would recommend that you split a json file into many files.
More precisely, parallelism is base on number of blocks of files if files are stored on HDFS. if MULTILINE_JSONFILE_.json is 40GB, it might have more than 400 blocks if the block size is 128MB. So, Spark tasks should run in parallel if the file is located in HDFS. If you are stuck with parallelism, I think this is because option("multiline", false) is specified.
In databricks documentation, you can see following sentence.
Files will be loaded as a whole entity and cannot be split.

Structured Spark Streaming Throws OOM exception

My structured Spark Streaming Job fails with the following exception after running for more than 24 hrs.
Exception in thread "spark-listener-group-eventLog" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.math.BigInteger.<init>(BigInteger.java:1114)
at java.math.BigInteger.valueOf(BigInteger.java:1098)
at scala.math.BigInt$.apply(BigInt.scala:49)
at scala.math.BigInt$.long2bigInt(BigInt.scala:101)
at org.json4s.Implicits$class.long2jvalue(JsonDSL.scala:45)
at org.json4s.JsonDSL$.long2jvalue(JsonDSL.scala:61)
Quick background:
My structured spark streaming job is to ingest events received as new files (parquet) into Solr collection. So, the sources are 8 different hive tables (8 different hdfs locations) receiving events and the sink is one solr collection.
Configuration:
Number Executors: 30
Executor Memory: 20 G
Driver memory: 20 G
cores - 5
Generated a hprof dump file and loaded into MAT to understand cause. The dump file looks like. This is a test environment and data stream TPS (transaction per minute) is very low and sometimes no transactions at all.
Any clue on what is causing this. Unfortunately, I'm unable to share the code snippet. Sorry about that.

How to optimize Hadoop MapReduce compressing Spark output in Google Datproc?

The goal: Millions of rows in Cassandra need to be extracted and compressed into a single file as quickly and efficiently as possible (on a daily basis).
The current setup uses a Google Dataproc cluster to run a Spark job that extracts the data directly into a Google Cloud Storage bucket. I've tried two approaches:
Using the (now deprecated) FileUtil.copyMerge() to combine the roughly 9000 Spark partition files into a single uncompressed file, then submitting a Hadoop MapReduce job to compress that single file.
Leaving the roughly 9000 Spark partition files as the raw output, and submitting a Hadoop MapReduce job to merge and compress those files into a single file.
Some job details:
About 800 Million rows.
About 9000 Spark partition files outputted by the Spark job.
Spark job takes about an hour to complete running on a 1 Master, 4 Worker (4vCPU, 15GB each) Dataproc cluster.
Default Dataproc Hadoop block size, which is, I think 128MB.
Some Spark configuration details:
spark.task.maxFailures=10
spark.executor.cores=4
spark.cassandra.input.consistency.level=LOCAL_ONE
spark.cassandra.input.reads_per_sec=100
spark.cassandra.input.fetch.size_in_rows=1000
spark.cassandra.input.split.size_in_mb=64
The Hadoop job:
hadoop jar file://usr/lib/hadoop-mapreduce/hadoop-streaming-2.8.4.jar
-Dmapred.reduce.tasks=1
-Dmapred.output.compress=true
-Dmapred.compress.map.output=true
-Dstream.map.output.field.separator=,
-Dmapred.textoutputformat.separator=,
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
-input gs://bucket/with/either/single/uncompressed/csv/or/many/spark/partition/file/csvs
-output gs://output/bucket
-mapper /bin/cat
-reducer /bin/cat
-inputformat org.apache.hadoop.mapred.TextInputFormat
-outputformat org.apache.hadoop.mapred.TextOutputFormat
The Spark job took about 1 hour to extract Cassandra data to GCS bucket. Using the FileUtil.copyMerge() added about 45 minutes to that, was performed by the Dataproc cluster but underutilized resources as it ones seems to use 1 node. The Hadoop job to compress that single file took an additional 50 minutes. This is not an optimal approach, as the cluster has to stay up longer even though it is not using its full resources.
The info output from that job:
INFO mapreduce.Job: Counters: 55
File System Counters
FILE: Number of bytes read=5072098452
FILE: Number of bytes written=7896333915
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
GS: Number of bytes read=47132294405
GS: Number of bytes written=2641672054
GS: Number of read operations=0
GS: Number of large read operations=0
GS: Number of write operations=0
HDFS: Number of bytes read=57024
HDFS: Number of bytes written=0
HDFS: Number of read operations=352
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Killed map tasks=1
Launched map tasks=353
Launched reduce tasks=1
Rack-local map tasks=353
Total time spent by all maps in occupied slots (ms)=18495825
Total time spent by all reduces in occupied slots (ms)=7412208
Total time spent by all map tasks (ms)=6165275
Total time spent by all reduce tasks (ms)=2470736
Total vcore-milliseconds taken by all map tasks=6165275
Total vcore-milliseconds taken by all reduce tasks=2470736
Total megabyte-milliseconds taken by all map tasks=18939724800
Total megabyte-milliseconds taken by all reduce tasks=7590100992
Map-Reduce Framework
Map input records=775533855
Map output records=775533855
Map output bytes=47130856709
Map output materialized bytes=2765069653
Input split bytes=57024
Combine input records=0
Combine output records=0
Reduce input groups=2539721
Reduce shuffle bytes=2765069653
Reduce input records=775533855
Reduce output records=775533855
Spilled Records=2204752220
Shuffled Maps =352
Failed Shuffles=0
Merged Map outputs=352
GC time elapsed (ms)=87201
CPU time spent (ms)=7599340
Physical memory (bytes) snapshot=204676702208
Virtual memory (bytes) snapshot=1552881852416
Total committed heap usage (bytes)=193017675776
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=47132294405
File Output Format Counters
Bytes Written=2641672054
I expected this to perform as well as or better than the other approach, but it performed much worse. The Spark job remained unchanged. Skipping the FileUtil.copyMerge() and jumping straight into the Hadoop MapReduce job... the map portion of the job was only at about 50% after an hour and a half. Job was cancelled at that point, as it was clear it was not going to be viable.
I have complete control over the Spark job and the Hadoop job. I know we could create a bigger cluster, but I'd rather do that only after making sure the job itself is optimized. Any help is appreciated. Thanks.
Can you provide some more details of your Spark job? What API of Spark are you using - RDD or Dataframe?
Why not perform merge phase completely in Spark (with repartition().write()) and avoid chaining of Spark and MR jobs?

Too many open files Kafka Exception on running for long

I have a Kafka producer code in Java that watches a directory for new files using java nio WatchService api and takes any new file and pushes to a kafka topic. Spark streaming consumer reads from the kafka topic. I am getting the following error after the Kafka producer job keeps running for a day. The producer pushes about 500 files every 2 mins. My Kafka topic has 1 partition and 2 replication factor. Can someone please help?
org.apache.kafka.common.KafkaException: Failed to construct kafka producer
at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:342)
at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:166)
at com.hp.hawkeye.HawkeyeKafkaProducer.Sender.createProducer(Sender.java:60)
at com.hp.hawkeye.HawkeyeKafkaProducer.Sender.<init>(Sender.java:38)
at com.hp.hawkeye.HawkeyeKafkaProducer.HawkeyeKafkaProducer.<init>(HawkeyeKafkaProducer.java:54)
at com.hp.hawkeye.HawkeyeKafkaProducer.myKafkaTestJob.main(myKafkaTestJob.java:81)
Caused by: org.apache.kafka.common.KafkaException: java.io.IOException: Too many open files
at org.apache.kafka.common.network.Selector.<init>(Selector.java:125)
at org.apache.kafka.common.network.Selector.<init>(Selector.java:147)
at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:306)
... 7 more
Caused by: java.io.IOException: Too many open files
at sun.nio.ch.EPollArrayWrapper.epollCreate(Native Method)
at sun.nio.ch.EPollArrayWrapper.<init>(EPollArrayWrapper.java:130)
at sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:69)
at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:36)
at java.nio.channels.Selector.open(Selector.java:227)
at org.apache.kafka.common.network.Selector.<init>(Selector.java:123)
... 9 more
Check ulimit -aH
check with your admin and increase the open files size, for eg:
open files (-n) 655536
else I suspect there might be leaks in your code, refer:
http://mail-archives.apache.org/mod_mbox/spark-user/201504.mbox/%3CCAKWX9VVJZObU9omOVCfPaJ_bPAJWiHcxeE7RyeqxUHPWvfj7WA#mail.gmail.com%3E

Spark partition by files

I have several thousand compressed CSV files on a S3 bucket, each of size approximately 30MB(around 120-160MB after decompression), which I want to process using spark.
In my spark job, I am doing simple filter select queries on each row.
While partitioning Spark is dividing the files into two or more parts and then creating tasks for each partition. Each task is taking around 1 min to complete just to process 125K records. I want to avoid this partitioning of a single file across many tasks.
Is there a way to fetch files and partition data such that each task works on one complete file, that is, Number of tasks = Number of input files.?
as well as playing with spark options, you can tell the s3a filesystem client to tell it to tell Spark that the "block size" of a file in S3 is 128 MB. The default is 32 MB, which is close enough to your "approximately 30MB" number that spark could be splitting the files in two
spark.hadoop.fs.s3a.block.size 134217728
using the wholeTextFiles() operation is safer though

Resources