Spark Job Reading Parquet Data - apache-spark

I have a spark job which reads a parquet data files .
Each Parquet file has block size of 32 MB and 13 Blocks .
I started the spark shell with 2 executor and 10 cores , which means 20 cores are available .
The job reads 10 parquet files and executes a count operation .
I assumed that since there are 13 blocks/file (10*13=130) , 130 tasks should run on that stage . But I can only see 36 tasks getting executed in that stage.
Also while reading the 10 files I can only see 20 Tasks doing the job . As there are 130 blocks , won't 130 tasks be spawned with each task reading a block .
Is there anything wrong in my understanding .
The commands I am Running are as below :
Spark Shell Command :
spark-shell --master yarn-client --num-executors 2 --executor-cores 10 --executor-memory 420G --driver-memory 2g --conf spark.yarn.executor.memoryOverhead=4096
Scala Code :
***> val sqlContext = new org.apache.spark.sql.SQLContext(sc) val parqfile
> =sqlContext.read.parquet("location.parquet")
>
> parqfile.registerTempTable("temptable")
>
> val p1 =sqlContext.sql("select count(*) from temptable")
> val p2=p1.show(100, false)

Related

How much should one choose the number of executors and cores on a spark-submit job?

I have a spark structured streaming job that does the following:
Streams from S3 Folder a file containing json (many json lines... like 12million)
Filters them to exclude a couple of million
Call an external HTTP api with each json (using concurrency)
Write the response data to a Kafka topic
My source S3 folder can have up to 48 or more files, therefore I am using the:
.option("maxFilesPerTrigger", 1)
My EMR cluster is: (1 Master + 2 Slave Nodes) (each is of type: m5.2xlarge)
Each equipped with 8 cores and 32GB of memory.
In my spark job, I want to know what these options should be?
spark-submit \
--master yarn \
--conf spark.dynamicAllocation.enabled=false \
--executor-memory ??g \
--driver-memory ??g \
--executor-cores ?? \
--num-executors ?? \
--queue default \
--deploy-mode cluster \
....
I want to distribute the load equally because I've been playing around with it and it seems like the transactions per sec that I am seeing on the HTTP endpoint is up/down and I think is a direct result of my parameters. I don't want to take the WHOLE cluster too. Any ideas?
Graph shows transactions per min of the HTTP endpoint being called.
it depend of your time requirements, others jobs...
firstly you should maybe try with full cluster.
1 master + 2 slave = 3.
cores = 3 * 8 = 24
memory = 3 * 32 = 96
recommended number of core : 5 , we will decrease to 4 to dont have core left.
--executor-cores 4
number of executor = 24/4 = 6 (1 master and 5 executors)
--num-executors 5
executor-memory/driver-memory : (6/96)- ~10% = 14g
final parameter :
spark-submit \
--master yarn \
--conf spark.dynamicAllocation.enabled=false \
--executor-memory 14g \
--driver-memory 14g \
--executor-cores 4 \
--num-executors 5 \
--queue default \
--deploy-mode cluster \
....
you can easly remove some Go from driver to give it to executors..

How are cores on SLURM allocated in relation to memory and thread allocation with Snakemake?

I am having trouble understanding how the threads and resources allocated to a snakejob translate to number of cores allocated per snakejob on my slurm partition. I have set the --cores flag to 46 on my .sh which runs my snakefile,
yet each of 5 snakejobs are concurrently running, with 16 cores provided to each of them. Does a rule specific thread number superceed the --cores flag for snakemake? I thought it was the max cores that all my jobs together had to work with...
Also, are cores allocated based on memory, and does that scale with number of threads specified? For example, my jobs were allocated 10GB a peice of memory, but one thread only. Each job was given two cores according to my SLURM outputs. When I specified 8 threads with 10GB of memory, I was provided 16 cores instead. Does that have to do with the amount of memory I gave to my job, or is it just that an additional core is provided for each thread for memory purposes? Any help would be appreciated.
Here is one of snake job outputs:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 index_genome
1
[Tue Feb 2 10:53:59 2021]
rule index_genome:
input: /mypath/genome/genomex.fna
output: /mypath/genome/genomex.fna.ann
jobid: 0
wildcards: bwa_extension=.ann
threads: 8
resources: mem_mb=10000
Here is my bash command:
module load snakemake/5.6.0
snakemake -s snake_make_x --cluster-config cluster.yaml --default-resources --cores 48 --jobs 47 \
--cluster "sbatch -n {threads} -M {cluster.cluster} -A {cluster.account} -p {cluster.partition}" \
--latency-wait 10
When you use slurm together with snakemake, the --cores flag unfortunately does not mean cores anymore, it means jobs.. So when you set --cores 48 you are actually telling snakemake to use at max 48 parallel jobs.
Related question:
Behaviour of "--cores" when using Snakemake with the slurm profile

Why after creating 5 partitions and 10 buckets, number of data files created at the backend is so high?

Below is my code:
spark.range(1,10000).withColumn("hashId",col("id")%5).write.partitionBy("hashId").bucketBy(10,"id").saveAsTable("learning.test_table")
Spark Configuration:
./spark-shell --master yarn --num-executors 10 --executor-cores 3 --executor-memory 5
There are 5 partitions and inside each partition, there are 61 files:
hdfs dfs -ls /apps/hive/warehouse/learning.db/test_table/hashId=0 | wc -l
61
After creating this table when I checked the backend, it created 305 files + 1 _SUCCESS file.
Could someone please explain why it is creating 305 files?

How to avoid small file problem while writing to hdfs & s3 from spark-sql-streaming

Me using spark-sql-2.3.1v , kafka with java8 in my project.
With
--driver-memory 4g \
--driver-cores 2 \
--num-executors 120 \
--executor-cores 1 \
--executor-memory 768m \
At consumer side , me trying to write the files in hdfs
Me using something like this below code
dataSet.writeStream()
.format("parquet")
.option("path", parqetFileName)
.option("mergeSchema", true)
.outputMode("Append")
.partitionBy("company_id","date")
.option("checkpointLocation", checkPtLocation)
.trigger(Trigger.ProcessingTime("25 seconds"))
.start();
When I store into hdfs folder it looks something below i.e. each file is around 1.5k+ i.e. few KBs.
$ hdfs dfs -du -h /transactions/company_id=24779/date=2014-06-24/
1.5 K /transactions/company_id=24779/date=2014-06-24/part-00026-1027fff9-5745-4250-961a-fd56508b7ea3.c000.snappy.parquet
1.5 K /transactions/company_id=24779/date=2014-06-24/part-00057-6604f6cc-5b8d-41f4-8fc0-14f6e13b4a37.c000.snappy.parquet
1.5 K /transactions/company_id=24779/date=2014-06-24/part-00098-754e6929-9c75-430f-b6bb-3457a216aae3.c000.snappy.parquet
1.5 K /transactions/company_id=24779/date=2014-06-24/part-00099-1d62cbd5-7409-4259-b4f3-d0f0e5a93da3.c000.snappy.parquet
1.5 K /transactions/company_id=24779/date=2014-06-24/part-00109-1965b052-c7a6-47a8-ae15-dea301010cf5.c000.snappy.parquet
Due to this small files , its taking a lot of processing time ,
while I read back larger set of data from hdfs & count the number of
rows then it is resulting into below heap space error.
2020-02-12 07:07:57,475 [Driver] ERROR org.apache.spark.deploy.yarn.ApplicationMaster - User class threw exception: java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.<init>(String.java:207)
at java.lang.String.substring(String.java:1969)
at java.net.URI$Parser.substring(URI.java:2869)
at java.net.URI$Parser.parse(URI.java:3049)
at java.net.URI.<init>(URI.java:588)
at org.apache.spark.sql.execution.streaming.SinkFileStatus.toFileStatus(FileStreamSinkLog.scala:52)
at org.apache.spark.sql.execution.streaming.MetadataLogFileIndex$$anonfun$2.apply(MetadataLogFileIndex.scala:46)
at org.apache.spark.sql.execution.streaming.MetadataLogFileIndex$$anonfun$2.apply(MetadataLogFileIndex.scala:46)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.sql.execution.streaming.MetadataLogFileIndex.<init>(MetadataLogFileIndex.scala:46)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:336)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at com.spgmi.ca.prescore.utils.DbUtils.loadFromHdfs(DbUtils.java:129)
at com.spgmi.ca.prescore.spark.CountRecords.main(CountRecords.java:84)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:684)
2020-02-12 07:07:57,533 [Reporter] WARN org.apache.spark.deploy.yarn.ApplicationMaster - Reporter thread fails 1 time(s) in a row.
java.io.IOException: Failed on local exception: java.nio.channels.ClosedByInterruptException; Host Details : local host is: dev1-dev.com":8030;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:805)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1497)
at org.apache.hadoop.ipc.Client.call(Client.java:1439)
at org.apache.hadoop.ipc.Client.call(Client.java:1349)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy22.allocate(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
at sun.reflect.GeneratedMethodAccessor32.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy23.allocate(Unknown Source)
at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:296)
at org.apache.spark.deploy.yarn.YarnAllocator.allocateResources(YarnAllocator.scala:249)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$allocationThreadImpl(ApplicationMaster.scala:540)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:606)
Caused by: java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:753)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:687)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:790)
at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:411)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1554)
Questions:
are these small files going to result in "small file problem" in spark processing?
If so how to deal with this scenario.
If I want to count the total number of records from given hdfs folder, how to do it ?
How to know how much heap-space necessary to handle this kind of data ?
After new changes
--driver-memory 16g \
--driver-cores 1 \
--num-executors 120 \
--executor-cores 1 \
--executor-memory 768m \
Run successfully results are :
2020-02-12 20:28:56,188 [Driver] WARN com.spark.mypackage.CountRecords - NUMBER OF PARTITIONS AFTER HDFS READ : 77926
+--------------------+-----+
|SPARK_PARTITION_ID()|count|
+--------------------+-----+
| 24354| 94|
| 26425| 96|
| 32414| 64|
| 76143| 32|
| 16861| 32|
| 30903| 64|
| 40335| 32|
| 64121| 64|
| 69042| 32|
| 32539| 64|
| 34759| 32|
| 41575| 32|
| 1591| 64|
| 3050| 98|
| 51772| 32|
+--------------------+-----+
2020-02-12 20:50:32,301 [Driver] WARN com.spark.mypackage.CountRecords - RECORD COUNT: 3999708
Yes. Small files is not only a Spark problem. It causes unnecessary load on your NameNode. You should spend more time compacting and uploading larger files than worrying about OOM when processing small files. The fact that your files are less than 64MB / 128MB, then that's a sign you're using Hadoop poorly.
Something like spark.read("hdfs://path").count() would read all the files in the path, then count the rows in the Dataframe
There is no hard-set number. You need to enable JMX monitoring on your jobs and see what the heap size is reaching. Otherwise, arbitrarily double the current memory you're giving the job until it starts not getting OOM. If you start approaching more than 8GB, then you need to consider reading less data in each job by adding more parallelization.
FWIW, Kafka Connect can also be used to output partitioned HDFS/S3 paths.

Spark GraphX Out of memory error

I am running GraphX on Spark with input file size of around 100GB on aws EMR.
My cluster configuration is as follows
Nodes - 10
Memory - 122GB each
HDD - 320GB each
No matter what I do I'm getting out of memory error when I run spark job as
spark-submit --deploy-mode cluster \
--class com.news.ncg.report.graph.NcgGraphx \
ncgaka-graphx-assembly-1.0.jar true s3://<bkt>/<folder>/run=2016-08-19-02-06-20/part* output
Error
AM Container for appattempt_1474446853388_0001_000001 exited with exitCode: -104
For more detailed output, check application tracking page:http://ip-172-27-111-41.ap-southeast-2.compute.internal:8088/cluster/app/application_1474446853388_0001Then, click on links to logs of each attempt.
Diagnostics: Container [pid=7902,containerID=container_1474446853388_0001_01_000001] is running beyond physical memory limits. Current usage: 1.4 GB of 1.4 GB physical memory used; 3.4 GB of 6.9 GB virtual memory used. Killing container.
Dump of the process-tree for container_1474446853388_0001_01_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 7907 7902 7902 7902 (java) 36828 2081 3522265088 359788 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx1024m -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1474446853388_0001/container_1474446853388_0001_01_000001/tmp -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError=kill -9 %p -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1474446853388_0001/container_1474446853388_0001_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class com.news.ncg.report.graph.NcgGraphx --jar s3://discover-pixeltoucher/jar/ncgaka-graphx-assembly-1.0.jar --arg true --arg s3://discover-pixeltoucher/ncgus/run=2016-08-19-02-06-20/part* --arg s3://discover-pixeltoucher/output/20160819/ --properties-file /mnt/yarn/usercache/hadoop/appcache/application_1474446853388_0001/container_1474446853388_0001_01_000001/__spark_conf__/__spark_conf__.properties
|- 7902 7900 7902 7902 (bash) 0 0 115810304 687 /bin/bash -c LD_LIBRARY_PATH=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native /usr/lib/jvm/java-openjdk/bin/java -server -Xmx1024m -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1474446853388_0001/container_1474446853388_0001_01_000001/tmp '-XX:+UseConcMarkSweepGC' '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1474446853388_0001/container_1474446853388_0001_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'com.news.ncg.report.graph.NcgGraphx' --jar s3://discover-pixeltoucher/jar/ncgaka-graphx-assembly-1.0.jar --arg 'true' --arg 's3://discover-pixeltoucher/ncgus/run=2016-08-19-02-06-20/part*' --arg 's3://discover-pixeltoucher/output/20160819/' --properties-file /mnt/yarn/usercache/hadoop/appcache/application_1474446853388_0001/container_1474446853388_0001_01_000001/__spark_conf__/__spark_conf__.properties 1> /var/log/hadoop-yarn/containers/application_1474446853388_0001/container_1474446853388_0001_01_000001/stdout 2> /var/log/hadoop-yarn/containers/application_1474446853388_0001/container_1474446853388_0001_01_000001/stderr
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Failing this attempt
Any idea how can I stop getting this error?
I created sparkSession as below
val spark = SparkSession
.builder()
.master(mode)
.config("spark.hadoop.validateOutputSpecs", "false")
.config("spark.driver.cores", "1")
.config("spark.driver.memory", "30g")
.config("spark.executor.memory", "19g")
.config("spark.executor.cores", "5")
.config("spark.yarn.executor.memoryOverhead","2g")
.config("spark.yarn.driver.memoryOverhead ","1g")
.config("spark.shuffle.compress","true")
.config("spark.shuffle.service.enabled","true")
.config("spark.scheduler.mode","FAIR")
.config("spark.speculation","true")
.appName("NcgGraphX")
.getOrCreate()
It seems like you want to deploy your Spark application on YARN. If that is the case, you should not set up application properties in code, but rather using spark-submit:
$ ./bin/spark-submit --class com.news.ncg.report.graph.NcgGraphx \
--master yarn \
--deploy-mode cluster \
--driver-memory 30g \
--executor-memory 19g \
--executor-cores 5 \
<other options>
ncgaka-graphx-assembly-1.0.jar true s3://<bkt>/<folder>/run=2016-08-19-02-06-20/part* output
In client mode, the JVM would have been already set up, so I would personally use CLI to pass those options.
After passing memory options in spark-submit change your code to load variables dynamically: SparkSession.builder().getOrCreate()
PS. You might also want to increase memory for AM in spark.yarn.am.memory property.

Resources