I am using Zeppelin to read avro files of size in GBs and have records in billions. I have tried with 2 instances and 7 instances on AWS EMR, but the performance seems equal. With 7 instances it is still taking alot of times. The code is:
val snowball = spark.read.avro(snowBallUrl + folder + prefix + "*.avro")
val prod = spark.read.avro(prodUrl + folder + prefix + "*.avro")
snowball.persist()
prod.persist()
val snowballCount = snowball.count()
val prodCount = prod.count()
val u = snowball.union(prod)
Output:
snowballCount: Long = 13537690
prodCount: Long = 193885314
And the resources can be seen here:
The spark.executor.cores is set to 1. If i try to change this number, the Zeppelin doesn't work and spark context shutdown. It would be great, if someone can hint a bit to improve the performance.
Edit:
I checked how many partitions it created:
snowball.rdd.partitions.size
prod.rdd.partitions.size
u.rdd.partitions.size
res21: Int = 55
res22: Int = 737
res23: Int = 792
Related
I have a job that loads some data from Hive and then it does some processing and ends writing data to Cassandra. At some point it was working fine but then all of the sudden one of the Spark operations has a bottleneck where only 2 cores are used even though the partition count it is set to 2000 across the pipeline. I am running Spark version: spark-core_2.11-2.0.0
My Spark configuration is as follows:
spark.executor.instances = "50"
spark.executor.cores = "4"
spark.executor.memory = "6g"
spark.driver.memory = "8g"
spark.memory.offHeap.enabled = "true"
spark.memory.offHeap.size = "4g"
spark.yarn.executor.memoryOverhead = "6096"
hive.exec.dynamic.partition.mode = "nonstrict"
spark.sql.shuffle.partitions = "3000"
spark.unsafe.sorter.spill.reader.buffer.size = "1m"
spark.file.transferTo = "false"
spark.shuffle.file.buffer = "1m"
spark.shuffle.unsafe.file.ouput.buffer = "5m"
When I do a thread dump of the executor that is running I see:
com.*.MapToSalaryRow.buildSalaryRow(SalaryTransformer.java:110)
com.*.MapToSalaryRow.call(SalaryTransformer.java:126)
com.*.MapToSalaryRow.call(SalaryTransformer.java:88)
org.apache.spark.sql.KeyValueGroupedDataset$$anonfun$mapGroups$1.apply(KeyValueGroupedDataset.scala:220)
A simplified version of the code that is having the problem is:
sourceDs.createOrReplaceTempView("salary_ds")
sourceDs.repartition(2000);
System.out.println("sourceDsdataset partition count = "+sourceDs.rdd().getNumPartitions());
Dataset<Row> salaryDs = sourceDs.groupByKey(keyByUserIdFunction, Encoders.LONG()).mapGroups(
new MapToSalaryRow( props), RowEncoder.apply(getSalarySchema())).
filter((FilterFunction<Row>) (row -> row != null));
salaryDs.persist(StorageLevel.MEMORY_ONLY_SER());
salaryDs.repartition(2000);
System.out.println("salaryDs dataset partition count = "+salaryDs.rdd().getNumPartitions());
Both of the above print statements show the partition count being 2000
The relevant code of the function MapGroups is:
class MapToSalaryInsightRow implements MapGroupsFunction<Long, Row, Row> {
private final Properties props;
#Override
public Row call(Long userId, Iterator<Row> iterator) throws Exception {
return buildSalaryRow(userId, iterator, props);
}
}
If anybody can point where the problem might be is highly appreciated.
Thanks
The problem was that there was a column of type array and for one of the rows this array was humongous therefore. Even though the partitions sizes were about the same two of them had 40 times the size. In those cases the task that got those rows would take significantly longer.
I have spark job code as below. Which works fine with below configuration on cluster.
String path = "/tmp/one.txt";
JavaRDD<SomeClass> jRDD = spark.read()
.textFile(path)
.javaRDD()
.map(line -> {
return new SomeClass(line);
});
Dataset responseSet = sparkSession.createDataFrame(jRDD, SomeClass.class);
responseSet.write()
.format("text")
.save(path + "processed");
Whereas, If I want to read binary file(same size as text) it takes much more time.
String path = "/tmp/one.txt";
JavaRDD<SomeClass> jRDD = sparkContext
.binaryRecords(path, 10000, new Configuration())
.toJavaRDD()
.map(line -> {
return new SomeClass(line);
});
Dataset responseSet = spark.createDataFrame(jRDD, SomeClass.class);
responseSet.write()
.format("text")
.save(path + "processed");
Below is my configuration.
driver-memory 8g
executor-memory 6g
num-executors 16
Time taken by first code with 150 MB file is 1.30 mins.
Time taken by second code with 150 MB file is 4 mins.
Also, first code was able to run on all 16 executors whereas second uses only one.
ny suggestions why it is slow?
I found the issue. The textFile()method was creating 16 partitions(you can checknumOfPartitions using getNumPartitions() method on RDD) whereas binaryRecords() created only 1(Java binaryRecords doesn't provide overloaded method which specifies num of partitions to be created).
I increased numOfPartitions on RDD created by binaryRecords() by using repartition(NUM_OF_PARTITIONS) method on RDD.
we have a 300 node cluster, each node having 132gb memory and 20 cores. the ask is - remove data from tableA which is in tableB and then merge B with A and push A to teradata.
below is the code
val ofitemp = sqlContext.sql("select * from B")
val ofifinal = sqlContext.sql("select * from A")
val selectfromfinal = sqlContext.sql("select A.a,A.b,A.c...A.x from A where A.x=B.x")
val takefromfinal = ofifinal.except(selectfromfinal)
val tempfinal = takefromfinal.unionAll(ofitemp)tempfinal.write.mode("overwrite").saveAsTable("C")
val tempTableFinal = sqlContext.table("C")tempTableFinal.write.mode("overwrite").insertInto("A")
the config used to run spark is -
EXECUTOR_MEM="16G"
HIVE_MAPPER_HEAP=2048 ## MB
NUMBER_OF_EXECUTORS="25"
DRIVER_MEM="5G"
EXECUTOR_CORES="3"
with A and B having few million records, the job is taking several hours to run.
As am very new to Spark, am not understanding - is it the code issue or the environment setting issue.
would be obliged, if you can share your thoughts to overcome the performance issues.
In your code, exceptcould be a bottleneck because it compares all columns for equality. Is this really what you need (I'm confused about the join À.x=B.y`in the line before)
If you only need to check on 1 attribute, the fastest way would be to do a "leftanti"-join :
val takefromfinal = ofifinal.join(ofitemp,$"A.x"===$"B.y","leftanti")
Besides that, study spark-UI and identify the bottleneck
I am using sparkR (spark 2.0.0, yarn) on cluster with following configuration: 5 machines (24 cores + 200 GB RAM each). Wanted to run sparkR.session() with additional arguments to assign only a percentage of total resources to my job:
if(Sys.getenv("SPARK_HOME") == "") Sys.setenv(SPARK_HOME = "/...")
library(SparkR, lib.loc = file.path(Sys.getenv('SPARK_HOME'), "R", "lib"))
sparkR.session(master = "spark://host:7077",
appname = "SparkR",
sparkHome = Sys.getenv("SPARK_HOME"),
sparkConfig = list(spark.driver.memory = "2g"
,spark.executor.memory = "20g"
,spark.executor.cores = "4"
,spark.executor.instances = "10"),
enableHiveSupport = TRUE)
The weird thing is that parameters seemed to be passed to sparkContext, but at the same time I end with number of x-core executors which make use of 100% resources (in this example 5 * 24 cores = 120 cores available; 120 / 4 = 30 executors).
I tried creating another spark-defaults.conf with no default paramaters assigned (so the only default parameters are those existed in spark documentation - they should be easily overrided) by:
if(Sys.getenv("SPARK_CONF_DIR") == "") Sys.setenv(SPARK_CONF_DIR = "/...")
Again, when I looked at the Spark UI on http://driver-node:4040 the total number of executors isn't correct (tab "Executors"), but at the same time all the config parameters in tab "Environment" are exactly the same as those provided by me in R script.
Anyone knows what might be the reason? Is the problem with R API or some infrastructural cluster-specific issue (like yarn settings)
I found you have to use the spark.driver.extraJavaOptions, e.g.
spark <- sparkR.session(master = "yarn",
sparkConfig = list(
spark.driver.memory = "2g",
spark.driver.extraJavaOptions =
paste("-Dhive.metastore.uris=",
Sys.getenv("HIVE_METASTORE_URIS"),
" -Dspark.executor.instances=",
Sys.getenv("SPARK_EXECUTORS"),
" -Dspark.executor.cores=",
Sys.getenv("SPARK_CORES"),
sep = "")
))
Alternatively you change the spark-submit args, e.g.
Sys.setenv("SPARKR_SUBMIT_ARGS"="--master yarn --driver-memory 10g sparkr-shell")
I'm following the Spark KafkaWordCount.scala example to process a Kafka stream.
To get a easy way to write the calc logic, I'm also using Spark-SQL.
The issue is I found each SQL query spent 300 - 400 ms even for an empty stream!
When I have a 2 seconds time window, this cost is too much.
The same logic wrote by Scala code only spent 10 - 12 ms.
The Spark-SQL version:
def processBySQL(persons: RDD[Person], sqc: SQLContext) = {
val ts = System.currentTimeMillis
import sqc.implicits._
val df = persons.toDF()
df.registerTempTable("tb_person")
sqc.cacheTable("tb_person")
sqc.sql("SELECT count(1), age FROM tb_person GROUP BY age").collect().foreach(println)
sqc.uncacheTable("tb_person")
println("[SQL]: " + (System.currentTimeMillis - ts) + " ms") //300 - 400ms
}
The Scala version:
def processByCode(persons: RDD[Person]) = {
val ts = System.currentTimeMillis
persons.groupBy(_.age)
.map(group => {
val (age, items) = group
val size = items.size
(items.last.name, items.size)
}).collect().foreach(println)
println("[CODE]: " + (System.currentTimeMillis - ts) + " ms") // 10 - 12ms
}
Full test code here: https://gist.github.com/nonlyli/e247a576b275cd7b3d88
Any idea for this issue?
Update 2015.09.21: Upgrade to spark 1.5.0, test results no big difference.
The sql and non-sql are not doing the same thing. For instance, in the SQL version, you're calling cacheTable