I want to load a property config file when submit a spark job, so I can load the proper config due to different environment, such as a test environment or a product environment. But I don't know where to put the properties file, here is the code loading the properties file:
object HbaseRDD {
val QUORUM_DEFAULT = "172.16.1.10,172.16.1.11,172.16.1.12"
val TIMEOUT_DEFAULT = "120000"
val config = Try {
val prop = new Properties()
prop.load(new FileInputStream("hbase.properties"))
(
prop.getProperty("hbase.zookeeper.quorum", QUORUM_DEFAULT),
prop.getProperty("timeout", TIMEOUT_DEFAULT)
)
}
def getHbaseRDD(tableName: String, appName:String = "test", master:String = "spark://node0:7077") = {
val sparkConf = new SparkConf().setAppName(appName).setMaster(master)
val sc = new SparkContext(sparkConf)
val conf = HBaseConfiguration.create()
config match {
case Success((quorum, timeout)) =>
conf.set("hbase.zookeeper.quorum", quorum)
conf.set("timeout", timeout)
case Failure(ex) =>
ex.printStackTrace()
conf.set("hbase.zookeepr.quorum", QUORUM_DEFAULT)
conf.set("timeout", TIMEOUT_DEFAULT)
}
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val hbaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
hbaseRDD
}
}
The question is where I put the hbase.properties file so that spark could find and loading it? Or how to specify it via spark-submit?
Please follow this example (Spark 1.5) configuration :
Files can be placed under working directory from where you are submitting spark job.. (which we used)
Another approach is keeping under hdfs as well.
check Run-time Environment configurations These configuration options will change one version to another version, you can check corresponding runtume config documentation
spark-submit --verbose --class <your driver class > \
--master yarn-client \
--num-executors 12 \
--driver-memory 1G \
--executor-memory 2G \
--executor-cores 4 \
--conf "spark.executor.extraJavaOptions=-verbose:gc -XX:+UseSerialGC -XX:+UseCompressedOops -XX:+UseCompressedStrings -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:PermSize=256M -XX:MaxPermSize=512M" \
--conf "spark.driver.extraJavaOptions=-XX:PermSize=256M -XX:MaxPermSize=512M" \
--conf "spark.shuffle.memoryFraction=0.5" \
--conf "spark.worker.cleanup.enabled=true" \
--conf "spark.worker.cleanup.interval=3600" \
--conf "spark.shuffle.io.numConnectionsPerPeer=5" \
--conf "spark.eventlog.enabled=true" \
--conf "spark.driver.extraLibrayPath=$HADOOP_HOME/*:$HBASE_HOME/*:$HADOOP_HOME/lib/*:$HBASE_HOME/lib/htrace-core-3.1.0-incubating.jar:$HDFS_PATH/*:$SOLR_HOME/*:$SOLR_HOME/lib/*" \
--conf "spark.executor.extraLibraryPath=$HADOOP_HOME/*:$folder/*:$HADOOP_HOME/lib/*:$HBASE_HOME/lib/htrace-core-3.1.0-incubating.jar:$HDFS_PATH/*:$SOLR_HOME/*:$SOLR_HOME/lib/*" \
--conf "spark.executor.extraClassPath=$OTHER_JARS:hbase.Properties" \
--conf "spark.yarn.executor.memoryOverhead=2048" \
--conf "spark.yarn.driver.memoryOverhead=1024" \
--conf "spark.eventLog.overwrite=true" \
--conf "spark.shuffle.consolidateFiles=true" \
--conf "spark.akka.frameSize=1024" \
--properties-file yourconfig.conf \
--files hbase.properties \
--jars $your_JARS\
Also, have a look at
http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management
How to load java properties file and use in Spark?
How to pass -D parameter or environment variable to Spark job?
spark-configuration-mess-solved
Related
I am generating around 30 window functions and run it against a pretty large dataset (1.5bil records) which is 14 days worth of data. If I run it against 1 day so roughly 100mio records it takes 27 hours to compute.. Which makes me thing there must be something wrong. As a comparison the join I do on the same dataset takes 2 minutes.
# Window Time = 30min
window_time = 1800
# TCP ports
ports = ['22', '25', '53', '80', '88', '123', '514', '443', '8080', '8443']
# Stats fields for window
stat_fields = ['source_bytes', 'destination_bytes', 'source_packets', 'destination_packets']
def add_port_column(r_df, port, window):
'''
Input:
r_df: dataframe
port: port
window: pyspark window to be used
Output: pyspark dataframe
'''
return r_df.withColumn('pkts_src_port_{}_30m'.format(port), F.when(F.col('source_port') == port, F.sum('source_packets').over(window)).otherwise(0))\
.withColumn('pkts_dst_port_{}_30m'.format(port), F.when(F.col('destination_port') == port, F.sum('destination_packets').over(window)).otherwise(0))
def add_stats_column(r_df, field, window):
'''
Input:
r_df: dataframe
field: field to generate stats with
window: pyspark window to be used
'''
r_df = r_df \
.withColumn('{}_sum_30m'.format(field), F.sum(field).over(window))\
.withColumn('{}_avg_30m'.format(field), F.avg(field).over(window))\
.withColumn('{}_std_30m'.format(field), F.stddev(field).over(window))\
.withColumn('{}_min_30m'.format(field), F.min(field).over(window))\
.withColumn('{}_max_30m'.format(field), F.max(field).over(window))\
.withColumn('{}_q25_30m'.format(field), F.expr("percentile_approx('{}', 0.25)".format(field)).over(window))\
.withColumn('{}_q75_30m'.format(field), F.expr("percentile_approx('{}', 0.75)".format(field)).over(window))
return r_df
w_s = (Window()
.partitionBy("ip")
.orderBy(F.col("timestamp"))
.rangeBetween(-window_time, 0))
flows_filtered_v3_df = (reduce(partial(add_port_column,window=w_s),
ports,
flows_filtered_v3_df
))
#
flows_filtered_v3_df = (reduce(partial(add_stats_column,window=w_s),
stat_fields,
flows_filtered_v3_df
))
Doing an aggregated count on ip (the partition I chose I get)
+------------+---------+
| ip| count(1)|
+------------+---------+
| xxxx1 |267639084|
|xxxx2 | 82596506|
|xxxx3 | 77049896|
|xxxx4 | 73114994|
I wonder how I could speed up things here or what I am doing wrong that it takes such a long time to compute.
EDIT:
Adding a few stats
6 spark nodes - 1 TB Memory / 252 Cores in total
Spark version: 2.4.0-cdh6.3.1
Options specified
org.apache.spark.deploy.SparkSubmit --conf spark.executor.memory=8g --conf spark.driver.memory=8g --conf spark.local.dir=/pkgs/cdh/tmp/spark --conf spark.yarn.security.tokens.hive.enabled=false --conf spark.yarn.security.credentials.hadoopfs.enabled=false --conf spark.security.credentials.hive.enabled=false --conf spark.app.name=DSS (Py): compute_flows_window_pyspark_2020-04-14 --conf spark.io.compression.codec=snappy --conf spark.sql.shuffle.partitions=40 --conf spark.shuffle.spill.compress=false --conf spark.shuffle.compress=false --conf spark.dku.limitedLogs={"filePartitioner.noMatch":100,"s3.ignoredPath":100,"s3.ignoredFile":100} --conf spark.security.credentials.hadoopfs.enabled=false --conf spark.jars.repositories=https://nexus.bisinfo.org:8443/repository/maven-central --conf spark.yarn.executor.memoryOverhead=600
I am running a glue ETL transformation job. This job is suppose to read data from s3 and converts to parquet.
Below is the glue source.... sourcePath is the location of the s3 file.
In this location we have around 100 million json files.. all of them are nested into sub-folders.
So that is the reason I am applying exclusionPattern to exclude and files starting with a (which are around 2.7 million files) and I believe that only the files starting with a will be processed.
val file_paths = Array(sourcePath)
val exclusionPattern = "\"" + sourcePath + "{[!a]}**" + "\""
glueContext
.getSourceWithFormat(connectionType = "s3",
options = JsonOptions(Map(
"paths" -> file_paths, "recurse" -> true, "groupFiles" -> "inPartition", "exclusions" -> s"[$exclusionPattern]"
)),
format = "json",
transformationContext = "sourceDF"
)
.getDynamicFrame()
.map(transformRow, "error in row")
.toDF()
After running this job with Standard worker type and with G2 worker type as well. I keep getting error
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 27788"...
And in the cloudwatch I can see that the driver memory is getting utilised 100% but executor memory usage is almost nil.
When running the job I am setting spark.driver.memory=10g and spark.driver.memoryOverhead=4096 and --conf job parameter.
This is the details in the logs
--conf spark.hadoop.yarn.resourcemanager.connect.max-wait.ms=60000
--conf spark.hadoop.fs.defaultFS=hdfs://ip-myip.compute.internal:1111
--conf spark.hadoop.yarn.resourcemanager.address=ip-myip.compute.internal:1111
--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.minExecutors=1
--conf spark.dynamicAllocation.maxExecutors=4
--conf spark.executor.memory=20g
--conf spark.executor.cores=16
--conf spark.driver.memory=20g
--conf spark.default.parallelism=80
--conf spark.sql.shuffle.partitions=80
--conf spark.network.timeout=600
--job-bookmark-option job-bookmark-disable
--TempDir s3://my-location/admin
--class com.example.ETLJob
--enable-spark-ui true
--enable-metrics
--JOB_ID j_111...
--spark-event-logs-path s3://spark-ui
--conf spark.driver.memory=20g
--JOB_RUN_ID jr_111...
--conf spark.driver.memoryOverhead=4096
--scriptLocation s3://my-location/admin/Job/ETL
--SOURCE_DATA_LOCATION s3://xyz/
--job-language scala
--DESTINATION_DATA_LOCATION s3://xyz123/
--JOB_NAME ETL
Any ideas what could be the issue.
Thanks
If you have too many files, you are probably overwhelming the driver. Try using the useS3ListImplementation. This is an implementation of the Amazon S3 ListKeys operation, which splits large results sets into multiple responses.
try to add:
"useS3ListImplementation" -> true
[1] https://aws.amazon.com/premiumsupport/knowledge-center/glue-oom-java-heap-space-error/
As suggested by #eman...
I applied all 3 groupFiles, groupSize and useS3ListImplementation.. like below
options = JsonOptions(Map(
"path" -> sourcePath,
"recurse" -> true,
"groupFiles" -> "inPartition",
"groupSize" -> 104857600,//100 mb
"useS3ListImplementation" -> true
))
And that is working for me... there is also an option of "acrossPartitions" if data is not arranged properly.
I'm consuming from Kafka and writing to parquet in EMRFS. Below code works in spark-shell:
val filesink_query = outputdf.writeStream
.partitionBy(<some column>)
.format("parquet")
.option("path", <some path in EMRFS>)
.option("checkpointLocation", "/tmp/ingestcheckpoint")
.trigger(Trigger.ProcessingTime(10.seconds))
.outputMode(OutputMode.Append)
.start
SBT is able to package the code without errors. When the .jar is sent to spark-submit, the job is accepted and stays in running state forever without writing data to HDFS.
There is no ERROR in the .inprogress log
Some posts suggest that a large watermark duration can cause it, but I have not set a custom watermark duration.
I can write to parquet using Pyspark, I put you my code in case that will be useful:
stream = self.spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", self.kafka_bootstrap_servers) \
.option("subscribe", self.topic) \
.option("startingOffsets", self.startingOffsets) \
.option("max.poll.records", self.max_poll_records) \
.option("auto.commit.interval.ms", self.auto_commit_interval_ms) \
.option("session.timeout.ms", self.session_timeout_ms) \
.option("key.deserializer", self.key_deserializer) \
.option("value.deserializer", self.value_deserializer) \
.load()
self.query = stream \
.select(col("value")) \
.select((self.proto_function("value")).alias("value_udf")) \
.select(*columns,
date_format(column_time, "yyyy").alias("date").alias("year"),
date_format(column_time, "MM").alias("date").alias("month"),
date_format(column_time, "dd").alias("date").alias("day"),
date_format(column_time, "HH").alias("date").alias("hour"))
query = self.query \
.writeStream \
.format("parquet") \
.option("checkpointLocation", self.path) \
.partitionBy("year", "month", "day", "hour") \
.option("path", self.path) \
.start()
Also, you need to run the code in that way: spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 <code>
I using the following code to write a stream to elasticsearch from python (pyspark) application.
#Streaming code
query = df.writeStream \
.outputMode("append") \
.format("org.elasticsearch.spark.sql") \
.option("checkpointLocation", "/tmp/") \
.option("es.resource", "logs/raw") \
.option("es.nodes", "localhost") \
.start()
query.awaitTermination()
If I write the results to the console it works fine, also, if I write to ES - not in streaming mode, it works ok. This is the code I used to write to ES:
#Not streaming
df.write.format("org.elasticsearch.spark.sql") \
.mode('append') \
.option("es.resource", "log/raw") \
.option("es.nodes", "localhost").save("log/raw")
The thing is, I can't debug it, the code is running, but nothing is written to ES (in streaming mode).
Thanks,
Eventually did work out for me, the problem was technical (needed vpn)
query = df.writeStream \
.outputMode("append") \
.queryName("writing_to_es") \
.format("org.elasticsearch.spark.sql") \
.option("checkpointLocation", "/tmp/") \
.option("es.resource", "index/type") \
.option("es.nodes", "localhost") \
.start()
query.awaitTermination()
Code:
val stream = df
.writeStream
.option("checkpointLocation", checkPointDir)
.format("es")
.start("realtime/data")
SBT Dependency:
libraryDependencies += "org.elasticsearch" %% "elasticsearch-spark-20" % "6.2.4"
I am running following job in HDP.
export SPARK-MAJOR-VERSION=2 spark-submit --class com.spark.sparkexamples.Audit --master yarn --deploy-mode cluster \ --files /bigdata/datalake/app/config/metadata.csv BRNSAUDIT_v4.jar dl_raw.ACC /bigdatahdfs/landing/AUDIT/BW/2017/02/27/ACC_hash_total_and_count_20170227.dat TH 20170227
Its failing with error that:
Table or view not found: dl_raw.ACC; line 1 pos 94; 'Aggregate [count(1) AS rec_cnt#58L, 'count('BRCH_NUM) AS hashcount#59, 'sum('ACC_NUM) AS hashsum#60] +- 'Filter (('trim('country_code) = trim(TH)) && ('from_unixtime('unix_timestamp('substr('bus_date, 0, 11), MM/dd/yyyy), yyyyMMdd) = 20170227)) +- 'UnresolvedRelation dl_raw.`ACC'*
Whereas table is present in Hive and it is accessible from spark-shell.
This is code for spark session.
val sparkSession = SparkSession.builder .appName("spark session example") .enableHiveSupport() .getOrCreate()
sparkSession.conf.set("spark.sql.crossJoin.enabled", "true")
val df_table_stats = sparkSession.sql("""select count(*) as rec_cnt,count(distinct BRCH_NUM) as hashcount, sum(ACC_NUM) as hashsum
from dl_raw.ACC
where trim(country_code) = trim('BW')
and from_unixtime(unix_timestamp(substr(bus_date,0,11),'MM/dd/yyyy'),'yyyyMMdd')='20170227'
""")
include the hive-site.xml in the --files parameter when you submit the spark job.
You can also copy hive-site.xml configuration file from hive-conf dir to spark-conf dir. This should resolve your issue.
cp /etc/hive/conf/hive-site.xml /etc/spark2/conf