I am running a glue ETL transformation job. This job is suppose to read data from s3 and converts to parquet.
Below is the glue source.... sourcePath is the location of the s3 file.
In this location we have around 100 million json files.. all of them are nested into sub-folders.
So that is the reason I am applying exclusionPattern to exclude and files starting with a (which are around 2.7 million files) and I believe that only the files starting with a will be processed.
val file_paths = Array(sourcePath)
val exclusionPattern = "\"" + sourcePath + "{[!a]}**" + "\""
glueContext
.getSourceWithFormat(connectionType = "s3",
options = JsonOptions(Map(
"paths" -> file_paths, "recurse" -> true, "groupFiles" -> "inPartition", "exclusions" -> s"[$exclusionPattern]"
)),
format = "json",
transformationContext = "sourceDF"
)
.getDynamicFrame()
.map(transformRow, "error in row")
.toDF()
After running this job with Standard worker type and with G2 worker type as well. I keep getting error
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 27788"...
And in the cloudwatch I can see that the driver memory is getting utilised 100% but executor memory usage is almost nil.
When running the job I am setting spark.driver.memory=10g and spark.driver.memoryOverhead=4096 and --conf job parameter.
This is the details in the logs
--conf spark.hadoop.yarn.resourcemanager.connect.max-wait.ms=60000
--conf spark.hadoop.fs.defaultFS=hdfs://ip-myip.compute.internal:1111
--conf spark.hadoop.yarn.resourcemanager.address=ip-myip.compute.internal:1111
--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.minExecutors=1
--conf spark.dynamicAllocation.maxExecutors=4
--conf spark.executor.memory=20g
--conf spark.executor.cores=16
--conf spark.driver.memory=20g
--conf spark.default.parallelism=80
--conf spark.sql.shuffle.partitions=80
--conf spark.network.timeout=600
--job-bookmark-option job-bookmark-disable
--TempDir s3://my-location/admin
--class com.example.ETLJob
--enable-spark-ui true
--enable-metrics
--JOB_ID j_111...
--spark-event-logs-path s3://spark-ui
--conf spark.driver.memory=20g
--JOB_RUN_ID jr_111...
--conf spark.driver.memoryOverhead=4096
--scriptLocation s3://my-location/admin/Job/ETL
--SOURCE_DATA_LOCATION s3://xyz/
--job-language scala
--DESTINATION_DATA_LOCATION s3://xyz123/
--JOB_NAME ETL
Any ideas what could be the issue.
Thanks
If you have too many files, you are probably overwhelming the driver. Try using the useS3ListImplementation. This is an implementation of the Amazon S3 ListKeys operation, which splits large results sets into multiple responses.
try to add:
"useS3ListImplementation" -> true
[1] https://aws.amazon.com/premiumsupport/knowledge-center/glue-oom-java-heap-space-error/
As suggested by #eman...
I applied all 3 groupFiles, groupSize and useS3ListImplementation.. like below
options = JsonOptions(Map(
"path" -> sourcePath,
"recurse" -> true,
"groupFiles" -> "inPartition",
"groupSize" -> 104857600,//100 mb
"useS3ListImplementation" -> true
))
And that is working for me... there is also an option of "acrossPartitions" if data is not arranged properly.
Related
For the life of me I cannot figure out what is going on here.
I am starting a Glue Job via Boto3 (from Lambda but testing locally gives the exact same issue) and when I pass parameters in via the "start job run" api I get the same error, but looking at the logs the parameters all look correct. Here is the output (I have changed some names of the buckets etc.)
Glue Code (sample):
def main():
args = getResolvedOptions(sys.argv, [
'JOB_NAME',
's3_bucket',
's3_temp_prefix',
's3_schema_prefix',
's3_processed_prefix',
'ingestion_run_id'
]
)
sc = SparkContext()
glueContext = GlueContext(sc)
logger = glueContext.get_logger()
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
s3_client = boto3.client('s3')
s3_bucket = args['s3_bucket']
temp_prefix = args['s3_temp_prefix']
schema_prefix = args['s3_schema_prefix']
processed_prefix = args['s3_processed_prefix']
ingestion_run_id = args['ingestion_run_id']
logger.info(f's3_bucket: {s3_bucket}')
logger.info(f'temp_prefix {temp_prefix}')
logger.info(f'schema_prefix: {schema_prefix}')
logger.info(f'processed_prefix: {processed_prefix}')
logger.info(f'ingestion_run_id: {ingestion_run_id}')
SAM Template to make the Glue Job:
CreateDataset:
Type: AWS::Glue::Job
Properties:
Command:
Name: glueetl
PythonVersion: 3
ScriptLocation: !Sub "s3://bucket-name/GLUE/create_dataset.py"
DefaultArguments:
"--extra-py-files": "s3://bucket-name/GLUE/S3GetKeys.py"
"--enable-continuous-cloudwatch-log": ""
"--enable-metrics": ""
GlueVersion: 2.0
MaxRetries: 0
Role: !GetAtt GlueRole.Arn
Timeout: 360
WorkerType: Standard
NumberOfWorkers: 15
Code to attempt to start the Glue Job:
import boto3
session = boto3.session.Session(profile_name='glue_admin', region_name=region)
client = session.client('glue')
name = 'CreateDataset-1uPuNfIw1Tjd'
args = {
"--s3_bucket": 'bucket-name',
"--s3_temp_prefix": 'TEMP',
"--s3_schema_prefix": 'SCHEMA',
"--s3_processed_prefix": 'PROCESSED',
"--ingestion_run_id": 'FakeRun'
}
client.start_job_run(JobName=name, Arguments=args)
This starts the job fine put then the script errors and this is the log left behind, from what I can see it seems the parameters are lined up fine?
Wed Feb 10 09:16:00 UTC 2021/usr/bin/java -cp /opt/amazon/conf:/opt/amazon/lib/hadoop-lzo/*:/opt/amazon/lib/emrfs-lib/*:/opt/amazon/spark/jars/*:/opt/amazon/superjar/*:/opt/amazon/lib/*:/opt/amazon/Scala2.11/* com.amazonaws.services.glue.PrepareLaunch --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.minExecutors=1 --conf spark.dynamicAllocation.maxExecutors=29 --conf spark.executor.memory=5g --conf spark.executor.cores=4 --conf spark.driver.memory=5g --JOB_ID j_76c49a0d580594d5c0f584458cc0c9d519 --enable-metrics --extra-py-files s3://bucket-name/GLUE/S3GetKeys.py --JOB_RUN_ID jr_c0b9049abf1ee1161de189a901dd4be05694c1c42863 --s3_schema_prefix SCHEMA --enable-continuous-cloudwatch-log --s3_bucket bucket-name --scriptLocation s3://bucket-name/GLUE/create_dataset.py --s3_temp_prefix TEMP --ingestion_run_id FakeRun --s3_processed_prefix PROCESSED --JOB_NAME CreateDataset-1uPuNfIw1Tjd
Bucket name has been altered for this post but it matches exactly.
Fail point in Glue JOb log:
java.lang.IllegalArgumentException: For input string: "--s3_bucket"
The bucket name has no illegal chars but does have '-' in it?
Thanks in advance for help.
This happened because --enable-continuous-cloudwatch-log argument expects a value and since you didn't provide a value, the argument parser assumed the next argument is the value for it(--enable-continuous-cloudwatch-log --s3_bucket), which in this case was --s3_bucket, now --s3_bucket is an invalid value for --enable-continuous-cloudwatch-log option, therefore that error happens.
I am generating around 30 window functions and run it against a pretty large dataset (1.5bil records) which is 14 days worth of data. If I run it against 1 day so roughly 100mio records it takes 27 hours to compute.. Which makes me thing there must be something wrong. As a comparison the join I do on the same dataset takes 2 minutes.
# Window Time = 30min
window_time = 1800
# TCP ports
ports = ['22', '25', '53', '80', '88', '123', '514', '443', '8080', '8443']
# Stats fields for window
stat_fields = ['source_bytes', 'destination_bytes', 'source_packets', 'destination_packets']
def add_port_column(r_df, port, window):
'''
Input:
r_df: dataframe
port: port
window: pyspark window to be used
Output: pyspark dataframe
'''
return r_df.withColumn('pkts_src_port_{}_30m'.format(port), F.when(F.col('source_port') == port, F.sum('source_packets').over(window)).otherwise(0))\
.withColumn('pkts_dst_port_{}_30m'.format(port), F.when(F.col('destination_port') == port, F.sum('destination_packets').over(window)).otherwise(0))
def add_stats_column(r_df, field, window):
'''
Input:
r_df: dataframe
field: field to generate stats with
window: pyspark window to be used
'''
r_df = r_df \
.withColumn('{}_sum_30m'.format(field), F.sum(field).over(window))\
.withColumn('{}_avg_30m'.format(field), F.avg(field).over(window))\
.withColumn('{}_std_30m'.format(field), F.stddev(field).over(window))\
.withColumn('{}_min_30m'.format(field), F.min(field).over(window))\
.withColumn('{}_max_30m'.format(field), F.max(field).over(window))\
.withColumn('{}_q25_30m'.format(field), F.expr("percentile_approx('{}', 0.25)".format(field)).over(window))\
.withColumn('{}_q75_30m'.format(field), F.expr("percentile_approx('{}', 0.75)".format(field)).over(window))
return r_df
w_s = (Window()
.partitionBy("ip")
.orderBy(F.col("timestamp"))
.rangeBetween(-window_time, 0))
flows_filtered_v3_df = (reduce(partial(add_port_column,window=w_s),
ports,
flows_filtered_v3_df
))
#
flows_filtered_v3_df = (reduce(partial(add_stats_column,window=w_s),
stat_fields,
flows_filtered_v3_df
))
Doing an aggregated count on ip (the partition I chose I get)
+------------+---------+
| ip| count(1)|
+------------+---------+
| xxxx1 |267639084|
|xxxx2 | 82596506|
|xxxx3 | 77049896|
|xxxx4 | 73114994|
I wonder how I could speed up things here or what I am doing wrong that it takes such a long time to compute.
EDIT:
Adding a few stats
6 spark nodes - 1 TB Memory / 252 Cores in total
Spark version: 2.4.0-cdh6.3.1
Options specified
org.apache.spark.deploy.SparkSubmit --conf spark.executor.memory=8g --conf spark.driver.memory=8g --conf spark.local.dir=/pkgs/cdh/tmp/spark --conf spark.yarn.security.tokens.hive.enabled=false --conf spark.yarn.security.credentials.hadoopfs.enabled=false --conf spark.security.credentials.hive.enabled=false --conf spark.app.name=DSS (Py): compute_flows_window_pyspark_2020-04-14 --conf spark.io.compression.codec=snappy --conf spark.sql.shuffle.partitions=40 --conf spark.shuffle.spill.compress=false --conf spark.shuffle.compress=false --conf spark.dku.limitedLogs={"filePartitioner.noMatch":100,"s3.ignoredPath":100,"s3.ignoredFile":100} --conf spark.security.credentials.hadoopfs.enabled=false --conf spark.jars.repositories=https://nexus.bisinfo.org:8443/repository/maven-central --conf spark.yarn.executor.memoryOverhead=600
When using tensorflow java for inference the amount of memory to make the job run on YARN is abnormally large. The job run perfectly with spark on my computer (2 cores 16Gb of RAM) and take 35 minutes to complete. But when I try to run it on YARN with 10 executors 16Gb memory and 16 Gb memoryOverhead the executors are killed for using too much memory.
Prediction Run on an Hortonworks cluster with YARN 2.7.3 and Spark 2.2.1. Previously we used DL4J to do inference and everything run under 3 min.
Tensor are correctly closed after usage and we use a mapPartition to do prediction. Each task contain approximately 20.000 records (1Mb) so this will make input tensor of 2.000.000x14 and output tensor of 2.000.000 (5Mb).
option passed to spark when running on YARN
--master yarn --deploy-mode cluster --driver-memory 16G --num-executors 10 --executor-memory 16G --executor-cores 2 --conf spark.driver.memoryOverhead=16G --conf spark.yarn.executor.memoryOverhead=16G --conf spark.sql.shuffle.partitions=200 --conf spark.tasks.cpu=2
This configuration may work if we set spark.sql.shuffle.partitions=2000 but it take 3 hours
UPDATE:
The difference between local and cluster was in fact due to a missing filter. we actually run the prediction on more data than we though.
To reduce memory footprint of each partition you must create batch inside each partition (use grouped(batchSize)). Thus you are faster than running predict for each row and you allocate tensor of predermined size (batchSize). If you investigate the code of tensorflowOnSpark scala inference this is what they did. Below you will find a reworked example of an implementation this code may not compile but you get the idea of how to do it.
lazy val sess = SavedModelBundle.load(modelPath, "serve").session
lazy val numberOfFeatures = 1
lazy val laggedFeatures = Seq("cost_day1", "cost_day2", "cost_day3")
lazy val numberOfOutputs = 1
val predictionsRDD = preprocessedData.rdd.mapPartitions { partition =>
partition.grouped(batchSize).flatMap { batchPreprocessed =>
val numberOfLines = batchPreprocessed.size
val featuresShape: Array[Long] = Array(numberOfLines, laggedFeatures.size / numberOfFeatures, numberOfFeatures)
val featuresBuffer: FloatBuffer = FloatBuffer.allocate(numberOfLines)
for (
featuresWithKey <- batchPreprocessed;
feature <- featuresWithKey.features
) {
featuresBuffer.put(feature)
}
featuresBuffer.flip()
val featuresTensor = Tensor.create(featuresShape, featuresBuffer)
val results: Tensor[_] = sess.runner
.feed("cost", featuresTensor)
.fetch("prediction")
.run.get(0)
val output = Array.ofDim[Float](results.numElements(), numberOfOutputs)
val outputArray: Array[Array[Float]] = results.copyTo(output)
results.close()
featuresTensor.close()
outputArray
}
}
spark.createDataFrame(predictionsRDD)
We use FloatBuffer and Shape to create Tensor as recommended in this issue
My spark is installed in CDH5 5.8.0 and run its application in yarn. There are 5 servers in the cluster. One server is for resource manager. The other four servers are node manager. Each server has 2 core and 8G memory.
The spark application main logic is not complex: Query table from postgres db. Do some business for each record and finally save result to db. Here is main code:
String columnName="id";
long lowerBound=1;
long upperBound=100000;
int numPartitions=20;
String tableBasic="select * from table1 order by id";
DataFrame dfBasic = sqlContext.read().jdbc(JDBC_URL, tableBasic, columnName, lowerBound, upperBound,numPartitions, dbProperties);
JavaRDD<EntityResult> rddResult = dfBasic.javaRDD().flatMap(new FlatMapFunction<Row, Result>() {
public Iterable<Result> call(Row row) {
List<Result> list = new ArrayList<Result>();
........
return list;
}
});
DataFrame saveDF = sqlContext.createDataFrame(rddResult, Result.class);
saveDF = saveDF.select("id", "column 1", "column 2",);
saveDF.write().mode(SaveMode.Append).jdbc(SQL_CONNECTION_URL, "table2", dbProperties);
I use this command to submit application to yarn:
spark-submit --master yarn-cluster --executor-memory 6G --executor-cores 2 --driver-memory 6G --conf spark.default.parallelism=90 --conf spark.storage.memoryFraction=0.4 --conf spark.shuffle.memoryFraction=0.4 --conf spark.executor.memory=3G --class com.Main1 jar1-0.0.1.jar
There are 7 executors and 20 partitions. When the table records is small, for example less than 200000, the 20 active tasks can assign to the 7 executors balance, like this:
Assign task averagely
But when the table records is huge, for example 1000000, the task will not assign averagely. There is always one executor run long time, the other executors run shortly. Some executors can't assign task. Like this:
enter image description here
I want to load a property config file when submit a spark job, so I can load the proper config due to different environment, such as a test environment or a product environment. But I don't know where to put the properties file, here is the code loading the properties file:
object HbaseRDD {
val QUORUM_DEFAULT = "172.16.1.10,172.16.1.11,172.16.1.12"
val TIMEOUT_DEFAULT = "120000"
val config = Try {
val prop = new Properties()
prop.load(new FileInputStream("hbase.properties"))
(
prop.getProperty("hbase.zookeeper.quorum", QUORUM_DEFAULT),
prop.getProperty("timeout", TIMEOUT_DEFAULT)
)
}
def getHbaseRDD(tableName: String, appName:String = "test", master:String = "spark://node0:7077") = {
val sparkConf = new SparkConf().setAppName(appName).setMaster(master)
val sc = new SparkContext(sparkConf)
val conf = HBaseConfiguration.create()
config match {
case Success((quorum, timeout)) =>
conf.set("hbase.zookeeper.quorum", quorum)
conf.set("timeout", timeout)
case Failure(ex) =>
ex.printStackTrace()
conf.set("hbase.zookeepr.quorum", QUORUM_DEFAULT)
conf.set("timeout", TIMEOUT_DEFAULT)
}
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val hbaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
hbaseRDD
}
}
The question is where I put the hbase.properties file so that spark could find and loading it? Or how to specify it via spark-submit?
Please follow this example (Spark 1.5) configuration :
Files can be placed under working directory from where you are submitting spark job.. (which we used)
Another approach is keeping under hdfs as well.
check Run-time Environment configurations These configuration options will change one version to another version, you can check corresponding runtume config documentation
spark-submit --verbose --class <your driver class > \
--master yarn-client \
--num-executors 12 \
--driver-memory 1G \
--executor-memory 2G \
--executor-cores 4 \
--conf "spark.executor.extraJavaOptions=-verbose:gc -XX:+UseSerialGC -XX:+UseCompressedOops -XX:+UseCompressedStrings -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:PermSize=256M -XX:MaxPermSize=512M" \
--conf "spark.driver.extraJavaOptions=-XX:PermSize=256M -XX:MaxPermSize=512M" \
--conf "spark.shuffle.memoryFraction=0.5" \
--conf "spark.worker.cleanup.enabled=true" \
--conf "spark.worker.cleanup.interval=3600" \
--conf "spark.shuffle.io.numConnectionsPerPeer=5" \
--conf "spark.eventlog.enabled=true" \
--conf "spark.driver.extraLibrayPath=$HADOOP_HOME/*:$HBASE_HOME/*:$HADOOP_HOME/lib/*:$HBASE_HOME/lib/htrace-core-3.1.0-incubating.jar:$HDFS_PATH/*:$SOLR_HOME/*:$SOLR_HOME/lib/*" \
--conf "spark.executor.extraLibraryPath=$HADOOP_HOME/*:$folder/*:$HADOOP_HOME/lib/*:$HBASE_HOME/lib/htrace-core-3.1.0-incubating.jar:$HDFS_PATH/*:$SOLR_HOME/*:$SOLR_HOME/lib/*" \
--conf "spark.executor.extraClassPath=$OTHER_JARS:hbase.Properties" \
--conf "spark.yarn.executor.memoryOverhead=2048" \
--conf "spark.yarn.driver.memoryOverhead=1024" \
--conf "spark.eventLog.overwrite=true" \
--conf "spark.shuffle.consolidateFiles=true" \
--conf "spark.akka.frameSize=1024" \
--properties-file yourconfig.conf \
--files hbase.properties \
--jars $your_JARS\
Also, have a look at
http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management
How to load java properties file and use in Spark?
How to pass -D parameter or environment variable to Spark job?
spark-configuration-mess-solved