Spark - ignoring corrupted files - apache-spark

In the ETL process that we are managing, we are receiving sometimes corrupted files.
We tried this Spark configuration and it seems it works (the Spark job is not failing because the corrupted files are discarded):
spark.sqlContext.setConf("spark.sql.files.ignoreCorruptFiles", "true")
But I don't know if there is anyway to know which files were ignored. Is there anyway to get those filenames?
Thanks in advance

One way is look through your executor logs. If you have setup following configuratios to true in your spark configuration.
RDD: spark.files.ignoreCorruptFiles
DataFrame: spark.sql.files.ignoreCorruptFiles
Then spark will log corrupted file as a WARN message in your executor logs.
Here is code snippet from Spark that does that:
if (ignoreCorruptFiles) {
currentIterator = new NextIterator[Object] {
// The readFunction may read some bytes before consuming the iterator, e.g.,
// vectorized Parquet reader. Here we use lazy val to delay the creation of
// iterator so that we will throw exception in `getNext`.
private lazy val internalIter = readCurrentFile()
override def getNext(): AnyRef = {
try {
if (internalIter.hasNext) {
internalIter.next()
} else {
finished = true
null
}
} catch {
// Throw FileNotFoundException even `ignoreCorruptFiles` is true
case e: FileNotFoundException => throw e
case e # (_: RuntimeException | _: IOException) =>
logWarning(
s"Skipped the rest of the content in the corrupted file: $currentFile", e)
finished = true
null
}
}

Did you solve it?
If not, may be you can try the below approach:
Read everything from the location with that ignoreCorruptFiles setting
You can get the file names each record belongs to using the input_file_name UDF. Get distinct names out.
Separately get list of all the objects in the respective directory
Find the difference.
Did you use a different approach?

Related

Create documents that not exist, skip others

I'm working in a concurrent environment when index being built by Spark job may receive updates for same document id from the job itself and other sources. It is assumed that updates from other sources are more fresh and Spark job needs to silently ignore documents that already exist, creating all other documents. This is very close to indexing with op_type: create, but the latter throws an exception that is not passed to my error handler. Following block of code:
.rdd
.repartition(getTasks(configurationManager))
.saveJsonToEs(
s"$indexName/_doc",
Map(
"es.mapping.id" -> MenuItemDocument.ID_FIELD,
"es.write.operation" -> "create",
"es.write.rest.error.handler.bulkErrorHandler" ->
"<some package>.IgnoreExistsBulkWriteErrorHandler",
"es.write.rest.error.handlers" -> "bulkErrorHandler"
)
)
where error handler survived several variations, but currently is:
class IgnoreExistsBulkWriteErrorHandler extends BulkWriteErrorHandler with LazyLogging {
override def onError(entry: BulkWriteFailure, collector: DelayableErrorCollector[Array[Byte]]): HandlerResult = {
logger.info("Encountered exception:", entry.getException)
if (entry.getException.getMessage.contains("version_conflict_engine_exception")) {
logger.info("Encountered document already present in index, skipping")
HandlerResult.HANDLED
} else {
HandlerResult.ABORT
}
}
}
(i obviously was checking for org.elasticsearch.index.engine.VersionConflictEngineException in getException().getCause() first, but it didn't work)
emits this in log:
org.elasticsearch.hadoop.EsHadoopException: Could not write all entries for bulk operation [186/1000]. Error sample (first [5] error messages):
org.elasticsearch.hadoop.rest.EsHadoopRemoteException: version_conflict_engine_exception: [_doc][12]: version conflict, document already exists (current version [1])
(i assume that my error handler is not called at all)
and terminates my whole Spark job. What is the correct way to achieve my desired result?

Spark Structured Streaming program that reads from non-empty Kafka topic (starting from earliest) triggers batches locally, but not on EMR cluster

We are running into a problem where -- for one of our applications --
we don't see any evidences of batches being processed in the Structured
Streaming tab of the Spark UI.
I have written a small program (below) to reproduce the issue.
A self-contained project that allows you to build the app, along with scripts that facilitate upload to AWS, and details on how to run and reproduce the issue can be found here: https://github.com/buildlackey/spark-struct-streaming-metrics-missing-on-aws (The github version of the app is a slightly evolved version of what is presented below, but it illustrates the problem of Spark streaming metrics not showing up.)
The program can be run 'locally' -- on someones' laptop in local[*] mode (say with a dockerized Kafka instance),
or on an EMR cluster. For local mode operation you invoke the main method with 'localTest' as the first
argument.
In our case, when we run on the EMR cluster, pointing to a topic
where we know there are many data records (we read from 'earliest'), we
see that THERE ARE INDEED NO BATCHES PROCESSED -- on the cluster for some reason...
In the local[*] case we CAN see batches processed.
To capture evidence of this i wrote a forEachBatch handler that simply does a
toLocalIterator.asScala.toList.mkString("\n") on the Dataset of each batch, and then dumps the
resultant string to a file. Running locally.. i see evidence of the
captured records in the temporary file. HOWEVER, when I run on
the cluster and i ssh into one of the executors i see NO SUCH
files. I also checked the master node.... no files matching the pattern 'Missing'
So... batches are not triggering on the cluster. Our kakfa has plenty of data and
when running on the cluster the logs show we are churning through messages at increasing offsets:
21/12/16 05:15:21 DEBUG KafkaDataConsumer: Get spark-kafka-source-blah topic.foo.event-18 nextOffset 4596542913 requested 4596542913
21/12/16 05:15:21 DEBUG KafkaDataConsumer: Get spark-kafka-source-blah topic.foo.event-18 nextOffset 4596542914 requested 4596542914
Note to get the logs we are using:
yarn yarn logs --applicationId <appId>
which should get both driver and executor logs for the entire run (when app terminates)
Now, in the local[*] case we CAN see batches processed. The evidence is that we see a file whose name
is matching the pattern 'Missing' in our tmp folder.
I am including my simple demo program below. If you can spot the issue and clue us in, I'd be very grateful !
// Please forgive the busy code.. i stripped this down from a much larger system....
import com.typesafe.scalalogging.StrictLogging
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import org.apache.spark.sql.{Dataset, SparkSession}
import java.io.File
import java.util
import scala.collection.JavaConverters.asScalaIteratorConverter
import scala.concurrent.duration.Duration
object AwsSupportCaseFailsToYieldLogs extends StrictLogging {
case class KafkaEvent(fooMsgKey: Array[Byte],
fooMsg: Array[Byte],
topic: String,
partition: String,
offset: String) extends Serializable
case class SparkSessionConfig(appName: String, master: String) {
def sessionBuilder(): SparkSession.Builder = {
val builder = SparkSession.builder
builder.master(master)
builder
}
}
case class KafkaConfig(kafkaBootstrapServers: String, kafkaTopic: String, kafkaStartingOffsets: String)
def sessionFactory: (SparkSessionConfig) => SparkSession = {
(sparkSessionConfig) => {
sparkSessionConfig.sessionBuilder().getOrCreate()
}
}
def main(args: Array[String]): Unit = {
val (sparkSessionConfig, kafkaConfig) =
if (args.length >= 1 && args(0) == "localTest") {
getLocalTestConfiguration
} else {
getRunOnClusterConfiguration
}
val spark: SparkSession = sessionFactory(sparkSessionConfig)
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val dataSetOfKafkaEvent: Dataset[KafkaEvent] = spark.readStream.
format("kafka").
option("subscribe", kafkaConfig.kafkaTopic).
option("kafka.bootstrap.servers", kafkaConfig.kafkaBootstrapServers).
option("startingOffsets", kafkaConfig.kafkaStartingOffsets).
load.
select(
$"key" cast "binary",
$"value" cast "binary",
$"topic",
$"partition" cast "string",
$"offset" cast "string").map { row =>
KafkaEvent(
row.getAs[Array[Byte]](0),
row.getAs[Array[Byte]](1),
row.getAs[String](2),
row.getAs[String](3),
row.getAs[String](4))
}
val initDF = dataSetOfKafkaEvent.map { item: KafkaEvent => item.toString }
val function: (Dataset[String], Long) => Unit =
(dataSetOfString, batchId) => {
val iter: util.Iterator[String] = dataSetOfString.toLocalIterator()
val lines = iter.asScala.toList.mkString("\n")
val outfile = writeStringToTmpFile(lines)
println(s"writing to file: ${outfile.getAbsolutePath}")
logger.error(s"writing to file: ${outfile.getAbsolutePath} / $lines")
}
val trigger = Trigger.ProcessingTime(Duration("1 second"))
initDF.writeStream
.foreachBatch(function)
.trigger(trigger)
.outputMode("append")
.start
.awaitTermination()
}
private def getLocalTestConfiguration: (SparkSessionConfig, KafkaConfig) = {
val sparkSessionConfig: SparkSessionConfig =
SparkSessionConfig(master = "local[*]", appName = "dummy2")
val kafkaConfig: KafkaConfig =
KafkaConfig(
kafkaBootstrapServers = "localhost:9092",
kafkaTopic = "test-topic",
kafkaStartingOffsets = "earliest")
(sparkSessionConfig, kafkaConfig)
}
private def getRunOnClusterConfiguration = {
val sparkSessionConfig: SparkSessionConfig = SparkSessionConfig(master = "yarn", appName = "AwsSupportCase")
val kafkaConfig: KafkaConfig =
KafkaConfig(
kafkaBootstrapServers= "kafka.foo.bar.broker:9092", // TODO - change this for kafka on your EMR cluster.
kafkaTopic= "mongo.bongo.event", // TODO - change this for kafka on your EMR cluster.
kafkaStartingOffsets = "earliest")
(sparkSessionConfig, kafkaConfig)
}
def writeStringFile(string: String, file: File): File = {
java.nio.file.Files.write(java.nio.file.Paths.get(file.getAbsolutePath), string.getBytes).toFile
}
def writeStringToTmpFile(string: String, deleteOnExit: Boolean = false): File = {
val file: File = File.createTempFile("streamingConsoleMissing", "sad")
if (deleteOnExit) {
file.delete()
}
writeStringFile(string, file)
}
}
I have encountered similar issue, maxOffsetsPerTrigger would fix the issue. Actually, it's not issue.
All logs and metrics per batch are only printed or showing after
finish of this batch. That's the reason why you can't see the job make
progress.
If maxOffsetsPerTrigger can't solve the issue, you could try to consume from latest offset to confirm the procssing logic is correct.
This is a provisional answer. One of our team members has a theory that looks pretty likely. Here it is: Batches ARE getting processed (this is demonstrated better by the version of the program I linked to on github), but we are thinking that since there is so much backed up in the topic on our cluster that the processing (from earliest) of the first batch takes a very long time, hence when looking at the cluster we see zero batches processed... even though there is clearly work being done. It might be that the solution is to use maxOffsetsPerTrigger to gate the amount of incoming traffic (when starting from earliest and working w/ a topic that has huge volumes of data). We are working on confirming this.

Spark transformation (map) is not called even after calling the action (count)

I have a map function defined on the DataFrame and when I invoke the action (count() in this case) I am not seeing the function calls invoked inside the map function getting called for each row.
Here is the code I have
def copyFilesToArchive(recordDF:DataFrame,s3Util:S3Client):Unit ={
if(s3Util !=null) {
// Copy all the Object to new Path
logger.info(".copyFilesToArchive() : Before Copying the Files to Archive and no.of RDD Partitions ={}",recordDF.rdd.partitions.length);
recordDF.rdd.map(row => {
var key = row.getAs("object_key")
var bucketName = row.getAs("bucket_name")
var targetBucketName = row.getAs("target_bucket_name")
var targetKey = "archive/" + "/" + key
var copyObjectRequest = new CopyObjectRequest(bucketName, key, targetBucketName,targetKey )
logger.info(".copyFilesToArchive() : Copying the File from ["+key+"] to ["+targetKey+"]");
s3Util.getS3Client.copyObject(copyObjectRequest)
})
logger.info(".copyFilesToArchive() : Copying the Files to Archive Folder. No.of Files to Copy ={}",recordDF.count());
}
else{
logger.info(".copyFilesToArchive() : Skipping Moving the Files as S3 Util is null");
}
}
And when I run my unit tests I am not seeing the logging statement of copying the files.
INFO ArchiveProcessor - .copyFilesToArchive() : Before Copying the Files to Archive and no.of RDD Partitions =200
INFO ArchiveProcessor - .copyFilesToArchive() : Copying the Files to Archive Folder. No.of Files to Copy =3000000
when I use collect() how ever i get OOM Error.
if I use collect() then i can see the logging output.
recordDF.collect().map(row => {
...
})
Thanks
Sateesh
Spark dataframes are immutable, if you do any transformation it will not change the original dataframe variable.
You are calling action method count() on recordDF but not on transformed version of recordDF i.e recordDF.rdd.map(//operations). Since you are not calling any action method that particular code block is not getting executed.
Since collect() is an action method, recordDF.collect().map(..)--> this is working for you. Collect method will bring all the records to driver, if memory is not enough (default is 1 GB) you will get OOM error.
You can use foreach or foreachPartition functions on dataframe -->recordDF.foreach(row ==> // transformation logic goes here) or call action on the recordDF.map.rdd(row=> //...)
val outRDD = recordDF.map.rdd(row=> //...)
logger.info("--<your message>--", outRDD.count)

Reading/writing with Avro schemas AND Parquet format in SparkSQL

I'm trying to write and read Parquet files from SparkSQL. For reasons of schema evolution, I would like to use Avro schemas with my writes and reads.
My understanding is that this is possible outside of Spark (or manually within Spark) using e.g. AvroParquetWriter and Avro's Generic API. However, I would like to use SparkSQL's write() and read() methods (which work with DataFrameWriter and DataFrameReader), and which integrate well with SparkSQL (I will be writing and reading Dataset's).
I can't for the life of me figure out how to do this, and am wondering if this is possible at all. The only options the SparkSQL parquet format seems to support are "compression" and "mergeSchema" -- i.e. no options for specifying an alternate schema format or alternate schema. In other words, it appears that there is no way to read/write Parquet files using Avro schemas using the SparkSQL API. But perhaps I'm just missing something?
To clarify, I also understand that this will basically just add the Avro schema to the Parquet metadata on write, and will add one more translation layer on read (Parquet format -> Avro schema -> SparkSQL internal format) but will specifically allow me to add default values for missing columns (which Avro schema supports but Parquet schema does not).
Also, I am not looking for a way to convert Avro to Parquet, or Parquet to Avro (rather a way to use them together), and I am not looking for a way to read/write plain Avro within SparkSQL (you can do this using databricks/spark-avro).
I am doing something similar. I use avro schema to write into parquet file however, dont read it as avro. But the same technique should work on read as well. I am not sure if this is the best way to do it, but here it is anyways:
I have AvroData.avsc which has the avro schema.
KafkaUtils.createDirectStream[String,Array[Byte],StringDecoder,DefaultDecoder,Tuple2[String, Array[Byte]]](ssc, kafkaProps, fromOffsets, messageHandler)
kafkaArr.foreachRDD { (rdd,time)
=> { val schema = SchemaConverters.toSqlType(AvroData.getClassSchema).dataType.asInstanceOf[StructType] val ardd = rdd.mapPartitions{itr =>
itr.map { r =>
try {
val cr = avroToListWithAudit(r._2, offsetSaved, loadDate, timeNow.toString)
Row.fromSeq(cr.toArray)
} catch{
case e:Exception => LogHandler.log.error("Exception while converting to Avro" + e.printStackTrace())
System.exit(-1)
Row(0) //This is just to allow compiler to accept. On exception, the application will exit before this point
}
}
}
public static List avroToListWithAudit(byte[] kfkBytes, String kfkOffset, String loaddate, String loadtime ) throws IOException {
AvroData av = getAvroData(kfkBytes);
av.setLoaddate(loaddate);
av.setLoadtime(loadtime);
av.setKafkaOffset(kfkOffset);
return avroToList(av);
}
public static List avroToList(AvroData a) throws UnsupportedEncodingException{
List<Object> l = new ArrayList<>();
for (Schema.Field f : a.getSchema().getFields()) {
String field = f.name().toString();
Object value = a.get(f.name());
if (value == null) {
//System.out.println("Adding null");
l.add("");
}
else {
switch (f.schema().getType().getName()){
case "union"://System.out.println("Adding union");
l.add(value.toString());
break;
default:l.add(value);
break;
}
}
}
return l;
}
The getAvroData method needs to have code to construct the avro object from raw bytes. I am also trying to figure out a way to do that without having to specifying each attribute setter explicitly, but seems like there isnt one.
public static AvroData getAvroData (bytes)
{
AvroData av = AvroData.newBuilder().build();
try {
av.setAttr(String.valueOf("xyz"));
.....
}
}
Hope it helps

Spark Broadcasting a HashMap no nullpointer but it doesnt fetch any values aswell

I am broadcasting a hashmap and returning a map from the below method
public static Map<Object1, Object2> lkpBC (JavaSparkContext ctx, String FilePath) {
Broadcast<Map<Object1, Object2>> CodeBC = null;
Map<Object1, Object2> codePairMap=null;
try {
Map<Object1, Object2> CodepairMap = LookupUtil.loadLookup(ctx, FilePath);
CodeBC = ctx.broadcast(codePairMap);
codePairMap= CodeBC.value();
} catch (Exception e) {
LOG.error("Error while broadcasting ", e);
}
return codePairMap;
}
and passing the map to the below method
public static JavaRDD<Object3> fetchDetails(
JavaSparkContext ctx,
JavaRDD<Object3> CleanFileRDD,
String FilePath,
Map<Object1, Object2> BcMap
) {
JavaRDD<Object3r> assignCd = CleanFileRDD.map(row -> {
object3 FileData = null;
try {
FileData = row;
if (BCMap.containsKey("some key")) {......}
} catch (Exception e) {
LOG.error("Error in Map function ", e);
}
return some object;
});
return assignCd;
}
in the local mode it works fine without any issues but when i run this on a spark standalone cluster(1 master 3 slaves) on EC2 this doesn't fetch any values nor throws an error. All the objects you see in the methods are serialized. Does it matter if i am calling these methods from a main class or any other different class?
PS: We use Kyro serializer in the spark conf
I think what's going on is you are not accessing the broadcast variable inside the closure of your map function. I think you are directly accessing the underlying BcMap (or BCMap, not sure if they are supposed to be different).
Line if (BCMap.containsKey("some key")) isn't accessing the broadcast variable CodeBC. Since it seems the type of BCMap is Map, not Broadcast.
To access the broadcast variable you would call CodeBC.value.containsKey.
Spark is designed in a functional way, it doesn't "do" anything to the underlying map, it makes a copy of it, broadcasts the copy, and wraps that copy in a Broadcast type.
I don't know what LookupUtil.loadLookup does, but I guess if the file doesn't exist or is empty does it return an empty map?
Here is an example of how you would do it in Scala:
val bcMap = ctx.broadcast(LookupUtil.loadLookup(ctx, FilePath))
cleanFileRDD.map(row =>
if (bcMap.value.containsKey("some key") ...
else ...)
I think you will solve your situation by following the wise words of a friend of mine "first solve all the obvious issues, then the harder issues seem to solve themselves". In your case they are:
Using mutable variables that get initialised to null
Using try catches that log errors but don't re-throw them. Just let exceptions bubble up.
Prematurely splitting things out into lots of different methods before you have it working as just one method.
And just because something works locally doesn't mean it will work when distributed. There are a lot of differences between running something locally and across a cluster, like: a) Data locality b) Serialization c) Closure capture d) Number of threads e) execution order ... etc

Resources