Create documents that not exist, skip others - apache-spark

I'm working in a concurrent environment when index being built by Spark job may receive updates for same document id from the job itself and other sources. It is assumed that updates from other sources are more fresh and Spark job needs to silently ignore documents that already exist, creating all other documents. This is very close to indexing with op_type: create, but the latter throws an exception that is not passed to my error handler. Following block of code:
.rdd
.repartition(getTasks(configurationManager))
.saveJsonToEs(
s"$indexName/_doc",
Map(
"es.mapping.id" -> MenuItemDocument.ID_FIELD,
"es.write.operation" -> "create",
"es.write.rest.error.handler.bulkErrorHandler" ->
"<some package>.IgnoreExistsBulkWriteErrorHandler",
"es.write.rest.error.handlers" -> "bulkErrorHandler"
)
)
where error handler survived several variations, but currently is:
class IgnoreExistsBulkWriteErrorHandler extends BulkWriteErrorHandler with LazyLogging {
override def onError(entry: BulkWriteFailure, collector: DelayableErrorCollector[Array[Byte]]): HandlerResult = {
logger.info("Encountered exception:", entry.getException)
if (entry.getException.getMessage.contains("version_conflict_engine_exception")) {
logger.info("Encountered document already present in index, skipping")
HandlerResult.HANDLED
} else {
HandlerResult.ABORT
}
}
}
(i obviously was checking for org.elasticsearch.index.engine.VersionConflictEngineException in getException().getCause() first, but it didn't work)
emits this in log:
org.elasticsearch.hadoop.EsHadoopException: Could not write all entries for bulk operation [186/1000]. Error sample (first [5] error messages):
org.elasticsearch.hadoop.rest.EsHadoopRemoteException: version_conflict_engine_exception: [_doc][12]: version conflict, document already exists (current version [1])
(i assume that my error handler is not called at all)
and terminates my whole Spark job. What is the correct way to achieve my desired result?

Related

Spark transformation (map) is not called even after calling the action (count)

I have a map function defined on the DataFrame and when I invoke the action (count() in this case) I am not seeing the function calls invoked inside the map function getting called for each row.
Here is the code I have
def copyFilesToArchive(recordDF:DataFrame,s3Util:S3Client):Unit ={
if(s3Util !=null) {
// Copy all the Object to new Path
logger.info(".copyFilesToArchive() : Before Copying the Files to Archive and no.of RDD Partitions ={}",recordDF.rdd.partitions.length);
recordDF.rdd.map(row => {
var key = row.getAs("object_key")
var bucketName = row.getAs("bucket_name")
var targetBucketName = row.getAs("target_bucket_name")
var targetKey = "archive/" + "/" + key
var copyObjectRequest = new CopyObjectRequest(bucketName, key, targetBucketName,targetKey )
logger.info(".copyFilesToArchive() : Copying the File from ["+key+"] to ["+targetKey+"]");
s3Util.getS3Client.copyObject(copyObjectRequest)
})
logger.info(".copyFilesToArchive() : Copying the Files to Archive Folder. No.of Files to Copy ={}",recordDF.count());
}
else{
logger.info(".copyFilesToArchive() : Skipping Moving the Files as S3 Util is null");
}
}
And when I run my unit tests I am not seeing the logging statement of copying the files.
INFO ArchiveProcessor - .copyFilesToArchive() : Before Copying the Files to Archive and no.of RDD Partitions =200
INFO ArchiveProcessor - .copyFilesToArchive() : Copying the Files to Archive Folder. No.of Files to Copy =3000000
when I use collect() how ever i get OOM Error.
if I use collect() then i can see the logging output.
recordDF.collect().map(row => {
...
})
Thanks
Sateesh
Spark dataframes are immutable, if you do any transformation it will not change the original dataframe variable.
You are calling action method count() on recordDF but not on transformed version of recordDF i.e recordDF.rdd.map(//operations). Since you are not calling any action method that particular code block is not getting executed.
Since collect() is an action method, recordDF.collect().map(..)--> this is working for you. Collect method will bring all the records to driver, if memory is not enough (default is 1 GB) you will get OOM error.
You can use foreach or foreachPartition functions on dataframe -->recordDF.foreach(row ==> // transformation logic goes here) or call action on the recordDF.map.rdd(row=> //...)
val outRDD = recordDF.map.rdd(row=> //...)
logger.info("--<your message>--", outRDD.count)

Spark - ignoring corrupted files

In the ETL process that we are managing, we are receiving sometimes corrupted files.
We tried this Spark configuration and it seems it works (the Spark job is not failing because the corrupted files are discarded):
spark.sqlContext.setConf("spark.sql.files.ignoreCorruptFiles", "true")
But I don't know if there is anyway to know which files were ignored. Is there anyway to get those filenames?
Thanks in advance
One way is look through your executor logs. If you have setup following configuratios to true in your spark configuration.
RDD: spark.files.ignoreCorruptFiles
DataFrame: spark.sql.files.ignoreCorruptFiles
Then spark will log corrupted file as a WARN message in your executor logs.
Here is code snippet from Spark that does that:
if (ignoreCorruptFiles) {
currentIterator = new NextIterator[Object] {
// The readFunction may read some bytes before consuming the iterator, e.g.,
// vectorized Parquet reader. Here we use lazy val to delay the creation of
// iterator so that we will throw exception in `getNext`.
private lazy val internalIter = readCurrentFile()
override def getNext(): AnyRef = {
try {
if (internalIter.hasNext) {
internalIter.next()
} else {
finished = true
null
}
} catch {
// Throw FileNotFoundException even `ignoreCorruptFiles` is true
case e: FileNotFoundException => throw e
case e # (_: RuntimeException | _: IOException) =>
logWarning(
s"Skipped the rest of the content in the corrupted file: $currentFile", e)
finished = true
null
}
}
Did you solve it?
If not, may be you can try the below approach:
Read everything from the location with that ignoreCorruptFiles setting
You can get the file names each record belongs to using the input_file_name UDF. Get distinct names out.
Separately get list of all the objects in the respective directory
Find the difference.
Did you use a different approach?

Azure Search - Error

When trying to index documents we are getting this error:
{"Token PropertyName in state ArrayStart would result in an invalid JSON object. Path 'value[0]'."}
Our code for indexing using the .NET library is :
using (var indexClient = new SearchIndexClient(searchServiceName, indexName, new SearchCredentials(apiKey)))
{
indexClient.Documents.Index(IndexBatch.Create(IndexAction.Create(documents.Select(doc => IndexAction.Create(doc)))));
}
Does anyone know why this error occurs?
The issue is because of an extra call to IndexAction.Create. If you change your indexing code to this, it will work:
indexClient.Documents.Index(IndexBatch.Create(documents.Select(doc => IndexAction.Create(doc))));
The compiler didn't catch this because IndexBatch.Create has a params argument that can take any number of IndexAction<T> for any type T. In this case, T was a collection, which is not supported (documents must be objects, not collections).
The programming model for creating batches and actions is changing substantially in the 1.0.0-preview release of the SDK. It will be more type-safe so that mistakes like this are more likely to be caught at compile-time.

primaryValues behave not as expected

In our poc, we have a cache in PARTIONED MODE, with 2 backups, and we started 3 nodes. 100 entries were loaded into cache and we did below steps to retrive it.
public void perform () throws GridException {
final GridCache<Long, Entity> cache= g.cache("cache");
GridProjection proj= g.forCache("cache");
Collection< Collection<Entity>> list= proj .compute().broadcast(
new GridCallable< Collection<Entity>>() {
#Override public Collection<Entity> call() throws Exception {
Collection<Entity> values= cache.primaryValues();
System.out.println("List size on each Node: "+ values.size());
// console from each node shows 28,38,34 respectively, which is correct
return values;
}
}).get();
for (Collection<Entity> e: list){
System.out.println("list size when arrives on main Node :"+ e.size());
//console shows 28 for three times, which is not correct
}
}
I assume that primaryValues() is to take value of each element returned by primaryEntrySet() out and put into a Collection. I also tried to use primaryEntrySet and it works without such problem.
The way GridGain serializes cache collections is by reference which may not be very intuitive. I have filed a Jira issue with Apache Ignite project (which is the next version of GridGain open source edition): https://issues.apache.org/jira/browse/IGNITE-38
In the mean time, please try the following from your GridCallable, which should work:
return new ArrayList(cache.primaryValues());

Using Squeryl with Akka Actors

So I'm trying to work with both Squeryl and Akka Actors. I've done a lot of searching and all I've been able to find is the following Google Group post:
https://groups.google.com/forum/#!topic/squeryl/M0iftMlYfpQ
I think I might have shot myself in the foot as I originally created this factory pattern so I could toss around Database objects.
object DatabaseType extends Enumeration {
type DatabaseType = Value
val Postgres = Value(1,"Postgres")
val H2 = Value(2,"H2")
}
object Database {
def getInstance(dbType : DatabaseType, jdbcUrl : String, username : String, password : String) : Database = {
Class.forName(jdbcDriver(dbType))
new Database(Session.create(
_root_.java.sql.DriverManager.getConnection(jdbcUrl,username,password),
squerylAdapter(dbType)))
}
private def jdbcDriver(db : DatabaseType) = {
db match {
case DatabaseType.Postgres => "org.postgresql.Driver"
case DatabaseType.H2 => "org.h2.Driver"
}
}
private def squerylAdapter(db : DatabaseType) = {
db match {
case DatabaseType.Postgres => new PostgreSqlAdapter
case DatabaseType.H2 => new H2Adapter
}
}
}
Originally in my implementation, I tried surrounding all my statements in using(session), but I'd keep getting the dreaded "No session is bound to the current thread" error, so I added the session.bindToCuirrentThread to the constructor.
class Database(session: Session) {
session.bindToCurrent
def failedBatch(filename : String, message : String, start : Timestamp = now, end : Timestamp = now) =
batch.insert(new Batch(0,filename,Some(start),Some(end),ProvisioningStatus.Fail,Some(message)))
def startBatch(batch_id : Long, start : Timestamp = now) =
batch update (b => where (b.id === batch_id) set (b.start := Some(start)))
...more functions
This worked reasonably well, until I got to Scala Actors.
class TransferActor() extends Actor {
def databaseInstance() = {
val dbConfig = config.getConfig("provisioning.database")
Database.getInstance(DatabaseType.Postgres,
dbConfig.getString("jdbcUrl"),
dbConfig.getString("username"),
dbConfig.getString("password"))
}
lazy val config = ConfigManager.current
override def receive: Actor.Receive = { /* .. do some work */
I constantly get the following:
[ERROR] [03/11/2014 17:02:57.720] [provisioning-system-akka.actor.default-dispatcher-4] [akka://provisioning-system/user/$c] No session is bound to current thread, a session must be created via Session.create
and bound to the thread via 'work' or 'bindToCurrentThread'
Usually this error occurs when a statement is executed outside of a transaction/inTrasaction block
java.lang.RuntimeException: No session is bound to current thread, a session must be created via Session.create
and bound to the thread via 'work' or 'bindToCurrentThread'
I'm getting a fresh Database object each time, not caching it with a lazy val, so shouldn't that constructor always get called and attach to my current thread? Does Akka attach different threads to different actors and swap them around? Should I just add a function to call session.bindToCurrentThread each time I'm in an actor? Seems kinda hacky.
Does Akka attach different threads to different actors and swap them around?
That's exactly how the actor model works. The idea is that you can have a small thread pool servicing a very large number of threads because actors only need to use a thread when they have a message waiting to be processed.
Some general tips for Squeryl.... A session is a one to one association with a JDBC connection. The main advantage of keeping Sessions open is that you can have a transaction open that gives you a consistent view of the database as you perform multiple operations. If you don't need that, make your session/transaction code granular to avoid these types of issues. If you do need it, don't rely on Sessions being available in a thread local context. Use the transaction(session){} or transaction(sessionFactory){} methods to explicitly tell Squeryl where you want your Session to come from.

Resources