Handle warnings in data processing with apache spark

Handle warnings in data processing with apache spark - apache-spark

I looked around on the web and I couldn’t find any way to deal in a structured way with malformed/faulty records during computation. All I was able to find was the flatMap/Some/None technique + logging.
I’m facing this problem because I have a processing algorithm that extracts more than one value from each record, but can fail in extracting one of those multiple values, and I want to keep track of them. Logging is not feasible because this “warning” happens so frequently that the logs would become overwhelming and impossibile to read.
Since I have 3 different possible outcomes from my processing I modeled it with this class hierarchy:
That holds result and/or warnings.
Since Result implements Traversable it can be used in a flatMap, discarding all warnings and failure results, in the other hand, if we want to keep track of warnings, we can elaborate them and output them if we need.

Related

Are there patterns for "lookbehind filters" in Apache Spark?

I stumbled at a couple of workloads which seem to require filtering data with "lookback" capability - mainly in IoT scenarios, where sensors can produce garbage data, and to detect that it's necessary to look at the previous record of that sensor.
Spark's filter() operation is obviously "element-only" - in fact, the RDD as a whole can't know the order of the elements you want it to look behind on. So another approach is needed.
My naive approach would involve keying the RDD on the sensor, repartitioning it so that keys and partitions become one and the same, and sorting all elements for the keys/partition so that they are in temporal order. Then we can filter with a user function and maybe emit the data back so that the rest of the pipeline can deal with it as it wishes.
However, this looks heavyweight and likely inefficient. Is there a more idiomatic way?
Summary: Is there a Spark-related design pattern to deal with filtering tasks which need to "lookback" at the previous element of a sequence?

Searching through polymorphic data with Elasticsearch

I am stumped at what seems to be a fundamental problem with Elasticsearch and polymorphic data. I would like to be able to find multiple types of results (e.g. users and videos and playlists) with just one Elasticsearch query. It has to be just one query, since that way Elasticsearch can do all the scoring and I won't have to do any magic to combine multiple query results of different types.
I know that Elasticsearch uses a flat document structure, bringing me to the following problem. If I index polymorphic data, I will have to specify a 'missing' value for each unique attribute that I care about in scoring subtypes of the polymorphic data.
I've looked for examples of other dealing with this problem and couldn't find any. There doesn't seem to be anything in the documentation on this either. Am I overlooking something obvious or was Elasticsearch just not designed to do something like this?
Kind regards,
Steffan

Thats not the issue of Elasticsearch itself, its the problem (or limitation) of underlying lucene indexes. So, any db/engine based on lucene will have the same problems (if not worse :), ES does a hell ton of job for you). Probably, ES will ease the pain in further releases, but not dramatically. And IMO, there's hardly any hi-perf search engine that can bear with true polymorphic data.
The answer depends on your data structure, thats for sure. Basically, you have two options:
Put all your data in single index, and split it by types. And you already know the overhead - lucent indexes works poorly with sparse data. More similar your data is, less problem you have. Anyway, ES will do all the underlying job for "missing" values, you only have to cope with memory/disk overhead for storing sparse data.
If your data is organised with parent-child relation (i.e. video -> playlist), you definitely need single index for such data. Which is leaving you with this approach only.
Divide your data into multiple indexes. This way you have slightly higher disk overhead for lucene index + possibly higher CPU usage when aggregation data from multiple shards (so, you should tune your sharding respectively).
You still can query ES for all your documents in single request, as ES supports multi-index queries.
So, this looks like question purely of your data structure. I'd recommend to simply fire up small cluster to measure memory/disk/cpu usage for expected data. More details on "index vs shard" – great article by Adrien.
Slightly off-topic, if ES doesn't seem to feet your needs, I suggest you to
still consider merging data on application side. ES works great with multiple light request (instead of few heavier), and as your results from ES is sorted already, you need to merge sorted streams having sorted input. Not so much magic there, tbh.

Avoid the use of Java data structures in Apache Spark to avoid copying the data

I have a MySQL database with a single table containing about 100 million records (~25GB, ~5 columns). Using Apache Spark, I extract this data via a JDBC connector and store it in a DataFrame.
From here, I do some pre-processing of the data (e.g. replacing the NULL values), so I absolutely need to go through each record.
Then I would like to perform dimensionality reduction and feature selection (e.g. using PCA), perform clustering (e.g. K-Means) and later on do the testing of the model on new data.
I have implemented this in Spark's Java API, but it is too slow (for my purposes) since I do a lot of copying of the data from a DataFrame to a java.util.Vector and java.util.List (to be able to iterate over all records and do the pre-processing), and later back to a DataFrame (since PCA in Spark expects a DataFrame as input).
I have tried extracting information from the database into a org.apache.spark.sql.Column but cannot find a way to iterate over it.
I also tried avoiding the use of Java data structures (such as List and Vector) by using the org.apache.spark.mllib.linalg.{DenseVector, SparseVector}, but cannot get that to work either.
Finally, I also considered using JavaRDD (by creating it from a DataFrame and a custom schema), but couldn't work it out entirely.
After a lengthy description, my question is: is there a way to do all steps mentioned in the first paragraph, without copying all the data into a Java data structure?
Maybe one of the options I tried could actually work, but I just can't seem to find out how, as the docs and literature on Spark are a bit scarce.

From the wording of your question, it seems there is some confusion about the stages of Spark processing.
First, we tell Spark what to do by specifying inputs and transformations. At this point, the only things that are known are (a) the number of partitions at various stages of processing and (b) the schema of the data. org.apache.spark.sql.Column is used at this stage to identify the metadata associated with a column. However, it doesn't contain any of the data. In fact, there is no data at all at this stage.
Second, we tell Spark to execute an action on a dataframe/dataset. This is what kicks off processing. The input is read and flows through the various transformations and into the final action operation, be it collect or save or something else.
So, that explains why you cannot "extract information from the database into" a Column.
As for the core of your question, it's hard to comment without seeing your code and knowing exactly what it is you are trying to accomplish but it is safe to say that much migrating between types is a bad idea.
Here are a couple of questions that might help guide you to a better outcome:
Why can't you perform the data transformations you need by operating directly on the Row instances?
Would it be convenient to wrap some of your transformation code into a UDF or UDAF?
Hope this helps.

Real time multi threaded max-heap for top-N geohash

There is a requirement to keep a list of top-10 localities in a city from where demand for our food service is emanating at any given instant. The city could have tens of thousands of localities.
If one has to make a near real time (lag no more than 5 minutes) datastore in memory that would
- keep count of incoming demand by locality (geo hash)
- reads by hundreds of our suppliers every minute (the ajax refresh is every minute)
I was thinking of a multi threaded synchronized max-heap. This would be a complex solution as tree locking is by itself a complex implementation.
Any recommendations for the best in-memory (replicatable master slave) data structure that can be read and updated in multi threaded environment?
We expect 10K QPS and 100K updates per second. When we scale to other cities and regions, we will need per city implementation of top-10.
Are there any off the shelf solutions available?
Persistence is not a need so no mySQL based solutions. If you recommend redis or mongo DB solution, please realize that the queries are not pointed-queries by key but a top-N query instead.
Thanks in advance.

If you're looking for exactly what you're describing, there are a few approaches that might work nicely. There are several papers describing concurrent data structures that could work as priority queues; here is one option that I'm not super familiar with but which looks promising. You might also want to check out concurrent skip lists, which should also match your requirements.
If I'm interpreting your problem statement correctly, you're hoping to maintain a top-10 list of locations based on the number of hits you receive. If that's the case, I would suspect that while the number of updates would be huge, the number of times that two locations would switch positions would not actually be all that large. In other words, most updates wouldn't actually require the data structure to change shape. Consequently, you could consider using a standard binary heap where each element uses an atomic-compare-and-set integer key and where you have some kind of locking system that's used only in the case where you need to add, move, or delete an element from the heap.
Given the scale that you're working at, you may also want to consider approximate solutions to your problem. The count-min sketch data structure, for example, was specifically designed to estimate frequent elements in a data stream and does so extremely quickly. It can easily be distributed and linked up with a priority queue in a manner similar to what I described above. There are lots of good implementations out there, and if I remember correctly this data structure is actually deployed in situations like the one you're describing.
Hope this helps!

Strategies for search across disparate data sources

I am building a tool that searches people based on a number of attributes. The values for these attributes are scattered across several systems.
As an example, dateOfBirth is stored in a SQL Server database as part of system ABC. That person's sales region assignment is stored in some horrible legacy database. Other attributes are stored in a system only accessible over an XML web service.
To make matters worse, the the legacy database and the web service can be really slow.
What strategies and tips should I consider for implementing a search across all these systems?
Note: Although I posted an answer, I'm not confident its a great answer. I don't intend to accept my own answer unless no one else gives better insight.

You could consider using an indexing mechanism to retrieve and locally index the data across all the systems, and then perform your searches against the index. Searches would be an awful lot faster and more reliable.
Of course, this just shifts the problem from one part of your system to another - now your indexing mechanism has to handle failures and heterogeneous systems, but that may be an easier problem to solve.
Another factor is how often the data changes. If you have to query data in real-time that goes stale very quickly, then indexing may not be practical.

If you can get away with a restrictive search, start by returning a list based on the search criteria corresponding to the fastest data source. Then join up those records with the other systems and remove records which don't match the search criteria.
If you have to implement OR logic, this approach is not going to work.

While not an actual answer, this might at least get you partway to a workable solution. We had a similar situation at a previous employer - lots of data sources, different ways of accessing those data sources, different access permissions, military/government/civilian sources, etc. We used Mule, which is built around the Enterprise Service Bus concept, to connect these data sources to our application. My details are a bit sketchy, as I wasn't the actual implementor, just an integrator, but what we did was define a channel in Mule. Then you write a simple integration piece to go between the channel and the data source, and the application and the channel. The integration piece does the work of making the actual query, and formatting the results, so we had a generic SQL integration piece for accessing a database, and for things like web services, we had some base classes that implemented common functionality, so the actual customization of the integration piecess was a lot less work than it sounds like. The application could then query the channel, which would handle accessing the various data sources, transforming them into a normalized bit of XML, and return the results to the application.
This had a lot of advantages for our situation. We could include new data sources for existing queries by simply connecting them to the channel - the application didn't have to know or care what data sources where there, as it only looked at the data from the channel. Since data can be pushed or pulled from the channel, we could have a data source update the application when, for example, it was updated.
It took a while to get it configured and working, but once we got it going, we were pretty successful with it. In our demo setup, we ended up with 4 or 5 applications acting as both producers and consumers of data, and connecting to maybe 10 data sources.

Have you thought of moving the data into a separate structure?
For example, Lucene stores data to be searched in a schema-less inverted indexed. You could have a separate program that retrieves data from all your different sources and puts them in a Lucene index. Your search could work against this index and the search results could contain a unique identifier and the system it came from.
http://lucene.apache.org/java/docs/
(There are implementations in other languages as well)

Have you taken a look at YQL? It may not be the perfect solution but I might give you starting point to work from.

Well, for starters I'd parallelize the queries to the different systems. That way we can minimize the query time.
You might also want to think about caching and aggregating the search attributes for subsequent queries in order to speed things up.
You have the option of creating an aggregation service or middleware that aggregates all the different systems so that you can provide a single interface for querying. If you do that, this is where I'd do the previously mentioned cache and parallize optimizations.
However, with all of that it you will need weighing up the development time/deployment time /long term benefits of the effort against migrating the old legacy database to a faster more modern one. You haven't said how tied into other systems those databases are so it may not be a very viable option in the short term.
EDIT: in response to data going out of date. You can consider caching if your data if you don't need the data to always match the database in real time. Also, if some data doesn't change very often (e.g. dates of birth) then you should cache them. If you employ caching then you could make your system configurable as to what tables/columns to include or exclude from the cache and you could give each table/column a personalizable cache timeout with an overall default.

Use Pentaho/Kettle to copy all of the data fields that you can search on and display into a local MySQL database
http://www.pentaho.com/products/data_integration/
Create a batch script to run nightly and update your local copy. Maybe even every hour. Then, write your query against your local MySQL database and display the results.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string