How to perform decision tree lookup using mapreduce? I am looking for an optimized version - decision-tree

I have decision tree with millions of nodes, serialized on HDFS. Can any one please help me giving some pointer how to do better serialization so that I can perform search more efficiently on Hadoop using map reduce.
Thanks.

Well in order to traverse your tree, you need the model to be loaded into memory. Once it is loaded it is pretty easy and fast to perform a traverse of an instance. You cant avoid storing your model into hdfs, so in order to perform a better traverse, you need to do something better in your main memory. But as i said, a tree traverse is always super fast. Perhaps providing some more information about your problem would be nice. Your problem is having millions of new examples and predicting their label?

Related

Searching through polymorphic data with Elasticsearch

I am stumped at what seems to be a fundamental problem with Elasticsearch and polymorphic data. I would like to be able to find multiple types of results (e.g. users and videos and playlists) with just one Elasticsearch query. It has to be just one query, since that way Elasticsearch can do all the scoring and I won't have to do any magic to combine multiple query results of different types.
I know that Elasticsearch uses a flat document structure, bringing me to the following problem. If I index polymorphic data, I will have to specify a 'missing' value for each unique attribute that I care about in scoring subtypes of the polymorphic data.
I've looked for examples of other dealing with this problem and couldn't find any. There doesn't seem to be anything in the documentation on this either. Am I overlooking something obvious or was Elasticsearch just not designed to do something like this?
Kind regards,
Steffan
Thats not the issue of Elasticsearch itself, its the problem (or limitation) of underlying lucene indexes. So, any db/engine based on lucene will have the same problems (if not worse :), ES does a hell ton of job for you). Probably, ES will ease the pain in further releases, but not dramatically. And IMO, there's hardly any hi-perf search engine that can bear with true polymorphic data.
The answer depends on your data structure, thats for sure. Basically, you have two options:
Put all your data in single index, and split it by types. And you already know the overhead - lucent indexes works poorly with sparse data. More similar your data is, less problem you have. Anyway, ES will do all the underlying job for "missing" values, you only have to cope with memory/disk overhead for storing sparse data.
If your data is organised with parent-child relation (i.e. video -> playlist), you definitely need single index for such data. Which is leaving you with this approach only.
Divide your data into multiple indexes. This way you have slightly higher disk overhead for lucene index + possibly higher CPU usage when aggregation data from multiple shards (so, you should tune your sharding respectively).
You still can query ES for all your documents in single request, as ES supports multi-index queries.
So, this looks like question purely of your data structure. I'd recommend to simply fire up small cluster to measure memory/disk/cpu usage for expected data. More details on "index vs shard" – great article by Adrien.
Slightly off-topic, if ES doesn't seem to feet your needs, I suggest you to
still consider merging data on application side. ES works great with multiple light request (instead of few heavier), and as your results from ES is sorted already, you need to merge sorted streams having sorted input. Not so much magic there, tbh.

How does Spark's sort shuffle work?

From https://0x0fff.com/spark-architecture-shuffle/ I know that the default way of shuffeling in Spark is sort shuffle. However the description was not step-by-step enough to be clear for me. How does it work?
What I understand is that each mapper writes into exactly one AppendOnlyMap (What are the keys?), which is sorted (and spilled - why spilled?) into potentially multiple... what exactly?... then somehow written in some indexed (what exactly is indexed by what with what key?) file. I think the idea in the end is that all those sorted-and-indexed files are brought with this Min Heap Merge together to have only one big file per reduces.
As one can see - there are more wholes (things I don't understand) than Swiss cheese (things I do understand)...

Spark broadcast variables: large maps

I am broadcasting a large Map (~6-10 GB). I am using sc.broadcast(prod_rdd) to do that. However, I am not sure whether broadcasting is meant only for small data/files and not for larger objects that I have. If former, what is the recommended practice? One option is to use a NoSQL database and then do the lookup using that. One issue with that is I might have to give up performance since I will be going through a single node (Region server or whatever equivalent of that is). If anyone has any insight into performance impact of these design choices, that will be greatly appreciated.
I'm wondering if you could perhaps use mapPartitions and read the map once per partition rather than broadcasting it?

How to score all user-product combinations in Spark MatrixFactorizationModel?

Given a MatrixFactorizationModel what would be the most efficient way to return the full matrix of user-product predictions (in practice, filtered by some threshold to maintain sparsity)?
Via the current API, once could pass a cartesian product of user-product to the predict function, but it seems to me that this will do a lot of extra processing.
Would accessing the private userFeatures, productFeatures be the correct approach, and if so, is there a good way to take advantage of other aspects of the framework to distribute this computation in an efficient way? Specifically, is there an easy way to do better than multiplying all pairs of userFeature, productFeature "by hand"?
Spark 1.1 has a recommendProducts method that can be mapped to each user ID. This is better than nothing but not really optimized for recommending to all users.
I would double-check that you really mean to make recommendations for everyone; at scale, this is inherently a big slow operation. Consider predicting for users that have been recently active only.
Otherwise, yes your best bet is to create your own method. The cartesian join of the feature RDDs is probably too slow as it's shuffling so many copies of the feature vectors. Choose the larger of the user / product feature set, and map that. In each worker, hold the other product / user feature set in memory in each worker. If this isn't feasible you can make this more complex and map several times against subsets of the smaller RDD in memory.
As of Spark 2.2, recommendProductsForUsers(num) would be the method.
Recommends the top "num" number of products for all users. The number of recommendations returned per user may be less than "num".
https://spark.apache.org/docs/2.2.0/api/python/pyspark.mllib.html

In CouchDB, are there ways to improve performance of the View index process?

I have some basic views and some map/reduce views with logic. Nothing too complex. Not too many documents. I've tried with 250k, 75k, and 10k documents. Seems like I'm always waiting for view indexing.
Does better, more efficient code in the view help? I'm assuming it's basically processing the view at all levels of aggregation. So there must be some improvement there.
Does emit()-ing less data help? emit(doc.id, doc) vs specifying fewer fields?
Do more or less complex keys impact view indexing?
Or is it all about memory, CPU cores, and processor speed?
There must be some documentation out there, but I can't find anything referencing ways to improve performance.
I would take a deeper look into the reduce function. Try to use the built-in Erlang functions like _sum, _count, instead of writing Javascript.
Complex views can take hours and more, that's normal.
Maybe post such not too complex map/reduce.
And don't forget: indexing all docs is only done once after changing the view (or pushing a whole bunch of new docs). Subsequent new docs are indexed incrementally.
Use a view with &stale=ok to retrieve the "old" data instantly, so you don't have to wait. (But pay attention: you always have to call a view without stale=ok at least once to trigger the indexing process). Or better: use stale=update_after.
The code you write in views is more like CREATE INDEX than SELECT. It should be irrelevant how long it takes, as long as the view builds keep up with the document change rate. Building a view is a sunk (one-time) cost.
When you query the view, that is always a binary tree scan, which operates against a static data set in logarithmic time. That is usually the performance people care about more (in production.)
If you are not seeing behavior like I describe, perhaps we could discuss your view functions and your general approach to your problem. CouchDB is very different from relational databases. In the latter, you have highly structured data and free-form queries. In CouchDB, you have free-form data but highly structured index definitions (views). Except during development, changing and rebuilding views should be rare.
not emitting anything will help, but doing the view creation in smaller batches ( there are scripts that do this automagically ) helps more than anything other than not emitting anything at all, which can't be helped sometimes.

Resources