Spark broadcast variables: large maps - apache-spark

I am broadcasting a large Map (~6-10 GB). I am using sc.broadcast(prod_rdd) to do that. However, I am not sure whether broadcasting is meant only for small data/files and not for larger objects that I have. If former, what is the recommended practice? One option is to use a NoSQL database and then do the lookup using that. One issue with that is I might have to give up performance since I will be going through a single node (Region server or whatever equivalent of that is). If anyone has any insight into performance impact of these design choices, that will be greatly appreciated.

I'm wondering if you could perhaps use mapPartitions and read the map once per partition rather than broadcasting it?

Related

Spark intersection implementation

How does Spark implement intersection method? Does it require 2 RDDs to colocate on a single machine?
From here it says that it uses hashtables, which is a bit odd as it's probably not scalable and sorting both rdds and then comparing item by item might have provided a more scalable solution.
Any thoughts on the subject are welcome
It definitely doesn't need the RDDs to colocate on a single machine. You can just look at the code for the details. Looks like it uses a cogroup.

Cassandra Data Model for apache access logs

In a POC, we are using cassandra for storing (among other things) Apache access logs (parsed) and use together with apache spark + zeppelin. We have managed to get things working BUT we are very uncertain about how to model the data correctly.
Edit: Our queries will span over months and years rather than weeks and days. Against production jobs are likely executed perhaps daily (at least for now) and we will use a smaller dataset during development.
Since this will be used for analytics ONLY, the queries can be pretty much anything but of course we could consider a handful of queries in advance.
I.e
latency percentiles
geo distribution
sum of requests
Popular rest resources
... etc
Partition key + Primary key. This is really difficult... the only thing that I can think of is something like ((userid, [webresource]), timestamp).
At least this would give a fairly even distribution. Otherwise we would have to use a checksum or something which feels wrong.
Or should I have different tables for different types, like latency, geo etc? Or is this a good option for materialized views?
I have googled for something like this without any luck so perhaps cassandra is a poor solution for this BUT still, we would really like to see how far we can get.
Anyway, any input is highly appreciated!
Regards /Johan

Spark-Cassandra Vs Spark-Elasticsearch

I have been using Elasticsearch for quite sometime now and little experience using Cassandra.
Now, I have a project we want to use spark to process the data but I need to decide if we should use Cassandra or Elasticsearch as the datastore to load my data.
In terms of connector, both Cassandra and Elasticsearch now has a good connector to load the data so that won't be deciding factor.
The winning factor to decide will be how fast I can load my data inside Spark. My data is almost 20 terabytes.
I know I can run some test using JMeter and see the result myself but I would like to ask anyone familiar with both systems.
Thanks
The short exact answer is "it depends", mostly on cluster sizes =)
I wouldn't chose Elastisearch as a primary source for the data, because it's good at searching. Searching is a very specific task and it requires a very specific approach, which in this cases uses inverted index to store actual data. Each field basically goes into separate index and because of that the indexes are very compact. Although it's possible to store into index complete objects, such an index will hardly get any benefit of compression. That requires much more disk space to store indexes and much more cpu clocks, spinning disks to process them.
Cassandra on the other hand is pretty good at storing and retrieving data.
Without any more or less specific requirements, I'd say that Cassandra is good at being primary storage (and provides pretty simple search scenarios) and ES is good at searching.
I will refute Evgenii answer about how ES is only good at searching.
YES ES exceed at text search but it doesnt mean it can't do data.
You can actually treat it as if it was "Mongo" style Documentation and run "filter" queries on it to have fast fetch results. HOWEVER the question now becomes: how fast do you need your read/write and do you need any distributions? What ES lacks is distribution. Yes ES can do sharding but it has issues doing multi region distribution and reliability of replication of your data.
If you need the flexibility / reliability of your data I would swing to Cassanda. Also since you are dealing with TB - Cassandra might be a winner too because it is fitted for extreme volume.
If you need an easier time to to run searches (not limited to text search, eg: geo spacial you can do too) then ES might be a better fit. (note for the shear volume you are doing, you will need to shard in order to distribute your load).

Atomic probabilistic counting and set membership in Cassandra

I am looking to do probabilistic counting and set membership using structures such as bloom filters and hyperloglog.
Is there any support for using such data structures and performing operations on them atomically on the server-side, through user-defined functions or similar? Or any way for me to add extensions with such functionality?
(I could ingest the data through another system and batch the updates to reduce the contention, but it would be far simpler if all this could be handled in the database server.)
You have to implement them client side. Common approach is to every X min serialize/insert the HLL you keep in memory on your system and then merge them on reads across interested range (maybe using RRD type approach for different periods beyond X min). This is not very durable, so depending on usecase it might mean something more complex.
Although it seems a close fit to C* I think one of the big issues is deletes, but you can probably work around them. Theres a proof of concept for C* side implementation here:
http://vilkeliskis.com/blog/2013/12/28/hacking_cassandra.html
that you can likely get working "well enough". https://issues.apache.org/jira/browse/CASSANDRA-8861 may be something to watch.

Cassandra or mongodb or something else for big online sales site

Currently we are using mongodb as our primary store for big online sales site, and currently we are focusing ourselves on big scalability among multiple machines.
Site backend is written in node.js and we are using mongoose as ODM.
I can see many blog posts which are writing about awesome cassandra DB, and I am starting to think about switching to cassandra. But still I am not sure if this is a really good decision, because I didn't found any good ODM/ORM lib for cassandra and node.js (and writing raw queries can be pain. Also writing good tested ORM/ODM can be time consuming task). So I am not sure how much benefit will I have after this switch. We are using elasticsearch as search engine, and it works excellent in combination with mongodb, and I am asking my self will do also good with cassandra.
If you have any experiance with this, it will be very helpfull.
Thank you!
Cassandra is a very nicely designed database, which can fulfill a lot of scenarios. MongoDB is also a really good DB engine. So let me just compare couple of main bullet points for you.
Always on system
Cassandra is really great when you need to provide 24x7 operations in multiple data centers. If you got more then one datacenter with multiple servers in each of them then Cassandra is great for you. Cassandra can sync writes to more than one datacenter and maintain desired data consistency across complex set ups. Recovery and re-sync is also quite easy.
On the other note MongoDB is easy to operate. If you got one data center and only couple of servers it might be a perfect fit (although global write lock might be a pain over time). In simple deployments it's easy to maintain and monitor.
Scalability
To continue the above statements - Cassandra is linearly scalable. There is, literally, no limit of how big the cluster will be. Your writes will always stay fast, while reads might become more complicated over time - depending on the structure of your data.
Denormalization of data
With Cassandra your writes and reads can be extremely fast if you will create a structure that will reflect what you need to get from your data. There is no query language (well, there is, but it's not exactly SQL) that you can use to reorganize your result set using aggregates, groupings, etc. Yes, some things are doable and some not - that is very specific to Cassandra data model. You will have to implement a lot of things on your own and write the result to the DB - i.e. counters for aggregation, different groupings, etc.
In comparison MongoDB is easy to use, easier to learn and more flexible - both for development (as knowledge curve/efforts goes) and for implementation of business logic (as time/effort is considered). That is - kind of - a reason why there are ORM engines for MongoDB and only couple (very limited) for Cassandra.
To summarize - both DBs are really good... if you will embrace their limitations. If you got only 100GB of data and you need flexible, easy to implement DB engine I would stick to MongoDB, alternatively take a look RethinkDB which have a very similar model and way better (in my personal opinion) clustering/data center replication implementation.
Cassandra is a great option for you if you will need to store TBs of data soon, deploying your apps across multiple data centers while accepting the cost of additional efforts to implement the same features and maintaining similar capabilities.
Don't take it personally that I have used the word only while describing your data set. Yes, it's not big - my company stores more than 20 TB these days... so yeah, 100GB is really not that much...
To stop everyone from pointing that I should compare some other features or point out some other differences between those two - it's just a rough, high level overview on the things I consider relevant to the problem, not a full comparison or analysis of the problem. But feel free to point out what I have missed and I will be happy to include new stuff in this answer...

Resources