A new project came to my hands and looks interesting for my own.
I need to stored all the coming data from industrial PLCs (control the machinery inside a factory) and every event in the plc generated a output that need to be saved for after data analysis.
I was wondering what will be the perfect match for this type of data (time series) to make a hole architecture to manage data IO and at the moment only querying it for graphics (later will be applied machine learning analysis for predictive maintenance).
I don't know if I working in the correct direction and will be good to have some knowledge from experts in that subject.
IO producer (this a own made project and cant not be change)
IO events layer --> Is apache kafka a option for manage a big amount of signal coming for a lot of different computers (collected to plcs) and also manage the data saving to a nosql database. (it is suitable for that?any better option)
nosql database--> This point is more clear choosing Cassandra for time series storing.
queryng nosql data--> We are choosing spark for make fast queries and later on some data analysis.
The layer where I have more doubts is the layer involved in administrate the io data before storing and I have serious doubts that kafka is the correct option.
Thanks for reading and sorry for my bad English ;) Feel free to give your point of view.
we have a similar project based on sensor data. we have about 30 GB of data coming per day. We use kafka to stream the data and store it in hdfs. we have a set up of python ( numpy , pandas and pyspark) along with spark for any data crunching basically for prediction part.
As far as your doubts on kafka goes ... its capable enuf to handle large datasets. The other benfit would be that kafka can handle multiple sources and will be easier to scale.
As far as data storage is concerned i would recommend you to go with HDFS as it can be used in multiple ways to consume the data. You can leverage hive or hbase if required in future.
Related
I'm using intake to access the catalog catalog.ocean.GFDL_CM2_6.GFDL_CM2_6_control_ocean_surface.
At the moment I only work with small patches of that data, but accessing that data every single time is still quite costly (it's on Google Cloud Storage). So I want to use the persist option of intake to store that data locally. However as far as I've understood from the docs, it looks like one can only persist the whole dataset. For that specific dataset that would amount to almost 400 dollars if I take a cost of 0.1$ per GB, since the total data is 3976GB.
Hence my questions:
Is there a way (especially for a zarr file which in theory should make this quite easy) to persist only parts of the data (for instance only a subset of the variables)
This is probably more complicated, but can I push things further, by persisting regions of data I'm interested in (in terms of coordinates values for instance)?
There is no direct Intake way to do what you are asking for. Intake was conceived as a way to get your data into a format that you can then manipulate as you normally do, i.e., deal with only the loading part, so that a persisted data-set is the same as the original.
However, it is not hard to accomplish manually: you should grab the xarray, filter for the region you need, and call to_zarr to save the new dataset. You can then point a simple catalogue entry like the old one at the new location.
You could have done this manipulation in a driver directly if this was a specific pattern that would repeat a lot. In fact, we have mooted the idea of whether/how to implement such processing steps in Intake, but there is no plan yet. In the end, we may take the work on pipelines in Holoviews to describe processing steps.
I was watching one of the Cassandra videos on DataSax Academy. One concept they talk a lot about is query driven modelling. This makes sense when you know your queries upfront like in the KillrVideo example.
However, in big data cases, I hope I am not the only one to think that we barely know what kind of queries analysts will perform on the data 5 months or one year down the road.
If this is the case, what are the best practices for storing your data? My guess is that for advanced querying of such data, you likely will end up loading your data into Spark. But what do I have to consider at storage time to avoid operational troubles and troubles at retrieval time? What retrieval approaches are less problematic?
Cassandra is also a database for analytics use cases, but not always for Ad-Hoc Analaytics (Only one report and this query will never perform again stuff).
For this use cases is a hadoop cluster a better option for your. (Maybe parquete on hadoop) If you see that queries will perform over and over again, Cassandra is your friend. Generally you can use Cassandra for 50 to 70% of your use cases. With column keys and secondary indizies you can perform really a wide spectrum of queries. Go to your Analytics Guys and ask them what they need. Then: Create your tables :)
Datastax has a course on doing analysis on Cassandra with Apache Spark.
I have been using Elasticsearch for quite sometime now and little experience using Cassandra.
Now, I have a project we want to use spark to process the data but I need to decide if we should use Cassandra or Elasticsearch as the datastore to load my data.
In terms of connector, both Cassandra and Elasticsearch now has a good connector to load the data so that won't be deciding factor.
The winning factor to decide will be how fast I can load my data inside Spark. My data is almost 20 terabytes.
I know I can run some test using JMeter and see the result myself but I would like to ask anyone familiar with both systems.
Thanks
The short exact answer is "it depends", mostly on cluster sizes =)
I wouldn't chose Elastisearch as a primary source for the data, because it's good at searching. Searching is a very specific task and it requires a very specific approach, which in this cases uses inverted index to store actual data. Each field basically goes into separate index and because of that the indexes are very compact. Although it's possible to store into index complete objects, such an index will hardly get any benefit of compression. That requires much more disk space to store indexes and much more cpu clocks, spinning disks to process them.
Cassandra on the other hand is pretty good at storing and retrieving data.
Without any more or less specific requirements, I'd say that Cassandra is good at being primary storage (and provides pretty simple search scenarios) and ES is good at searching.
I will refute Evgenii answer about how ES is only good at searching.
YES ES exceed at text search but it doesnt mean it can't do data.
You can actually treat it as if it was "Mongo" style Documentation and run "filter" queries on it to have fast fetch results. HOWEVER the question now becomes: how fast do you need your read/write and do you need any distributions? What ES lacks is distribution. Yes ES can do sharding but it has issues doing multi region distribution and reliability of replication of your data.
If you need the flexibility / reliability of your data I would swing to Cassanda. Also since you are dealing with TB - Cassandra might be a winner too because it is fitted for extreme volume.
If you need an easier time to to run searches (not limited to text search, eg: geo spacial you can do too) then ES might be a better fit. (note for the shear volume you are doing, you will need to shard in order to distribute your load).
Currently we are using mongodb as our primary store for big online sales site, and currently we are focusing ourselves on big scalability among multiple machines.
Site backend is written in node.js and we are using mongoose as ODM.
I can see many blog posts which are writing about awesome cassandra DB, and I am starting to think about switching to cassandra. But still I am not sure if this is a really good decision, because I didn't found any good ODM/ORM lib for cassandra and node.js (and writing raw queries can be pain. Also writing good tested ORM/ODM can be time consuming task). So I am not sure how much benefit will I have after this switch. We are using elasticsearch as search engine, and it works excellent in combination with mongodb, and I am asking my self will do also good with cassandra.
If you have any experiance with this, it will be very helpfull.
Thank you!
Cassandra is a very nicely designed database, which can fulfill a lot of scenarios. MongoDB is also a really good DB engine. So let me just compare couple of main bullet points for you.
Always on system
Cassandra is really great when you need to provide 24x7 operations in multiple data centers. If you got more then one datacenter with multiple servers in each of them then Cassandra is great for you. Cassandra can sync writes to more than one datacenter and maintain desired data consistency across complex set ups. Recovery and re-sync is also quite easy.
On the other note MongoDB is easy to operate. If you got one data center and only couple of servers it might be a perfect fit (although global write lock might be a pain over time). In simple deployments it's easy to maintain and monitor.
Scalability
To continue the above statements - Cassandra is linearly scalable. There is, literally, no limit of how big the cluster will be. Your writes will always stay fast, while reads might become more complicated over time - depending on the structure of your data.
Denormalization of data
With Cassandra your writes and reads can be extremely fast if you will create a structure that will reflect what you need to get from your data. There is no query language (well, there is, but it's not exactly SQL) that you can use to reorganize your result set using aggregates, groupings, etc. Yes, some things are doable and some not - that is very specific to Cassandra data model. You will have to implement a lot of things on your own and write the result to the DB - i.e. counters for aggregation, different groupings, etc.
In comparison MongoDB is easy to use, easier to learn and more flexible - both for development (as knowledge curve/efforts goes) and for implementation of business logic (as time/effort is considered). That is - kind of - a reason why there are ORM engines for MongoDB and only couple (very limited) for Cassandra.
To summarize - both DBs are really good... if you will embrace their limitations. If you got only 100GB of data and you need flexible, easy to implement DB engine I would stick to MongoDB, alternatively take a look RethinkDB which have a very similar model and way better (in my personal opinion) clustering/data center replication implementation.
Cassandra is a great option for you if you will need to store TBs of data soon, deploying your apps across multiple data centers while accepting the cost of additional efforts to implement the same features and maintaining similar capabilities.
Don't take it personally that I have used the word only while describing your data set. Yes, it's not big - my company stores more than 20 TB these days... so yeah, 100GB is really not that much...
To stop everyone from pointing that I should compare some other features or point out some other differences between those two - it's just a rough, high level overview on the things I consider relevant to the problem, not a full comparison or analysis of the problem. But feel free to point out what I have missed and I will be happy to include new stuff in this answer...
Now I have a project with ads exchange service (something like google double click) and I have to pick a high-scalable database. I'm thinking about mongodb or cassandra.
Cassandra:
fit with our write-intensive system. (+)
looks hard to do aggregate(very important for analytics) (is there a good way? Just read slide about Twitter rainbird, seem good) (?)
I dont prefer java much. (-)
MongoDB:
Seem easier to do analytics. (have build-in aggregate functions) (+)
more RAM-consuming? (because of document-oriented vs key-value Cassandra) (?)
write perfomance compare to Cassandra? (?)
javascript shell and natural fit with node.js(one important part in our project) (+)
http://pastebin.com/raw.php?i=FD3xe6Jt - This article make me cautious. (-)
Can you guys help me to pick the one or answer some of my questions above
Thanks.
I don't know about Cassandra, but MongoDB has some advantages for using it for analytics: high concurrency, sharding, storing everything about an event in a single document, features like upsert and $inc.
For more detailed explanations check the following resources:
MongoDB Analytics - videos
http://blog.mongodb.org/post/171353301/using-mongodb-for-real-time-analytics
http://www.mongodb.org/display/DOCS/Use+Cases
http://www.slideshare.net/jrosoff/scalable-event-analytics-with-mongodb-ruby-on-rails
http://nosql.mypopescu.com/post/3508305955/fast-asynchronous-analytics-with-mongodb
http://blog.opengovernment.org/2011/02/24/fast-asynchronous-analytics-with-mongodb/
http://blog.10gen.com/post/4416876632/london-startup-ubervu-on-storing-5tb-of-data-in-mongodb
It depends a lot on your domain, most cases one would probably choose Mongo.
For example http://square.github.com/cube/ is built on Mongo.
Cube is an open-source system for visualizing time series data, built on MongoDB, Node and D3. If you send Cube timestamped events (with optional structured data), you can easily build realtime visualizations of aggregate metrics for internal dashboards. For example, you might use Cube to monitor traffic to your website, counting the number of requests in 5-minute intervals:
Most use cases of Cassandra draw from the need oh high availability that's the main feature of it afaik. Your needs seem to be centered around having a cheap way to shove queryable data in a scale-out DB, and mongo almost matches RDBMS in regards to querying. Mongo is also probably easier to deal with.
I think cassandra is a good fit for this problem.
You don't need to know much java to get it running (other than install java), as long as there is a client library in your chosen language.
Cassandra 0.8+ now has atomic counter support - perfect for impressions/click tracking.
You could also run hadoop on top of cassandra, giving you a proven platform for writing map reduce jobs to do analytics/aggregations and store the results back to Cassandra too.
Check out this slideshow about cassandra and hadoop: http://www.slideshare.net/jeromatron/cassandrahadoop-4399672
I hope that helps.