I want to use cassandra tigger to import my data to elasticsearch for searching.
Considering the data consistency, I hope they execute atomically.
So I want to know the trigger of the execution sequence, together with "write commitlog","memtable", "index" atomically, or the trigger is completely asynchronous?
Triggers are run before anything you listed above. The intent is to capture a mutation before it is persisted in the database. This is to potentially enhance data as it is received. What you have outlined about could have some edge failure conditions with data indexed in ES and not persisted to the database.
Have you looked at the DataStax search product? It has a much deeper integration with Cassandra that avoids these problems.
Related
Recently I have used Jdbc_streaming filter plugin of logstash, it is very helpful plugin which allows me to connect with my database on the fly and perform checks against my events.
But are there any drawbacks or pitfall of using this filter.
I mean I have the following queries :
For example , I am firing select query against each of my events.
Is it a good idea to query my database for each event. I mean what if I am processing a syslog event of a server which is continuously sending me data, in that case for each event I will be triggering a select query on my database so how will my database will react in terms of load and response time.
What about the number of connections, how they are managed.
How this will behave if I join multiple tables.
I hope I am able to convey my question.
I just want to understand , how exactly it is working in back end and does querying my database at massive speed will degrade my database performance.
I am not sure whether this answer is correct or not.
But as per my experience , logstash works in sequential manner for the above plugin.
It creates only single connection to RDS and query's the DB for each record.
So there is no connection overhead, but then it degrades the performance by many folds.
This answer is just from my experience, it might be possible that this can be a completely wrong answer. Any edits or answers are welcome.
Query 1: Event data from device is stored in Cassandra table. Obviously this is time series data. If we need to store how older dated events (if cached in device due to some issue) at current time, are we going to get performance issue? If yes, what is the solution to avoid that?
Query 2: Is it good practice to write the event into Cassandra table as soon as the event comes in? Or shall we queue it for sometime to write multiple events in one go if that improves Cassandra write performance significantly?
Q1: this all depends on the table design. Usually this shouldn't be an issue, but this may depend on your access patterns & compaction strategy. If you have table structure, please share it.
Q2: Individual writes shouldn't be a problem, but it really depends on your requirements for throughput. If you'll write several data points that belong to the same partition key you potentially may use unlogged batches, and in this case Cassandra will perform only one write for several inserts that are in this batch. Please read this document.
I am trying to sync my Spark database on S3 with an older Oracle database via daily ETL Spark job. I am trying to understand just what Spark does when it connects to a RDS like Oracle to fetch data.
Does it only grab the data that at the time of Spark's request to the DB (i.e. if it fetches data from an Oracle DB at 2/2 17:00:00, it will only grab data UP to that point in time)? Essentially saying that any new data or updates at 2/2 17:00:01 will not be obtained from the data fetch?
Well, it depends. In general you have to assume that this behavior is non-deterministic, unless explicitly ensured by your application and database design.
By default Spark will fetch data every time you execute an action on the corresponding Spark dataset. It means that every execution might see different state of your database.
This behavior can be affected by multiple factors:
Explicit caching and possible cache evictions.
Implicit caching with shuffle files.
Exacted set of parameters you use with JDBC data source.
In the first two cases Spark can reuse already fetched data without going back to the original data source. The third one is much more interesting. By default Spark fetches data using a single transaction but there methods which enable parallel reads based on column ranges or predicates. If one of these is used Spark will fetch data using multiple transactions, and each one can observe different state of your database.
If consistent point-in-time semantics is required you have basically two options:
Use immutable, append-only and timestamped records in your database and issue timestamp dependent queries from Spark.
Perform consistent database dumps and use these as a direct input to your Spark jobs.
While the first approach is much more powerful it is much harder to implement if you're working with per-existing architecture.
I'm not a mongodb expert, so I'm a little unsure about server setup now.
I have a single instance running mongo3.0.2 with wiredtiger, accepting both read and write ops. It collects logs from client, so write load is decent. Once a day I want to process this logs and calculate some metrics using aggregation framework, data set to process is something like all logs from last month and all calculation takes about 5-6 hours.
I'm thinking about splitting write and read to avoid locks on my collections (server continues to write logs while i'm reading, newly written logs may match my queries, but i can skip them, because i don't need 100% accuracy).
In other words, i want to make a setup with a secondary for read, where replication is not performing continuously, but starts in a configured time or better is triggered before all read operations are started.
I'm making all my processing from node.js so one option i see here is to export data created in some period like [yesterday, today] and import it to read instance by myself and make calculations after import is done. I was looking on replica set and master/slave replication as possible setups but i didn't get how to config it to achieve the described scenario.
So maybe i wrong and miss something here? Are there any other options to achieve this?
Your idea of using a replica-set is flawed for several reasons.
First, a replica-set always replicates the whole mongod instance. You can't enable it for individual collections, and certainly not only for specific documents of a collection.
Second, deactivating replication and enabling it before you start your report generation is not a good idea either. When you enable replication, the new slave will not be immediately up-to-date. It will take a while until it has processed the changes since its last contact with the master. There is no way to tell how long this will take (you can check how far a secondary is behind the primary using rs.status() and comparing the secondaries optimeDate with its lastHeartbeat date).
But when you want to perform data-mining on a subset of your documents selected by timespan, there is another solution.
Transfer the documents you want to analyze to a new collection. You can do this with an aggregation pipeline consisting only of a $match which matches the documents from the last month followed by an $out. The out-operator specifies that the results of the aggregation are not sent to the application/shell, but instead written to a new collection (which is automatically emptied before this happens). You can then perform your reporting on the new collection without locking the actual one. It also has the advantage that you are now operating on a much smaller collection, so queries will be faster, especially those which can't use indexes. Also, your data won't change between your aggregations, so your reports won't have any inconsistencies between them due to data changing between them.
When you are certain that you will need a second server for report generation, you can still use replication and perform the aggregation on the secondary. However, I would really recommend you to build a proper replica-set (consisting of primary, secondary and an arbiter) and leave replication active at all times. Not only will that make sure that your data isn't outdated when you generate your reports, it also gives you the important benefit of automatic failover should your primary go down for some reason.
I am currently working on an application using Neo4j as an embedded database.
And I wondering how it would be possible to make sure that separate threads use separate transactions. Normally, I would assign database operations to a transaction, but the code examples I found, don't allow for making sure that write operations use separate transactions:
try (Transaction tx = graphDb.beginTx()) {
Node node = graphDb.createNode();
tx.success();
}
As graphDB shall be used as a thread-safe singleton, I really don't see, how that shall work... (E.g. for several users creating a shopping list in separate transactions.)
I would be grateful for pointing out where I misunderstand the concept of transactions in Neo4j.
Best regards and many thanks in advance,
Oliver
The code you posted will run in separate transactions if executed by multiple threads, one transaction per thread.
The way this is achieved (and it's quite a common pattern) is storing transaction state against ThreadLocal (read the Javadoc and things will become clear).
Neo4j Transaction Management
In order to fully maintain data integrity and ensure good transactional behavior, Neo4j supports the ACID properties:
atomicity: If any part of a transaction fails, the database state is left unchanged.
consistency: Any transaction will leave the database in a consistent state.
isolation: During a transaction, modified data cannot be accessed by other operations.
durability: The DBMS can always recover the results of a committed transaction.
Specifically:
-All database operations that access the graph, indexes, or the schema must be performed in a transaction.
Here are the some useful links to understand Neo4j transactions
http://neo4j.com/docs/stable/rest-api-transactional.html
http://neo4j.com/docs/stable/query-transactions.html
http://comments.gmane.org/gmane.comp.db.neo4j.user/20442