How to Write a Custom Snitch for Apache Cassandra?

How to Write a Custom Snitch for Apache Cassandra? - cassandra

I'm currently attempting to write a new snitch that makes minor tweaks and improvements to SimpleSnitch that comes with the "out of the box" Apache Cassandra.
My current objective is, instead of the pre-made Dynamic Snitch that comes as a wrapper of subsnitches provided by Cassandra, to create my own snitch that is able to inform Cassandra about which nodes to send requests to. Whereas most snitches simply inform Cassandra about the topology of the network, I want to inform Cassandra about which node to send to, and where that node resides.
My main trouble is finding how to get my snitch to interact with Cassandra in a way that exceeds the scope of requesting for topological data. For example, my public String getDatacenter(InetAddress endpoint) function is actively being called, however this is the only function in my program that is called by Cassandra. I would love to be able to write some function, public String getBestNode(), that returns the IP of the node to Cassandra when it requests for it. However, I can't seem to find any information online that provides me with functions to override/write myself that Cassandra calls for.
If anyone has a good Snitch writing resource they could link me, that would be greatly appreciated. Otherwise, I would be thankful for any advice anyone has for me.

Related

Shopware 6 partitioning

Has anyone had any experience with database partitioning? We already have a lot of data and queries on it are already starting to slow down. Maybe someone has some examples? These are tables related to orders.

Shopware, since version 6.4.12.0, allows the use of database clusters, see the relevant documentation. You will have to set up a number read-only nodes first. The load of reading data will then be distributed among the read-only nodes while write operations are restricted to the primary node.
Note that in a cluster setup you should also use a lock storage that compliments the setup.

Besides using a DB cluster you can also try to reduce the load of the db server.
The first thing you should enable the HTTP-Cache, still better to additionaly also set up a reverse cache like varnish. This will greatly decrease the number of requests that hit your webserver and thus your DB server as well.
Besides all those measures explained here should improve the overall performance of your shop as well as decreasing load on the DB.
Additionally you could use Elasticsearch, so that costly search requests won't hit the Database. And use a "real" MessageQueue, so that the messages are not stored in the Database. And use Redis instead of the database for the storage of performance critical information as is documented in the articles in this category of the official docs.
The impact of all those measures probably depends on your concrete project setup, so maybe you see in the DB locks something that hints to one of the points i mentioned previously, so that would be an indicator to start in that direction. E.g. if you see a lot of search related queries Elasticsearch would be a great start, but if you see a lot of DB load coming from writing/reading/deleting messages, then the MessageQueue might be a better starting point.
All in all when you use a DB cluster with a primary and multiple replicas and use the additional services i mentioned here your shop should be able to scale quite well without the need for partitioning the actual DB.

Deploying PySpark as a service

I have code in PySpark that parallelizes heavy computation. It takes two files and several parameters, performs the computation and generates information that at this moment is stored in a CSV file (tomorrow it would be ideal that the information is stored in a Postgres database).
Now, I want to consume this functionality as a service from a system made in Django, from which users will set the parameters of the Spark service, the selection of the two files and then query the results of the process.
I can think of several ways to cover this, but I don't know which one is the most convenient in terms of simplicity of implementation:
Use Spark API-REST: this would allow a request to be made from Django to the Spark cluster. The problem is that I found no official documentation, I get everything through blogs whose parameters and experiences correspond to a particular situation or solution. At no point does it mention, for example, how I could send files to be consumed through the API by Spark, or get the result.
Develop a simple API in Spark's Master to receive all parameters and execute the spark-submit command at the OS level. The awkwardness of this solution is that everything must be handled at the OS level, the parameter files and the final process result must be saved on disk and accessible by the Django server that wants to get it to save its information in the DB lately.
Integrate the Django app in the Master server, writing PySpark code inside It, Spark connects to the master server and runs the code that manipulates the RDDs. This scheme does not convince me because it sacrifices the independence between Spark and the Django application, which is already huge.
If someone could enlighten me about this, maybe due to lack of experience I am overlooking a cleaner, more robust, or idiomatic solution.
Thanks in advance

Streaming data from Kafka to Hazelcast and persisting it into cassandra

Let me expose the architecture of my system before diving into the heart of the problem.
I Have stream of data that comes from Kafka and my company uses a distributed cache (hazelcast precisely) that make data ready to be requested through web services that we expose. We also want to persist the data in the cache to cassandra so it would be durable. I have two solutions on how to put the data to hazelcast and I would like to have your suggestions (maybe another way of doing) and tell me in your view what's the best solution and why?
1/ use a kafka-hazelcast connector to send data directly from kafka to hazelcast and then persist the data to cassadandra using write-behind and mapstores ==> there two main drawbacks with this solution, first we to serialize/deserialize each time we store data to cassandra (important usage of CPU) and second we put all the data to the cache even not needed by users (we have lots of evictions hapenning)
2/ Use a kafka-cassandra connector and write data directly to cassandra and then find a means (how complex you think this part could be ?) to notify hazelcast to update/evict the data if it's already in the cache ==> the pros of this solution is that we get rid of the serilizatino/deserialization needed by the mapstores and we load only the data that was queried before and the key is already in the cache
Which one of the two solutions do you prefer and why ?
what's the best means to notify hazelcast in the second solution in you point of view ?
Thank you in advance for your suggestions/answers
I hope i was concise and clear !

Limiting Cassandra query syntax for clients

We plan to use Cassandra 3.x and we want to allow our customers to connect to Cassandra directly for exporting the data into their data warehouses.
They will connect via ODBC from remote.
Is there any way to prevent that the customer executes huge or bad SELECT statements that will result in a high load for all nodes? We use an extra data center in our replication strategy where only customers can connect, so live system will not be affected. But we want to setup some workers that will run on this shadow system also. Most important thing is, that a connected remote client will not have any noticable impact on other remote connections or our local worker jobs. There is a materialized view already and I want to force customers to get data based on primary key only (i.e. disallow usage of ALLOW FILTERING). It would be great also, if one can limit the number of rows returned (e.g. 1 million) to prevent a pull of all data.
Is there a best practise for this use case?
I know of BlackRocks video related to multi-tenant strategy in C* which advises to use tenant_id in schema. That is what we're doing already, but how can I ensure security/isolation via ODBC connected tenants/customers? Or do I have to write an API on my own which handles security?

I would recommend to expose access via API, not via ODBC - at least you would have greater control on what is executed, and enforce tenant_id, and other checks, like limits, etc. You can try to utilize the Cassandra's CQL parser to decompose query, and put all required things back.
Theoretically, you can could utilize Apache Calcite, for example. It has implementation of JDBC driver that could be used, plus there is existing Cassandra adapter that you can modify to accomplish your task (mapping authentication into tenant_ids, etc.), but this will be quite a lot of work.

Cassandra as an embedded service and with custom consistency level

I am thinking of building an application that uses Cassandra as its data store, but has low latency requirements. I am aware of EmbeddedCassandraService from this blog post
Is the following implementation possible and what are known pitfalls (defects, functional limitations)?
1) Run Cassandra as an embedded service, persisting data to disk (durable).
2) Java application interacts with local embedded service via one of the following. What are the pros
TMemoryBuffer (or something more appropriate?)
StorageProxy (what are the pitfalls of using this API?)
Apache Avro? (see question #5 below)
3) Java application interacts with remote Cassandra service ("backup" nodes) via Thrift (or Avro?).
4) Write must always succeed to the local embedded Cassandra service in order to be successful, and at least one of the remote (non-embedded) Cassandra nodes. Is this possible? Is it possible to define a custom / complex consistency level?
5) Side-question: Cassandra: The Definitive Guide mentions in several places that Thrift will ultimately be replaced with Avro, but seems like that's not the case just yet?
As you might guess, I am new to Cassandra, so any direction to specific documentation pages (not the wiki homepage) or sample projects are appreciated.

Unless your entire database is sitting on the local machine (i.e. a single node), you gain nothing by this configuration. Cassandra will shard your data across the cluster, so (as mentioned in one of the comments) your writes will frequently be made to another node that owns the data. Presuming you write with a consistency level of at least one, your call will block until that other node acks the write. This negates any benefit of talking to the embedded instance since you have some network latency anyway.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string