I am running a spark job where some data is loaded from cassandra table. From that data, I make some insert and delete statements.
and execute them. (using forEach)
boolean deleteStatus= connector.openSession().execute(delete).wasApplied();
boolean insertStatus = connector.openSession().execute(insert).wasApplied();
System.out.println(delete+":"+deleteStatus);
System.out.println(insert+":"+insertStatus);
When i run it locally, i see the respective results in the table.
However, when I run it on a cluster, sometimes the result is displayed and sometime the changes don't take place.
I saw the stdout from web-ui of spark, and the query along with true was printed for both the queries.(Data was loaded correctly. But sometimes, only insert is reflected, sometimes only delete, sometimes both, and most of the times none.)
Specifications:
spark slaves on same machines as the cassandra nodes.(each node has two instances of slaves.)
spark master on a separate machine.
Repair done on all nodes.
Cassandra restarted
boolean deleteStatus= connector.openSession().execute(delete).wasApplied();
boolean insertStatus = connector.openSession().execute(insert).wasApplied();
This is a known anti-pattern, you create a new Session object for each query, which is extremely expensive.
Just create the session once and re-use it for all the queries.
To see which queries are being executed and sent to Cassandra, use the slow query logger feature as a hack: http://datastax.github.io/java-driver/manual/logging/#logging-query-latencies
The idea is to set the threshold to a ridiculously low value so that every query will be considered slow and displayed in the log.
You should use this hack only for testing of course
Related
I'm running OS Cassandra 3.11.9 with Datastax Java Driver 3.8.0. I have a Cassandra keyspace that has multiple tables functioning as lookup tables / search indices. Whenever I receive a new POST request to my endpoint, I parse the object and insert it in the corresponding Cassandra table. I also put inserts to each corresponding lookup table. (10-20 per object)
When ingesting a lot of data into the system, I've been running into WriteTimeoutExceptions in the driver.
I tried to serialize the insert requests into the lookup tables by introducing Apache Camel and putting all the Statements into a queue that the Session could work off of, but it did not help.
With Camel, since the exceptions are now happening in the Camel thread, the test continues to run, instead of failing on the first exception. Eventually, the test seems to crash Cassandra. (Nothing in the Cassandra logs though)
I also tried to turn off my lookup tables and instead insert into the main table 15x per object (to simulate a similar number of writes as if I had the lookup tables on). This test passed with no exception, which makes me think the large number of tables is the problem.
Is a large number (2k+) of Cassandra tables a code smell? Should we rearchitect or just throw more resources at it? Nothing indicative has shown in the logs, mostly just some status about the number of tables etc - no exceptions)
Can the Datastax Java Driver be used multithreaded like this? It says it is threadsafe.
There is a direct effect of the high number of tables onto the performance - see this doc (the whole series is good source of information), and this blog post for more details. Basically, with ~1000 tables, you get ~20-25% degradation of performance.
That's could be a reason, not completely direct, but related. For each table, Cassandra needs to allocate memory, have a part for it in the memtable, keep information about it, etc. This specific problem could come from the blocked memtable flushes, or something like. Check the nodetool tpstats and nodetool tablestats for blocked or pending memtable flushes. It's better to setup some continuous monitoring solution, such as, metrics collector for Apache Cassandra, and and for period of time watch for the important metrics that include that information as well.
I am trying to run a Hive query with pyspark. I am using Hortonworks so I need to use the Hive WarehouseConnector.
Running one or even multiple queries is easy and works. My problem is that I want to issue set commands before. For instance to set the dag name in tez ui: set hive.query.name=something relevant or to set up some memory configuration set hive.tez.container.size = 8192. For these statements to take effect, they need to run on the same session than the main query and that's my issue.
I tried 2 ways:
The first one was to generate a new hive session for each query, with a properly setup url eg.:
url='jdbc:hive2://hiveserver:10000/default?hive.query.name=relevant'
builder = HiveWarehouseSession.session(self.spark)
builder.hs2url(url)
hive = builder.build()
hive.execute("select * from whatever")
It works well for the first query, but the same url is reused for the next one (even if I try to manually delete builder and hive), so does not work.
The second way is to set spark.sql.hive.thriftServer.singleSession=true globally in the spark thrift server. This does seem to work, but I find it a shame to limit the global spark thrift server for the benefit of one application only.
Is there a way to achieve what I am looking for? Maybe there could be a way to pin a query to one executor, so hopefully one session?
this has been a big peeve of mine...still is actually.
The solution that resolved this issue for me was putting all the queries in a query file, where each query would be separated by a semicolon. Then I run the query using beeline from within a python script.
Unfortunately, it does not work with queries that return results...only suitable for set, overwrite, insert kind of queries.
In case you might have discovered a more efficient way to do this, please do share.
I'm using Cassandra Java driver with a fetch size set to 1k. I need to query all records in a table and perform some time consuming action for a every row.
What will happen if I'll keep the ResultSet open (not fully iterated) for a one day?
What I don't care about:
consistency. If some new record will be written in the meantime, I'm ok to fetch it. However, I'm fine if I won't get it
fault tolerance. If during that process some node will fail, I'm fine if the query will fail too. However, I would like to detect that from the client perspective.
What I care about:
Cassandra resource utilization - I don't want to cause cluster outage due to some blocked resources
lateness - I don't want to block (or slow down much) cluster for other consumers of that table
I would like to get all records which existed when I started the query (assuming no deletions). However, they don't have to be up to date
The paging state is the information about the last read data (literally serialized partition key, clustering, and remaining). When sent to coordinator it will look for everything greater than that. So there are no resources in the server spent for this and no performance impact vs a normal read.
Cassandra does not have any features to allow isolation even within a single query. If data has changed from when the first query was made and the second, you will get the up to date information.
I'm banging my head on this, but, frankly speaking, my brains won't get it - or so it seems.
I have a column family that holds jobs for a rather large group of actors. It is a central job management and scheduling table that must be distributed and available throughout the whole cluster and possibly even traverses datacenter barriers some day in the near future.
Each job executor actor system, the ones that actually execute the jobs, is installed alongside one Cassandra node - that is, on the same node. Actually of course there is s master actor that pulls the jobs and distributes them to the actor agents, but that has nothing to do with my question.
There are also some actor systems that can create jobs in the central job table to be executed by other actors or even actor systems, but usually the jobs are loaded batch wise or manually through a web interface.
An actor that is to execute a job always only queries it's local cassandra node. If finished, it will update the job table to indicate it is finished. This write should, in normal circumstances, also only update records with jobs, for which his local Cassandra node is authoritative.
Now, sometimes it may happen that an actor system on a given host has nothing to do. In this case it should indeed get jobs from other nodes too, but of course it will still only talk to it's local Cassandra node. I know this works and it doesn't bother me a bit.
What keep me up at night is this:
How would I create a compound key to achieve the local authoritative of a Cassandra node for job entries for it's local actor system and thereby it's job execution actors, without splitting the job table in multiple column families or the like?
In other words: how can I create a compound key that makes sure that a) jobs are evenly distributed through my cluster and
b) a local query on the job table only returns jobs for which this Cassandra node is authoritative and
c) my distributed agent system still has the possibility to fetch jobs from other nodes, in case it has no own jobs to execute???
A last word on c) above. I do not want to do 2 queries in the case there is no local job, but still only on!
Any hints on this?
This is general structure of job table so far:
ClusterKey UUID: Primary Key
JobScope String: HOST / GLOBAL / SERVICE / CHANNEL
JobIdentifier String: Web-Crawler, Twitter
Description String:
URL String:
JobType String: FETCH / CLEAN / PARSE /
Job String: Definition of the job
AdditionalData Collection:
JobStatus String: NEW / WORKING / FINISHED
User String:
ValidFrom Timestamp:
ValidUntill Collection:
Still in the process setting everything up, so no query so far defined. But an Actor will pull jobs out of it and set status and so
Cassandra has no way of "pinning" a key to a node, if that's what you are after.
If I were you, I'd stop worrying about whether my local node was authoritative for some set of data, and start leveraging the built-in consistency controls in Cassandra for managing the set of nodes that you read from or write to.
Lots of information here on read consistency and write consistency- using the right consistency will ensure that your application scales well while keeping it logically correct: http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html
Another item worth mentioning is atomic "compare and swap", also known as lightweight transactions. Let's say you want to ensure that a given job is only performed once. You could add a field indicating whether the job has been "picked up", then query on that field (where picked_up = 0) and simultaneously (and atomically) update the field to indicate that you are "picking up" that work. That way no other actors will pick it up again.
Info on lightweight transactions here: http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_ltwt_transaction_c.html
I noticed if I have a java method in which I have a preparedStatement uisng the JDBC driver that comes with Cassandra it is always slow. But if I put the same query twice in the method the second time it is 20x faster. Why is that? I would think the second, third, four time I call the java method it would be faster then the first. I am using Cassandra 1.2.5. I have also cached 100MB of rows in the row-cache and set the table to caching = "all". In Cassandra-cli I verified the settings. And in Cassandra-Cli I verified the second, third fourth time I get the rows from the same table I do the JDBC calls against I get faster response time.
Any Ideas?
Thanks,
-Tony
From the all knowing CQL3 documentation (always a great starting point btw):
Prepared statement is an optimization that allows to parse a query only once but execute it multiple times with different concrete values.
The statement gets cached. This is the difference maker you are experiencing. Also prepared statements get pre-compiled, typically meaning an execution plan is prepared before the query is run against the db. Knowing what you are doing makes the process faster.
At the first run your prepared statement is cached in-case you run the same query again, which you do, and since its cached the querying will be executed much faster.