I am using datastax 4.5 and trying to use shark .i am able to open shark shell but queries are not working ,Error is :
shark> use company2;
OK
Time taken: 0.126 seconds
shark> select count(*) from nhanes;
java.lang.RuntimeException: Could not get input splits
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:158)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
at shark.SharkCliDriver.processCmd(SharkCliDriver.scala:347)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
at shark.SharkCliDriver$.main(SharkCliDriver.scala:240)
at shark.SharkCliDriver.main SharkCliDriver.scala
FAILED: Execution Error, return code -101 from shark.execution.SparkTask
Any idea about this error?
My second question is related to backup.
As i am using opscenter for taking backup but in production is it reliable or do i go for nodetool backup and schedule it on individual node.
Thanks
Check "Could not get input splits" Error, with Hive-Cassandra-CqlStorageHandler. You can first test it using hive. If it fails in hive, you need check you keyspace partitioner. I would suggest to create a clean new keyspace and table to test it. Most likely it's something wrong with your KS settings. You can also check the replication of the keyspace, make sure it's replicated to the datacenter the cassandra node starts.
For the second question, it's recommend to use opscenter to backup which is fully tested and easy to use. You can also manually backup by using node tool for each node which causes some human error.
Related
There is a well-known issue in Apache Spark: if you are reading a table which is being updated at the same time - you can face error like this:
Caused by: java.io.FileNotFoundException: No such file or directory '<snipped for posting>.snappy.parquet'
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
Recommended solution is to call REFRESH TABLE tableName.
In my case the issue is that the table I want to read is being updated NOT BY ME at some random moments. Actually the update can start right after I have started my job, or may be it was already going on when I started my job.
What can I do to prevent the error? The update of the source table may be ongoing, it doesn't make sense to call REFRESH TABLE tableName in a cycle, and I cannot wait for that update to be done (actually I dont know how to do it). I just want to ignore it and read some consistent version of the table.
Is there a way to tell spark that I dont care if it reads outdated, but consistent version of the table?
We have cassandra 3.0.10 installed on centos. The developers made some coding mistakes on preparing statements. The result is that the prepared statement cache is overrunning and we always get evicted error message. The error is shown below:
INFO [ScheduledTasks:1] 2017-12-07 10:38:28,216 QueryProcessor.java:134 - 7 prepared statements discarded in the last minute because cache limit reached (8178944 bytes)
We have corrected the prepared statements and would like to flush the prepared statement cache to start from scratch. We have stopped and restarted the cassandra instance but the prepared statement count was not reset.
Cassandra 3.0.10 is installed on centos and we are using svcadm disable/enable cassandra to stop/start cassandra.
I noticed that in later version of cassandra, e.g. 3.11.1, there is a prepared_statements table under the system keyspace. Shutting down cassandra and deleting the file ${CASSANDRA_HOME}/data/data/system/prepared_statements-*, then restart cassandra actually resets the prepared_statement cache.
Appreciate any help on this.
Thanks.
Update: 2018-06-01
We are currently using a work-around to clear prepared statements associated with certain tables by dropping index then recreating the index on the table. This discards prepared statements that have dependencies on the defined index. For now, this is the most we can do. Problem is, if this won't work for tables that don't have index defined on them.
Still need a better way of doing this, e.g. some admin command to clear the cache.
I want to test my cluster a little, how data replicates, etc.
I have a cassandra cluster formed by 5 machines ( centos 7 & cassie 3.4 on them).
Are there anywhere tables already created for testing that I can import in my db in some keyspace?
If yes, please be kind enough and explain me how to import them into a keyspace and where from to take them.
You can use Cassandra-stress. This is great to create data for your style of table and also has some default tables.
http://docs.datastax.com/en/cassandra_win/3.0/cassandra/tools/toolsCStress.html
I highly recommend it.
Actually , it is a lot of data in internet that can be used for testing
e.g.
https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public
http://bigdata-madesimple.com/70-websites-to-get-large-data-repositories-for-free/
Cassandra provide with tool cqlsh for executing CQL command as COPY for importing CSV data to database.
P.S.But pay attention on the fact that cqlsh has some restriction related to timeout. That is why it would be better to use some cassandra connector to make this process more effective.
I'm encountering the same problem as Cassandra system.hints table is empty even when the one of the node is down:
I am learning Cassandra from academy.datastax.com. I am trying the Replication and Consistency demo on local machine. RF = 3 and Consistency = 1.
When my Node3 is down and I am updating my table using update command, the SYSTEM.HINTS table is expected to store hint for node3 but it is always empty.
#amalober pointed out that this was due to a difference the Cassandra version being used. From the Cassandra docs at DataStax:
In Cassandra 3.0 and later, the hint is stored in a local hints directory on each node for improved replay.
This same question was asked 3 years ago, How to access the local data of a Cassandra node, but the accepted solution was to
...Hack something together using the Cassandra source that reads SSTables and have that feed the local client you're hoping to build. A great starting point would be looking at the source of org.apache.cassandra.tools.SSTableExport which is used in the sstable2json tool.
Is there an easier way to access the local hints directory of a Cassandra node?
Is there an easier way to access the local hints directory of a Cassandra node?
The hint directory is defined in $CASSANDRA_HOME/conf/cassandra.yaml file (sometimes it is located under /etc/cassandra also, depending on how you install Cassandra)
Look for the property hints_directory
I guess you are using ccm. So, the hint file should be in $CASSANDRA_HOME/.ccm/yourcluster/yournode/hints directory
I haven't been able to reproduce your issue with not getting a hints file. Every attempt I had resulted in the hints file as expected. There is a way to view the hints easier now.
We added a dump for hints in sstable-tools that you can use to view the mutations in the HH files. We may in the future add ability to use the HH files like sstables in the shell (use mutations to build memtable and include in queries) but for now its pretty raw.
Its pretty simple (sans metadata setup) if you wanna do analysis of data yourself. You can see what we did here and change to your needs: https://github.com/tolbertam/sstable-tools/blob/master/src/main/java/org/apache/cassandra/hints/HintsTool.java#L39
I have two different independent machines running Cassandra and I want to migrate the data from one machine to the other.
Thus, I first took a snapshot of my Cassandra Cluster on machine 1 according to the datastax documentation.
Then I moved the data to machine 2, where I'm trying to import it with sstableloader.
As a note: The keypsace (open_weather) and tablename (raw_weather_data) on the machine 2 have been created and are the same as on machine 1.
The command I'm using looks as follows:
bin/sstableloader -d localhost "path_to_snapshot"/open_weather/raw_weather_data
And then get the following error:
Established connection to initial hosts
Opening sstables and calculating sections to stream
For input string: "CompressionInfo.db"
java.lang.NumberFormatException: For input string: "CompressionInfo.db"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at org.apache.cassandra.io.sstable.Descriptor.fromFilename(Descriptor.java:276)
at org.apache.cassandra.io.sstable.Descriptor.fromFilename(Descriptor.java:235)
at org.apache.cassandra.io.sstable.Component.fromFilename(Component.java:120)
at org.apache.cassandra.io.sstable.SSTable.tryComponentFromFilename(SSTable.java:160)
at org.apache.cassandra.io.sstable.SSTableLoader$1.accept(SSTableLoader.java:84)
at java.io.File.list(File.java:1161)
at org.apache.cassandra.io.sstable.SSTableLoader.openSSTables(SSTableLoader.java:78)
at org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:162)
at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:106)
Unfortunately I have no idea why?
I'm not sure if it is related to the issue, but somehow on machine 1 my *.db files are name rather "strange" as compared to the *.db files I already have on machine 2.
*.db files from machine 1:
la-53-big-CompressionInfo.db
la-53-big-Data.db
...
la-54-big-CompressionInfo.db
...
*.db files from machine 2:
open_weather-raw_weather_data-ka-5-CompressionInfo.db
open_weather-raw_weather_data-ka-5-Data.db
What am I missing? Any help would be highly appreciated. I'm also open to any other suggestions. The COPY command will most probably not work since it is Limited to 99999999 rows as far as I know.
P.s. I didn't want to create a overly huge post, but if you need any further information to help me out, just let me know.
EDIT:
Note that I'm using Cassandra in the stand-alone mode.
EDIT2:
After installing the same version 2.1.4 on my destination machine (machine 2), I still get all the same error. With SSTableLoader I still get the above mentioned error and with copying the files manually (as described by LHWizard), I still get empty tables after starting Cassandra again and performing a SELECT command.
Regarding the initial tokens, I get a huge list of tokens if I perform node ring on machine 1. I'm not sure what to do with those?
your data is already in the form of a snapshot (or backup). What I have done in the past is the following:
install the same version of cassandra on the restore node
edit cassandra.yaml on the restore node - make sure that cluster_name and snitch are the same.
edit seeds: list and any other properties that were altered in the original node.
get the schema from the original node using cqlsh DESC KEYSPACE.
start cassandra on the restore node and import the schema.
(steps 6 & 7 may not be completely necessary, but this is what I do.)
stop cassandra, delete the contents of /var/lib/cassandra/data/, commitlog/, and saved_caches/* folders.
restart cassandra on the restore node to recreate the correct folders, then stop it
copy the contents of the snapshots folder to each corresponding table folder in the restore node, then start cassandra. You probably want to run nodetool repair.
You don't really need to bulk import the data, it's already in the correct format if you are using the same version of cassandra, although you didn't specify that in your original question.