I am Bulk Loading the data into cassandra using SSTables.I am following https://github.com/SPBTV/csv-to-sstable this.
I created the SSTables by
$ java -jar csv-to-sstable.jar quote /home/arque/table_big.cql /home/arque/Documents/data.csv /home/arque
I am getting an error while I am trying to run following command:
$ sstableloader -d 192.168.0.7 /home/arque/quote/table_big
Error:
Error: Established connection to initial hosts
Opening sstables and calculating sections to stream
Failed to list files in /home/arque/quote/table_big
java.lang.AssertionError
java.lang.RuntimeException: Failed to list files in /home/arque/quote /table_big
at org.apache.cassandra.db.lifecycle.LogAwareFileLister.list(LogAwareFileLister.java:77)
The error is in the csv-to-sstable tool. Look at this file: https://github.com/SPBTV/csv-to-sstable/blob/master/src/main/java/com/spbtv/cassandra/bulkload/Bulkload.java
You say you only have an issue when Primary key is composite key. That's because the tool expects primary key to be defined on same lane as column.
Line 66:
// Primary key defined on the same line as the corresponding column
Pattern pattern = Pattern.compile(".*?(\\w+)\\s+\\w+\\s+PRIMARY KEY.*");
If you change this to suite your needs it should work.
Related
I am getting error while executing alter table script.
ALTER TABLE user.employee ADD salary text;
ServerError: java.lang.RuntimeException: java.util.concurrent.ExecutionException: org.apache.cassandra.exceptions.ConfigurationException: Column family ID mismatch (found e5da3980-83eb-11ec-8c56-1b3845d1a791; expected c8ac48d0-83eb-11ec-8c56-1b3845d1a791)
When I describe table ,I am seeing newly created column present. But I am bot able to access the new column.Its throwing below error InvalidRequest: Error from server: code=2200 [Invalid query] message="Undefined name xxxxxxxxx in selection clause"
We have close to 100GB of data.
This looks like the same question asked on https://community.datastax.com/questions/13220/ so I'm re-posting my answer here.
This exception indicates that you have a schema disagreement in your cluster:
ConfigurationException: Column family ID mismatch (\
found e5da3980-83eb-11ec-8c56-1b3845d1a791; \
expected c8ac48d0-83eb-11ec-8c56-1b3845d1a791 \
)
In my experience, the most common cause of this problem is that you dropped and re-created the table without waiting for the schema to propagate to all nodes in the cluster in between the DROP and CREATE. Alternatively, it's possible that you've tried to create the table and assumed it didn't work then tried to create it again.
In any case, Cassandra thinks the table was created at 05:48 GMT but found a version created at 05:49 GMT. For what it's worth:
e5da3980-83eb-11ec-8c56-1b3845d1a791 = February 2, 2022 at 5:49:33 AM GMT
c8ac48d0-83eb-11ec-8c56-1b3845d1a791 = February 2, 2022 at 5:48:44 AM GMT
You'll need to resolve the schema disagreement. Depending on the Cassandra version you can either (a) run nodetool resetlocalschema on nodes which have a different schema version based on the output of nodetool describecluster, or (b) perform a rolling restart of all nodes. Cheers!
ExecutionException: org.apache.cassandra.exceptions.ConfigurationException: Column family ID mismatch (found e5da3980-83eb-11ec-8c56-1b3845d1a791; expected c8ac48d0-83eb-11ec-8c56-1b3845d1a791)
Has that column been deleted/added more than once? Cassandra (especially the pre 3.0 versions) is notorious for problems with that.
Check the output of nodetool describecluster. Are there multiple schema versions being reported?
If there are multiple schema versions, then run a rolling restart of the cluster. That's a sure-fire way to force schema agreement. Check the table, and see if that column is there. If not, try to add it.
The other solution, would be to try adding it with a different name (ex: "salary2").
A Google Spanner DDL script runs successfully when submitted in the Spanner Console, but when executed via the "glcoud spanner databases ddl update" command using the "--ddl-file" argument it consistently fails with the error:
(gcloud.spanner.databases.ddl.update) INVALID_ARGUMENT: Error parsing Spanner DDL
statement: \n : Syntax error on line 1, column 1: Encountered 'EOF' while parsing:
ddl_statement
'#type': type.googleapis.com/google.rpc.LocalizedMessage
locale: en-US
message: |-
Error parsing Spanner DDL statement:
: Syntax error on line 1, column 1: Encountered 'EOF' while parsing: ddl_statement
Example of the command:
gcloud spanner databases ddl update test-db
--instance=test-instance
--ddl-file=table.ddl
cat table.ddl
CREATE TABLE regions
(
region_id STRING(2) NOT NULL,
name STRING(13) NOT NULL,
) PRIMARY KEY (region_id);
There is only one other reference to this identical situation on the internet. Has anyone got the "ddl-file" argument to successfully work?
The problem is (most probably) caused by the last semi colon in your DDL script. It seems that the --ddl-file option accepts scripts with multiple DDL statements that may be separated by semi colons (;), but the last statement should not be terminated by a semi colon. Doing so will cause gcloud to try to parse another DDL statement after the last, only to determine that there is none, and thereby throwing an Unexpected end of file error.
So TLDR: Remove the last semi colon in your script and it should work.
TLDR; A table is still inaccessible while system_schema.tables already contains a record respective the table
I'm trying to use Cassandra concurrently.
Cassandra version: [cqlsh 5.0.1 | Cassandra 3.11.3 | CQL spec 3.4.4 | Native protocol v4]
I have two Python scripts using cassandra-driver==3.16.0 for Consumer and Producer running in different processes.
While Producer creates and fills table, Consumer waits until table is created with Python script running CQL statement:
table_exists = False
while not table_exists:
cql = SimpleStatement(
"SELECT table_name FROM system_schema.tables WHERE keyspace_name = 'test_keyspace' AND table_name = 'test_table'"
)
results = cassandra_session.execute(cql)
table_exists = bool(results.current_rows)
After the statement results with at least one record I make a conclusion that table has been created and try to read it with SELECT:
SELECT * FROM test_keyspace.test_table WHERE ...
But sometimes, I get really annoying error:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/stress.py", line 128, in runner
for r in select(TEST_KEYSPACE, table_name):
File "/stress.py", line 63, in select
results = cassandra_session.execute(statement)
File "cassandra/cluster.py", line 2171, in cassandra.cluster.Session.execute
File "cassandra/cluster.py", line 4062, in cassandra.cluster.ResponseFuture.result
cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="unconfigured table test_table"
According to the information I discovered that error happens when SELECT statement executes with a table that has not been created yet.
So while system_schema.tables already contains a record about the table, the table is not yet accessible.
Maybe there is a more reliable way to check table accessibility? Or common workaround?
With single node Cassandra setups, I have witnessed structural changes not to propagate immediately. I.e. creating a table, then inserting into it, and the insert fails because the table does not exist. Then you check if the table exists, and it is there. And then, since some time has passed, inserts work.
The only way I managed to make Single Node Cassandras behave consistently is to introduce a one-second-wait after every structural change. This was fine by me, since Single Node Cassandras are only used in local development scenarios. In productive environments, I simply disable the wait.
In my case, it was searching for table on another keyspace.
ERROR:
cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="unconfigured table <table_name>"
I made sure it __keyspace__ points to right keyspace and then it worked.
Trying to load tsv file in HBase running in HDInsight in Microsoft Azure cloud using a recommended approach connecting through Remote Desktop and running on the command line trying to load t1.tsv file (with two tab separated columns) from hdfs into hbase t1 table:
C:\apps\dist\hbase-0.98.0.2.1.5.0-2057-hadoop2\bin>hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,num t1 t1.tsv
and get:
ERROR: One or more columns in addition to the row key and timestamp(optional) are required
Usage: importtsv -Dimporttsv.columns=a,b,c
replacing order of the specified columns to num,HBASE_ROW_KEY
C:\apps\dist\hbase-0.98.0.2.1.5.0-2057-hadoop2\bin>hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=num,HBASE_ROW_KEY t1 t1.tsv
I get:
ERROR: Must specify exactly one column as HBASE_ROW_KEY
Usage: importtsv -Dimporttsv.columns=a,b,c
This tells me that comma separator in the column list is not recognized or column name is incorrect I also tried to use column with qualifier as num:v and as 'num' - nothing helps
Any ideas what could be wrong here? Thanks.
>hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns="HBASE_ROW_KEY,d:c1,d:c2" testtable /example/inputfile.txt
This works for me. I think there are some differences between terminals in Linux and Windows, thus in windows you need to add quotation marks to clarify this is a value string, otherwise might not be recognized.
I am trying to export data from cassandra using CQL client. A column family has about 100000 rows in it. when i am copying dta into csv file using COPY TO command i get following rpc_time out error.
copy mycolfamily to '/root/mycolfamily.csv'
Request did not complete within rpc_timeout.
I am running in:
[cqlsh 3.1.6 | Cassandra 1.2.8 | CQL spec 3.0.0 | Thrift protocol 19.36.0]
How can I increase RPC timeout limit?
I tried adding rpc_timeout_in_ms: 20000 (defalut is 10000) in my conf/cassandra.yaml file. but while restarting cassandra I get:
[root#user ~]# null; Can't construct a java object for tag:yaml.org,2002:org.apache.cassandra.config.Config; exception=Cannot create property=rpc_timeout_in_ms for JavaBean=org.apache.cassandra.config.Config#71bfc4fc; Unable to find property 'rpc_timeout_in_ms' on class: org.apache.cassandra.config.Config
Invalid yaml; unable to start server. See log for stacktrace.
The COPY command currently does the same thing with SELECT with LIMIT 99999999. So, it will eventually goes to timeout while your data is growing. Here's the export function;
https://github.com/apache/cassandra/blob/trunk/bin/cqlsh#L1524
I'm doing the same export on production. What I'm doing is the following;
make select * from table where timeuuid = someTimeuuid limit 10000
write the result set to a csv file w/ >> mode
make the next selects with respect to the last timeuuid
You can pipe command in cqlsh by the following cqlsh command
echo "{$cql}" | /usr/bin/cqlsh -u user -p password localhost 9160 > file.csv
You can use Auto pagination by specifying fetch size in Datastax Java driver.
Statement stmt = new SimpleStatement("SELECT id FROM mycolfamily;");
stmt.setFetchSize(500);
session.execute(stmt);
for (Row r:result.all()){
//write to file
}
I have encountered the same problem a few minutes ago then I have found CAPTURE and it worked:
First start capturing on cqlsh and then run your query with some limiting of your choice.
http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/capture_r.html
The best way yo export the data is using nodetool snapshot option. This returns immediately and can be restored later on. The only issue is that this export is per node and for the entire cluster.
Example:
nodetool -h localhost -p 7199 snapshot
See reference:
http://docs.datastax.com/en/archived/cassandra/1.1/docs/backup_restore.html#taking-a-snapshot