How to use describe keypsaces; command in jupyter notebook - cassandra

I am reading Cassandra database data in Jupiter notebook. In Cassandra we can use this with command "describe keyspaces;".
Suppose, once I am Jupyter is connected Cassandra and I dont want to use Cassandra, want to input Cassandra commands through Jupyter notebook, how to achieve describe keyspaces in order to know the keyspaces
Tried entering describe keyspaces; command
from cassandra.cluster import Cluster
cluster = Cluster(['127.0.0.1']) # provide contact points and port
session = cluster.connect('fiirstkeyspace')
rows = session.execute('select * from books_by_author limit 5 ;')
for row in rows:
print(row)
In the above code, I know I have a keyspace called 'fiirstkeyspace'
however,I want to know all the keyspaces in Cassandra through Jupyter notebook.
show keyspaces;
File "<ipython-input-62-dd2f479cd0fc>", line 1
show keyspaces;
^
SyntaxError: invalid syntax
describe keyspaces;
File "<ipython-input-67-21f5033a29b3>", line 1
describe keyspaces;
^
SyntaxError: invalid syntax

describe keyspaces, etc. are commands that are implemented in cqlsh - they aren't real CQL commands. In Python you can get all this information via Metadata class, that hides implementation details, as schema for system tables could differ between versions.
The code for getting names of all keyspaces is quite simple (cluster is the name of object that you created to connect to Cassandra cluster):
cluster.metadata.keyspaces.keys()
And then you can fetch data about individual keyspaces from cluster.metadata.keyspaces map.

Related

Pyspark on EMR and external hive/glue - can drop but not create tables via sqlContext

I'm writing a dataframe to an external hive table from pyspark running on EMR. The work involves dropping/truncating data from an external hive table, writing the contents of a dataframe into aforementioned table, then writing the data from hive to DynamoDB. I am looking to write to an internal table on the EMR cluster but for now I would like the hive data to be available to subsequent clusters. I could write to the Glue catalog directly and force it to registered but that is a step further than I need to go.
All components work fine individually on a given EMR cluster: I can create an external hive table on EMR, either using a script or ssh and hive shell. This table can be queried by Athena and can be read from by pyspark. I can create a dataframe and INSERT OVERWRITE the data into the aforementioned table in pyspark.
I can then use hive shell to copy the data from the hive table into a DynamoDB table.
I'd like to wrap all of the work into the one pyspark script instead of having to submit multiple distinct steps.
I am able to drop tables using
sqlContext.sql("drop table if exists default.my_table")
When I try to create a table using sqlContext.sql("create table default.mytable(id string,val string) STORED AS ORC") I get the following error:
org.apache.hadoop.net.ConnectTimeoutException: Call From ip-xx-xxx-xx-xxx/xx.xxx.xx.xx to ip-xxx-xx-xx-xx:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=ip-xxx-xx-xx-xx:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
I can't figure out why I can create an external hive table in Glue using hive shell on the cluster, drop the table using hive shell or pyspark sqlcontext, but I can't create a table using sqlcontext. I have checked around and the solutions offered don't make sense in this context (copying hive-site.xml) as I can clearly write to the required addresses with no hassle, just not in pyspark. And it is doubly strange that I can drop the tables with them being definitely dropped when I check in Athena.
Running on:
emr-5.28.0,
Hadoop distribution Amazon 2.8.5
Spark 2.4.4
Hive 2.3.6
Livy 0.6.0 (for notebooks but my experimentation is via ssh and pyspark shell)
Turns out I could create tables via a spark.sql() call as long as I provided a location for the tables. Seems like Hive shell doesn't require it, yet spark.sql() does. Not expected but not entirely unsurprising.
Complementing #Zeathor's answer. After configuring the EMR and Glue connection and permission (you can check more in here: https://www.youtube.com/watch?v=w20tapeW1ME), you will just need to write sparkSQL commands:
spark = SparkSession.builder.appName('TestSession').getOrCreate()
spark.sql("create database if not exists test")
You can then create your tables from dataframes:
df.createOrReplaceTempView("first_table");
spark.sql("create table test.table_name as select * from first_table");
All the databases and tables metadata will then be stored in AWS Glue Catalogue.

Cassandra drop keyspace, tables not working

I am starting with cassandra and have had some problems. I create keyspaces and tables to go playing, if I delete them and then run a describe keyspace they keep appearing to me. Other times I delete them and it tells me they don't exist, but I can't create them either because it says it exists.
Is there a way to clear that "cache" or something similar?
I would also like to know if through cqlsh I can execute a .cql file that is on my computer.
[cqlsh 5.0.1 | Cassandra 3.11.0 | CQL spec 3.4.4 | Native protocol v4]
This may be due to the eventually consistent nature of Cassandra. If you are on a small test cluster and just playing around you could try doing CONSISTENCY ALL in cqlsh which will force the nodes to become consistent.
You should always run delete or drop command manually with CONSISTENCY ALL so that it will reflect all the nodes and DCs. Also you need to wait for a moment to replicate into the cluster. Once replicated you will not get deleted data else you need to run a repair in the cluster.

HBase phantom table can't be created or deleted

As am trying to get familiar with hbase, I have created a table example in hbase shell. I reformatted the NameNode for hadoop and (because I didn't properly shut it down before my computer ran out of battery) restarted hadoop and hbase. But now I when I try to create example table I get the following error:
ERROR: Table already exists: example!
and when I try to disable it and drop it I get the following:
ERROR: Table example does not exist.
When I try to list the tables, no tables are listed. I even removed the hbase directory from hadoop, but the problem still persists.

Why read fails with cqlsh query when huge tombstones is present

I have a table with huge tombstones.So when i performed a spark job (which reads) on that specific table, it gave result without any issues. But i executed same query using cqlsh it gave me error because huge tombstones present in that table.
Cassandra failure during read query at consistency one(1 replica
needed but o replicas responded ,1 failed
I know tombstones should not be there, i can run repair to avoid them , but apart from that why spark succeeded and cqlsh failed. They both use same sessions and queries.
How spark-cassandra connector works? is it different than cqlsh?
Please let me know .
thank you.
The Spark Cassandra Connector is different to cqlsh in a few ways.
It uses the Java Driver and not the python Driver
It has significantly more lenient retry policies
It full table scans by breaking the request up into pieces
Any of these items could be contributing to why it would work in SCC and not in CQLSH.

Import all existing keyspaces schemas

I have 6 keyspaces in cassandra database. I want to migrate all my keyspaces schemas in another cassandra database. How can I do it at once?
DESCRIBE SCHEMA;
will give you DDL statements needed to recreate the non-system keyspaces and tables on a new cluster.
You can run following command to write the scema in cql file
cqlsh -e "Desc keyspace keyspacename" > 'out.cql'
and then use SOURCE to import cql file on another host OR cqlsh -f out.cql optionalHostname

Resources