Connect to Cassandra from Pig

Connect to Cassandra from Pig - cassandra

I am trying to connect to Cassandra from pig.
But Cassandra is installed in different cluster i need to connect to connect to Cassandra remotely from pig.
I am referring following link exmaple
Getting error like
Failed to parse: Can not retrieve schema from loader org.apache.cassandra.hadoop.pig.CqlStorage#1216d9bf
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:198)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1688)
at org.apache.pig.PigServer$Graph.access$000(PigServer.java:1421)
at org.apache.pig.PigServer.parseAndBuild(PigServer.java:354)
at org.apache.pig.PigServer.executeBatch(PigServer.java:379)
at org.apache.pig.PigServer.executeBatch(PigServer.java:365)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:140)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:769)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:372)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:484)
at org.apache.pig.Main.main(Main.java:158)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
My pig script is as follows
A = LOAD 'cql://userName:password/mykeyspace/mycolumnfamily'
USING org.apache.cassandra.hadoop.pig.CqlStorage()
AS (user_id:long, fname:chararray, last_update_date:chararray, lname:chararray);
DUMP A;
Please let me know where we have to provide the ip of the system where the Cassandra is installed

Something i got by searching on internet is http://www.datastax.com/dev/blog/cassandra-and-pig-tutorial
Querying Cassandra Using Pig
Starting the pig client through Datastax Enterprise.
This requires no setup beyond having started the cluster in Analytics mode.
(14:52:17)[~/BlogPosts/CassPig_Libraries]dse pig
2013-08-26 14:52:27,166 [main] INFO org.apache.pig.Main - Logging error messages to: /Users/russellspitzer/BlogPosts/CassPig_Libraries/pig_1377553947163.log
2013-08-26 14:52:27,421 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: cfs://127.0.0.1/
2013-08-26 14:52:27.488 java[64588:1503] Unable to load realm info from SCDynamicStore
2013-08-26 14:52:28,348 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: 127.0.0.1:8012
grunt>
Next we construct our pig commands, starting with loading our data from Cassandra. We’ll be using the cql:// url and the CqlStorage() connector. The format of the command is basically load ‘cql://keyspace/table’. More info on CQL3 and Pig.
grunt> libdata = load 'cql://libdata/libout' USING CqlStorage();
grunt> DESCRIBE libdata;
Set the following as environment variables (uppercase,
underscored), or as Hadoop configuration variables (lowercase, dotted):
* PIG_INITIAL_ADDRESS or cassandra.thrift.address : initial address to connect to
* PIG_RPC_PORT or cassandra.thrift.port : the port thrift is listening on
* PIG_PARTITIONER or cassandra.partitioner.class : cluster partitioner
For example, against a local node with the default settings, you'd use:
export PIG_INITIAL_ADDRESS=localhost
export PIG_RPC_PORT=9160
export PIG_PARTITIONER=org.apache.cassandra.dht.Murmur3Partitioner
These properties can be overridden with the following if you use different clusters for input and output:
* PIG_INPUT_INITIAL_ADDRESS : initial address to connect to for reading
* PIG_INPUT_RPC_PORT : the port thrift is listening on for reading
* PIG_INPUT_PARTITIONER : cluster partitioner for reading
* PIG_OUTPUT_INITIAL_ADDRESS : initial address to connect to for writing
* PIG_OUTPUT_RPC_PORT : the port thrift is listening on for writing
* PIG_OUTPUT_PARTITIONER : cluster partitioner for writing
For more reference refer the below URL
https://github.com/Stratio/stratio-cassandra/tree/master/examples/pig
Hope this Helps!!!...

Related

Kafka Schema Registry failed to initialize after ZK and Brokers are configured with SASL_PLAINTEXT Security

We are using Confluent community edition setup for Kafka, currently we have a requirement to configure ACLs around the cluster, accordingly we have configured the zk and broker nodes so clients requires authentication(username/password) SASL_PLAINTEXT tokens to publish/subscribe to cluster, its working perfectly without schema registry, however while configuring schema-registry, its unable initialize throwing below exception even though we have configurred it to use SASL+PLAINTEXT connection with brokers/zk nodes. Is there anything I'm missing please help.
Please note we are using allow.everyone.if.no.acl.found=true flag and currently we dont have and ACL defined, so I don't think we need to setup any ACLs for _schemas topic which is used by schema registry to initialize.
[2019-12-17 00:33:23,844] ERROR Error starting the schema registry (io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication:64)
io.confluent.kafka.schemaregistry.exceptions.SchemaRegistryInitializationException: Error initializing kafka store while initializing schema registry
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.init(KafkaSchemaRegistry.java:212)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.initSchemaRegistry(SchemaRegistryRestApplication.java:62)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.setupResources(SchemaRegistryRestApplication.java:73)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.setupResources(SchemaRegistryRestApplication.java:40)
at io.confluent.rest.Application.createServer(Application.java:201)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryMain.main(SchemaRegistryMain.java:42)
Caused by: io.confluent.kafka.schemaregistry.storage.exceptions.StoreInitializationException: Timed out trying to create or validate schema topic configuration
at io.confluent.kafka.schemaregistry.storage.KafkaStore.createOrVerifySchemaTopic(KafkaStore.java:172)
at io.confluent.kafka.schemaregistry.storage.KafkaStore.init(KafkaStore.java:114)
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.init(KafkaSchemaRegistry.java:210)
... 5 more
Caused by: java.util.concurrent.TimeoutException
at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:108)
at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:274)
at io.confluent.kafka.schemaregistry.storage.KafkaStore.createOrVerifySchemaTopic(KafkaStore.java:165)
... 7 more

How to connect KairosDB with Azure Cosmos (Cassandra)

I use KairosDB on top of Cassandra for saving all our time series data. I am now trying to replicate the same KairosDB with Azure CosmosDB (Cassandra API). But its throwing error:
16:59:08.364 [main] INFO [LZ4Compressor.java:52] - Using LZ4Factory:JNI
16:59:08.441 [main] INFO [NettyUtil.java:73] - Did not find Netty's native epoll transport in the classpath, defaulting to NIO.
16:59:08.842 [main] ERROR [CassandraModule.java:136] - Unable to setup cassandra schema
com.datastax.driver.core.exceptions.AuthenticationException: Authentication error on host ilenstsdb2.cassandra.cosmos.azure.com/40.65.106.154:10350: Cql request had unsupported headers Compression
at com.datastax.driver.core.Connection$8.apply(Connection.java:392)
at com.datastax.driver.core.Connection$8.apply(Connection.java:361)
enter image description here

I don't have a huge amount of exposure to the Cassandra API but the error is indicating that your API request contained the header Compression while authenticating and the server didn't like it.
Make sure you're not using withCompression() when starting your datastax driver:
cluster = Cluster.builder()
.addContactPoint("X.X.X.X")
.withCompression(ProtocolOptions.Compression.LZ4)
.build();
...should be...
cluster = Cluster.builder()
.addContactPoint("X.X.X.X")
.build();

Connecting to Cassandra with Spark

First, I have bought the new O'Reilly Spark book and tried those Cassandra setup instructions. I've also found other stackoverflow posts and various posts and guides over the web. None of them work as-is. Below is as far as I could get.
This is a test with only a handful of records of dummy test data. I am running the most recent Cassandra 2.0.7 Virtual Box VM provided by plasetcassandra.org linked from the main Cassandra project page.
I downloaded Spark 1.2.1 source and got the latest Cassandra Connector code from github and built both against Scala 2.11. I have JDK 1.8.0_40 and Scala 2.11.6 setup on Mac OS 10.10.2.
I run the spark shell with the cassandra connector loaded:
bin/spark-shell --driver-class-path ../spark-cassandra-connector/spark-cassandra-connector/target/scala-2.11/spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar
Then I do what should be a simple row count type test on a test table of four records:
import com.datastax.spark.connector._
sc.stop
val conf = new org.apache.spark.SparkConf(true).set("spark.cassandra.connection.host", "192.168.56.101")
val sc = new org.apache.spark.SparkContext(conf)
val table = sc.cassandraTable("mykeyspace", "playlists")
table.count
I get the following error. What is confusing is that it is getting errors trying to find Cassandra at 127.0.0.1, but it also recognizes the host name that I configured which is 192.168.56.101.
15/03/16 15:56:54 INFO Cluster: New Cassandra host /192.168.56.101:9042 added
15/03/16 15:56:54 INFO CassandraConnector: Connected to Cassandra cluster: Cluster on a Stick
15/03/16 15:56:54 ERROR ServerSideTokenRangeSplitter: Failure while fetching splits from Cassandra
java.io.IOException: Failed to open thrift connection to Cassandra at 127.0.0.1:9160
<snip>
java.io.IOException: Failed to fetch splits of TokenRange(0,0,Set(CassandraNode(/127.0.0.1,/127.0.0.1)),None) from all endpoints: CassandraNode(/127.0.0.1,/127.0.0.1)
BTW, I can also use a configuration file at conf/spark-defaults.conf to do the above without having to close/recreate a spark context or pass in the --driver-clas-path argument. I ultimately hit the same error though, and the above steps seem easier to communicate in this post.
Any ideas?

Check the rpc_address config in your cassandra.yaml file on your cassandra node. It's likely that the spark connector is using that value from the system.local/system.peers tables and it may be set to 127.0.0.1 in your cassandra.yaml.
The spark connector uses thrift to get token range splits from cassandra. Eventually I'm betting this will be replaced as C* 2.1.4 has a new table called system.size_estimates (CASSANDRA-7688). It looks like it's getting the host metadata to find the nearest host and then making the query using thrift on port 9160.

Hector Cassandra Connectivity Error

I am trying to connect to a local cassandra instance through a java client powered by Hector. I attempt to read rows after trying to connect. The code snippet is as follows
Cluster myCluster = HFactory.getOrCreateCluster("test" , "localhost:9160");
KeyspaceDefinition keySpaceDef = myCluster.describeKeyspace("testkeyspace");
.....
However the connectivity fails with this error
Exception in thread "main" java.lang.NoSuchFieldError: DEFAULT_MEMTABLE_OPERATIONS_IN_MILLIONS
at me.prettyprint.cassandra.service.ThriftCfDef.(ThriftCfDef.java:65)
at me.prettyprint.cassandra.service.ThriftCfDef.fromThriftList(ThriftCfDef.java:144)
at me.prettyprint.cassandra.service.ThriftKsDef.(ThriftKsDef.java:34)
at me.prettyprint.cassandra.service.AbstractCluster$4.execute(AbstractCluster.java:192)
at me.prettyprint.cassandra.service.AbstractCluster$4.execute(AbstractCluster.java:187)
at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:101)
at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:232)
at me.prettyprint.cassandra.service.AbstractCluster.describeKeyspace(AbstractCluster.java:201)
I have cassandra, thrift as dependencies in my pom.xml. Any clues as to what could be wrong?

cassandra sstable-loader error: "Got an unknow host from describe_ring()"

I am trying to load sstables to cassandra cluster of two nodes with sstable-loader utility provided in cassandra 0.8.4
1) I have loaded the data successfully on single node environment .
2) As i have created the cluster of two nodes ,while loading ,after gossip it throws exception
java.lang.RuntimeException: Got an unknow host from describe_ring()

This is a bug in 0.8.4 (https://issues.apache.org/jira/browse/CASSANDRA-3044). It's fixed in 0.8.5; you can test that by following the link on the release thread here.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string