How to insert data in cassandra using Pig - cassandra

I am trying to copy data from a file in HDFS to a table in Cassandra using Pig. But the job fails with null pointer exception while storing the data in Cassandra. Can someone help me with this?
Users table structure:
CREATE TABLE users (
user_id text PRIMARY KEY,
age int,
first text,
last text
)
My pig script
A = load '/user/hduser/user.txt' using PigStorage(',') as (id:chararray,age:int,fname:chararray,lname:chararray);
C = foreach A GENERATE TOTUPLE(TOTUPLE('user_id',id)), TOTUPLE('age',age),TOTUPLE('first',fname),TOTUPLE('last',lname);
STORE C into 'cql://ram_keyspace/users' USING CqlStorage();
Exception:
java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.cassandra.hadoop.cql3.CqlRecordWriter.(CqlRecordWriter.java:123)
at org.apache.cassandra.hadoop.cql3.CqlRecordWriter.(CqlRecordWriter.java:90)
at org.apache.cassandra.hadoop.cql3.CqlOutputFormat.getRecordWriter(CqlOutputFormat.java:76)
at org.apache.cassandra.hadoop.cql3.CqlOutputFormat.getRecordWriter(CqlOutputFormat.java:57)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getRecordWriter(PigOutputFormat.java:84)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.(MapTask.java:627)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:753)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.NullPointerException
at org.apache.cassandra.hadoop.cql3.CqlRecordWriter.(CqlRecordWriter.java:109)
... 12 more
Can someone who has used Pig with Cassandra help me fix this?

You are using CqlStorage which requires you to specify the output_query which is a prepared statement that will be used to insert the data into the column family. The DSE pig documentation provides an example:
grunt> STORE insertformat INTO
'cql://cql3ks/simple_table1?output_query=UPDATE+cql3ks.simple_table1+set+b+%3D+%3F'
USING CqlStorage;

Related

Partitioned table on synapse

I'm trying to create a new partitioned table on my SqlDW (synapse) based on a partitioned table on Spark (synapse) with
%%spark
val df1 = spark.sql("SELECT * FROM sparkTable")
df1.write.partitionBy("year").sqlanalytics("My_SQL_Pool.dbo.StudentFromSpak", Constants.INTERNAL )
Error : StructuredStream-spark package version: 2.4.5-1.3.1
StructuredStream-spark package version: 2.4.5-1.3.1
StructuredStream-spark package version: 2.4.5-1.3.1
java.sql.SQLException:
com.microsoft.sqlserver.jdbc.SQLServerException: External file access
failed due to internal error: 'File
/synapse/workspaces/test-partition-workspace/sparkpools/myspark/sparkpoolinstances/c5e00068-022d-478f-b4b8-843900bd656b/livysessions/2021/03/09/1/tempdata/SQLAnalyticsConnectorStaging/application_1615298536360_0001/aDtD9ywSeuk_shiw47zntKz.tbl/year=2000/part-00004-5c3e4b1a-a580-4c7e-8381-00d92b0d32ea.c000.snappy.parquet:
HdfsBridge::CreateRecordReader - Unexpected error encountered
creating the record reader: HadoopExecutionException: Column count
mismatch. Source file has 5 columns, external table definition has 6
columns.' at
com.microsoft.spark.sqlanalytics.utils.SQLAnalyticsJDBCWrapper.executeUpdateStatement(SQLAnalyticsJDBCWrapper.scala:89)
at
thanks
The sqlanalytics() function name has been changed to synapsesql(). It does not currently support writing partitioned tables but you could implement this yourself, eg by writing multiple tables back to the dedicated SQL pool and the using partition switching back there.
The syntax is simply (as per the documentation):
df.write.synapsesql("<DBName>.<Schema>.<TableName>", <TableType>)
An example would be:
df.write.synapsesql("yourDb.dbo.yourTablePartition1", Constants.INTERNAL)
df.write.synapsesql("yourDb.dbo.yourTablePartition2", Constants.INTERNAL)
Now do the partition switching in the database using the ALTER TABLE ... SWITCH PARTITION syntax.

Not able to load simple table from Cassandra using Pig

I am trying to load simple table created in Cassandra sing cql command. But load is failing when I try to dump .my pig script looks like this.
A = LOAD 'cql://pigtest/myusers' USING CqlStorage()
AS (user_id:int,fname:chararray,lname:chararray);
describe A;
DUMP A;
My users table schema looks like
CREATE TABLE users (
user_id int ( primary key),
fnam text,
lname text
)
I am getting following exception ( I tried with Cassandra 2.0.9 and 2.1.0, and pig 0.13). Please help us with root cause/
ERROR 1002: Unable to store alias A
Caused by: InvalidRequestException(why:Expected 8 or 0 byte long (7))
at org.apache.cassandra.thrift.Cassandra$execute_prepared_cql3_query_result$execute_prepared_cql3_query_resultStandardScheme.read(Cassandra.java:54918)
at org.apache.cassandra.thrift.Cassandra$execute_prepared_cql3_query_result$execute_prepared_cql3_query_resultStandardScheme.read(Cassandra.java:54895)
at org.apache.cassandra.thrift.Cassandra$execute_prepared_cql3_query_result.read(Cassandra.java:54810)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_execute_prepared_cql3_query(Cassandra.java:1861)
at org.apache.cassandra.thrift.Cassandra$Client.execute_prepared_cql3_query(Cassandra.java:1846)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.executeQuery(CqlPagingRecordReader.java:635)
... 28 more
Verify that the partitioner is the same on the server and client.
Murmur3Partitioner vs RandomPartitioner
> cqlsh -e "describe cluster" | head
Cluster: Test Cluster
Partitioner: Murmur3Partitioner
-
Pig Script
set cassandra.input.partitioner.class org.apache.cassandra.dht.Murmur3Partitioner;
set cassandra.output.partitioner.class org.apache.cassandra.dht.Murmur3Partitioner;

pig cql not working with greater/lesser than condition

I am using pig 0.11.1 with cassandra 1.2.12, when i write a pig cql like
PK for "employees" is ((date,id),joindatems)
data = LOAD
'cql://company/employees?where_clause=+joindatems+%3C+1388769238536746+and+joindatems+%3E+1388768338536746'
using CqlStorage();
Its joindatems is datetime in milliseconds, here am having a condition joindatems < 1388769238536746 and joindatems > 1388768338536746
when i run this am getting
Caused by: InvalidRequestException(why:joindatems cannot be restricted by both an equal and an inequal relation)
but this this kind of query(joindatems < 1388769238536746 and joindatems > 1388768338536746) is supported in cassandra cql, am i doing anything wrong in pig script.
Thanks in advance

Pig into Cassandra - pass list objects using python UDF and CqlStorage

I am working on a dataflow including some aggregating steps in Pig and storing steps into Cassandra. I have been able to pass relatively simple data types such as integer, long or dates but can't find how to pass some sort of list, set or tuple from Pig to Cassandra using CqlStorage.
I use Pig 0.9.2, so I can't use the FLATTEN methods.
Question
How do I fill a Cassandra table carrying complex data types such as sets or lists from Pig 0.9.2?
Overview of my specific application:
I created the corresponding Cassandra table respecting the description:
CREATE TABLE mycassandracf (
my_id int,
date timestamp,
my_count bigint,
grouped_ids list<bigint>,
PRIMARY KEY (my_id, date));
and a STORE instruction carrying prepared statement:
STORE CassandraAggregate
INTO 'cql://test/mycassandracf?output_query=UPDATE+test.mycassandracf+set+my_count+%3D+%3F%2C+grouped_ids+%3D+%3F'
USING CqlStorage;
From a 'GROUP BY' relation, I 'generate' a relation in a cql-friendly format (e.g. in tuples), that I want to store into Cassandra.
CassandraAggregate = FOREACH GroupedRelation
GENERATE TOTUPLE(TOTUPLE('my_id', $0.my_id),
TOTUPLE('date', ISOToUnix($0.createdAt))),
TOTUPLE(COUNT($1), $1.grouped_id);
DUMP CassandraAggregate;
(((my_id,30021),(date,1357084800000)),(2,{(60128490006325819),(62726281032786005)}))
(((my_id,30165),(date,1357084800000)),(1,{(60128411174143024)}))
(((my_id,30376),(date,1357084800000)),(4,{(60128411146211875),(63645100121476995),(60128411146211875),(63645100121476995)}))
Unsurprisingly, using the STORE instruction on this relation raises the exception:
java.lang.ClassCastException: org.apache.pig.data.DefaultDataBag cannot be cast to org.apache.pig.data.DataByteArray
I thus add a UDF written in python to apply some flattening on the grouped_id bag:
#outputSchema("flat_bag:bag{}")
def flattenBag(bag):
return tuple([long(item) for tup in bag for item in tup])
I use tuple because using python sets as well as python lists ends up in getting casting errors.
Adding it to my pipeline, I have:
CassandraAggregate = FOREACH GroupedRelation
GENERATE TOTUPLE(TOTUPLE('my_id', $0.my_id),
TOTUPLE('date', ISOToUnix($0.createdAt))),
TOTUPLE(COUNT($1), py_f.flattenBag($1.grouped_id));
DUMP CassandraAggregate;
(((my_id,30021),(date,1357084800000)),(2,(60128490006325819,62726281032786005)))
(((my_id,31120),(date,1357084800000)),(1,(60128411174143024)))
(((my_id,31120),(date,1357084800000)),(1,(60128411146211875,63645100121476995,6012841114621187563645100121476995)))
Using the STORE instruction on this last relation raises the exception with error stack:
java.io.IOException: java.io.IOException: org.apache.thrift.transport.TTransportException
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:465)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:428)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:408)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:262)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:260)
Caused by: java.io.IOException: org.apache.thrift.transport.TTransportException
at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:248)
Caused by: org.apache.thrift.transport.TTransportException
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at org.apache.cassandra.thrift.Cassandra$Client.recv_execute_prepared_cql3_query(Cassandra.java:1724)
at org.apache.cassandra.thrift.Cassandra$Client.execute_prepared_cql3_query(Cassandra.java:1709)
at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:232)
I tested the exact same workflow with simple data types and is working perfectly. What I am really looking for is the way to fill a cassandra table with complex types such as sets or lists from Pig.
Many thanks
After further investigation, I found the solution here:
https://issues.apache.org/jira/browse/CASSANDRA-5867
Basically, CqlStorage supports complex types. For that, the type should represented by a tuple in the tuples, carrying as first element the very data type as a string. For list, this is how one does this:
# python
#outputSchema("flat_bag:bag{}")
def flattenBag(bag):
return ('list',) + tuple([long(item) for tup in bag for item in tup])
Thus, in grunt:
# pig
CassandraAggregate = FOREACH GroupedRelation
GENERATE TOTUPLE(TOTUPLE('my_id', $0.my_id),
TOTUPLE('date', ISOToUnix($0.createdAt))),
TOTUPLE(COUNT($1), py_f.flattenBag($1.grouped_id));
DUMP CassandraAggregate;
(((my_id,30021),(date,1357084800000)),(2,(list, 60128490006325819,62726281032786005)))
(((my_id,31120),(date,1357084800000)),(1,(list, 60128411174143024)))
(((my_id,31120),(date,1357084800000)),(1,(list, 60128411146211875,63645100121476995,6012841114621187563645100121476995)))
This is then stored into cassandra using classic encoded prepared statement.
Hope this will of some help.

DataStax Cassandra File System - Fixed Width Text File - Hive Integration Issue

I'm trying to read a fixed width text file stored in Cassandra File System (CFS) using Hive. I'm able to query the file when I run from hive client. However, when I try to run from Hadoop Hive JDBC, It says table is not available or bad connection. Below are the steps I followed.
Input file (employees.dat):
21736Ambalavanar Thirugnanam BOY-EAG 2005-05-091992-11-18
21737Anand Jeyamani BOY-AST 2005-05-091985-02-12
31123Muthukumar Rajendran BOY-EES 2009-08-121983-02-23
Starting Hive Client
bash-3.2# dse hive;
Logging initialized using configuration in file:/etc/dse/hive/hive-log4j.properties
Hive history file=/tmp/root/hive_job_log_root_201209250900_157600446.txt
hive> use HiveDB;
OK
Time taken: 1.149 seconds
Creating Hive External Table pointing to fixed width format text file
hive> CREATE EXTERNAL TABLE employees (empid STRING, firstname STRING, lastname STRING, dept STRING, dateofjoining STRING, dateofbirth STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
> WITH SERDEPROPERTIES ("input.regex" = "(.{5})(.{25})(.{25})(.{15})(.{10})(.{10}).*" )
> LOCATION 'cfs://hostname:9160/folder/';
OK
Time taken: 0.524 seconds
Do a select * from table.
hive> select * from employees;
OK
21736 Ambalavanar Thirugnanam BOY-EAG 2005-05-09 1992-11-18
21737 Anand Jeyamani BOY-AST 2005-05-09 1985-02-12
31123 Muthukumar Rajendran BOY-EES 2009-08-12 1983-02-23
Time taken: 0.698 seconds
Do a select with specific fields from hive table throws permission error (first issue)
hive> select empid, firstname from employees;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
java.io.IOException: The ownership/permissions on the staging directory cfs:/tmp/hadoop-root/mapred/staging/root/.staging is not as expected. It is owned by root and permissions are rwxrwxrwx. The directory must be owned by the submitter root or by root and permissions must be rwx------
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:108)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:856)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:452)
at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:136)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:133)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1332)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1123)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:931)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Job Submission failed with exception 'java.io.IOException(The ownership/permissions on the staging directory cfs:/tmp/hadoop-root/mapred/staging/root/.staging is not as expected. It is owned by root and permissions are rwxrwxrwx. The directory must be owned by the submitter root or by root and permissions must be rwx------)'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask
The second issue is, when I try to run the select * query from JDBC Hive driver (outside of dse/cassandra nodes), It says the table employees is not available. The external table created acts like a temporary table and it does not get persisted. When I use 'hive> show tables;', the employees table is not listed. Can anyone please help me figure out the problem?
I don't have an immediate answer for the first issue, but the second looks like its due to a known issue.
There is a bug in DSE 2.1 which drops external tables created from CFS files from the metastore when show tables is run. Only the table metadata is removed, the data remains in CFS so if you recreate the table definition you shouldn't have to reload it. Tables backed by Cassandra ColumnFamilies are notaffected by this bug. This has been fixed in the 2.2 release of DSE, which is due for release imminently.
I'm not familiar with the Hive JDBC driver, but if it issues a Show Tables command at any point, it could be triggering this bug.

Resources