Not able to load simple table from Cassandra using Pig - cassandra

I am trying to load simple table created in Cassandra sing cql command. But load is failing when I try to dump .my pig script looks like this.
A = LOAD 'cql://pigtest/myusers' USING CqlStorage()
AS (user_id:int,fname:chararray,lname:chararray);
describe A;
DUMP A;
My users table schema looks like
CREATE TABLE users (
user_id int ( primary key),
fnam text,
lname text
)
I am getting following exception ( I tried with Cassandra 2.0.9 and 2.1.0, and pig 0.13). Please help us with root cause/
ERROR 1002: Unable to store alias A
Caused by: InvalidRequestException(why:Expected 8 or 0 byte long (7))
at org.apache.cassandra.thrift.Cassandra$execute_prepared_cql3_query_result$execute_prepared_cql3_query_resultStandardScheme.read(Cassandra.java:54918)
at org.apache.cassandra.thrift.Cassandra$execute_prepared_cql3_query_result$execute_prepared_cql3_query_resultStandardScheme.read(Cassandra.java:54895)
at org.apache.cassandra.thrift.Cassandra$execute_prepared_cql3_query_result.read(Cassandra.java:54810)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_execute_prepared_cql3_query(Cassandra.java:1861)
at org.apache.cassandra.thrift.Cassandra$Client.execute_prepared_cql3_query(Cassandra.java:1846)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.executeQuery(CqlPagingRecordReader.java:635)
... 28 more

Verify that the partitioner is the same on the server and client.
Murmur3Partitioner vs RandomPartitioner
> cqlsh -e "describe cluster" | head
Cluster: Test Cluster
Partitioner: Murmur3Partitioner
-
Pig Script
set cassandra.input.partitioner.class org.apache.cassandra.dht.Murmur3Partitioner;
set cassandra.output.partitioner.class org.apache.cassandra.dht.Murmur3Partitioner;

Related

Partitioned table on synapse

I'm trying to create a new partitioned table on my SqlDW (synapse) based on a partitioned table on Spark (synapse) with
%%spark
val df1 = spark.sql("SELECT * FROM sparkTable")
df1.write.partitionBy("year").sqlanalytics("My_SQL_Pool.dbo.StudentFromSpak", Constants.INTERNAL )
Error : StructuredStream-spark package version: 2.4.5-1.3.1
StructuredStream-spark package version: 2.4.5-1.3.1
StructuredStream-spark package version: 2.4.5-1.3.1
java.sql.SQLException:
com.microsoft.sqlserver.jdbc.SQLServerException: External file access
failed due to internal error: 'File
/synapse/workspaces/test-partition-workspace/sparkpools/myspark/sparkpoolinstances/c5e00068-022d-478f-b4b8-843900bd656b/livysessions/2021/03/09/1/tempdata/SQLAnalyticsConnectorStaging/application_1615298536360_0001/aDtD9ywSeuk_shiw47zntKz.tbl/year=2000/part-00004-5c3e4b1a-a580-4c7e-8381-00d92b0d32ea.c000.snappy.parquet:
HdfsBridge::CreateRecordReader - Unexpected error encountered
creating the record reader: HadoopExecutionException: Column count
mismatch. Source file has 5 columns, external table definition has 6
columns.' at
com.microsoft.spark.sqlanalytics.utils.SQLAnalyticsJDBCWrapper.executeUpdateStatement(SQLAnalyticsJDBCWrapper.scala:89)
at
thanks
The sqlanalytics() function name has been changed to synapsesql(). It does not currently support writing partitioned tables but you could implement this yourself, eg by writing multiple tables back to the dedicated SQL pool and the using partition switching back there.
The syntax is simply (as per the documentation):
df.write.synapsesql("<DBName>.<Schema>.<TableName>", <TableType>)
An example would be:
df.write.synapsesql("yourDb.dbo.yourTablePartition1", Constants.INTERNAL)
df.write.synapsesql("yourDb.dbo.yourTablePartition2", Constants.INTERNAL)
Now do the partition switching in the database using the ALTER TABLE ... SWITCH PARTITION syntax.

Spark : Can't set table properties start with "spark.sql" to hive external table while creating

Env : linux (spark-submit xxx.py)
Target database : Hive
We used to use beeline to execute hql, but now we try to run the hql through pyspark and faced some issue when tried to set table properties while creating the table.
SQL
CREATE EXTERNAL TABLE example.a(
column_a string)
TBLPROPERTIES (
'discover.partitions'='true',
'spark.sql.sources.schema.numPartCols'='1',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"column_a","type":"string","nullable":true,"metadata":{}}]}',
'spark.sql.sources.schema.partCol.0'='received_utc_date_partition');
Error message
Hive - ERROR - Cannot persist
example.a into Hive metastore as table property
keys may not start with 'spark.sql.': [spark.sql.sources.schema.partCol.0, spark.sql.sources.schema.numParts,
spark.sql.sources.schema.numPartCols, spark.sql.sources.schema.part.0];
In line 130-147 in spark source code it seems that it prevent all table properties that start with "spark.sql"
Not sure if I did it wrong or there's another way to set up the table properties for hive table.
Any kinds of suggestion is appreciated.

User does not have privileges for ALTERTABLE_ADDCOLS while using spark.sql to read the data

Select query in spark.sql is resulting in the following error:
User *username* does not have privileges for ALTERTABLE_ADDCOLS
Spark version - 2.1.0
Trying to execute the following query:
dig = spark.sql("""select col1, col2 from dbname.tablename""")
It's caused by the spark.sql.hive.caseSensitiveInferenceMode propertie.
By default, spark tries to infer the table's schema and then change its properties.
To avoid these messages you can alter the default configuration to INFER_ONLY.
Considering a spark session named spark, the code below should work:
spark.conf.set("spark.sql.hive.caseSensitiveInferenceMode", "INFER_ONLY")

How to insert data in cassandra using Pig

I am trying to copy data from a file in HDFS to a table in Cassandra using Pig. But the job fails with null pointer exception while storing the data in Cassandra. Can someone help me with this?
Users table structure:
CREATE TABLE users (
user_id text PRIMARY KEY,
age int,
first text,
last text
)
My pig script
A = load '/user/hduser/user.txt' using PigStorage(',') as (id:chararray,age:int,fname:chararray,lname:chararray);
C = foreach A GENERATE TOTUPLE(TOTUPLE('user_id',id)), TOTUPLE('age',age),TOTUPLE('first',fname),TOTUPLE('last',lname);
STORE C into 'cql://ram_keyspace/users' USING CqlStorage();
Exception:
java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.cassandra.hadoop.cql3.CqlRecordWriter.(CqlRecordWriter.java:123)
at org.apache.cassandra.hadoop.cql3.CqlRecordWriter.(CqlRecordWriter.java:90)
at org.apache.cassandra.hadoop.cql3.CqlOutputFormat.getRecordWriter(CqlOutputFormat.java:76)
at org.apache.cassandra.hadoop.cql3.CqlOutputFormat.getRecordWriter(CqlOutputFormat.java:57)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getRecordWriter(PigOutputFormat.java:84)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.(MapTask.java:627)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:753)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.NullPointerException
at org.apache.cassandra.hadoop.cql3.CqlRecordWriter.(CqlRecordWriter.java:109)
... 12 more
Can someone who has used Pig with Cassandra help me fix this?
You are using CqlStorage which requires you to specify the output_query which is a prepared statement that will be used to insert the data into the column family. The DSE pig documentation provides an example:
grunt> STORE insertformat INTO
'cql://cql3ks/simple_table1?output_query=UPDATE+cql3ks.simple_table1+set+b+%3D+%3F'
USING CqlStorage;

pig cql not working with greater/lesser than condition

I am using pig 0.11.1 with cassandra 1.2.12, when i write a pig cql like
PK for "employees" is ((date,id),joindatems)
data = LOAD
'cql://company/employees?where_clause=+joindatems+%3C+1388769238536746+and+joindatems+%3E+1388768338536746'
using CqlStorage();
Its joindatems is datetime in milliseconds, here am having a condition joindatems < 1388769238536746 and joindatems > 1388768338536746
when i run this am getting
Caused by: InvalidRequestException(why:joindatems cannot be restricted by both an equal and an inequal relation)
but this this kind of query(joindatems < 1388769238536746 and joindatems > 1388768338536746) is supported in cassandra cql, am i doing anything wrong in pig script.
Thanks in advance

Resources