CQL expecting set null error - cassandra

I am trying to export the cassandra data to file using CQL . But i am getting 'expecting set null' error. My keyspace name and column family name are same.
Cassandra version : 1.1.2
Actually i need to export cassandra data to csv or any format. But i tried most of the export commands. But getting the same error. keyspace name and column family name are same is that an issue ?

The first thing you should do is upgrade to 1.1.8. If that still doesn't work, file a bug report at https://issues.apache.org/jira/browse/CASSANDRA.

Related

How can i extract values from cassandra output using python?

I'm trying to connect cassandra database through python using cassandra driver .And it went successful with out any problem . When i tried to fetch the values from cassandra ,it has some formatted output like Row(values) .
python version 3.6
package : cassandra
from cassandra.cluster import Cluster
cluster = Cluster()
session = cluster.connect('employee')
k=session.execute("select count(*) from users")
print(k[0])
Output :
Row(count=11)
Expected :
11
From documentation:
By default, each row in the result set will be a named tuple. Each row will have a matching attribute for each column defined in the schema, such as name, age, and so on. You can also treat them as normal tuples by unpacking them or accessing fields by position.
So you can access your data by name as k[0].count, or by position as rows[0][0]
Please read Getting started document from driver's documentation - it will answer most of your questions.
Cassandra reply everything using something called row factory, which by default is a named tuple.
In your case, to access the output you should access k[0].count.

PySpark cannot insertInto Hive table because "Can only write data to relations with a single path"

I have a Hive Orc table with a definition similar to the following definition
CREATE EXTERNAL TABLE `example.example_table`(
...
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'path'='s3a://path/to/table')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3a://path/to/table'
TBLPROPERTIES (
...
)
I am attempting to use PySpark to append a dataframe to this table using "df.write.insertInto("example.example_table")". When running this, I get the following error:
org.apache.spark.sql.AnalysisException: Can only write data to relations with a single path.;
at org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:188)
at org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:134)
...
When looking at the underlying Scala code, the condition that throws this error is checking to see if the table location has multiple "rootPaths". Obviously, my table is defined with a single location. What else could cause this?
It is that path that you are defining that causes the error. I just ran into this same problem myself. Hive generates a location path based on the hive.metastore.warehouse.dir property, so you have that default location plus the path you specified, which is causing that linked code to fail.
If you want to pick a specific path other than the default, then try using LOCATION.
Try running a describe extended example.example_table query to see more detailed information on the table. One of the output rows will be a Detailed Table Information which contains a bunch of useful information:
Table(
tableName:
dbName:
owner:
createTime:1548335003
lastAccessTime:0
retention:0
sd:StorageDescriptor(cols:
location:[*path_to_table*]
inputFormat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
outputFormat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
compressed:false
numBuckets:-1
serdeInfo:SerDeInfo(
name:null
serializationLib:org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
parameters:{
serialization.format=1
path=[*path_to_table*]
}
)
bucketCols:[]
sortCols:[]
parameters:{}
skewedInfo:SkewedInfo(skewedColNames:[]
skewedColValues:[]
skewedColValueLocationMaps:{})
storedAsSubDirectories:false
)
partitionKeys:[]
parameters:{transient_lastDdlTime=1548335003}
viewOriginalText:null
viewExpandedText:null
tableType:MANAGED_TABLE
rewriteEnabled:false
)
We had the same problem in a project when migrating from Spark 1.x and HDFS to Spark 3.x and S3. We solve this issue setting the next Spark property to false:
spark.sql.hive.convertMetastoreParquet
You can just run
spark.sql("SET spark.sql.hive.convertMetastoreParquet=false")
Or maybe
spark.conf("spark.sql.hive.convertMetastoreParquet", False)
Being spark the SparkSession object. The explanaition of this is currently in Spark documentation.

Spark returns Empty DataFrame but Populated in Hive

I have a table in hive
db.table_name
When I run the following in hive I get results back
SELECT * FROM db.table_name;
When I run the following in a spark-shell
spark.read.table("db.table_name").show
It shows nothing. Similarly
sql("SELECT * FROM db.table_name").show
Also shows nothing. Selecting arbitrary columns out before the show also displays nothing. Performing a count states the table has 0 rows.
Running the same queries works against other tables in the same database.
Spark Version: 2.2.0.cloudera1
The table is created using
table.write.mode(SaveMode.Overwrite).saveAsTable("db.table_name")
And if I read the file using the parquet files directly it works.
spark.read.parquet(<path-to-files>).show
EDIT:
I'm currently using a workaround by describing the table and getting the location and using spark.read.parquet.
Have you refresh metadata table? Maybe you need to refresh table to access to new data.
spark.catalog.refreshTable("my_table")
I solved the problem by using
query_result.write.mode(SaveMode.Overwrite).format("hive").saveAsTable("table")
which stores the results in textfile.
There is probably some incompatibility with the Hive parquet.
I also found a Cloudera report about it (CDH Release Notes): they recommend creating the Hive table manually and then load data from a temporary table or by query.

Cassandra: Unable to import null value from csv

I am trying to import a csv file to Cassandra. The csv file has been generated from Postgres and it contains some null values.
Cassandra version:
[cqlsh 5.0.1 | Cassandra 3.5 | CQL spec 3.4.0 | Native protocol v4]
I am using this query to import:
copy reports
(id,name,user_id,user_name,template_id,gen_epoch,exp_epoch,file_name,format,refile_size,is_sch,job_id,status,status_msg)
from '/home/reports.csv' with NULL='' and header=true and DELIMITER =
',';
I keep on receiving this error:
Failed to import 66 rows: ParseError - invalid literal for int() with
base 10: '', given up without retries
However, when I changed all the null values to some random value, I was able to import that row using the same command. I have already tried all the solutions I found on internet.
Please advise.
You can put just empty field in the CSV.
I.e. I want to rewrite a value in DB to null - I can do something like:
cqlsh$> copy my_table(id,value_column1,value_column2) from 'myimport.csv';
And in myimport.csv, there will be
1234,,3
Like this, value_column1 will have 'null' value.
This looks like a Cassandra bug (see https://issues.apache.org/jira/browse/CASSANDRA-11549). I haven't been able to find a way to get Cassandra to accept the nulls. You may have to stick with a workaround for now substitute some sentinel value for the nulls.

How to write into Microsoft SQL Server table even if table exist using PySpark

I have a PySpark Code which writes into SQL Server database like this
df.write.jdbc(url=url, table="AdventureWorks2012.dbo.people", properties=properties)
However problem is that I want to keep writing in the table people even if the table exist and I see in the Spark Document that there are possible options error, append, overwrite and ignore for mode and all of them throws error, the object already exist if the table already exist in the database.
Spark throw following error
py4j.protocol.Py4JJavaError: An error occurred while calling o43.jdbc.
com.microsoft.sqlserver.jdbc.SQLServerException: There is already an object named 'people' in the database
Is there way to write data into the table even if the table already exits ?
Please let me know you need more explanation
For me the issue was with Spark 1.5.2. The way it checks if the table exists (here) is by running SELECT 1 FROM $table LIMIT 1. If the query fails, the tables doesn't exist. That query failed even when the table was there.
This was changed to SELECT * FROM $table WHERE 1=0 in 1.6.0 (here).
So append and overwrite mode will not throw an error when the table already exists. From the spark documentation ( http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes ) SaveMode.Append will "When saving a DataFrame to a data source, if data/table already exists, contents of the DataFrame are expected to be appended to existing data." and SaveMode.Overwrite will "Overwrite mode means that when saving a DataFrame to a data source, if data/table already exists, existing data is expected to be overwritten by the contents of the DataFrame." Depending on how you want to handle the existing table one of these two should likely meet your needs.

Resources