Cassandra: Unable to import null value from csv - string

I am trying to import a csv file to Cassandra. The csv file has been generated from Postgres and it contains some null values.
Cassandra version:
[cqlsh 5.0.1 | Cassandra 3.5 | CQL spec 3.4.0 | Native protocol v4]
I am using this query to import:
copy reports
(id,name,user_id,user_name,template_id,gen_epoch,exp_epoch,file_name,format,refile_size,is_sch,job_id,status,status_msg)
from '/home/reports.csv' with NULL='' and header=true and DELIMITER =
',';
I keep on receiving this error:
Failed to import 66 rows: ParseError - invalid literal for int() with
base 10: '', given up without retries
However, when I changed all the null values to some random value, I was able to import that row using the same command. I have already tried all the solutions I found on internet.
Please advise.

You can put just empty field in the CSV.
I.e. I want to rewrite a value in DB to null - I can do something like:
cqlsh$> copy my_table(id,value_column1,value_column2) from 'myimport.csv';
And in myimport.csv, there will be
1234,,3
Like this, value_column1 will have 'null' value.

This looks like a Cassandra bug (see https://issues.apache.org/jira/browse/CASSANDRA-11549). I haven't been able to find a way to get Cassandra to accept the nulls. You may have to stick with a workaround for now substitute some sentinel value for the nulls.

Related

How to manage mangled data when importing from your source in sqoop or pyspark

I have been working on a project to import the Danish 2.5Million ATM transaction data set to derive some visualizations.
The data is hosted on a mysql server provided by the university. The objective is to import the data using Sqoop and then apply a few transformations to it using pyspark.
Link to the dataset here : https://www.kaggle.com/sparnord/danish-atm-transactions
The Sql server, that hosts this information has a few rows which are intentionally or unintentionally mangled.
Example:
So I have a very basic sqoop command which gets the details from the source database. However I run into an issue where there are values which have a double-quote " especially in the column message_text
Sqoop Command :
sqoop import --connect jdbc:mysql:{source-connection-string} --table SRC_ATM_TRANS --username {username}--password {password} --target-dir /user/root/etl_project --fields-terminated-by '|' --lines-terminated-by "\n" -m 1
Here is sample row that is imported in the transaction.
2017|January|1|Sunday|21|Active|85|Diebold Nixdorf|København|Regnbuepladsen|5|1550|55.676|12.571|DKK|MasterCard|4531|Withdrawal|4017|"Suspected malfunction|0.000|55.676|13|2618425|0.000|277|1010|93|3|280.000|0|75|803|Clouds
However the expected output should be
2017|January|1|Sunday|21|Active|85|Diebold Nixdorf|København|Regnbuepladsen|5|1550|55.676|12.571|DKK|MasterCard|4531|Withdrawal|4017|"Suspected malfunction,0.000|55.676|13|2618425|0.000|277|1010|93|3|280.000|0|75|803|Clouds|Cloudy
At first I was okay with this hoping that pyspark would handle the mangled data since the delimiters are specified.
But now I run into issues when populating my dataframe.
transactions = spark.read.option("sep","|").csv("/user/root/etl_project/part-m-00000", header = False,schema = transaction_schema)
However when I inspect my rows I see that the mangled data has caused the dataframe to put these affected values into a single column!
transactions.filter(transactions.message_code == "4017").collect()
Row(year=2017, month=u'January', day=1, weekday=u'Sunday', hour=17, atm_status=u'Active', atm_id=u'35', atm_manufacturer=u'NCR', atm_location=u'Aabybro', atm_streetname=u'\xc3\u0192\xcb\u0153stergade', atm_street_number=6, atm_zipcode=9440, atm_lat=57.162, atm_lon=9.73, currency=u'DKK', card_type=u'MasterCard', transaction_amount=7387, service=u'Withdrawal', message_code=u'4017', message_text=u'Suspected malfunction|0.000|57.158|10|2625037|0.000|276|1021|83|4|319.000|0|0|800|Clear', weather_lat=None, weather_lon=None, weather_city_id=None, weather_city_name=None, temp=None, pressure=None, humidity=None, wind_speed=None, wind_deg=None, rain_3h=None, clouds_all=None, weather_id=None, weather_main=None, weather_description=None)
At this point I am not sure on what to do?
Do I go ahead and create temporary columns to manage this and use a regex replacement to fill in these values ?
Or is there any better way I can import the data and manage these mangled values either in sqoop or in pyspark ?

How can i extract values from cassandra output using python?

I'm trying to connect cassandra database through python using cassandra driver .And it went successful with out any problem . When i tried to fetch the values from cassandra ,it has some formatted output like Row(values) .
python version 3.6
package : cassandra
from cassandra.cluster import Cluster
cluster = Cluster()
session = cluster.connect('employee')
k=session.execute("select count(*) from users")
print(k[0])
Output :
Row(count=11)
Expected :
11
From documentation:
By default, each row in the result set will be a named tuple. Each row will have a matching attribute for each column defined in the schema, such as name, age, and so on. You can also treat them as normal tuples by unpacking them or accessing fields by position.
So you can access your data by name as k[0].count, or by position as rows[0][0]
Please read Getting started document from driver's documentation - it will answer most of your questions.
Cassandra reply everything using something called row factory, which by default is a named tuple.
In your case, to access the output you should access k[0].count.

Cassandra : COPY data with cql function on column

I am trying to export and import data from a cassandra table for changing a timestamp column to unixepoch column ( ie type timestamp to bigint)
I tried exporting data to csv using below command
COPY raw_data(raw_data_field_id, toUnixTimestamp(dt_timestamp), value) TO 'raw_data_3_feb_19.csv' WITH PAGETIMEOUT=40 AND PAGESIZE=20;
but getting error as : Improper COPY command.
How can I fix this issue or is there a better way to achieve this?
from
raw_data_field_id | dt_timestamp | value
-------------------+---------------------------------+-------
23 | 2018-06-12 07:15:00.656000+0000 | 131.3
to
raw_data_field_id | dt_unix_timestamp_epoch | value
-------------------+---------------------------------+-------
23 | 1528787700656 | 131.3
The COPY command does not support adding extra functions to process the output.
I would say you have several solutions:
export the data in csv using COPY, convert the timestamp value (using sh commands or a high level language) and import it to a new table
export using echo "select raw_data_field_id, toUnixTimestamp(dt_timestamp), value from raw.raw_data;" | ccm node1 cqlsh > output.csv, change the csv so it has a proper format and import it to a new table (this solution is from here)
write your own conversion tool using one of Cassandra drivers (python, java etc).
maybe you could try something with a UDF, but I haven't tested this.
You should be aware that COPY FROM supports datasets that have less than 2 milion rows.

Unable to read column types from amazon redshift using psycopg2

I'm trying to access the types of columns in a table in redshift using psycopg2.
I'm doing this by running a simple query on pg_table_def like as follows:
SELECT * FROM pg_table_def;
This returns the traceback:
psycopg2.NotSupportedError: Column "schemaname" has unsupported type "name"
So it seems like the types of the columns that store schema (and other similar information on further queries) are not supported by psycopg2.
Has anyone run into this issue or a similar one and is aware of a workaround? My primary goal in this is to be able to return the types of columns in the table. For the purposes of what I'm doing, I can't use another postgresql adapter.
Using:
python- 3.6.2
psycopg2- 2.7.4
pandas- 0.17.1
You could do something like below, and could return the result back to calling service.
cur.execute("select * from pg_table_def where tablename='sales'")
results = cur.fetchall()
for row in results:
print ("ColumnNanme=>"+row[2] +",DataType=>"+row[3]+",encoding=>"+row[4])
Not sure about exception, if all the permissions are fine, then, it should work fine, print something like below.
ColumnNanme=>salesid,DataType=>integer,encoding=>lzo
ColumnNanme=>commission,DataType=>numeric(8,2),encoding=>lzo
ColumnNanme=>saledate,DataType=>date,encoding=>lzo
ColumnNanme=>description,DataType=>character varying(255),encoding=>lzo

CQL expecting set null error

I am trying to export the cassandra data to file using CQL . But i am getting 'expecting set null' error. My keyspace name and column family name are same.
Cassandra version : 1.1.2
Actually i need to export cassandra data to csv or any format. But i tried most of the export commands. But getting the same error. keyspace name and column family name are same is that an issue ?
The first thing you should do is upgrade to 1.1.8. If that still doesn't work, file a bug report at https://issues.apache.org/jira/browse/CASSANDRA.

Resources