how can I add csv to cassandra db? - cassandra

I know it can be done in traditional way, but if I were to use Cassandra DB, is there a easy/quick and agaile way to add csv to the DB as a set of key-value pairs ?
Ability to add a time-series data coming via CSV file is my prime requirement. I am ok to switch to any other database such as mongodb, rike, if it is conviniently doable there..

Edit 2 Dec 02, 2017
Please use port 9042. Cassandra access has changed to CQL with default port as 9042, 9160 was default port for Thrift.
Edit 1
There is a better way to do this without any coding. Look at this answer https://stackoverflow.com/a/18110080/298455
However, if you want to pre-process or something custom you may want to so it yourself. here is a lengthy method:
Create a column family.
cqlsh> create keyspace mykeyspace
with strategy_class = 'SimpleStrategy'
and strategy_options:replication_factor = 1;
cqlsh> use mykeyspace;
cqlsh:mykeyspace> create table stackoverflow_question
(id text primary key, name text, class text);
Assuming your CSV is like this:
$ cat data.csv
id,name,class
1,hello,10
2,world,20
Write a simple Python code to read off of the file and dump into your CF. Something like this:
import csv
from pycassa.pool import ConnectionPool
from pycassa.columnfamily import ColumnFamily
pool = ConnectionPool('mykeyspace', ['localhost:9160'])
cf = ColumnFamily(pool, "stackoverflow_question")
with open('data.csv', 'rb') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print str(row)
key = row['id']
del row['id']
cf.insert(key, row)
pool.dispose()
Execute this:
$ python loadcsv.py
{'class': '10', 'id': '1', 'name': 'hello'}
{'class': '20', 'id': '2', 'name': 'world'}
Look the data:
cqlsh:mykeyspace> select * from stackoverflow_question;
id | class | name
----+-------+-------
2 | 20 | world
1 | 10 | hello
See also:
a. Beware of DictReader
b. Look at Pycassa
c. Google for existing CSV loader to Cassandra. I guess there are.
d. There may be a simpler way using CQL driver, I do not know.
e. Use appropriate data type. I just wrapped them all into text. Not good.
HTH
I did not see the time-series requirement. Here is how you do for time series.
This is your data
$ cat data.csv
id,1383799600,1383799601,1383799605,1383799621,1383799714
1,sensor-on,sensor-ready,flow-out,flow-interrupt,sensor-killAll
Create traditional wide row. (CQL suggests not to use COMPACT STORAGE, but this is just to get you going quickly.)
cqlsh:mykeyspace> create table timeseries
(id text, timestamp text, data text, primary key (id, timestamp))
with compact storage;
This the altered code:
import csv
from pycassa.pool import ConnectionPool
from pycassa.columnfamily import ColumnFamily
pool = ConnectionPool('mykeyspace', ['localhost:9160'])
cf = ColumnFamily(pool, "timeseries")
with open('data.csv', 'rb') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print str(row)
key = row['id']
del row['id']
for (timestamp, data) in row.iteritems():
cf.insert(key, {timestamp: data})
pool.dispose()
This is your timeseries
cqlsh:mykeyspace> select * from timeseries;
id | timestamp | data
----+------------+----------------
1 | 1383799600 | sensor-on
1 | 1383799601 | sensor-ready
1 | 1383799605 | flow-out
1 | 1383799621 | flow-interrupt
1 | 1383799714 | sensor-killAll

Let's say your CSV looks like
'P38-Lightning', 'Lockheed', 1937, '.7'
cqlsh to your DB
And..
CREATE TABLE airplanes (
name text PRIMARY KEY,
manufacturer ascii,
year int,
mach float
);
then...
COPY airplanes (name, manufacturer, year, mach) FROM '/classpath/temp.csv';
Refer: http://www.datastax.com/docs/1.1/references/cql/COPY

Do Backup
./cqlsh -e"copy <keyspace>.<table> to '../data/table.csv';"
Use backup
./cqlsh -e"copy <keyspace>.<table> from '../data/table.csv';"

There is now an open-source program for bulk-loading data (local or remote) into Cassandra from multiple files (CSVs or JSONs) called DataStax Bulk Loader (see docs, source, examples):
dsbulk load -url ~/my_data_folder -k keyspace1 -t table1 -header true

Related

psycopg2 export DB to csv with column names

I'm using psycopg2 to connect to postgre DB, and to export the data into CSV file.
This is how I made the export DB to csv:
def export_table_to_csv(self, table, csv_path):
sql = "COPY (SELECT * FROM %s) TO STDOUT WITH CSV DELIMITER ','" % table
self.cur.execute(sql)
with open(csv_path, "w") as file:
self.cur.copy_expert(sql, file)
But the data is just the rows - without the column names.
How can I export the data with the column names?
P.S. I am able to print the column names:
sql = '''SELECT * FROM test'''
self.cur.execute(sql)
column_names = [desc[0] for desc in self.cur.description]
for i in column_names:
print(i)
I want the cleanest way to do export the DB with columns name (i.e. I prefer to do this in one method, and not rename columns In retrospect).
As I said in my comment, you can add HEADER to the WITH clause of your SQL:
sql = "COPY (SELECT * FROM export_test) TO STDOUT WITH CSV HEADER"
By default, comma delimiters are used with CSV option so you don't need to specify.
For future Questions, you should submit a minimal reproducible example. That is, code we can directly copy and paste and run. I was curious if this would work so I made one and tried it:
import psycopg2
conn = psycopg2.connect('host=<host> dbname=<dbname> user=<user>')
cur = conn.cursor()
# create test table
cur.execute('DROP TABLE IF EXISTS export_test')
sql = '''CREATE TABLE export_test
(
id integer,
uname text,
fruit1 text,
fruit2 text,
fruit3 text
)'''
cur.execute(sql)
# insert data into table
sql = '''BEGIN;
insert into export_test
(id, uname, fruit1, fruit2, fruit3)
values(1, 'tom jones', 'apple', 'banana', 'pear');
insert into export_test
(id, uname, fruit1, fruit2, fruit3)
values(2, 'billy idol', 'orange', 'cherry', 'strawberry');
COMMIT;'''
cur.execute(sql)
# export to csv
fid = open('export_test.csv', 'w')
sql = "COPY (SELECT * FROM export_test) TO STDOUT WITH CSV HEADER"
cur.copy_expert(sql, fid)
fid.close()
And the resultant file is:
id,uname,fruit1,fruit2,fruit3
1,tom jones,apple,banana,pear
2,billy idol,orange,cherry,strawberry

Import a random csv as a table on the fly - Postgresql and Python

I am using a pgadmin client. I have multiple csv files.
I would like to import each csv file as a table.
When I tried the below
a) Click create table
b) Enter the name of table and save it.
c) I see the table name
d) Click on "Import csv"
e) selected columns as "header"
f) Clicked "Import"
But I got an error message as below
ERROR: extra data after last expected column
CONTEXT: COPY Test_table, line 2: "32,F,52,Single,WHITE,23/7/2180 12:35,25/7/2180..."
I also tried the python psycopg2 version as shown below
import psycopg2
conn = psycopg2.connect("host='xxx.xx.xx.x' port='5432' dbname='postgres' user='abc' password='xxx'")
cur = conn.cursor()
f = open(r'test.csv', 'r')
cur.copy_from(f,public.test, sep=',') #while I see 'test' table under my schema, how can I give here the schema name etc. I don't know wht it says table not defined
f.close()
UndefinedTable: relation "public.test" does not exist
May I check whether it is possible to import some random csv as table using pgadmin import?
Pandas will do this easily. Create a table with a structure as some csv.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html
The csv is first read by read_csv to a Dataframe
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Regards Niels
As I understand the requirement, a new table is wanted for every csv. The code below illustrates that. It can be customized and datatypes can be elaborated, see the documentation for Pandas.DataFrame.to_sql. I think, actually, that the heavy lifting is done by SQLAlchemy
import io
import os
import pandas as pd
import psycopg2
buf_t1 = io.StringIO()
buf_t1.write("a,b,c,d\n")
buf_t1.write("1,2,3,4\n")
buf_t1.seek(0)
df_t1 = pd.read_csv(buf_t1)
df_t1.to_sql(name="t1", con="postgresql+psycopg2://host/db", index=False, if_exists='replace')
#
buf_t2 = io.StringIO()
buf_t2.write("x,y,z,t\n")
buf_t2.write("1,2,3,'Hello World'\n")
buf_t2.seek(0)
df_t2 = pd.read_csv(buf_t2)
df_t2.to_sql(name="t2", con="postgresql+psycopg2://host/db", index=False, if_exists='replace')
This will result in two new tables, t1 and t2. Defined as like this:
create table t1
(
a bigint,
b bigint,
c bigint,
d bigint
);
create table t2
(
x bigint,
y bigint,
z bigint,
t text
);

Table in Pyspark shows headers from CSV File

I have a csv file with contents as below which has a header in the 1st line .
id,name
1234,Rodney
8984,catherine
Now I was able create a table in hive to skip header and read the data appropriately.
Table in Hive
CREATE EXTERNAL TABLE table_id(
`tmp_id` string,
`tmp_name` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://some-testing/test/data/'
tblproperties ("skip.header.line.count"="1");
Results in Hive
select * from table_id;
OK
1234 Rodney
8984 catherine
Time taken: 1.219 seconds, Fetched: 2 row(s)
But, when I use the same table in pyspark (Ran the same query) I see even the headers from file in pyspark results as below.
>>> spark.sql("select * from table_id").show(10,False)
+------+---------+
|tmp_id|tmp_name |
+------+---------+
|id |name |
|1234 |Rodney |
|8984 |catherine|
+------+---------+
Now, how can I ignore these showing up in the results in pyspark.
I'm aware that we can read the csv file and add .option("header",True) to achieve this but, I wanna know if there's a way to do something similar in pyspark while querying tables.
Can someone suggest me a way.... Thanks 🙏 in Advance !!
u can use below two properties:
serdies properties and table properties, you will be able to access table from hive and spark by skipping header in both env.
CREATE EXTERNAL TABLE `student_test_score_1`(
student string,
age string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'delimiter'=',',
'field.delim'=',',
'header'='true',
'skip.header.line.count'='1',
'path'='hdfs:<path>')
LOCATION
'hdfs:<path>'
TBLPROPERTIES (
'spark.sql.sources.provider'='CSV')
This is know issue in Spark-11374 and closed as won't fix.
In query you can have where clause to select all records except 'id' and 'name'.
spark.sql("select * from table_id where tmp_id <> 'id' and tmp_name <> 'name'").show(10,False)
#or
spark.sql("select * from table_id where tmp_id != 'id' and tmp_name != 'name'").show(10,False)
Another way would be using reading files from HDFS with .option("header","true").

Why does the first column appear as the word 'data' when inserting columns into PostgreSQL using psycopg2?

When adding columns to PostgreSQL using Python's psycopg2 module, why does the first column name always appear as the word "data"? I'm trying to better understand this for myself and can provide code as needed.
data | Fieldtest1 | Fieldtest2 | Fieldtest3 | Fieldtest4 | Fieldtest5 | Fieldtest6
------+------------+------------+------------+------------+------------+------------
(0 rows)
Code:
def add_cols(tablename, colnames, data):
conn = psycopg2.connect("dbname=test user=test_user password=test_password host=localhost port=5432")
cursor = conn.cursor()
###Create a new table
cursor.execute("CREATE TABLE IF NOT EXISTS \""+tablename+"\" (data text);")
###Layer the new table with column names
for colname in colnames:
cursor.execute("ALTER TABLE IF EXISTS "+tablename+" ADD IF NOT EXISTS \""+colname+"\" VARCHAR;")
because you are specifying it here (in the parentheses):
cursor.execute("CREATE TABLE IF NOT EXISTS \""+tablename+"\" (data text);")

Inserting data in table with umlaut is not possible

I am using Cassandra 1.2.5 (cqlsh 3.0.2) and trying to inserting data in a small test-database with german characters which is not possible. I get back the message from cqlsh: "Bad Request: Input length = 1"
below is the setup of the keyspace, the table and the insert.
CREATE KEYSPACE test WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
use test;
CREATE TABLE testdata (
id varchar,
text varchar,
PRIMARY KEY (id)
This is working:
insert into testdata (id, text) values ('4711', 'test');
This is not allowed:
insert into testdata (id, text) values ('4711', 'töst`);
->Bad Request: Input length = 1
my locale is :de_DE.UTF-8
Does Cassandra 1.2.5 has a problem with Umlaut ?
I just did what you posted and it worked for me. The one thing that was different however, is that instead of a single quote, you finished 'töst` with a backtick. That doesn't allow me to finish the statement in cqlsh. When I replace that with 'töst' it succeeds and I get:
cqlsh:test> select * from testdata;
id | text
------+------
4711 | töst

Resources