I am looking for example of using "CassandraDataModel" for Mahout. Also what tables are needed and how to import data into those tables from CSV files
Posting this, If others face same issue with Windows version of Cassandra 3.9.0 (DataStax Distribuition).
Following are the tables that need to be created through CQL:
CREATE TABLE "users" ("userID" bigint , "itemID" bigint,value float,PRIMARY KEY ("userID","itemID") ) WITH COMPACT STORAGE;
CREATE TABLE "items" ("itemID" bigint , "userID" bigint,value float,PRIMARY KEY ("itemID","userID")) WITH COMPACT STORAGE;
CREATE TABLE "userIDs" ("ID" bigint , "userID" bigint, PRIMARY KEY ("ID","userID")) WITH COMPACT STORAGE;
CREATE TABLE "itemIDs" ("ID" bigint , "itemID" bigint, PRIMARY KEY ("ID","itemID")) WITH COMPACT STORAGE;
Following is the format of command for importing data from CSV files to these 4 tables.
copy users from 'C:\Recommender_mv\RecommenderApp\data\movies.csv';
Related
I'm already having a MySQL table in my local machine (Linux) itself, and I have a Hive external table with the same schema as the MySQL table.
I want to sync my hive external table whenever a new record is inserted or updated. Batch update is ok with me to say hourly.
What is the best possible approach to achieve the same without using sqoop?
Thanks,
Sumit
Without scoop, you can create table STORED BY JdbcStorageHandler. Project repository: https://github.com/qubole/Hive-JDBC-Storage-Handler It will work as usual hive table, but query will run on MySQL. Predicate pushdown will work.
DROP TABLE HiveTable;
CREATE EXTERNAL TABLE HiveTable(
id INT,
id_double DOUBLE,
names STRING,
test INT
)
STORED BY 'org.apache.hadoop.hive.jdbc.storagehandler.JdbcStorageHandler'
TBLPROPERTIES (
"mapred.jdbc.driver.class"="com.mysql.jdbc.Driver",
"mapred.jdbc.url"="jdbc:mysql://localhost:3306/rstore",
"mapred.jdbc.username"="root",
"mapred.jdbc.input.table.name"="JDBCTable",
"mapred.jdbc.output.table.name"="JDBCTable",
"mapred.jdbc.password"="",
"mapred.jdbc.hive.lazy.split"= "false"
);
How can I append a new element to a set which is in a custom type in Cassandra.
custom_type is :
CREATE TYPE custom_type (
normal_type TEXT,
set_type Set<TEXT>
);
and the table to be updated is :
CREATE TABLE test_table (
id TEXT,
my_type FROZEN<custom_type>,
clustering_key TEXT,
PRIMARY KEY ((id),clustering_key)
);
Tried below query but did not work.
#Query("update test_table set my_type.set_type = my_type.set_type + {'newelement'} where id=?1 and clustering_key=?2")
Any Idea on how to do that?
Using [cqlsh 5.0.1 | Cassandra 3.11.1 | CQL spec 3.4.4
When you say frozen, then the whole value is treated as one piece (blob), so you can't update parts of this field. Official documentation states:
When using the frozen keyword, you cannot update parts of a user-defined type value. The entire value must be overwritten. Cassandra treats the value of a frozen, user-defined type like a blob.
Hi I'm new to Cassandra/CQL but have SQL/relational database experience.
I have a very simple key space and table setup for testing purposes and need to add some very basic data - some of which is in BLOB format.
CREATE TABLE file_share (
key varchar,
content_type varchar,
client_id varchar,
org_id varchar,
user_id varchar,
value blob,
internal_key uuid,
PRIMARY KEY (client_id, org_id, user_id, key)
);
The trouble for me right now is: How can i (using a script INSERT statement or the like) put some dummy data - specifically in the blob itself so that I can use it for testing? Is there a way to upload a file? If so, how? If not, is there a better way to get binary data as blob? I've been trying to find a list of blob handling functions for CQL but can't find anything so far. Thanks
If this is just for testing you can use the following syntax:
INSERT INTO file_share (client_id, org_id, user_id, key, value)
VALUES ('A', 'B', 'C' , 'D', 0xabcd);
I don't know of any way to input file data to a blob using cqlsh.
We have a CQL table that looks something like this:
CREATE table data (
occurday text,
seqnumber int,
occurtimems bigint,
unique bigint,
fields map<text, text>,
primary key ((occurday, seqnumber), occurtimems, unique)
)
I can query this table from cqlsh like this:
select * from data where seqnumber = 10 AND occurday = '2013-10-01';
This query works and returns the expected data.
If I execute this query as part of a LOAD from within Pig, however, things don't work.
-- Need to URL encode the query
data = LOAD 'cql://ks/data?where_clause=seqnumber%3D10%20AND%20occurday%3D%272013-10-01%27' USING CqlStorage();
gives
InvalidRequestException(why:seqnumber cannot be restricted by more than one relation if it includes an Equal)
at org.apache.cassandra.thrift.Cassandra$prepare_cql3_query_result.read(Cassandra.java:39567)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_prepare_cql3_query(Cassandra.java:1625)
at org.apache.cassandra.thrift.Cassandra$Client.prepare_cql3_query(Cassandra.java:1611)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.prepareQuery(CqlPagingRecordReader.java:591)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.executeQuery(CqlPagingRecordReader.java:621)
Shouldn't these behave the same? Why is the version through Pig failing where the straight cqlsh command works?
Hadoop is using CqlPagingRecordReader to try to load your data. This is leading to queries that are not identical to what you have entered. The paging record reader is trying to obtain small slices of Cassandra data at a time to avoid timeouts.
This means that your query is executed as
SELECT * FROM "data" WHERE token("occurday","seqnumber") > ? AND
token("occurday","seqnumber") <= ? AND occurday='A Great Day'
AND seqnumber=1 LIMIT 1000 ALLOW FILTERING
And this is why you are seeing your repeated key error. I'll submit a bug to the Cassandra Project.
Jira:
https://issues.apache.org/jira/browse/CASSANDRA-6151
Creating tables failing with inet, multiple primary keys, and collections. Syntax was correct.
Error messages don't make sense with the primary key (unmatched parens). remove that, I learned that inet won't work except in some cases.
Anything I'm doing wrong, or not understanding about using CQL3 (interfaces or syntax)?
CREATE TABLE session (
'user_id' bigint,
'admin_id' bigint,
'session_id' varchar,
'cache' text ,
'created' timestamp ,
'hits' list<timestamp>,
'ip' inet ,
PRIMARY KEY ( 'session_id' , 'user_id' )
);
The following also fails
CREATE TABLE 'session' (
'user_id' bigint,
'session_id' varchar,
PRIMARY KEY ( 'session_id' , 'user_id' )
);
This works
CREATE TABLE 'session' (
'user_id' bigint,
'session_id' varchar PRIMARY KEY
);
The clue
>help TYPES
CQL types recognized by this version of cqlsh:
ascii
bigint
blob
boolean
counter
decimal
double
float
int
text
timestamp
uuid
varchar
varint
DSE 3.0.x
[EDIT] - turns out DSE has Cassandra 1.1.x installed.
TL;DR: Collections (part of CQL3) not available yet in DSE 3.0.x
Also worth noting, but unrelated to my issue:
Even in Datastax community edition - one needs to activate CQL3. The documentation says it should be activated by default in cqlsh
http://www.datastax.com/docs/1.2/cql_cli/using_cql
"Activating CQL 3
You activate the CQL mode in one of these ways:
Use the DataStax Java Driver to activate CQL through the
native/binary protocol. Start cqlsh, a Python-based command-line
client.
Use the set_sql_version Thrift method.
Specify the desired CQL mode in the connect() call to the Python driver:
*connection = cql.connect('localhost:9160', cql_version='3.0')*
The documentation there was incorrect also, should be
con = cql.connect('localhost', cql_version='3.0.0')
Also, Enterprise Opcenter doesn't yet support CQL 3 in DSE.
cqlsh --cqlversion=3