Importing CSV to Cassandra DB - cassandra

I am trying to import a .csv file to Cassandra by using the COPY command in the CQL. The problem is that, after executing the command, the console displays that '327 rows imported in 0.589 seconds' but only the last row from the csv file gets into the Cassandra table, and the size of the table is only one row.
Here is the command that i am using for copying:
copy test from '/root/Documents/test.csv' with header=true;
And when i do select * from test; it shows only one row.
Any help would be appreciated.

Posting my comment here as an answer for future reference.
One way this could happen (where all rows are imported but the result is only 1 left in table) is if every row had the same primary key. Each insert is really doing a replacement, hence the result of just a single row of data.

Related

Upsert Option in ADF Copy Activity

With the "upsert option" , should I expect to see "0" as "Rows Written" in a copy activity result summary?
My situation is this: The source and sink table columns are not exactly the same but the Key columns to tell it how to know the write behavior are correct.
I have tested and made sure that it does actually do insert or update based on the data I give to it BUT what I don't understand is if I make ZERO changes and just keep running the pipeline , why does it not show "zero" in the Rows Written summary?
The main reason why rowsWritten is not shown as 0 even when the source and destination have same data is:
Upsert inserts data when a key column value is absent in target table and updates the values of other rows whenever the key column is found in target table.
Hence, it is modifying all records irrespective of the changes in data. As in SQL Merge, there is no way to tell copy activity that if an entire row already exists in target table, then ignore that case.
So, even when key_column matches, it is going to update the values for rest of the columns and hence counted as row written. The following is an example of 2 cases
The rows of source and sink are same:
The rows present:
id,gname
1,Ana
2,Ceb
3,Topias
4,Jerax
6,Miracle
When inserting completely new rows:
The rows present in source are (where sink data is as above):
id,gname
8,Sumail
9,ATF

Azure Data Factory Copy Activity is dropping columns on the floor

first time, long time.
I'm running an import of a csv file that has 734 columns in Azure Data Factory Copy Activity. Data factory is not reading the last 9 columns and is populating with NULL. Even in the preview I can see that the columns have no values but the schema for those columns is detected. Is there a limit of columns in Copy to 725?
As Joel said there is no restriction for 725 or so columns . I suggest
Go to the mapping tab and only pick 726th column ( if you have a header it will be easy or ADF will generate header like Prop_726( most probably) , copy the data to blob as sink , If the blob has the field , that means that you have a data type issue on the table .
Let me know how its goes , if you are still facing the issue , please share some dummy data for 726th column .
Here is what happened. I had the file in zip folders, and I thought I had to unzip the files first to process them. It turns out that when unzipping through ADF, it stripped the quotation marks from my columns, and then one of the columns had an escape character in it. That escape character shifted everything over, and resulted in me losing nine columns.
But I did learn a bunch of things NOT to do, so it wasn't a total waste of time. Thanks for the answers!

Cassandra Altering the table

I have a table in Cassandra say employee(id, email, role, name, password) with only id as my primary key.
I want to ...
1. Add another column (manager_id) in with a default value in it
I know that I can add a column in the table but there is no way i can provide a default value to that column through CQL. I can also not update the value for manager_id later since I need to know the id (Partition key and the values are randomly generated unique values which i don't know) to update the row. Is there any way I can achieve this?
2. Rename this table to all_employee.
I also know that its not allowed to rename a table in cassandra. So I am trying to copy the data of table(employee) to csv and copy from csv to new table (all_employee) and deleting the old table(employee). I am doing this through an automated script with cql queries in it and script works fine but will fail if it gets executed again(Which i can not restrict) since the table employee will not be there once its deleted. Essentially I am looking for "If exists" clause in COPY query which is not supported in cql. Is there any other way I can achieve the outcome?
Please note that the amount of data in the table is very small so performance in not an issue.
For #1
I dont think cassandra support default column . You need to do that from your appliaction. Write some default value every time you insert a row.
For #2
you can check if the table exists before trying to copy from it.
SELECT your_table_name FROM system_schema.tables WHERE keyspace_name='your_keyspace_name';

Hive Context to load into a table

I am selecting a entire table and loading into a new table.It is loaded correctly but the value is appending not overwriting.
Spark version 1.6
Below is the code snippet
DataFrame df = hiveContext.createDataFrame(JavaRDD<Row>, StructType);
df.registerTempTable("tempregtable");
String query="insert into employee select * from tempregtable";
hiveContext.sql(query);
I am droping and creating the table (employee) and executing the above code.But still the old row value gets appended with new row.For eg if I am inserted four rows and dropped the table and again inserting four rows totally 8 rows got added.Kindly help me, how to overwrite the data instead of appending.
Regards
Prakash
try
String query="insert overwrite table employee select * from tempregtable";
INSERT OVERWRITE will overwrite any existing data in the table or partition
INSERT INTO will append to the table or partition
Reference: Hive Language Manual

How to retrieve a very big cassandra table and delete some unuse data from it?

I hava created a cassandra table with 20 million records. Now I want to delete the expired data decided by one none primary key column. But it doesn't support the operation on the column. So I try to retrieve the table and get the data line by line to delete the data.Unfortunately,it is too huge to retrieve. Otherwise,I couldn't delete the whole table, how could I achieve my goal?
Your question is actually, how to get the data from the table in bulks (also called pagination).
You can do that by selecting different slices from your primary key: For example, if your primary key is some sort of ID, select a range of IDs each time, process the results and do whatever you want to do with them, then get the next range, and so on.
Another way, which depends on the driver you're working with, will be to use fetch_size. You can see a Python example here and a Java example here.

Resources