how to export cassandra db into another system? - cassandra

I am using a Cassandra database with several key-spaces. Now i want to use that key-spaces within another system.
What are valid options to achieve that?

You can use the cqlsh COPY command to export your data into csv. Then you can import csv into your other database if it is supported, e.g.:
COPY keyspace.tablename (column1, column2, ..) TO '../export.csv' WITH HEADER = TRUE ;
https://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlshCopy.html

Related

How to manage mangled data when importing from your source in sqoop or pyspark

I have been working on a project to import the Danish 2.5Million ATM transaction data set to derive some visualizations.
The data is hosted on a mysql server provided by the university. The objective is to import the data using Sqoop and then apply a few transformations to it using pyspark.
Link to the dataset here : https://www.kaggle.com/sparnord/danish-atm-transactions
The Sql server, that hosts this information has a few rows which are intentionally or unintentionally mangled.
Example:
So I have a very basic sqoop command which gets the details from the source database. However I run into an issue where there are values which have a double-quote " especially in the column message_text
Sqoop Command :
sqoop import --connect jdbc:mysql:{source-connection-string} --table SRC_ATM_TRANS --username {username}--password {password} --target-dir /user/root/etl_project --fields-terminated-by '|' --lines-terminated-by "\n" -m 1
Here is sample row that is imported in the transaction.
2017|January|1|Sunday|21|Active|85|Diebold Nixdorf|København|Regnbuepladsen|5|1550|55.676|12.571|DKK|MasterCard|4531|Withdrawal|4017|"Suspected malfunction|0.000|55.676|13|2618425|0.000|277|1010|93|3|280.000|0|75|803|Clouds
However the expected output should be
2017|January|1|Sunday|21|Active|85|Diebold Nixdorf|København|Regnbuepladsen|5|1550|55.676|12.571|DKK|MasterCard|4531|Withdrawal|4017|"Suspected malfunction,0.000|55.676|13|2618425|0.000|277|1010|93|3|280.000|0|75|803|Clouds|Cloudy
At first I was okay with this hoping that pyspark would handle the mangled data since the delimiters are specified.
But now I run into issues when populating my dataframe.
transactions = spark.read.option("sep","|").csv("/user/root/etl_project/part-m-00000", header = False,schema = transaction_schema)
However when I inspect my rows I see that the mangled data has caused the dataframe to put these affected values into a single column!
transactions.filter(transactions.message_code == "4017").collect()
Row(year=2017, month=u'January', day=1, weekday=u'Sunday', hour=17, atm_status=u'Active', atm_id=u'35', atm_manufacturer=u'NCR', atm_location=u'Aabybro', atm_streetname=u'\xc3\u0192\xcb\u0153stergade', atm_street_number=6, atm_zipcode=9440, atm_lat=57.162, atm_lon=9.73, currency=u'DKK', card_type=u'MasterCard', transaction_amount=7387, service=u'Withdrawal', message_code=u'4017', message_text=u'Suspected malfunction|0.000|57.158|10|2625037|0.000|276|1021|83|4|319.000|0|0|800|Clear', weather_lat=None, weather_lon=None, weather_city_id=None, weather_city_name=None, temp=None, pressure=None, humidity=None, wind_speed=None, wind_deg=None, rain_3h=None, clouds_all=None, weather_id=None, weather_main=None, weather_description=None)
At this point I am not sure on what to do?
Do I go ahead and create temporary columns to manage this and use a regex replacement to fill in these values ?
Or is there any better way I can import the data and manage these mangled values either in sqoop or in pyspark ?

Convert Access database into delimited format on Unix/Linux

I have an Access database file and I need to convert it into delimited file format.
The Access DB file has multiple tables and I need to create separate delimited files for each table.
So far I am not able to parse Access DB files with any Unix commands. Is there some way that I can do this on Unix?
You can use UCanAccess to dump Access tables to CSV files using the console utility:
gord#xubuntu64-nbk1:~/Downloads/UCanAccess$ ./console.sh
/home/gord/Downloads/UCanAccess
Please, enter the full path to the access file (.mdb or .accdb): /home/gord/ClientData.accdb
Loaded Tables:
Clients
Loaded Queries:
Loaded Procedures:
Loaded Indexes:
Primary Key on Clients Columns: (ID)
UCanAccess>
Copyright (c) 2019 Marco Amadei
UCanAccess version 4.0.4
You are connected!!
Type quit to exit
Commands end with ;
Use:
export [--help] [--bom] [-d <delimiter>] [-t <table>] [--big_query_schema <pathToSchemaFile>] [--newlines] <pathToCsv>;
for exporting the result set from the last executed query or a specific table into a .csv file
UCanAccess>export -d , -t Clients clientdata.csv;
UCanAccess>Created CSV file: /home/gord/Downloads/UCanAccess/clientdata.csv
UCanAccess>quit
Cheers! Thank you for using the UCanAccess JDBC Driver.
gord#xubuntu64-nbk1:~/Downloads/UCanAccess$
gord#xubuntu64-nbk1:~/Downloads/UCanAccess$ cat clientdata.csv
ID,LastName,FirstName,DOB
1,Thompson,Gord,2017-04-01 07:06:27
2,Loblaw,Bob,1966-09-12 16:03:00

How to export result as CSV file with Header from Cassandra DevCenter?

I am using DevCenter 1.6.0. I know from unix system I can use copy command with HEDER = TRUE and can export all rows with header. But when I am using DataStax I am able to export al rows from result but unable to include header. How to export result as CSV file with Header?
Thanks !!!

Cassandra CQL : insert data from existing file

I have a JSON file that I want to insert into a Cassandra table using CQL.
According to datastax documentation, you can insert json with the following command :
INSERT INTO data JSON '{My_Json}';
But I can't find a way to do that directly from an existing json file. Is this possible or do I need to to some Java code to do that insert ?
Note : I am using Cassandra 3.9
The only file format supported for importing is csv. It is possible to convert your json file to CSV format and import it with the copy command. If that is not an option for you, java code is needed to parse your file and insert it into Cassandra.

Export data from SqlQuery to Excel sheet [duplicate]

I have table with more than 3 000 000 rows. I have try to export the data from it manually and with SQL Server Management Studio Export data functionality to Excel but I have met several problems:
when create .txt file manually copying and pasting the data (this is several times, because if you copy all rows from the SQL Server Management Studio it throws out of memory error) I am not able to open it with any text editor and to copy the rows;
the Export data to Excel do not work, because Excel do not support so many rows
Finally, with the Export data functionality I have created a .sql file, but it is 1.5 GB, and I am not able to open it in SQL Server Management Studio again.
Is there a way to import it with the Import data functionality, or other more clever way to make a backup of the information of my table and then to import it again if I need it?
Thanks in advance.
I am not quite sure if I understand your requirements (I don't know if you need to export your data to excel or you want to make some kind of backup).
In order to export data from single tables, you could use Bulk Copy Tool which allows you to export data from single tables and exporting/Importing it to files. You can also use a custom Query to export the data.
It is important that this does not generate a Excel file, but another format. You could use this to move data from one database to another (must be MS SQL in both cases).
Examples:
Create a format file:
Bcp [TABLE_TO_EXPORT] format "[EXPORT_FILE]" -n -f "[ FORMAT_FILE]" -S [SERVER] -E -T -a 65535
Export all Data from a table:
bcp [TABLE_TO_EXPORT] out "[EXPORT_FILE]" -f "[FORMAT_FILE]" -S [SERVER] -E -T -a 65535
Import the previously exported data:
bcp [TABLE_TO_EXPORT] in [EXPORT_FILE]" -f "[FORMAT_FILE] " -S [SERVER] -E -T -a 65535
I redirect the output from hte export/import operations to a logfile (by appending "> mylogfile.log" ad the end of the commands) - this helps if you are exporting a lot of data.
Here a way of doing it without bcp:
EXPORT THE SCHEMA AND DATA IN A FILE
Use the ssms wizard
Database >> Tasks >> generate Scripts… >> Choose the table >> choose db model and schema
Save the SQL file (can be huge)
Transfer the SQL file on the other server
SPLIT THE DATA IN SEVERAL FILES
Use a program like textfilesplitter to split the file in smaller files and split in files of 10 000 lines (so each file is not too big)
Put all the files in the same folder, with nothing else
IMPORT THE DATA IN THE SECOND SERVER
Create a .bat file in the same folder, name execFiles.bat
You may need to check the table schema to disable the identity in the first file, you can add that after the import in finished.
This will execute all the files in the folder against the server and the database with, the –f define the Unicode text encoding should be used to handle the accents:
for %%G in (*.sql) do sqlcmd /S ServerName /d DatabaseName -E -i"%%G" -f 65001
pause

Resources