We have our ODS in Cassandra but for EDW workload and reporting, we need to load the data to Snowflake. One of the option that we explored is using DSBULK where DSBULK unloads the data as CSV and then we load the data into Snowflake. But we do not want to go with that option, we are trying to find out if there is a way to unload from the SSTABLEs to CSV so that it will be less intrusive. Is there a way to bulk unload from SSTABLEs to multiple CSVs.
Thanks
Related
I am stucking on insert/update multiple rows /approximately 800 rows/ to cassandra table by cqlengine. I do not want to use loop in python. I searched and find batch query. But can not use it.
Please help me making batch query or give other efficient way to insert multiple rows in cassandra.
Thank you.
https://cqlengine.readthedocs.io/en/latest/topics/queryset.html#batch-queries
CQL batches are not an optimisation -- they do not make your queries run faster. In fact, they do the opposite if you have large batches because they can overload the coordinator of the request and queries end up running slower.
CQL batches are designed to achieve atomicity so either (a) all the statements in the batch are executed successfully, or (b) none at all.
In Cassandra, you can achieve a higher throughput if you issue multiple asynchronous writes instead of a single batch. And more app instances (clients) perform better because the traffic can get bottlenecked with a single client app.
If your goal is bulk load data, I recommend you instead use a tool like DataStax Bulk Loader (DSBulk). DSBulk is a free open-source software that allows you to bulk load data in CSV or JSON format to a Cassandra cluster.
Here are some resources to help you get started:
Introducing DataStax Bulk Loader
Loading data to Cassandra with DSBulk
More loading examples with DSBulk
Common DSBulk settings
Unloading data from Cassandra with DSBulk
Counting data with DSBulk (handy for verifying records loaded)
Examples for loading from other locations
I have the following code:
Dataset<Row> rows = sparkContext.sql ("select from hive tables with multiple joins");
rows.saveAsTable(writing to another external table in hive immediately);
1) In the above case when saveAsTable() is invoked, will spark load the whole dataset into memory?
1.1) If yes, then how do we handle the scenario when this query can actually return huge volume of data which cannot fit into the memory?
2) When spark starts executing saveAsTable() to write data to the external Hive table when the server crashes, is there a possibility of partial data be written to the target Hive table?
2.2) If yes, how do we avoid incomplete/partial data being persisted into target hive tables?
Yes spark will place all data in memory but use parallel processes. But when we write data it will use driver memory to store the data before write. So try increasing driver memory.
so there are couple of options you have. If you have memory in clustor you can increase num-cores, num-executors, executor-memory along with driver-memory based on data size.
If you cannot fit all data in memory break the data and process in a loop programatically.
Lets say source data is partitioned by date and you have 10 days to process. try to process 1 day at a time and write to a staging dataframe. Then create partition based on date in final table and overwrite date everytime in loop.
I am trying to understand which of the below two would be better option especially in case of Spark environment :
Loading the parquet file directly into a dataframe and access the data (1TB of data table)
Using any database to store and access the data.
I am working on data pipeline design and trying to understand which of the above two options will result in more optimized solution.
Loading the parquet file directly into a dataframe and access the data is more scalable comparing to reading RDBMS like Oracle through JDBC connector. I handle the data more the 10TB but I prefer ORC format for better performance. I suggest you have to directly read data from files the reason for that is data locality - if your run your Spark executors on the same hosts, where HDFS data nodes located and can effectively read data into memory without network overhead. See https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-data-locality.html and How does Apache Spark know about HDFS data nodes? for more details.
I have a spark job that right now pulls data from HDFS and transforms the data into flat files to load into the Cassandra.
The cassandra table is essentially 3 columns but the last two are map collections, so a "complex" data structure.
Right now I use the COPY command and get about 3k rows/sec load but thats extremely slow given that I need to load about 50milllion records.
I see I can convert the CSV file to sstables but I don't see an example involving map collections and/or lists.
Can I use the spark connector to cassandra to load data with map collections and lists and get better performance than just the COPY command?
Yes the Spark Cassandra Connector can be much much faster for files already in HDFS. Using spark you'll be able to distributedly grab and write into C*.
Even without Spark using a java based loader like https://github.com/brianmhess/cassandra-loader will give you a significant speed improvement.
How do I load data into Cassandra from Netezza? In doing that I also need to transform some tables. I have no experience in ETL. I would like to know how to start on this.