I have a cassandra table which size is ~150 GB. I want to migrate the table to a different cassandra cluster. I have two approaches here :-
1. Using spark job to read data from the old cluster and write to the new cluster.
2. Save the cassandra data to S3 using some format . Once the data is saved to S3 read it again using spark to save the data to the new cluster.
If i go with this approach then what format should i save the data ? Because i have to again read the data from S3. So, which format will be best in this case ? csv or json or parquet ?
I'd recommend using COPY TO command to extract in csv, then copy data back in - https://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlshCopy.html
e.g.
COPY my_table TO 'my_table.csv' // on source Cassandra
COPY my_table FROM 'my_table.csv' // on destination Cassandra
Related
I have a spark streaming job with a batch interval of 2 mins(configurable).
This job reads from a Kafka topic and creates a Dataset and applies a schema on top of it and inserts these records into the Hive table.
The Spark Job creates one file per batch interval in the Hive partition like below:
dataset.coalesce(1).write().mode(SaveMode.Append).insertInto(targetEntityName);
Now the data that comes in is not that big, and if I increase the batch duration to maybe 10mins or so, then even I might end up getting only 2-3mb of data, which is way less than the block size.
This is the expected behaviour in Spark Streaming.
I am looking for efficient ways to do a post processing to merge all these small files and create one big file.
If anyone's done it before, please share your ideas.
I would encourage you to not use Spark to stream data from Kafka to HDFS.
Kafka Connect HDFS Plugin by Confluent (or Apache Gobblin by LinkedIn) exist for this very purpose. Both offer Hive integration.
Find my comments about compaction of small files in this Github issue
If you need to write Spark code to process Kafka data into a schema, then you can still do that, and write into another topic in (preferably) Avro format, which Hive can easily read without a predefined table schema
I personally have written a "compaction" process that actually grabs a bunch of hourly Avro data partitions from a Hive table, then converts into daily Parquet partitioned table for analytics. It's been working great so far.
If you want to batch the records before they land on HDFS, that's where Kafka Connect or Apache Nifi (mentioned in the link) can help, given that you have enough memory to store records before they are flushed to HDFS
I have exactly the same situation as you. I solved it by:
Lets assume that your new coming data are stored in a dataset: dataset1
1- Partition the table with a good partition key, in my case I have found that I can partition using a combination of keys to have around 100MB per partition.
2- Save using spark core not using spark sql:
a- load the whole partition in you memory (inside a dataset: dataset2) when you want to save
b- Then apply dataset union function: dataset3 = dataset1.union(dataset2)
c- make sure that the resulted dataset is partitioned as you wish e.g: dataset3.repartition(1)
d - save the resulting dataset in "OverWrite" mode to replace the existing file
If you need more details about any step please reach out.
I'm trying to decrease the time Spark using to read and write data by using Alluxio.
But I found that I have to specify the path to read data.
I've found that I can use metatool of Hive to change Hive's warehouse from HDFS to Alluxio, so I can write data to Alluxio by Spark sql. But I don't know how to read Alluxio's data by sql.
Is there any way to read/write Alluxio's data just like Hive? Maybe read Alluxio's metadata and add it to metastore?
All you need to do is to modify the table location in Spark's metastore.
You can check Alluxio for details, if the table location alter takes too long, check this thread for help.
Note that first time you query that table, Alluxio will fetch data from UFS. After the data is stored in Alluxio, your future table query will directly read data from Alluxio.
I am trying to understand which of the below two would be better option especially in case of Spark environment :
Loading the parquet file directly into a dataframe and access the data (1TB of data table)
Using any database to store and access the data.
I am working on data pipeline design and trying to understand which of the above two options will result in more optimized solution.
Loading the parquet file directly into a dataframe and access the data is more scalable comparing to reading RDBMS like Oracle through JDBC connector. I handle the data more the 10TB but I prefer ORC format for better performance. I suggest you have to directly read data from files the reason for that is data locality - if your run your Spark executors on the same hosts, where HDFS data nodes located and can effectively read data into memory without network overhead. See https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-data-locality.html and How does Apache Spark know about HDFS data nodes? for more details.
we have 2 clusters one Map R and another our own. We want created new setup in our own hardware using the Map R data.
I have copied all the orc files from the Map R cluster and followed the same folder structure
Created a orc formatted table with location of #1
then executed this command "MSCK REPAIR TABLE <>"
above steps passed without error, but when i query the partitions then job fails with below error
java.lang.IllegalArgumentException: Buffer size too small. size = 262144 needed = 4958903
at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:193)
at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:238)
Can some one tell me can we create HIVE ORC partition tables directly from the orc files?
My storage is Azure data lake.
According to your description, based on my understanding, I think you want to copy all orc files from a cluster to another and load these orc files as a hive table.
For doing it, please just try to follow the command below to create external table for loading orcfile data.
CREATE EXTERNAL TABLE IF NOT EXSISTS <table name> (<column_name column_type>, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS ORC
LOCATION '<orcfile path>'
If not aware of the columns list of an orc file, you can refer to the Hive manual ORC File Dump Utility to print the ORC file metadata in JSON format via hive --orcfiledump -j -p <location-of-orc-file-or-directory>.
I am using spark streaming to write the aggregated output as parquet files to the hdfs using SaveMode.Append. I have an external table created like :
CREATE TABLE if not exists rolluptable
USING org.apache.spark.sql.parquet
OPTIONS (
path "hdfs:////"
);
I had an impression that in case of external table the queries should fetch the data from newly parquet added files also. But, seems like the newly written files are not being picked up.
Dropping and recreating the table every time works fine but not a solution.
Please suggest how can my table have the data from newer files also.
Are you reading those tables with spark?
if so, spark caches parquet tables metadata (since schema discovery can be expensive)
To overcome this, you have 2 options:
Set the config spark.sql.parquet.cacheMetadata to false
refresh the table before the query: sqlContext.refreshTable("my_table")
See here for more details: http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-metastore-parquet-table-conversion