I want to run a dsbulk unload command, but my cassandra cluster has ~1tb of data in the table I want to export. Is there a way to run the dsbulk unload command and stream the data into s3 as opposed to writing to disk?
Im running the following command in my dev environment, but obviously this is just writing to disk on my machine
bin/dsbulk unload -k myKeySpace -t myTable -url ~/data --connector.csv.compression gzip
It doesn't support it "natively" out of the box. Theoretically it could be implemented, as DSBulk is now open source, but it should be done by somebody.
Update:
The workaround could be, as pointed by Adam is to use aws s3 cp and pipe to it from DSBulk, like this:
dsbulk unload .... |aws s3 cp - s3://...
but there is a limitation - the unload will be performed in one thread, so unloading could be much slower.
In the short term you can use Apache Spark in the local master mode with Spark Cassandra Connector, something like this (for Spark 2.4):
spark-shell --packages com.datastax.spark:spark-cassandra-connector-assembly_2.11:2.5.1
and inside:
val data = spark.read.format("org.apache.spark.sql.cassandra")\
.options(Map( "table" -> "table_name", "keyspace" -> "keyspace_name")).load()
data.write.format("json").save("s3a://....")
Related
My problem is as below:
A pyspark script that runs perfectly on a local machine and an EC2 is ported on to an EMR for scaling up. There's a config file with relative locations for outputs mentioned.
An example:
Config
feature_outputs= /outputs/features/
File structure:
classifier_setup
feature_generator.py
model_execution.py
config.py
utils.py
logs/
models/
resources/
outputs/
Code reads the config, generates features and writes them into the path mentioned above. On EMR, this is getting saved in to the HDFS. (spark.write.parquet writes into the HDFS, on the hand, df.toPandas().to_csv() writes to the relative output path mentioned). The next part of the script, reads the same path mentioned in the config, tries to read the parquet from the mentioned location, and fails.
How to make sure that the outputs are created in the relative that is specified in the code ?
If that's not possible, how can I make sure that I read it from the HDFS in the subsequent steps.
I referred these discussions: HDFS paths ,enter link description here, however, it's not very clear to me. Can someone help me with this.
Thanks.
Short Answer to your question:
Writing using Pandas and Spark are 2 different things. Pandas doesn't utilize Hadoop to process, read and write; it writes into the standard EMR file system, which is not HDFS. On the other hand, Spark utilizes distributed computing for getting things into multiple machines at the same time and it's built on top of Hadoop so by default when you write using Spark it writes into HDFS.
While writing from EMR, you can choose to write either into
EMR local filesystem,
HDFS, or
EMRFS (which is s3 buckets).
Refer AWS documentation
If at the end of your job, you are writing using Pandas dataframe and you want to write it into HDFS location (maybe because your next step Spark job is reading from HDFS, or for some reason) you might have to use PyArrow for that, Refer this
If at the end fo your job, you are writing into HDFS using Spark dataframe, in next step you can read it by using hdfs://<feature_outputs> like that to read in next step.
Also while you are saving data into EMR HDFS, you will have to keep in mind that if you are using default EMR storage, it's volatile i.e. all the data will be lost once the EMR goes down i.e. gets terminated, and if you want to keep your data stored in EMR you might have to get an External EBS volume attached to it that can be used in other EMR also or some other storage solution that AWS provides.
The best way is if you are writing your data and you need it to be persisted to write it into S3 instead of EMR.
Usecase is to load local file into HDFS. Below two are approaches to do the same , Please suggest which one is efficient.
Approach1: Using hdfs put command
hadoop fs -put /local/filepath/file.parquet /user/table_nm/
Approach2: Using Spark .
spark.read.parquet("/local/filepath/file.parquet ").createOrReplaceTempView("temp")
spark.sql(s"insert into table table_nm select * from temp")
Note:
Source File can be in any format
No transformations needed for file loading .
table_nm is an hive external table pointing to /user/table_nm/
Assuming that they are already built local .parquet files, using -put will be faster as there is no overhead of starting the Spark App.
If there are many files, there is simply still less work to do via -put.
I have an external hive table defined with a location in s3
LOCATION 's3n://bucket/path/'
When writing to this table at the end of a pyspark job that aggregates a bunch of data the write to Hive is extremely slow because only 1 executor/container is being used for the write. When writing to an HDFS backed table the write happens in parallel and is significantly faster.
I've tried defining the table using the s3a path but my job fails due to some vague errors.
This is on Amazon EMR 5.0 (hadoop 2.7), pyspark 2.0 but I have experienced the same issue with previous versions of EMR/spark.
Is there a configuration or alternative library that I can use to make this write more efficient?
I guess you're using parquet. The DirectParquetOutputCommitter has been removed to avoid potential data loss issue. The change was actually in 04/2016.
It means the data your write to S3 will firstly be saved in a _temporary folder, then "moved" to its final location. Unfortunately "moving" == "copying & deleting" in S3 and it is rather slow. To make it worse, this final "moving" is done only by the driver.
You will have to write to local HDFS then copy the data over (I do recommend this), if you don't want to fight to add that class back. In HDFS "moving" ~ "renaming" so it takes no time.
I'm just getting started using Apache Spark. I'm using cluster mode and I want to process a big file. I am using the textFile method from SparkContext, it will read a local file system available on all nodes.
Due to the fact my file is really big it is a pain to copy and paste in each cluster node. My question is: is there any way to have this file in a unique location like a shared folder?
Thanks a lot
You can keep the file in Hadoop or S3 .
Then you can give the path of the file in textFile method itself .
for s3 :
val data = sc.textFile("s3n://yourAccessKey:yourSecretKey#/path/")
for hadoop :
val hdfsRDD = sc.textFile("hdfs://...")
Is it possible to create a RDD using data from master or worker? I know that there is a option SC.textFile() which sources the data from local system (driver) similarly can we use something like "master:file://input.txt" ? because I am accessing a remote cluster and my input data size is large and cannot login to remote cluster.
I am not looking for S3 or HDFS. Please suggest if there is any other option.
Data in an RDD is always controlled by the Workers, whether it is in memory or located in a data-source. To retrieve the data from the Workers into the Driver you can call collect() on your RDD.
You should put your file on HDFS or a filesystem that is available to all nodes.
The best way to do this is to as you stated use sc.textFile. To do that you need to make the file available on all nodes in the cluster. Spark provides an easy way to do this via the --files option for spark-submit. Simply pass the option followed by the path to the file that you need copied.
You can access the hadoop, file by creating hadoop configuration.
import org.apache.spark.deploy.SparkHadoopUtil
import java.io.{File, FileInputStream, FileOutputStream, InputStream}
val hadoopConfig = SparkHadoopUtil.get.conf
val fs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI(fileName), hadoopConfig)
val fsPath = new org.apache.hadoop.fs.Path(fileName)
Once you get the path you can copy, delete, move or perform any operations.