I'd like to stream the contents of a couchbase bucket to S3 in parquet file using a spark job. I'm currently leveraging Couchbase's spark streaming integration with the couchbase connector to generate a Dstream, but within each dstream there are multiple RDDs that only contain around 10 records each. I could create a file for each RDD and upload them individually to s3, but considering that I have 12 million records to import I would be left with around a million small files in s3 which is not ideal. What would be the best way to load the contents of a couchbase bucket and load it into s3 using a spark job? I'd ultimately like to have a single parquet file with all the contents of the couchbase bucket if possible.
Related
I am designing solution that reads DDB stream and write to S3 using Lambda for writing to S3 datalake
I will be reading max batch size of 10,000 changes in a batch window of like 2 minutes. Based on the velocity of change in the DB, my design might also create small S3 Files. I am seeing customers having issues with Spark not working properly with small S3 files. I am wondering how this common this issue is?
I'm currently playing around with Spark on an EMR cluster. I noticed that if I perform the reads/writes into/out of my cluster in Spark script itself, there is an absurd wait time for my outputted data to show up in the S3 console, even for relatively lightweight files. Would this write be expedited by writing to HDFS in my Spark script and then adding an additional step to transfer the output from HDFS -> S3 using s3-dist-cp?
There will be log files dropped into S3 in some interval time, i want to automate the picking up of new files from S3 and push the same in my pyspark ETL code. Can we watch the S3 using spark streaming, how to do that with python?
Is it possible to get presto to only query a subset of files in an s3 folder by file updated time/created time? I have a folder that contains thousands of files and am hoping for a solution that does not require me to rearrange the data in s3.
I am using a vanilla self hosted presto cluster and not Athena and not using s3 select either.
I am trying to migrate an entire table from my RDS instance (MySQL 5.7) to either S3 (csv file) or Hive.
The table has a total of 2TB of data. And it has a BLOB column which stores a zip file (usually 100KB, but it can reach 5MB).
I made some tests with Spark, Sqoop and AWS DMS, but had problems with all of them. I have no experience exporting data from RDS with those tools, so I really appreciate any help.
Which one is the most recommended for this task? And what strategy do you think is more efficient?
You can copy the RDS data to S3 using AWS pipeline. Here is an example which does the very thing.
Once you taken the dump to S3 in csv format it is easy to read the data using spark and register that as Hive Table.
val df = spark.read.csv("s3://...")
df.saveAsTable("mytable") // saves as hive