Streaming Couchbase data to S3 - apache-spark

I'd like to stream the contents of a couchbase bucket to S3 in parquet file using a spark job. I'm currently leveraging Couchbase's spark streaming integration with the couchbase connector to generate a Dstream, but within each dstream there are multiple RDDs that only contain around 10 records each. I could create a file for each RDD and upload them individually to s3, but considering that I have 12 million records to import I would be left with around a million small files in s3 which is not ideal. What would be the best way to load the contents of a couchbase bucket and load it into s3 using a spark job? I'd ultimately like to have a single parquet file with all the contents of the couchbase bucket if possible.

Related

Spark too many small S3 files with DDB stream

I am designing solution that reads DDB stream and write to S3 using Lambda for writing to S3 datalake
I will be reading max batch size of 10,000 changes in a batch window of like 2 minutes. Based on the velocity of change in the DB, my design might also create small S3 Files. I am seeing customers having issues with Spark not working properly with small S3 files. I am wondering how this common this issue is?

Is s3-dist-cp faster for transferring data between S3 and EMR than using Spark and S3 directly?

I'm currently playing around with Spark on an EMR cluster. I noticed that if I perform the reads/writes into/out of my cluster in Spark script itself, there is an absurd wait time for my outputted data to show up in the S3 console, even for relatively lightweight files. Would this write be expedited by writing to HDFS in my Spark script and then adding an additional step to transfer the output from HDFS -> S3 using s3-dist-cp?

Automate pulling json files from S3 and pushing the same to pyspark for ETL

There will be log files dropped into S3 in some interval time, i want to automate the picking up of new files from S3 and push the same in my pyspark ETL code. Can we watch the S3 using spark streaming, how to do that with python?

Making presto/trino only query a subset of files in s3

Is it possible to get presto to only query a subset of files in an s3 folder by file updated time/created time? I have a folder that contains thousands of files and am hoping for a solution that does not require me to rearrange the data in s3.
I am using a vanilla self hosted presto cluster and not Athena and not using s3 select either.

How to export a 2TB table from a RDS instance to S3 or Hive?

I am trying to migrate an entire table from my RDS instance (MySQL 5.7) to either S3 (csv file) or Hive.
The table has a total of 2TB of data. And it has a BLOB column which stores a zip file (usually 100KB, but it can reach 5MB).
I made some tests with Spark, Sqoop and AWS DMS, but had problems with all of them. I have no experience exporting data from RDS with those tools, so I really appreciate any help.
Which one is the most recommended for this task? And what strategy do you think is more efficient?
You can copy the RDS data to S3 using AWS pipeline. Here is an example which does the very thing.
Once you taken the dump to S3 in csv format it is easy to read the data using spark and register that as Hive Table.
val df = spark.read.csv("s3://...")
df.saveAsTable("mytable") // saves as hive

Resources