Making presto/trino only query a subset of files in s3 - presto

Is it possible to get presto to only query a subset of files in an s3 folder by file updated time/created time? I have a folder that contains thousands of files and am hoping for a solution that does not require me to rearrange the data in s3.
I am using a vanilla self hosted presto cluster and not Athena and not using s3 select either.

Related

Streaming Couchbase data to S3

I'd like to stream the contents of a couchbase bucket to S3 in parquet file using a spark job. I'm currently leveraging Couchbase's spark streaming integration with the couchbase connector to generate a Dstream, but within each dstream there are multiple RDDs that only contain around 10 records each. I could create a file for each RDD and upload them individually to s3, but considering that I have 12 million records to import I would be left with around a million small files in s3 which is not ideal. What would be the best way to load the contents of a couchbase bucket and load it into s3 using a spark job? I'd ultimately like to have a single parquet file with all the contents of the couchbase bucket if possible.

History Server running with different Spark versions

I have a use case where spark application is running in one spark version, the event data is published to s3, and start history server from the same s3 path, but with different spark version. Will this cause any problems?
No, it will not cause any problem as long as you can read from S3 bucket using that specific format. Spark versions are mostly compatible. As long as you can figure out how to work in specific version, you're good.
EDIT:
Spark will write to S3 bucket in the data format that you specify. For example, on PC if you create txt file any computer can open that file. Similarly on S3, once you've created Parquet file any Spark version can open it, jus the API may be different.

Working on Local Partitions in Spark

I have a huge file stored in S3 and loading ii into my Spark Cluster and i want to invoke a custom Java Library which takes a Input File Location, process the Data and writes to a given output location. How ever i cannot rewrite that custom logic in Spark.
I am trying to see whether i can load the file from S3 and save the partition to local disk and give that location to Custom Java App and once it is processed load all the partitions and save it into S3.
Is this possible ? What ever i have read so far it looks like i need to use RDD Api. but couldn't find more info on how i can save each partition to local disk.
Appreciate any inputs.

AWS Data Lake Ingest

Do you need to ingest excel and other proprietary formats using glue or allow glue to work crawl your s3 bucket to use these data formats within your data lake?
I have gone through the "Data Lake Foundation on the AWS Cloud" document and am left scratching my head about getting data into the lake. I have a Data Provider with a large set of data stored on their system as excel and access files.
Based on the process flow they would upload the data into the submission s3 bucket, which would set off a series of actions, but there is no etl of the data into a format that would work with the other tools.
Would using these files require using glue on the data that is submitted in the bucket or is there another way to make this data available to other tools such as Athena and redshift spectrum?
Thank you for any light you can shed on this topic.
-Guido
I'm not seeing that can take excel data directly to Data Lake. You might need to convert into CSV/TSV/Json or other formats before loading into Data Lake.
Formats Supported by Redshift Spectrum:
http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html -- Again I don't see Excel as of now.
Athena Supported File Formats:
http://docs.aws.amazon.com/athena/latest/ug/supported-formats.html -- I don't see Excel also not supported here.
You need to upload the files to S3 either to Use Athena or Redshift Spectrum or even Redshift storage itself.
Uploading Files to S3:
If you have bigger files, you need to use S3 multipart upload to upload quicker. If you want more speed, you need to use S3 accelerator to upload your files.
Querying Big Data with Athena:
You can create external tables with Athena from S3 locations. Once you create external tables, use Athena Sql reference to query your data.
http://docs.aws.amazon.com/athena/latest/ug/language-reference.html
Querying Big Data with Redshift Spectrum:
Similar to Athena, you can create external tables with Redshift. Start querying those tables and get the results on Redshift.
Redshift has lot of commercial tools, I use SQL Workbench. It is free open source and rock solid, supported by AWS.
SQL WorkBench: http://www.sql-workbench.net/
Connecting your WorkBench to Redshift: http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-using-workbench.html
Copying data to Redshift:
Also if you want to take the data storage to Redshift, you can use the copy command to pull the data from S3 and its gets loaded to Redshift.
Copy Command Examples:
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html
Redshift Cluster Size and Number of Nodes:
Before creating Redshift Cluster, check for required size and number of nodes needed. More number of nodes gets query parallely running. One more important factor is how well your data is distributed. (Distribution key and Sort keys)
I have a very good experience with Redshift, getting up to the speed might take sometime.
Hope it helps.

How to export a 2TB table from a RDS instance to S3 or Hive?

I am trying to migrate an entire table from my RDS instance (MySQL 5.7) to either S3 (csv file) or Hive.
The table has a total of 2TB of data. And it has a BLOB column which stores a zip file (usually 100KB, but it can reach 5MB).
I made some tests with Spark, Sqoop and AWS DMS, but had problems with all of them. I have no experience exporting data from RDS with those tools, so I really appreciate any help.
Which one is the most recommended for this task? And what strategy do you think is more efficient?
You can copy the RDS data to S3 using AWS pipeline. Here is an example which does the very thing.
Once you taken the dump to S3 in csv format it is easy to read the data using spark and register that as Hive Table.
val df = spark.read.csv("s3://...")
df.saveAsTable("mytable") // saves as hive

Resources