Set a number of files that will be written with INSERT statement - presto

Is there a config for controlling the number of files written using INSERT or CREATE TABLE AS in Presto? Looking for something similar or identical to the Spark counterpart spark.sql.shuffle.partitions = 1.
I am looking to decrease the amount of small files that are generated with INSERT to avoid additional ETL in Spark with the above spark config. Is this possible? I haven't found anything close to this in Presto docs.

You can't control the number of output files directly, but you can reduce the number of files that get written by turning on the scale-writers config option (or scale_writers session property). Add the following to the config.properties file:
scale-writers=true
When that option is enabled, Trino (formerly known as PrestoSQL) will use the minimum number of writers necessary and scale up as necessary based on throughput.
See this discussion on the Trino Community Slack:
https://trinodb.slack.com/archives/CFLB9AMBN/p1564046069087800?thread_ts=1563945529.046400&cid=CFLB9AMBN
Unfortunately, this option is not yet documented as of Presto 327. I created an issue to track this improvement to the documentation: https://github.com/trinodb/trino/issues/2352.

Related

What is the correct way to use maxBytesPerTrigger in Pyspark?

I'm using Spark readStream and setting option maxBytesPerTrigger like this: temp_data = spark.readStream.format("delta").option("maxBytesPerTrigger",1000).load(raw_data_delta_table)
But my whole file is loaded in single batch. I want to load it into multiple batches. Where I'm missing out? And yeah, If I use maxFilesPerTrigger, then it's working fine. But maxBytesPerTrigger is not working.
Thanks
You can find the behavior of maxBytesPerTrigger from delta official page.
maxBytesPerTrigger: How much data gets processed in each micro-batch. This option sets a “soft max”, meaning that a batch processes approximately this amount of data and may process more than the limit. If you use Trigger.Once for your streaming, this option is ignored. This is not set by default.
Spark will need to read in the entire file, not pieces of a file. Therefore based off that if the soft max is usually less than the file size then it will ingest the entire file.
See "Limit Input Rate" Section in link below.
https://docs.databricks.com/delta/delta-streaming.html

What is the best way to export all of my data from a Cassandra Cluster?

I am very new to Cassandra and any help here would be appreciated. I have a cluster of 6 nodes that spans 2 datacenters (3 nodes to each cluster). My client has decided that they do not want to renew their Cassandra license with Datastax anymore and want their data exported into a format that can be easily imported into another Database in the future. I was thinking of exporting the data as a CSV file, but since the data is distributed between all the nodes, I am not sure what is the best way to export all the data.
One option - You should be able to use the CQL COPY command - which copies the data into a CSV format. The nice thing about copy is that you can run it from a single node (i.e. it is not a "node" level tool). Command would be (once in cqlsh):
CQL> COPY . to '/path/to/file'
If there is a LOT of data, or a lot of tables, this tool may not be a great fit. But for small number of tables that don't have HUGE rowcounts (< several million), this works well. Hope that helps.
-Jim
Since 2018 you can use DSBulk with DSE to export or import data to/from CSV (by default), or JSON. Since the end of 2019 it's possible to use it with open source Cassandra as well.
It could be as simple as:
dsbulk unload -k keyspace -t table -u user -p password -url filename
DSBulk is heavily optimized for fast data export, without putting too much load onto the coordinator node that happens when you just run select * from table.
You can control what columns to export, and even provide your own query, etc. DataStax blog has a series of blog posts about different aspects of using DSBulk:
Introduction and Loading
More Loading
Common Settings
Unloading
Counting
Examples for Loading From Other Locations
You can use CQL COPY command for exporting the data from Cassandra cluster. However it is performant for small set of data if you are having big size of data this command is not useful cause it will give some error or timeout issue. Also, you may use sstabledump and export your node-wise date into JSON format. Hope, this will useful for you.
I have implemented small script for this purpose. It isn't the best way, since it slow and, in my experience, produces connection errors on system tables. But it could be useful for inspecting Cassandra on small datasets: https://github.com/kirillt/cassandra-utils

Spark tagging file names for purpose of possible later deletion/rollback?

I am using Spark 2.4 in AWS EMR.
I am using Pyspark and SparkSQL for my ELT/ETL and using DataFrames with Parquet input and output on AWS S3.
As of Spark 2.4, as far as I know, there is no way to tag or to customize the file name of output files (parquet). Please correct me?
When I store parquet output files on S3 I end up with file names which look like this:
part-43130-4fb6c57e-d43b-42bd-afe5-3970b3ae941c.c000.snappy.parquet
The middle part of the file name looks like it has embedded GUID/UUID :
part-43130-4fb6c57e-d43b-42bd-afe5-3970b3ae941c.c000.snappy.parquet
I would like to know if I can obtain this GUID/UUID value from the PySpark or SparkSQL function at run-time, to log/save/display this value in a text file?
I need to log this GUID/UUID value because I may need to later remove the files with this value as part of their names, for a manual rollback purposes (for example, I may discover a day or a week later that this data is somehow corrupt and needs to be deleted, so all files tagged with GUID/UUID can be identified and removed).
I know that I can partition the table manually on a GUID column but then I end up with too many partitions, so it hurts performance. What I need is to somehow tag the files, for each data load job, so I can identify and delete them easily from S3, hence GUID/UUID value seems like one possible solution.
Open for any other suggestions.
Thank you
Is this with the new "s3a specific committer"? If so, it means that they're using netflix's code/trick of using a GUID on each file written so as to avoid eventual consistency problems. That doesn't help much though.
consider offering a patch to Spark which lets you add a specific prefix to a file name.
Or for Apache Hadoop & Spark (i.e. not EMR), an option for the S3A committers to put that prefix in when they generate temporary filenames.
Short term: well, you can always list the before-and-after state of the directory tree (tip: use FileSystem.listFiles(path, recursive) for speed), and either remember the new files, or rename them (which will be slow: Remembering the new filenames is better)
Spark already writes files with UUID in names. Instead of creating too many partitions you can setup customer file naming (e.g. add some id). May be this is solution for you - https://stackoverflow.com/a/43377574/1251549
Not tried yet (but planning) - https://github.com/awslabs/amazon-s3-tagging-spark-util
In theory, you can tag with jobid (or whatever) and then run something
Both solutions lead to perform multiple s3 list objects API request check tags/filename and delete file one by one.

Spark Temptable vs Broadcasting

I have a question/opinion that needs experts suggestion.
I have a table called config that contains some configuration information as the table name suggests. I need this details to be accessed from all the executors during my job's life cycle. So my first option is Broadcasting them in List[Case Class] .But suddenly got an idea of making the config as Temptable using registerTempTable() and use it accross my job.
This temp table approach can be used an alternative to Broadcast variables ( I have extensive hands-on on Broadcasting)?
registerTempTable does just give you the possibilty to run plain sql queries on your dataframe, there is no performance benefit/caching/materialization involved.
You should go with broadcasting (I would suggest to use a Map for configuration parameters)
registerTempTable() then using it for lookup, will mostly use the broadcast join only internally given the scenario the table/config file size < 10MB.

What is a good Bulk data loading tool for Cassandra

I'm looking for a tool to load CSV into Cassandra. I was hoping to use RazorSQL for this but I've been told that it will be several months out.
What is a good tool?
Thanks
1) If you have all the data to be loaded in place you can try sstableloader(only for cassandra 0.8.x onwards) utility to bulk load the data.For more details see:cassandra bulk loader
2) Cassandra has introduced BulkOutputFormat bulk loading data into cassandra with hadoop job in latest version that is cassandra-1.1.x onwards.
For more details see:Bulkloading to Cassandra with Hadoop
I'm dubious that tool support would help a great deal with this, since a Cassandra schema needs to reflect the queries that you want to run, rather than just being a generic model of your domain.
The built-in bulk loading mechanism for cassandra is via BinaryMemtables: http://wiki.apache.org/cassandra/BinaryMemtable
However, whether you use this or the more usual Thrift interface, you still probably need to manually design a mapping from your CSV into Cassandra ColumnFamilies, taking into account the queries you need to run. A generic mapping from CSV-> Cassandra may not be appropriate since secondary indexes and denormalisation are commonly needed.
For Cassandra 1.1.3 and higher, there is the CQL COPY command that is available for importing (or exporting) data to (or from) a table. According to the documentation, if you are importing less than 2 million rows, roughly, then this is a good option. Is is much easier to use than the sstableloader and less error prone. The sstableloader requires you to create strictly formatted .db files whereas the CQL COPY command accepts a delimited text file. Documenation here:
http://www.datastax.com/docs/1.1/references/cql/COPY
For larger data sets, you should use the sstableloader.http://www.datastax.com/docs/1.1/references/bulkloader. A working example is described here http://www.datastax.com/dev/blog/bulk-loading.

Resources