Changing output file name in Azure HDInsight - azure

Using the .NET SDK, I'm doing some log file parsing with Azure HDInsight. Seemingly simple things like changing the output file format from "part-xxxxx" to something related to the input file name seems to be quite complicated, and documentation is scant.
Based on what I've seen about output file formats in Hadoop in general, it looks like this isn't a setting I can change based on a template (which could then be fed in with HadoopJobConfiguration.AdditionalGenericArguments in the .NET SDK), but some actual Java code, which seems to suggest that the only way to get this done is to recode my solution as an actual Java class.
Suggestions?

This is a fundamental Hadoop thing.
Hadoop jobs will always output files in part-nnnnn format, the only bit you can specify is the baseOutputDirectory path they will go in, so you could certainly use the directory to relate the output to the input.
The reason for this is that each reducer has to have its own output file.
If you're doing any further processing on the output in Hadoop, with Hive for example, then this shouldn't be too much of a hardship, since the InputFormats used will pick up all the part-nnnnn files for you.
That said, you could provide a subclass of the MultipleOutputFormat class to control the pattern of the filenames, but that will need to be in Java, since you can't write OutputFormats with the streaming API.
Another option might be to use the Azure Storage client to merge, and rename the output files once the

Related

Databricks spark.readstream format differences

I am having confusion on the difference of the following code in Databricks
spark.readStream.format('json')
vs
spark.readStream.format('cloudfiles').option('cloudFiles.format', 'json')
I know cloudfiles as the format would be regarded as Databricks Autoloader . In performance/function comparison , which one is better ? Anyone has some experience on that?
Thanks
There are multiple differences between these two. When you use Auto Loader you get at least, there are more things (see doc for all details):
Better performance, scalability and cost efficiency when discovering new files. You can either use file notification mode (when you get notified about new files using cloud-native integration) or optimized file listing mode that uses native cloud APIs to list files and directories. Spark's file streaming is relying on the Hadoop APIs that are much slower, especially if you have a lot of nested directories and a lot of files
Support for schema inference and evolution. With Auto Loader you can detect changes in the schema for JSON/CSV/Avro, and adjust it to process new fields.

Apache Beam bundling issue

My problem is the following, I want to aggregate some data that is stored on S3. As initial input to my pipeline I use a text file that contains the path of all the S3 files that should be aggregated.
PCollection<String> readInputPipeline = p.apply("ReadLines", TextIO.read().from(options.getInputFile()));
readInputPipeline = readInputPipeline.apply(ParDo.of(new ReadFromS3Mapper()));
The input file has 346k lines. When I deploy this code to a Spark cluster reading from S3 looks like it happens only in 2 Spark Tasks even though many cores are available. Is there any way for me to increase the parallelism of this operation?
I am running this on EMR on Amazon with a master (m3.xlarge) and a core machine (R3.4xlarge) with the following options:
"spark-submit"
"--driver-java-options='-Dspark.yarn.app.container.log.dir=/mnt/var/log/hadoop'",
"--master", "yarn",
"--executor-cores","16",
"--executor-memory","6g"
PS: maybe the solution could be that I shouldn't do this kind of expensive IO operations in this context?
Spark decides how to split up an input, here it's decided to go through the entire file in one go, because it so small.
I've done something similar in a distcp application; this uses Spark's ParallelCollectionRDD class to explicitly tell spark to split the listing up one-by-one.
That class should be enough for you to do something similar -you may have to read the initial text file in locally to a list, then pass the list to the ParallelCollectionRDD constructor
A bit late reply, but I looked into what Beam does in the 2.16.0 release.
You're getting 2 tasks after the first TextIO.read() -- I suspect that your initial list of files of 346k lines is being split into two partitions. This behaviour is controlled by the desiredBundleSize inside TextIO, which is hard-coded to 64MB.
In Spark, your action ReadFromS3Mapper will be "fused" to the arriving records and you'll always stay at two partitions.
If you want to keep the same code, you can force a repartition between the two transformations:
PCollection<String> allContents = p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
.apply("Repartition", Reshuffle.viaRandomKey())
.apply(ParDo.of(new ReadFromS3Mapper()));
As an alternative, there's quite a few interesting patterns available in the TextIO and FileIO utilities. There's an example that matches yours almost exactly (implicitly including the reshuffle).

Spark tagging file names for purpose of possible later deletion/rollback?

I am using Spark 2.4 in AWS EMR.
I am using Pyspark and SparkSQL for my ELT/ETL and using DataFrames with Parquet input and output on AWS S3.
As of Spark 2.4, as far as I know, there is no way to tag or to customize the file name of output files (parquet). Please correct me?
When I store parquet output files on S3 I end up with file names which look like this:
part-43130-4fb6c57e-d43b-42bd-afe5-3970b3ae941c.c000.snappy.parquet
The middle part of the file name looks like it has embedded GUID/UUID :
part-43130-4fb6c57e-d43b-42bd-afe5-3970b3ae941c.c000.snappy.parquet
I would like to know if I can obtain this GUID/UUID value from the PySpark or SparkSQL function at run-time, to log/save/display this value in a text file?
I need to log this GUID/UUID value because I may need to later remove the files with this value as part of their names, for a manual rollback purposes (for example, I may discover a day or a week later that this data is somehow corrupt and needs to be deleted, so all files tagged with GUID/UUID can be identified and removed).
I know that I can partition the table manually on a GUID column but then I end up with too many partitions, so it hurts performance. What I need is to somehow tag the files, for each data load job, so I can identify and delete them easily from S3, hence GUID/UUID value seems like one possible solution.
Open for any other suggestions.
Thank you
Is this with the new "s3a specific committer"? If so, it means that they're using netflix's code/trick of using a GUID on each file written so as to avoid eventual consistency problems. That doesn't help much though.
consider offering a patch to Spark which lets you add a specific prefix to a file name.
Or for Apache Hadoop & Spark (i.e. not EMR), an option for the S3A committers to put that prefix in when they generate temporary filenames.
Short term: well, you can always list the before-and-after state of the directory tree (tip: use FileSystem.listFiles(path, recursive) for speed), and either remember the new files, or rename them (which will be slow: Remembering the new filenames is better)
Spark already writes files with UUID in names. Instead of creating too many partitions you can setup customer file naming (e.g. add some id). May be this is solution for you - https://stackoverflow.com/a/43377574/1251549
Not tried yet (but planning) - https://github.com/awslabs/amazon-s3-tagging-spark-util
In theory, you can tag with jobid (or whatever) and then run something
Both solutions lead to perform multiple s3 list objects API request check tags/filename and delete file one by one.

How to create an hbase sequencefile key in Spark for load to Bigtable?

I want to be able to easily create test data files that I can save and re-load into a dev Bigtable instance at will, and pass to other members of my team so they can do the same. The suggested way of using Dataflow to Bigtable seems ridiculously heavy-weight (anyone loading a new type of data--not for production purposes, even just playing around with Bigtable for the first time--needs to know Apache Beam, Dataflow, Java, and Maven??--that's potentially going to limit Bigtable adoption for my team) and my data isn't already in HBase so I can't just export a sequencefile.
However, per this document, it seems like the sequencefile key for HBase should be constructible in regular Java/Scala/Python code:
The HBase Key consists of: the row key, column family, column qualifier, timestamp and a type.
It just doesn't go into enough detail for me to actually do it. What delimiters exist between the different parts of the key? (This is my main question).
From there, Spark at least has a method to write a sequencefile so I think I should be able to create the files I want as long as I can construct the keys.
I'm aware that there's an alternative (described in this answer, whose example link is broken) that would involve writing a script to spin up a Dataproc cluster, push a TSV file there, and use HBase ImportTsv to push the data to Bigtable. This also seems overly heavy-weight to me but maybe I'm just not used to the cloud world yet.
The sequence file solution is meant for situations where large sets of data need to be imported and/or exported from Cloud Bigtable. If your file is small enough, then create a script that creates a table, reads from a file, and uses a BufferedMutator (or bath writes in your favorite language) to write to Cloud Bigtable.

Is it possible to retrieve the list of files when a DataFrame is written, or or have spark store it somewhere?

With a call like
df.write.csv("s3a://mybucket/mytable")
I obviously know where files/objects are written, but because of S3's eventual consistency guarantees, I can't be 100% sure that getting a listing from that location will return all (or even any) of the files that were just written. If I could get the list of files/objects spark just wrote, then I could prepare a manifest file for a Redshift COPY command without worrying about eventual consistency. Is this possible-- and if so how?
The spark-redshift library can take care of this for you. If you want to do it yourself you can have a look at how they do it here: https://github.com/databricks/spark-redshift/blob/1092c7cd03bb751ba4e93b92cd7e04cffff10eb0/src/main/scala/com/databricks/spark/redshift/RedshiftWriter.scala#L299
EDIT: I avoid further worry about consistency by using df.coalesce(fileCount) to output a known number of file parts (for Redshift you want a multiple of the slices in your cluster). You can then check how many files are listed in the Spark code and also how many files are loaded in Redshift stl_load_commits.
It's good to be aware of consistency risks; you can get it in listings with delayed create visibility and deleted objects still being found.
AFAIK, You can't get a list of files created, as its somewhere where tasks can generate whatever they want into the task output dir, which is then marshalled (via listing and copy) into the final output dir,
In the absence of a consistency layer atop S3 (S3mper, s3guard, etc), you can read & spin for "a bit" to allow for the shards to catch up. I have no good idea of what is a good value of "a bit".
However, if you are calling fs.write.csv(), you may have been caught by listing inconsistencies within the committer used to propagate task output to the job dir; s that's done in S3A via list + copy, see.

Resources