can we use regex expression in prefix in terraform replication configuration block - terraform

I am using s3 module https://github.com/terraform-aws-modules/terraform-aws-s3-bucket for creating the buckets. I need to add a cross account replication configuration for a bucket and need to exclude a folder test in the bucket from getting replicated from source bucket to the destination bucket. The bucket has folders of following format and it continues in the similar way
0067hy89-9834-c91c-6it78-d45678as5673
0081as45-1988-e32i-4zx01-l3688we31356
test
0019li23-2012-b41l-7vd38-k267180qw340
The condition is that all the above folders should be replicated except the test folder. I could not find any exclude to be used in the prefix to do this. Also, I read that S3 lifecycles rules do not support regular expressions. Can we use regex expression in s3 replication configuration block to use a filter where the prefix should include all the alpha numeric folders mentioned above except the test folder . Please suggest if it is possible to use regex in the replication configuration prefix

Related

Azure Databricks dbfs with python

In azure databricks i have different results for the directory list of dbfs by simply adding two dots.
Can anybody explain to me why this happens?
With dbutils, you can only use "dbfs:/" paths.
If you do not specify "dbfs:/" at the start of your path, it will simply auto-add it.
dbutils.fs.ls('pathA')
--> dbfs:/pathA
is exactly the same as
dbutils.fs.ls('dbfs:/pathA')
but if you do not use the ':', then it will add it silently.
dbutils.fs.ls('dbfs/pathB')
--> dbfs:/dbfs/pathB
It means your dbfs/ is considered as a folder name dbfs at the root of your dbfs:/
To avoid confusion, always specify dbfs:/ to your path.

Parametrization using Azure Data Factory

I have a Pipeline job in Azure Data Factory which I want to use to run the pipeline job but pass all files for a specific month through for example.
I have a folder called 2020/01 inside this folder is numerous files with different names.
The question is: Can one pass a parameter through to only extract and load the files for 2020/01/01 and 2020/01/02 if that makes sense?
Excellent, Thanks Jay it worked and i can now run my pipeline jobs passing through the month or even day level.
Really appreciate your response, have a fantastic day.
Regards
Rayno
The question is: Can one pass a parameter through to only extract and
load the files for 2020/01/01 and 2020/01/02 if that makes sense?
You did't mention which connector you are using in pipeline job,but you mentioned folder in your question.As i know,the majority folder path could be parametrization in ADF copy activity configuration.
You could create a param :
Then apply it in the wildcard folder path:
Even if your files' names have same prefix,you could apply 01*.json on the wildcard file name property.

assertion failed: conflicting directory structures detected. Suspicious paths

I am trying to read data from aws s3 where I am having error.
s3 bucket and paths for example as below:
s3://USA/Texas/Austin/valid
s3://USA/Texas/Austin/invalid
s3://USA/Texas/Houston/valid
s3://USA/Texas/Houston/invalid
s3://USA/Texas/Dallas/valid
s3://USA/Texas/Dallas/invalid
s3://USA/Texas/San_Antonio/valid
s3://USA/Texas/San_Antonio/invalid
when I try to read as
spark.read.parquet("s3://USA/Texas/Austin/valid")
or
spark.read.parquet("s3://USA/Texas/Austin/invalid")
or
spark.read.parquet("s3://USA/Texas/Austin")
it works just fine.
but when I try to read as
spark.read.parquet("s3://USA/Texas/*")
or
spark.read.parquet("s3://USA/Texas")
it throws an exception.
java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:
If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.
as per suggestion I can read them individually but I have more then 500 files, to read them individually and union them will be hectic.
is there any other way to achieve this?
I am using HDFS with Parquet but I ran into the same issue. For me, setting the basePath to a path level above anything you will be accessing in that query works.
Also, I believe the '*' is unnecessary, though I'm not sure of the behavior of S3 on this one.
eg.
spark.read.option("basePath", "s3://USA/Texas/").parquet("s3://USA/Texas/")
Perhaps this is off-base for your S3 scenario but will hopefully help someone else with HDFS getting the same error.
If you can use Hive, then set two configurations
hive.input.dir.recursive=true
hive.mapred.supports.subdirectories=true
and create external table on the root path. Then, the table should read all the subdirectories data in the table but the schema should be the same or it will get an error.

Change spark _temporary directory path

Is it possible to change the _temporary directory where spark save its temporary files before writing?
In particular, since I am writing single partitions of a table I woud like the temporary folder to be within the partition folder.
Is it possibile?
There is no way to use the default FileOutputCommitter because of its implementation, the FileOutputCommiter creates a ${mapred.output.dir}/_temporary subdirectory where the files are written and later on, after being committed, moved to ${mapred.output.dir}.
In the end, an entire temporary folder deleted. When two or more Spark jobs have the same output directory, mutual deletion of files will be inevitable.
Eventually, I've downloaded org.apache.hadoop.mapred.FileOutputCommitter and org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter (you can name it YourFileOutputCommitter) made some changes that allows _temporaly rename
in your driver, you'll have to add following code:
val conf: JobConf = new JobConf(sc.hadoopConfiguration)
conf.setOutputCommitter(classOf[YourFileOutputCommitter])
// update temporary path for committer
YourFileOutputCommitter.tempPath = "_tempJob1"
note: it's better to use MultipleTextOutputFormat to rename files because two jobs that write to the same location can override each other.
Update
I've created short post in our tech blog, it has more details
https://www.outbrain.com/techblog/2020/03/how-you-can-set-many-spark-jobs-write-to-the-same-path/

Junk Spark output file on S3 with dollar signs

I have a simple spark job that reads a file from s3, takes five and writes back in s3.
What I see is that there is always additional file in s3, next to my output "directory", which is called output_$folder$.
What is it? How I can prevent spark from creating it?
Here is some code to show what I am doing...
x = spark.sparkContext.textFile("s3n://.../0000_part_00")
five = x.take(5)
five = spark.sparkContext.parallelize(five)
five.repartition(1).saveAsTextFile("s3n://prod.casumo.stu/dimensions/output/")
After the job I have s3 "directory" called output which contains results and another s3 object called output_$folder$ which I don't know what it is.
Changing S3 paths in the application from s3:// to s3a:// seems to have done the trick for me. The $folder$ files are no longer getting created since I started using s3a://.
Ok, it seems I found out what it is.
It is some kind of marker file, probably used for determining if the S3 directory object exists or not.
How I reached this conclusion?
First, I found this link that shows the source of
org.apache.hadoop.fs.s3native.NativeS3FileSystem#mkdir
method: http://apache-spark-user-list.1001560.n3.nabble.com/S3-Extra-folder-files-for-every-directory-node-td15078.html
Then I googled other source repositories to see if I am going to find different version of the method. I didn't.
At the end, I did an experiment and rerun the same spark job after I removed the s3 output directory object but left output_$folder$ file. Job failed saying that output directory already exists.
My conclusion, this is hadoop's way to know if there is a directory in s3 with given name and I will have to live with that.
All the above happens when I run the job from my local, dev machine - i.e. laptop. If I run the same job from a aws data pipeline, output_$folder$ does not get created.
s3n:// and s3a:// doesn't generate marker directory like <output>_$folder$
If you are using hadoop with AWS EMR., I found moving from s3 to s3n is straight forward since they both use same file system implementation, whereas s3a involves AWS credential related code change.
('fs.s3.impl', 'com.amazon.ws.emr.hadoop.fs.EmrFileSystem')
('fs.s3n.impl', 'com.amazon.ws.emr.hadoop.fs.EmrFileSystem')
('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')

Resources