Databricks - How to remove files , directories based on regular expression

Databricks - How to remove files , directories based on regular expression - databricks

I had a lot of files in databricks and wanted to clean them. Some of the files having a prefix such as "tweets1*.
How could I delete the files using a prefix something like linux pattern. I applied the following command, and it didnt work.
dbutils.fs.rm("/tweets1*",recurse=True)

You can go with the classic bash.
Inside your cell type:
%sh
rm -rf path/to/your/folder/tweets1*
When I have to perform some complex operation which I already know how to do with bash I use it directly inside the cell.

Related

Databricks notebook: use magic commands for several lines

I may be missing the obvious, but:
I am using the Databricks community edition notebook.
I am trying to use several %fs lines within the same cell
Is this possible... ?
I tried this, as cell content:
%fs rm /FileStore/tables/file.txt
%fs ls /FileStore/tables/
and also this:
%%fs
rm /FileStore/tables/file.txt
ls /FileStore/tables/
...and just in case...
%fs
rm /FileStore/tables/file.txt
ls /FileStore/tables/
Having the rm and ls commands in different cells works, but is there a way to have them both in the same cell...?

You can't do that for %fs - it treats the rest as arguments for the first command. The same for the other magic commands.
If you want to execute multiple commands in one cell, then you need to use dbutils.fs... commands in Python or Scala (doc):
dbutils.fs.rm("/FileStore/tables/file.txt")
dbutils.fs.ls("/FileStore/tables/")

How do I mark a file as config(missingok) in fpm non-interactively?

I've been trying to use fpm to create an rpm, but have ran into a problem. After I install the package, there are files I no longer need which are deleted in a post-install script in order to save space. Unfortunately, when the packages in uninstalled, it complains about the files not being there, as they are still registered by the rpm as part of the package. When I looked into how to fix this via the rpm, I stumbled on the %config(missingok) macro which seems ideal. However, it doesn't seem like there is a way to set this via fpm.
My current options for possible solutions are changing the -edit flag from using vi to edit the spec file to using a script by setting the fpm_editor variable, or touching the file in a pre-remove script to try and trick the rpm into thinking these problematic file still exist. Neither of these option are very appealing.
So my question is this: Is there a way to use fpm to either a: remove the package from the "sight" of the rpm post-install, or b: mark the file as noconfig(missingok) via fpm?
Without utilizing the two solutions above of course.

The usual way of doing this is rm -f these files at the end of the %install section, instead of doing this in the post-install scriptlet.
This way the useless files will not be packaged in the final rpm.
I never packaged an rpm with fpm, but looking at the source code I see the command-line switches --exclude and --exclude-file that should be the ones you're looking for:
option ["-x", "--exclude"], "EXCLUDE_PATTERN",
"Exclude paths matching pattern (shell wildcard globs valid here). " \
"If you have multiple file patterns to exclude, specify this flag " \
"multiple times.", :attribute_name => :excludes do |val|
excludes << val
next excludes
end # -x / --exclude
option "--exclude-file", "EXCLUDE_PATH",
"The path to a file containing a newline-sparated list of "\
"patterns to exclude from input."

Copy a range of folders in command line

I found this link with an example of how I can actually copy range of files https://serverfault.com/questions/370403/copy-a-range-of-files-in-command-line-zsh-bash, using this
$cp P10802[75-83].JPG ~/Images/.
Is there any way I can also copy range of folders or directory?
Maybe something like this $cp -r folder[001-999] ~/images./

Use the -R flag to recursive copy the directories. According to Can I use shell wildcards to select filenames ranging across double-digit numbers, you can use the syntax {start..end} to match a number range. Putting that together would give you:
cp -R folder{001..999} ~/images./

Yes, using the same logic. Globbing and expansion (which is what bash uses to generate the individual names out of these patterns) work on files as well as on directory names.

using execv/execl to delete all files

I'm trying to delete all files in folder using from c program using the following method:
execl("/bin/rm","/media/sda1/*",0,0,0,0,0,0,0,0,0);
But I get the following failure:
rm: can't remove '/media/sda1/*': No such file or directory, though there are files in this folder.
How can we delete all files or copy all files (from one folder to another) using execv family ? Does anyone have any idea ?
Thanks,
Ran

The problem is caused by the glob pattern /media/sda1/* you are using: Note the asterisk, which a shell would expand to the the list of all non-hidden files in that folder. If you are passing it directly to rm, it would attempt to delete a folder called *.
If you don't want to manually iterate over all filesinside the folder, you'll need to start the command in a shell which will expand the glob pattern for you.
You could use
execl("/bin/bash","-c 'rm -rf /media/sda1/*'",0,0,0,0,0,0,0,0,0);
... for that. A nice alternative would be to use system() which implicitly starts the command in a shell:
system("rm -rf /media/sda1/*");
More about:
glob
the function system()

How to directly overwrite with 'unexpand' (spaces-to-tabs conversion)?

I'm trying to use something along the lines of
unexpand -t 4 *.php
but am unsure how to write this command to do what I want.
Weirdly,
unexpand -t 4 file.php > file.php
gives me an empty file. (i.e. overwriting file.php with nothing)
I can specify multiple files okay, but don't know how to then overwrite each file.
I could use my IDE, but there are ~67000 instances of to be replaced over 200 files, and this will take a while.
I expect that the answers to my question(s) will be standard unix fare, but I'm still learning...

You can very seldom use output redirection to replace the input. Replacing works with commands that support it internally (since they then do the basic steps themselves). From the shell level, it's far better to work in two steps, like so:
Do the operation on foo, creating foo.tmp
Move (rename) foo.tmp to foo, overwriting the original
This will be fast. It will require a bit more disk space, but if you do both steps before continuing to the next file, you will only need as much extra space as the largest single file, this should not be a problem.
Sketch script:
for a in *.php
do
unexpand -t 4 $a >$a-notab
mv $a-notab $a
done
You could do better (error-checking, and so on), but that is the basic outline.

Here's the command I used:
for p in $(find . -iname "*.js")
do
unexpand -t 4 $(dirname $p)/"$(basename $p)" > $(dirname $p)/"$(basename $p)-tab"
mv $(dirname $p)/"$(basename $p)-tab" $(dirname $p)/"$(basename $p)"
done
This version changes all files within the directory hierarchy rooted at the current working directory.
In my case, I only wanted to make this change to .js files; you can omit the iname clause from find if you wish, or use different args to cast your net differently.
My version wraps filenames in quotes, but it doesn't use quotes around 'interesting' directory names that appear in the paths of matching files.
To get it all on one line, add a semi after lines 1, 3, & 4.
This is potentially dangerous, so make a backup or use git before running the command. If you're using git, you can verify that only whitespace was changed with git diff -w.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Databricks - How to remove files , directories based on regular expression - databricks

I had a lot of files in databricks and wanted to clean them. Some of the files having a prefix such as "tweets1. How could I delete the files using a prefix something like linux pattern. I applied the following command, and it didnt work. dbutils.fs.rm("/tweets1",recurse=True)

You can go with the classic bash. Inside your cell type: %sh rm -rf path/to/your/folder/tweets1* When I have to perform some complex operation which I already know how to do with bash I use it directly inside the cell.

Related

Databricks notebook: use magic commands for several lines

How do I mark a file as config(missingok) in fpm non-interactively?

Copy a range of folders in command line

using execv/execl to delete all files

How to directly overwrite with 'unexpand' (spaces-to-tabs conversion)?

Categories

Resources

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Databricks - How to remove files , directories based on regular expression - databricks

I had a lot of files in databricks and wanted to clean them. Some of the files having a prefix such as "tweets1*. How could I delete the files using a prefix something like linux pattern. I applied the following command, and it didnt work. dbutils.fs.rm("/tweets1*",recurse=True)

You can go with the classic bash. Inside your cell type: %sh rm -rf path/to/your/folder/tweets1* When I have to perform some complex operation which I already know how to do with bash I use it directly inside the cell.

Related

Databricks notebook: use magic commands for several lines

How do I mark a file as config(missingok) in fpm non-interactively?

Copy a range of folders in command line

using execv/execl to delete all files

How to directly overwrite with 'unexpand' (spaces-to-tabs conversion)?

Categories

Resources

I had a lot of files in databricks and wanted to clean them. Some of the files having a prefix such as "tweets1. How could I delete the files using a prefix something like linux pattern. I applied the following command, and it didnt work. dbutils.fs.rm("/tweets1",recurse=True)