Recursive search grep - linux

I'm trying to search through HDFS for parquet files and list them out. I'm using this, which works great. It looks through all of the subdirectories in /sources.works_dbo and gives me all the parquet files:
hdfs dfs -ls -R /sources/works_dbo | grep ".*\.parquet$"
However; I just want to return the first file it encounters per subdirectory, so that each subdirectory only appears on a single line in my output. Say I had this:
sources/works_dbo/test1/file1.parquet
sources/works_dbo/test1/file2.parquet
sources/works_dbo/test2/file3.parquet
When I run my command I expect the output to look like this:
sources/works_dbo/test1/file1.parquet
sources/works_dbo/test2/file3.parquet

... | awk '!seen[gensub(/[^/]+$/,"",1)]++' file
sources/works_dbo/test1/file1.parquet
sources/works_dbo/test2/file3.parquet
The above uses GNU awk for gensub(), with other awks you'd use a variable and sub():
awk '{path=$0; sub(/[^/]+$/,"",path)} !seen[path]++'
It will work for any mixture of any length of paths.

You can use sort -u (unique) with / as the delimiter and using the first three fields as key. The -s option ("stable") makes sure that the file retained is the first one encountered for each subdirectory.
For this input
sources/works_dbo/test1/file1.parquet
sources/works_dbo/test1/file2.parquet
sources/works_dbo/test2/file3.parquet
the result is
$ sort -s -t '/' -k 1,3 -u infile
sources/works_dbo/test1/file1.parquet
sources/works_dbo/test2/file3.parquet

If the subdirectories are of variable length, this awk solution may come in handy:
hdfs dfs -ls -R /sources/works_dbo | awk '
BEGIN{FS="/"; OFS="/";}
{file=$NF; // file name is always the last field
$NF=""; folder=$0; // chomp off the last field to cache folder
if (!(folder in seen_dirs)) // cache the first file per folder
seen_dirs[folder]=file;
}
END{
for (f in seen_dirs) // after we've processed all rows, print our cache
print f,seen_dirs[f];
}'

Using Perl:
hdfs dfs -ls -R /sources/works_dbo | grep '.*\.parquet$' | \
perl -MFile::Basename -nle 'print unless $h{ dirname($_) }++'
In the perl command above:
-M loads File::Basename module;
-n causes Perl to apply the expression passed via -e for each input line;
-l preserves the line terminator;
$_ is the default variable keeping the currently read line;
dirname($_) returns the directory part for the path specified by $_;
$h is a hash where keys are directory names, and values are integers 0, 1, 2 etc;
the line is printed to the standard output, unless the directory name is seen in the previous iterations, i.e. the hash value $h{ dirname($_) } is non-zero.
By the way, instead of piping the result of hdfs dfs -ls -R via grep, you can use the find command:
hdfs dfs -find /sources/works_dbo -name '*.parquet'

Related

Automate and looping through batch script

I'm new to batch. I want iterate through a list and use the output content to replace a string in another file.
ls -l somefile | grep .txt | awk 'print $4}' | while read file
do
toreplace="/Team/$file"
sed 's/dataFile/"$toreplace"/$file/ file2 > /tmp/test.txt
done
When I run the code I get the error
sed: 1: "s/dataFile/"$torepla ...": bad flag in substitute command: '$'
Example of somefile with which has list of files paths
foo/name/xxx/2020-01-01.txt
foo/name/xxx/2020-01-02.txt
foo/name/xxx/2020-01-03.txt
However, my desired output is to use the list of file paths in somefile directory to replace a string in another file2 content. Something like this:
This is the directory of locations where data from /Team/foo/name/xxx/2020-01-01.txt ............
I'm not sure if I understand your desired outcome, but hopefully this will help you to figure out your problem:
You have three files in a directory:
TEAM/foo/name/xxx/2020-01-02.txt
TEAM/foo/name/xxx/2020-01-03.txt
TEAM/foo/name/xxx/2020-01-01.txt
And you have another file called to_be_changed.txt which contains the text This is the directory of locations where data from TO_BE_REPLACED ............ and you want to grab the filenames of your three files and insert them into your to_be_changed.txt file, you can do it with:
while read file
do
filename="$file"
sed "s/TO_BE_REPLACED/${filename##*/}/g" to_be_changed.txt >> changed.txt
done < <(find ./TEAM/ -name "*.txt")
And you will then have made a file called changed.txt which contains:
This is the directory of locations where data from 2020-01-02.txt ............
This is the directory of locations where data from 2020-01-03.txt ............
This is the directory of locations where data from 2020-01-01.txt ............
Is this what you're trying to achieve? If you need further clarification I'm happy to edit this answer to provide more details/explanation.
ls -l somefile | grep .txt | awk 'print $4}' | while read file
No. No, no, nono.
ls -l somefile is only going to show somefile unless it's a directory.
(Don't name a directory "somefile".)
If you mean somefile.txt, please clarify in your post.
grep .txt is going to look through the lines presented for the three characters txt preceded by any character (the dot is a regex wildcard). Since you asked for a long listing of somefile it shouldn't find any, so nothing should be passed along.
awk 'print $4}' is a typo which won't compile. awk will crash.
Keep it simple. What I suspect you meant was
for file in *.txt
Then in
toreplace="/Team/$file"
sed 's/dataFile/"$toreplace"/$file/ file2 > /tmp/test.txt
it's unlear what you expect $file to be - awk's $4 from an ls -l seems unlikely.
Assuming it's the filenames from the for above, then try
sed "s,dataFile,/Team/$file," file2 > /tmp/test.txt
Does that help? Correct me as needed. Sorry if I seem harsh.
Welcome to SO. ;)

Unix: Sort 'ls' by return value of program

how can I use program as a key to sort in Unix shell? In other words to sort output of 'ls' (or any other program) by return value of a program applied on each line.
I'll give two example solutions:
A one-line command that is simpler and therefore something I'd try use first.
A bash script that allows sorting a list by output from an arbitrary bash function that reads each line of the list as input.
Example 1 (without executing command on each line)
If the question is how to, in general, sort outputs of programs like ls, below is an example specific to ls that sorts by inode. However, every program may have its own idiosyncrasies when generating its output so this example may have to be adapted:
ls -ail /home/user/ | tail -n+2 | tr -s ' ' | sort -t' ' -k1,1 -g
Here are the different parts of this command broken down:
ls -ail /home/user/
Lists all (-a) files in directory /home/user/ in list (-l) format with inode (-i).
tail -n+1
Cuts off first line from ls output.
tr -s ' '
Combines (-s) multiple spaces (' ') for sort.
sort -t ' ' -k 1 -g
Sorts list by first (1) field of integers (-g) separated by one space (' ').
Example 2 (executing command with each line as input)
Here is a more adaptable example in a bash script I worked up to show how the list of files generated from ls -a1 can be fed into bash function getinode which uses stat to output the inode for each file. A while loop repeats this process for each file, saving in comma-delimited format the data by repeatedly appending a variable named OUTPUT which at the end is sorted by sort using the first field.
The important part is that the function getinode can be anything, so long as it outputs a string. I set up getinode to receive a file path as input (first argument $1) and to then output the inode to stdout via echo $INODE. The script calls getinode via $(getinode "$FILEPATH").
#!/bin/bash
# Usage: lsinodesort.sh [file]
# Refs/attrib:
# [1]: How to sort a csv file by sorting on a single field. https://stackoverflow.com/a/44744800
# [2]: How to read a while loop variable. https://stackoverflow.com/a/16854326
WORKDIR="$1" # read directory from first argument
getinode() {
# Usage: getinode [path]
INODE="$(stat "$1" --format=%i)"
echo $INODE
}
if [ -d "$WORKDIR" ]; then
LINES="$(ls -a1 "$WORKDIR")" # save `ls` output to variable LINES
else
exit 1; # not a valid directory
fi
while read line; do
path="$WORKDIR"/"$line" # Determine path.
if [ -f "$path" ]; then # Check if path is a file.
FILEPATH="$path"
FILENAME="$(basename "$path")" # Determine filename from path.
FILEINODE=$(getinode "$FILEPATH") # Get inode.
OUTPUT="$FILEINODE"",""$FILENAME""\n""$OUTPUT" ; # Append inode and file name to OUTPUT
fi
done <<< "$LINES" # See [2].
OUTPUT=$(printf "${OUTPUT}" | sort -t, -k1,1) # sort OUTPUT. See [1]
OUTPUT="inode","filename""\n""$OUTPUT"
printf "${OUTPUT}\n" # print final OUTPUT.
When I run it on my own home folder I get output like this:
inode,filename
3932162,.bashrc
3932165,.bash_logout
3932382,.zshrc
3932454,.gitconfig
3933234,.bash_aliases
3933512,.profile
3933612,.viminfo
I'm not sure to understand your question, so I'll try to rephrase it first.
If I'm not mistaken, you want to sort the output of a program (it may be ls or any other command in a Unix shell).
I'll suggest using the pipeline feature available on Unix shell.
For instance, you can sort the output of the ls command using :
ls /home | sort
This feature is available but not limited to the ls command.
By the way, there are optional flags you can use for sorting ls command results if that's your specific use case :
ls -S # for sorting by file size
ls -t # for sorting by modification time
You can also append the --reverse or -r flag for displaying the result in reverse order.
As for the sort function, there are also flags allowing to customize your result as per your needs :
sort -n # for sorting numerically instead of alphabetically
sort -k5 # for sorting based on the 5th column
sort -t "," # for using the comma as a field separator
You can combine all of them like that for sorting the output of ‘ls -l‘ command on the basis of field 2,5 (Numeric) and 9 (Non-Numeric/alphabetically).
ls -l /home/$USER | sort -t "," -nk2,5 -k9
sort function examples

Looping through a file with path and file names and within these file search for a pattern

I have a file called lookupfile.txt with the following info:
path, including filename
Within bash I would like to search through these files in mylookup file.txt for a pattern : myerrorisbeinglookedat. When found, output the lines where found into another recorder file. All the found result can land in the same file.
Please help.
You can write a single grep statement to achieve this:
grep myerrorisbeinglookedat $(< lookupfile.txt) > outfile
Assuming:
the number of entries in lookupfile.txt is small (tens or hundreds)
there are no white spaces or wildcard characters in the file names
Otherwise:
while IFS= read -r file; do
# print the file names separated by a NULL character '\0'
# to be fed into xargs
printf "$file\0"
done < lookupfile.txt | xargs -0 grep myerrorisbeinglookedat > outfile
xargs takes output of the loop, tokenizes them correctly and invokes grep command. xargs batches up the files based on operating system limits in case there are a large number of files.

Changing the file names and copying into different directory

I have some files say about 1000 numbers.. Wanted to rename those files in such a way that, wanted to cut out only few chars from file name and copy it to some other directory.
Ex: Original file name.
vfcon062562~19.xml
vfcon058794~29.xml
vfcon072009~3.xml
vfcon071992~10.xml
vfcon071986~2.xml
vfcon071339~4.xml
vfcon069979~43.xml
Required O/P is cutting the ~and following chars.
O/P Ex:
vfcon058794.xml
vfcon062562.xml
vfcon069979.xml
vfcon071339.xml
vfcon071986.xml
vfcon071992.xml
vfcon072009.xml
But want to place n different directory.
If you are using bash or similar you can use the following simple loop:
for input in vfcon*xml
do
mv $input targetDir/$(echo $input | awk -F~ '{print $1".xml"}')
done
Or in a single line:
for input in vfcon*xml; do mv $input targetDir/$(echo $input | awk -F~ '{print $1".xml"}'); done
This uses awk to separate everything before ~ using it as a field separator and printing the first column and appending ".xml" to create the output file name. All this is prepended with the targetDir which can be a full path.
If you are using csh / tcsh then the syntax of the loop will be slightly different but the commands will be the same.
I like to make sure that my data set is correct prior to changing anything so I would put that into a variable first and then check over it.
files=$(ls vfcon*xml)
echo $files | less
Then, like #Stefan said, use a loop:
for i in $files
do
mv "$i" "$( echo "$file" | sed 's/~[0-9].//g')"
done
If you need help with bash you can use http://www.shellcheck.net/

Clearing archive files with linux bash script

Here is my problem,
I have a folder where is stored multiple files with a specific format:
Name_of_file.TypeMM-DD-YYYY-HH:MM
where MM-DD-YYYY-HH:MM is the time of its creation. There could be multiple files with the same name but not the same time of course.
What i want is a script that can keep the 3 newest version of each file.
So, I found one example there:
Deleting oldest files with shell
But I don't want to delete a number of files but to keep a certain number of newer files. Is there a way to get that find command, parse in the Name_of_file and keep the 3 newest???
Here is the code I've tried yet, but it's not exactly what I need.
find /the/folder -type f -name 'Name_of_file.Type*' -mtime +3 -delete
Thanks for help!
So i decided to add my final solution in case anyone liked to get it. It's a combination of the 2 solutions given.
ls -r | grep -P "(.+)\d{4}-\d{2}-\d{2}-\d{2}:\d{2}" | awk 'NR > 3' | xargs rm
One line, super efficiant. If anything changes on the pattern of date or name just change the grep -P pattern to match it. This way you are sure that only the files fitting this pattern will get deleted.
Can you be extra, extra sure that the timestamp on the file is the exact same timestamp on the file name? If they're off a bit, do you care?
The ls command can sort files by timestamp order. You could do something like this:
$ ls -t | awk 'NR > 3' | xargs rm
THe ls -t lists the files by modification time where the newest are first.
The `awk 'NR > 3' prints out the list of files except for the first three lines which are the three newest.
The xargs rm will remove the files that are older than the first three.
Now, this isn't the exact solution. There are possible problems with xargs because file names might contain weird characters or whitespace. If you can guarantee that's not the case, this should be okay.
Also, you probably want to group the files by name, and keep the last three. Hmm...
ls | sed 's/MM-DD-YYYY-HH:MM*$//' | sort -u | while read file
do
ls -t $file* | awk 'NR > 3' | xargs rm
done
The ls will list all of the files in the directory. The sed 's/\MM-DD-YYYY-HH:MM//' will remove the date time stamp from the files. Thesort -u` will make sure you only have the unique file names. Thus
file1.txt-01-12-1950
file2.txt-02-12-1978
file2.txt-03-12-1991
Will be reduced to just:
file1.txt
file2.txt
These are placed through the loop, and the ls $file* will list all of the files that start with the file name and suffix, but will pipe that to awk which will strip out the newest three, and pipe that to xargs rm that will delete all but the newest three.
Assuming we're using the date in the filename to date the archive file, and that is possible to change the date format to YYYY-MM-DD-HH:MM (as established in comments above), here's a quick and dirty shell script to keep the newest 3 versions of each file within the present working directory:
#!/bin/bash
KEEP=3 # number of versions to keep
while read FNAME; do
NODATE=${FNAME:0:-16} # get filename without the date (remove last 16 chars)
if [ "$NODATE" != "$LASTSEEN" ]; then # new file found
FOUND=1; LASTSEEN="$NODATE"
else # same file, different date
let FOUND="FOUND + 1"
if [ $FOUND -gt $KEEP ]; then
echo "- Deleting older file: $FNAME"
rm "$FNAME"
fi
fi
done < <(\ls -r | grep -P "(.+)\d{4}-\d{2}-\d{2}-\d{2}:\d{2}")
Example run:
[me#home]$ ls
another_file.txt2011-02-11-08:05
another_file.txt2012-12-09-23:13
delete_old.sh
not_an_archive.jpg
some_file.exe2011-12-12-12:11
some_file.exe2012-01-11-23:11
some_file.exe2012-12-10-00:11
some_file.exe2013-03-01-23:11
some_file.exe2013-03-01-23:12
[me#home]$ ./delete_old.sh
- Deleting older file: some_file.exe2012-01-11-23:11
- Deleting older file: some_file.exe2011-12-12-12:11
[me#home]$ ls
another_file.txt2011-02-11-08:05
another_file.txt2012-12-09-23:13
delete_old.sh
not_an_archive.jpg
some_file.exe2012-12-10-00:11
some_file.exe2013-03-01-23:11
some_file.exe2013-03-01-23:12
Essentially, but changing the file name to dates in the form to YYYY-MM-DD-HH:MM, a normal string sort (such as that done by ls) will automatically group similar files together sorted by date-time.
The ls -r on the last line simply lists all files within the current working directly print the results in reverse order so newer archive files appear first.
We pass the output through grep to extract only files that are in the correct format.
The output of that command combination is then looped through (see the while loop) and we can simply start deleting after 3 occurrences of the same filename (minus the date portion).
This pipeline will get you the 3 newest files (by modification time) in the current dir
stat -c $'%Y\t%n' file* | sort -n | tail -3 | cut -f 2-
To get all but the 3 newest:
stat -c $'%Y\t%n' file* | sort -rn | tail -n +4 | cut -f 2-

Resources