Split multiple files - linux

I have a directory with hundreds of files and I have to divide all of them in 400 lines files (or less).
I have tried ls and split, wc and split and to make some scripts.
Actually I'm lost.
Please, can anybody help me?
EDIT:
Thanks to John Bollinger and his answer this is the scritp we will use to our purpose:
#!/bin/bash
# $# -> all args passed to the script
# The arguments passed in order:
# $1 = num of lines (required)
# $2 = dir origin (optional)
# $3 = dir destination (optional)
if [ $# -gt 0 ]; then
lin=$1
if [ $# -gt 1 ]; then
dirOrg=$2
if [ $# -gt 2 ]; then
dirDest=$3
if [ ! -d "$dirDest" ]; then
mkdir -p "$dirDest"
fi
else
dirDest=$dirOrg
fi
else
dirOrg=.
dirDest=.
fi
else
echo "Missing parameters: NumLineas [DirectorioOrigen] [DirectorioDestino]"
exit 1
fi
# The shell glob expands to all the files in the target directory; a different
# glob pattern could be used if you want to restrict splitting to a subset,
# or if you want to include dotfiles.
for file in "$dirOrg"/*; do
# Details of the split command are up to you. This one splits each file
# into pieces named by appending a sequence number to the original file's
# name. The original file is left in place.
fileDest=${file##*/}
split --lines="$lin" --numeric-suffixes "$file" "$dirDest"/"$fileDest"
done
exit0

Since you seem to know about split, and to want to use it for the job, I guess your issue revolves around using one script to wrap the whole task. The details are unclear, but something along these lines is probably what you want:
#!/bin/bash
# If an argument is given then it is the name of the directory containing the
# files to split. Otherwise, the files in the working directory are split.
if [ $# -gt 0 ]; then
dir=$1
else
dir=.
fi
# The shell glob expands to all the files in the target directory; a different
# glob pattern could be used if you want to restrict splitting to a subset,
# or if you want to include dotfiles.
for file in "$dir"/*; do
# Details of the split command are up to you. This one splits each file
# into pieces named by appending a sequence number to the original file's
# name. The original file is left in place.
split --lines=400 --numeric-suffixes "$file" "$file"
done

Related

How does Google search PDF's so quickly even though it takes way more time for the PDF to load when I open the link?

It takes time for the PDF load completely when I click on links in Google. But Google searches millions of files and returns the exact result. It even returns part where words I searched can be found(briefly below the result). All of this in a few seconds.
However it takes way more time to open those links individually.
MY REASONING- Google has already gone through the internet(as soon as the link is uploaded to net) and Google just gives me links from its history rather doing the search in real time.
But still it sounds unconvincing as it might make it slightly quick but not much.
Also as an extension, can solution to this be used as hack to open web-pages / PDF quickly avoiding all irrelevant parts(like ads/toolbars in some news pages etc)? You know if Google can search it in such short time, there should be a way for us to get relevant pages quickly,right?
Thanks in advance.
Example: Image of the PDF which took over 10s to open. But Google returned the result in 0.58s(according Google itself)
In case you wanted to do this yourself, here is one solution that I came up with to the problem of quickly searching every law ever enacted by the U.S. Congress. These laws are in more than 32GB of PDF files, which you can download for free from here and there on the Internet.
For more than 150 PDF files that I downloaded, I used the naming convention of V<volume>[C<congress>[S<session>]]Y<year>[<description>].pdf. Here's some examples of how my PDF files were named, whitespace and all.
V1C1Y1789-1791.pdf
V6Y1789-1845 Private Laws and Resolutions.pdf
V7Y1789-1845 Indian Treaties.pdf
V50C75S1Y1937.pdf
V51C75S2Y1937.pdf
V52C75S3Y1938.pdf
V53C76S1Y1939.pdf
V54C76S2-3Y1939-1941.pdf
Then I made the following directories to hold the PDF files and their text representations (to be used for fast searching) that I was going to create. I placed all the downloaded PDFs in the first, Originals, directory.
~/Documents/Books/Laws/Originals/
~/Documents/Books/Laws/PDF/
~/Documents/Books/Laws/Text/
The first problem I encountered was that, for the volumes before number 65, the (selectable) text in the PDF files was poorly constructed: often out of order and jumbled around. (I initially discovered this when using a pdfgrep tool.) This problem made the text almost impossible to search through. However, the images of the PDF files seemed quite reasonable.
Using brew, I installed the ocrmypdf tool ("brew install ocrmypdf") to improve the OCR text layer on those problematic PDF files. It worked very well.
To get around some apparent limitations of xargs (long command lines halted file substitutions) and zargs (file substitution halted after a pipe, a redirect, or the first substitution in a string), I had created the following Zsh function, which I used to mass execute ocrmypdf on 92 PDF files.
# Make a function to execute a string of shell code in the first argument on
# one or more files specified in the remaining arguments. Every specification
# of $F in the string will be replaced with the current file. If the last
# argument is "test," show each command before asking the user to execute it.
#
# XonFs <command_string> <file_or_directory>... [test]
#
XonFs()
{
# If the last argument is a test, identify and remove it.
#
local testing
if [[ ${argv[-1]} == "test" ]]
then
testing=1
unset -v 'argv[-1]'
fi
# Get a list of files, from each argument after the first one, and sort
# the list like the Finder does. The IFS setting makes the output of the
# sort command be separated by newlines, instead of by whitespace.
#
local F IFS=$'\n' answer=""
local files=($(sort -Vi <<<"${argv[2,-1]}"))
# Execute the command for each file. But if we are testing, show the
# command that will be executed before asking the user for permission to
# do so.
#
for F in $files
do
# If this is a test, show the user the command that we will execute,
# using the current filename. Then ask the user if they want execute
# the command or not, or just quit the script. If this is not a test,
# execute the command.
#
if (( $testing ))
then
# Separate each file execution with a newline. Show the first
# argument to the function, the command that can be executed, with
# the f variable expanded as the current file. Then, ask the user
# whether the command should be executed.
#
[[ -n $answer ]] && print
printf "%s\n" "${(e)1}"
read -ks "answer?EXECUTE? y/n/q [no] "
# Report what the user's answer is interpreted to be, and do what
# they want.
#
if [[ "$answer" == [yY] ]]
then
print "Yes."
eval $1
elif [[ "$answer" == [qQ] ]]
then
print "Quit."
break
else
answer="n"
print "No."
fi
else
eval $1
fi
done
}
Thus, I used the following shell commands to put the best versions of the PDF files in the PDF directory. (It took about a day on my computer to complete the ocrmypdf conversions.) During the conversions, I also had text files created from the converted PDF files and placed in the Originals directory.
cd ~/Documents/Books/Laws/Originals/
cp V{65..128}[YC]*.pdf ../PDF
XonFs 'print "$F"; ocrmypdf -f --sidecar "${F:r}.txt" "$F" "../PDF/$F"; print;' V{1..64}[YC]*.pdf
I then used pdftotext to create the text file versions of the original (unconverted) PDF files, as follows. If I remember correctly, the pdftotext tool is installed automatically with the installation of pdfgrep ("brew install pdfgrep").
XonFs 'print "$F"; pdftotext "$F" "${F:r}.txt";' V{65..128}[YC]*.pdf
Next, I created easily and quickly searchable versions of all the text files (direct conversions of the PDF files) and placed these new versions of the text files in the Text directory with the following command.
XonFs 'print "$F"; cat "$F" | tr -cds "[:space:][:alnum:]\!\$%&,.:;?" "\!\$%&,.:;?" | tr -s "\n" "=" | tr "\f" "_" | tr -s "[:space:]" " " | sed -E -e "s/ ?& ?/ and /g" -e '"'s/[ =]*_[ =]*/\\'$'\n/g' -e 's/( ?= ?)+/\\'$'\t/g'"' > "../Text/$F";' *.txt
(Ok, tr and sed commands look crazy, but they basically do the following. Delete everything except certain characters and remove some of their redundancies, change all newlines to =, change each formfeed to _, change all whitespace to " ", and change each "&" into "and", change each _ to newline, and change each = to tab. Thus, in the new versions of the text files, a lot of extraneous characters are removed or reduced, newlines separate pages, and tabs represent newlines. Unfortunately, sed seems to require tricky escaping of characters like \n and \t in sed replacement specifications.)
The below is the zsh code for a tool I created called greplaw (and a supporting function called error). Since I will be using this tool a lot, I placed this code in my ~/.zshenv file.
# Provide a function to print an error message for the current executor, which
# is identified by the first argument. The second argument, if not null, is a
# custom error message to print. If the third argument exists and is neither
# zero nor null, the script is exited, but only to the prompt if there is one.
# The fourth argument, if present, is the current linenumber to report in the
# error message.
#
# Usage:
# error [executor [messageString [exitIndicator [lineNumber]]]]
#
# Examples:
# error greplaw
# error greplaw "" 1
# error greplaw "No text files found" 0 $LINENO
# error greplaw "No pdf files found" "" $LINENO
# error greplaw "No files found" x $LINENO
# error greplaw HELL eject $LINENO
#
error()
{
print ${1:-"Script"}": "${4:+"Line $4: "}"Error: "${2:-"Uknown"} 1>&2
[[ $3 =~ '^0*$' ]] || { ${missing_variable_ejector:?} } 2>/dev/null
}
# Function to grep through law files: see usage below or execute with -h.
#
greplaw()
{
# Provide a function to print an error message for the current executor.
# If the user did not include any arguments, give the user a little help.
#
local executor=$( basename $0 )
local err() {error $executor $*}
(( $# )) || 1="-h"
# Create variables with any defaults that we have. The color and no color
# variables are for coloring the extra output that this script outputs
# beyond what grep does.
#
local lawFileFilter=() contextLines=1 maxFileMatches=5 grepOutput=1
local maxPageMatches=5 quiet=0 c="%B%F{cyan}" nc="%f%b" grep="pcregrep"
local grepOptions=(-M -i --color) fileGrepOptions=(-M -n -i)
# Print out the usage for the greplaw function roughly in the fashion of a
# man page, with color and bolding. The output of this function should be
# sent to a print -P command.
#
local help()
{
# Make some local variables to make the description more readable in
# the below string. However, insert the codes for bold and color as
# appropriate.
#
local func="%B%F{red}$executor%f%b" name synopsis description examples
local c f g h l o p q number pattern option
# Mass declare our variables for usage categories, function flags, and
# function argument types. Use the man page standards for formatting
# the text in these variables.
#
for var in name synopsis description examples
declare $var="%B%F{red}"${(U)var}"%f%b"
for var in c f g h l o p q
declare $var="%B%F{red}-"$var"%f%b"
for var in number pattern option
declare $var="%U"$var"%u"
# Print the usage for the function, using our easier to use and read
# variables.
#
cat <<greplawUsage
$name
$func
$synopsis
$func [$c $number] [$f $number] [$g] [$h] [$l $pattern]... [$o $option]...
[$p $number] [$q] $pattern...
$description
This function searches law files with a regular expression, which is
specified as one or more arguments. If more than one argument is provided,
they are joined, with a pattern of whitespace, into a singular expression.
The searches are done without regard to case or whitespace, including line
and page breaks. The output of this function is the path of each law file
the expression is found in, including the PDF page as well as the results
of the $grep. If just a page is reported without any $grep results,
the match begins on that page and continues to the next one.
The following options are available:
$c $number
Context of $number lines around each match shown. Default is 1.
$f $number
File matches maximum is $number. Default is 5. Infinite is -1.
$g
Grep output is omitted
$h
Help message is merely printed.
$l $pattern
Law file regex $pattern will be added as a filename filter.
$o $option
Option $option added to the final $grep execution, for output.
$p $number
Page matches maximum is $number. Default is 5. Infinite is -1.
$q
Quiet file and page information: information not from $grep.
$examples
$func bureau of investigation
$func $o --color=always congress has not | less -r
$func $l " " $l Law congress
greplawUsage
}
# Update our defaulted variables according to the supplied arguments,
# until an argument is not understood. If an agument looks invalid,
# complain and eject. Add each law file filter to an array.
#
while (( $# ))
do
case $1 in
(-[Cc]*)
[[ $2 =~ '^[0-9]+$' ]] || err "Bad $1 argument: $2" eject
contextLines=$2
shift 2;;
(-f*)
[[ $2 =~ '^-?[0-9]+$' ]] || err "Bad $1 argument: $2" eject
maxFileMatches=$2
shift 2;;
(-g*)
grepOutput=0
shift;;
(-h*|--h*)
print -P "$( help )"
return;;
(-l*)
lawFileFilter+=$2
shift 2;;
(-o*)
grepOptions+=$2
shift 2;;
(-p*)
[[ $2 =~ '^-?[0-9]+$' ]] || err "Bad $1 argument: $2" eject
maxPageMatches=$2
shift 2;;
(-q*)
quiet=1
shift;;
(*)
break;;
esac
done
# If the user specified that the script and that grep has no output, then
# we will give no output at all, so just eject with an error. Also, make
# sure we have remaining arguments to assemble the search pattern with.
# Assemble it by joining them with the only allowable whitespace in the
# text files: a space, a tab (which represents a newline), or a newline
# (which prepresents a new PDF page).
#
(( $quiet && ! $grepOutput )) && err "No grep output and quiet: nothing" x
(( $# )) || err "No pattern supplied to grep law files with" eject
local pattern=${(j:[ \t\n]:)argv[1,-1]}
# Quickly seachable text files are searched as representatives of the
# actual PDF law files. Define our PDF and text directories. Note that to
# expand the home directory specification, no quotes can be used.
#
local pdfDirectory=~/Documents/Books/Laws/PDF
local textDirectory=${pdfDirectory:h}"/Text"
# Get a list of the text files, without their directory specifications,
# sorted like the Finder would: this makes the file in order of when the
# laws were created. The IFS setting separates the output of the sort
# command by newlines, instead of by whitespace.
#
local filter fileName fileMatches=0 IFS=$'\n'
local files=( $textDirectory/*.txt )
local fileNames=( $( sort -Vi <<<${files:t} ) )
[[ $#files -gt 1 ]] || err "No text files found" eject $LINENO
# Repeatedly filter the fileNames for each of the law file filters that
# were passed in.
#
for filter in $lawFileFilter
fileNames=( $( grep $filter <<<"${fileNames}" ) )
[[ $#fileNames -gt 0 ]] || err "All law files were filtered out" eject
# For each filename, search for pattern matches. If there are any, report
# the corresponding PDF file, the page numbers and lines of the match.
#
for fileName in $fileNames
do
# Do a case-insensitive, multiline grep of the current file for the
# search pattern. In the grep, have each line prepended with the line
# number, which represents the PDF page number.
#
local pages=() page="" pageMatches=0
local file=$textDirectory"/"$fileName
pages=( $( $grep $fileGrepOptions -e $pattern $file ) )
# If the grep found nothing, move on to the next file. Otherwise, if
# the maximum file matches has been defined and has been exeeded, then
# stop processing files.
#
if [[ $#pages -eq 0 ]]
then
continue
elif [[ ++fileMatches -gt $maxFileMatches && $maxFileMatches -gt 0 ]]
then
break
fi
# For each page with a match, print the page number and the matching
# lines in the page.
#
for page in $pages
do
# If there have been no previous page matches in the current file,
# identify the corresponding PDF file that the matches, in theory,
# come from.
#
if [[ ++pageMatches -eq 1 ]]
then
# Put a blank line between matches for each file, unless
# either minimum output is requested or page matches are not
# reported.
#
if [[ $fileMatches -ne 1 && $pageMatches -ne 0
&& $maxPageMatches -ne 0 ]]
then
(( $quiet )) || print
fi
# Identify and print in color the full location of the PDF
# file (prepended with an open command for easy access),
# unless minimum output is requested.
#
local pdfFile=$pdfDirectory"/"${fileName:r}".pdf"
(( $quiet )) || print -P $c"open "$pdfFile$nc
fi
# If the maximum pages matches has been defined and has been
# exeeded, stop processing pages for the current file.
#
if [[ $maxPageMatches -gt 0 && $pageMatches -gt $maxPageMatches ]]
then
break
fi
# Extract and remove the page number specification (an initial
# number before a colon) from the grep output for the page. Then
# extract the lines of the page: tabs are decoded as newlines.
#
local pageNumber=${page%%:*}
page=${page#*:}
local lines=( $( tr '\t' '\n' <<<$page ) )
# Print the PDF page number in yellow. Then grep the lines of the
# page that we have, matching possibly multiple lines without
# regard to case. And have any grep output use color and a line
# before and after the match, for context.
#
(( $quiet )) || print -P $c"Page "$pageNumber$nc
if (( $grepOutput ))
then
$grep -C $contextLines -e $pattern $grepOptions <<<$lines
fi
done
done
}
Yes, if I had to do it again, I would have used Perl...
Here's the usage for greplaw, as a zsh shell function. It's a surprisingly fast search tool.
NAME
greplaw
SYNOPSIS
greplaw [-c number] [-f number] [-g] [-h] [-l pattern]... [-o option]...
[-p number] [-q] pattern...
DESCRIPTION
This function searches law files with a regular expression, which is
specified as one or more arguments. If more than one argument is provided,
they are joined, with a pattern of whitespace, into a singular expression.
The searches are done without regard to case or whitespace, including line
and page breaks. The output of this function is the path of each law file
the expression is found in, including the PDF page as well as the results
of the pcregrep. If just a page is reported without any pcregrep results,
the match begins on that page and continues to the next one.
The following options are available:
-c number
Context of number lines around each match shown. Default is 1.
-f number
File matches maximum is number. Default is 5. Infinite is -1.
-g
Grep output is omitted
-h
Help message is merely printed.
-l pattern
Law file regex pattern will be added as a filename filter.
-o option
Option option added to the final pcregrep execution, for output.
-p number
Page matches maximum is number. Default is 5. Infinite is -1.
-q
Quiet file and page information: information not from pcregrep.
EXAMPLES
greplaw bureau of investigation
greplaw -o --color=always congress has not | less -r
greplaw -l " " -l Law congress
That'll do it... (If I remembered everything correctly and didn't do any typos...)

Looping over all files of certain extension in a directory

I wrote a small script that unzips all the *.zip files in the current directory to extract only *.srt files directory to a newly created directory. It then loops over all the *.mkv files in the current directory to get their name and then changes subs/*.srt file name to produce a new file name that is exactly as *.mkv file name.
The script works when there is one zip file and one mkv file but when there are more files it produces bad filenames. I cannot track why this is the case. Now I figured out when this is the case.
EDIT
I managed to narrow down the scenarios when file names are changed in erroneous way.
Let's say in a current directory we have three *.mkv files: (sorted alphabetically)
$ ls -1a *.mkv
Home.S06E10.1080p.BluRay.x264-PRINTER.mkv
Home.S06E11.1080p.BluRay.x264-PRINTER.mkv
Home.S06E12.1080p.BluRay.x264-PRINTER.mkv
and three *.srt files:
$ ls -1a *.srt
Home.S06E10.srt
Home.S06E11.BDRip.X264-PRINTER.srt
Home.S06E12.BDRip.X264-PRINTER.srt
When I run the script, I get:
subs/Home.S06E10.srt -> subs/Home.S06E10.1080p.BluRay.x264-PRINTER.srt
subs/Home.S06E10.1080p.BluRay.x264-PRINTER.srt -> subs/Home.S06E11.1080p.BluRay.x264-PRINTER.srt
subs/Home.S06E11.1080p.BluRay.x264-PRINTER.srt -> subs/Home.S06E12.1080p.BluRay.x264-PRINTER.srt
As you see, Home.S06E10.srt is used twice
#!/usr/bin/env bash
mkdir -p subs
mkdir -p mkv-out
mkdir -p subs-bak
# unzip files, maybe there are subtitles in it...
for zip in *.zip; do
if [ -f "$zip" ]; then
unzip "$zip" -d subs "*.srt" >/dev/null
fi
done
# move all subtitles to subs catalog
for srt in *.srt; do
if [ -f "$srt" ]; then
mv "*.srt" subs
fi
done
mkvCount=(*.mkv)
srtCount=(subs/*.srt)
if [ ${#mkvCount[#]} != ${#srtCount[#]} ]; then
echo "Different number of srt and mkv files!"
exit 1
fi
for MOVIE in *.mkv; do
for SUBTITLE in subs/*.srt; do
NAME=$(basename "$MOVIE" .mkv)
SRT="subs/$NAME.srt"
if [ ! -f "$SRT" ]; then
echo "$SUBTITLE -> ${SRT}"
mv "$SUBTITLE" "$SRT"
fi
done
done
You seem to be relying on the lexicographical order of the files to associate one SRT with one MKV. If all you have are season-episode files for the same series, then I suggest a completely different approach: iterate a season and an episode counters, then generate masks in the form S##E## and find a movie and a subtitle files. If you find them, you move them.
for season in {01..06}; do
for episode in {01..24}; do
# Count how many movies and subtitles we have in the form S##E##
nummovies=$(find -name "*S${season}E${episode}*.mkv" | wc -l)
numsubs=$(find -name "*S${season}E${episode}*.srt" | wc -l)
if [[ $nummovies -gt 1 || $numsubs -gt 1 ]]; then
echo "Multiple movies/subtitles for S${season}E${episode}"
exit 1
fi
# Skip if there is no movie or subtitle for this particular
# season/episode combination
if [[ $nummovies -eq 0 ]]; then
continue
fi
if [[ $numsubs -eq 0 ]]; then
echo "No subtitle for S${season}E${episode}"
continue
fi
# Now actually take the MKV file, get its basename, then find the
# SRT file with the same S##E## and move it
moviename=$(find -name "*S${season}E${episode}*.mkv")
basename=$(basename -s .mkv "$moviename")
subfile=$(find -name "*S${season}E${episode}*.srt")
mv "${subfile}" "${basename}.mkv"
done
done
If you don't want to rewrite everything, just change your last loop:
Drop the inner loop
Take the movie name instead and use sed to find the particular S##E## substring
Use find to find one SRT file like in my code
Move it
This has the benefit of not relying on hard-coded number of seasons/episodes. I guessed six seasons and no season with more than 26 episodes. However I thought my code would do and would look more simple.
Make certain that there will be exactly one SRT file. Having zero or more than one file will probably just give an error from mv, but it's better to be safe. In my code I used a separate call to find with wc to count the number of lines, but if you are more knowledgeable in bash-foo, then perhaps there's a way to treat the output of find as an array instead.
In both my suggestions you can also drop that check for # movies = # subtitles. This gives you more flexibility. The subtitles can be in whatever directories you want, but the movies are assumed to the in the CWDIR. With find you can also use the -or operator to accept other extensions, such as AVI and MPG.

Delete files in one directory that do not exist in another directory or its child directories

I am still a newbie in shell scripting and trying to come up with a simple code. Could anyone give me some direction here. Here is what I need.
Files in path 1: /tmp
100abcd
200efgh
300ijkl
Files in path2: /home/storage
backupfile_100abcd_str1
backupfile_100abcd_str2
backupfile_200efgh_str1
backupfile_200efgh_str2
backupfile_200efgh_str3
Now I need to delete file 300ijkl in /tmp as the corresponding backup file is not present in /home/storage. The /tmp file contains more than 300 files. I need to delete the files in /tmp for which the corresponding backup files are not present and the file names in /tmp will match file names in /home/storage or directories under /home/storage.
Appreciate your time and response.
You can also approach the deletion using grep as well. You can loop though the files in /tmp checking with ls piped to grep, and deleting if there is not a match:
#!/bin/bash
[ -z "$1" -o -z "$2" ] && { ## validate input
printf "error: insufficient input. Usage: %s tmpfiles storage\n" ${0//*\//}
exit 1
}
for i in "$1"/*; do
fn=${i##*/} ## strip path, leaving filename only
## if file in backup matches filename, skip rest of loop
ls "${2}"* | grep -q "$fn" &>/dev/null && continue
printf "removing %s\n" "$i"
# rm "$i" ## remove file
done
Note: the actual removal is commented out above, test and insure there are no unintended consequences before preforming the actual delete. Call it passing the path to tmp (without trailing /) as the first argument and with /home/storage as the second argument:
$ bash scriptname /path/to/tmp /home/storage
You can solve this by
making a list of the files in /home/storage
testing each filename in /tmp to see if it is in the list from /home/storage
Given the linux+shell tags, one might use bash:
make the list of files from /home/storage an associative array
make the subscript of the array the filename
Here is a sample script to illustrate ($1 and $2 are the parameters to pass to the script, i.e., /home/storage and /tmp):
#!/bin/bash
declare -A InTarget
while read path
do
name=${path##*/}
InTarget[$name]=$path
done < <(find $1 -type f)
while read path
do
name=${path##*/}
[[ -z ${InTarget[$name]} ]] && rm -f $path
done < <(find $2 -type f)
It uses two interesting shell features:
name=${path##*/} is a POSIX shell feature which allows the script to perform the basename function without an extra process (per filename). That makes the script faster.
done < <(find $2 -type f) is a bash feature which lets the script read the list of filenames from find without making the assignments to the array run in a subprocess. Here the reason for using the feature is that if the array is updated in a subprocess, it would have no effect on the array value in the script which is passed to the second loop.
For related discussion:
Extract File Basename Without Path and Extension in Bash
Bash Script: While-Loop Subshell Dilemma
I spent some really nice time on this today because I needed to delete files which have same name but different extensions, so if anyone is looking for a quick implementation, here you go:
#!/bin/bash
# We need some reference to files which we want to keep and not delete,
 # let's assume you want to keep files in first folder with jpeg, so you
# need to map it into the desired file extension first.
FILES_TO_KEEP=`ls -1 ${2} | sed 's/\.pdf$/.jpeg/g'`
#iterate through files in first argument path
for file in ${1}/*; do
# In my case, I did not want to do anything with directories, so let's continue cycle when hitting one.
if [[ -d $file ]]; then
continue
fi
# let's omit path from the iterated file with baseline so we can compare it to the files we want to keep
NAME_WITHOUT_PATH=`basename $file`
 # I use mac which is equal to having poor quality clts
# when it comes to operating with strings,
# this should be safe check to see if FILES_TO_KEEP contain NAME_WITHOUT_PATH
if [[ $FILES_TO_KEEP == *"$NAME_WITHOUT_PATH"* ]];then
echo "Not deleting: $NAME_WITHOUT_PATH"
else
# If it does not contain file from the other directory, remove it.
echo "deleting: $NAME_WITHOUT_PATH"
rm -rf $file
fi
done
Usage: sh deleteDifferentFiles.sh path/from/where path/source/of/truth

Find and delete files that contain same string in filename in linux terminal

I want to delete all files from a folder that contain a not unique numerical string in the filename using linux terminal. E.g.:
werrt-110009.jpg => delete
asfff-110009.JPG => delete
asffa-123489.jpg => maintain
asffa-111122.JPG => maintain
Any suggestions?
I only now understand your question, I think. You want to remove all files that contain a numeric value that is not unique (in a particular folder). If a filename contains a value that is also found in another filename, you want to remove both files, right?
This is how I would do that (it may not be the fastest way):
# put all files in your folder in a list
# for array=(*) to work make sure you have enabled nullglob: shopt -s nullglob
array=(*)
delete=()
for elem in "${array[#]}"; do
# for each elem in your list extract the number
num_regex='([0-9]+)\.'
[[ "$elem" =~ $num_regex ]]
num="${BASH_REMATCH[1]}"
# use the extracted number to check if it is unique
dup_regex="[^0-9]($num)\..+?(\1)"
# if it is not unique, put the file in the files-to-delete list
if [[ "${array[#]}" =~ $dup_regex ]]; then
delete+=("$elem")
fi
done
# delete all found duplicates
for elem in "${delete[#]}"; do
rm "$elem"
done
In your example, array would be:
array=(werrt-110009.jpg asfff-110009.JPG asffa-123489.jpg asffa-111122.JPG)
And the result in delete would be:
delete=(werrt-110009.jpg asfff-110009.JPG)
Is this what you meant?
you can use the linux find command along with the -regex parameter and the -delete parameter
to do it in one command
Use "rm" command to delete all matching string files in directory
cd <path-to-directory>/ && rm *110009*
This command helps to delete all files with matching string and it doesn't depend on the position of string in file name.
I was mentioned rm command option as another option to delete files with matching string.
Below is the complete script to achieve your requirement,
#!/bin/sh -eu
#provide the destination fodler path
DEST_FOLDER_PATH="$1"
TEMP_BUILD_DIR="/tmp/$( date +%Y%m%d-%H%M%S)_clenup_duplicate_files"
#++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
clean_up()
{
if [ -d $TEMP_BUILD_DIR ]; then
rm -rf $TEMP_BUILD_DIR
fi
}
trap clean_up EXIT
[ ! -d $TEMP_BUILD_DIR ] && mkdir -p $TEMP_BUILD_DIR
TEMP_FILES_LIST_FILE="$TEMP_BUILD_DIR/folder_file_names.txt"
echo "$(ls $DEST_FOLDER_PATH)" > $TEMP_FILES_LIST_FILE
while read filename
do
#check files with number pattern
if [[ "$filename" =~ '([0-9]+)\.' ]]; then
#fetch the number to find files with similar number
matching_string="${BASH_REMATCH[1]}"
# use the extracted number to check if it is unique
#find the files count with matching_string
if [ $(ls -1 $DEST_FOLDER_PATH/*$matching_string* | wc -l) -gt 1 ]; then
rm $DEST_FOLDER_PATH/*$matching_string*
fi
fi
#reload remaining files in folder (this optimizes the loop and speeds up the operation
#(this helps lot when folder contains more files))
echo "$(ls $DEST_FOLDER_PATH)" > $TEMP_FILES_LIST_FILE
done < $TEMP_FILES_LIST_FILE
exit 0
How to execute this script,
Save this script into file as
path-to-script/delete_duplicate_files.sh (you can rename whatever
you want)
Make script executable
chmod +x {path-to-script}/delete_duplicate_files.sh
Execute script by providing directory path where duplicate
files(files with matching number pattern) needs to be deleted
{path-to-script}/delete_duplicate_files.sh "{path-to-directory}"

Bash command to move only some files?

Let's say I have the following files in my current directory:
1.jpg
1original.jpg
2.jpg
2original.jpg
3.jpg
4.jpg
Is there a terminal/bash/linux command that can do something like
if the file [an integer]original.jpg exists,
then move [an integer].jpg and [an integer]original.jpg to another directory.
Executing such a command will cause 1.jpg, 1original.jpg, 2.jpg and 2original.jpg to be in their own directory.
NOTE
This doesn't have to be one command. I can be a combination of simple commands. Maybe something like copy original files to a new directory. Then do some regular expression filter on files in the newdir to get a list of file names from old directory that still need to be copied over etc..
Turning on extended glob support will allow you to write a regular-expression-like pattern. This can handle files with multi-digit integers, such as '87.jpg' and '87original.jpg'. Bash parameter expansion can then be used to strip "original" from the name of a found file to allow you to move the two related files together.
shopt -s extglob
for f in +([[:digit:]])original.jpg; do
mv $f ${f/original/} otherDirectory
done
In an extended pattern, +( x ) matches one or more of the things inside the parentheses, analogous to the regular expression x+. Here, x is any digit. Therefore, we match all files in the current directory whose name consists of 1 or more digits followed by "original.jpg".
${f/original/} is an example of bash's pattern substitution. It removes the first occurrence of the string "original" from the value of f. So if f is the string "1original.jpg", then ${f/original/} is the string "1.jpg".
well, not directly, but it's an oneliner (edit: not anymore):
for i in [0-9].jpg; do
orig=${i%.*}original.jpg
[ -f $orig ] && mv $i $orig another_dir/
done
edit: probably I should point out my solution:
for i in [0-9].jpg: execute the loop body for each jpg file with one number as filename. store whole filename in $i
orig={i%.*}original.jpg: save in $orig the possible filename for the "original file"
[ -f $orig ]: check via test(1) (the [ ... ] stuff) if the original file for $i exists. if yes, move both files to another_dir. this is done via &&: the part after it will be only executed if the test was successful.
This should work for any strictly numeric prefix, i.e. 234.jpg
for f in *original.jpg; do
pre=${f%original.jpg}
if [[ -e "$pre.jpg" && "$pre" -eq "$pre" ]] 2>/dev/null; then
mv "$f" "$pre.jpg" targetDir
fi
done
"$pre" -eq "$pre" gives an error if not integer
EDIT:
this fails if there exist original.jpg and .jpg both.
$pre is then nullstring and "$pre" -eq "$pre" is true.
The following would work and is easy to understand (replace out with the output directory, and {1..9} with the actual range of your numbers.
for x in {1..9}
do
if [ -e ${x}original.jpg ]
then
mv $x.jpg out
mv ${x}original.jpg out
fi
done
You can obviously also enter it as a single line.
You can use Regex statements to find "matches" in the files names that you are looking through. Then perform your actions on the "matches" you find.
integer=0; while [ $integer -le 9 ] ; do if [ -e ${integer}original.jpg ] ; then mv -vi ${integer}.jpg ${integer}original.jpg lol/ ; fi ; integer=$[ $integer + 1 ] ; done
Note that here, "lol" is the destination directory. You can change it to anything you like. Also, you can change the 9 in while [ $integer -le 9 ] to check integers larger than 9. Right now it starts at 0* and stops after checking 9*.
Edit: If you want to, you can replace the semicolons in my code with carriage returns and it may be easier to read. Also, you can paste the whole block into the terminal this way, even if that might not immediately be obvious.

Resources