Merge multiple files containing same IDs - Linux - search

i have 10000 files in One folder like this :
1000.htm
Page_1000.html
file-1000.txt
2000.htm
Page_2000.html
file-2000.txt
i want merge each files that have similar name
example :
1000.htm Page_1000.html file-1000.txt > 1.txt
2000.htm Page_2000.html file-2000.txt > 2.txt
i have try to merge using cat like this its working but i cant do that in 10k files.
cat 1000* > 1.txt
cat 2000* > 2.txt
Thanks

You probably can't do that because the globbing (*) tries to expand to a too large amount of argument. You can use find instead to find all files matching the pattern and than use xargs to perform cat on them.
find . -name '1000*' -print0 | xargs -0 cat > 1.txt
'-print0' and '-0' will delimit on the null (\0) character instead of the default line break character (\n). This way also files with linebreaks in their file names work as expected.

find . -name '*.htm' -printf '%P\n' |
while IFS='.' read -r key sfx; do
cnt=$(( cnt + 1 ))
cat "${key}.htm" "Page_${key}.html" "file-${key}.txt" > "${cnt}.txt"
done
though you should consider using the key in the output file name instead of a cnt variable so it's easy to tell which input files were included in the output file.

i=1;
for ((num = 1000; num < 10000; num+=1000));
do
cat ${num}.htm Page_${num}.html file-${num}.txt > ${i}.txt
i=$((i + 1));
done
You can change num < 10000, as per your requirement.

Related

Linux - is there a way to get the file size of a directory BUT only including the files that have a last modified / creation date of x?

as per title I am trying to find a way to get the file size of a directory (using du) but only counting the files in the directory that have been created (or modified) after a specific date.
Is it something that can be done using the command line?
Thanks :)
From #Bodo's comment. Using GNU find:
find directory/ -type f -newermt 2021-11-25 -printf "%s\t %f\n" | \
awk '{s += $1 } END { print s }' | \
numfmt --to=iec-i
find looks in in directory/ (change this)
Looks for files (-type f)
that have a newer modified time than 2021-11-25 (-newermt) (change this)
and outputs the files's size (%s) on each line
adds up all the sizes from the lines with awk {s += $1 }
Prints the results END { print s }
Formats the byte value to human readable with numfmt's --to=iec-i

To find the nearest smaller value based on a variable input inside bash script

I'm trying to find all the files with format: 100_Result.out, 101_Result.out, ... 104_Result.out from the subdirectories of a directory: /as/test_dir/. 
The structure of the subdirectories looks like: /as/test_dir/4/, /as/test_dir/5/,/as/test_dir/6/, /as/test_dir/7/, ...
So, if I have a variable num=102 in the script, then I want to check all the *_Result.out files and need to capture the file which is one value smaller than the num variable
– i.e., I want the file: 101_Result.out.
Similarly, if num=101, then file should be 100_Result.out
But sometimes it could happen that the .out files are not in sequential,
i.e., not all values are present. 
So, if num=102 but there is no 101_Result.out file,
but I have a 100_Result.out file in one of the sub-directories,
then that's what I want.
I tried below and I believe somehow I've reached to it
but it doesn't look perfect.
#!/bin/bash
dir="/as/test_dir"
files=( $(find "$dir"/*/ -type f -name "*_Result.out" -exec basename "{}" \;) )
num=102
len=${#files[#]}
i=0
while [ $i -lt $len ]; do
var=$(echo "${files[$i]}" | awk -F'_' '{print $1}')
dif=$(($num - $var))
if [[ "$dif" -ge '1' ]];then
echo "$dif" >> tmpfile
fi
let i++
done
arr=( $(cat tmpfile) )
min=${arr[0]}
max=${arr[0]}
for i in "${arr[#]}"
do
if [[ "$i" -lt "$min" ]];then
min="$i"
elif [[ "${#arr[#]}" -eq '1' ]] && [[ "${arr[0]}" -eq '1' ]];then
min="$i"
for j in "${files[#]}"
do
var=$(echo "${j}" | awk -F'_' '{print $1}')
if [[ $(($num - $var)) -eq "$min" ]];
then
file_name="${var}_Result.out"
echo "$file_name"
fi
done
fi
done
echo "$min"
#rm tmpfile
Any help is most welcome.
and need to capture the file which is one value smaller than the num variable .... if i've num=102 in the script, then i want the file: 101_Result.out
Just glob the file.
echo "$dir"/*/"$((num - 1))_Result.out"
sometimes it could happen that the .out files are not in sequential,i.e if num=102, then it may happen that i've only 100_Result.out file in one the sub-directories
Try not to store state in bash. Instead, write one long big pipeline. Like ex. so:
# Get all files
find "$dir"/*/ -type f -name "*_Result.out" |
# Extract the number in first column separated by space.
sed 's~.*/\([0-9]*\)_Result.out$~\1 &~' |
# Filter only smaller
awk -v num="$num" '$1 < num' |
# Get first file smaller file
sort -n | tail -n1 |
# Remove the number
cut -d' ' -f2-
The question does not specify what to do if there are multiple files
with the same number (i.e., a tie;
e.g., /as/test_dir/4/101_Result.out 
and /as/test_dir/7/101_Result.out). 
This answer assumes that you want all of them.
One Step
This is very similar to Darkman’s answer, but
It handles pathnames somewhat better.
IMO, it’s clearer. 
It uses meaningful variable names (one-letter names are too short),
looser spacing (169 characters is too many for a one-liner),
and a simpler algorithm (no subtraction).
dir="/as/test_dir"
num=102
find "$dir" -type f -name '*_Result.out' |
awk -F'[/_]' -v limit="$num" '
{
this_num = $(NF-1)
if (this_num < limit)
{
numbers[$0] = this_num
if (this_num > max) max = this_num
}
}
END { for (i in numbers) if (numbers[i] == max) print i }
'
You clearly already understand the find command —
find all files in and under $dir whose names match *_Result.out.
Pipe into awk.  
Each filename (pathname) becomes an input record to awk.
-F'[/_]' means use slash (/) and underscore (_) as field separators. 
That means that a filename (input record) of /as/test_dir/4/100_Result.out
gets broken into these fields:
 $1 = (blank)
 $2 = as
 $3 = test
 $4 = dir
 $5 = 4
 $6 = 100
 $7 = Result.out
($1 would be set to the text before the first /  (or _) 
if there were any.)
As illustrated above, the number part of the file name
is the second-to-last field in the record; i.e., $(NF-1). 
This depends on the fact that the file name
always contains exactly one underscore, and it comes right after the number. 
(See Part 2 of this answer for a more flexible approach.)
If the number is less than the limit (e.g., 102),
save the pathname in an array, associated with the number.
If the number is less than the limit
but more than the maximum we have seen so far, update the maximum. 
(We don’t need to initialize max explicitly;
awk automatically initializes all variables to zero1.)
Finally, print all the pathnames that are associated with the max value.
The above will list (all) the desired filenames on the standard output. 
As you know, you can put them into an array
by putting arr=( before the find … | awk … pipeline, and ) after it.
________
1 Actually, variables are initialized to null. 
This is treated as zero when it is used in a numeric context.
Two Steps
The above is OK
for producing a human-readable, displayable list of filenames. 
However, filenames can contain weird characters
like space, tab, newline, *, ?, etc.;
processing the output from find can be problematic. 
A somewhat safer approach is to determine the max value,
and then, as a second step, find the file(s) that match that value. 
You can then process those files with -exec.
max=$(find "$dir" -type f -name '*_Result.out' |
awk -F'/' -v limit="$limit" '
{
this_num = $NF
sub(/_Result.out/, "", this_num)
if (this_num < limit)
{
numbers[$0] = this_num
if (this_num > max) max = this_num
}
}
END { print max }
'
)
if [ "$max" = "" ]
then
echo "No file(s) found."
else
find "$dir" -type f -name "${max}_Result.out"
fi
-F'/' means use only slash (/) as field separator. 
That means that a filename (input record) of /as/test_dir/4/100_Result.out gets broken into these fields:
 $1 = (blank)
 $2 = as
 $3 = test_dir
 $4 = 4
 $5 = 100_Result.out
Here, the last (rightmost) component of the pathname (i.e., the file name)
is the last field in the record; i.e., $NF. 
This, of course,
is equivalent to the -exec basename "{}" you’re already using.
Temporarily assign the file name to the this_num variable. 
Then strip off the _Result.out part (by substituting null for it),
leaving just the number. 
Strictly speaking, the sub call should be
sub(/_Result\.out/, "", this_num)
to treat the . as a literal dot
rather than an any-character (wildcard). 
But we know that the fourth-to-last character in the file name
is an actual dot, because it matched the -name.
At the end, just print the maximum number,
capturing the value in a shell variable, …
…, and then find the files that match that number (name).

Searching multiple files for list of words in a text file

I need to go through a huge amount of text files and list the ones that contains ALL of the words listed in another text file.
I need to list only the files containing all of the words. It does not have to be in any specific order. I've tried to use a variety of grep commands, but it only outputs the files containing any of the words, not all of them. It would be ideal to use the txt file containing the list of words as a search for grep.
Expected output is a list of just the files that succeed in the search (files that contains all the words from the "query" text file)
Tried
grep -Ffw word_list.txt /*.fas
find . -exec grep "word_list.txt" '{}' \; -print
I've found solutions using a number of pipes like
awk "/word1/&&/word2/&&/word3/" ./*.txt
find . -path '*.txt' -prune -o -type f -exec gawk '/word1/{a=1}/word2/{b=1}/word3/{c=1}END{ if (a && b && c) print FILENAME }' {} \;
But I have a huge list of words and would be impractical.
Thank you.
Given sample files
file1.txt
word1
word2
word4
word5
file2.txt
word1
word2
word3
word4
file3.txt
word2
word3
word4
file4.txt
word0
word1
word2
word3
word4
file5.txt
word0
word1
word2
word3
word4
word5
This old-fashioned awk/shell code
#!/bin/bash
wordList="$1"
shift
awk -v wdListFile="$wordList" '
BEGIN{
dbg=0
while(getline < wdListFile > 0 ) {
words[$0]=$0
flags[$0]=0
numFlags++
}
}
{
if (dbg) { print "#dbg: myFile=" myFile " FILENAME=" FILENAME }
if (myFile != FILENAME) {
# a minor cost of extra reset on the first itteration in the run
if (dbg) { print "#dbg: inside flags reset" }
for (flg in flags) {
flags[flg]=0
}
}
for (i=1; i<=NF; i++) {
if (dbg) { print "#dbg: $i="$i }
if ($i in words) {
flags[$i]++
}
}
matchedCnt=0
for (f in flags) {
if (dbg) { print "#dbg: flags["f"]="flags[f] }
if (flags[f] > 0 ) {
matchedCnt++
if (dbg) { print "#dbg: incremeted matchedCnt to " matchedCnt}
}
}
if (dbg) {print "#dbg: Testing matchedCnt=" matchedCnt "==numFlags=" numFlags}
if (matchedCnt == numFlags) {
if (dbg) { print "All words found in "FILENAME "matchedCnt=" matchedCnt " numFlags=" numFlags}
print FILENAME
nextfile
}
myFile=FILENAME
if (dbg) { print "#dbg: myFile NOW=" myFile }
}' $#
Run from the command line as
./genGrep.sh wd.lst file*.txt
Produces the following output
file2.txt
file4.txt
file5.txt
One time only, make the script executable with
chmod 755 ./genGrep.sh
I would recommend making a copy of this file with dbg in the name, then take the original copy and delete all lines with dbg. This way you'll have a dbg version if you need it, but the dbg lines add an extra ~20% to reading the code.
Note that you can switch all dbging on by setting dbg=1 OR you can turn on individual lines by adding a ! char, i.e. if (! dbg) { ...}.
If for some reason you're running on really old Unix hardware, the nextfile command may not work. See if your system has gawk available, or get it installed.
I think there is an trick to getting nextfile behavior if it's not builtin, but I don't want to spend time researching that now.
Note that the use of the flags[] array, matchedCnt variable and the builtin awk function nextfile is designed to stop searching in a file once all words have been found.
You could also add a parameter to say "if n percent match, then print file name", but that comes with a consulting rate attached.
If you don't understand the stripped down awk code (removing the dbg sections), please work your way thur Grymoire's Awk Tutorial before asking questions.
Managing thousands of files (as you indicate) is a separate problem. But to get things going, I would call genGrep.sh wd.lst A* ; genGrep.sh wd.lst B*; ... and hope that works. The problem is that the command line has a limit of chars that can be processed at once in filename lists. So if A* expands to 1 billion chars, that you have to find a way to break up line size to something that the shell can process.
Typically, this is solved with xargs, so
find /path/to/files -name 'file*.txt' | xargs -I {} ./genGrep.sh wd.lst {}
Will find all the files that you specify by wildcard as demonstrated, from 1 or more /path/to/file that you list as the first argument to find.
All matching files are sent thru the pipe to xargs, which reads all files from list that one command invocation can process, and continues looping (not visible to you), until all files have been processed.
There are extra options to xargs that allow having multiple copies of ./genGrep.sh running, if you have the extra "cores" available on your computer. I don't want to get to deep into that, as I don't know if the rest of this is really going to work in your real-world use.
IHTH
It's a little hack as there is no direct way to do AND in grep.. We can using grep -E option to simulate AND.
grep -H -E "word1" *.txt| grep -H -E "word2" *.txt|grep -H -E "word3" *.txt | grep -H -E "word4" *.txt| cut -d: -f1
-H => --with-filename
-E => --extended-regexp
cut -d: -f1 => to print only the file name.
Try something like:
WORD_LIST=file_with_words.txt
FILES_LIST=file_with_files_to_search.txt
RESULT=file_with_files_containing_all_words.txt
# Generate a list of files to search and store as provisional result
# You can use find, ls, or any other way you find useful
find . > ${RESULT}
# Now perform the search for every word
for WORD in $(<${WORD_LIST}); do
# Remove any previous file list
rm -f ${FILES_LIST}
# Set the provisional result as the new starting point
mv ${RESULT} ${FILES_LIST}
# Do a grep on this file list and keep only the files that
# contain this particular word (and all the previous ones)
cat ${FILES_LIST} | xargs grep -l > $RESULT
done
# Clean up temporary files
rm -f ${FILES_LIST}
At this point you should have in $RESULTS the list of files that contain all the words in ${WORD_LIST}.
This operation is costly, as you have to read all the (still) candidate files again and again for every word you check, so try to put the less frequent words in the first place in the ${WORD_LIST} so you will drop as many files as possible from the checking as soon as possible.

Get distinct extension list Linux

I am new in Linux and currently I am facing a problem. I want to get list of extensions (.doc, .pdf) from a folder. I googled a lot and finally I get a solution which is given below :
find . -type f | awk -F. '!a[$NF]++{print $NF}'
I understand find . -type f, but unable to understand awk -F. '!a[$NF]++{print $NF}' what does it mean?
NF = Number of Fields in the current record
Can anyone explain?
Thanks in advance.
To answer your question what the awk line is doing :
As you already indicated, the line find . -type f returns a list of files located in the current directory. Eg.
./foo.ext1
./bar.ext2
./spam.ext2
./ham.ext3
./spam.ham.eggs
This list of files is send with a pipe to the command awk -F. '!a[$NF]++{print $NF}'. This awk line contains a lot of information. First of all you need to know that awk is a record parser where each record consists of a number of fields. The default record is a line while the default field separator is a sequence of spaces. So what does your awk line do now :
-F. :: this redefines the field separator to be a dot (.). From this point forward all lines in the example have now 2 fields (eg line 1 foo and ext1) while the last line has 3 fields (spam, ham and eggs).
NF :: this is an awk variable that returns the number of fields per record. It is clear that the extension is represented by the last field ($NF)
a[$NF] :: this is a array where the index is the extension. The default array value is zero unless you assign something to it.
a[$NF]++ :: this returns the current value of a[$NF] and increments the value with 1 after the return. Thus for line 1, a["ext1"]++ returns 0 and sets a["ext1"] to 1. While for line 3, a["ext2"]++ returns 1 and sets a["ext2"] to 2. This indicates that a[$NF] keeps track of the amount of times $NF appeared.
!a[$NF]++ :: this combines the logic of the above but checks if return value of a[$NF]++ is 0. If it is 0, return true otherwise return false. In case of line 2 of the example, This statement will return true because a["ext2"]++ has value 0. However, after the statement a["ext2"] has the value 1. When reading line 3, the statement will return false. In other words, have we seen $NF already? And while you answer this question with "yes" or "no" increment the count of $NF with one.
!a[$NF]++{print $NF} :: this combines everything. It essentially states, If !a[$NF]++ returns true, then print $NF, but before printing increment a[$NF] by one. Or in other words, If the field representing the extension ($NF) appears for the first time, print that field. If it has already appeared before, do nothing.
The incrementing of the array is important as it keeps track of what has been seen already. So line by line the following will happen
foo.ext1 => $NF="ext1", a["ext1"] is 0 so print $NF and set a["ext1"]=1
bar.ext2 => $NF="ext2", a["ext2"] is 0 so print $NF and set a["ext2"]=1
spam.ext2 => $NF="ext2", a["ext2"] is 1 so do not print and set a["ext2"]=2
ham.ext3 => $NF="ext3", a["ext3"] is 0 so print $NF and set a["ext3"]=1
spam.ham.eggs => $NF="eggs", a["eggs"] is 0 so print $NF and set a["eggs"]=1
The output is
ext1
ext2
ext3
eggs
General comments:
A file without any extensions al or not in a hidden directory (eg. ./path/to/awesome_filename_without_extension or ./path/to/.secret/filename_without_extension) or a part its full path printed as if it was the extension. The result however is meaning less, i.e.
/path/to/awesome_filename_without_extension
secret/awesome_filename_without_extension
This is best resolved as
find . -type f -exec basename -a '{}' + \
| awk -F. '((NF>1)&&(!a[$NF]++)){print $NF}'
Here the output of find is processed directly by basename which strips the directory from the filename. The awk line does one more check, do we have more then 1 field (i.e. is there an extension).
A very simple way of doing what you are attempting is to sort the output keeping only unique extensions, e.g.
find . -type f -regex ".*[.][a-zA-Z0-9][a-zA-Z0-9]*$" | \
awk -F '.' '{ print $NF }' | sort -u
if your sort doesn't support the -u option, then you can pipe the results of sort to uniq, e.g.
find . -type f -regex ".*[.][a-zA-Z0-9][a-zA-Z0-9]*$" | \
awk -F '.' '{ print $NF }' | sort | uniq
The -regex option limits the find selection to filenames with at least one ASCII character extension. However it will also pickup files without an extension if they contain a '.', e.g. foo.bar.fatcat would result in fatcat being included in the list.
You could adjust the regular expression to meet your needs. If your version of find supports posix-extended regular expressions then you can prevent longer extensions from being picked up. For example to limit the extension to 1-3 characters, you could use:
find . -type f -regextype posix-extended -regex ".*[.][a-zA-Z0-9]{1,3}$" | \
awk -F '.' '{ print $NF }' | sort -u
There are other ways to approach this, but given your initial example, this is a close follow-on.
You can use the following command for this purpose:
$find <DIR> -type f -print0 | xargs -0 -n1 basename | grep -Po '(?<=.)\..*$' | sort | uniq
.bak
.c
.file
.file.bak
.input
.input.bak
.log
.log.bak
.out
.out.bak
.test
.test.bak
.txt
.txt.bak
where the find command will look for all files under the <DIR> subtree pass them to basename to get only their filename without the path part (-0, and -print0 are used to take into account files with spaces in their names), then you grep only the part of the string that starts with a . (the extension .tar, .txt, .tar.gz) and also it ignores the hidden files with their name starting with .. After that you sort them and get only the unique values.
If you do not need the starting . in the extension name add
| sed 's/^\.//'

Count occurence of character in files

I want to count all $ characters in each file in a directory with several subdirectories.
My goal is to count all variables in a PHP project. The files have the suffix .php.
I tried
grep -r '$' . | wc -c
grep -r '$' . | wc -l
and a lot of other stuff but all returned a number that can not match. In my example file are only four $.
So I hope someone can help me.
EDIT
My example file
<?php
class MyClass extends Controller {
$a;$a;
$a;$a;
$a;
$a;
To recursively count the number of $ characters in a set of files in a directory you could do:
fgrep -Rho '$' some_dir | wc -l
To include only files of extension .php in the recursion you could instead use:
fgrep -Rho --include='*.php' '$' some_dir | wc -l
The -R is for recursively traversing the files in some_dir and the -o is for matching part of the each line searched. The set of files are restricted to the pattern *.php and file names are not included in the output with -h, which may otherwise have caused false positives.
For counting variables in a PHP project you can use the variable regex defined here.
So, the next will grep all variables for each file:
cd ~/my/php/project
grep -Pro '\$[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*' .
-P - use perlish regex
-r - recursive
-o - each match on separate line
will produce something like:
./elFinderVolumeLocalFileSystem.class.php:$path
./elFinderVolumeLocalFileSystem.class.php:$path
./elFinderVolumeMySQL.class.php:$driverId
./elFinderVolumeMySQL.class.php:$db
./elFinderVolumeMySQL.class.php:$tbf
You want count them, so you can use:
$ grep -Proc '\$[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*' .
and will get the count of variables in each file, like:
./connector.minimal.php:9
./connector.php:9
./elFinder.class.php:437
./elFinderConnector.class.php:46
./elFinderVolumeDriver.class.php:1343
./elFinderVolumeFTP.class.php:577
./elFinderVolumeFTPIIS.class.php:63
./elFinderVolumeLocalFileSystem.class.php:279
./elFinderVolumeMySQL.class.php:335
./mime.types:0
./MySQLStorage.sql:0
When want count by file and by variable, you can use:
$ grep -Pro '\$[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*' . | sort | uniq -c
for getting result like:
17 ./elFinderVolumeLocalFileSystem.class.php:$target
8 ./elFinderVolumeLocalFileSystem.class.php:$targetDir
3 ./elFinderVolumeLocalFileSystem.class.php:$test
97 ./elFinderVolumeLocalFileSystem.class.php:$this
1 ./elFinderVolumeLocalFileSystem.class.php:$write
6 ./elFinderVolumeMySQL.class.php:$arc
3 ./elFinderVolumeMySQL.class.php:$bg
10 ./elFinderVolumeMySQL.class.php:$content
1 ./elFinderVolumeMySQL.class.php:$crop
where you can see, than the variable $write is used only once, so (maybe) it is useless.
You can also count per variable per whole project
$ grep -Proh '\$[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*' . | sort | uniq -c
and will get something like:
13 $tree
1 $treeDeep
3 $trg
3 $trgfp
10 $ts
6 $tstat
35 $type
where you can see, than the $treeDeep is used only once in a whole project, so it is sure useless.
You can achieve many other combinations with different grep, sort and uniq commands..

Resources