Sorting file numerically and incrementally - linux

So, I have 1000 files in a folderA.
Let's say:
File_0001, File_0002, File_0003, File_0004, File_0005, . . . , File_1000
Question, how to sort these files every two incremental number and copy these files into another folder (folderB). So that the files in folderB will be like this:
File_0002, File_0004, File_0006,File_0008, File_0010, . . . , File_1000
Any suggestions will be really appreciated.
Thank you

You can also use simple cp command:
cp File_*[02468] folderB

ls | sort | xargs -n2 echo | awk '{print $2}' | xargs -I '{}' echo mv '{}' /folderB
the trick is to use | xargs -n2 echo | awk '{print $2}' to get the even line.

Depending on what's actually wanted, I'd say #demostene's answer is probably right. If OP actually wants alternate files from the list, regardless of possibly skipped numbers, then
cp $(ls | awk 'NR%2 == 0 {print $0}') folderB
would seem to do the trick. Note the obvious extensions for every third, fourth, or Nth file.

Related

Total number of lines in a directory

I have a directory with thousands of files (100K for now). When I use wc -l ./*, I'll get:
c1 ./test1.txt
c2 ./test2.txt
...
cn ./testn.txt
c1+c2+...+cn total
Because there are a lot of files in the directory, I just want to see the total count and not the details. Is there any way to do so?
I tried several ways and I got following error:
Argument list too long
If what you want is the total number of lines and nothing else, then I would suggest the following command:
cat * | wc -l
This catenates the contents of all of the files in the current working directory and pipes the resulting blob of text through wc -l.
I find this to be quite elegant. Note that the command produces no extraneous output.
UPDATE:
I didn't realize your directory contained so many files. In light of this information, you should try this command:
for file in *; do cat "$file"; done | wc -l
Most people don't know that you can pipe the output of a for loop directly into another command.
Beware that this could be very slow. If you have 100,000 or so files, my guess would be around 10 minutes. This is a wild guess because it depends on several parameters that I'm not able to check.
If you need something faster, you should write your own utility in C. You could make it surprisingly fast if you use pthreads.
Hope that helps.
LAST NOTE:
If you're interested in building a custom utility, I could help you code one up. It would be a good exercise, and others might find it useful.
Credit: this builds on #lifecrisis's answer, and extends it to handle large numbers of files:
find . -maxdepth 1 -type f -exec cat {} + | wc -l
find will find all of the files in the current directory, break them into groups as large as can be passed as arguments, and run cat on the groups.
awk 'END {print NR" total"}' ./*
Would be an interesting comparison to find out how many lines don't end with a new line.
Combining the awk and Gordon’s find solutions and avoiding the "." files.
find ./* -maxdepth 0 -type f -exec awk 'END {print NR}' {} +
No idea if this is better or worse but it does give a more accurate count (for me) and does not count lines in "." files. Using ./* is just a guess that appears to work.
Still need depth and ./* requires "0" depth.
I did get the same result with the "cat" and "awk" solutions (using the same find) since the "cat *" takes care of the new line issue. I don't have a directory with enough files to measure time. Interesting, I'm liking the "cat" solution.
This will give you the total count for all the files (including hidden files) in your current directory :
$ find . -maxdepth 1 -type f | xargs wc -l | grep total
1052 total
To count for files excluding hidden files use :
find . -maxdepth 1 -type f -not -path "*/\.*" | xargs wc -l | grep total
(Apologies for adding this as an answer—but I do not have enough reputation for commenting.)
A comment on #lifecrisis's answer. Perhaps cat is slowing things down a bit. We could replace cat by wc -l and then use awkto add the numbers. (This could be faster since much less data needs to go throught the pipe.)
That is
for file in *; do wc -l "$file"; done | awk '{sum += $1} END {print sum}'
instead of
for file in *; do cat "$file"; done | wc -l
(Disclaimer: I am not incorporating many of the improvements in other answers, but I thought the point was valid enough to write down.)
Here are my results for comparison (I ran the newer version first so that any cache effects would go against the newer candidate).
$ time for f in `seq 1 1500`; do head -c 5M </dev/urandom >myfile-$f |sed -e 's/\(................\)/\1\n/g'; done
real 0m50.360s
user 0m4.040s
sys 0m49.489s
$ time for file in myfile-*; do wc -l "$file"; done | awk '{sum += $1} END {print sum}'
30714902
real 0m3.455s
user 0m2.093s
sys 0m1.515s
$ time for file in myfile-*; do cat "$file"; done | wc -l
30714902
real 0m4.481s
user 0m2.544s
sys 0m4.312s
iF you want to know only total number Lines in directory excluding total line
ls -ltr | sed -n '/total/!p' | awk '{print NR}'
Previous comment will give total count of lines which includes only count of lines in all files
Below command will provide the total count of lines from all files in path
for i in `ls- ltr | awk ‘$1~”^-rw”{print $9}’`; do wc -l $I | awk ‘{print $1}’; done >>/var/tmp/filelinescount.txt
Cat /var/tmp/filelinescount.txt| sed -r “s/\s+//g”|tr “\n” “+”| sed “s:+$::g”| sed ’s/^/“/g’| sed ’s/$/“/g’ | awk ‘{print “echo” “ “ $0”+bc”}’| sh

Replacing unknown amount of blank spaces for X amount

Hey so I'm writing a linux script and I came to an interesting finding.
I've got a command that will sort the files inside a directory by it's size and prints the largest one. Command is as follows
find . -type f -ls | sort -r -n -k7 | head -n 1
This will print something amongst the likes of
895918591 8 -r-w-x 1 user01 xdf 1931 28 march 23:21 ./myscript.sh
So I want to to get the largest file size alone and print it. To separate it I used cut -d' ' -f2 issue is, this leaves only empty output. That is because the amount of spaces is inconsistent.
So I tried doing something like this
find . -type f -ls | sort -r -n -k7 | head -n 1 | tr -d [:blank:] | cut -d' ' -f2
Issue is, this removes all the blank spaces now I can't separate them by common separator. So I'm asking, is there a way to replace literally all the blank spaces and then replace them with a single blank space?
If not, at least any other way to get to that number of bytes?
Sed and Awk are great tools for this kind of thing. Sed is a regex-based language that modifies the contents of each line the Sed program receives, and Awk is also a line-oriented tool that automatically splits its input into fields.
To turn sequences of blanks into one blank (substitute all matches of /\s+/ with ) in Sed:
$ find ... | sed 's/\s+/ /g'
To just print the first "word" (sequence of nonspaces) of each line in Awk:
$ find ... | awk '{print $1}'
http://tldp.org/LDP/abs/html/sedawk.html can get you started with these languages.
Instead of cut you can use awk:
find . -type f -ls | sort -r -n -k7 | head -n 1 | awk '{print $2}'
However you can even avoid head as well using awk:
find . -type f -ls | sort -r -n -k7 | awk '{print $2; exit}'
The tool to convert multiple spaces to just one is called tr -s:
tr translates
s squeezes
Sample:
$ cat a
hello this is a sample text with multiple spaces
$ tr -s " " < a
hello this is a sample text with multiple spaces
If you then want to convert every space into X, just pipe to sed 's/ / /g'.
I think you're overthinking the issue at hand:
find -type f -printf "%s\n"|sort -n|tail -n1
Instead of using cut, you can try using the printf command that gives you control over your display
find . -type f -ls | sort -r -n -k7 | head -n 1 -printf %s
You're doing it wrong.
Parsing ls in any form ( like find's -ls option ) is the bad approach.
Do not use ls output for anything. ls is a tool for interactively looking at directory metadata. Any attempts at parsing ls output with code are broken.
I strongly suggest you to read further about this subject. Read Parsing ls.
Instead, use the following function:
# Usage: largest [dir]
largest() {
local f size largest
while read -rd '' f; do
size=$(wc -c < "$f")
if (( size > largest[0] )); then
largest=("$size" "$f")
fi
done < <(find "${1-.}" -type f -print0)
printf '%s is the largest file in %s\n' "${largest[1]}" "${1-.}"
}

viewing file's content for each file-name appearing in a list

I'm creating a list of file-names using the command:
ls | grep "\.txt$"
I'm getting a list of files:
F1.txt
F2.txt
F3.txt
F4.txt
I want to view the content of these files (using less / more / cat /...)
is there a way to do this by pipping?
(Btw, I got a list of file-names using a more complex command, this is just a simpler example for clarification)
Would this be enough?
$ cat *txt
For richer queries, you could use find and xargs:
$ find . -name "*txt" | xargs cat
you can try something like this:
#!/bin/bash
for i in *.txt
do
echo Displaying file $i ...
more $i
done
What about:
cat $(ls | grep "\.txt$")

how to compare output of two ls in linux

So here is the task which I can't solve. I have a directory with .h files and a directory with .i files, which have the same names as the .h files. I want just by typing a command to have all .h files which are not found as .i files. It's not a hard problem, I can do it in some programming language, but I'm just curious how it will look like in cmd :). To be more specific here is the algo:
get file names without extensions from ls *.h
get file names without extensions from ls *.i
compare them
print all names from 1 that are not met in 2
Good luck!
diff \
<(ls dir.with.h | sed 's/\.h$//') \
<(ls dir.with.i | sed 's/\.i$//') \
| grep '$<' \
| cut -c3-
diff <(ls dir.with.h | sed 's/\.h$//') <(ls dir.with.i | sed 's/\.i$//') executes ls on the two directories, cuts off the extensions, and compares the two lists. Then grep '$<' finds the files that are only in the first listing, and cut -c3- cuts off the "< " characters that diff inserted.
ls ./dir_h/*.h | sed -r -n 's:.*dir_h/([^.]*).h$:dir_i/\1.i:p' | xargs ls 2>&1 | \
grep "No such file or directory" | awk '{print $4}' | sed -n -r 's:dir_i/([^:]*).*:dir_h/\1:p'
ls -1 dir1/*.hh dir2/*.ii | awk -F"/" '{print $NF}' |awk -F"." '{a[$1]++;b[$0]}END{for(i in a)if(a[i]==1 && b[i".hh"]) print i}'
explanation:
ls -1 dir1/*.hh dir2/*.ii
above will list all the files *.hh and *.ii files in both the directories.
awk -F"/" '{print $NF}'
above will just print the file name excluding the complete path of the file.
awk -F"." '{a[$1]++;b[$0]}END{for(i in a)if(a[i]==1 && b[i".hh"]) print i}'
above will create two associative arrays one with file name and one with excluding the extension.
if both hh and ii files exist the value in the assosciative array will 2 if there is only one file then the value will be 1.so we need array item whose value is 1 and it should be a header file (.hh).
this can be checked using the asso..array b which is done in the END block.
Assuming bash is your shell:
for file in $( ls dir_with_h/*.h ); do
name=${file%\.h}; # trim trailing ".h" file extension
name=${name#dir_with_h/}; # trim leading folder name
if [ ! -e dir_with_i/${name}.i ]; then
echo ${name};
fi
done
Undoubtedly this can be ported to virtually all other shells. I find this less cryptic than some other approaches (although this is surely my problem) but it is a little wordy. As such. a shell script might help recall it.

Problems with Grep Command in bash script

I'm having some rather unusual problems using grep in a bash script. Below is an example of the bash script code that I'm using that exhibits the behaviour:
UNIQ_SCAN_INIT_POINT=1
cat "$FILE_BASENAME_LIST" | uniq -d >> $UNIQ_LIST
sed '/^$/d' $UNIQ_LIST >> $UNIQ_LIST_FINAL
UNIQ_LINE_COUNT=`wc -l $UNIQ_LIST_FINAL | cut -d \ -f 1`
while [ -n "`cat $UNIQ_LIST_FINAL | sed "$UNIQ_SCAN_INIT_POINT"'q;d'`" ]; do
CURRENT_LINE=`cat $UNIQ_LIST_FINAL | sed "$UNIQ_SCAN_INIT_POINT"'q;d'`
CURRENT_DUPECHK_FILE=$FILE_DUPEMATCH-$CURRENT_LINE
grep $CURRENT_LINE $FILE_LOCTN_LIST >> $CURRENT_DUPECHK_FILE
MATCH=`grep -c $CURRENT_LINE $FILE_BASENAME_LIST`
CMD_ECHO="$CURRENT_LINE matched $MATCH times," cmd_line_echo
echo "$CURRENT_DUPECHK_FILE" >> $FILE_DUPEMATCH_FILELIST
let UNIQ_SCAN_INIT_POINT=UNIQ_SCAN_INIT_POINT+1
done
On numerous occasions, when grepping for the current line in the file location list, it has put no output to the current dupechk file even though there have definitely been matches to the current line in the file location list (I ran the command in terminal with no issues).
I've rummaged around the internet to see if anyone else has had similar behaviour, and thus far all I have found is that it is something to do with buffered and unbuffered outputs from other commands operating before the grep command in the Bash script....
However no one seems to have found a solution, so basically I'm asking you guys if you have ever come across this, and any idea/tips/solutions to this problem...
Regards
Paul
The `problem' is the standard I/O library. When it is writing to a terminal
it is unbuffered, but if it is writing to a pipe then it sets up buffering.
try changing
CURRENT_LINE=`cat $UNIQ_LIST_FINAL | sed "$UNIQ_SCAN_INIT_POINT"'q;d'`
to
CURRENT LINE=`sed "$UNIQ_SCAN_INIT_POINT"'q;d' $UNIQ_LIST_FINAL`
Are there any directories with spaces in their names in $FILE_LOCTN_LIST? Because if they are, those spaces will need escaped somehow. Some combination of find and xargs can usually deal with that for you, especially xargs -0
A small bash script using md5sum and sort that detects duplicate files in the current directory:
CURRENT="" md5sum * |
sort |
while read md5sum filename;
do
[[ $CURRENT == $md5sum ]] && echo $filename is duplicate;
CURRENT=$md5sum;
done
you tagged linux, some i assume you have tools like GNU find,md5sum,uniq, sort etc. here's a simple example to find duplicate files
$ echo "hello world">file
$ md5sum file
6f5902ac237024bdd0c176cb93063dc4 file
$ cp file file1
$ md5sum file1
6f5902ac237024bdd0c176cb93063dc4 file1
$ echo "blah" > file2
$ md5sum file2
0d599f0ec05c3bda8c3b8a68c32a1b47 file2
$ find . -type f -exec md5sum "{}" \; |sort -n | uniq -w32 -D
6f5902ac237024bdd0c176cb93063dc4 ./file
6f5902ac237024bdd0c176cb93063dc4 ./file1

Resources