find files that have number in file name greater than - linux

If I have 10 files called 01-a.txt, 02-a.txt,...10-a.txt - how can I find the files where the number is greater than 5? I would like a general solution, and I would be putting the contents of all files into one file using something like
cat *.txt > bigfile.txt
I can get files with numbers using
ls *[0-9]*
but can't seem go beyond that.
thanks.

You may use seq for that, but it works only if all files have same naming convention:
seq -f "%02g-a.txt" 6 10
06-a.txt
07-a.txt
08-a.txt
09-a.txt
10-a.txt
I.e.:
cat `seq -f "%02g-a.txt" 6 10` > bigfile.txt

Assuming the folder contains only these files.
This would list all files where the number is > 5
ls [0-9]* | awk -F '-' '{if ($1 > 5) print $0}'

It will cat all files named as "< numeric_value >-< rest >" and having this < numeric_value > greater than $LIM.
Even if they are written with one single digit (like 5), with two digits (like 05), or more...
And even if the < rest > are different among the files.
LIM=5
for file in $(ls);
do
number=$(echo $file | cut -f1 -d'-');
[ $number -gt $LIM ] && cat $file >> bigfile.txt;
done

ls *.txt | perl -n -e '$f = $_; $f =~ s/\D//g; $f > 5 and print'

Related

Replace filename to a string of the first line in multiple files in bash

I have multiple fasta files, where the first line always contains a > with multiple words, for example:
File_1.fasta:
>KY620313.1 Hepatitis C virus isolate sP171215 polyprotein gene, complete cds
File_2.fasta:
>KY620314.1 Hepatitis C virus isolate sP131957 polyprotein gene, complete cds
File_3.fasta:
>KY620315.1 Hepatitis C virus isolate sP127952 polyprotein gene, complete cds
I would like to take the word starting with sP* from each file and rename each file to this string (for example: File_1.fasta to sP171215.fasta).
So far I have this:
$ for match in "$(grep -ro '>')";do
fname=$("echo $match|awk '{print $6}'")
echo mv "$match" "$fname"
done
But it doesn't work, I always get the error:
grep: warning: recursive search of stdin
I hope you can help me!
you can use something like this:
grep '>' *.fasta | while read -r line ; do
new_name="$(echo $line | cut -d' ' -f 6)"
old_name="$(echo $line | cut -d':' -f 1)"
mv $old_name "$new_name.fasta"
done
It searches for *.fasta files and handles every "hitted" line
it splits each result of grep by spaces and gets the 6th element as new name
it splits each result of grep by : and gets the first element as old name
it
moves/renames from old filename to new filename
There are several things going on with this code.
For a start, .. I actually don't get this particular error, and this might be due to different versions.
It might resolve to the fact that grep interprets '>' the same as > due to bash expansion being done badly. I would suggest maybe going for "\>".
Secondly:
fname=$("echo $match|awk '{print $6}'")
The quotes inside serve unintended purpose. Your code should like like this, if anything:
fname="$(echo $match|awk '{print $6}')"
Lastly, to properly retrieve your data, this should be your final code:
for match in "$(grep -Hr "\>")"; do
fname="$(echo "$match" | cut -d: -f1)"
new_fname="$(echo "$match" | grep -o "sP[^ ]*")".fasta
echo mv "$fname" "$new_fname"
done
Explanations:
grep -H -> you want your grep to explicitly use "Include Filename", just in case other shell environments decide to alias grep to grep -h (no filenames)
you don't want to be doing grep -o on your file search, as you want to have both the filename and the "new filename" in one data entry.
Although, i don't see why you would search for '>' and not directory for 'sP' as such:
for match in "$(grep -Hro "sP[0-9]*")"
This is not the exact same behaviour, and has different edge cases, but it just might work for you.
Quite straightforward in (g)awk :
create a file "script.awk":
FNR == 1 {
for (i=1; i<=NF; i++) {
if (index($i, "sP")==1) {
print "mv", FILENAME, $i ".fasta"
nextfile
}
}
}
use it :
awk -f script.awk *.fasta > cmmd.txt
check the content of the output.
mv File_1.fasta sP171215.fasta
mv File_2.fasta sP131957.fasta
if ok, launch rename with . cmmd.txt
For all fasta files in directory, search their first line for the first word starting with sP and rename them using that word as the basename.
Using a bash array:
for f in *.fasta; do
arr=( $(head -1 "$f") )
for word in "${arr[#]}"; do
[[ "$word" =~ ^sP* ]] && echo mv "$f" "${word}.fasta" && break
done
done
or using grep:
for f in *.fasta; do
word=$(head -1 "$f" | grep -o "\bsP\w*")
[ -z "$word" ] || echo mv "$f" "${word}.fasta"
done
Note: remove echo after you are ok with testing.

Concatenate files together that have similar names

I have a number of files that look something like this:
418_S32_L003_R1_001.fastq.gz
418_S32_L003_R2_001.fastq.gz
418_S1_L002_R1_001.fastq.gz
418_S1_L002_R2_001.fastq.gz
419_S32_L003_R1_001.fastq.gz
419_S32_L003_R2_001.fastq.gz
419_S1_L002_R1_001.fastq.gz
419_S1_L002_R2_001.fastq.gz
The first number is different for each set of four files.
Samples that start with the same number should be combined together if they have the same value for *R1* or *R2*.
So, these two samples should be concatenated:
418_S32_L003_R1_001.fastq.gz
418_S1_L002_R1_001.fastq.gz
And these two should be concatenated:
419_S32_L003_R2_001.fastq.gz
419_S1_L002_R2_001.fastq.gz
And this should be repeated for all files within the directory.
Is there a good way to do this in bash other than manually concatenating like this:
cat 418_S32_L003_R1_001.fastq.gz 418_S1_L002_R1_001.fastq.gz > 418_R1.fastq.gz
You can read each file and append it to the target file which you can get from the file name.
for file in *.fastq.gz;
do
IFS='_' read -a array <<< $file;
name="${array[0]}_${array[3]}.fastq.gz";
cat $file >> $name;
done
a = 0
for i in *
do
for j in *
do
if [ $(echo $j |cut -d _ -f 1) = $(echo $i |cut -d _ -f 1) ]
then
cat $i $j > $a
a = a + 1
fi
done
done
This might work for you (GNU parallel):
parallel --dry-run -N4 --plus cat {1} {4} \> {1%_.*}_R1.{1+..} ::: *R[12]*
This will print out the intended cat commands, check the results and if OK remove the --dry-run option.

How to get count of file with T date as part of file name

consider todays date as 24/02/14
I have set of files in mention directory "/apps_kplus/KplusOpenReport/crs_scripts/rconvysya" and file names as
INTTRADIVBMM20142402
INTTRADIVBFX20142402
INTTRADIVBFI20142402
INTTRADIVBDE20142402
INTPOSIVBIR20142402
INTPOSIVBIR20142302
INTTRADIVBDE20142302
INTTRADIVBFI20142302
INTTRADIVBFX20142302
INTTRADIVBMM20142302
I wanted to get count of file with current date. for that i am using below query
#! /bin/bash
tm=$(date +%y%d%m)
x=$(ls /apps_kplus/KplusOpenReport/crs_scripts/rconvysya/INT*$tm.txt 2>/dev/null | wc -l)
if [ $x -eq 5 ]
then
exit 4
else
exit 3
fi
But not getting desired output.
what is wrong.
You're searching for files with a .txt extension, and the files you list have none. If you remove the .txt from the pathname, it works.
With Awk, anything is possible!
tm=$(date +%Y%d%m) # Capital Y is for the four digit year!
ls -a | awk -v tm=$tm '$0 ~ tm' | wc -l
This is the Great and Awful Awk Command. (Awful as in both full of awe and as in Starbucks Coffee).
The -v parameter allows me to set Awk variables. The $0 represents the entire line which is the file name from the ls command. The $0 ~ tm means I'm looking for files that contain the time string you specified.
I could do this too:
ls -a | awk "/$tm\$/"
Which lets the shell interpolate $tm into the Awk program. This is looking only for files that end in your $tm stamp. The /.../ by itself means matching the entire line. It's an even shorter Awk shortcut than I had before. Plus, it makes sure you're only matching the timestamp on the end of the file.
You can try the following awk:
awk -v d="$date" '{split(d,ary,/\//);if($0~"20"ary[3]ary[1]ary[2]) cnt++}END{print cnt}' file
where $date is your shell variable containing the date you wish to search for.
$ cat file
INTTRADIVBMM20142402
INTTRADIVBFX20142402
INTTRADIVBFI20142402
INTTRADIVBDE20142402
INTPOSIVBIR20142402
INTPOSIVBIR20142302
INTTRADIVBDE20142302
INTTRADIVBFI20142302
INTTRADIVBFX20142302
INTTRADIVBMM20142302
$ awk -v d="$date" '{split(d,ary,/\//);if($0~"20"ary[3]ary[1]ary[2]) cnt++}END{print cnt}' file
5
ls /apps_kplus/KplusOpenReport/crs_scripts/rconvysya | grep "`date +%Y%m%d`"| wc -l
This worked for me and it seems like the simplest solution, I hope it fits you needs
To get exit status 4 if there are 0k files present use code bellow:
0=`ls /apps_kplus/KplusOpenReport/crs_scripts/rconvysya | grep "`date +%Y%m%d`"| xargs -n 1 stat -c%s | grep "^0$" | wc -l`
if [ "$0" -eq "0" ]
then
exit 4
else
exit 3
fi

How can I count the different file types within a folder using linux terminal?

Hey I'm star struck on how to count the different amounts of file types / extensions recursively in a folder. I also need to print them to a .txt file.
For example I have 10 txt's 20 .docx files mixed up in multiple folders.
Help me !
find ./ -type f |awk -F . '{print $NF}' | sort | awk '{count[$1]++}END{for(j in count) print j,"("count[j]" occurences)"}'
Gets all filenames with find, then uses awk to get the extension, then uses awk again to count the occurences
Just with bash: version 4 required for this code
#!/bin/bash
shopt -s globstar nullglob
declare -A exts
for f in * **/*; do
[[ -f $f ]] || continue # only count files
filename=${f##*/} # remove directories from pathname
ext=${filename##*.}
[[ $filename == $ext ]] && ext="no_extension"
: ${exts[$ext]=0} # initialize array element if unset
(( exts[$ext]++ ))
done
for ext in "${!exts[#]}"; do
echo "$ext ${exts[$ext]}"
done | sort -k2nr | column -t
this one seems unsolved so far, so here is how far I got counting files and ordering them:
find . -type f | sed -n 's/..*\.//p' | sort -f | uniq -ic

How to split a file and keep the first line in each of the pieces?

Given: One big text-data file (e.g. CSV format) with a 'special' first line (e.g., field names).
Wanted: An equivalent of the coreutils split -l command, but with the additional requirement that the header line from the original file appear at the beginning of each of the resulting pieces.
I am guessing some concoction of split and head will do the trick?
This is robhruska's script cleaned up a bit:
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
head -n 1 file.txt > tmp_file
cat "$file" >> tmp_file
mv -f tmp_file "$file"
done
I removed wc, cut, ls and echo in the places where they're unnecessary. I changed some of the filenames to make them a little more meaningful. I broke it out onto multiple lines only to make it easier to read.
If you want to get fancy, you could use mktemp or tempfile to create a temporary filename instead of using a hard coded one.
Edit
Using GNU split it's possible to do this:
split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }; export -f split_filter; tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
Broken out for readability:
split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }
export -f split_filter
tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
When --filter is specified, split runs the command (a function in this case, which must be exported) for each output file and sets the variable FILE, in the command's environment, to the filename.
A filter script or function could do any manipulation it wanted to the output contents or even the filename. An example of the latter might be to output to a fixed filename in a variable directory: > "$FILE/data.dat" for example.
This one-liner will split the big csv into pieces of 999 records, preserving the header row at the top of each one (so 999 records + 1 header = 1000 rows)
cat bigFile.csv | parallel --header : --pipe -N999 'cat >file_{#}.csv'
Based on Ole Tange's answer.
See comments for some tips on installing parallel
You could use the new --filter functionality in GNU coreutils split >= 8.13 (2011):
tail -n +2 FILE.in | split -l 50 - --filter='sh -c "{ head -n1 FILE.in; cat; } > $FILE"'
You can use [mg]awk:
awk 'NR==1{
header=$0;
count=1;
print header > "x_" count;
next
}
!( (NR-1) % 100){
count++;
print header > "x_" count;
}
{
print $0 > "x_" count
}' file
100 is the number of lines of each slice.
It doesn't require temp files and can be put on a single line.
I'm a novice when it comes to Bash-fu, but I was able to concoct this two-command monstrosity. I'm sure there are more elegant solutions.
$> tail -n +2 file.txt | split -l 4
$> for file in `ls xa*`; do echo "`head -1 file.txt`" > tmp; cat $file >> tmp; mv -f tmp $file; done
This is assuming your input file is file.txt, you're not using the prefix argument to split, and you're working in a directory that doesn't have any other files that start with split's default xa* output format. Also, replace the '4' with your desired split line size.
Use GNU Parallel:
parallel -a bigfile.csv --header : --pipepart 'cat > {#}'
If you need to run a command on each of the parts, then GNU Parallel can help do that, too:
parallel -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
parallel -a bigfile.csv --header : --pipepart --fifo my_program_reading_from_fifo {}
parallel -a bigfile.csv --header : --pipepart --cat my_program_reading_from_a_file {}
If you want to split into 2 parts per CPU core (e.g. 24 cores = 48 equal sized parts):
parallel --block -2 -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
If you want to split into 10 MB blocks:
parallel --block 10M -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
Below is a 4 liner that can be used to split a bigfile.csv into multiple smaller files, and preserve the csv header. Uses only built-in Bash commands (head, split, find, grep, xargs, and sed) which should work on most *nix systems. Should also work on Windows if you install mingw-64 / git-bash.
csvheader=`head -1 bigfile.csv`
split -d -l10000 bigfile.csv smallfile_
find .|grep smallfile_ | xargs sed -i "1s/^/$csvheader\n/"
sed -i '1d' smallfile_00
Line by line explanation:
Capture the header to a variable named csvheader
Split the bigfile.csv into a number of smaller files with prefix smallfile_
Find all smallfiles and insert the csvheader into the FIRST line using xargs and sed -i. Note that you need to use sed within "double quotes" in order to use variables.
The first file named smallfile_00 will now have redundant headers on lines 1 and 2 (from the original data as well as from the sed header insert in step 3). We can remove the redundant header with sed -i '1d' command.
This is a more robust version of Denis Williamson's script. The script creates a lot of temporary files, and it would be a shame if they were left lying around if the run was incomplete. So, let's add signal trapping (see http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_02.html and then http://tldp.org/LDP/abs/html/debugging.html) and remove our temporary files; this is a best practice anyways.
trap 'rm split_* tmp_file ; exit 13' SIGINT SIGTERM SIGQUIT
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
head -n 1 file.txt > tmp_file
cat $file >> tmp_file
mv -f tmp_file $file
done
Replace '13' with whatever return code you want. Oh, and you should probably be using mktemp anyways (as some have already suggested), so go ahead and remove 'tmp_file" from the rm in the trap line. See the signal man page for more signals to catch.
I liked the awk version of marco, adopted from this a simplified one-liner where you can easily specify the split fraction as granular as you want:
awk 'NR==1{print $0 > FILENAME ".split1"; print $0 > FILENAME ".split2";} NR>1{if (NR % 10 > 5) print $0 >> FILENAME ".split1"; else print $0 >> FILENAME ".split2"}' file
I really liked Rob and Dennis' versions, so much so that I wanted to improve them.
Here's my version:
in_file=$1
awk '{if (NR!=1) {print}}' $in_file | split -d -a 5 -l 100000 - $in_file"_" # Get all lines except the first, split into 100,000 line chunks
for file in $in_file"_"*
do
tmp_file=$(mktemp $in_file.XXXXXX) # Create a safer temp file
head -n 1 $in_file | cat - $file > $tmp_file # Get header from main file, cat that header with split file contents to temp file
mv -f $tmp_file $file # Overwrite non-header containing file with header-containing file
done
Differences:
in_file is the file argument you want to split maintaining headers
Use awk instead of tail due to awk having better performance
split into 100,000 line files instead of 4
Split file name will be input file name appended with an underscore and numbers (up to 99999 - from the "-d -a 5" split argument)
Use mktemp to safely handle temporary files
Use single head | cat line instead of two lines
Inspired by #Arkady's comment on a one-liner.
MYFILE variable simply to reduce boilerplate
split doesn't show file name, but the --additional-suffix option allows us to easily control what to expect
removal of intermediate files via rm $part (assumes no files with same suffix)
MYFILE=mycsv.csv && for part in $(split -n4 --additional-suffix=foo $MYFILE; ls *foo); do cat <(head -n1 $MYFILE) $part > $MYFILE.$part; rm $part; done
Evidence:
-rw-rw-r-- 1 ec2-user ec2-user 32040108 Jun 1 23:18 mycsv.csv.xaafoo
-rw-rw-r-- 1 ec2-user ec2-user 32040108 Jun 1 23:18 mycsv.csv.xabfoo
-rw-rw-r-- 1 ec2-user ec2-user 32040108 Jun 1 23:18 mycsv.csv.xacfoo
-rw-rw-r-- 1 ec2-user ec2-user 32040110 Jun 1 23:18 mycsv.csv.xadfoo
and of course head -2 *foo to see the header is added.
A simple but maybe not as elegant way: Cut off the header beforehand, split the file, and then rejoin the header on each file with cat, or with whatever file is reading it in.
So something like:
head -n1 file.txt > header.txt
split -l file.txt
cat header.txt f1.txt
I had a better result using the following code, every split file will have a header and the generated files will have a normalized name.
export F=input.csv && LINES=3 &&\
export PF="${F%.*}_" &&\
split -l $LINES "${F}" "${PF}" &&\
for fn in $PF*
do
mv "${fn}" "${fn}.csv"
done &&\
export FILES=($PF*) && for file in "${FILES[#]:1}"
do
head -n 1 "${F}" > tmp_file
cat "$file" >> tmp_file
mv -f tmp_file "${file}"
done
output
$ wc -l input*
22 input.csv
3 input_aa.csv
4 input_ab.csv
4 input_ac.csv
4 input_ad.csv
4 input_ae.csv
4 input_af.csv
4 input_ag.csv
2 input_ah.csv
51 total

Resources