The Situation:
I have hundreds of zip files with an arbitrary date/time mixed into its name (4-6-2021 12-34-09 AM.zip). I need to get all of these files in order such that (0.zip, 1.zip 2.zip etc) with in a Linux cli system.
What I've tried:
I've tried ls -tr | while read i; do n=$((n+1)); mv -- "$i" "$(printf '%03d' "$n").zip"; done which almost does what I want but still seems to be out of order (I think its taking the order of when the file was created rather than the filename (which is what I need)).
If I can get this done, my next step would be to rename the file (yes a single file) in each zip to the name of the zip file. I'm not sure how I'd go about this either.
tl;dr
I have these files named with a weird date system. I need the date to be in order and renamed sequentially like 0.zip 1.zip 2.zip etc. It's 3:00 AM and I don't know why I'm up still trying to solve this and I have no idea how I'll rename the files in the zips to that sequential number (read above for more detail on this).
Thanks in advance!
GNU awk is an option here, redirecting the result of the file listing back into awk:
awk '{
fil=$0; # Set a variable fil to the line
gsub("-"," ",$1); # Replace "-" for " " in the first space delimited field
split($1,map," "); # Split the first field into the array map, using " " as the delimiter
if (length(map[1])==1) {
map[1]="0"map[1] # If the length of the day is 1, pad out with "0"
};
if (length(map[2])==1) {
map[2]="0"map[2] # Do the same for month
}
$1=map[1]" "map[2]" "map[3]; # Rebuilt first field based on array values
gsub("-"," ",$2); # Change "-" for " " in time
map1[mktime($1" "$2)]=fil # Get epoch format of date/time using mktime function and use this as an index for array map1 with the original line (fil) as the value
}
END {
PROCINFO["sorted_in"]="#ind_num_asc"; # At the end of processing, set the array sorting to index number ascending
cnt=0; # Initialise a cnt variable
for (i in map1) {
print "mv \""map1[i]"\" \""cnt".zip\""; # Loop through map1 array printing values and using these values along with cnt to generate and print move command
cnt++
}
}' <(for fil in *AM.zip;do echo $fil;done)
Once you are happy with the way the print command are printed, pipe the result into bash and so:
awk '{ fil=$0;gsub("-"," ",$1);split($1,map," ");if (length(map[1])==1) { map[1]="0"map[1] };if (length(map[2])==1) { map[2]="0"map[2] }$1=map[1]" "map[2]" "map[3];gsub("-"," ",$2);map1[mktime($1" "$2)]=fil} END { PROCINFO["sorted_in"]="#ind_num_asc";cnt=0;for (i in map1) { print "mv \""map1[i]"\" \""cnt".zip\"";cnt++ } }' <(for fil in *AM.zip;do echo $fil;done) | bash
Related
I have multiple paired files with headings xxx_1.txt and xxx_2.txt, yyy_1.txt and yyy_2.txt, etc. They are single column files with the following format:
xxx_1.txt:
#CHROM_POSREFALT
MSHR1153_annotated_1_9107CA
MSHR1153_annotated_1_9197CT
MSHR1153_annotated_1_9303TC
MSHR1153_annotated_1_10635GA
MSHR1153_annotated_1_10836AG
MSHR1153_annotated_1_11108AG
MSHR1153_annotated_1_11121GA
MSHR1153_annotated_1_11123CT
MSHR1153_annotated_1_11131CT
MSHR1153_annotated_1_11155AG
MSHR1153_annotated_1_11166CT
MSHR1153_annotated_1_11186TC
MSHR1153_annotated_1_11233TG
MSHR1153_annotated_1_11274GT
MSHR1153_annotated_1_11472CG
MSHR1153_annotated_1_11814GA
MSHR1153_annotated_1_11815CT
xxx_2.txt:
LocationMSHR1153_annotatedMSHR0491_Australasia
MSHR1153_annotated_1_56TC
MSHR1153_annotated_1_226AG
MSHR1153_annotated_1_670AG
MSHR1153_annotated_1_817CT
MSHR1153_annotated_1_1147TC
MSHR1153_annotated_1_1660TC
MSHR1153_annotated_1_2488AG
MSHR1153_annotated_1_2571GA
MSHR1153_annotated_1_2572TC
MSHR1153_annotated_1_2698TC
MSHR1153_annotated_1_2718TG
MSHR1153_annotated_1_3018TC
MSHR1153_annotated_1_3424TC
MSHR1153_annotated_1_3912CT
MSHR1153_annotated_1_4013GA
MSHR1153_annotated_1_4087GC
MSHR1153_annotated_1_4878CT
MSHR1153_annotated_1_5896GA
MSHR1153_annotated_1_7833TG
MSHR1153_annotated_1_7941CT
MSHR1153_annotated_1_8033GA
MSHR1153_annotated_1_8888AC
MSHR1153_annotated_1_9107CA
MSHR1153_annotated_1_9197CT
They are actually much longer than this. My goal is two compare each line and produce multiple outputs for the purpose of creating a venn diagram later on. So I need one file which lists all the lines in common which looks like this (in this case there is only one):
MSHR1153_annotated_1_9107CA
One file that lists everything specific to xxx_1 and one file which lists everything specific to xxx_2.
I have so far come up with this:
awk ' FNR==NR { position[$1]=$1; next} {if ( $1 in position ) {print $1 > "foundinboth"} else {print $1 > "uniquetofile1"}} ' FILE2 FILE1
The problem is I know how over 300 paired files to run through, and if I use this I have to change them manually each time. It also doesn't produce all the files at the same time. Is there a way to do this to loop through and change everything automatically? The files are paired so that the suffix at the end is different "_1" and "_2". I need it to loop through each paired file and produce everything I need at the same time.
Would you please try the following:
for f in *_1.txt; do # find files such as "xxx_1.txt"
basename=${f%_*} # extract "xxx" portion
if [[ -f ${basename}_2.txt ]]; then # make sure "xxx_2.txt" exists
file1="${basename}_1.txt" # assign bash variable file1
file2="${basename}_2.txt" # assign bash variable file2
both="${basename}_foundinboth.txt"
uniq1="${basename}_uniquetofile1.txt"
uniq2="${basename}_uniquetofile2.txt"
awk -v both="$both" -v uniq1="$uniq1" -v uniq2="$uniq2" '
# pass the variables to AWK with -v option
FNR==NR { b[$1]=$1; next }
{
if ($1 in b) {
print $1 > both
seen[$1]++ # mark if the line is found in file1
} else {
print $1 > uniq1
}
}
END {
for (i in b) {
if (! seen[i]) { # the line is not found in file1
print i > uniq2 # then it is unique to file2
}
}
}' "$file2" "$file1"
fi
done
Please note that the lines in *_uniquetofile2.txt do not keep the original order.
If you need them to, please try to sort them for yourself or let me know.
I have a file named "compare" and a file named "final_contigs_c10K.fa"
I want to eleminate lines AND THE NEXT LINE from "final_contigs_c10K.fa" containing specific strings in "compare".
compare looks like this :
k119_1
k119_3
...
and the number of lines of compare is 26364.
final_contigs_c10K.fa looks like :
>k119_1
AAAACCCCC
>k119_2
CCCCC
>k119_3
AAAAAAAA
...
I want to make make final_contigs_c10K.fa into a format :
>k119_1
AAAACCCCC
>k119_3
AAAAAAAA
...
I tried this code, but this code takes too much time, though it seems to be working fine. I think it takes too much time because the number of lines in compare is 26364, which is too much compared to my other files that I had tested the code on.
while read line; do sed -i -e "/$line/ { N; d; }" final_contigs_c10K.fa; done < compare
Is there a way to make this command faster?
Using awk
$ awk 'NR==FNR{a[">" $1];next}$1 in a{p=3} --p>0' compare final_contigs_c10K.fa
>k119_1
AAAACCCCC
>k119_3
AAAAAAAA
This will produce the output to stdout ie. won't make any changes to original files.
Explained:
$ awk '
NR==FNR { # process the first file
a[">" $1] # hash to a, adding > while at it
next # process the next record
} # process th second file after this point
$1 in a { p=3 } # if current record was in compare file set p
--p>0 # print current file match and the next record
' compare final_contigs_c10K.fa # mind the file order
I've looked everywhere and I'm out of luck.
I am trying to count the files in my current directory and all sub directories so that when I run the shell script count_files.sh it will produce a similar output to:
$
2 sh
4 html
1 css
2 noexts
(EDIT the above output should have each count and extension on a newline)
$
where noexts are either files without any period as an extension (ex: fileName ) or files with a period but no extension (ex: fileName. ).
this pipeline:
find * | awf -F . '{print $NF}'
gives me a comprehensive list of all the files, and I've figured out how to remove files without any period (ex: fileName ) using sed '/\//d'
MY ISSUE is that I cannot remove the files from the output of the above pipeline that are separated by a period but have NULL after the period (ex: fileName. ), as it is separated by the delimiter '.'
How can I use sed like above to remove a null character from a pipe input?
I understand this could be a quick fix, but I've been googling like a madman with no luck. Thanks in advance.
Chip
To filter filenames that end with ., since filenames are the whole input line in find's output, you could use
sed '/\.$/d'
Where \. matches a literal dot and $ matches the end of the line.
However, I think I'd do the whole thing in awk. Since sorting does not appear to be necessary:
EDIT: Found a nicer way to do it with awk and find's -printf action.
find . -type f -printf '%f\n' | awk -F. '!/\./ || $NF == "" { ++count["noext"]; next } { ++count[$NF] } END { for(k in count) { print k " " count[k] } }'
Here we pass -printf '%f\n' to find to make it print only the file name without the preceding directory, which makes it much easier to work with for our purposes -- this way there's no need to worry about periods in directory names (such as /etc/somethingorother.d). The field separator is '.', the awk code is
!/\./ || $NF == "" { # if the line (the filename) does not contain
# a period or there's nothing after the last .
++count["noext"] # increment the "noext" counter
# note that this will be collated with files that
# have ".noext" as filename extension. see below.
next # go to the next line
}
{ # in all other lines
++count[$NF] # increment the counter for the file extension
}
END { # in the very end:
for(k in count) { # print the counters.
print count[k] " " k
}
}
Note that this way, if there is a file "foo.noext", it will be counted among the files without a filename extension. If this is a worry, use a special counter for files without an extension -- either apart from the array or with a key that cannot be a filename extension (such as one that includes a . or the empty string).
I have a file (currently about 1 GB, 40M lines), and I need to split it into about smaller files based on a target file size (target is ~1 MB per file).
The file contains multiple lines of tab-separated values. The first column has an integer value. The file is sorted by the first column. There are about 1M values in the first column, so each value has on average 40 lines, but some may have 2 and others may have 100 or more lines.
12\t...
12\t...
13\t...
14\t...
15\t...
15\t...
15\t...
16\t...
...
2584765\t...
2586225\t...
2586225\t...
After splitting the file, any distinct first value must only appear in a single file. E.g. when I read a smaller file and find a line starting with 15, it is guaranteed that no other files contain lines starting with 15.
This does not mean map all lines that start with a specific value to a single file.
Is this possible with the commandline tools available on a Unix/Linux system?
The following will try to split every 40,000 records, but postpone the split if the next record has the same key as the previous.
awk -F '\t' 'BEGIN { i=1; s=0; f=sprintf("file%05i", i) }
NR % 40000 == 0 { s=1 }
s==1 && $1!=k { close(f); f=sprintf("file%05i", ++i); s=0 }
{ k=$1; print >>f }' input
List all the keys by looking at only the first column awk and making them unique sort -u. Then for each of these keys, select only the lines that start with the key grep and redirect this into a file named after the key.
Oneliner:
for key in `awk '{print $1;}' file_to_split | sort -u` ; do grep -e "^$key\\s" file_to_split > splitted_file_$key ; done
Or multiple lines for a script file and better readability:
for key in `awk '{print $1;}' file_to_split | sort -u`
do
grep -e "^$key\\s" file_to_split > splitted_file_$key
done
Not especially efficient as it parses the files many times.
Also not sure the for command being able to use such a large input from the `` subcommand.
On unix systems you also can usually use perl. So here is a perl solution:
#!/usr/local/bin/perl
use strict;
my $last_key;
my $store;
my $c = 0;
my $max_size = 1000000;
while(<>){
my #fields = split(/\t/);
my $key = $fields[0];
if ($last_key ne $key) {
store() if (length($store)>$max_size);
}
$store.=$_;
$last_key = $key;
}
store();
sub store {
$c++;
open (O, ">", "${c}.part");
print O $store;
close O;
$store='';
}
save it as x.pl.
use it like:
x.pl bigfile.txt
It sorts your entries into
1.part
2.part
...
files and tries to keep them around $max_size.
HTH
I have a huge set of files, 64,000, and I want to create a Bash script that lists the name of files using
ls -1 > file.txt
for every 4,000 files and store the resulted file.txt in a separate folder. So, every 4000 files have their names listed in a text files that is stored in a folder. The result is
folder01 contains file.txt that lists files #0-#4000
folder02 contains file.txt that lists files #4001-#8000
folder03 contains file.txt that lists files #8001-#12000
.
.
.
folder16 contains file.txt that lists files #60000-#64000
Thank you very much in advance
You can try
ls -1 | awk '
{
if (! ((NR-1)%4000)) {
if (j) close(fnn)
fn=sprintf("folder%02d",++j)
system("mkdir "fn)
fnn=fn"/file.txt"
}
print >> fnn
}'
Explanation:
NR is the current record number in awk, that is: the current line number.
NR starts at 1, on the first line, so we subtract 1 such that the if statement is true for the first line
system calls an operating system function from within awk
print in itself prints the current line to standard output, we can redirect (and append) the output to the file using >>
All uninitialized variables in awk will have a zero value, so we do not need to say j=0 in the beginning of the program
This will get you pretty close;
ls -1 | split -l 4000 -d - folder
Run the result of ls through split, breaking every 4000 lines (-l 4000), using numeric suffixes (-d), from standard input (-) and start the naming of the files with folder.
Results in folder00, folder01, ...
Here an exact solution using awk:
ls -1 | awk '
(NR-1) % 4000 == 0 {
dir = sprintf("folder%02d", ++nr)
system("mkdir -p " dir);
}
{ print >> dir "/file.txt"} '
There are already some good answers above, but I would also suggest you take a look at the watch command. This will re-run a command every n seconds, so you can, well, watch the output.