Awk script to select files and print file sizes - linux

I'm working on a home work assignment. The question is:
Write an awk script to select all regular files (not directories or
links) in /etc ending with .conf, sort the result by size from
smallest to largest, count the number of files, and print out the
number of files followed by the filenames and sizes in two columns.
Include a header row for the filenames and sizes. Paste both your
script and its output in the answer area.
I'm really struggling trying to get this to work through using awk. Here's what I came up with.
ls -lrS /etc/*.conf |wc –l
will return the number 33 which is the number of files .conf files in the directory.
ls -lrS /etc/*.conf |awk '{print "File_Size"": " $5 " ""File_Name and Size"": " $9}'
this will make 2 columns with the name and size of the .conf file in the directory.
It works, but I don't think it is what he's looking for. I'm having an AWKful time.

Let's see here...
select all regular files (not directories or links)
So far you haven't addressed this, but if you are piping in the output of ls -l..., this is easy, select on
/^-/
because directories start with d, symbolic links with l and so on. Only plain old files start with -. Now
print out the number of files followed
Well, counting matches is easy enough...
BEGIN{count=0} # This is not *necessary*, but I tend to put it in for clarity
/^-/ {count++;}
To get the filename and size, look at the output of ls -l and count up columns
BEGIN{count=0}
/^-/ {
count++;
SIZE=$5;
FNAME=$9;
}
The big difficulty here is that awk doesn't provide much by way of sorting primitives, so that's the hard part. That can be beaten if you want to be clever but it is not particularly efficient (see the awful thing I did in a [code-golf] solution). The easy (and unixy) thing to do would be to pipe part of the output to sort, so...we collect a line for each file into a big string
BEGIN{count=0}
/^-/ {
count++
SIZE=$5;
FNAME=$9;
OUTPUT=sprintf("%10d\t%s\n%s",SIZE,FNAME,OUTPUT);
}
END{
printf("%d files\n",count);
printf(" SIZE \tFILENAME"); # No newline here because OUTPUT has it
print OUTPUT|"sort -n --key=1";
}
Gives output like
11 files
SIZE FILENAME
673 makefile
2192 houghdata.cc
2749 houghdata.hh
6236 testhough.cc
8751 fasthough.hh
11886 fasthough.cc
19270 HoughData.png
60036 houghdata.o
104680 testhough
150292 testhough.o
168588 fasthough.o
(BTW--There is a test subdirectory here, and you'll note that it does not appear in the output.)

May be something like this should get you on your way -
ls -lrS /etc/*.conf |
awk '
BEGIN{print "Size:\tFilename:"} # Prints Headers
/^-/{print $5"\t"$9} # Prints two desired columns, /^-/ captures only files
END{print "Total Files = "(NR-1)}' # Uses in-built variable to print count
Test: Text after # are comments for your reference.
[jaypal:~/Temp] ls -lrS /etc/*.conf |
awk '
BEGIN{print "Size:\tFilename:"}
/^-/{print $5"\t"$9}
END{print "Total Files = "(NR-1)}'
Size: Filename:
0 /etc/kern_loader.conf
22 /etc/ntp.conf
54 /etc/ftpd.conf
105 /etc/launchd.conf
168 /etc/memberd.conf
242 /etc/notify.conf
366 /etc/ntp-restrict.conf
526 /etc/gdb.conf
723 /etc/pf.conf
753 /etc/6to4.conf
772 /etc/syslog.conf
983 /etc/rtadvd.conf
1185 /etc/asl.conf
1238 /etc/named.conf
1590 /etc/newsyslog.conf
1759 /etc/autofs.conf
2378 /etc/dnsextd.conf
4589 /etc/man.conf
Total Files = 18

I would first find the files with something like find /etc -type f -name '*.conf' ; so you get the right list of files. Then you do ls -l on them (perhaps using xargs). And then using awk should be simple.
But I don't think that if I did more your homework that would help you. You need to think by yourself and find out.

Disclaimer: I'm not a shell expert.
Thought I'd give this a go, been beaten on speed of reply though :-) :
clear
FILE_COUNT=`find /etc/ -name '*.conf' -type f -maxdepth 1 | wc -l`
echo "Number of files: $FILE_COUNT"
ls -lrS /etc/[^-]*.conf | awk '
BEGIN {print "NAME | SIZE"}\
{print $9," | ",$5}\
END {print "- DONE -"}\
'
My output is ugly :-( :
Number of files: 21
NAME | SIZE
/etc/kern_loader.conf | 0
/etc/resolv.conf | 20
/etc/AFP.conf | 24
/etc/ntp.conf | 42
/etc/ftpd.conf | 54
/etc/notify.conf | 132
/etc/memberd.conf | 168
/etc/Symantec.conf | 246
/etc/ntp-restrict.conf | 366
/etc/gdb.conf | 526
/etc/6to4.conf | 753
/etc/syslog.conf | 772
/etc/asl.conf | 860
/etc/liveupdate.conf | 861
/etc/rtadvd.conf | 983
/etc/named.conf | 1238
/etc/newsyslog.conf | 1590
/etc/autofs.conf | 1759
/etc/dnsextd.conf | 2378
/etc/smb.conf | 2975
/etc/man.conf | 4589
/etc/amavisd.conf | 31925
- DONE -

Related

Separating data into files based on random ID numbers using shell script

I have a tab delimited file(1993_NYA.tab) with radiosonde data containing its' ID. I want to extract data of each ID into separate tab files. The file looks like this.
1993-01-01T10:45:03 083022143 250 78.93018 11.95426 960.0 -16.8 76 1.7 276
1993-01-01T10:45:16 083022143 300 78.93011 11.95529 953.7 -17.2 77 1.8 288
1993-01-01T10:45:30 083022143 350 78.93000 11.95638 947.3 -17.6 79 2.0 297
Here 083022143 is the ID but it changes randomly(not in ascending order). The code I tried is as follows.
ID=$(cat 1993_NYA.tab | cut -f 2 | sort | uniq)
for i in {$ID}
do
awk -F '\t' '$2 = "$i"' 1993_NYA.tab > 1993_$i.tab
done
This is not storing data of a particular ID into filename containing the same ID. Can anyone please help.
There are three small mistakes.
{$ID} should be just $ID.
'$2 = "$i"' has the assignment operator = rather than the comparison ==.
'$2 = "$i"' does not interpolate the value of i into the argument because of the single quotes; we can write e. g. "\$2 == $i" instead.
With the above changes, your code works. But awk has built-in magic which makes the task much easier. The single command
awk -F '\t' '{print >"1993_"$2".tab"}' 1993_NYA.tab
does what you want.

Print a row of 16 lines evenly side by side (column)

I have a file with unknown number of lines(but even number of lines). I want to print them side by side based on total number of lines in that file. For example, I have a file with 16 lines like below:
asdljsdbfajhsdbflakjsdff235
asjhbasdjbfajskdfasdbajsdx3
asjhbasdjbfajs23kdfb235ajds
asjhbasdjbfajskdfbaj456fd3v
asjhbasdjb6589fajskdfbaj235
asjhbasdjbfajs54kdfbaj2f879
asjhbasdjbfajskdfbajxdfgsdh
asjhbasdf3709ddjbfajskdfbaj
100
100
150
125
trh77rnv9vnd9dfnmdcnksosdmn
220
225
sdkjNSDfasd89asdg12asdf6asdf
So now i want to print them side by side. as they have 16 lines in total, I am trying to get the results 8:8 like below
asdljsdbfajhsdbflakjsdff235 100
asjhbasdjbfajskdfasdbajsdx3 100
asjhbasdjbfajs23kdfb235ajds 150
asjhbasdjbfajskdfbaj456fd3v 125
asjhbasdjb6589fajskdfbaj235 trh77rnv9vnd9dfnmdcnksosdmn
asjhbasdjbfajs54kdfbaj2f879 220
asjhbasdjbfajskdfbajxdfgsdh 225
asjhbasdf3709ddjbfajskdfbaj sdkjNSDfasd89asdg12asdf6asdf
paste command did not work for me exactly, (paste - - - - - - - -< file1) nor the awk command that I used awk '{printf "%s" (NR%2==0?RS:FS),$1}'
Note: The number of lines in a file dynamic. The only known thing in my scenario is, they are even number all the time.
If you have the memory to hash the whole file ("max" below):
$ awk '{
a[NR]=$0 # hash all the records
}
END { # after hashing
mid=int(NR/2) # compute the midpoint, int in case NR is uneven
for(i=1;i<=mid;i++) # iterate from start to midpoint
print a[i],a[mid+i] # output
}' file
If you have the memory to hash half of the file ("mid"):
$ awk '
NR==FNR { # on 1st pass hash second half of records
if(FNR>1) { # we dont need the 1st record ever
a[FNR]=$0 # hash record
if(FNR%2) # if odd record
delete a[int(FNR/2)+1] # remove one from the past
}
next
}
FNR==1 { # on the start of 2nd pass
if(NR%2==0) # if record count is uneven
exit # exit as there is always even count of them
offset=int((NR-1)/2) # compute offset to the beginning of hash
}
FNR<=offset { # only process the 1st half of records
print $0,a[offset+FNR] # output one from file, one from hash
next
}
{ # once 1st half of 2nd pass is finished
exit # just exit
}' file file # notice filename twice
And finally if you have awk compiled into a worms brain (ie. not so much memory, "min"):
$ awk '
NR==FNR { # just get the NR of 1st pass
next
}
FNR==1 {
mid=(NR-1)/2 # get the midpoint
file=FILENAME # filename for getline
while(++i<=mid && (getline line < file)>0); # jump getline to mid
}
{
if((getline line < file)>0) # getline read from mid+FNR
print $0,line # output
}' file file # notice filename twice
Standard disclaimer on getline and no real error control implemented.
Performance:
I seq 1 100000000 > file and tested how the above solutions performed. Output was > /dev/null but writing it to a file lasted around 2 s longer. max performance is so-so as the mem print was 88 % of my 16 GB so it might have swapped. Well, I killed all the browsers and shaved off 7 seconds for the real time of max.
+------------------+-----------+-----------+
| which | | |
| min | mid | max |
+------------------+-----------+-----------+
| time | | |
| real 1m7.027s | 1m30.146s | 0m48.405s |
| user 1m6.387s | 1m27.314 | 0m43.801s |
| sys 0m0.641s | 0m2.820s | 0m4.505s |
+------------------+-----------+-----------+
| mem | | |
| 3 MB | 6.8 GB | 13.5 GB |
+------------------+-----------+-----------+
Update:
I tested #DavidC.Rankin's and #EdMorton's solutions and they ran, respectively:
real 0m41.455s
user 0m39.086s
sys 0m2.369s
and
real 0m39.577s
user 0m37.037s
sys 0m2.541s
Mem print was about the same as my mid had. It pays to use the wc, it seems.
$ pr -2t file
asdljsdbfajhsdbflakjsdff235 100
asjhbasdjbfajskdfasdbajsdx3 100
asjhbasdjbfajs23kdfb235ajds 150
asjhbasdjbfajskdfbaj456fd3v 125
asjhbasdjb6589fajskdfbaj235 trh77rnv9vnd9dfnmdcnksosdmn
asjhbasdjbfajs54kdfbaj2f879 220
asjhbasdjbfajskdfbajxdfgsdh 225
asjhbasdf3709ddjbfajskdfbaj sdkjNSDfasd89asdg12asdf6asdf
if you want just one space between columns, change to
$ pr -2ts' ' file
You can also do it with awk simply by storing the first-half of the lines in an array and then concatenating the second half to the end, e.g.
awk -v nlines=$(wc -l < file) -v j=0 'FNR<=nlines/2{a[++i]=$0; next} j<i{print a[++j],$1}' file
Example Use/Output
With your data in file, then
$ awk -v nlines=$(wc -l < file) -v j=0 'FNR<=nlines/2{a[++i]=$0; next} j<i{print a[++j],$1}' file
asdljsdbfajhsdbflakjsdff235 100
asjhbasdjbfajskdfasdbajsdx3 100
asjhbasdjbfajs23kdfb235ajds 150
asjhbasdjbfajskdfbaj456fd3v 125
asjhbasdjb6589fajskdfbaj235 trh77rnv9vnd9dfnmdcnksosdmn
asjhbasdjbfajs54kdfbaj2f879 220
asjhbasdjbfajskdfbajxdfgsdh 225
asjhbasdf3709ddjbfajskdfbaj sdkjNSDfasd89asdg12asdf6asdf
Extract the first half of the file and the last half of the file and merge the lines:
paste <(head -n $(($(wc -l <file.txt)/2)) file.txt) <(tail -n $(($(wc -l <file.txt)/2)) file.txt)
You can use columns utility from autogen:
columns -c2 --by-columns file.txt
You can use column, but the count of columns is calculated in a strange way from the count of columns of your terminal. So assuming your lines have 28 characters, you also can:
column -c $((28*2+8)) file.txt
I do not want to solve this, but if I were you:
wc -l file.txt
gives number of lines
echo $(($(wc -l < file.txt)/2))
gives a half
head -n $(($(wc -l < file.txt)/2)) file.txt > first.txt
tail -n $(($(wc -l < file.txt)/2)) file.txt > last.txt
create file with first half and last half of the original file. Now you can merge those files together side by side as it was described here .
Here is my take on it using the bash shell wc(1) and ed(1)
#!/usr/bin/env bash
array=()
file=$1
total=$(wc -l < "$file")
half=$(( total / 2 ))
plus1=$(( half + 1 ))
for ((m=1;m<=half;m++)); do
array+=("${plus1}m$m" "${m}"'s/$/ /' "${m}"',+1j')
done
After all of that if just want to print the output to stdout. Add the line below to the script.
printf '%s\n' "${array[#]}" ,p Q | ed -s "$file"
If you want to write the changes directly to the file itself, Use this code instead below the script.
printf '%s\n' "${array[#]}" w | ed -s "$file"
Here is an example.
printf '%s\n' {1..10} > file.txt
Now running the script against that file.
./myscript file.txt
Output
1 6
2 7
3 8
4 9
5 10
Or using bash4+ feature mapfile aka readarray
Save the file in an array named array.
mapfile -t array < file.txt
Separate the files.
left=("${array[#]::((${#array[#]} / 2))}") right=("${array[#]:((${#array[#]} / 2 ))}")
loop and print side-by-side
for i in "${!left[#]}"; do
printf '%s %s\n' "${left[i]}" "${right[i]}"
done
What you said The only known thing in my scenario is, they are even number all the time. That solution should work.

Finding duplicate entries across very large text files in bash

I am working with very large data files extracted from a database. There are duplicates across these files that I need to remove. If there are duplicates they will exist across files not within the same file. The files contain entries that look like the following:
File1
623898/bn-oopi-990iu/I Like Potato
982347/ki-jkhi-767ho/Let's go to Sesame Street
....
File2
568798/jj-ytut-786hh/Hello Mike
982347/ki-jkhi-767ho/Let's go to Sesame Street
....
So the Sesame Street line will have to be removed possibly even across 5 files but at least remain in one of them. From what I have been able to grab so far I can perform the following cat * | sort | uniq -cd to give me each duplicated line and the number of times they have been duplicated. But have no way of getting the file name. cat * | sort | uniq -cd | grep "" * doesn't work. Any ideas or approaches for a solution would be great.
Expanding your original idea:
sort * | uniq -cd | awk '{print $2}' | grep -Ff- *
i.e. form the output, print only the duplicate strings, then search all the files for them (list of things to search from taken form -, i.e. stdin), literally (-F).
Something along these lines might be useful:
awk '!seen[$0] { print $0 > FILENAME ".new" } { seen[$0] = 1 }' file1 file2 file3 ...
twalberg's solution works perfectly but if your files are really large it could exhaust the available memory because it creates one entry in an associative array per encountered unique record. If it happens, you can try a similar approach where there is only one entry per duplicate record (I assume you have GNU awk and your files are named *.txt):
sort *.txt | uniq -d > dup
awk 'BEGIN {while(getline < "dup") {dup[$0] = 1}} \
!($0 in dup) {print >> (FILENAME ".new")} \
$0 in dup {if(dup[$0] == 1) {print >> (FILENAME ".new");dup[$0] = 0}}' *.txt
Note that if you have many duplicates it could also exhaust the available memory. You can solve this by splitting the dup file in smaller chunks and run the awk script on each chunk.

Compare all files in a folder

I've got script in crontab which creates every 30 minutes files with list of Offline peers in asterisk:
now=$(date +"%Y%m%d%H%M")
/usr/sbin/asterisk -rx 'sip show peers' | grep "Unspec" | sed 's/[/].*//' >> /var/log/asterisk/offline/offline_$now
I need to parse theese files and find extensions that were always offline, i.e. stings in files that were constant.
How can I do this?
Output is:
/usr/sbin/asterisk -rx 'sip show peers' | grep "Unspec" | sed 's/[/].*//' | tail -3
891
894
899
ls /var/log/asterisk/offline/
offline_201309051400 offline_201309051418 offline_201309051530 offline_201309051700
offline_201309051830 offline_201309052000 offline_201309052130
offline_201309051405 offline_201309051430 offline_201309051600 offline_201309051730
offline_201309051900 offline_201309052030 offline_201309052200
offline_201309051406 offline_201309051500 offline_201309051630 offline_201309051800
offline_201309051930 offline_201309052100 offline_201309052230
This awk script will print the lines that are present in all of the files:
awk 'FNR==1{f++}{a[$0]++}END{for (i in a) if (a[i]==f) print i}' offline_*
How it works:
With FNR==1{f++} we count the number of files that are parsed (FNR is equal to one for the first line of each file)
with {a[$0]++} we count how many times each line has appeared.
the END block prints the elements of the array that have been found f times.

grouping lines from a txt file using filters in Linux to create multiple txt files

I have a txt file, where each line starts with participant No, followed by the date and other variables (numbers only), so has format:
S001_2 20090926 14756 93
S002_2 20090803 15876 13
I want to write a script that creates smaller txt files containing only 20 participants per file (so first one will contain lines from S001_2 to S020_2;second from S021_2 to S040_2; total number of subjects approximately 200). However, subjects are not organized, therefore I can`t set a range with sed.
What would be the best command to filter ppts into chunks depending on what number (SOO1_2) the line starts with?
Thanks in advance.
Use the split command to split a file (or a filtered result) without ranges and sed. According to the documentation, this should work:
cat file.txt | split -l 20 - PREFIX
This will produce the files PREFIXaa, PREFIXab, ... (Note that it does not add the .txt extension to the file name!)
If you want to filter the files first, in the way #Sergey described:
cat file.txt | sort | split -l 20 - PREFIX
Sort without any parameters should be suitable, because there are leading zeros in your numbers like S001_2. So, first sort the file:
sort file.txt > sorted.txt
Then you will be able to set ranges with sed for file_sort.txt
This looks like a whole script for splitting sorted file into 20-line files:
num=1;
i=1;
lines=`wc -l sorted.txt | cut -d' ' -f 1`;#get number of lines
while [ $i -lt $lines ];do
sed -n $i,`echo $i+19 | bc`p sorted.txt > file$num;
num=`echo $num+1 | bc`;
i=`echo $i+20 | bc`;
done;
$ split -d -l 20 file.txt -a3 db_
produces: db_000, db_001, db_002, ..., db_N

Resources