Compare all files in a folder - linux

I've got script in crontab which creates every 30 minutes files with list of Offline peers in asterisk:
now=$(date +"%Y%m%d%H%M")
/usr/sbin/asterisk -rx 'sip show peers' | grep "Unspec" | sed 's/[/].*//' >> /var/log/asterisk/offline/offline_$now
I need to parse theese files and find extensions that were always offline, i.e. stings in files that were constant.
How can I do this?
Output is:
/usr/sbin/asterisk -rx 'sip show peers' | grep "Unspec" | sed 's/[/].*//' | tail -3
891
894
899
ls /var/log/asterisk/offline/
offline_201309051400 offline_201309051418 offline_201309051530 offline_201309051700
offline_201309051830 offline_201309052000 offline_201309052130
offline_201309051405 offline_201309051430 offline_201309051600 offline_201309051730
offline_201309051900 offline_201309052030 offline_201309052200
offline_201309051406 offline_201309051500 offline_201309051630 offline_201309051800
offline_201309051930 offline_201309052100 offline_201309052230

This awk script will print the lines that are present in all of the files:
awk 'FNR==1{f++}{a[$0]++}END{for (i in a) if (a[i]==f) print i}' offline_*
How it works:
With FNR==1{f++} we count the number of files that are parsed (FNR is equal to one for the first line of each file)
with {a[$0]++} we count how many times each line has appeared.
the END block prints the elements of the array that have been found f times.

Related

How can I fix my bash script to find a random word from a dictionary?

I'm studying bash scripting and I'm stuck fixing an exercise of this site: https://ryanstutorials.net/bash-scripting-tutorial/bash-variables.php#activities
The task is to write a bash script to output a random word from a dictionary whose length is equal to the number supplied as the first command line argument.
My idea was to create a sub-dictionary, assign each word a number line, select a random number from those lines and filter the output, which worked for a similar simpler script, but not for this.
This is the code I used:
6 DIC='/usr/share/dict/words'
7 SUBDIC=$( egrep '^.{'$1'}$' $DIC )
8
9 MAX=$( $SUBDIC | wc -l )
10 RANDRANGE=$((1 + RANDOM % $MAX))
11
12 RWORD=$(nl "$SUBDIC" | grep "\b$RANDRANGE\b" | awk '{print $2}')
13
14 echo "Random generated word from $DIC which is $1 characters long:"
15 echo $RWORD
and this is the error I get using as input "21":
bash script.sh 21
script.sh: line 9: counterintelligence's: command not found
script.sh: line 10: 1 + RANDOM % 0: division by 0 (error token is "0")
nl: 'counterintelligence'\''s'$'\n''electroencephalograms'$'\n''electroencephalograph': No such file or directory
Random generated word from /usr/share/dict/words which is 21 characters long:
I tried in bash to split the code in smaller pieces obtaining no error (input=21):
egrep '^.{'21'}$' /usr/share/dict/words | wc -l
3
but once in the script line 9 and 10 give error.
Where do you think is the error?
problems
SUBDIC=$( egrep '^.{'$1'}$' $DIC ) will store all words of the given length in the SUBDIC variable, so it's content is now something like foo bar baz.
MAX=$( $SUBDIC | ... ) will try to run the command foo bar baz which is obviously bogus; it should be more like MAX=$(echo $SUBDIC | ... )
MAX=$( ... | wc -l ) will count the lines; when using the above mentioned echo $SUBDIC you will have multiple words, but all in one line...
RWORD=$(nl "$SUBDIC" | ...) same problem as above: there's only one line (also note #armali's answer that nl requires a file or stdin)
RWORD=$(... | grep "\b$RANDRANGE\b" | ...) might match the dictionary entry catch 22
likely RWORD=$(... | awk '{print $2}') won't handle lines containing spaces
a simple solution
doing a "random sort" over the all the possible words and taking the first line, should be sufficient:
egrep "^.{$1}$" "${DIC}" | sort -R | head -1
MAX=$( $SUBDIC | wc -l ) - A pipe is used for connecting a command's output, while $SUBDIC isn't a command; an appropriate syntax is MAX=$( <<<$SUBDIC wc -l ).
nl "$SUBDIC" - The argument to nl has to be a filename, which "$SUBDIC" isn't; an appropriate syntax is nl <<<"$SUBDIC".
This code will do it. My test dictionary of words is in file file. It's a good idea to get all words of a given length first but put them in an array not in var. And then get a random index and echo it.
dic=( $(sed -n "/^.\{$1\}$/p" file) )
ind=$((0 + RANDOM % ${#dic[#]}))
echo ${dic[$ind]}
I am also doing this activity and I create one simple solution.
I create the script.
#!/bin/bash
awk "NR==$1 {print}" /usr/share/dict/words
Here if you want a random string then you have to run the script as per the below command from the terminal.
./script.sh $RANDOM
If you want the print any specific number string then you can run as per the below command from the terminal.
./script.sh 465
cat /usr/share/dict/american-english | head -n $RANDOM | tail -n 1
$RANDOM - Returns a different random number each time is it referred to.
this simple line outputs random word from the mentioned dictionary.
Otherwise as umläute mentined you can do:
cat /usr/share/dict/american-english | sort -R | head -1

Finding duplicate entries across very large text files in bash

I am working with very large data files extracted from a database. There are duplicates across these files that I need to remove. If there are duplicates they will exist across files not within the same file. The files contain entries that look like the following:
File1
623898/bn-oopi-990iu/I Like Potato
982347/ki-jkhi-767ho/Let's go to Sesame Street
....
File2
568798/jj-ytut-786hh/Hello Mike
982347/ki-jkhi-767ho/Let's go to Sesame Street
....
So the Sesame Street line will have to be removed possibly even across 5 files but at least remain in one of them. From what I have been able to grab so far I can perform the following cat * | sort | uniq -cd to give me each duplicated line and the number of times they have been duplicated. But have no way of getting the file name. cat * | sort | uniq -cd | grep "" * doesn't work. Any ideas or approaches for a solution would be great.
Expanding your original idea:
sort * | uniq -cd | awk '{print $2}' | grep -Ff- *
i.e. form the output, print only the duplicate strings, then search all the files for them (list of things to search from taken form -, i.e. stdin), literally (-F).
Something along these lines might be useful:
awk '!seen[$0] { print $0 > FILENAME ".new" } { seen[$0] = 1 }' file1 file2 file3 ...
twalberg's solution works perfectly but if your files are really large it could exhaust the available memory because it creates one entry in an associative array per encountered unique record. If it happens, you can try a similar approach where there is only one entry per duplicate record (I assume you have GNU awk and your files are named *.txt):
sort *.txt | uniq -d > dup
awk 'BEGIN {while(getline < "dup") {dup[$0] = 1}} \
!($0 in dup) {print >> (FILENAME ".new")} \
$0 in dup {if(dup[$0] == 1) {print >> (FILENAME ".new");dup[$0] = 0}}' *.txt
Note that if you have many duplicates it could also exhaust the available memory. You can solve this by splitting the dup file in smaller chunks and run the awk script on each chunk.

Count occurence of character in files

I want to count all $ characters in each file in a directory with several subdirectories.
My goal is to count all variables in a PHP project. The files have the suffix .php.
I tried
grep -r '$' . | wc -c
grep -r '$' . | wc -l
and a lot of other stuff but all returned a number that can not match. In my example file are only four $.
So I hope someone can help me.
EDIT
My example file
<?php
class MyClass extends Controller {
$a;$a;
$a;$a;
$a;
$a;
To recursively count the number of $ characters in a set of files in a directory you could do:
fgrep -Rho '$' some_dir | wc -l
To include only files of extension .php in the recursion you could instead use:
fgrep -Rho --include='*.php' '$' some_dir | wc -l
The -R is for recursively traversing the files in some_dir and the -o is for matching part of the each line searched. The set of files are restricted to the pattern *.php and file names are not included in the output with -h, which may otherwise have caused false positives.
For counting variables in a PHP project you can use the variable regex defined here.
So, the next will grep all variables for each file:
cd ~/my/php/project
grep -Pro '\$[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*' .
-P - use perlish regex
-r - recursive
-o - each match on separate line
will produce something like:
./elFinderVolumeLocalFileSystem.class.php:$path
./elFinderVolumeLocalFileSystem.class.php:$path
./elFinderVolumeMySQL.class.php:$driverId
./elFinderVolumeMySQL.class.php:$db
./elFinderVolumeMySQL.class.php:$tbf
You want count them, so you can use:
$ grep -Proc '\$[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*' .
and will get the count of variables in each file, like:
./connector.minimal.php:9
./connector.php:9
./elFinder.class.php:437
./elFinderConnector.class.php:46
./elFinderVolumeDriver.class.php:1343
./elFinderVolumeFTP.class.php:577
./elFinderVolumeFTPIIS.class.php:63
./elFinderVolumeLocalFileSystem.class.php:279
./elFinderVolumeMySQL.class.php:335
./mime.types:0
./MySQLStorage.sql:0
When want count by file and by variable, you can use:
$ grep -Pro '\$[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*' . | sort | uniq -c
for getting result like:
17 ./elFinderVolumeLocalFileSystem.class.php:$target
8 ./elFinderVolumeLocalFileSystem.class.php:$targetDir
3 ./elFinderVolumeLocalFileSystem.class.php:$test
97 ./elFinderVolumeLocalFileSystem.class.php:$this
1 ./elFinderVolumeLocalFileSystem.class.php:$write
6 ./elFinderVolumeMySQL.class.php:$arc
3 ./elFinderVolumeMySQL.class.php:$bg
10 ./elFinderVolumeMySQL.class.php:$content
1 ./elFinderVolumeMySQL.class.php:$crop
where you can see, than the variable $write is used only once, so (maybe) it is useless.
You can also count per variable per whole project
$ grep -Proh '\$[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*' . | sort | uniq -c
and will get something like:
13 $tree
1 $treeDeep
3 $trg
3 $trgfp
10 $ts
6 $tstat
35 $type
where you can see, than the $treeDeep is used only once in a whole project, so it is sure useless.
You can achieve many other combinations with different grep, sort and uniq commands..

grouping lines from a txt file using filters in Linux to create multiple txt files

I have a txt file, where each line starts with participant No, followed by the date and other variables (numbers only), so has format:
S001_2 20090926 14756 93
S002_2 20090803 15876 13
I want to write a script that creates smaller txt files containing only 20 participants per file (so first one will contain lines from S001_2 to S020_2;second from S021_2 to S040_2; total number of subjects approximately 200). However, subjects are not organized, therefore I can`t set a range with sed.
What would be the best command to filter ppts into chunks depending on what number (SOO1_2) the line starts with?
Thanks in advance.
Use the split command to split a file (or a filtered result) without ranges and sed. According to the documentation, this should work:
cat file.txt | split -l 20 - PREFIX
This will produce the files PREFIXaa, PREFIXab, ... (Note that it does not add the .txt extension to the file name!)
If you want to filter the files first, in the way #Sergey described:
cat file.txt | sort | split -l 20 - PREFIX
Sort without any parameters should be suitable, because there are leading zeros in your numbers like S001_2. So, first sort the file:
sort file.txt > sorted.txt
Then you will be able to set ranges with sed for file_sort.txt
This looks like a whole script for splitting sorted file into 20-line files:
num=1;
i=1;
lines=`wc -l sorted.txt | cut -d' ' -f 1`;#get number of lines
while [ $i -lt $lines ];do
sed -n $i,`echo $i+19 | bc`p sorted.txt > file$num;
num=`echo $num+1 | bc`;
i=`echo $i+20 | bc`;
done;
$ split -d -l 20 file.txt -a3 db_
produces: db_000, db_001, db_002, ..., db_N

Awk script to select files and print file sizes

I'm working on a home work assignment. The question is:
Write an awk script to select all regular files (not directories or
links) in /etc ending with .conf, sort the result by size from
smallest to largest, count the number of files, and print out the
number of files followed by the filenames and sizes in two columns.
Include a header row for the filenames and sizes. Paste both your
script and its output in the answer area.
I'm really struggling trying to get this to work through using awk. Here's what I came up with.
ls -lrS /etc/*.conf |wc –l
will return the number 33 which is the number of files .conf files in the directory.
ls -lrS /etc/*.conf |awk '{print "File_Size"": " $5 " ""File_Name and Size"": " $9}'
this will make 2 columns with the name and size of the .conf file in the directory.
It works, but I don't think it is what he's looking for. I'm having an AWKful time.
Let's see here...
select all regular files (not directories or links)
So far you haven't addressed this, but if you are piping in the output of ls -l..., this is easy, select on
/^-/
because directories start with d, symbolic links with l and so on. Only plain old files start with -. Now
print out the number of files followed
Well, counting matches is easy enough...
BEGIN{count=0} # This is not *necessary*, but I tend to put it in for clarity
/^-/ {count++;}
To get the filename and size, look at the output of ls -l and count up columns
BEGIN{count=0}
/^-/ {
count++;
SIZE=$5;
FNAME=$9;
}
The big difficulty here is that awk doesn't provide much by way of sorting primitives, so that's the hard part. That can be beaten if you want to be clever but it is not particularly efficient (see the awful thing I did in a [code-golf] solution). The easy (and unixy) thing to do would be to pipe part of the output to sort, so...we collect a line for each file into a big string
BEGIN{count=0}
/^-/ {
count++
SIZE=$5;
FNAME=$9;
OUTPUT=sprintf("%10d\t%s\n%s",SIZE,FNAME,OUTPUT);
}
END{
printf("%d files\n",count);
printf(" SIZE \tFILENAME"); # No newline here because OUTPUT has it
print OUTPUT|"sort -n --key=1";
}
Gives output like
11 files
SIZE FILENAME
673 makefile
2192 houghdata.cc
2749 houghdata.hh
6236 testhough.cc
8751 fasthough.hh
11886 fasthough.cc
19270 HoughData.png
60036 houghdata.o
104680 testhough
150292 testhough.o
168588 fasthough.o
(BTW--There is a test subdirectory here, and you'll note that it does not appear in the output.)
May be something like this should get you on your way -
ls -lrS /etc/*.conf |
awk '
BEGIN{print "Size:\tFilename:"} # Prints Headers
/^-/{print $5"\t"$9} # Prints two desired columns, /^-/ captures only files
END{print "Total Files = "(NR-1)}' # Uses in-built variable to print count
Test: Text after # are comments for your reference.
[jaypal:~/Temp] ls -lrS /etc/*.conf |
awk '
BEGIN{print "Size:\tFilename:"}
/^-/{print $5"\t"$9}
END{print "Total Files = "(NR-1)}'
Size: Filename:
0 /etc/kern_loader.conf
22 /etc/ntp.conf
54 /etc/ftpd.conf
105 /etc/launchd.conf
168 /etc/memberd.conf
242 /etc/notify.conf
366 /etc/ntp-restrict.conf
526 /etc/gdb.conf
723 /etc/pf.conf
753 /etc/6to4.conf
772 /etc/syslog.conf
983 /etc/rtadvd.conf
1185 /etc/asl.conf
1238 /etc/named.conf
1590 /etc/newsyslog.conf
1759 /etc/autofs.conf
2378 /etc/dnsextd.conf
4589 /etc/man.conf
Total Files = 18
I would first find the files with something like find /etc -type f -name '*.conf' ; so you get the right list of files. Then you do ls -l on them (perhaps using xargs). And then using awk should be simple.
But I don't think that if I did more your homework that would help you. You need to think by yourself and find out.
Disclaimer: I'm not a shell expert.
Thought I'd give this a go, been beaten on speed of reply though :-) :
clear
FILE_COUNT=`find /etc/ -name '*.conf' -type f -maxdepth 1 | wc -l`
echo "Number of files: $FILE_COUNT"
ls -lrS /etc/[^-]*.conf | awk '
BEGIN {print "NAME | SIZE"}\
{print $9," | ",$5}\
END {print "- DONE -"}\
'
My output is ugly :-( :
Number of files: 21
NAME | SIZE
/etc/kern_loader.conf | 0
/etc/resolv.conf | 20
/etc/AFP.conf | 24
/etc/ntp.conf | 42
/etc/ftpd.conf | 54
/etc/notify.conf | 132
/etc/memberd.conf | 168
/etc/Symantec.conf | 246
/etc/ntp-restrict.conf | 366
/etc/gdb.conf | 526
/etc/6to4.conf | 753
/etc/syslog.conf | 772
/etc/asl.conf | 860
/etc/liveupdate.conf | 861
/etc/rtadvd.conf | 983
/etc/named.conf | 1238
/etc/newsyslog.conf | 1590
/etc/autofs.conf | 1759
/etc/dnsextd.conf | 2378
/etc/smb.conf | 2975
/etc/man.conf | 4589
/etc/amavisd.conf | 31925
- DONE -

Resources