How to get "random" number based on a string in Bash? - string

I know there are plenty of random number generators out there, but I am looking for something that may be a little more predictable. Is there a way to get a "random" number (one that would be the same in every instance) given a string? I would like to do this in bash. I am looking for something a little more advanced than the count of characters in the string, but not as advanced as a full checksum of it.
The end goal is to get a decimal value, so this could be ran against a string multiple times repeating the result.

You need a random number, but you don't want a full checksum, it's contradiction. I think md5sum and sha1sum is really easy to use and should fit your needs:
md5sum <<< "$your_str"
or
sha1sum <<< "$your_str"
Update:
If you need decimal numbers, just:
n=$(md5sum <<< "$your_str")
echo $((0x${n%% *}))

To hash the string abc to 3 decimals:
echo $((0x$(echo abc | md5sum | cut -f 1 -d " " | cut -c 1-3))) | cut -c 1-3
To get e.g. 4 decimals instead of 3, replace the two occurrences of 3 by 4:
echo $((0x$(echo abc | md5sum | cut -f 1 -d " " | cut -c 1-4))) | cut -c 1-4
$(echo abc | md5sum | cut -f 1 -d " " | cut -c 1-4) gives an hexadecimal of 4 digits.
$((0xHEX)) converts an hexadecimal (HEX) to decimal, which may have more than 4 digits even when HEX has 4 digits, e.g. ffff is 65535.
cut -c 1-4 gets the first 4 digits.

$ echo abc | md5sum | grep -Eo "[[:digit:]]{3}" | head -n1
Result -> 248
$ echo xyz | md5sum | grep -Eo "[[:digit:]]{3}" | head -n1
Result -> 627
Combined other answers!
For HEX colors
echo abc | sha1sum | grep -Eo "[a-f0-9]{6}" | head -n1
03cfd7
echo abc | sha256sum | grep -Eo "[a-f0-9]{6}" | head -n1
edeaaf
echo abc | sha512sum | grep -Eo "[a-f0-9]{6}" | head -n1
4f285d

Related

Linux command to retrieve unique words and count along with punctuation marks

tr -c '[:alnum:]' '[\n*]' < 4300-0.txt | sort | uniq -c | sort -nr | head
The following command retrieves unique words along with the count. I'd like to retrieve punctuation marks along with the unique word counts.
What is the way to achieve this?
You could split your input with tee and extract punctuations and alnum separately.
echo "Helo, world!" |
{
tee >(tr -c '[:alnum:]' '\n' >&3) |
tr -c '[:punct:]' '\n'
} 3>&1 |
sed '/^$/d' |
sort | uniq -c | sort -nr | head
should output:
1 world
1 Helo
1 !
1 ,
A short sed script also seems to work:
echo "Helo, world!
OK!" |
sed '
s/\([[:alnum:]]\+\)\([^[:alnum:]]\)/\1\n\2/g
s/\([[:punct:]]\+\)\([^[:punct:]]\)/\1\n\2/g
s/[^[:punct:][:alnum:]]/\n/g
' |
sed '/^$/d' |
sort | uniq -c | sort -nr | head
should output:
2 !
1 world
1 OK
1 Helo
1 ,
You can use [:punct:] to retrieve the punctuation marks
And you can run:
tr -c '[:alnum:][:punct:]' '[\n*]' < 4300-0.txt | sort | uniq -c | sort -nr | head
it will print out the punctuation marks as well.
For example:
if you have in your txt file
aaa,
aaa
the output will be:
1 aaa
1 aaa,

Grep for string containing several metacharacters and extract 3 lines after match

I'd like to grep for 1:N:0:CGATGT within a file and extract the line containing 1:N:0:CGATGT and 3 additional lines after (4 lines total for each match). I've tired to grep numerous ways, all yielding unsuccessful:
[ssabri#login2 data]$ history | tail -n 8
1028 zcat A1_S1_L008_R1_001.fastq.gz | grep -A4 "1[[:]][[N]][[:]]0[[:]]CGATGT" | wc -l
1029 zcat A1_S1_L008_R1_001.fastq.gz | grep -A4 "1[[:]][[N]][[:]]0[[:]]CGATGT$" | wc -l
1030 zcat A1_S1_L008_R1_001.fastq.gz | grep -A4 "1[[:]][[N]][[:]][[0]][[:]]CGATGT$" | wc -l
1031 zcat A1_S1_L008_R1_001.fastq.gz | grep -A4 -w "1[[:]][[N]][[:]][[0]][[:]]CGATGT$" | wc -l
1032 zcat A1_S1_L008_R1_001.fastq.gz | egrep -A4 -w "1[[:]][[N]][[:]][[0]][[:]]CGATGT$" | wc -l
1033 zcat A1_S1_L008_R1_001.fastq.gz | grep -x -A4 -w "1:N:0:CGATGT" | wc -l
1034 zcat A1_S1_L008_R1_001.fastq.gz | grep -E -A4 -w "1:N:0:CGATGT" | wc -l
1035 zcat A1_S1_L008_R1_001.fastq.gz | grep -A4 -w "1\:N\:0\:CGATGT$" | wc -l
EDIT: The input files looks something like this:
[ssabri#login2 data]$ zcat A1_S1_L008_R1_001.fastq.gz | head -n 12
#J00153:28:H7LNWBBXX:8:1101:28625:1191 1:N:0:CGAGGT
ACNTGCTCCATCCATAGCACCTAGAACAGAGCCTGGNACAGAANAAGNGC
+
A-#<-<<FJJAJFFFF-FJJJJJAJFJJJFF-A-FA#JJJJFJ#JJA#FJ
#J00153:28:H7LNWBBXX:8:1101:29457:1191 1:N:0:CGATGT
GTNGTGGTAGATCTGGACGCGGCTGAAGGCCTGGGGNCCCGTGNCAGN
+
-<#<FJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJ#JJJJJJ#JJJ#
#J00153:28:H7LNWBBXX:8:1101:31000:1191 1:N:0:CCATGT
TCNAATTATCACCATTACAGGAGGGTCAGTAGAACANGCGTTCTGGTNGG
+
<A#<AFFJJJFJJJFJJJJJJFFFJ7A<<JJFJJJJ#JJJAFJJJJF#-A
grep -A3 "1:N:0:CGATGT" file
#J00153:28:H7LNWBBXX:8:1101:29457:1191 1:N:0:CGATGT
GTNGTGGTAGATCTGGACGCGGCTGAAGGCCTGGGGNCCCGTGNCAGN
+
-<#<FJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJ#JJJJJJ#JJJ#
sometimes simpler thinking is better, here you don't need any regex extensions since you're string matching without any special regex chars that will need escaping. A(fter) context match should be 3 since you want 3 trailing lines (total will be 4 with matching line).
I understand that you are looking for a grep solution. However this is not the only option in text processing. If you use awk then this might be a solution:
awk ' BEGIN {line=4;} \
/1:N:0:CGATGT/ {line=0; print $0; next;} \
{ if (line<3) { print $0; line = line+1; } } ' your-file
Given the problem you seem to be having with using grep and pulling out a fixed 4 lines, try this:
$ awk 'NF>1{f=0} $NF=="1:N:0:CGATGT"{f=1} f' file
#J00153:28:H7LNWBBXX:8:1101:29457:1191 1:N:0:CGATGT
GTNGTGGTAGATCTGGACGCGGCTGAAGGCCTGGGGNCCCGTGNCAGN
+
-<#<FJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJ#JJJJJJ#JJJ#
Rather than printing a fixed number of lines after a match, it will print from the first line where the last field is your target string to just before the next line that COULD contain your target string.
To identify any blocks that have some number other than 4 lines including the target, use this:
$ awk 'f && NF>1{ if (f!=5) print NR, f-1 | "cat>&2"; f=0} $NF=="1:N:0:CGATGT"{f=1} f{print; ++f}' file
It will output to stderr the input file line number and the count of the number of lines in the unexpected block.

Removing blankspace before string

I would like to count the occurence of strings from certain file using pipelines, without awk and sed command.
my_file content:
ls -al
bash
cat datoteka.txt
cat d.txt | sort | less
/bin/bash
terminal command that I use:
cat $my_file | cut -d ' ' -f 1 | tr '|' '\n' | xargs -r -L1 basename | sort | uniq -c | xargs -r -L1 sh -c 'echo $1 $0'
desired output:
bash 2
cat 2
less 1
ls 1
sort 1
In my case, I get:
bash 2
cat 2
ls 1
_sort 1 (not counted )
_less 1 (not counted )
Sort and less comand are not counted because of the whitespace (I marked with _ ) infront of those two strings. How shall I improve my code, to remove this blank space before "sort" and "less"? Thanks in advance!
Update: Here is a second and longer example of an input file:
nl /etc/passwd
seq 1 10 | tr "\n" ","
seq 1 10 | tr -d 13579 | tr -s "\n "
seq 1 100 | split -d -a 2 -l10 - blabla-
uname -a | cut -d" " -f1,3
cut -d: -f1 /etc/passwd > fst
cut -d: -f3 /etc/passwd > scnd
ps -e | column
echo -n ABC | wc -m -c
cmp -s dat1.txt dat1.txt ; echo $?
diff dat1 dat2
ps -e | grep firefox
echo dat1 dat2 dat3 | tr " " "\n" | xargs -I {} -p ln -s {}
The problem with the code in the question, as you were aware, was with the cut statement. This replaces cut with a shell while loop that also includes the basename command:
$ tr '|' '\n' <my_file | while read cmd other; do basename "$cmd"; done | sort | uniq -c | xargs -r -L1 sh -c 'echo $1 $0'
bash 2
cat 2
less 1
ls 1
sort 1
Alternate Sorting
The above sorts the results alphabetically by the name of the command. If instead we want to sort in descending numerical order of number of occurrences, then:
tr '|' '\n' <file2 | while read cmd other; do basename "$cmd"; done | sort | uniq -c | xargs -r -L1 sh -c 'echo $1 $0' | sort -snrk2
Applying this command to the second input example in the question:
$ tr '|' '\n' <file2 | while read cmd other; do basename "$cmd"; done | sort | uniq -c | xargs -r -L1 sh -c 'echo $1 $0' | sort -snrk2
tr 4
cut 3
seq 3
echo 2
ps 2
cmp 1
column 1
diff 1
grep 1
nl 1
split 1
uname 1
wc 1
xargs 1
while IFS='|' read -ra commands; do
for cmd in "${commands[#]}"; do
set -- $cmd # unquoted to discard irrelevant whitespace
basename $1
done
done < myfile |
sort |
uniq -c |
while read num cmd; do
echo "$cmd $num"
done
bash 2
cat 2
less 1
ls 1
sort 1

Trying to get file with max lines to print with lines number

So i've been goofing with this since last night and I can get a lot of things to happen just not what I want.
I need a code to find the file with the most lines in a directory and then print the name of the file and the number of lines that file has.
I can get the entire directory's lines to print but can't seem to narrow the field so to speak.
Any help for a fool of a learner?
wc -l $1/* 2>/dev/null
| grep -v ' total$'
| sort -n -k1
| tail -1l
After some pro help in another question, this is where I got to, but it returns them all, and doesn't print their line counts.
Following awk command should do the job for you and you can avoid all redundant piped commands:
wc -l $1/* | awk '$2 != "total"{if($1>max){max=$1;fn=$2}} END{print max, fn}'
UPDATE: To avoid last line of wc's output this might be better awk command:
wc -l $1/* | awk '{arr[cnt++]=$0} END {for (i=0; i<length(arr)-1; i++)
{split(arr[i], a, " "); if(a[1]>max) {max=a[1]; fn=a[2]}} print max, fn}'
you can try:
wc -l $1/* | grep -v total | sort -g | tail -1
actually to avoid the grep that would also remove files containing "total":
for f in $1/*; do wc -l $f; done | sort -g | tail -1
or even better, as suggested in comments:
wc -l $1/* | sort -rg | sed -n '2p'
you can even make it a function:
function get_biggest_file() {
wc -l $* | sort -rg | sed -n '2p'
}
% ls -l
... 0 Jun 12 17:33 a
... 0 Jun 12 17:33 b
... 0 Jun 12 17:33 c
... 0 Jun 12 17:33 d
... 25 Jun 12 17:33 total
% get_biggest_file ./*
5 total
EDIT2: using the function I gave, you can simply output what you need as follows:
get_biggest $1/* | awk '{print "The file \"" $2 "\" has the maximum number of lines: " $1}'
EDIT: if you tried to write the function as you've written it in the question, you should add line continuation character at the end, as follows, or your shell will think you're trying to issue 4 commands:
wc -l $1/* 2>/dev/null \
| grep -v ' total$' \
| sort -n -k1 \
| tail -1l

How do I compare 80 md5sums with each other in bash

I have to compare md5sums of 80 copies of same file with each other and report a failure on a mismatch. How do I do it effectively in bash? I am looking for an elegant algorithm to do it.
md5sum FILES | sed 's/ .*$//' | sort -u
If you get more than one line of output, you have a mismatch.
(This doesn't tell you where the mismatch is.)
Putting it together, and replacing the sed command with a somewhat less terse awk command:
count=$(md5sum "$#" | awk '{print $1}' | sort -u | wc -l)
if [ $count -eq 1 ] ; then
echo "Everything matches"
else
echo "Nope"
fi
The output of:
md5sum $files | sort -k 1,2
is a list of the checksums in sorted order, with the corresponding file names afterwards. If you need to eyeball the results, this might be sufficient. If you need to identify odd-ball results, you have to decide on the presentation. You say you've got 80 copies of 'the same file'. Suppose there are actually 10 copies of each of 8 versions of 'the file'. How are you going to decide which is correct and which is bogus? What if you have 41 with one hash and 39 with another - are you sure the 39 are wrong and the 41 correct? Clearly, it is likely that one hash will predominate, but you'll have to worry about those pesky boundary conditions.
You can also do fancier things, such as:
md5sum $files | sort -k 1,2 > sorted.md5
sed 's/ .*//' sorted.md5 | uniq -c | sed 's/^ *\([0-9][0-9]*\) \(.*\)/\2 \1/' > counted.md5
join -j 1 -o 1.1,2.2,1.2 sorted.md5 counted.md5
This gives you an output consisting of the MD5 checksum, repetition count, and file name. The first sed script could be replaced by awk '{print $1}' if you prefer. The second would be replaced by awk '{printf "%s %s\n", $2, $1}', which is probably clearer (and is shorter). The reason for that futzing around is to get rid of the leading spaces in the output of uniq -c which confuse join.
md5sum $files | sort -k 1,2 > sorted.md5
awk '{print $1}' sorted.md5 | uniq -c | awk '{printf "%s %s\n", $2, $1}' > counted.md5
join -j 1 -o 1.1,2.2,1.2 sorted.md5 counted.md5
I created some files x1.h, x2.h and x3.h by copying dbatools.h, and set files=$(ls *.h). The output was:
0763af91756ef24f3d8f61131eb8f8f2 1 dblbac.h
10215826449a3e0f967a4c436923cffa 1 dbatool.h
37f48869409c2b0554d83bd86034c9bf 4 dbatools.h
37f48869409c2b0554d83bd86034c9bf 4 x1.h
37f48869409c2b0554d83bd86034c9bf 4 x2.h
37f48869409c2b0554d83bd86034c9bf 4 x3.h
5a48695c6b8673373d30f779ccd3a3c2 1 dbxglob.h
7b22f7e2373422864841ae880aad056d 1 dbstringlist.h
a5b8b19715f99c7998c4519cd67f0230 1 dbimglob.h
f9ef785a2340c7903b8e1ae4386df211 1 dbmach11.h
This can be further processed as necessary (for example, with sort -k2,3nr to get the counts in decreasing order, so the deviant files appear last). You have the names of the duplicate files grouped together along with a count telling you how many there are each duplication. What you do next depends on you.
A real production script would use temporary file names instead of hard-coded names, of course, and would clean up after itself.
md5sum FILES > MD5SUMS.md5
cut -c1-32 < MD5SUMS.md5 | sort | uniq -c | sort -n
will return something like this:
1 485fd876eef8e941fcd6fc19643e5e59
1 585fd876eef8e941fcd6fc19643e5e59
5 385fd876eef8e941fcd6fc19643e5e59
Reading: 5 files have the same checksum, two other have "individual" checksums. I assume, that the majority is right, so an additional
| tail -1 | cut -c 9-
returns the checksum of the last line. Now filter everything else (and put the parts together):
md5sum FILES > MD5SUMS.md5
grep -v "$(cut -c1-32 < MD5SUMS.md5 | sort | uniq -c | sort -n | tail -1 | cut -c 9-)" MD5SUMS.md5 | cut -c35-
This will print the filenames of the non-majority files.

Resources