Hello and thank you for taking the time to read this question. For the last day I have been trying to solve a problem and haven’t come any closer to a solution. I have a sample file of data that contains the following:
Fighter#Trainer
Bobby#SamBonen
Billy#BobBrown
Sammy#DJacobson
James#DJacobson
Donny#SonnyG
Ben#JasonS
Dave#JuanO
Derrek#KMcLaughlin
Dillon#LGarmati
Orson#LGarmati
Jeff#RodgerU
Brad#VCastillo
The goal is to identify “Trainers” that have have more then one fighter. My gut feeling is the “getline” and variable declaration directives in AWK are going to be needed. I have tried different combinations of
awk -F# 'NR>1{a=$2; getline; if($2 = a) {print $0,"Yes"} else {print $0,"NO"}}' sample.txt
Yet, the output is nowhere near the desired results. In fact, it doesn’t even output all the rows in the sample file!
My desired results are:
Fighter#Trainer
Bobby#SamBonen#NO
Billy#BobBrown#NO
Sammy#DJacobson#YES
James#DJacobson#YES
Donny#SonnyG#NO
Ben#JasonS#NO
Dave#JuanO#NO
Derrek#KMcLaughlin#NO
Dillon#LGarmati#YES
Orson#LGarmati#YES
Jeff#RodgerU#NO
Brad#VCastillo#NO
I am completely lost as to where to go from here. I have been searching and trying to find a solution to no avail, and I'm looking for some input. Thank you!
You don't need getline.
You could just process the input normally,
building up counts per trainer,
and print the result in an END block:
awk -F# '{
lines[NR] = $0;
trainers[NR] = $2;
counts[$2]++;
}
END {
print lines[1];
for (i = 2; i <= length(lines); i++) {
print lines[i] "#" (counts[trainers[i]] > 1 ? "YES" : "NO");
}
}' sample.txt
Another option is to make two passes:
$ cat p.awk
BEGIN {FS=OFS="#"}
NR==1 {print;next};
NR==FNR {++trainers[$2]; next}
FNR>1 {$3=(trainers[$2]>1)?"YES":"NO"; print}
$ awk -f p.awk p.txt p.txt
Fighter#Trainer
Bobby#SamBonen#NO
Billy#BobBrown#NO
Sammy#DJacobson#YES
James#DJacobson#YES
Donny#SonnyG#NO
Ben#JasonS#NO
Dave#JuanO#NO
Derrek#KMcLaughlin#NO
Dillon#LGarmati#YES
Orson#LGarmati#YES
Jeff#RodgerU#NO
Brad#VCastillo#NO
Explained:
Set the input and output file separators:
BEGIN {FS=OFS="#"}
Print the header:
NR==1 {print;next};
First pass, count occurrences of each trainer:
NR==FNR {++trainers[$2]; next}
Second pass, set YES or NO according to trainer count, and print result:
FNR>1 {$3=(trainers[$2]>1)?"YES":"NO"; print}
I have the following file:
1,A
2,B
3,C
10000,D
20,E
4000,F
I want to select the lines having a count greater than 10 and less than 5000. the output should be E and F. In C++ or any other language is a piece of cake. I really wanted to know how can I do it with a linux command.
I tried the following command
awk -F ',' '{$1 >= 10 && $1 < 5000} { count++ } END { print $1,$2}' test.txt
But it is only givine 4000,F.
just do:
awk -F',' '$1 >= 10 && $1 < 5000' test.txt
you put boolean check in {....}, and don't use the result at all. it doesn't make any sense. You should do either {if(...) ...} or booleanExpression{do...}
useless count++
you have only print statement in END so only last line was printed out.
Your script does actually:
print the last line of the test.txt, no matter what it is.
Hi i am looking for an awk that can find two patterns and print the data between them to
a file only if in the middle there is a third patterns in the middle.
for example:
Start
1
2
middle
3
End
Start
1
2
End
And the output will be:
Start
1
2
middle
3
End
I found in the web awk '/patterns1/, /patterns2/' path > text.txt
but i need only output with the third patterns in the middle.
And here is a solution without flags:
$ awk 'BEGIN{RS="End"}/middle/{printf "%s", $0; print RT}' file
Start
1
2
middle
3
End
Explanation: The RS variable is the record separator, so we set it to "End", so that each Record is separated by "End".
Then we filter the Records that contain "middle", with the /middle/ filter, and for the matched records we print the current record with $0 and the separator with print RT
This awk should work:
awk '$1=="Start"{ok++} ok>0{a[b++]=$0} $1=="middle"{ok++} $1=="End"{if(ok>1) for(i=0; i<length(a); i++) print a[i]; ok=0;b=0;delete a}' file
Start
1
2
middle
3
End
Expanded:
awk '$1 == "Start" {
ok++
}
ok > 0 {
a[b++] = $0
}
$1 == "middle" {
ok++
}
$1 == "End" {
if (ok > 1)
for (i=0; i<length(a); i++)
print a[i];
ok=0;
b=0;
delete a
}' file
Just use some flags with awk:
/Start/ {
start_flag=1
}
/middle/ {
mid_flag=1
}
start_flag {
n=NR;
lines[NR]=$0
}
/End/ {
if (start_flag && mid_flag)
for(i=n;i<NR;i++)
print lines[i]
start_flag=mid_flag=0
delete lines
}
Modified the awk user000001
awk '/middle/{printf "%s%s\n",$0,RT}' RS="End" file
EDIT:
Added test for Start tag
awk '/Start/ && /middle/{printf "%s%s\n",$0,RT}' RS="End" file
This will work with any modern awk:
awk '/Start/{f=1;rec=""} f{rec=rec $0 ORS} /End/{if (rec~/middle/) printf "%s",rec}' file
The solutions that set RS to "End" are gawk-specific, which may be fine but it's definitely worth mentioning.
I have some CSV/tabular data in a file, like so:
1,7,3,2
8,3,8,0
4,9,5,3
8,5,7,3
5,6,1,9
(They're not always numbers, just random comma-separated values. Single-digit numbers are easier for an example, though.)
I want to shuffle a random 40% of any of the columns. As an example, say the 3rd one. So perhaps 3 and 1 get swapped with each other. Now the third column is:
1 << Came from the last position
8
5
7
3 << Came from the first position
I am trying to do this in place in a file from within a bash script that I am working on, and I am not having much luck. I keep wandering down some pretty crazy and fruitless grep rabbit holes that leave me thinking that I'm going the wrong way (the constant failure is what's tipping me off).
I tagged this question with a litany of things because I'm not entirely sure which tool(s) I should even be using for this.
Edit: I'm probably going to end up accepting Rubens' answer, however wacky it is, because it directly contains the swapping concept (which I guess I could have emphasized more in the original question), and it allows me to specify a percentage of the column for swapping. It also happens to work, which is always a plus.
For someone who doesn't need this, and just wants a basic shuffle, Jim Garrison's answer also works (I tested it).
A word of warning, however, on Rubens' solution. I took this:
for (i = 1; i <= NF; ++i) {
delim = (i != NF) ? "," : "";
...
}
printf "\n";
removed the printf "\n"; and moved the newline character up like this:
for (i = 1; i <= NF; ++i) {
delim = (i != NF) ? "," : "\n";
...
}
because just having "" on the else case was causing awk to write broken characters at the end of each line (\00). At one point, it even managed to replace my entire file with Chinese characters. Although, honestly, this probably involved me doing something extra stupid on top of this problem.
This will work for a specifically designated column, but should be enough to point you in the right direction. This works on modern bash shells including Cygwin:
paste -d, <(cut -d, -f1-2 test.dat) <(cut -d, -f3 test.dat|shuf) <(cut -d, -f4- test.dat)
The operative feature is "process substitution".
The paste command joins files horizontally, and the three pieces are split from the original file via cut, with the second piece (the column to be randomized) run through the shuf command to reorder the lines. Here's the output from running it a couple of times:
$ cat test.dat
1,7,3,2
8,3,8,0
4,9,5,3
8,5,7,3
5,6,1,9
$ paste -d, <(cut -d, -f1-2 test.dat) <(cut -d, -f3 test.dat|shuf) <(cut -d, -f4- test.dat)
1,7,1,2
8,3,8,0
4,9,7,3
8,5,3,3
5,6,5,9
$ paste -d, <(cut -d, -f1-2 test.dat) <(cut -d, -f3 test.dat|shuf) <(cut -d, -f4- test.dat)
1,7,8,2
8,3,1,0
4,9,3,3
8,5,7,3
5,6,5,9
Algorithm:
create a vector with n pairs, from 1 to number of lines, and the respective value in the line (for the selected column), and then sort it randomly;
find how many lines should be randomized: num_random = percentage * num_lines / 100;
select the first num_random entries from your randomized vector;
you may sort the selected lines randomly, but it should be already randomly sorted;
printing output:
i = 0
for num_line, value in column; do
if num_line not in random_vector:
print value; # printing non-randomized value
else:
print random_vector[i]; # randomized entry
i++;
done
Implementation:
#! /bin/bash
infile=$1
col=$2
n_lines=$(wc -l < ${infile})
prob=$(bc <<< "$3 * ${n_lines} / 100")
# Selected lines
tmp=$(tempfile)
paste -d ',' <(seq 1 ${n_lines}) <(cut -d ',' -f ${col} ${infile}) \
| sort -R | head -n ${prob} > ${tmp}
# Rewriting file
awk -v "col=$col" -F "," '
(NR == FNR) {id[$1] = $2; next}
(FNR == 1) {
i = c = 1;
for (v in id) {value[i] = id[v]; ++i;}
}
{
for (i = 1; i <= NF; ++i) {
delim = (i != NF) ? "," : "";
if (i != col) {printf "%s%c", $i, delim; continue;}
if (FNR in id) {printf "%s%c", value[c], delim; c++;}
else {printf "%s%c", $i, delim;}
}
printf "\n";
}
' ${tmp} ${infile}
rm ${tmp}
In case you want a close approach to in-placement, you may pipe the output back to the input file, using sponge.
Execution:
To execute, simply use:
$ ./script.sh <inpath> <column> <percentage>
As in:
$ ./script.sh infile 3 40
1,7,3,2
8,3,8,0
4,9,1,3
8,5,7,3
5,6,5,9
Conclusion:
This allows you to select the column, randomly sort a percentage of entries in that column, and replace the new column in the original file.
This script goes as proof like no other, not only that shell scripting is extremely entertaining, but that there are cases where it should definitely be used not. (:
I'd use a 2-pass approach that starts by getting a count of the number of lines and read the file into an array, then use awk's rand() function to generate random numbers to identify the lines you'll change and then rand() again to determine which pairs of those lines you will swap and then swap the array elements before printing. Something like this PSEUDO-CODE, rough algorithm:
awk -F, -v pct=40 -v col=3 '
NR == FNR {
array[++totNumLines] = $0
next
}
FNR == 1{
pctNumLines = totNumLines * pct / 100
srand()
for (i=1; i<=(pctNumLines / 2); i++) {
oldLineNr = rand() * some factor to produce a line number that's in the 1 to totNumLines range but is not already recorded as processed in the "swapped" array.
newLineNr = ditto plus must not equal oldLineNr
swap field $col between array[oldLineNr] and array[newLineNr]
swapped[oldLineNr]
swapped[newLineNr]
}
next
}
{ print array[FNR] }
' "$file" "$file" > tmp &&
mv tmp "$file"
I am struggling with this awk code which should emulate the tail command
num=$1;
{
vect[NR]=$0;
}
END{
for(i=NR-num;i<=NR;i++)
print vect[$i]
}
So what I'm trying to achieve here is an tail command emulated by awk/
For example consider cat somefile | awk -f tail.awk 10
should print the last 10 lines of a text file, any suggestions?
All of these answers store the entire source file. That's a horrible idea and will break on larger files.
Here's a quick way to store only the number of lines to be outputted (note that the more efficient tail will always be faster because it doesn't read the entire source file!):
awk -vt=10 '{o[NR%t]=$0}END{i=(NR<t?0:NR);do print o[++i%t];while(i%t!=NR%t)}'
more legibly (and with less code golf):
awk -v tail=10 '
{
output[NR % tail] = $0
}
END {
if(NR < tail) {
i = 0
} else {
i = NR
}
do {
i = (i + 1) % tail;
print output[i]
} while (i != NR % tail)
}'
Explanation of legible code:
This uses the modulo operator to store only the desired number of items (the tail variable). As each line is parsed, it is stored on top of older array values (so line 11 gets stored in output[1]).
The END stanza sets an increment variable i to either zero (if we've got fewer than the desired number of lines) or else the number of lines, which tells us where to start recalling the saved lines. Then we print the saved lines in order. The loop ends when we've returned to that first value (after we've printed it).
You can replace the if/else stanza (or the ternary clause in my golfed example) with just i = NR if you don't care about getting blank lines to fill the requested number (echo "foo" |awk -vt=10 … would have nine blank lines before the line with "foo").
for(i=NR-num;i<=NR;i++)
print vect[$i]
$ indicates a positional parameter. Use just plain i:
for(i=NR-num;i<=NR;i++)
print vect[i]
The full code that worked for me is:
#!/usr/bin/awk -f
BEGIN{
num=ARGV[1];
# Make that arg empty so awk doesn't interpret it as a file name.
ARGV[1] = "";
}
{
vect[NR]=$0;
}
END{
for(i=NR-num;i<=NR;i++)
print vect[i]
}
You should probably add some code to the END to handle the case when NR < num.
You need to add -v num=10 to the awk commandline to set the value of num. And start at NR-num+1 in your final loop, otherwise you'll end up with num+1 lines of output.
This might work for you:
awk '{a=a b $0;b=RS;if(NR<=v)next;a=substr(a,index(a,RS)+1)}END{print a}' v=10