Search word from certain line GREP - search

File type = "ooTextFile"
Object class = "TextGrid"
xmin = 0
xmax = 82.7959410430839
tiers? <exists>
size = 1
item []:
item [1]:
class = "IntervalTier"
name = "ortho"
xmin = 0
xmax = 82.7959410430839
intervals: size = 6
intervals [1]:a
xmin = 0
xmax = 15.393970521541949
text = "Aj tento rok organizuje Rádio Sud piva. Kto chce súťažiť, nemusí sa nikde registrovať.
intervals [2]:
xmin = 15.393970521541949
xmax = 27.58997052154195
.
.
.
Hi I am working with hundreds of text files like this.
I want to filter all values xmin=... from this text file but only from 16th line because at the start there are xmins which are useless as you can see.
I tried:
cat text.txt | grep xmin
but it shows all lines where xmin is.
Please help me. I can't modify text files because I need to work with hundreds of them so I have to design suitable way how to filter them.

Like this:
awk 'FNR>15 && /xmin/' file*
xmin = 0
xmin = 15.393970521541949
It show all xmin from line 16
You can also print file name of the found xmin
awk 'FNR>15 && /xmin/ {$1=$1;print FILENAME" -> "$0}' file*
file22 -> xmin = 0
file22 -> xmin = 15.393970521541949
Update: Need to be FNR to work with multiple files.

Using sed and grep to look for "xmin" from 16th line till the end of a single file:
sed -n '16,$p' foobar.txt | grep "xmin"
In case of multiple files here is a bash script to get the output:
#!/bin/bash
for file in "$1"/*; do
output=$(sed -n '16,$p' "$file" | grep "xmin")
if [[ -n $output ]]; then
echo -e "$file has follwoing entries:\n$output"; fi; done
Run the script as bash script.sh /directory/containing/the/files/to/be/searched

Related

AWK print every other column, starting from the last column (and next to last column) for N interations (print from right to left)

Hopefully someone out there in the world can help me, and anyone else with a similar problem, find a simple solution to capturing data. I have spent hours trying a one liner to solve something I thought was a simple problem involving awk, a csv file, and saving the output as a bash variable. In short here's the nut...
The Missions:
1) To output every other column, starting from the LAST COLUMN, with a specific iteration count.
2) To output every other column, starting from NEXT TO LAST COLUMN, with a specific iteration count.
The Data (file.csv):
#12#SayWhat#2#4#2.25#3#1.5#1#1#1#3.25
#7#Smarty#9#6#5.25#5#4#4#3#2#3.25
#4#IfYouLike#4#1#.2#1#.5#2#1#3#3.75
#3#LaughingHard#8#8#13.75#8#13#6#8.5#4#6
#10#AtFunny#1#3#.2#2#.5#3#3#5#6.5
#8#PunchLines#7#7#10.25#7#10.5#8#11#6#12.75
Desired results for Mission 1:
2#2.25#1.5#1#3.25
9#5.25#4#3#3.25
4#.2#.5#1#3.75
8#13.75#13#8.5#6
1#.2#.5#3#6.5
7#10.25#10.5#11#12.75
Desired results for Mission 2:
SayWhat#4#3#1#1
Smarty#6#5#4#2
IfYouLike#1#1#2#3
LaughingHard#8#8#6#4
AtFunny#3#2#3#5
PunchLines#7#7#8#6
My Attempts:
The closes I have come to solving any of the above problems, is an ugly pipe (which is OK for skinning a cat) for Mission 1. However, it doesn't use any declared iterations (which should be 5). Also, I'm completely lost on solving Mission 2.
Any help to simplify the below and solving Mission 2 will be HELLA appreciated!
outcome=$( awk 'BEGIN {FS = "#"} {for (i = 0; i <= NF; i += 2) printf ("%s%c", $(NF-i), i + 2 <= NF ? "#" : "\n");}' file.csv | sed 's/##.*//g' | awk -F# '{for (i=NF;i>0;i--){printf $i"#"};printf "\n"}' | sed 's/#$//g' | awk -F# '{$1="";print $0}' OFS=# | sed 's/^#//g' );
Also, if doing a loop for a specific number of iterations is helpful in solving this problem, then magic number is 5. Maybe a solution could be a for-loop that is counting from right to left and skipping every other column as 1 iteration, with the starting column declared as an awk variable (Just a thought I have no way of knowing how to do)
Thank you for looking over this problem.
There are certainly more elegant ways to do this, but I am not really an awk person:
Part 1:
awk -F# '{ x = ""; for (f = NF; f > (NF - 5 * 2); f -= 2) { x = x ? $f "#" x : $f ; } print x }' file.csv
Output:
2#2.25#1.5#1#3.25
9#5.25#4#3#3.25
4#.2#.5#1#3.75
8#13.75#13#8.5#6
1#.2#.5#3#6.5
7#10.25#10.5#11#12.75
Part 2:
awk -F# '{ x = ""; for (f = NF - 1; f > (NF - 5 * 2); f -= 2) { x = x ? $f "#" x : $f ; } print x }' file.csv
Output:
SayWhat#4#3#1#1
Smarty#6#5#4#2
IfYouLike#1#1#2#3
LaughingHard#8#8#6#4
AtFunny#3#2#3#5
PunchLines#7#7#8#6
The literal 5 in each of those is your "number of iterations."
Sample data:
$ cat mission.dat
#12#SayWhat#2#4#2.25#3#1.5#1#1#1#3.25
#7#Smarty#9#6#5.25#5#4#4#3#2#3.25
#4#IfYouLike#4#1#.2#1#.5#2#1#3#3.75
#3#LaughingHard#8#8#13.75#8#13#6#8.5#4#6
#10#AtFunny#1#3#.2#2#.5#3#3#5#6.5
#8#PunchLines#7#7#10.25#7#10.5#8#11#6#12.75
One awk solution:
NOTE: OP can add logic to validate the input parameters.
$ cat mission
#!/bin/bash
# format: mission { 1 | 2 } { number_of_fields_to_display }
mission=${1} # assumes user inputs "1" or "2"
offset=$(( mission - 1 )) # subtract one to determine awk/NF offset
iteration_count=${2} # assume for now this is a positive integer
awk -F"#" -v offset=${offset} -v itcnt=${iteration_count} 'BEGIN { OFS=FS }
{ # we will start by counting fields backwards until we run out of fields
# or we hit "itcnt==iteration_count" fields
loopcnt=0
for (i=NF-offset ; i>=0; i-=2) # offset=0 for mission=1; offset=1 for mission=2
{ loopcnt++
if (loopcnt > itcnt)
break
fstart=i # keep track of the field we want to start with
}
# now printing our fields starting with field # "fstart";
# prefix the first printf with a empty string, then each successive
# field is prefixed with OFS=#
pfx = ""
for (i=fstart; i<= NF-offset; i+=2)
{ printf "%s%s",pfx,$i
pfx=OFS
}
# terminate a line of output with a linefeed
printf "\n"
}
' mission.dat
Some test runs:
###### mission #1
# with offset/iteration = 4
$ mission 1 4
2.25#1.5#1#3.25
5.25#4#3#3.25
.2#.5#1#3.75
13.75#13#8.5#6
.2#.5#3#6.5
10.25#10.5#11#12.75
#with offset/iteration = 5
$ mission 1 5
2#2.25#1.5#1#3.25
9#5.25#4#3#3.25
4#.2#.5#1#3.75
8#13.75#13#8.5#6
1#.2#.5#3#6.5
7#10.25#10.5#11#12.75
# with offset/iteration = 6
$ mission 1 6
12#2#2.25#1.5#1#3.25
7#9#5.25#4#3#3.25
4#4#.2#.5#1#3.75
3#8#13.75#13#8.5#6
10#1#.2#.5#3#6.5
8#7#10.25#10.5#11#12.75
###### mission #2
# with offset/iteration = 4
$ mission 2 4
4#3#1#1
6#5#4#2
1#1#2#3
8#8#6#4
3#2#3#5
7#7#8#6
# with offset/iteration = 5
$ mission 2 5
SayWhat#4#3#1#1
Smarty#6#5#4#2
IfYouLike#1#1#2#3
LaughingHard#8#8#6#4
AtFunny#3#2#3#5
PunchLines#7#7#8#6
# with offset/iteration = 6;
# notice we pick up field #1 = empty string so output starts with a '#'
$ mission 2 6
#SayWhat#4#3#1#1
#Smarty#6#5#4#2
#IfYouLike#1#1#2#3
#LaughingHard#8#8#6#4
#AtFunny#3#2#3#5
#PunchLines#7#7#8#6
this is probably not what you're asking but perhaps will give you an idea.
$ awk -F_ -v skip=4 -v endoff=0 '
BEGIN {OFS=FS}
{offset=(NF-endoff)%skip;
for(i=offset;i<=NF-endoff;i+=skip) printf "%s",$i (i>=(NF-endoff)?ORS:OFS)}' file
112_116_120
122_126_130
132_136_140
142_146_150
you specify the number of skips between columns and the end offset as input variables. Here, for last column end offset is set to zero and skip column is 4.
For clarity I used the input file
$ cat file
_111_112_113_114_115_116_117_118_119_120
_121_122_123_124_125_126_127_128_129_130
_131_132_133_134_135_136_137_138_139_140
_141_142_143_144_145_146_147_148_149_150
changing FS for your format should work.

Looping through a table and append information of that table to another file

This is my first post and im fairly new to bash coding.
We ran some Experiments where i work and for plotting it in gnuplot we need to append a reaction label to a Result.
We have a file that looks like this:
G135b CH2O+HCO=O2+C2H3
R020b 2CO+H=OH+C2O
R021b 2CO+O=O2+C2O
and a Result-file (which i cant access right now, sorry) where the first column of shown file is the same, followed by multiple values. They are not in the same order.
Now i want to loop through the Result-file and take the value of the first column, search for it in the shown file and append the reactionlabel to that line.
How can i loop through all the lines of the resulting file and take the value of the first column in a temporary variable?
I want to use this variable like this:
grep -r '^$var' shownfile | awk '{print $2}'
(Gives something back like this: CH2O+HCO=O2+C2H3)
How can i append the result of that line to the Result-file?
Edit: I also wrote a Script to go from a file that looks like this:
G135b : 0.178273 C H 2 O + H C O = O 2 + C 2 H 3
to this:
G135b CH2O+HCO=O2+C2H3
which is:
#!/bin/bash
file=$(pwd)
cd $file
# echo "$file"
cut -f1,3 $file/newfile >>tmpfile
sed -i "s/://g" tmpfile
sed -i "s/ //g" tmpfile
cp tmpfile newfile
How do i execute the cut command inside a file? Like -i for sed. My workaround is pretty ugly because it creates another file in the current directory.
Thank you :)
join command would work here which would perform inner join on 2 files wrt 1st column of each(by default).
$ cat data
G135b CH2O+HCO=O2+C2H3
R020b 2CO+H=OH+C2O
R021b 2CO+O=O2+C2O
$ cat result_file
G135b a b c
R020b a b
R021b a b x y z
$ join data result_file
G135b CH2O+HCO=O2+C2H3 a b c
R020b 2CO+H=OH+C2O a b
R021b 2CO+O=O2+C2O a b x y z
Using awk, it would be something like:
NR == FNR { data[$1] = $2; next; }
{ print $0 " " data[$1]; }
Save that in a file called reactions.awk, then call awk -f reactions.awk shownfile resultfile.
awk '{a[$1]=a[$1]$2} END{for (i in a){print i,a[i]}}' file1 file2

Linux replace column in filename with parent directory name

I have a file structure that looks like this:
Surge/Track_000000/000_extracted.csv
where the zeroes can be replaced by any numberical value
The files 000_extracted.csv look like:
Timestep ElementID SE
1 100 .5
2 100 1.3
3 100 .7
4 100 .2
Ideally what I would like to have is a resulting file that looks like this:
Track Timestep ElementID SE
0000000 1 100 .5
0000000 2 100 1.3
0000000 3 100 .7
0000000 4 100 .2
Where the 0000000 is the 7 digit track code from the parent directory name.
As I first step I want to append the directory name (Track_000000) to the filename. So it would go from 212_extracted.csv to Track_000000_212_extracted.csv.
I tried this:
for i in 'ls ./Surge/'
do
for j in 'ls ./Surge/$i/'
do
mv -v './Surge/$i/*.csv' './Surge/$i/$i-*.csv'
done
done
This is not working. While $i should be Track_0000000, instead it is telling me it is /Surge/.
Any help would be appreciated.
Thanks,
K
Something along these lines:
for dir in /Surge/*; do
prefix=$(basename $dir)
trk=${prefix#Track_}
for file in $dir/*.csv; do
awk '{ print TRK, " ", $0 }' TRK=$trk < $file > ${prefix}_${file}
done
done
Use back apostrophe: ` instead of ':
for i in `ls ./Surge/`; do
for j in `ls ./Surge/$i/`; do
mv -v './Surge/$i/*.csv' './Surge/$i/$i-*.csv'
done
done

How to select random lines from a file

I have a text file containing 10 hundreds of lines, with different lengths. Now I want to select N lines randomly, save them in another file, and remove them from the original file.
I've found some answers to this question, but most of them use a simple idea: sort the file and select first or last N lines. unfortunately this idea doesn't work to me, because I want to preserve the order of lines.
I tried this piece of code, but it's very slow and takes hours.
FILEsrc=$1;
FILEtrg=$2;
MaxLines=$3;
let LineIndex=1;
while [ "$LineIndex" -le "$MaxLines" ]
do
# count number of lines
NUM=$(wc -l $FILEsrc | sed 's/[ \r\t].*$//g');
let X=(${RANDOM} % ${NUM} + 1);
echo $X;
sed -n ${X}p ${FILEsrc}>>$FILEtrg; #write selected line into target file
sed -i -e ${X}d ${FILEsrc}; #remove selected line from source file
LineIndex=`expr $LineIndex + 1`;
done
I found this line the most time consuming one in the code:
sed -i -e ${X}d ${FILEsrc};
is there any way to overcome this problem and make the code faster?
Since I'm in hurry, may I ask you to send me complete c/c++ code for doing this?
A simple O(n) algorithm is described in:
http://en.wikipedia.org/wiki/Reservoir_sampling
array R[k]; // result
integer i, j;
// fill the reservoir array
for each i in 1 to k do
R[i] := S[i]
done;
// replace elements with gradually decreasing probability
for each i in k+1 to length(S) do
j := random(1, i); // important: inclusive range
if j <= k then
R[j] := S[i]
fi
done
Generate all your offsets, then make a single pass through the file. Assuming you have the desired number of offsets in offsets (one number per line) you can generate a single sed script like this:
sed "s!.*!&{w $FILEtrg\nd;}!" offsets
The output is a sed script which you can save to a temporary file, or (if your sed dialect supports it) pipe to a second sed instance:
... | sed -i -f - "$FILEsrc"
Generating the offsets file left as an exercise.
Given that you have the Linux tag, this should work right off the bat. The default sed on some other platforms may not understand \n and/or accept -f - to read the script from standard input.
Here is a complete script, updated to use shuf (thanks #Thor!) to avoid possible duplicate random numbers.
#!/bin/sh
FILEsrc=$1
FILEtrg=$2
MaxLines=$3
# Add a line number to each input line
nl -ba "$FILEsrc" |
# Rearrange lines
shuf |
# Pick out the line number from the first $MaxLines ones into sed script
sed "1,${MaxLines}s!^ *\([1-9][0-9]*\).*!\1{w $FILEtrg\nd;}!;t;D;q" |
# Run the generated sed script on the original input file
sed -i -f - "$FILEsrc"
[I've updated each solution to remove selected lines from the input, but I'm not positive the awk is correct. I'm partial to the bash solution myself, so I'm not going to spend any time debugging it. Feel free to edit any mistakes.]
Here's a simple awk script (the probabilities are simpler to manage with floating point numbers, which don't mix well with bash):
tmp=$(mktemp /tmp/XXXXXXXX)
awk -v total=$(wc -l < "$FILEsrc") -v maxLines=$MaxLines '
BEGIN { srand(); }
maxLines==0 { exit; }
{ if (rand() < maxLines/total--) {
print; maxLines--;
} else {
print $0 > /dev/fd/3
}
}' "$FILEsrc" > "$FILEtrg" 3> $tmp
mv $tmp "$FILEsrc"
As you print a line to the output, you decrement maxLines to decrease the probability of choosing further lines. But as you consume the input, you decrease total to increase the probability. In the extreme, the probability hits zero when maxLines does, so you can stop processing the input. In the other extreme, the probability hits 1 once total is less than or equal to maxLines, and you'll be accepting all further lines.
Here's the same algorithm, implemented in (almost) pure bash using integer arithmetic:
FILEsrc=$1
FILEtrg=$2
MaxLines=$3
tmp=$(mktemp /tmp/XXXXXXXX)
total=$(wc -l < "$FILEsrc")
while read -r line && (( MaxLines > 0 )); do
(( MaxLines * 32768 > RANDOM * total-- )) || { printf >&3 "$line\n"; continue; }
(( MaxLines-- ))
printf "$line\n"
done < "$FILEsrc" > "$FILEtrg" 3> $tmp
mv $tmp "$FILEsrc"
Here's a complete Go program :
package main
import (
"bufio"
"fmt"
"log"
"math/rand"
"os"
"sort"
"time"
)
func main() {
N := 10
rand.Seed( time.Now().UTC().UnixNano())
f, err := os.Open(os.Args[1]) // open the file
if err!=nil { // and tell the user if the file wasn't found or readable
log.Fatal(err)
}
r := bufio.NewReader(f)
var lines []string // this will contain all the lines of the file
for {
if line, err := r.ReadString('\n'); err == nil {
lines = append(lines, line)
} else {
break
}
}
nums := make([]int, N) // creates the array of desired line indexes
for i, _ := range nums { // fills the array with random numbers (lower than the number of lines)
nums[i] = rand.Intn(len(lines))
}
sort.Ints(nums) // sorts this array
for _, n := range nums { // let's print the line
fmt.Println(lines[n])
}
}
Provided you put the go file in a directory named randomlines in your GOPATH, you may build it like this :
go build randomlines
And then call it like this :
./randomlines "path_to_my_file"
This will print N (here 10) random lines in your files, but without changing the order. Of course it's near instantaneous even with big files.
Here's an interesting two-pass option with coreutils, sed and awk:
n=5
total=$(wc -l < infile)
seq 1 $total | shuf | head -n $n \
| sed 's/^/NR == /; $! s/$/ ||/' \
| tr '\n' ' ' \
| sed 's/.*/ & { print >> "rndlines" }\n!( &) { print >> "leftover" }/' \
| awk -f - infile
A list of random numbers are passed to sed which generates an awk script. If awk were removed from the pipeline above, this would be the output:
{ if(NR == 14 || NR == 1 || NR == 11 || NR == 20 || NR == 21 ) print > "rndlines"; else print > "leftover" }
So the random lines are saved in rndlines and the rest in leftover.
Mentioned "10 hundreds" lines should sort quite quickly, so this is a nice case for the Decorate, Sort, Undecorate pattern. It actually creates two new files, removing lines from the original one can be simulated by renaming.
Note: head and tail cannot be used instead of awk, because they close the file descriptor after given number of lines, making tee exit thus causing missing data in the .rest file.
FILE=input.txt
SAMPLE=10
SEP=$'\t'
<$FILE nl -s $"SEP" -nln -w1 |
sort -R |
tee \
>(awk "NR > $SAMPLE" | sort -t"$SEP" -k1n,1 | cut -d"$SEP" -f2- > $FILE.rest) \
>(awk "NR <= $SAMPLE" | sort -t"$SEP" -k1n,1 | cut -d"$SEP" -f2- > $FILE.sample) \
>/dev/null
# check the results
wc -l $FILE*
# 'remove' the lines, if needed
mv $FILE.rest $FILE
This might work for you (GNU sed, sort and seq):
n=10
seq 1 $(sed '$=;d' input_file) |
sort -R |
sed $nq |
sed 's/.*/&{w output_file\nd}/' |
sed -i -f - input_file
Where $n is the number of lines to extract.

Linux, big text file, strip out content from line A to line B

I want to strip a chunk of lines from a big text file. I know the start and end line number. What is the most elegant way to get the content (lines between the A and B) out to some file?
I know the head and tail commands - is there even a quicker (one step) way?
The file is over 5GB and it contains over 81 mio lines.
UPDATED: The results
time sed -n 79224100,79898190p BIGFILE.log > out4.log
real 1m9.988s
time tail -n +79224100 BIGFILE.log | head -n +`expr 79898190 - 79224100` > out1.log
real 1m11.623s
time perl fileslice.pl BIGFILE.log 79224100 79898190 > out2.log
real 1m13.302s
time python fileslice.py 79224100 79898190 < BIGFILE.log > out3.log
real 1m13.277s
The winner is sed. The fastest, the shortest. I think Chuck Norris would use it.
sed -n '<A>,<B>p' input.txt
This works for me in GNU sed:
sed -n 'I,$p; Jq'
The q quits when the indicated line is processed.
for example, these large numbers work:
$ yes | sed -n '200000000,${=;p};200000005q'
200000000
y
200000001
y
200000002
y
200000003
y
200000004
y
200000005
y
I guess big files need a bigger solution...
fileslice.py:
import sys
import itertools
for line in itertools.islice(sys.stdin, int(sys.argv[1]) - 1, int(sys.argv[2])):
sys.stdout.write(line)
invocation:
python fileslice.py 79224100 79898190 < input.txt > output.txt
Here's a perl solution :)
fileslice.pl:
#!/usr/bin/perl
use strict;
use warnings;
use IO::File;
my $first = $ARGV[1];
my $last = $ARGV[2];
my $fd = IO::File->new($ARGV[0], 'r') or die "Unable to open file $ARGV[0]: $!\n";
my $i = 0;
while (<$fd>) {
$i++;
next if ($i < $first);
last if ($i > $last);
print $_;
}
Start with
perl fileslice.pl file 79224100 79898190

Resources