Linux Shell Bash Spilt Text file by columes - linux

I have a massive list of english words that looks something like this.
1. do accomplish, prepare, resolve, work out
2. say suggest, disclose, answer
3. go continue, move, lead
4. get bring, attain, catch, become
5. make create, cause, prepare, invest
6. know understand, appreciate, experience, identify
7. think contemplate, remember, judge, consider
8. take accept, steal, buy, endure
9. see detect, comprehend, scan
10. come happen, appear, extend, occur
11. want choose, prefer, require, wish
12. look glance, notice, peer, read
13. use accept, apply, handle, work
14. find detect, discover, notice, uncover
15. give grant, award, issue
16. tell confess, explain, inform, reveal
And I would like to be able to extract the second colum,
do
say
go
get
make
know
think
take
see
comer
want
look
use
find
give
tell
anybody know how to do this in shell bash.
Thanks.

Using bash
$ cat tst.sh
#!/usr/bin/env bash
while read line; do
line=${line//[0-9.]}
line=${line/,*}
echo ${line% *}
done < /path/to/input_file
$ ./tst.sh
do
say
go
get
make
know
think
take
see
come
want
look
use
find
give
tell
Using sed
$ sed 's/[^a-z]*\([^ ]*\).*/\1/' input_file
do
say
go
get
make
know
think
take
see
come
want
look
use
find
give
tell

There are a lot of ways to do that:
awk '{print $2}' input-file
cut -d ' ' -f 5 input-file # assuming 4 spaces between the first columns
< input-file tr -s ' ' | cut -d ' ' -f 2
< input-file tr -s ' ' \\t | cut -f 2
perl -lane 'print $F[1]' input-file
sed 's/[^ ]* *\([^ ]*\).*/\1/' input-file
while read a b c; do printf '%s\n' "$b"; done < input-file

What about awk:
cat file.txt | awk '{print $2}'
Does this work?

Related

Bash: Flip strings to the other side of the delimiter

Basically, I have a file formatted like
ABC:123
And I would like to flip the strings around the delimiter, so it would look like this
123:ABC
I would prefer to do this with bash/linux tools.
Thanks for any help!
That's reasonably easy with internal bash commands, assuming two fields, as per the following transcript:
pax:~$ x='abc:123'
pax:~$ echo "${x#*:}:${x%:*}"
123:abc
The first substitution ${x#*:} removes everything from the start up to the colon. The second, ${x%:*}, removes everything from the colon to the end.
Then you just re-join them with the colon in-between.
It doesn't matter for your particular data but % and # use the shortest possible pattern. The %% and ## variants will give you the longest possible pattern (greedy).
As an aside, this is ideal if you doing it for one string at a time since you don't need to kick up an external process to do the work for you. But, if you're processing an entire file, there are better ways to do it, such as with awk:
pax:~$ printf "abc:123\ndef:456\nghi:789\n" | awk -F: '{print $2 FS $1}'
123:abc
456:def
789:ghi
#!/bin/sh -x
var1=$(echo -e 'ABC:123' | cut -d':' -f1)
var2=$(echo -e 'ABC:123' | cut -d':' -f2)
echo -e "${var2}":"${var1}"
I use cut to split the string into two parts, and store both of those parts as variables.
From there, it's possible to use echo to re-arrange the variables as you see fit.
Using sed.
sed -E 's/(.*):(.*)/\2:\1/' file.txt
Using paste and cut with process substitution.
paste -d: <(cut -d : -f2 file.txt) <(cut -d : -f1 file.txt)
A slower/slowest shell solution on large set of data/files.
while IFS=: read -r left rigth; do printf '%s:%s\n' "$rigth" "$left"; done < file.txt

With Bash, how to print the whole lines of a .txt file which contain, in their first section, a specific string?

So far, I used grep for these kind of questions, but here, I think it can't be use. Indeed, I want a match in the first section of each line, and then print the whole line.
I wrote something like that:
cat file.txt | cut -d " " -f1 | grep root
But of course, this command does not print the whole line, but only the first section of each line that contains "root". I've heard of awk command, but even with the manual, I do not understand how to reach my goal.
Thank you in advance for your answer
Simple awk should do it:
awk '$1 ~ /root/' file.txt

Remove Letter After Number and Before Comma

I need to remove any letters that occur after the first comma in a line
some.file
JAN,334X,333B,337A,338D,332Q,335H,331U
Expected Result:
JAN,334,333,337,338,332,335,331
Code:
sed -i 's/\[0-9][0-9][0-9].*,/[0-9][0-9][0-9],/g' some.file
What am I doing wrong?
You could also use a small loop (this is GNU sed);
sed ':;s/[A-Z],/,/2;t;s/[A-Z]$//'
It only deletes the second letter preceding a comma, and loops. Finally, it deletes the letter at the line's end, if there is one.
Try this
$ sed 's/,\([0-9]*\)[^,]*/,\1/g' <<<'JAN,334X,333B,337A,338D,332Q,335H,331U'
JAN,334,333,337,338,332,335,331
You need to capture the digits with round parenthesis in order to use the captured string in the replacement. The option g does this for every occurrence.
Comparison of the different answers
Test data:
$ > data; for ((x=1000000;x>0;x--)); do echo 'JAN,334X,333B,337A,338D,332Q,335H,331U' >> data; done
My answer is the slowest:
$ time sed 's/,\([0-9]*\)[^,]*/,\1/g' < data >/dev/null
real 0m16.368s
user 0m16.296s
sys 0m0.024s
Michael is a bit faster:
$ time sed ':;s/[A-Z],/,/2;t;s/[A-Z]$//' < data >/dev/null
real 0m9.669s
user 0m9.624s
sys 0m0.012s
But Sundeep is the fastet:
$ time sed 's/[A-Z]//4g' < data >/dev/null
real 0m4.905s
user 0m4.856s
sys 0m0.028s
Some issues are:
No need to escape [.
Your replace value is wrong. Ex: s/regex/replace/g
Use this:
sed -e 's/\([0-9]\+\)[a-zA-Z],/\1,/g' -e 's/\([0-9]\+\)[a-zA-Z]$/\1/g' file
You should omit the * and the first \ looks like a mistake i.e.
sed -i 's/[0-9][0-9][0-9].,/[0-9][0-9][0-9],/g' some.file
but I think you also want to capture the number ...
sed -i 's/\([0-9][0-9][0-9]\).,/\1,/g' some.file
Would be helpful if you posted your actual output as well ...
Since question is tagged linux, this GNU sed option comes in handy
$ echo 'JAN,334X,333B,337A,338D,332Q,335H,331U' | sed -E 's/[A-Z](,|$)/\1/2g'
JAN,334,333,337,338,332,335,331
2g means replace from 2nd match onwards till end of line
If number of letters is known for first column, this can be simplified to
$ echo 'JAN,334X,333B,337A,338D,332Q,335H,331U' | sed 's/[A-Z]//4g'
JAN,334,333,337,338,332,335,331
No need for sed, coreutils will do:
paste -d, <(cut -d, -f1 data) <(cut -d, -f2- data | tr -d 'A-Z')
This takes .3 seconds on my computer when run on the data file generated in ceving's answer.

Grep versus Awk: How do the search mechanisms differ

I am writing a script that must loop, each loop different scripts pull variables from external files and the last step compiles them. I am trying to maximize the speed at which this loop can run, and thus trying to find the best programs for the job.
The rate limiting step right now is searching through a file which has 2 columns and 4.5 million lines. column one is a key and column 2 is the value I am extracting.
The two programs I am evaluating are awk and grep. I have put the two scripts and their run times to find the last value below.
time awk -v a=15 'BEGIN{B=10000000}$1==a{print $2;B=NR}NR>B{exit}' infile
T
real 0m2.255s
user 0m2.237s
sys 0m0.018s
time grep "^15 " infile |cut -d " " -f 2
T
real 0m0.164s
user 0m0.127s
sys 0m0.037s
This brings me to my question... how does grep search. I understand awk runs line by line and field by field, which is why it takes longer as the file gets longer and i have to search further into it.
how does grep search? Clearly not line by line, or if it is it's clearly in a much different manner than awk considering the almost 20x time difference.
(I have noticed awk runs faster than grep for short files and I've yet to try and find where they diverge, but for those sizes it really doesn't matter nearly as much!).
I'd like to understand this so I can make good decisions for future program usage.
The awk command you posted does far more than the grep+cut:
awk -v a=15 'BEGIN{B=10000000}$1==a{print $2;B=NR}NR>B{exit}' infile
grep "^15 " infile |cut -d " " -f 2
so a time difference is very understandable. Try this awk command, which IS equivalent to the grep+cut, and see what results you get so we can compare apples to apples:
awk '/^15 /{print $2}' infile
or even:
awk '$1==15{print $2}' infile

Grep Usage help

I want to use grep to find all of the headers in a corpus, I want to find every thing up to the : and ignore every thing after that. Does anyone know how to do that? (Could I get a complete line of code)
Use sed or awk.
A sed example:
sed -e '/^[^:]*$/d' -e 's/\(.*\):.*/\1/' filename
If all you want to do is display the first portion of the matched line then you can say
grep your_pattern | cut -d: -f 1
but if you want to not match against data after the colon, you need a different tool. There are many tools available sed, awk, perl, python, etc. For instance, the Perl code would look something like this
perl -nle '($s) = split /:/; print $s if $s =~ /your_pattern/'
or the longer script version:
#!/usr/bin/perl
use strict;
use warnings;
while (my $line = <>) {
my $substring = split /:/, $line;
if ($substring =~ /your_pattern/) {
print "$substring\n";
}
}
(I'm not sure I fully understand your question)
you must use 'grep' AND 'cut', one solution (albeit far from perfect) would be:
$ cat file | grep ':' | cut -f 1 -d ':'
sed -n '/^$/q;/:/{s/:.*/:/;p;}'
This will stop after all the headers are processed.
Edit: a bit improved version:
sed -n '/^$/q;/^[^ :\t]{1,}:/{s/:.*/:/;p;}'

Resources