Take nth column in a text file - linux

I have a text file:
1 Q0 1657 1 19.6117 Exp
1 Q0 1410 2 18.8302 Exp
2 Q0 3078 1 18.6695 Exp
2 Q0 2434 2 14.0508 Exp
2 Q0 3129 3 13.5495 Exp
I want to take the 2nd and 4th word of every line like this:
1657 19.6117
1410 18.8302
3078 18.6695
2434 14.0508
3129 13.5495
I'm using this code:
nol=$(cat "/path/of/my/text" | wc -l)
x=1
while [ $x -le "$nol" ]
do
line=($(sed -n "$x"p /path/of/my/text)
echo ""${line[1]}" "${line[3]}"" >> out.txt
x=$(( $x + 1 ))
done
It works, but it is very complicated and takes a long time to process long text files.
Is there a simpler way to do this?

iirc :
cat filename.txt | awk '{ print $2 $4 }'
or, as mentioned in the comments :
awk '{ print $2 $4 }' filename.txt

You can use the cut command:
cut -d' ' -f3,5 < datafile.txt
prints
1657 19.6117
1410 18.8302
3078 18.6695
2434 14.0508
3129 13.5495
the
-d' ' - mean, use space as a delimiter
-f3,5 - take and print 3rd and 5th column
The cut is much faster for large files as a pure shell solution. If your file is delimited with multiple whitespaces, you can remove them first, like:
sed 's/[\t ][\t ]*/ /g' < datafile.txt | cut -d' ' -f3,5
where the (gnu) sed will replace any tab or space characters with a single space.
For a variant - here is a perl solution too:
perl -lanE 'say "$F[2] $F[4]"' < datafile.txt

For the sake of completeness:
while read -r _ _ one _ two _; do
echo "$one $two"
done < file.txt
Instead of _ an arbitrary variable (such as junk) can be used as well. The point is just to extract the columns.
Demo:
$ while read -r _ _ one _ two _; do echo "$one $two"; done < /tmp/file.txt
1657 19.6117
1410 18.8302
3078 18.6695
2434 14.0508
3129 13.5495

One more simple variant -
$ while read line
do
set $line # assigns words in line to positional parameters
echo "$3 $5"
done < file

If your file contains n lines, then your script has to read the file n times; so if you double the length of the file, you quadruple the amount of work your script does — and almost all of that work is simply thrown away, since all you want to do is loop over the lines in order.
Instead, the best way to loop over the lines of a file is to use a while loop, with the condition-command being the read builtin:
while IFS= read -r line ; do
# $line is a single line of the file, as a single string
: ... commands that use $line ...
done < input_file.txt
In your case, since you want to split the line into an array, and the read builtin actually has special support for populating an array variable, which is what you want, you can write:
while read -r -a line ; do
echo ""${line[1]}" "${line[3]}"" >> out.txt
done < /path/of/my/text
or better yet:
while read -r -a line ; do
echo "${line[1]} ${line[3]}"
done < /path/of/my/text > out.txt
However, for what you're doing you can just use the cut utility:
cut -d' ' -f2,4 < /path/of/my/text > out.txt
(or awk, as Tom van der Woerdt suggests, or perl, or even sed).

If you are using structured data, this has the added benefit of not invoking an extra shell process to run tr and/or cut or something. ...
(Of course, you will want to guard against bad inputs with conditionals and sane alternatives.)
...
while read line ;
do
lineCols=( $line ) ;
echo "${lineCols[0]}"
echo "${lineCols[1]}"
done < $myFQFileToRead ;
...

Related

echo without trimming the space in awk command

I have a file consisting of multiple rows like this
10|EQU000000001|12345678|3456||EOMCO042|EOMCO042|31DEC2018|16:51:17|31DEC2018|SHOP NO.5,6,7 RUNWAL GRCHEMBUR MHIN|0000000010000.00|6761857316|508998|6011|GL
I have to split and replace the column 11 into 4 different columns using the count of character.
This is the 11th column containing extra spaces also.
SHOP NO.5,6,7 RUNWAL GRCHEMBUR MHIN
This is I have done
ls *.txt *.TXT| while read line
do
subName="$(cut -d'.' -f1 <<<"$line")"
awk -F"|" '{ "echo -n "$11" | cut -c1-23" | getline ton;
"echo -n "$11" | cut -c24-36" | getline city;
"echo -n "$11" | cut -c37-38" | getline state;
"echo -n "$11" | cut -c39-40" | getline country;
$11=ton"|"city"|"state"|"country; print $0
}' OFS="|" $line > $subName$output
done
But while doing echo of 11th column, its trimming the extra spaces which leads to mismatch in count of character. Is there any way to echo without trimming spaces ?
Actual output
10|EQU000000001|12345678|3456||EOMCO042|EOMCO042|31DEC2018|16:51:17|31DEC2018|SHOP NO.5,6,7 RUNWAL GR|CHEMBUR MHIN|||0000000010000.00|6761857316|508998|6011|GL
Expected Output
10|EQU000000001|12345678|3456||EOMCO042|EOMCO042|31DEC2018|16:51:17|31DEC2018|SHOP NO.5,6,7 RUNWAL GR|CHEMBUR|MH|IN|0000000010000.00|6761857316|508998|6011|GL
The least annoying way to code this that I've found so far is:
perl -F'\|' -lane '$F[10] = join "|", unpack "a23 A13 a2 a2", $F[10]; print join "|", #F'
It's fairly straightforward:
Iterate over lines of input; split each line on | and put the fields in #F.
For the 11th field ($F[10]), split it into fixed-width subfields using unpack (and trim trailing spaces from the second field (A instead of a)).
Reassemble subfields by joining with |.
Reassemble the whole line by joining with | and printing it.
I haven't benchmarked it in any way, but it's likely much faster than the original code that spawns multiple shell and cut processes per input line because it's all done in one process.
A complete solution would wrap it in a shell loop:
for file in *.txt *.TXT; do
outfile="${file%.*}$output"
perl -F'\|' -lane '...' "$file" > "$outfile"
done
Or if you don't need to trim the .txt part (and you don't have too many files to fit on the command line):
perl -i.out -F'\|' -lane '...' *.txt *.TXT
This simply places the output for each input file foo.txt in foo.txt.out.
A pure-bash implementation of all this logic
#!/usr/bin/env bash
shopt -s nocaseglob extglob
for f in *.txt; do
subName=${f%.*}
while IFS='|' read -r -a fields; do
location=${fields[10]}
ton=${location:0:23}; ton=${ton%%+([[:space:]])}
city=${location:23:12}; city=${city%%+([[:space:]])}
state=${location:36:2}
country=${location:38:2}
fields[10]="$ton|$city|$state|$country"
printf -v out '%s|' "${fields[#]}"
printf '%s\n' "${out:0:$(( ${#out} - 1 ))}"
done <"$f" >"$subName.out"
done
It's slower (if I did this well, by about a factor of 10) than pure awk would be, but much faster than the awk/shell combination proposed in the question.
Going into the constructs used:
All the ${varname%...} and related constructs are parameter expansion. The specific ${varname%pattern} construct removes the shortest possible match for pattern from the value in varname, or the longest match if % is replaced with %%.
Using extglob enables extended globbing syntax, such as +([[:space:]]), which is equivalent to the regex syntax [[:space:]]+.

Fast ways to make new multiple files from one file matching multiple patterns

I have one file called uniq.txt (20,000 lines).
head uniq.txt
1
103
10357
1124
1126
I have another file called all.txt (106,371,111 lines)
head all.txt
cg0001 ? 1 -0.394991215660192
cg0001 AB 103 -0.502535661820095
cg0002 A 10357 -0.563632386999913
cg0003 ? 1 -0.394991215660444
cg0004 ? 1 -0.502535661820095
cg0004 A 10357 -0.563632386999913
cg0003 AB 103 -0.64926706504459
I would like to make new 20,000 files from all.txt matching each line pattern of uniq.txt. For example,
head 1.newfile.txt
cg0001 ? 1 -0.394991215660192
cg0003 ? 1 -0.394991215660444
cg0004 ? 1 -0.502535661820095
head 103.newfile.txt
cg0001 AB 103 -0.502535661820095
cg0003 AB 103 -0.64926706504459
head 10357.newfile.txt
cg0002 A 10357 -0.563632386999913
cg0004 A 10357 -0.563632386999913
Is there any way that I can make new 20,000 files really fast?
My current script takes 1 min to make one new file. I guess it's scanning all.txt file every time it makes a new file.
You can try it with awk. Ideally you don't need >> in awk but since you have stated there would be 20,000 files, we don't want to exhaust system's resources by keeping too many file open.
awk '
NR==FNR { names[$0]++; next }
($3 in names) { file=$3".newfile.txt"; print $0 >>(file); close (file) }
' uniq.txt all.txt
This will first scan the uniq.txt file into memory creating a lookup table of sorts. It will then read through the all.txt file and start inserting entries into corresponding files.
This uses a while loop — This may or may not be the quickest way, although give it a try:
lines_to_files.sh
#!/bin/bash
while IFS='' read -r line || [[ -n "$line" ]]; do
num=$(echo "$line" | awk '{print $3}')
echo "$line" >> /path/to/save/${num}_newfile.txt
done < "$1"
usage:
$ ./lines_to_files.sh all.txt
This should create a new file for each line in your all.txt file based on the third column. As it reads each line it will add it to the appropriate file. Keep in mind that if you run the script successive times it will append the data that is already there for each file.
An explanation of the while loop used above for reading the flie can be found here:
↳ https://stackoverflow.com/a/10929511/499581
You can read each line into a Bash array, then append to the file named after the number in column three (array index 2):
#!/bin/bash
while read -ra arr; do
echo "${arr[#]}" >> "${arr[2]}".newfile.txt
done < all.txt
This creates space separated output. If you prefer tab separated, it depends a bit on your input data: if it is tab separated as well, you can just set IFS to a tab to get tab separated output:
IFS=$'\t'
while read -ra arr; do
echo "${arr[*]}" >> "${arr[2]}".newfile.txt
done < all.txt
Notice the change in printing the array, the * is now actually required.
Or, if the input data is not tab separated (or we don't know), we can set IFS in a subshell in each loop:
while read -ra arr; do
( IFS=$'\t'; echo "${arr[*]}" >> "${arr[2]}".newfile.txt )
done < all.txt
I'm not sure what's more expensive, spawning a subshell or a few parameter assignments, but I feel it's the subshell – to avoid spawning it, we can set and reset IFS in each loop instead:
while read -ra arr; do
old_ifs="$IFS"
IFS=$'\t'
echo "${arr[*]}" >> "${arr[2]}".newfile.txt
IFS="$old_ifs"
done < all.txt
OP asked for fast ways. This is the fastest I've found.
sort -S 4G -k3,3 all.txt |
awk '{if(last!=$3){close(file); file=$3".newfile.txt"; last=$3} print $0 > file}'
Total time was 2m4.910s vs 10m4.058s for the runner-up. Note that it uses 4 GB of memory (possibly faster if more, definitely slower if less) and that it ignores uniq.txt.
Results for full-sized input files (100,000,000-line all.txt, 20,000-line uniq.txt):
sort awk write me ~800,000 input lines/second
awk append #jaypal-singh ~200,000 input lines/second
bash append #benjamin-w ~15,000 input lines/second
bash append + extra awk #lll ~2000 input lines/second
Here's how I created the test files:
seq 1 20000 | sort -R | sed 's/.*/cg0001\tAB\t&\t-0.502535661820095/' > tmp.txt
seq 1 5000 | while read i; do cat tmp.txt; done > all.txt
seq 1 20000 | sort -R > uniq.txt
PS: Apologies for the flaw in my original test setup.

concatenate the result of echo and a command output

I have the following code:
names=$(ls *$1*.txt)
head -q -n 1 $names | cut -d "_" -f 2
where the first line finds and stores all names matching the command line input into a variable called names, and the second grabs the first line in each file (element of the variable names) and outputs the second part of the line based on the "_" delim.
This is all good, however I would like to prepend the filename (stored as lines in the variable names) to the output of cut. I have tried:
names=$(ls *$1*.txt)
head -q -n 1 $names | echo -n "$names" cut -d "_" -f 2
however this only prints out the filenames
I have tried
names=$(ls *$1*.txt
head -q -n 1 $names | echo -n "$names"; cut -d "_" -f 2
and again I only print out the filenames.
The desired output is:
$
filename1.txt <second character>
where there is a single whitespace between the filename and the result of cut.
Thank you.
Best approach, using awk
You can do this all in one invocation of awk:
awk -F_ 'NR==1{print FILENAME, $2; exit}' *"$1"*.txt
On the first line of the first file, this prints the filename and the value of the second column, then exits.
Pure bash solution
I would always recommend against parsing ls - instead I would use a loop:
You can avoid the use of awk to read the first line of the file by using bash built-in functionality:
for i in *"$1"*.txt; do
IFS=_ read -ra arr <"$i"
echo "$i ${arr[1]}"
break
done
Here we read the first line of the file into an array, splitting it into pieces on the _.
Maybe something like that will satisfy your need BUT THIS IS BAD CODING (see comments):
#!/bin/bash
names=$(ls *$1*.txt)
for f in $names
do
pattern=`head -q -n 1 $f | cut -d "_" -f 2`
echo "$f $pattern"
done
If I didn't misunderstand your goal, this also works.
I've always done it this way, I just found out that this is a deprecated way to do it.
#!/bin/bash
names=$(ls *"$1"*.txt)
for e in $names;
do echo $e `echo "$e" | cut -c2-2`;
done

bash print first to nth column in a line iteratively

I am trying to get the column names of a file and print them iteratively. I guess the problem is with the print $i but I don't know how to correct it. The code I tried is:
#! /bin/bash
for i in {2..5}
do
set snp = head -n 1 smaller.txt | awk '{print $i}'
echo $snp
done
Example input file:
ID Name Age Sex State Ext
1 A 12 M UT 811
2 B 12 F UT 818
Desired output:
Name
Age
Sex
State
Ext
But the output I get is blank screen.
You'd better just read the first line of your file and store the result as an array:
read -a header < smaller.txt
and then printf the relevant fields:
printf "%s\n" "${header[#]:1}"
Moreover, this uses bash only, and involves no unnecessary loops.
Edit. To also answer your comment, you'll be able to loop through the header fields thus:
read -a header < smaller.txt
for snp in "${header[#]:1}"; do
echo "$snp"
done
Edit 2. Your original method had many many mistakes. Here's a corrected version of it (although what I wrote before is a much preferable way of solving your problem):
for i in {2..5}; do
snp=$(head -n 1 smaller.txt | awk "{print \$$i}")
echo "$snp"
done
set probably doesn't do what you think it does.
Because of the single quotes in awk '{print $i}', the $i never gets expanded by bash.
This algorithm is not good since you're calling head and awk 4 times, whereas you don't need a single external process.
Hope this helps!
You can print it using awk itself:
awk 'NR==1{for (i=2; i<=5; i++) print $i}' smaller.txt
The main problem with your code is that your assignment syntax is wrong. Change this:
set snp = head -n 1 smaller.txt | awk '{print $i}'
to this:
snp=$(head -n 1 smaller.txt | awk '{print $i}')
That is:
Do not use set. set is for setting shell options, numbered parameters, and so on, not for assigning arbitrary variables.
Remove the spaces around =.
To run a command and capture its output as a string, use $(...) (or `...`, but $(...) is less error-prone).
That said, I agree with gniourf_gniourf's approach.
Here's another alternative; not necessarily better or worse than any of the others:
for n in $(head smaller.txt)
do
echo ${n}
done
somthin like
for x1 in $(head -n1 smaller.txt );do
echo $x1
done

extracting specified line numbers from file using shell script

I have a file with a list of address it looks like this (ADDRESS_FILE)
0xf012134
0xf932193
.
.
0fx12923a
I have another file with a list of numbers it looks like this (NUMBERS_FILE)
20
40
.
.
12
I want to cut the first 20 lines from ADDRESS_FILE and put that into a new file
then cut the next 40 lines from ADDRESS_FILE so on ...
I know that a series of sed commands like the one given below does the job
sed -n 1,20p ADDRESSS_FILE > temp_file_1
sed -n 20,60p ADDRESSS_FILE > temp_file_2
.
.
sed -n somenumber,endofilep. ADDRESS_FILE > temp_file_n
But I want to does this automatically using shell scripting which will change the numbers of lines to cut on each sed execution.
How to do this ???
Also on a general note, which are the text processing commands in linux which are very useful in such cases?
Assuming your line numbers are in a file called lines, sorted etc., try:
#!/bin/sh
j=0
count=1
while read -r i; do
sed -n $j,$i > filename.$count # etc... details of sed/redirection elided
j=$i
count=$(($count+1))
done < lines
Note. The above doesn't assume a consistent number of lines to split on for each iteration.
Since you've additionally asked for a general utility, try split. However this splits on a consistent number of lines, and is perhaps of limited use here.
Here's an alternative that reads directly from the NUMBERS_FILE:
n=0; i=1
while read; do
sed -n ${i},+$(( REPLY - 1 ))p ADDRESS_FILE > temp_file_$(( n++ ))
(( i += REPLY ))
done < NUMBERS_FILE
size=$(wc -l ADDRESSS_FILE)
i=1
n=1
while [ $n -lt $size ]
do
sed -n $n,$((n+19))p ADDRESSS_FILE > temp_file_$i
i=$((i+1))
n=$((n+20))
done
or just
split -l20 ADDRESSS_FILE temp_file_
(thanks Brian Agnew for the idea).
An ugly solution which works with a single sed invocation, can probably be made less horrible.
This generates a tiny sed script to split the file
#!/bin/bash
sum=0
count=0
sed -n -f <(while read -r n ; do
echo $((sum+1),$((sum += n)) "w temp_file_$((count++))" ;
done < NUMBERS_FILE) ADDRESS_FILE

Resources