extracting specified line numbers from file using shell script - linux

I have a file with a list of address it looks like this (ADDRESS_FILE)
0xf012134
0xf932193
.
.
0fx12923a
I have another file with a list of numbers it looks like this (NUMBERS_FILE)
20
40
.
.
12
I want to cut the first 20 lines from ADDRESS_FILE and put that into a new file
then cut the next 40 lines from ADDRESS_FILE so on ...
I know that a series of sed commands like the one given below does the job
sed -n 1,20p ADDRESSS_FILE > temp_file_1
sed -n 20,60p ADDRESSS_FILE > temp_file_2
.
.
sed -n somenumber,endofilep. ADDRESS_FILE > temp_file_n
But I want to does this automatically using shell scripting which will change the numbers of lines to cut on each sed execution.
How to do this ???
Also on a general note, which are the text processing commands in linux which are very useful in such cases?

Assuming your line numbers are in a file called lines, sorted etc., try:
#!/bin/sh
j=0
count=1
while read -r i; do
sed -n $j,$i > filename.$count # etc... details of sed/redirection elided
j=$i
count=$(($count+1))
done < lines
Note. The above doesn't assume a consistent number of lines to split on for each iteration.
Since you've additionally asked for a general utility, try split. However this splits on a consistent number of lines, and is perhaps of limited use here.

Here's an alternative that reads directly from the NUMBERS_FILE:
n=0; i=1
while read; do
sed -n ${i},+$(( REPLY - 1 ))p ADDRESS_FILE > temp_file_$(( n++ ))
(( i += REPLY ))
done < NUMBERS_FILE

size=$(wc -l ADDRESSS_FILE)
i=1
n=1
while [ $n -lt $size ]
do
sed -n $n,$((n+19))p ADDRESSS_FILE > temp_file_$i
i=$((i+1))
n=$((n+20))
done
or just
split -l20 ADDRESSS_FILE temp_file_
(thanks Brian Agnew for the idea).

An ugly solution which works with a single sed invocation, can probably be made less horrible.
This generates a tiny sed script to split the file
#!/bin/bash
sum=0
count=0
sed -n -f <(while read -r n ; do
echo $((sum+1),$((sum += n)) "w temp_file_$((count++))" ;
done < NUMBERS_FILE) ADDRESS_FILE

Related

Bash: Two for loops at once?

I am not sure how to do this at all.
I have two text files, FILE1 and FILE2.
I would like to run a for loop for each file at the same time and display the
contents next to each other.
For example,
for $i in $(cat FILE1); do echo $i; done
for $j in $(cat FILE2); do echo $j; done
I would like to combine these two commands, so I can run both files at the same time and have an output like $i $j
Solution 1
Use the paste command
paste FILE1 FILE2
Details for paste command
Another resource
Solution 2
You can do this if they have the same number of lines.
#!/bin/bash
t=$(cat FILE1 | wc -l)
for i in `seq 1 $t`;
do
cat FILE1|head -n $i|tail -n 1
cat FILE2|head -n $i|tail -n 1
done
You can extend it to what you want for unequal number of lines.
You shouldn't be using for loops at all; see Bash FAQ 001. Instead, use two read commands in a single while loop.
while IFS= read -r line1 && IFS= read -r line2 <&3; do
printf '%s | %s\n' "$line1" "$line2"
done < FILE1 3< FILE2
Each read command reads from a separate file descriptor. In this version, the loop will exit when the shorter of the two files is exhausted.
There are two different questions being asked here. Other answers address the question of how to display the contents of the file in 2 columns. Running two loops simultaneously (which is the wrong way to address the first problem) can be done by running them each asynchronously: for i in ${seqi?}; do ${cmdi?}; done & for j in ${seqj?}; do ${cmdj?}; done & wait
Although you could also implement paste -d ' ' file1 file2 with something like:
while read line_from_file1; p=$?; read line_from_file2 <&3 || test "$p" = 0; do
echo "$line_from_file1" "$line_from_file2"
done < file1 3< file2
Another option, in bash v4+ is to read the two files into 2 arrays, then echo the array elements side-by-side:
# Load each file into its own array
readarray -t f1 < file1
readarray -t f2 < file2
# Print elements of both arrays side-by-side
for ((i=0;i<${#f1[#]};i++)) ; do echo ${f1[i]} ${f2[i]}; done
Or change the echo to printf if you want the columns to line up:
printf "%-20s %-20s\n" ${f1[i]} ${f2[i]}
I'm not suggesting you do this if your files are 100s of megabytes.

Quickest way to split a string in bash

The goal: produce a path from an integer.
I need to split strings in fixed length (2 characters in this case), and then glue the pieces with a separator. Example : 123456 => 12/34/56, 12345 => 12/34/5.
I found a solution with sed:
sed 's/\(..\)/\1\//g'
but I'm not sure it's really quick, since I'm really not searching for any analysis of the string content (which will always be an integer, if it's any importance), but really to split it in length 2 (or 1 if the original length is odd).
bash expansion can do substring
var=123456
echo "${var:0:2}" # 2 first char
echo "${var:2:2}" # next two
echo "${var:4:2}" # etc.
joinning manually with /
echo "${var:0:2}/${var:2:2}/${var:4:2}"
Use parameter substitution. ${var:position:length} extracts substrings, ${#var} returns length of the value, ${var%final} removes "final" from the end of the value. Run in in a loop for strings of unknown length:
#!/bin/bash
for s in 123456 1234567 ; do
o=""
for (( pos=0 ; pos<${#s} ; pos+=2 )) ; do
o+=${s:pos:2}/
done
o=${o%/}
echo "$o"
done
TL;DR
sed is enough fast.
If we are talking about speed, let's check.
I think sed is the shorted solution, but as example I'll take #choroba's shell script:
$ wc -l hugefile
10877493 hugefile
Sed:
sed 's/\(..\)/\1\//g' hugefile
Output:
real 0m25.432s
user 0m8.731s
sys 0m10.123s
Script:
#!/bin/bash
while IFS='' read -r s ; do
o=""
for (( pos=0 ; pos<${#s} ; pos+=2 )) ; do
o+=${s:pos:2}/
done
o=${o%/}
echo "$o"
done < hugefile
Working really long time, I've interrupted it at:
real 1m19.480s
user 1m14.795s
sys 0m4.683s
So on my PC Intel(R) Core(TM) i5-7500 CPU # 3.40GHz, MemTotal: 16324532 kB, sed making around 426568 (close for half a million) string modifications per second. Seems like fast enough
You can split a string into elements using the fold command, read the elements into an array with readarray and process substitution, and then insert the field separator using IFS:
$ var=123456
$ readarray -t arr < <(fold -w2 <<< "$var")
$ (IFS=/; echo "${arr[*]}")
12/34/56
I put the last command in a subshell so the change to IFS is not persistent.
Notice that the [*] syntax is required here, or IFS won't be used as the output separator, i.e., the usually preferred [#] wouldn't work.
readarray and its synonym mapfile require Bash 4.0 or newer.
This works with an odd number of elements as well:
$ var=12345
$ readarray -t arr < <(fold -w2 <<< "$var")
$ (IFS=/; echo "${arr[*]}")
12/34/5

Fast ways to make new multiple files from one file matching multiple patterns

I have one file called uniq.txt (20,000 lines).
head uniq.txt
1
103
10357
1124
1126
I have another file called all.txt (106,371,111 lines)
head all.txt
cg0001 ? 1 -0.394991215660192
cg0001 AB 103 -0.502535661820095
cg0002 A 10357 -0.563632386999913
cg0003 ? 1 -0.394991215660444
cg0004 ? 1 -0.502535661820095
cg0004 A 10357 -0.563632386999913
cg0003 AB 103 -0.64926706504459
I would like to make new 20,000 files from all.txt matching each line pattern of uniq.txt. For example,
head 1.newfile.txt
cg0001 ? 1 -0.394991215660192
cg0003 ? 1 -0.394991215660444
cg0004 ? 1 -0.502535661820095
head 103.newfile.txt
cg0001 AB 103 -0.502535661820095
cg0003 AB 103 -0.64926706504459
head 10357.newfile.txt
cg0002 A 10357 -0.563632386999913
cg0004 A 10357 -0.563632386999913
Is there any way that I can make new 20,000 files really fast?
My current script takes 1 min to make one new file. I guess it's scanning all.txt file every time it makes a new file.
You can try it with awk. Ideally you don't need >> in awk but since you have stated there would be 20,000 files, we don't want to exhaust system's resources by keeping too many file open.
awk '
NR==FNR { names[$0]++; next }
($3 in names) { file=$3".newfile.txt"; print $0 >>(file); close (file) }
' uniq.txt all.txt
This will first scan the uniq.txt file into memory creating a lookup table of sorts. It will then read through the all.txt file and start inserting entries into corresponding files.
This uses a while loop — This may or may not be the quickest way, although give it a try:
lines_to_files.sh
#!/bin/bash
while IFS='' read -r line || [[ -n "$line" ]]; do
num=$(echo "$line" | awk '{print $3}')
echo "$line" >> /path/to/save/${num}_newfile.txt
done < "$1"
usage:
$ ./lines_to_files.sh all.txt
This should create a new file for each line in your all.txt file based on the third column. As it reads each line it will add it to the appropriate file. Keep in mind that if you run the script successive times it will append the data that is already there for each file.
An explanation of the while loop used above for reading the flie can be found here:
↳ https://stackoverflow.com/a/10929511/499581
You can read each line into a Bash array, then append to the file named after the number in column three (array index 2):
#!/bin/bash
while read -ra arr; do
echo "${arr[#]}" >> "${arr[2]}".newfile.txt
done < all.txt
This creates space separated output. If you prefer tab separated, it depends a bit on your input data: if it is tab separated as well, you can just set IFS to a tab to get tab separated output:
IFS=$'\t'
while read -ra arr; do
echo "${arr[*]}" >> "${arr[2]}".newfile.txt
done < all.txt
Notice the change in printing the array, the * is now actually required.
Or, if the input data is not tab separated (or we don't know), we can set IFS in a subshell in each loop:
while read -ra arr; do
( IFS=$'\t'; echo "${arr[*]}" >> "${arr[2]}".newfile.txt )
done < all.txt
I'm not sure what's more expensive, spawning a subshell or a few parameter assignments, but I feel it's the subshell – to avoid spawning it, we can set and reset IFS in each loop instead:
while read -ra arr; do
old_ifs="$IFS"
IFS=$'\t'
echo "${arr[*]}" >> "${arr[2]}".newfile.txt
IFS="$old_ifs"
done < all.txt
OP asked for fast ways. This is the fastest I've found.
sort -S 4G -k3,3 all.txt |
awk '{if(last!=$3){close(file); file=$3".newfile.txt"; last=$3} print $0 > file}'
Total time was 2m4.910s vs 10m4.058s for the runner-up. Note that it uses 4 GB of memory (possibly faster if more, definitely slower if less) and that it ignores uniq.txt.
Results for full-sized input files (100,000,000-line all.txt, 20,000-line uniq.txt):
sort awk write me ~800,000 input lines/second
awk append #jaypal-singh ~200,000 input lines/second
bash append #benjamin-w ~15,000 input lines/second
bash append + extra awk #lll ~2000 input lines/second
Here's how I created the test files:
seq 1 20000 | sort -R | sed 's/.*/cg0001\tAB\t&\t-0.502535661820095/' > tmp.txt
seq 1 5000 | while read i; do cat tmp.txt; done > all.txt
seq 1 20000 | sort -R > uniq.txt
PS: Apologies for the flaw in my original test setup.

Adding spaces after each character in a string

I have a string variable in my script, made up of the 9 permission characters from ls -l
eg:
rwxr-xr--
I want to manipulate it so that it displays like this:
r w x r - x r - -
IE every three characters is tab separated and all others are separated by a space. The closest I've come is using a printf
printf "%c %c %c\t%c %c %c\t%c %c %c\t/\n" "$output"{1..9}
This only prints the first character but formatted correctly
I'm sure there's a way to do it using "sed" that I can't think of
Any advice?
Using the Posix-specified utilities fold and paste, split the string into individual characters, and then interleave a series of delimiters:
fold -w1 <<<"$str" | paste -sd' \t'
$ sed -r 's/(.)(.)(.)/\1 \2 \3\t/g' <<< "$output"
r w x r - x r - -
Sadly, this leaves a trailing tab in the output. If you don't want that, use:
$ sed -r 's/(.)(.)(.)/\1 \2 \3\t/g; s/\t$//' <<< "$str"
r w x r - x r - -
Why do u need to parse them? U can access to every element of string by copy needed element. It's a very easy and without any utility, for example:
DATA="rwxr-xr--"
while [ $i -lt ${#DATA} ]; do
echo ${DATA:$i:1}
i=$(( i+1 ))
done
With awk:
$ echo "rwxr-xr--" | awk '{gsub(/./,"& ");gsub(/. . . /,"&\t")}1'
r w x r - x r - -
> echo "rwxr-xr--" | sed 's/\(.\{3,3\}\)/\1\t/g;s/\([^\t]\)/\1 /g;s/\s*$//g'
r w x r - x r - -
( Evidently I didn't put much thought into my sed command. John Kugelman's version is obviously much clearer and more concise. )
Edit: I wholeheartedly agree with triplee's comment though. Don't waste your time trying to parse ls output. I did that for a long time before I figured out you can get exactly what you want (and only what you want) much easier by using stat. For example:
> stat -c %a foo.bar # Equivalent to stat --format %a
0754
The -c %a tells stat to output the access rights of the specified file, in octal. And that's all it prints out, thus eliminating the need to do wacky stuff like ls foo.bar | awk '{print $1}', etc.
So for instance you could do stuff like:
GROUP_READ_PERMS=040
perms=$(stat -c %a foo.bar)
if (( (perms & GROUP_READ_PERMS) != 0 )); then
... # Do some stuff
fi
Sure as heck beats parsing strings like "rwxr-xr--"
sed 's/.../& /2g;s/./& /g' YourFile
in 2 simple step
A version which includes a pure bash version for short strings, and sed for longer strings, and preserves newlines (adding a space after them too)
if [ "${OS-}" = "Windows_NT" ]; then
threshold=1000
else
threshold=100
fi
function escape()
{
local out=''
local -i i=0
local str="${1}"
if [ "${#str}" -gt "${threshold}" ]; then
# Faster after sed is started
sed '# Read all lines into one buffer
:combine
$bdone
N
bcombine
:done
s/./& /g' <<< "${str}"
else
# Slower, but no process to load, so faster for short strings. On windows
# this can be a big deal
while (( i < ${#str} )); do
out+="${str:$i:1} "
i+=1
done
echo "$out"
fi
}
Explanation of sed. "If this is the last line, jump to :done, else append Next into buffer and jump to :combine. After :done is a simple sed replacement expression. The entire string (newlines and all) are in one buffer so that the replace works on newlines too (which are lost in some of the awk -F examples)
Plus this is Linux, Mac, and Git for Windows compatible.
Setting awk -F '' allows each character to be bounded, then you'll want to loop through and print each field.
Example:
ls -l | sed -n 2p | awk -F '' '{for(i=1;i<=NF;i++){printf " %s ",$i;}}'; echo ""
The part seems like the answer to your question:
awk -F '' '{for(i=1;i<=NF;i++){printf " %s ",$i;}}'
I realize, this doesn't provide the trinary grouping you wanted though. hmmm...

Take nth column in a text file

I have a text file:
1 Q0 1657 1 19.6117 Exp
1 Q0 1410 2 18.8302 Exp
2 Q0 3078 1 18.6695 Exp
2 Q0 2434 2 14.0508 Exp
2 Q0 3129 3 13.5495 Exp
I want to take the 2nd and 4th word of every line like this:
1657 19.6117
1410 18.8302
3078 18.6695
2434 14.0508
3129 13.5495
I'm using this code:
nol=$(cat "/path/of/my/text" | wc -l)
x=1
while [ $x -le "$nol" ]
do
line=($(sed -n "$x"p /path/of/my/text)
echo ""${line[1]}" "${line[3]}"" >> out.txt
x=$(( $x + 1 ))
done
It works, but it is very complicated and takes a long time to process long text files.
Is there a simpler way to do this?
iirc :
cat filename.txt | awk '{ print $2 $4 }'
or, as mentioned in the comments :
awk '{ print $2 $4 }' filename.txt
You can use the cut command:
cut -d' ' -f3,5 < datafile.txt
prints
1657 19.6117
1410 18.8302
3078 18.6695
2434 14.0508
3129 13.5495
the
-d' ' - mean, use space as a delimiter
-f3,5 - take and print 3rd and 5th column
The cut is much faster for large files as a pure shell solution. If your file is delimited with multiple whitespaces, you can remove them first, like:
sed 's/[\t ][\t ]*/ /g' < datafile.txt | cut -d' ' -f3,5
where the (gnu) sed will replace any tab or space characters with a single space.
For a variant - here is a perl solution too:
perl -lanE 'say "$F[2] $F[4]"' < datafile.txt
For the sake of completeness:
while read -r _ _ one _ two _; do
echo "$one $two"
done < file.txt
Instead of _ an arbitrary variable (such as junk) can be used as well. The point is just to extract the columns.
Demo:
$ while read -r _ _ one _ two _; do echo "$one $two"; done < /tmp/file.txt
1657 19.6117
1410 18.8302
3078 18.6695
2434 14.0508
3129 13.5495
One more simple variant -
$ while read line
do
set $line # assigns words in line to positional parameters
echo "$3 $5"
done < file
If your file contains n lines, then your script has to read the file n times; so if you double the length of the file, you quadruple the amount of work your script does — and almost all of that work is simply thrown away, since all you want to do is loop over the lines in order.
Instead, the best way to loop over the lines of a file is to use a while loop, with the condition-command being the read builtin:
while IFS= read -r line ; do
# $line is a single line of the file, as a single string
: ... commands that use $line ...
done < input_file.txt
In your case, since you want to split the line into an array, and the read builtin actually has special support for populating an array variable, which is what you want, you can write:
while read -r -a line ; do
echo ""${line[1]}" "${line[3]}"" >> out.txt
done < /path/of/my/text
or better yet:
while read -r -a line ; do
echo "${line[1]} ${line[3]}"
done < /path/of/my/text > out.txt
However, for what you're doing you can just use the cut utility:
cut -d' ' -f2,4 < /path/of/my/text > out.txt
(or awk, as Tom van der Woerdt suggests, or perl, or even sed).
If you are using structured data, this has the added benefit of not invoking an extra shell process to run tr and/or cut or something. ...
(Of course, you will want to guard against bad inputs with conditionals and sane alternatives.)
...
while read line ;
do
lineCols=( $line ) ;
echo "${lineCols[0]}"
echo "${lineCols[1]}"
done < $myFQFileToRead ;
...

Resources