Linux: Fix the output content from a file

Linux: Fix the output content from a file - linux

Really need your help. I have a file which included data like (field:value) in one line
File.A
A:13 B:2 D:5 F:92 G:3 ...
I had created a file which include "A to Z".
File.B
A B C D E F G H I J ...
And trying to use bash script to get content and fix the output which will insert the miss line with 0 value.
A:13 B:2 C:0 D:5 E:0 F:92 G:3 H:0 ...
Think over two days.. but still not thing come out from my head. Is there any way I can solve it?

Let's make brace expansion work: {A..Z} expands as all the list of letters:
$ echo {A..Z}
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Then we can loop through all lines greping. In case it matches, we print the line; otherwise, we print letter:0.
for letter in {A..Z}
do
grep "^$letter" file || echo "$letter:0"
done
Test
$ for letter in {A..Z}; do grep "^$letter" file || echo "$letter:0"; done
A:13
B:2
C:0
D:5
E:0
F:92
G:3
H:0
I:0
J:0
K:0
L:0
M:0
N:0
O:0
P:0
Q:0
R:0
S:0
T:0
U:0
V:0
W:0
X:0
Y:0
Z:0
Now that you updated the question and the input file contains everything in the same line, you can use this grep to match:
grep -o "$word:[0-9]*" file
and then replace new lines with spaces:
$ for word in {A..Z}; do grep -o "$word:[0-9]*" file || echo "$word:0"; done | tr '\n' ' '
A:13 B:2 C:0 D:5 E:0 F:92 G:3 H:0 I:0 J:0 K:0 L:0 M:0 N:0 O:0 P:0 Q:0 R:0 S:0 T:0 U:0 V:0 W:0 X:0 Y:0 Z:0

If you fancy a bit of awk, you could try this:
awk -F: -vRS=" " '
{ c[$1] = $2 }
END{
for(i=65;i<91;++i){
a=sprintf("%c", i)
printf("%c:%d ",i,c[a])
}
}' A
where A is your file. The first block builds an array of all the values that have been set. Once all of the file has been read, the loop goes through the ascii values of A (65) to Z (90) and prints out the values that have been set in the array. The ones that are missing are printed as 0.
Output:
A:13 B:2 C:0 D:5 E:0 F:92 G:3 H:0 I:0 J:0 K:0 L:0 M:0 N:0 O:0 P:0 Q:0 R:0 S:0 T:0 U:0 V:0 W:0 X:0 Y:0 Z:0
Since everyone clearly can't get enough from my answer, here's another way you could do it, inspired by the {A..Z} range used in #fedorqui's answer:
awk -F: -vRS=" " '
NR==FNR { a[i++] = $1; next }
{ b[$1] = $2 }
END{for(i=0;i<length(a);++i)printf("%c:%d ",a[i],b[a[i]])}' - <<<$(echo {A..Z}) A
The first block reads in all the letters of the alphabet, thus reducing the need to know their character codes. The second block builds an array from your file A. Once the file has been read, All the values are printed out, resulting in the same output as above.

Pure Bash, no external processes. Print the match if the letter is found in the line or the letter followed by 0 otherwise.
read content < "$infile"
for letter in {A..Z}; do
if [[ $content =~ ${letter}:[[:digit:]]+ ]] ; then
echo "${BASH_REMATCH[0]}"
else
echo "${letter}:0"
fi
done
or shorter
for x in {A..Z}; do
[[ $content =~ ${x}:[0-9]+ ]] && echo "${BASH_REMATCH[0]}" || echo "${x}:0"
done

Related

How to cut file into chuck

How to get information from specimen1 to specimen3 and paste it into another file 'DNA_combined.txt'?
I tried cut command and awk commend but I found that it is tricky to cutting by paragraph(?) or sequence.
My trial was something like cut -d '>' -f 1-3 dna1.fasta > DNA_combined.txt

You can get the line number for each row using Esc + : and type set nu
Once you get the line number corresponding to each row:
Note down the line number corresponding to Line containing >Specimen 1 (say X) and Specimen 3 (say Y)
Then, use sed command to get the text between two lines
sed -n 'X,Yp' dna1.fasta > DNA_combined.txt
Please let me know if you have any questions.

If you want the first three sequences irrespective of the content after >, you can use this:
$ cat ip.txt
>one
ACGTA
TCGAAA
>two
TGACA
>three
ACTG
AAAAC
>four
ATGC
>five
GTA
$ awk '/^>/ && ++count==4{exit} 1' ip.txt
>one
ACGTA
TCGAAA
>two
TGACA
>three
ACTG
AAAAC
/^>/ matches the start of a sequence
for such sequences, increment the count variable
if count reaches 4, the exit command will terminate the script
1 idiomatic way to print contents of input record

Would you please try the following:
awk '
BEGIN {print ">Specimen1-3"} # print header
/^>Specimen/ {f = match($0, "^>Specimen[1-3]") ? 1 : 0; next}
# set the flag depending on the number
f # print if f == 1
' dna1.fasta > DNA_combined.txt

Using sed on line break element

Hello let say I have a file such as :
$OUT some text
some text
some text
$OUT
$OUT
$OUT
how can I use sed in order to replace the 3 $OUT into "replace-thing" ?
and get
$OUT some text
some text
some text
replace-thing

With sed:
sed -n '1h; 1!H; ${g; s/\$OUT\n\$OUT\n\$OUT/replace-thing/g; p;}' file
GNU sed does not require the semicolon after p.
With commentary
sed -n ' # without printing every line:
# next 2 lines read the entire file into memory
1h # line 1, store current line in the hold space
1!H # not line 1, append a newline and current line to hold space
# now do the search-and-replace on the file contents
${ # on the last line:
g # replace pattern space with contents of hold space
s/\$OUT\n\$OUT\n\$OUT/replace-thing/g # do replacement
p # and print the revised contents
}
' file
This is the main reason I only use sed for very simple things: once you start using the lesser-used commands, you need extensive commentary to understand the program.
Note the commented version does not work on the BSD-derived sed on MacOS -- the comments break it, but removing them is OK.
In plain bash:
pattern=$'$OUT\n$OUT\n$OUT' # using ANSI-C quotes
contents=$(< file)
echo "${contents//$pattern/replace-thing}"
And the perl one-liner:
perl -0777 -pe 's/\$OUT(\n\$OUT){2}/replace-thing/g' file

for this particular task, I recommend to use awk instead. (hope that's an option too)
Update: to replace all 3 $OUT use: (Thanks to #thanasisp and #glenn jackman)
cat input.txt | awk '
BEGIN {
i = 0
p = "$OUT" # Pattern to match
n = 3 # N matches
r = "replace-thing"
}
$0 == p {
++i
if(i == n){
print(r)
i = 0 #reset counter (optional)
}
}
$0 != p {
i = 0
print($0)
}'
If you just want to replace the 3th $OUT usage, use:
cat input.txt | awk '
BEGIN {
i = 0
p = "\\$OUT" # Pattern to match
n = 3 # Nth match
r = "replace-thing"
}
$0 ~ p {
++i
if(i == n){
print(r)
}
}
i <= n || $0 !~ p {
print($0)
}'

This might work for you (GNU sed):
sed -E ':a;N;s/[^\n]*/&/3;Ta;/^(\$OUT\n?){3}$/d;P;D' file
Gather up 3 lines in the pattern space and if those 3 lines each contain $OUT, delete them. Otherwise, print/delete the first line and repeat.

Count overlapping occurences of a repeated string using grep/linux/bash

I'm trying to count occurences of a repeated string. Eg.
echo 'joebobtomtomtomjoebobmike' | grep -o 'tomtom' | wc -l
This outputs 1, but obviously the string 'tomtom' fits twice here. How can I make it so it counts both occurences?
Thanks!

You can use this awk script
{
count = 0
$0 = tolower($0)
while (length() > 0) {
m = match($0, pattern)
if (m == 0)
break
count++
$0 = substr($0, m + 1)
}
print count
}
Explanation
We first convert the line to all lower case to ignore case. This script works by shortening the string after matching the pattern. It uses the function match() to find the position where the pattern is matched. If
m == 0, that means no matches were found, so we can break from the loop. We increment count each iteration of the loop, then reset the $0 string to the substring starting at index m + 1.
If you save this as a.awk, you can do
echo "joebobtomtomtomjoebobmike" | awk -v "pattern=tomtom" -f a.awk
And it will output 2.

This might work for you (GNU sed):
sed -r '/(tom)\1/!d;:a;s//\n\1/;ta;s/\n//'| wc -l
The repeating pattern tomtom can be rewritten in regexp form as (tom)\1 then replacing the first part of the repeating patten by a newline and looping until no more patterns are found will give a number of lines indicating the overlapping pattern. As the result is printed this must be taken into account and subtracted from the result i.e. the last (in this case the first) newline must be removed. Of course if there is no repeating pattern the result must be zero hence the first sed command.

You could just walk the length of the string and see if the substring at the current location is the desired text:
string=joebobtomtomtomjoebobmiketomtomtom
match=tomtom
for ((i=0; i <= ${#string} - ${#match}; i++)); do
[[ ${string:i:${#match}} == $match ]] && ((count++))
done
echo $count # => 4

Insert new line if different number appears in a column

I have a column
1
1
1
2
2
2
I would like to insert a blank line when the value in the column changes:
1
1
1
<- blank line
2
2
2

I would recommend using awk:
awk -v i=1 'NR>1 && $i!=p { print "" }{ p=$i } 1' file
On any line after the first, if value of the "i"th column is different to the previous value, print a blank line. Always set the value of p. The 1 at the end evaluates to true, which means that awk prints the line. i can be set to the column number of your choice.

while read L; do [[ "$L" != "$PL" && "$PL" != "" ]] && echo; echo "$L"; PL="$L"; done < file

awk(1) seems like the obvious answer to this problem:
#!/usr/bin/awk -f
BEGIN { prev = "" }
/./ {
if (prev != "" && prev != $1) print ""
print
prev = $1
}

You can also do this with SED:
sed '{N;s/^\(.*\)\n\1$/\1\n\1/;tx;P;s/^.*\n/\n/;P;D;:x;P;D}'
The long version with explanations is:
sed '{
N # read second line; (terminate if there are no more lines)
s/^\(.*\)\n\1$/\1\n\1/ # try to replace two identical lines with themselves
tx # if replacement succeeded then goto label x
P # print the first line
s/^.*\n/\n/ # replace first line by empty line
P # print this empty line
D # delete empty line and proceed with input
:x # label x
P # print first line
D # delete first line and proceed with input
}'
One thing I like about using (GNU) SED (what which is not clear if it is useful to you from your question) is that you can easily apply changes in-place with the -i switch, e.g.
sed -i '{N;s/^\(.*\)\n\1$/\1\n\1/;tx;P;s/^.*\n/\n/;P;D;:x;P;D}' FILE

You could use getline function in Awk to match the current line against the following line:
awk '{f=$1; print; getline}f != $1{print ""}1' file

How to select random lines from a file

I have a text file containing 10 hundreds of lines, with different lengths. Now I want to select N lines randomly, save them in another file, and remove them from the original file.
I've found some answers to this question, but most of them use a simple idea: sort the file and select first or last N lines. unfortunately this idea doesn't work to me, because I want to preserve the order of lines.
I tried this piece of code, but it's very slow and takes hours.
FILEsrc=$1;
FILEtrg=$2;
MaxLines=$3;
let LineIndex=1;
while [ "$LineIndex" -le "$MaxLines" ]
do
# count number of lines
NUM=$(wc -l $FILEsrc | sed 's/[ \r\t].*$//g');
let X=(${RANDOM} % ${NUM} + 1);
echo $X;
sed -n ${X}p ${FILEsrc}>>$FILEtrg; #write selected line into target file
sed -i -e ${X}d ${FILEsrc}; #remove selected line from source file
LineIndex=`expr $LineIndex + 1`;
done
I found this line the most time consuming one in the code:
sed -i -e ${X}d ${FILEsrc};
is there any way to overcome this problem and make the code faster?
Since I'm in hurry, may I ask you to send me complete c/c++ code for doing this?

A simple O(n) algorithm is described in:
http://en.wikipedia.org/wiki/Reservoir_sampling
array R[k]; // result
integer i, j;
// fill the reservoir array
for each i in 1 to k do
R[i] := S[i]
done;
// replace elements with gradually decreasing probability
for each i in k+1 to length(S) do
j := random(1, i); // important: inclusive range
if j <= k then
R[j] := S[i]
fi
done

Generate all your offsets, then make a single pass through the file. Assuming you have the desired number of offsets in offsets (one number per line) you can generate a single sed script like this:
sed "s!.*!&{w $FILEtrg\nd;}!" offsets
The output is a sed script which you can save to a temporary file, or (if your sed dialect supports it) pipe to a second sed instance:
... | sed -i -f - "$FILEsrc"
Generating the offsets file left as an exercise.
Given that you have the Linux tag, this should work right off the bat. The default sed on some other platforms may not understand \n and/or accept -f - to read the script from standard input.
Here is a complete script, updated to use shuf (thanks #Thor!) to avoid possible duplicate random numbers.
#!/bin/sh
FILEsrc=$1
FILEtrg=$2
MaxLines=$3
# Add a line number to each input line
nl -ba "$FILEsrc" |
# Rearrange lines
shuf |
# Pick out the line number from the first $MaxLines ones into sed script
sed "1,${MaxLines}s!^ *\([1-9][0-9]*\).*!\1{w $FILEtrg\nd;}!;t;D;q" |
# Run the generated sed script on the original input file
sed -i -f - "$FILEsrc"

[I've updated each solution to remove selected lines from the input, but I'm not positive the awk is correct. I'm partial to the bash solution myself, so I'm not going to spend any time debugging it. Feel free to edit any mistakes.]
Here's a simple awk script (the probabilities are simpler to manage with floating point numbers, which don't mix well with bash):
tmp=$(mktemp /tmp/XXXXXXXX)
awk -v total=$(wc -l < "$FILEsrc") -v maxLines=$MaxLines '
BEGIN { srand(); }
maxLines==0 { exit; }
{ if (rand() < maxLines/total--) {
print; maxLines--;
} else {
print $0 > /dev/fd/3
}
}' "$FILEsrc" > "$FILEtrg" 3> $tmp
mv $tmp "$FILEsrc"
As you print a line to the output, you decrement maxLines to decrease the probability of choosing further lines. But as you consume the input, you decrease total to increase the probability. In the extreme, the probability hits zero when maxLines does, so you can stop processing the input. In the other extreme, the probability hits 1 once total is less than or equal to maxLines, and you'll be accepting all further lines.
Here's the same algorithm, implemented in (almost) pure bash using integer arithmetic:
FILEsrc=$1
FILEtrg=$2
MaxLines=$3
tmp=$(mktemp /tmp/XXXXXXXX)
total=$(wc -l < "$FILEsrc")
while read -r line && (( MaxLines > 0 )); do
(( MaxLines * 32768 > RANDOM * total-- )) || { printf >&3 "$line\n"; continue; }
(( MaxLines-- ))
printf "$line\n"
done < "$FILEsrc" > "$FILEtrg" 3> $tmp
mv $tmp "$FILEsrc"

Here's a complete Go program :
package main
import (
"bufio"
"fmt"
"log"
"math/rand"
"os"
"sort"
"time"
)
func main() {
N := 10
rand.Seed( time.Now().UTC().UnixNano())
f, err := os.Open(os.Args[1]) // open the file
if err!=nil { // and tell the user if the file wasn't found or readable
log.Fatal(err)
}
r := bufio.NewReader(f)
var lines []string // this will contain all the lines of the file
for {
if line, err := r.ReadString('\n'); err == nil {
lines = append(lines, line)
} else {
break
}
}
nums := make([]int, N) // creates the array of desired line indexes
for i, _ := range nums { // fills the array with random numbers (lower than the number of lines)
nums[i] = rand.Intn(len(lines))
}
sort.Ints(nums) // sorts this array
for _, n := range nums { // let's print the line
fmt.Println(lines[n])
}
}
Provided you put the go file in a directory named randomlines in your GOPATH, you may build it like this :
go build randomlines
And then call it like this :
./randomlines "path_to_my_file"
This will print N (here 10) random lines in your files, but without changing the order. Of course it's near instantaneous even with big files.

Here's an interesting two-pass option with coreutils, sed and awk:
n=5
total=$(wc -l < infile)
seq 1 $total | shuf | head -n $n \
| sed 's/^/NR == /; $! s/$/ ||/' \
| tr '\n' ' ' \
| sed 's/.*/ & { print >> "rndlines" }\n!( &) { print >> "leftover" }/' \
| awk -f - infile
A list of random numbers are passed to sed which generates an awk script. If awk were removed from the pipeline above, this would be the output:
{ if(NR == 14 || NR == 1 || NR == 11 || NR == 20 || NR == 21 ) print > "rndlines"; else print > "leftover" }
So the random lines are saved in rndlines and the rest in leftover.

Mentioned "10 hundreds" lines should sort quite quickly, so this is a nice case for the Decorate, Sort, Undecorate pattern. It actually creates two new files, removing lines from the original one can be simulated by renaming.
Note: head and tail cannot be used instead of awk, because they close the file descriptor after given number of lines, making tee exit thus causing missing data in the .rest file.
FILE=input.txt
SAMPLE=10
SEP=$'\t'
<$FILE nl -s $"SEP" -nln -w1 |
sort -R |
tee \
>(awk "NR > $SAMPLE" | sort -t"$SEP" -k1n,1 | cut -d"$SEP" -f2- > $FILE.rest) \
>(awk "NR <= $SAMPLE" | sort -t"$SEP" -k1n,1 | cut -d"$SEP" -f2- > $FILE.sample) \
>/dev/null
# check the results
wc -l $FILE*
# 'remove' the lines, if needed
mv $FILE.rest $FILE

This might work for you (GNU sed, sort and seq):
n=10
seq 1 $(sed '$=;d' input_file) |
sort -R |
sed $nq |
sed 's/.*/&{w output_file\nd}/' |
sed -i -f - input_file
Where $n is the number of lines to extract.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Linux: Fix the output content from a file - linux

Related

How to cut file into chuck

Using sed on line break element

Count overlapping occurences of a repeated string using grep/linux/bash

Insert new line if different number appears in a column

How to select random lines from a file

Categories

Resources