Extracting text from a txt file - linux

I have a txt file with records on it. The records follow this pattern:
six lines, blank space, six lines, .....like this example:
string line 1
string line 2
string line 3
string line 4
string line 5 (year format yyyy)
string line 6 (can use several lines)
<blank space> (always a blank space when a new txt block begins)
string line 1
string line 2
string line 3
string line 4
string line 5 (year format yyyy)
string line 6
Here is a proper example: I need the title(line 2) and year(line5)
Hualong Yu, Geoffrey I. Webb,
Adaptive online extreme learning machine by regulating forgetting factor by concept drift map,
Neurocomputing,
Volume 343,
2019,
Pages 141-153,
ISSN 0925-2312,
https://doi.org/10.1016/j.neucom.2018.11.098.
https://www.sciencedirect.com/science/article/pii/S0925231219301572
Antonino Feitosa Neto, Anne M.P. Canuto,
EOCD: An ensemble optimization approach for concept drift applications,
Information Sciences,
Volume 561,
2021,
Pages 81-100,
ISSN 0020-0255,
https://doi.org/10.1016/j.ins.2021.01.051.
https://www.sciencedirect.com/science/article/pii/S002002552100089X
I want to extract the string in line 2 and the year in line 5 all all blocks of text (separeted by blank spaces), save it to another txt file as this output:
string line2 , yyyy
I dont have exp'ed wih linux shell so I am here asking for some inputs to help me do this task.
Thanks

If you don't care about the trailing comma in line 5, just do:
awk '{print $2, $5}' RS= FS='\\n' input > output
This assumes that the blank line separating the records is indeed completely blank and does not contain any whitespace. If there is any whitespace in that line, you'll want to pre-filter the data to remove it.
eg:
$ cat input
Hualong Yu, Geoffrey I. Webb,
Adaptive online extreme learning machine by regulating forgetting factor by concept drift map,
Neurocomputing,
Volume 343,
2019,
Pages 141-153,
ISSN 0925-2312,
https://doi.org/10.1016/j.neucom.2018.11.098.
https://www.sciencedirect.com/science/article/pii/S0925231219301572
Antonino Feitosa Neto, Anne M.P. Canuto,
EOCD: An ensemble optimization approach for concept drift applications,
Information Sciences,
Volume 561,
2021,
Pages 81-100,
ISSN 0020-0255,
https://doi.org/10.1016/j.ins.2021.01.051.
https://www.sciencedirect.com/science/article/pii/S002002552100089
$ awk '{print $2, $5}' RS= FS='\\n' input
Adaptive online extreme learning machine by regulating forgetting factor by concept drift map, 2019,
EOCD: An ensemble optimization approach for concept drift applications, 2021,

Something like:
perl -00 -nE 'my #ln = (split /,\n/)[1,4]; say join(",", #ln)' input.txt > output.txt
should work as at least a starting point. Reads a paragraph at a time, splits up into lines, and prints the two you're looking for on the same line separated by a comma.

Related

Count the number of times a substring appears in a file and place it in a new column

Question:
I have 2 files, file 1 is a TSV (BED) file that has 23 base-pair sequences in column 7, for example:
1 779692 779715 Sample_3 + 1 ATGGTGCTTTGTTATGGCAGCTC
1 783462 783485 Sample_4 - 1 ATGAATAAGTCAAGTAAATGGAC
File 2 is a FASTA file (hg19.fasta) that looks like this. Although it breaks across the lines, this continous string of A,C,G, and T's reprsents a continous string (i.e. a chromsosome). This file is the entire human reference genome build 19, so the two > headers followed by sequences essentially occurs 23 times for each of the 23 chromosomes:
>1 dna:chromosome chromosome:GRCh37:1:1:249250621:1
AATTTGACCAGAAGTTATGGGCATCCCTCCCCTGGGAAGGAGGCAGGCAGAAAAGTTTGGAATCTATGTAGTAAAATATG
TTACTCTTTTATATATATGAATAAGTCAAGTAAATGGACATACATATATGTGTGTATATGTGTATATATATATACACACA
TATATACATACATACATACATACATATTATCTGAATTAGGCCATGGTGCTTTGTTATGGCAGCTCTCTGGGATACATGTG
CAGAATGTACAGGTTTGTTACACAGGTATACACCTGCCATGGTTGTTTGCTGCACCCATCAACTCACCATCTACATTAGG
TATTTCTCCTAACGTTATCCCTCATGAATAAGTCAAGTAAATGGAC
>2 dna:chromosome chromosome:GRCh37:1:1:2492300:1
AATTTGACCAGAAGTTATGGGCATCCCTCCCCTGGGAAGGAGGCAGGCAGAAAAGTTTGGAATCTATGTAGTAAAATATG
TTACTCTTTTATATATATGAATAAGTCAAGTAAATGGACATACATATATGTGTGTATATGTGTATATATATATACACACA
TATATACATACATACATACATACATATTATCTGAATTAGGCCATGGTGCTTTGTTATGGCAGCTCTCTGGGATACATGTG
I want to 1) Find out how many times each 23bp sequence appears in the second file without overlapping any others including sequences that break across the lines and 2) append this number to a new column next to the sequence so the new file looks like this:
Desired Output:
1 779692 779715 Sample_3 + 1 ATGGTGCTTTGTTATGGCAGCTC 1
1 783462 783485 Sample_4 - 1 ATGAATAAGTCAAGTAAATGGAC 2
My attempt:
I imagine solving the first part will be some variation on grep, so far I've managed:
grep -o ATGGTGCTTTGTTATGGCAGCTC "$file_2" | grep -c ""
which gets the count of a specific sequence, but not each sequence in the column. I think appending the grep results will require awk and paste but I haven't gotten that far!
Any help is appreciated as always! =)
Updates and Edits:
The actual size of these files is relatively massive (30mb or ~500,000 lines for each tsv/BED file) and the FASTA file is the entire human reference genome build 19, which is ~60,000,000 lines. The perl solution proposed by #choroba works, but doesn't scale well to these sizes.
Unfortunately, because of the need to identify matches across the lines, the awk and bash/grep solutions memtnioned below won't work.
I want multiple non-overlapping hits in the same chromosome to count as the actual number of hits. I.e. If you search for a sequence and get 2 hits in a single chromosome and 1 in another chromosome, the total count should be 3.
Ted Lyngmo is very kindly helping me develop a solution in C++ that allows this to be run in a realistic timeframe, there's more detail on his post in this thread. And link to the Github for this is here =)
If the second file is significantly bigger than the first one, I would try this awk script:
awk 'v==1 {a[$7];next} # Get the pattern from first file into the array a
v==2 { # For each line of the second file
for(i in a){ # Loop though all patterns
a[i]=split($0,b,i)-1 # Get the number of pattern match in the line
}
}
v==3 {print $0,a[$7]} # Re-read first file to add the number of pattern matches
' v=1 file1 v=2 file2 v=3 file1
I'd reach for a programming language like Perl.
#!/usr/bin/perl
use warnings;
use strict;
my ($fasta_file, $bed_file) = #ARGV;
open my $fasta, '<', $fasta_file or die "$fasta_file: $!";
open my $bed, '<', $bed_file or die "$bed_file: $!";
my $seq;
while (<$fasta>) {
$seq .= "\n", next if /^>/;
chomp;
$seq .= $_;
}
while (<$bed>) {
chomp;
my $short_seq = (split /\t/, $_)[-1];
my $count = () = $seq =~ /\Q$short_seq\E/g;
print "$_\t$count\n";
}
To count overlapping sequences, change the regex to a lookahead.
my $count = () = $seq =~ /(?=\Q$short_seq\E)/g;
Since grep -c seems to give you the correct count (matching lines, not counting multiple occurances on the same line) you could read the 7 fields from the TSV (BED) file and just print them again with the grep output added to the end:
#!/bin/bash
# read the fields into the array `v`:
while read -ra v
do
# print the 7 first elements in the array + the output from `grep -c`:
echo "${v[#]:0:7}" "$(grep -Fc "${v[6]}" hg19.fasta)"
done < tsv.bed > outfile
outfile will now contain
1 779692 779715 Sample_3 + 1 ATGGTGCTTTGTTATGGCAGCTC 1
1 783462 783485 Sample_4 - 1 ATGAATAAGTCAAGTAAATGGAC 2
Benchmarks
This table is a comparison of the three different solutions presented as answers here, with timings to finish different amount of tsv/bed records with the full hg19.fa file (excluding the records containing only N:s). hg19.fa contains 57'946'726 such records. As a baseline I've used two versions of a C++ program (called hgsearch/hgsearchmm). hgsearch reads the whole hg19.fa file into memory and then searches it in parallel. hgsearchmm uses a memory mapped file instead and then searches that (also in parallel).
search \ beds
1
2
100
1000
10000
awk
1m0.606s
17m19.899s
-
-
-
perl
13.263s
15.618s
4m48.751s
48m27.267s
-
bash/grep
2.088s
3.670s
3m27.378s
34m41.129s
-
hgsearch
8.776s
9.425s
30.619s
3m56.508s
38m43.984s
hgsearchmm
1.942s
2.146s
21.715s
3m28.265s
34m56.783s
The tests were run on an Intel Core i9 12 cores/24 HT:s in WSL/Ubuntu 20.04 (SSD disk).
The sources for the scripts and baseline programs used can be found here

How can I append any string at the end of line and keep doing it after specific number of lines?

I want to add a symbol " >>" at the end of 1st line and then 5th line and then so on. 1,5,9,13,17,.... I was searching the web and went through below article but I'm unable to achieve it. Please help.
How can I append text below the specific number of lines in sed?
retentive
good at remembering
The child was very sharp, and her memory was extremely retentive.
— Rowlands, Effie Adelaide
unconscionable
greatly exceeding bounds of reason or moderation
For generations in the New York City public schools, this has become the norm with devastating consequences rooted in unconscionable levels of student failure.
— New York Times (Nov 4, 2011)
Output should be like-
retentive >>
good at remembering
The child was very sharp, and her memory was extremely retentive.
— Rowlands, Effie Adelaide
unconscionable >>
greatly exceeding bounds of reason or moderation
For generations in the New York City public schools, this has become the norm with devastating consequences rooted in unconscionable levels of student failure.
— New York Times (Nov 4, 2011)
You can do it with awk:
awk '{if ((NR-1) % 5) {print $0} else {print $0 " >>"}}'
We check if line number minus 1 is a multiple of 5 and if it is we output the line followed by a >>, otherwise, we just output the line.
Note: The above code outputs the suffix every 5 lines, because that's what is needed for your example to work.
You can do it multiple ways. sed is kind of odd when it comes to selecting lines but it's doable. E.g.:
sed:
sed -i -e 's/$/ >>/;n;n;n;n' file
You can do it also as perl one-liner:
perl -pi.bak -e 's/(.*)/$1 >>/ if not (( $. - 1 ) % 5)' file
You're thinking about this wrong. You should append to the end of the first line of every paragraph, don't worry about how many lines there happen to be in any given paragraph. That's just:
$ awk -v RS= -v ORS='\n\n' '{sub(/\n/," >>&")}1' file
retentive >>
good at remembering
The child was very sharp, and her memory was extremely retentive.
— Rowlands, Effie Adelaide
unconscionable >>
greatly exceeding bounds of reason or moderation
For generations in the New York City public schools, this has become the norm with devastating consequences rooted in unconscionable levels of student failure.
— New York Times (Nov 4, 2011)
This might work for you (GNU sed):
sed -i '1~4s/$/ >>/' file
There's a couple more:
$ awk 'NR%5==1 && sub(/$/,">>>") || 1 ' foo
$ awk '$0=$0(NR%5==1?">>>":"")' foo
Here is a non-numeric way in Awk. This works if we have an Awk that supports the RS variable being more than one character long. We break the data into records based on the blank line separation: "\n\n". Inside these records, we break fields on newlines. Thus $1 is the word, $2 is the definition, $3 is the quote and $4 is the source:
awk 'BEGIN {OFS=FS="\n";ORS=RS="\n\n"} $1=$1" >>"'
We use the same output separators as input separators. Our only pattern/action step is then to edit $1 so that it has >> on it. The default action is { print }, which is what we want: print each record. So we can omit it.
Shorter: Initialize RS from catenation of FS.
awk 'BEGIN {OFS=FS="\n";ORS=RS=FS FS} $1=$1" >>"'
This is nicely expressive: it says that the format uses two consecutive field separators to separate records.
What if we use a flag, initially reset, which is reset on every blank line? This solution still doesn't depend on a hard-coded number, just the blank line separation. The rule fires on the first line, because C evaluates to zero, and then after every blank line, because we reset C to zero:
awk 'C++?1:$0=$0" >>";!NF{C=0}'
Shorter version of accepted Awk solution:
awk '(NR-1)%5?1:$0=$0" >>"'
We can use a ternary conditional expression cond ? then : else as a pattern, leaving the action empty so that it defaults to {print} which of course means {print $0}. If the zero-based record number is is not congruent to 0, modulo 5, then we produce 1 to trigger the print action. Otherwise we evaluate `$0=$0" >>" to add the required suffix to the record. The result of this expression is also a Boolean true, which triggers the print action.
Shave off one more character: we don't have to subtract 1 from NR and then test for congruence to zero. Basically whenever the 1-based record number is congruent to 1, modulo 5, then we want to add the >> suffix:
awk 'NR%5==1?$0=$0" >>":1'
Though we have to add ==1 (+3 chars), we win because we can drop two parentheses and -1 (-4 chars).
We can do better (with some assumptions): Instead of editing $0, what we can do is create a second field which contains >> by assigning to the parameter $2. The implicit print action will print this, offset by a space:
awk 'NR%5==1?$2=">>":1'
But this only works when the definition line contains one word. If any of the words in this dictionary are compound nouns (separated by space, not hyphenated), this fails. If we try to repair this flaw, we are sadly brought back to the same length:
awk 'NR%5==1?$++NF=">>":1'
Slight variation on the approach: Instead of trying to tack >> onto the record or last field, why don't we conditionally install >>\n as ORS, the output record separator?
awk 'ORS=(NR%5==1?" >>\n":"\n")'
Not the tersest, but worth mentioning. It shows how we can dynamically play with some of these variables from record to record.
Different way for testing NR == 1 (mod 5): namely, regexp!
awk 'NR~/[16]$/?$0=$0" >>":1'
Again, not tersest, but seems worth mentioning. We can treat NR as a string representing the integer as decimal digits. If it ends with 1 or 6 then it is congruent to 1, mod 5. Obviously, not easy to modify to other moduli, not to mention computationally disgusting.

Grep lines from a file in batches according to a format

I have a file with contents as:
Hi welcome ! Chunk Start Line 1Line2! Chunk Start Line 1 Line 2 Line 3 ! Chunk Start Line 1Line 2Line 3Line 1Line 2Line 3Line 4Line 5Line 1Line 2Line 3Line 4
Now, everything beginning with "! Chunk Start" and before the next "! Chunk Start" is a chunk, i.e. the lines between "! Chunk Start" , make a chunk. I need to get the contents of each chunk in a single line. i.e.:
Line 1 Line 2Line 1 Line2 Line 3Line 1 Line 2 Line 3 Line 1 Line 2 Line 3 Line 4 Line 5 Line 1 Line 2 Line 3 Line 4
I have done this, but I think there should be a better way. The way I have done this is:
grep -A100 "! Chunk Start" file.txt
Rest of the logic is there to concat the lines. But this A100 is what I am worried about. What if there are more than 100 lines in a chunk, this will fail.
I probably need to do this with awk/sed. Please suggest.
You can use GNU AWK (gawk). It has a GNU extension for a powerful regexp form of the record separator RS to divide the input by ! Chunk Start. Each line of your "chunks" can then be processed as a field. Standard AWK has a limit on the number of fields (99 or something?), but gawk supports up to MAX_LONG fields. This large number of fields should solve your worry about 100+ input lines per chunk.
$ gawk 'BEGIN{RS="! Chunk Start\n";FS="\n"}NR>1{$1=$1;print}' infile.txt
AWK (and GNU AWK) works by dividing input into records, then dividing each record into fields. Here, we are dividing records (record separator RS) based on the string ! Chunk Start and then dividing each record into fields (field separator FS) based on a newline \n. You can also specify a custom output record separator ORS and custom output field separator OFS, but in this case what we want happen to be the defaults (ORS="\n" and OFS=" ").
When dividing into records, the part before the first ! Chunk Start will be considered a record. We ignore this using NR>1. I have interpreted your problem specification
everything beginning with "! Chunk Start" and before the next "! Chunk Start" is a chunk
to mean that once ! Chunk Start has been seen, everything else until the end of input belongs in at least some chunk.
The mysterious $1=$1 forces gawk to reprocess the input line $0, which parses it using the input format (FS), consuming the newlines. The print prints this reprocessed line using the output format (OFS and ORS).
Edit: The version above prints spaces at the end of each line. Thanks to #EdMorton for pointing out that the default field separator FS separates on whitespace (including newlines), so FS should be left unmodified:
$ gawk 'BEGIN{RS="! Chunk Start\n"}NR>1{$1=$1;print}' infile.txt
This might work for you (GNU sed):
sed '0,/^! Chunk Start/d;:a;$!N;/! Chunk Start/!s/\n/ /;ta;P;d' file
Delete upto and including the first line containing ! Chunk Start. Gather up lines replacing the newline by a space. When the next match is found print the first line, delete the pattern space and repeat.
Good grief. Just use awk:
$ awk -v RS='! Chunk Start' '{$1=$1}NR>1' file
Line 1 Line2
Line 1 Line 2 Line 3
Line 1 Line 2 Line 3 Line 1 Line 2 Line 3 Line 4 Line 5 Line 1 Line 2 Line 3 Line 4
The above uses GNU awk for multi-char RS.

Remove line break every nth line using sed

Example: Is there a way to use sed to remove/subsitute a pattern in a file for every 3n + 1 and 3n+ 2 line?
For example, turn
Line 1n/
Line 2n/
Line 3n/
Line 4n/
Line 5n/
Line 6n/
Line 7n/
...
To
Line 1 Line 2 Line 3n/
Line 4 Line 5 Line 6n/
...
I know this can probably be handled by awk. But what about sed?
Well, I'd just use awk for that1 since it's a little more complex but, if you're really intent on using sed, the following command will combine groups of three lines into a single line (which appears to be what you're after based on the title and text, despite the strange use of /n for newline):
sed '$!N;$!N;s/\n/ /g'
See the following transcript for how to test this:
$ printf 'Line 1\nLine 2\nLine 3\nLine 4\nLine 5\n' | sed '$!N;$!N;s/\n/ /g'
Line 1 Line 2 Line 3
Line 4 Line 5
The sub-commands are as follows:
$!N will append the next line to the pattern space, but only if you're not on the last line (you do this twice to get three lines). Each line in the pattern space is separated by a newline character.
s/\n/ /g replaces all the newlines in the pattern space with a space character, effectively combining the three lines into one.
1 With something like:
awk '{if(NR%3==1){s="";if(NR>1){print ""}};printf s"%s", $0;s=" "}'
This is complicated by the likelihood you don't want an extraneous space at the end of each line, necessitating the introduction of the s variable.
Since the sed variant is smaller (and less complex once you understand it), you're probably better off sticking with it. Well, at least up to the point where you want to combine groups of 17 lines, or do something else more complex than sed was meant to handle :-)
The example is for merging 3 consecutive lines although description is different. To generate the example output, you can use awk idiom
awk 'ORS=NR%3?FS:RS' <(seq 1 9)
1 2 3
4 5 6
7 8 9
in your case the record separator needs to be defined upfront to include the literals
awk -v RS="n/\\n" 'ORS=NR%3?FS:RS'
ok. following are ways to deal with it generally using awk and sed.
awk:
awk 'NR % 3 { sub(/pattern/, substitution) } { print }' file | paste -d' ' - - -
sed:
sed '{s/pattern/substitution/p; n;s/pattern/substitution/p; n;p}' file | paste -d' ' - - -
both of them replace pattern in 3n+1 and 3n+2 lines into substitution and keep the 3n line untouched.
paste - - - is the bash idiom to fold the stdout by 3.

Parse a file under linux

I'm trying to compute some news article popularity based on twitter data. However, while retrieving the tweets I forgot to escape the characters ending up with an unusable file.
Here is a line from the file:
1369283975$,$337427565662830592$,$0$,$username$,$Average U.S. 401(k) balance tops $80$,$000$,$ up 75 pct since 2009 http://t.co/etHHMUFpoo #news$,$http://www.reuters.com/article/2013/05/23/funds-fidelity-401k-idUSL2N0E31ZC20130523?feedType=RSS&feedName=marketsNews
The '$,$' pattern occurs not only as a field delimiter but also in the tweet, from where I want to remove it.
A correct line would be:
1369283975$,$337427565662830592$,$0$,$username$,$Average U.S. 401(k) balance tops $80000 up 75 pct since 2009 http://t.co/etHHMUFpoo #news$,$http://www.reuters.com/article/2013/05/23/funds-fidelity-401k-idUSL2N0E31ZC20130523?feedType=RSS&feedName=marketsNews
I tried to use cut and sed but I'm not getting the results I want. What would be a good strategy to solve this?
If we can assume that there are never extra separators in the time, id, retweets, username, and link fields, then you could take the middle part and remove all $,$ from it, for example like this:
perl -ne 'chomp; #a=split(/\$,\$/); $_ = join("", #a[4..($#a-1)]); print join("\$,\$", #a[0..3], $_, $a[$#a]), "\n"' < data.txt
What this does:
splits the line using $,$ as delimiter
takes the middle part = fields[4] .. fields[N-1]
joins again by $,$ the first 4 fields, the fixed middle part, and the last field (the link)
This works with your example, but I don't know what other corner cases you might have.
A good way to validate the result is to count the number of occurrences of $,$ is 6 on all lines. You can do that by piping the result to this:
... | perl -ne 'print scalar split(/\$,\$/), "\n"' | sort -u
(should output a single line, with "6")

Resources