Parse a file under linux - linux

I'm trying to compute some news article popularity based on twitter data. However, while retrieving the tweets I forgot to escape the characters ending up with an unusable file.
Here is a line from the file:
1369283975$,$337427565662830592$,$0$,$username$,$Average U.S. 401(k) balance tops $80$,$000$,$ up 75 pct since 2009 http://t.co/etHHMUFpoo #news$,$http://www.reuters.com/article/2013/05/23/funds-fidelity-401k-idUSL2N0E31ZC20130523?feedType=RSS&feedName=marketsNews
The '$,$' pattern occurs not only as a field delimiter but also in the tweet, from where I want to remove it.
A correct line would be:
1369283975$,$337427565662830592$,$0$,$username$,$Average U.S. 401(k) balance tops $80000 up 75 pct since 2009 http://t.co/etHHMUFpoo #news$,$http://www.reuters.com/article/2013/05/23/funds-fidelity-401k-idUSL2N0E31ZC20130523?feedType=RSS&feedName=marketsNews
I tried to use cut and sed but I'm not getting the results I want. What would be a good strategy to solve this?

If we can assume that there are never extra separators in the time, id, retweets, username, and link fields, then you could take the middle part and remove all $,$ from it, for example like this:
perl -ne 'chomp; #a=split(/\$,\$/); $_ = join("", #a[4..($#a-1)]); print join("\$,\$", #a[0..3], $_, $a[$#a]), "\n"' < data.txt
What this does:
splits the line using $,$ as delimiter
takes the middle part = fields[4] .. fields[N-1]
joins again by $,$ the first 4 fields, the fixed middle part, and the last field (the link)
This works with your example, but I don't know what other corner cases you might have.
A good way to validate the result is to count the number of occurrences of $,$ is 6 on all lines. You can do that by piping the result to this:
... | perl -ne 'print scalar split(/\$,\$/), "\n"' | sort -u
(should output a single line, with "6")

Related

AWK - string containing required fields

I thought it would be easy to define a string such as "1 2 3" and use it within AWK (GAWK) to extract the required fields, how wrong I have been.
I have tried creating AWK arrays, BASH arrays, splitting, string substitution etc, but could not find any method to use the resulting 'chunks' (ie the column/field numbers) in a print statement.
I believe Akshay Hegde has provided an excellent solution with the get_cols function, here
but it was over 8 years ago, and I am really struggling to work out 'how it works', namely, what this is doing;
s = length(s) ? s OFS $(C[i]) : $(C[i])
I am unable to post a comment asking for clarification due to my lack of reputation (and it is an old post).
Is someone able to explain how the solution works?
NB I don't think I need the sub as I using the following to cleanup (replace all non-numeric characters with a comma, ie seperator, and sort numerically)
Columns=$(echo $Input_string | sed 's/[^0-9]\+/,/g') Columns=$(echo $Columns | xargs -n1 | sort -n | xargs)
(using this string, the awk would be Executed using awk -v cols=$Columns -f test.awk infile in the given solution)
Given the informative answer from #Ed Morton, with a nice worked example, I have attempted to remove the need for a function (and also an additional awk program file). The intention is to have this within a shell script, and I would rather it be self contained, but also, further investigation into 'how it works'.
Fields="1 2 3"
echo $Fields | awk -F "," '{n=split($0,Column," "); for(i=1;i<=n;i++) s = length(s) ? s OFS $(Column[i]) : $(Column[i])}END{print "s="s " arr1="Column[1]" arr2="Column[2]" arr3="Column[3]}'
The results have surprised me (taking note of my Comment to Ed)
s=1 2 3 arr1=1 arr2=2 arr3=3
The above clearly shows the split has worked into the array, but I thought s would include $ for each ternary operator concatenation, ie "$1 $2 $3"
Moreso, I was hoping to append the actual file to the above command, which I have found allows me to use echo $string | awk '{program}' file.name
NB it is a little insulting that my question has been marked as -1 indicating little research effort, as I have spent days trying to work this out.
Taking all the information above, I think s results in "1 2 3", but the print doesn't accept this in the same way as it does as it is called from a function, simply trying to 'print 1 2 3' in relation to the file, which seems to be how all my efforts have ended up.
This really confuses me, as Ed's 'diagonal' example works from command line, indicating that concept of 'print s' is absolutely fine when used with a file name input.
Can anyone suggest how this (example below) can work?
I don't know if using echo pipe and appending the file name is strictly allowed, but it appears to work (?!?!?!)
(failed result)
echo $Fields | awk -F "," '{n=split($0,Column," "); for(i=1;i<=n;i++) s = length(s) ? s OFS $(Column[i]) : $(Column[i])}END{print s}' myfile.txt
This appears to go through myfile.txt and output all lines containing many comma separated values, ie the whole file (I haven't included the values, just for illustration only)
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
what this is doing; s = length(s) ? s OFS $(C[i]) : $(C[i])
You have encountered a ternary operator, it has following syntax
condition ? valueiftrue : valueiffalse
length function, when provided with single argument does return number of characters, in GNU AWK integer 0 is considered false, others integers are considered true, so in this case it is is not empty check. When s is not empty (it might be also not initalized yet, as GNU AWK will assume empty string in such case), it is concatenated with output field separator (OFS, default is space) and C[i]-th field value and assigned to variable s, when s is empty value of C[i]-th field value. Used multiple time this allows building of string of values sheared by OFS, consider following simple example, let say you want to get diagonal of 2D matrix, stored in file.txt with following content
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
then you might do
awk '{s = length(s) ? s OFS $(NR) : $(NR)}END{print s}' file.txt
which will get output
1 7 13 19 25
Explanation: NR is number row, so 1st row $(NR) is 1st field, for 2nd row it is 2nd field, for 3rd it is 3rd field and so on
(tested in GNU Awk 5.0.1)

Extracting text from a txt file

I have a txt file with records on it. The records follow this pattern:
six lines, blank space, six lines, .....like this example:
string line 1
string line 2
string line 3
string line 4
string line 5 (year format yyyy)
string line 6 (can use several lines)
<blank space> (always a blank space when a new txt block begins)
string line 1
string line 2
string line 3
string line 4
string line 5 (year format yyyy)
string line 6
Here is a proper example: I need the title(line 2) and year(line5)
Hualong Yu, Geoffrey I. Webb,
Adaptive online extreme learning machine by regulating forgetting factor by concept drift map,
Neurocomputing,
Volume 343,
2019,
Pages 141-153,
ISSN 0925-2312,
https://doi.org/10.1016/j.neucom.2018.11.098.
https://www.sciencedirect.com/science/article/pii/S0925231219301572
Antonino Feitosa Neto, Anne M.P. Canuto,
EOCD: An ensemble optimization approach for concept drift applications,
Information Sciences,
Volume 561,
2021,
Pages 81-100,
ISSN 0020-0255,
https://doi.org/10.1016/j.ins.2021.01.051.
https://www.sciencedirect.com/science/article/pii/S002002552100089X
I want to extract the string in line 2 and the year in line 5 all all blocks of text (separeted by blank spaces), save it to another txt file as this output:
string line2 , yyyy
I dont have exp'ed wih linux shell so I am here asking for some inputs to help me do this task.
Thanks
If you don't care about the trailing comma in line 5, just do:
awk '{print $2, $5}' RS= FS='\\n' input > output
This assumes that the blank line separating the records is indeed completely blank and does not contain any whitespace. If there is any whitespace in that line, you'll want to pre-filter the data to remove it.
eg:
$ cat input
Hualong Yu, Geoffrey I. Webb,
Adaptive online extreme learning machine by regulating forgetting factor by concept drift map,
Neurocomputing,
Volume 343,
2019,
Pages 141-153,
ISSN 0925-2312,
https://doi.org/10.1016/j.neucom.2018.11.098.
https://www.sciencedirect.com/science/article/pii/S0925231219301572
Antonino Feitosa Neto, Anne M.P. Canuto,
EOCD: An ensemble optimization approach for concept drift applications,
Information Sciences,
Volume 561,
2021,
Pages 81-100,
ISSN 0020-0255,
https://doi.org/10.1016/j.ins.2021.01.051.
https://www.sciencedirect.com/science/article/pii/S002002552100089
$ awk '{print $2, $5}' RS= FS='\\n' input
Adaptive online extreme learning machine by regulating forgetting factor by concept drift map, 2019,
EOCD: An ensemble optimization approach for concept drift applications, 2021,
Something like:
perl -00 -nE 'my #ln = (split /,\n/)[1,4]; say join(",", #ln)' input.txt > output.txt
should work as at least a starting point. Reads a paragraph at a time, splits up into lines, and prints the two you're looking for on the same line separated by a comma.

Remove redundant strings without looping

Is there a way to remove both duplicates and redundant substrings from a list, using shell tools? By "redundant", I mean a string that is contained within another string, so "foo" is redundant with "foobar" and "barfoo".
For example, take this list:
abcd
abc
abd
abcd
bcd
and return:
abcd
abd
uniq, sort -u and awk '!seen[$0]++' remove duplicates effectively but not redundant strings:
How to delete duplicate lines in a file without sorting it in Unix?
Remove duplicate lines without sorting
I can loop through each line recursively with grep but this is is quite slow for large files. (I have about 10^8 lines to process.)
There's an approach using a loop in Python here: Remove redundant strings based on partial strings and Bash here: How to check if a string contains a substring in Bash but I'm trying to avoid loops. Edit: I mean nested loops here, thanks for the clarification #shellter
Is there a way to use a awk's match() function with an array index? This approach builds the array progressively so never has to search the whole file, so should be faster for large files. Or am I missing some other simple solution?
An ideal solution would allow matching of a specified column, as for the methods above.
EDIT
Both of the answers below work, thanks very much for the help. Currently testing for performance on a real dataset, will update with results and accept an answer. I tested both approaches on the same input file, which has 430,000 lines, of which 417,000 are non-redundant. For reference, my original looped grep approach took 7h30m with this file.
Update:
James Brown's original solution took 3h15m and Ed Morton's took 8h59m. On a smaller dataset, James's updated version was 7m versus the original's 20m. Thank you both, this is really helpful.
The data I'm working with are around 110 characters per string, with typically hundreds of thousands of lines per file. The way in which these strings (which are antibody protein sequences) are created can lead to characters from one or both ends of the string getting lost. Hence, "bcd" is likely to be a fragment of "abcde".
An awk that on first run extracts and stores all substrings and strings to two arrays subs and strs and checks on second run:
$ awk '
NR==FNR { # first run
if(($0 in strs)||($0 in subs)) # process only unseen strings
next
len=length()-1 # initial substring length
strs[$0] # hash the complete strings
while(len>=1) {
for(i=1;i+len-1<=length();i++) { # get all substrings of current len
asub=substr($0,i,len) # sub was already resetved :(
if(asub in strs) # if substring is in strs
delete strs[asub] # we do not want it there
subs[asub] # hash all substrings too
}
len--
}
next
}
($0 in strs)&&++strs[$0]==1' file file
Output:
abcd
abd
I tested the script with about 30 M records of 1-20 char ACGT strings. The script ran 3m27s and used about 20 % of my 16 GBs. Using strings of length 1-100 I OOM'd in a few mins (tried it again with about 400k records oflength of 50-100 and it uses about 200 GBs and runs about an hour). (20 M records of 1-30 chars ran 7m10s and used 80 % of the mem)
So if your data records are short or you have unlimited memory, my solution is fast but in the opposite case it's going to crash running out of memory.
Edit:
Another version that tries to preserve memory. On the first go it checks the min and max lengths of strings and on the second run won't store substrings shorter than global min. For about 400 k record of length 50-100 it used around 40 GBs and ran 7 mins. My random data didn't have any redundancy so input==putput. It did remove redundance with other datasets (2 M records of 1-20 char strings):
$ awk '
BEGIN {
while((getline < ARGV[1])>0) # 1st run, check min and max lenghts
if(length()<min||min=="") # TODO: test for length()>0, too
min=length()
else if(length()>max||max=="")
max=length()
# print min,max > "/dev/stderr" # debug
close(ARGV[1])
while((getline < ARGV[1])>0) { # 2nd run, hash strings and substrings
# if(++nr%10000==0) # debug
# print nr > "/dev/stderr" # debug
if(($0 in strs)||($0 in subs))
continue
len=length()-1
strs[$0]
while(len>=min) {
for(i=1;i+len-1<=length();i++) {
asub=substr($0,i,len)
if(asub in strs)
delete strs[asub]
subs[asub]
}
len--
}
}
close(ARGV[1])
while((getline < ARGV[1])>0) # 3rd run, output
if(($0 in strs)&&!strs[$0]++)
print
}' file
$ awk '{print length($0), $0}' file |
sort -k1,1rn -k2 -u |
awk '!index(str,$2){str = str FS $2; print $2}'
abcd
abd
The above assumes the set of unique values will fit in memory.
EDIT
This won't work. Sorry.
#Ed's solution is the best idea I can imagine without some explicit looping, and even that is implicitly scanning over the near-entire growing history of data on every record. It has to.
Can your existing resources hold that whole column in memory, plus a delimiter per record? If not, then you're going to be stuck with either very complex optimization algorithms, or VERY slow redundant searches.
Original post left for reference in case it gives someone else an inspiration.
That's a lot of data.
Given the input file as-is,
while read next
do [[ "$last" == "$next" ]] && continue # throw out repeats
[[ "$last" =~ $next ]] && continue # throw out sustrings
[[ "$next" =~ $last ]] && { last="$next"; continue; } # upgrade if last a substring of next
echo $last # distinct string
last="$next" # set new key
done < file
yields
abcd
abd
With a file of that size I wouldn't trust that sort order, though. Sorting is going to be very slow and take a lot of resources, but will give you more trustworthy results. If you can sort the file once and use that output as the input file, great. If not, replace that last line with done < <( sort -u file ) or something to that effect.
Reworking this logic in awk will be faster.
$: sort -u file | awk '1==NR{last=$0} last~$0{next} $0~last{last=$0;next} {print last;last=$0}'
Aside from the sort this uses trivial memory and should be very fast and efficient, for some value of "fast" on a file with 10^8 lines.

How can I append any string at the end of line and keep doing it after specific number of lines?

I want to add a symbol " >>" at the end of 1st line and then 5th line and then so on. 1,5,9,13,17,.... I was searching the web and went through below article but I'm unable to achieve it. Please help.
How can I append text below the specific number of lines in sed?
retentive
good at remembering
The child was very sharp, and her memory was extremely retentive.
— Rowlands, Effie Adelaide
unconscionable
greatly exceeding bounds of reason or moderation
For generations in the New York City public schools, this has become the norm with devastating consequences rooted in unconscionable levels of student failure.
— New York Times (Nov 4, 2011)
Output should be like-
retentive >>
good at remembering
The child was very sharp, and her memory was extremely retentive.
— Rowlands, Effie Adelaide
unconscionable >>
greatly exceeding bounds of reason or moderation
For generations in the New York City public schools, this has become the norm with devastating consequences rooted in unconscionable levels of student failure.
— New York Times (Nov 4, 2011)
You can do it with awk:
awk '{if ((NR-1) % 5) {print $0} else {print $0 " >>"}}'
We check if line number minus 1 is a multiple of 5 and if it is we output the line followed by a >>, otherwise, we just output the line.
Note: The above code outputs the suffix every 5 lines, because that's what is needed for your example to work.
You can do it multiple ways. sed is kind of odd when it comes to selecting lines but it's doable. E.g.:
sed:
sed -i -e 's/$/ >>/;n;n;n;n' file
You can do it also as perl one-liner:
perl -pi.bak -e 's/(.*)/$1 >>/ if not (( $. - 1 ) % 5)' file
You're thinking about this wrong. You should append to the end of the first line of every paragraph, don't worry about how many lines there happen to be in any given paragraph. That's just:
$ awk -v RS= -v ORS='\n\n' '{sub(/\n/," >>&")}1' file
retentive >>
good at remembering
The child was very sharp, and her memory was extremely retentive.
— Rowlands, Effie Adelaide
unconscionable >>
greatly exceeding bounds of reason or moderation
For generations in the New York City public schools, this has become the norm with devastating consequences rooted in unconscionable levels of student failure.
— New York Times (Nov 4, 2011)
This might work for you (GNU sed):
sed -i '1~4s/$/ >>/' file
There's a couple more:
$ awk 'NR%5==1 && sub(/$/,">>>") || 1 ' foo
$ awk '$0=$0(NR%5==1?">>>":"")' foo
Here is a non-numeric way in Awk. This works if we have an Awk that supports the RS variable being more than one character long. We break the data into records based on the blank line separation: "\n\n". Inside these records, we break fields on newlines. Thus $1 is the word, $2 is the definition, $3 is the quote and $4 is the source:
awk 'BEGIN {OFS=FS="\n";ORS=RS="\n\n"} $1=$1" >>"'
We use the same output separators as input separators. Our only pattern/action step is then to edit $1 so that it has >> on it. The default action is { print }, which is what we want: print each record. So we can omit it.
Shorter: Initialize RS from catenation of FS.
awk 'BEGIN {OFS=FS="\n";ORS=RS=FS FS} $1=$1" >>"'
This is nicely expressive: it says that the format uses two consecutive field separators to separate records.
What if we use a flag, initially reset, which is reset on every blank line? This solution still doesn't depend on a hard-coded number, just the blank line separation. The rule fires on the first line, because C evaluates to zero, and then after every blank line, because we reset C to zero:
awk 'C++?1:$0=$0" >>";!NF{C=0}'
Shorter version of accepted Awk solution:
awk '(NR-1)%5?1:$0=$0" >>"'
We can use a ternary conditional expression cond ? then : else as a pattern, leaving the action empty so that it defaults to {print} which of course means {print $0}. If the zero-based record number is is not congruent to 0, modulo 5, then we produce 1 to trigger the print action. Otherwise we evaluate `$0=$0" >>" to add the required suffix to the record. The result of this expression is also a Boolean true, which triggers the print action.
Shave off one more character: we don't have to subtract 1 from NR and then test for congruence to zero. Basically whenever the 1-based record number is congruent to 1, modulo 5, then we want to add the >> suffix:
awk 'NR%5==1?$0=$0" >>":1'
Though we have to add ==1 (+3 chars), we win because we can drop two parentheses and -1 (-4 chars).
We can do better (with some assumptions): Instead of editing $0, what we can do is create a second field which contains >> by assigning to the parameter $2. The implicit print action will print this, offset by a space:
awk 'NR%5==1?$2=">>":1'
But this only works when the definition line contains one word. If any of the words in this dictionary are compound nouns (separated by space, not hyphenated), this fails. If we try to repair this flaw, we are sadly brought back to the same length:
awk 'NR%5==1?$++NF=">>":1'
Slight variation on the approach: Instead of trying to tack >> onto the record or last field, why don't we conditionally install >>\n as ORS, the output record separator?
awk 'ORS=(NR%5==1?" >>\n":"\n")'
Not the tersest, but worth mentioning. It shows how we can dynamically play with some of these variables from record to record.
Different way for testing NR == 1 (mod 5): namely, regexp!
awk 'NR~/[16]$/?$0=$0" >>":1'
Again, not tersest, but seems worth mentioning. We can treat NR as a string representing the integer as decimal digits. If it ends with 1 or 6 then it is congruent to 1, mod 5. Obviously, not easy to modify to other moduli, not to mention computationally disgusting.

replace every nth occurrence of a pattern using awk [duplicate]

This question already has answers here:
Printing with sed or awk a line following a matching pattern
(9 answers)
Closed 6 years ago.
I'm trying to replace every nth occurrence of a string in a text file.
background:
I have a huge bibtex file (called in.bib) containing hundreds of entries beginning with "#". But every entry has a different amount of lines. I want to write a string (e.g. "#") right before every (let's say) 6th occurrence of "#" so, in a second step, I can use csplit to split the huge file at "#" into files containing 5 entries each.
The problem is to find and replace every fifth "#".
Since I need it repeatedly, the suggested answer in printing with sed or awk a line following a matching pattern won't do the job. Again, I do not looking for just one matching place but many of it.
What I have so far:
awk '/^#/ && v++%5 {sub(/^#/, "\n#\n#")} {print > "out.bib"}' in.bib
replaces 2nd until 5th occurance (and no more).
(btw, I found and adopted this solution here: "Sed replace every nth occurrence". Initially, it was meant to replace every second occurence--which it does.)
And, second:
awk -v p="#" -v n="5" '$0~p{i++}i==n{sub(/^#/, "\n#\n#")}{print > "out.bib"}' in.bib
replaces exactly the 5th occurance and nothing else.
(adopted solution from here: "Display only the n'th match of grep"
What I need (and not able to write) is imho a loop. Would a for loop do the job? Something like:
for (i = 1; i <= 200; i * 5)
<find "#"> and <replace with "\n#\n#">
then print
The material I have looks like this:
#article{karamanic_jedno_2007,
title = {Jedno Kosova, Dva Srbije},
journal = {Ulaznica: Journal for Culture, Art and Social Issues},
author = {Karamanic, Slobodan},
year = {2007}
}
#inproceedings{blome_eigene_2008,
title = {Das Eigene, das Andere und ihre Vermischung. Zur Rolle von Sexualität und Reproduktion im Rassendiskurs des 19. Jahrhunderts},
comment = {Rest of lines snippet off here for usability -- as in following entries. All original entries may have a different amount of lines.}
}
#book{doring_inter-agency_2008,
title = {Inter-agency coordination in United Nations peacebuilding}
}
#book{reckwitz_subjekt_2008,
address = {Bielefeld},
title = {Subjekt}
}
What I want is every sixth entry looking like this:
#
#book{reckwitz_subjekt_2008,
address = {Bielefeld},
title = {Subjekt}
}
Thanks for your help.
Your code is almost right, i modified it.
To replace every nth occurrence, you need a modular expression.
So for better understanding with brackets, you need an expression like ((i % n) == 0)
awk -v p="#" -v n="5" ' $0~p { i++ } ((i%n)==0) { sub(/^#/, "\n#\n#") }{ print }' in.bib > out.bib
you can do the splitting in awk easily in one step.
awk -v RS='#' 'NR==1{next} (NR-1)%5==1{c++} {print RT $0 > FILENAME"."c}' file
will create file.1, file.2, etc with 5 records each, where the record is defined by the delimiter #.
Instead of doing this in multiple steps with multiple tools, just do something like:
awk '/#/ && (++v%5)==1{out="out"++c} {print > out}' file
Untested since you didn't provide any sample input/output.
If you don't have GNU awk and your input file is huge you'll need to add a close(out) right before the out=... to avoid having too many files open simultaneously.

Resources