Finding all information between keywords - linux

I want to search a given word and retrieve all the surrounding lines between a pair of keywords:
I have the following data
NEW:
this is stackoverflow
this is a ghi/enlightening website
NEW:
put returns between paragraphs
indent code by 4 spaces
NEW:
here is this
most productive website
this is abc/enlightening/def
Now I want to retrieve all information between the two NEW which have the word "enlightening". That is, for the example input above I want the following output:
OUTPUT:
NEW:
this is stackoverflow
this is a ghi/enlightening website
NEW:
here is this
most productive website
this is abc/enlightening/def
I know that grep allows me to search a word-- but it retrieves only a specified number of lines e.g. 5 (specified by the user) above and below the given word. But how do I find out all the information between any keyword in linux("NEW" in this case). E.g. I specify here the delimiting keyword as "NEW" and call the information between any two new as paragraph. So, here my first paragraph is:
this is stackoverflow
this is a ghi/enlightening website
my second paragraph is:
put returns between paragraphs
indent code by 4 spaces
and so on.
Now I want all those paragraphs which have the keyword "enlightening" in them. i.e. I want the following output:
OUTPUT:
NEW:
this is stackoverflow
this is a ghi/enlightening website
NEW:
here is this
most productive website
this is abc/enlightening/def

The following AWK command should work (for mawk anyway -- POSIX doesn't seem to allow RS to be an arbitrary string):
awk -vRS='NEW:\n' -vORS= '/enlightening/ { print RS $0 }' data
Explanation:
-vFOO=BAR is a variable assignment.
Setting RS (Record Separator) to NEW:\n makes records be separated by NEW:\n instead of being lines.
Setting ORS to the empty string removes redundant blank lines after records on output. (Another option is to set it to NEW:\n, if having NEW:\n appear after the record is okay.)
/enlightening/ { print RS $0 } prints the record separator followed by the entire matching record ($0) for each record that contains "enlightening".
If having the separator appear after records is okay, then the command can be simplified to the following:
awk -vRS='NEW:\n' -vORS='NEW:\n' '/enlightening/' data
The default action when no action is specified is to print the record.
For strict POSIX compliance, appending lines to a temporary buffer while between two NEW:s and only printing that buffer if the search term was seen (could use a flag) should work, though it's more complicated.

Related

Matching emails from second column in one file against another file

I have two files, one with emails in it(useremail.txt), and another with email:phonenumber(emailnumber.txt).
useremail.txt contains:
John smith:blabla#hotmail.com
David smith:haha#gmail.com
emailnumber.txt contains:
blabla#hotmail.com:093748594
So the solution needs to grab the email from the second column of useremail and then search through the emailnumber file and find matches and output John smith:093748594, so just the name and phone number.
I'm on windows so I need a gawk or grep solution, I have tried for a long time trying to get it to work with awk/grep and can't find the right solution, any help would be really appreciated.
Another in (GNU) awk:
$ awk '
BEGIN {
# RS=ORS="\r\n" # since you are using GNU awk this replaces the sub()
FS=OFS=":" # input and output field separators
}
NR==FNR { # processing the first file
sub(/\r$/,"",$NF) # remove the \r after the email OR uncomment RS above
a[$2]=$1 # hash name, index on email
next # on to the next record
}
($1 in a) { # if email in second file matches one in hash
print a[$1],$2 # output. If ORS uncommented above, output ends in \r
# if not, you may want to add it to the print ... "\r"
}' useremail emailnumber
Output:
John smith:093748594
Since you tried the accepted answer in Linux and Windows and you use GNU awk, in the future you could set RS="\r?\n" which accepts both forms, \r\n and bare \n. However, I've recently ran into a problem with that form in a specific condition (for which I've not yet filed a bug report).
You could try this:
awk -F":" '(FNR==NR){a[$2]=$1}(FNR!=NR){print a[$1]":"$2}' useremail.txt emailnumber.txt
If there are entries in emailnumber.txt with no matching entry in useremail.txt:
awk -F":" '(FNR==NR){a[$2]=$1}(FNR!=NR){if(a[$1]){print a[$1]":"$2}}' useremail.txt emailnumber.txt

IBM Domino xpage - parse iCalendar summary with new lines manually/ical4j

So far I was parsing the NotesCalendarEntry ics manually and overwriting certain properties, and it worked fine. Today i stumbled upon a problem, where a long summary name of the appointment gets split into multiple lines, and my parsing goes wrong, it replaces the part up to the first line break and the old part is still there.
Here's how I do this "parsing":
NotesCalendarEntry calEntry = cal.getEntryByUNID(apptuid);
String iCalE = calEntry.read();
StringBuilder sb = new StringBuilder(iCalE);
int StartIndex = iCalE.indexOf("BEGIN:VEVENT"); // care only about vevent
tmpIndex = sb.indexOf("SUMMARY:") + 8;
LineBreakIndex = sb.indexOf(Character.toString('\n'), tmpIndex);
if(sb.charAt(LineBreakIndex-1) == '\r') // take \r\n into account if exists
LineBreakIndex--;
sb.delete(tmpIndex, LineBreakIndex); // delete old content
sb.insert(tmpIndex, subject); // put my new content
It works when line breaks are where they are supposed to be, but somehow with long summary name, line breaks are put into the summary (not literal \r\n characters, but real line breaks).
I split the iCalE string by \r\n and got this (only a part obviously):
SEQUENCE:6
ATTENDEE;ROLE=CHAIR;PARTSTAT=ACCEPTED;CN="test/Test";RSVP=FALSE:
mailto:test#test.test
ATTENDEE;CUTYPE=ROOM;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED
;CN="Room 2/Test";RSVP=TRUE:mailto:room2#test.test
CLASS:PUBLIC
DESCRIPTION:Test description\n
SUMMARY:Very long name asdjkasjdklsjlasdjlasjljraoisjroiasjroiasjoriasoiruasoiruoai Mee
ting long name
LOCATION:Room 2/Test
ORGANIZER;CN="test/Test":mailto:test#test.test
Each line is one array element from iCalE.split("\\r\\n");. As you can see, the Summary field got split into 2 lines, and a space was added after the line break.
Now I have no idea how to parse this correctly, I thought about finding the index of next : instead of a new line break, and then finding the first line break before that : character, but that wouldn't work if the summary also contained a : after the injected line-break, and also wouldn't work on fields like that ORGANIZER;CN= as it doesn't use : but ;
I tried importing external ical4j jar into my xpage to overcome this problem, and while everything is recognized in Domino Designer it resulted in lots of NoClassDefFound exceptions after trying to reach my xpage service, despite the jars being in the build path and all.
java.lang.NoClassDefFoundError: net.fortuna.ical4j.data.CalendarBuilder
How can I safely parse this manually, or how can I properly import ical4j jar to my xpage? I just want to modify 3 fields, the DTSTART, DTEND and SUMMARY, with the dates I had no problems so far. Fields like Description are using literal \n string to mark new lines, it should be the same in other fields...
Update
So I have read more about iCalendar, and it seems that there is a standard for this called line folds, these are crlf line endings followed by a space. I made a while loop checking until the last line-break not followed by a space, and it works great so far. Will use this unless there's a better solution (ical4j is one, but I can't get it working with Domino)

Performing a sort using k1,1 only

Assume you have an unsorted file with the following content:
identifier,count=Number
identifier, extra information
identifier, extra information
...
I want to sort this file so that for each id, write the line with the count first and then the lines with extra info. I can only use the sort unix command with option -k1,1 but am allowed to slightly change the lines to get this sort.
As an example, take
a,Count=1
a,giulio
aa,Count=44
aa,tango
aa,information
ee,Count=2
bb,que
f,Count=3
b,Count=23
bax,game
f,ee
c,Count=3
c,roma
b,italy
bax,Count=332
a,atlanta
bb,Count=78
c,Count=3
The output should be
a,Count=1
a,atlanta
a,giulio
aa,Count=44
aa,information
aa,tango
b,Count=23
b,italy
bax,Count=332
bax,game
bb,Count=78
bb,que
c,Count=3
c,roma
ee,Count=2
f,Count=3
f,ee
but I get:
aa,Count=44
aa,information
aa,tango
a,atlanta
a,Count=1
a,giulio
bax,Count=332
bax,game
bb,Count=78
bb,que
b,Count=23
b,italy
c,Count=3
c,Count=3
c,roma
ee,Count=2
f,Count=3
f,ee
I tried adding spaces at the end of the identifier and/or at the beginning of the count field and other characters, but none of these approaches work.
Any pointer on how to perform this sorting?
EDIT:
if you consider for example the products with id starting with a, one of them has info 'atlanta' and appears before Count (but I wand Count to appear before any information). In addition, bb should be after b in alphabetical order for the ids. To make my question clearer: How can I get the IDs sorted by alphabetical order and such that for a given ID, the line with Count appears before the others. And how to do this using sort -k1,1 (This is a group project I am working on and I am not free to change the sorting command) and maybe changing the content (I tried for example adding a '~' to all the infos so that Count is before)
you need to tell sort, that comma is used as field separator
sort -t, -k1,1
For ASCII sorting make sure LC_ALL=C and LANG and LANGUAGE are unset

Splitting A File On Delimiter

I have a file on a Linux system that is roughly 10GB. It contains 20,000,000 binary records, but each record is separated by an ASCII delimiter "$". I would like to use the split command or some combination thereof to chunk the file into smaller parts. Ideally I would be able to specify that the command should split every 1,000 records (therefore every 1,000 delimiters) into separate files. Can anyone help with this?
The only unorthodox part of the problem seems to be the record separator. I'm sure this is fixable in awk pretty simply - but I happen to hate awk.
I would transfer it in the realm of 'normal' problems first:
tr '$' '\n' < large_records.txt | split -l 1000
This will by default create xaa, xab, xac... files; look at man split for more options
I love awk :)
BEGIN { RS="$"; chunk=1; count=0; size=1000 }
{
print $0 > "/tmp/chunk" chunk;
if (++count>=size) {
chunk++;
count=0;
}
}
(note that the redirection operator in awk only truncates/creates the file on its first invocation - subsequent references are treated as append operations - unlike shell redirection)
Make sure by default the unix split will exhaust with suffixes once it reaches max threshold of default suffix limit of 2. More info on : https://www.gnu.org/software/coreutils/manual/html_node/split-invocation.html

Using Awk to process a file where each record has different fixed-width fields

I have some data files from a legacy system that I would like to process using Awk. Each file consists of a list of records. There are several different record types and each record type has a different set of fixed-width fields (there is no field separator character). The first two characters of the record indicate the type, from this you then know which fields should follow. A file might look something like this:
AAField1Field2LongerField3
BBField4Field5Field6VeryVeryLongField7Field8
CCField99
Using Gawk I can set the FIELDWIDTHS, but that applies to the whole file (unless I am missing some way of setting this on a record-by-record basis), or I can set FS to "" and process the file one character at a time, but that's a bit cumbersome.
Is there a good way to extract the fields from such a file using Awk?
Edit: Yes, I could use Perl (or something else). I'm still keen to know whether there is a sensible way of doing it with Awk though.
Hopefully this will lead you in the right direction. Assuming your multi-line records are guaranteed to be terminated by a 'CC' type row you can pre-process your text file using simple if-then logic. I have presumed you require fields1,5 and 7 on one row and a sample awk script would be.
BEGIN {
field1=""
field5=""
field7=""
}
{
record_type = substr($0,1,2)
if (record_type == "AA")
{
field1=substr($0,3,6)
}
else if (record_type == "BB")
{
field5=substr($0,9,6)
field7=substr($0,21,18)
}
else if (record_type == "CC")
{
print field1"|"field5"|"field7
}
}
Create an awk script file called program.awk and pop that code into it. Execute the script using :
awk -f program.awk < my_multi_line_file.txt
You maybe can use two passes:
1step.awk
/^AA/{printf "2 6 6 12" }
/^BB/{printf "2 6 6 6 18 6"}
/^CC/{printf "2 8" }
{printf "\n%s\n", $0}
2step.awk
NR%2 == 1 {FIELDWIDTHS=$0}
NR%2 == 0 {print $2}
And then
awk -f 1step.awk sample | awk -f 2step.awk
You probably need to suppress (or at least ignore) awk's built-in field separation code, and use a program along the lines of:
awk '/^AA/ { manually process record AA out of $0 }
/^BB/ { manually process record BB out of $0 }
/^CC/ { manually process record CC out of $0 }' file ...
The manual processing will be a bit fiddly - I suppose you'll need to use the substr function to extract each field by position, so what I've got as one line per record type will be more like one line per field in each record type, plus the follow-on printing.
I do think you might be better off with Perl and its unpack feature, but awk can handle it too, albeit verbosely.
Could you use Perl and then select an unpack template based on the first two chars of the line?
Better use some fully featured scripting language like perl or ruby.
What about 2 scripts? E.g. 1st script inserts field separators based on the first characters, then the 2nd should process it?
Or first of all define some function in your AWK script, which splits the lines into variables based on the input - I would go this way, for the possible re-usage.

Resources