Multilevel parsing using shell command - string

I have a file in the following format
/////
name 1
start_occurrence:
occurrence 1
occurrence 2
///
name 2
start_occurance:
occurrence 1
occurrence 2
///
name 3
start_occurrence:
occurrence 1
occurrence 2
occurrence 3
All I need is to make a count of the number of occurrences for each name and save them in a CSV file. Can I do it using any combination of shell commands? Yes I can do it programmatically, but looking for a bunch of shell commands in a pipe lined fashion.
"names" can be anything. Names does not come with a pattern. Only catch is that the line after /// is the name. Also Occurrence does not have any number with it, anyline that starts with occurrence or have occurrence is a subject of interest.

awk 'c=="THISISNAME"{b=$0;c="";}$1=="///"{c="THISISNAME"}$0~/\<occurrence\>/{a[b]+=1;}END{for (i in a){print i" "a[i]}}' YOUR_FILE_HERE
explain:
if match the name start condition ($1=="///"), mark the c to THISISNAME.
if this is the name line (c=="THISISNAME"), mark the name line with b, and mark c as name part ended(c="").
if match the occurrence condition ($0~/\<occurrence\>/), make a[b] += 1.
use a map a to remark the occurrence time of each name.
awk use EREs, the $0~/EREs/ means $0 match the regex. the '\<' and '>' means '\b' in PREs

Related

Find if the first 10 digits of two columns on csv file are matched in bash

I have a file which contains two columns (names.csv), values are separated by comma
,
a123456789-anything,a123456789-anything
b123456789-anything,b123456789-anything
c123456789-anything,c123456789-anything
d123456789-anything,d123456789-anything
e123456789-anything,e123456789-anything
e123456777-anything,e123456999-anything
These columns have values with 10 digits, which are unique identifiers, and some extra junk in the values (-anything).
I want to see if the columns have the prefix matched!
To verify the values on first and second column I use:
cat /home/names.csv | parallel --colsep ',' echo column 1 = {1} column 2 = {2}
Which print the values. Because the values are HEX digits, it is cumbersome to verify one by one by only reading. Is there any way to see if the 10 digits of each column pair are exact matches? They might contain special characters!
Expected output (example, but anything that says the columns are matched or not can work):
Matches (including first line):
,
a123456789-anything,a123456789-anything
b123456789-anything,b123456789-anything
c123456789-anything,c123456789-anything
d123456789-anything,d123456789-anything
e123456789-anything,e123456789-anything
Non-matches
e123456777-anything,e123456999-anything
Here's one way using awk. It prints every line where the first 10 characters of the first two fields match.
% cat /tmp/names.csv
,
a123456789-anything,a123456789-anything
b123456789-anything,b123456789-anything
c123456789-anything,c123456789-anything
d123456789-anything,d123456789-anything
e123456789-anything,e123456789-anything
e123456777-anything,e123456999-anything
% awk -F, 'substr($1,1,10)==substr($2,1,10)' /tmp/names.csv
,
a123456789-anything,a123456789-anything
b123456789-anything,b123456789-anything
c123456789-anything,c123456789-anything
d123456789-anything,d123456789-anything
e123456789-anything,e123456789-anything

How do I remove text using sed?

For instance let say I have a text file:
worker1, 0001, company1
worker2, 0002, company2
worker3, 0003, company3
How would I use sed to take the first 2 characters of the first column so "wo" and remove the rest of the text and attach it to the second column so the output would look like this:
wo0001,company1
wo0002,company2
wo0003,company3
$ sed -E 's/^(..)[^,]*, ([^,]*,) /\1\2/' file
wo0001,company1
wo0002,company2
wo0003,company3
s/ begin substitution
^(..) match the first two characters at the beginning of the line, captured in a group
[^,]* match any amount of non-comma characters of the first column
, match a comma and a space character
([^,]*,) match the second field and comma captured in a group (any amount of non-comma characters followed by a comma)
match the next space character
/\1\2/ replace with the first and second capturing group

Linux - How to remove certain lines from a files based on a field value

I want to remove certain lines from a tab-delimited file and write output to a new file.
a b c 2017-09-20
a b c 2017-09-19
es fda d 2017-09-20
es fda d 2017-09-19
The 4th column is Date, basically I want to keep only lines that has 4th column as "2017-09-19" (keep line 2&4) and write to a new file. The new file should have same format as the raw file.
How to write the linux command for this example?
Note: The search criteria should be on the 4th field as I have other fields in the real data and possibly have same value as 4th field.
With awk:
awk 'BEGIN{OFS="\t"} $4=="2017-09-19"' file
OFS: output field separator, a space by default
Use grep to filter:
cat file.txt | grep '2017-09-19' > filtered_file.txt
This is not perfect, since the string 2017-09-19 is not required to appear in the 4th column, but if your file looks like the example, it'll work.
Sed solution:
sed -nr "/^([^\t]*\t){3}2017-09-19/p" input.txt >output.txt
this is:
-n - don't output every line
-r - extended regular expresion
/regexp/p - print line that contains regular expression regexp
^ - begin of line
(regexp){3} - repeat regexp 3 times
[^\t] - any character except tab
\t - tab character
* - repeat characters multiple times
2017-09-19 - search text
That is, skip 3 columns separated by a tab from the beginning of the line, and then check that the value of column 4 coincides with the required value.
awk '/2017-09-19/' file >newfile
cat newfile
a b c 2017-09-19
es fda d 2017-09-19

How to grep multiples strings within N lines

I was wondering if there is anyway that I could grep (or any other command) that will search multiple strings within N lines.
Example
Search for "orange", "lime", "banana" all within 3 lines
If the input file is
xxx
a lime
b orange
c banana
yyy
d lime
foo
e orange
f banana
I want to print the three lines starting with a, b, c.
The lines with the searched strings can appear in any order.
I do not want to print the lines d, e, f, as there is a line in between, and so the three strings are not grouped together.
Your question is rather unclear. Here is a simple Awk script which collects consecutive matching lines and prints iff the array is longer than three elements.
awk '/orange|lime|banana/ { a[++n] = $0; next }
{ if (n>=3) for (i=1; i<=n; i++) print a[i]; delete a; n=0 }
END { if (n>=3) for (i=1; i<=n; i++) print a[i] }' file
It's not clear whether you require all of your expressions to match; this one doesn't attempt to. If you see three successive lines with orange, that's a match, and will be printed.
The logic should be straightforward. The array a collects matches, with n indexing into it. When we see a non-match, we check its length, and print if it's 3 or more, then start over with an empty array and index. This is (clumsily) repeated at end of file as well, in case the file ends with a match.
If you want to permit gap (so, if there are three successive lines where one matches "orange" and "banana", then one which doesn't match, then one which matches "lime", print those three lines? Your question is unclear) you could change to always keeping an array of the last three lines, though then you also need to specify how to deal with e.g. a sequence of five lines which matches by these rules.
Similar to tripleee's answer, I would also use awk for this purpose.
The main idea is to implement a simple state machine.
Simple example
As a simple example, first try to find three consecutive lines of banana.
Consider the pattern-action statement
/banana/ { bananas++ }
For every line matching the regex banana, it increases the variable bananas (in awk, all variables are initialised with 0).
Of course, you want bananas to be reset to 0 when there is non-matching line, so your search starts from the beginning:
/banana/ { bananas++; next }
{ bananas = 0 }
You can also test for values of variables in the pattern of actions.
For example, if you want to print "Found" after three lines containing banana, extend the rule:
/banana/ {
bananas++
if (bananas >= 3) {
print "Found"
bananas = 0
}
next
}
This resets the variable bananas to 0, and prints the string "Found".
How to proceed further
Using this basic idea, you should be able to write your own awk script that handles all the cases.
First, you should familiarise yourself with awk (pattern, actions, program execution).
Then, extend and adapt my example to fit your needs.
In particular, you probably need an associative array matched, with indices "banana", "orange", "lime".
You set matched["banana"] = $0 when the current line matches /banana/. This saves the current line for later output.
You clear that whole array when the current line does not match any of your expressions.
When all strings are found (matched[s] is not empty for every string s), you can print the contents of matched[s].
I leave the actual implementation to you.
As others have said, your description leaves many corner-cases unclear.
You should figure them out for yourself and adapt your implementation accordingly.
I think you want this:
awk '
/banana/ {banana=3}
/lime/ {lime=3}
/orange/ {orange=3}
(orange>0)&&(lime>0)&&(banana>0){print l2,l1,$0}
{orange--;lime--;banana--;l2=l1;l1=$0}' OFS='\n' yourFile
So, if you see the word banana you set banana=3 so it is valid for the next 3 lines. Likewise, if you see lime, give it 3 lines of chances to make a group, and similarly for orange.
Now, if all of orange, lime and banana have been seen in the previous three lines, print the second to last line (l2), the last line (l1) and the current line $0.
Now decrement the counts for each fruit before we move to the next line, and save the current line and shuffle backwards in time order the previous 2 lines.

Search for a pattern in Column in a CSV and replace another pattern in the same line using sed command

I want to check for a pattern (only if the pattern starts with) in second column in a CSV file and if that pattern exists then replace something else in same line.
I wrote the following sed command for following csv to change the I to N if the pattern 676 exists in second column. But it checks 676 in the 7th and 9th column also since the ,676 exists. Ideally, I want only the second line to be checked for if the prefix 676 exists. All I want is to check 676 prefixed in second column (pattern not in the middle or end of the second value Ex- 46769777) and then do the change on ,I, to ,N,.
sed -i '/,676/ {; s/,I,/,N,/;}' temp.csc
6768880,55999777,S,I,TTTT,I,67677,yy
6768880,676999777,S,I,TTTT,I,67677,yy
6768880,46769777,S,I,TTTT,I,67677,yy
Expected result required
6768880,55999777,S,I,TTTT,I,67677,yy
6768880,676999777,S,N,TTTT,N,67677,yy
6768880,40999777,S,I,TTTT,I,67677,yy
If you are not bound by sed, awk might be a better option for you. Give this a try :
awk -F"," '{match($2,/^676/)&&gsub(",I",",N")}{print}' temp.csc
match syntax does the matching of second column to numbers that starts with (^) 676. gsub replaces I with N.
Result:
6768880,55999777,S,I,TTTT,I,67677,yy
6768880,676999777,S,N,TTTT,N,67677,yy
6768880,46769777,S,I,TTTT,I,67677,yy
This requires that 676 appear at the beginning of the second column before any changes are made:
$ sed '/^[^,]*,676/ s/,I,/,N,/g' file
6768880,55999777,S,I,TTTT,I,67677,yy
6768880,676999777,S,N,TTTT,N,67677,yy
6768880,46769777,S,I,TTTT,I,67677,yy
Notes:
The regex /^[^,]*,676/ requires that 676 appear after the first appearance of a comma on the line. In more detail:
^ matches the beginning of the line
[^,]* matches the first column
,676 matches the first comma followed by 676
In your desired output, ,I, was replaced with ,N, every time it appeared on the line. To accomplish this, g (meaning global) was added to the substitute command.

Resources