removing text between pipe and comma - linux

I have a enormous long file with the text separated as
subtlechanges|NEW=19647490,subtlec|NEW=19638255
and I want the text like
subtlechanges,subtle.
I tried using the \|.*$ but it is removing everything after the first pipe. Any guess. Thanks in advance

If I understand you correctly, we have a file that may look like:
$ cat file
subtlechanges|NEW=19647490,subtle|NEW=19638255
And, we want to remove everything from a pipe character to the next comma. In that case:
$ sed 's/|[^,]*//g' file
subtlechanges,subtle
How it works
In sed, substitute commands look like s/old/new/g where old is a regular expression for what is removed, new is what gets substituted in, and the final g signifies that we want to do this not just once per line but as many times per line as we can.
The regular expression that we use for old here is |[^,]*. This matches a pipe, |, and any characters after up to, but not including, the first comma.

Another approach, using comma or pipe as the field separator, print the 1st, 3rd, ... every odd field.
awk -F '[,|]' '{
sep=""
for (i=1; i<NF; i+=2) {
printf "%s%s", sep, $i
sep=","
}
print ""
}' file

Related

Linux remove whitespace first line

i have the file virt.txt contains:
0302 000000 23071SOCIETY 117
0602 000000000000000001 PAYMENT BANK
I want to remove 3 whitespaces from 6th to 8th column to the first line only.
I do:
sed '1s/[[:blank:]]+[[:blank:]]+[[:blank:]]//6' virt.txt
it'KO
please help
Your regex would consume all the available blanks from a sequence of three or more (in a quite inefficient way) and replace the sixth occurrence of that. Because your first input line does not contain six or more separate stretches of three or more whitespace characters, it actually did nothing. But you can in fact use sed to do exactly what you say you want:
sed '1s/^\(.....\) /\1/' virt.txt
(or for convenience, if you have sed -E or the variant sed -r which works on some platforms, but neither of these is standard):
sed -E '1s/^(.{5}) {3}/\1/' virt.txt # -E is not portable
The parentheses capture the first five characters into a back reference, and we then use the first back reference \1 as the replacement string, effectively replacing only the text which matched outside the parentheses.
If your sed supports the -i option, you can use that to modify the file directly; but this is also not standard, so the most portable solution is to write the result to a new file, then move it back on top of the original file if you want to replace it.
sed is convenient if you are familiar with it, but as you are clearly not, perhaps a better approach would be to use a different language, ideally one which is not write-only for many users, like sed.
If you know the three characters will always be spaces, just do a static replacement.
awk 'NR==1 { $0 = substr($0, 1, 5) substr($0, 9) } 1' virt.txt
On the first line (NR is the current input line number) replace the input line $0 with a catenation of the substrings on both sides of the part you want to cut.
For a simple replacement like that, you can also use basic Unix text manipulation utilities, though it's rather inefficient and inelegant:
head -n 1 virt.txt | cut -c1-5,9- >newfile.txt
tail -n +2 virt.txt >>newfile.txt
If you need to check that the three characters are spaces, the Awk script only needs a minor tweak.
awk 'NR==1 && /^.{5} {3}/ { $0 = substr($0, 1, 5) substr($0, 9) } 1' virt.txt
You should vaguely recognize the regex from above. Awk is less succinct, but as a consequence also quite a lot more readable, than sed.

How to count number of lines with only 1 character?

Im trying to just print counted number of lines which have only 1 character.
I have a file with 200k lines some of the lines have only one character (any type of character)
Since I have no experience I have googled a lot and scraped documentation and come up with this mixed solution from different sources:
awk -F^\w$ '{print NF-1}' myfile.log
I was expecting that will filter lines with single char, and it seems work
^\w$
However Im not getting number of the lines containing a single character. Instead something like this:
If a non-awk solution is OK:
grep -c '^.$'
You could try the following:
awk '/^.$/{++c}END{print c}' file
The variable c is incremented for every line containing only 1 character (any character).
When the parsing of the file is finished, the variable is printed.
In awk, rules like your {print NF-1} are executed for each line. To print only one thing for the whole file you have to use END { print ... }. There you can print a counter which you increment each time you see a line with one character.
However, I'd use grep instead because it is easier to write and faster to execute:
grep -xc . yourFile

which bash command is good for extracting multiple pattern from text file?

I have very big text file and just need to extract some specific patterns from it and save in other .txt file.
Here is the format of my text file:
"1","Dbxref=Entrez%7CGene:5008779;ID=GSPATG00000003001;Name=GSPATG00000003001;Ontology_term=GO:0005488"
"2","Dbxref=Entrez%7CProtein:XP_001422966,EMBL:CAK55568,Uniprot:A0BAK1_PARTE,Entrez%7CProtein:124390026;Derived_from=GSPATT00000003001;ID=GSPATP00000003001;isoelectric_point=10.31;molecular_weight=55095.3;Name=GSPATP00000003001;Ontology_term=GO:0005488"
"3","Alias=PTMB.459;Dbxref=Entrez%7CGene:5008781,Entrez%7CNucleotide:CR548612;ID=GSPATG00000005001;Name=GSPATG00000005001;Ontology_term=GO:0004185,GO:0006508"
"4","Dbxref=Entrez%7CProtein:XP_001422968,Entrez%7CProtein:124390028,EMBL:CAK55570,Uniprot:Q6BFB1_PARTE;Derived_from=GSPATT00000005001;ID=GSPATP00000005001;isoelectric_point=6.41;molecular_weight=48434.5;Name=GSPATP00000005001;Ontology_term=GO:0004185,GO:0006508"
"5","Alias=PTMB.456;Dbxref=Entrez%7CNucleotide:CR548612,Entrez%7CGene:5008770;ID=GSPATG00000009001;Name=GSPATG00000009001;Ontology_term=GO:0004672,GO:0004674,GO:0004713,GO:0005524,GO:0006468"
"6","Dbxref=Entrez%7CProtein:XP_001422972,Entrez%7CProtein:124390032,EMBL:CAK55574,Uniprot:Q6BFB4_PARTE;Derived_from=GSPATT00000009001;ID=GSPATP00000009001;isoelectric_point=9.79;molecular_weight=73346.4;Name=GSPATP00000009001;Ontology_term=GO:0004672,GO:0004674,GO:0004713,GO:0005524,GO:0006468"
"7","Dbxref=Entrez%7CGene:5008748;ID=GSPATG00000010001;Name=GSPATG00000010001;Ontology_term=GO:0005515,GO:0007154,GO:0035091"
What I need, I just need to extract all the words for :
ID, Name and Ontology_term .
for example the expected output for the line 7 would be :
ID=GSPATG00000010001;Name=GSPATG00000010001;Ontology_term=GO:0005515,GO:0007154,GO:0035091"
How can I do it in Linux terminal ?
Through sed,
$ sed 's/.*;\(ID[^;]*\).*;\(Name[^;]*\).*;\(Ontology_term[^;]*\).*/\1;\2;\3/' file
ID=GSPATG00000003001;Name=GSPATG00000003001;Ontology_term=GO:0005488"
ID=GSPATP00000003001;Name=GSPATP00000003001;Ontology_term=GO:0005488"
ID=GSPATG00000005001;Name=GSPATG00000005001;Ontology_term=GO:0004185,GO:0006508"
ID=GSPATP00000005001;Name=GSPATP00000005001;Ontology_term=GO:0004185,GO:0006508"
ID=GSPATG00000009001;Name=GSPATG00000009001;Ontology_term=GO:0004672,GO:0004674,GO:0004713,GO:0005524,GO:0006468"
ID=GSPATP00000009001;Name=GSPATP00000009001;Ontology_term=GO:0004672,GO:0004674,GO:0004713,GO:0005524,GO:0006468"
ID=GSPATG00000010001;Name=GSPATG00000010001;Ontology_term=GO:0005515,GO:0007154,GO:0035091"
[^;]* matches any character but not of a semicolon zero or more times. In basic sed, capturing groups are referred by \(..\) .
Your input format is pesky in that it contains semicolon-separated fields inside a double-quoted comma-separated field. If we can be sure that the first field before the first semicolon is always uninteresting and that the last field should also always be discarded, we can cheat by simply splitting on semicolons and extract the fields we want.
awk -F ';' '{ for (i=1; i<=NF; ++i) { sub(/"$/, "", $i);
if ($i ~ /^(ID|Name|Ontology_term)=/) printf "%s", $i; printf "\n" } }' file
If these assumptions do not always hold, maybe you can massage or preprocess the input so they do. In fact, I do this by trimming any final double-quote. Ultimately, parsing the input and translating it to a well-defined flat comma- or semicolon-separated format (or JSON if you have a lot of optional fields or nested structures) might be the most robust and fruitful solution.

remove a line with special character with pattern

I'm trying to remove a line with special characters which is not prefixed with \.
Below are the special characters:
^$%.*+?!(){}[]|\
I need to check all the above special characters which is not prefixed with \ in 2nd column.
I'm trying with awk to complete this, but no luck. I want the output as below.
input.txt
1,ap^ple
2,o$range
3,bu+tter
4,gr(ape
5,sm\(oke
6,ra\in
7,pla\\y
8,wor\+k
output.txt
1,ap^ple
2,o$range
3,bu+tter
4,gr(ape
6,ra\in
I believe you are simply looking for:
awk '$2 !~ /\\[][|\\{}()!?+*.%$^]/' FS=,
This gives the desired output on the given input file, but does not at all match the description given in the question.
EDIT
Given the discussion in the comment section, it appears that the desired solution should output all lines that contain a special character, unless that character is preceded by a backslash. Given that description, we must remove backslash from the list of special characters. A (non-working, given for the purpose of description) solution is:
awk '$2 ~ /[^\\][][|{}()!?+*.%$^]/' FS=,
This simply matches any two character string in which the first is not a backslash and the 2nd is one of the characters ][|{}()!?+*.%$^. This fails because it does not catch the case in which a special character occurs as the first element of the string. For that, we extend the regex so that the first character can be either the beginning of the string or anything that is not a backslash.
awk '$2 ~ /(^|[^\\])[][|{}()!?+*.%$^]/' FS=,
The reason we need to re-order the special characters is that ] has a special meaning inside brackets (namely, it closed the brackets!) and it must be list first to avoid that meaning. Similarly, ^ must not be first because it has a special meaning when it is the first member of a character class (it negates the class). (The other characters don't matter; they just got reordered as a typographical accident.)
One part of the trick is to put the special characters into a character class safely, remembering that ], ^ and - (not present in your list) have special rules associated with them in character classes. Specifically, the ^ as first character negates the character class (so place it somewhere other than first), and the ] character terminates the character class unless it is either first or second after a ^.
Hence, you want:
awk '/\\[]^$%.*+?!(){}[\\|]/ { next } { print }' input.txt
The complex (ghastly) regex matches a backslash followed by one of the special characters; the action is next to skip that line. The { print } (which could also be written 1 or any other true value) prints those lines which are not eliminated by the regex.
Example output
1,ap^ple
2,o$range
3,bu+tter
4,gr(ape
6,ra\in
You can refine the processing to ignore the first field and so on as in William Pursell's answer, which does the reordering of the characters in the list substantially the same way I did, but without explaining why.
awk -F, '$2 !~ /\\[]^$%.*+?!(){}[\\|]/ { print }' input.txt

Combine matching lines using sed or awk?

I have a file like the following:
1,
cake:01351
12,
bun:1063
scone:13581
biscuit:1931
14,
jelly:1385
I need to convert it so that when a number is read at the start of a line it is combined with the line beneath it, but if there is no number at the start the line is left as is. This would be the output that I need:
1,cake:01351
12,bun:1063
scone:13581
biscuit:1931
14,jelly:1385
Having a lot of trouble achieving this with sed, it seems it may not be the best way for what I think should be quite simple.
Any suggestions greatly appreciated.
A very basic sed implementation:
sed -e '/^[0-9]/{N;s/\n//;}'
This relies on the first character on only the 'number' lines being a number (as you specified).
It
matches lines starting with a number, ^[0-9]
brings in the next line, N
deletes the embedded newline, s/\n//
This is a file on my intranet. I can't recall where I found the handy sed one-liner. You might find something if you search for 'sed one-liner'
Have you ever needed to combine lines of text, but it's too tedious to do it by hand.
For example, imagine that we have a text file with hundreds of lines which look like this:
14/04/2003,10:27:47,0
IdVg,3.000,-1.000,0.050,0.006
GmMax,0.011,0.975,0.005
IdVg,3.000,-1.000,0.050,0.006
GmMax,0.011,0.975,0.005
14/04/2003,10:30:51,600
IdVg,3.000,-1.000,0.050,0.006
GmMax,0.011,0.975,0.005
IdVg,3.000,-1.000,0.050,0.006
GmMax,0.010,0.975,0.005
14/04/2003,10:34:02,600
IdVg,3.000,-1.000,0.050,0.006
GmMax,0.011,0.975,0.005
IdVg,3.000,-1.000,0.050,0.006
GmMax,0.010,0.975,0.005
Each date (14/04/2003) is the start of a data record, and it continues on the next four lines.
We would like to input this to Excel as a 'comma separated value' file, and see each record in its own row.
In our example, we need to append any line starting with a G or I to the preceding line, and insert a comma, so as to produce the following:
14/04/2003,10:27:47,0,IdVg,3.000,-1.000,0.050,0.006,GmMax,0.011,0.975,0.005,IdVg,3.000,...
14/04/2003,10:30:51,600,IdVg,3.000,-1.000,0.050,0.006,GmMax,0.011,0.975,0.0005,IdVg,3.000,...
14/04/2003,10:34:02,600,IdVg,3.000,-1.000,0.050,0.006,GmMax,0.011,0.975,0.0005,IdVg,3.000,...
This is a classic application of a 'regular expression' and, once again, sed comes to the rescue.
The editing can be done with a single sed command:
sed -e :a -e '$!N;s/\n\([GI]\)/,\1/;ta' -e 'P;D' filename >newfilename
I didn't say it would be obvious, or easy, did I?
This is the kind of command you write down somewhere for the rare occasions when you need it.
Try a regular expression, such as:
sed '/[0-9]\+,/{N}s/\n//)'
That checks the first line for a number (0-9) and a comma, then replaces the new line with nothing, removing it.
Another awk solution, less cryptic than some other answers:
awk '/^[0-9]/ {n = $0; getline; print n $0; next} 1'
$ awk 'ORS= /^[0-9]+,$/?" ":"\n"' file
1, cake:01351
12, bun:1063
scone:13581
biscuit:1931
14, jelly:1385

Resources