How to delete lines from TXT or CSV with specific pattern - linux

I have a txt file formatted as follows:
The aim is to remove the rows which begin with the word "Subtotal Group 1" or "Subtotal Group 2" or "Grand Total" (such strings are always at the beginning of the line), but I need to remove them only if the remaining portion of the line have blank fields (or filled with spaces).
It could be achievable with awk or sed (1 pass), but I'm currently doing with 3 separate steps (one for each text). A more generic syntax would be great. Thanks everybody.
My txt file looks like this:
Some Generic Headers at the beginning of the file
=======================================================================
Group 1
=======================================================================
6.00 500 First Line Text 1685.52
1.00 502 Second Line Text 280.98
530 Other Line text 157.32
_________________________________________________________________________
Subtotal Group 1
Subtotal Group 1
Subtotal Group 1
Subtotal Group 1 2123.82
Subtotal Group 1
Subtotal Group 1
========================================================================
GROUP 2
========================================================================
7.00 701 First Line Text 53.63
711 Second Line text 97.85
7.00 740 Third Line text 157.32
741 Any Line text 157.32
742 Any Line text 18.04
801 Last Line text 128.63
_______________________________________________________________________
Subtotal Group 2
Subtotal Group 2
Subtotal Group 2
Subtotal Group 2
Subtotal Group 2 612.79
Subtotal Group 2
_______________________________________________________________________
Grand total
Grand total
Grand total
Grand total
Grand total
Grand total
Grand total 1511.03
The goal output I'm trying to achieve is:
Some Generic Headers at the beginning of the file
=======================================================================
Group 1
=======================================================================
6.00 500 First Line Text 1685.52
1.00 502 Second Line Text 280.98
530 Other Line text 157.32
_______________________________________________________________________
Subtotal Group 1 2123.82
=======================================================================
GROUP 2
=======================================================================
7.00 701 First Line Text 53.63
711 Second Line text 97.85
7.00 740 Third Line text 157.32
741 Any Line text 157.32
742 Any Line text 18.04
801 Last Line text 128.63
_______________________________________________________________________
Subtotal Group 2 612.79
_______________________________________________________________________
Grand total 1511.03

That's a job grep was invented to do:
$ grep -Ev '^(Subtotal Group [0-9]+|Grand total)[[:blank:]]*$' file
Some Generic Headers at the beginning of the file
=======================================================================
Group 1
=======================================================================
6.00 500 First Line Text 1685.52
1.00 502 Second Line Text 280.98
530 Other Line text 157.32
_________________________________________________________________________
Subtotal Group 1 2123.82
========================================================================
GROUP 2
========================================================================
7.00 701 First Line Text 53.63
711 Second Line text 97.85
7.00 740 Third Line text 157.32
741 Any Line text 157.32
742 Any Line text 18.04
801 Last Line text 128.63
_______________________________________________________________________
Subtotal Group 2 612.79
_______________________________________________________________________
Grand total 1511.03
You can use the same regexp in awk or sed if you prefer:
awk '!/^(Subtotal Group [0-9]+|Grand total)[[:blank:]]*$/' file
sed -E '/^(Subtotal Group [0-9]+|Grand total)[[:blank:]]*$/d' file

If your good lines always end with a number and your Any Text lines don't, you could use:
sed -n '/^.*[0-9]$/p' file
Where -n will suppress printing of pattern space, and you will only output lines ending with [0-9]. Given your example file, the output is:
Subtotal 2123.82
Total 625.80
Any Word 9999.99

You can do:
grep -v -P "^(Subtotal Group \d+|Grand total)[,\s]*$" inputfile > outputfile
Edited as per comment.
Second Edit: adapted to new specs

The question isn't quite clear if the goal is to keep the total/subtotal lines, or if they should be removed.
Also, it is not clear if the "#*" comments are an actual part of the input file, or they're merely descriptive.
Fortunately, both of these are minor details. This is fairly simple to do with perl:
$ perl -n -e 'print if /^(Subtotal|Grand Total),(,| |#.*)*/' inputfile
Subtotal,,, #This is unuseful --> To be removed
Subtotal,,, #This is unuseful --> To be removed
Subtotal,,,125.40 #This is a good line
Subtotal,,, #This is unuseful --> To be removed
Grand Total,,, #This is unuseful --> To be removed
Grand Total,,,125.40 #This is a good line
This assumes you want to keep the total and the subtotal lines, and remove all other lines.
To do it the other way around, to remove the total/subtotal lines, and keep the others, replace the if keyword with unless.
And if the comments aren't actually in the input file itself, the pattern only needs to be tweaked slightly:
perl -n -e 'print if /^(Subtotal|Grand Total),(,| )*/' inputfile
This also ignores any extra whitespace. If you want whitespace to be significant, this becomes:
perl -n -e 'print if /^(Subtotal|Grand Total),(,)*/' inputfile
Like I said, even though your question is not 100% clear, the unclear parts are just minor details. perl will easily handle every possibility.
As shown in the example, perl will print the edited inputfile on standard output. In order to replace inputfile with the edited contents, simply add the -i option to the command (before the -e option).

And an attempt at an awk solution ...
awk -F, '{for(i=2;i<=NF;i++){if($i~/[0-9.-]+/){print $0;next}}}' falzone
Subtotal,,,125.40
Grand Total,,,125.40
Any other text,,,9999.99
Or, looking at the non-csv version:
grep [0-9.-] falzone2
Subtotal 2123.82
Total 625.80
Any Word 9999.99

Related

How do I use grep to get numbers larger than 50 from a txt file

I am relatively new to grep and unix. I am trying to get the names of people who have won more than 50 races from a txt file. So far the code I have used is, cat file.txt|grep -E "[5-9][0-9]$" but this is only giving me numbers from 50-99. How could I get it from 50-200. Thank you!!
driver
races
wins
Some_Man
90
160
Some_Man
10
80
the above is similar to the format of the data, although it is not tabulated.
Do you have to use grep? you could use awk like this:
awk '{if($[replace with the field number]>50)print$2}' < file.txt
assuming your fields are delimited by spaces, otherwise you could use -F flag to specify delimiter.
if you must use grep, then it's regular expression like you did. to make it 50 to 200 you will do:
cat file.txt|grep -E "(\b[5-9][0-9]|\b1[0-9][0-9])$"
Input:
Rank Country Driver Races Wins
1 [United_Kingdom] Lewis_Hamilton 264 94
2 [Germany] Sebastian_Vettel 254 53
3 [Spain] Fernando_Alonso 311 32
4 [Finland] Kimi_Raikkonen 326 21
5 [Germany] Nico_Rosberg 200 23
Awk would be a better candidate for this:
awk '$4>=50 && $4<=200 { print $0 }' file
Check to see if the fourth space delimited field ($4 - Change to what ever field number this actually is) if both greater than or equal to 50 and less than or equal to 200 and print the line ($0) if the condition is met

Move all rows in a tsv with a certain date to their own file

I have a TSV file with 4 columns in this format
dog phil tall 2020-12-09 12:34:22
cat jill tall 2020-12-10 11:34:22
The 4th column is a date string Example : 2020-12-09 12:34:22
I want every row with the same date to go into its own file
For example,
file 20201209 should have all rows that start with 2020-12-09 in the 4th column
file 20201210 should have all rows that start with 2020-12-10 in the 4th column
Is there any way to do this through the terminal?
With GNU awk to allow potentially large numbers of concurrently open output files and gensub():
awk '{print > gensub(/-/,"","g",$(NF-1))}' file
With any awk:
awk '{out=$(NF-1); gsub(/-/,"",out); if (seen[out]++) print >> out; else print > out; close(out)}' file
There's ways to speed up either script by sorting the input first if that's an issue.

Extract substring from first column

I have a large text file with 2 columns. The first column is large and complicated, but contains a name="..." portion. The second column is just a number.
How can I produce a text file such that the first column contains ONLY the name, but the second column stays the same and shows the number? Basically, I want to extract a substring from the first column only AND have the 2nd column stay unaltered.
Sample data:
application{id="1821", name="app-name_01"} 0
application{id="1822", name="myapp-02", optionalFlag="false"} 1
application{id="1823", optionalFlag="false", name="app_name_public"} 3
...
So the result file would be something like this
app-name_01 0
myapp-02 1
app_name_public 3
...
If your actual Input_file is same as the shown sample then following code may help you in same.
awk '{sub(/.*name=\"/,"");sub(/\".* /," ")} 1' Input_file
Output will be as follows.
app-name_01 0
myapp-02 1
app_name_public 3
Using GNU awk
$ awk 'match($0,/name="([^"]*)"/,a){print a[1],$NF}' infile
app-name_01 0
myapp-02 1
app_name_public 3
Non-Gawk
awk 'match($0,/name="([^"]*)"/){t=substr($0,RSTART,RLENGTH);gsub(/name=|"/,"",t);print t,$NF}' infile
app-name_01 0
myapp-02 1
app_name_public 3
Input:
$ cat infile
application{id="1821", name="app-name_01"} 0
application{id="1822", name="myapp-02", optionalFlag="false"} 1
application{id="1823", optionalFlag="false", name="app_name_public"} 3
...
Here's a sed solution:
sed -r 's/.*name="([^"]+).* ([0-9]+)$/\1 \2/g' Input_file
Explanation:
With the parantheses your store in groups what's inbetween.
First group is everything after name=" till the first ". [^"] means "not a double-quote".
Second group is simply "one or more numbers at the end of the line preceeded with a space".

Linux - How to remove certain lines from a files based on a field value

I want to remove certain lines from a tab-delimited file and write output to a new file.
a b c 2017-09-20
a b c 2017-09-19
es fda d 2017-09-20
es fda d 2017-09-19
The 4th column is Date, basically I want to keep only lines that has 4th column as "2017-09-19" (keep line 2&4) and write to a new file. The new file should have same format as the raw file.
How to write the linux command for this example?
Note: The search criteria should be on the 4th field as I have other fields in the real data and possibly have same value as 4th field.
With awk:
awk 'BEGIN{OFS="\t"} $4=="2017-09-19"' file
OFS: output field separator, a space by default
Use grep to filter:
cat file.txt | grep '2017-09-19' > filtered_file.txt
This is not perfect, since the string 2017-09-19 is not required to appear in the 4th column, but if your file looks like the example, it'll work.
Sed solution:
sed -nr "/^([^\t]*\t){3}2017-09-19/p" input.txt >output.txt
this is:
-n - don't output every line
-r - extended regular expresion
/regexp/p - print line that contains regular expression regexp
^ - begin of line
(regexp){3} - repeat regexp 3 times
[^\t] - any character except tab
\t - tab character
* - repeat characters multiple times
2017-09-19 - search text
That is, skip 3 columns separated by a tab from the beginning of the line, and then check that the value of column 4 coincides with the required value.
awk '/2017-09-19/' file >newfile
cat newfile
a b c 2017-09-19
es fda d 2017-09-19

Grep find lines that have 4,5,6,7 and 9 in zip code column

I'm using grep to display all lines that have ONLY 4,5,6,7 and 9 in the zipcode column.
How do i display only the lines of the file that contain the numbers 4,5,6,7 and 9 in the zipcode field?
A sample row is:
15 m jagger mick 41 4th 95115
Thanks
I am going to assume you meant "How do I use grep to..."
If all of the lines in the file have a 5 digit zip at the end of each line, then:
egrep "[45679]{5}$" filename
Should give you what you want.
If there might be whitespace between the zip and the end of the line, then:
egrep "[45679]{5}[[:space:]]*$" filename
would be more robust.
If the problem is more general than that, please describe it more accurately.
Following regex should fetch you desired result:
egrep "[45679]+$" file
If by "grep" you mean, "the correct tool", then the solution you seek is:
awk '$7 ~ /^[45679]*$/' input
This will print all lines of input in which the 7th field consists only of the characters 4,5,6,7, and 9. If you want to specify 'the last column' rather than the 7th, try
awk '$NF ~ /^[45679]*$/' input

Resources