Remove Letter After Number and Before Comma - linux

I need to remove any letters that occur after the first comma in a line
some.file
JAN,334X,333B,337A,338D,332Q,335H,331U
Expected Result:
JAN,334,333,337,338,332,335,331
Code:
sed -i 's/\[0-9][0-9][0-9].*,/[0-9][0-9][0-9],/g' some.file
What am I doing wrong?

You could also use a small loop (this is GNU sed);
sed ':;s/[A-Z],/,/2;t;s/[A-Z]$//'
It only deletes the second letter preceding a comma, and loops. Finally, it deletes the letter at the line's end, if there is one.

Try this
$ sed 's/,\([0-9]*\)[^,]*/,\1/g' <<<'JAN,334X,333B,337A,338D,332Q,335H,331U'
JAN,334,333,337,338,332,335,331
You need to capture the digits with round parenthesis in order to use the captured string in the replacement. The option g does this for every occurrence.
Comparison of the different answers
Test data:
$ > data; for ((x=1000000;x>0;x--)); do echo 'JAN,334X,333B,337A,338D,332Q,335H,331U' >> data; done
My answer is the slowest:
$ time sed 's/,\([0-9]*\)[^,]*/,\1/g' < data >/dev/null
real 0m16.368s
user 0m16.296s
sys 0m0.024s
Michael is a bit faster:
$ time sed ':;s/[A-Z],/,/2;t;s/[A-Z]$//' < data >/dev/null
real 0m9.669s
user 0m9.624s
sys 0m0.012s
But Sundeep is the fastet:
$ time sed 's/[A-Z]//4g' < data >/dev/null
real 0m4.905s
user 0m4.856s
sys 0m0.028s

Some issues are:
No need to escape [.
Your replace value is wrong. Ex: s/regex/replace/g
Use this:
sed -e 's/\([0-9]\+\)[a-zA-Z],/\1,/g' -e 's/\([0-9]\+\)[a-zA-Z]$/\1/g' file

You should omit the * and the first \ looks like a mistake i.e.
sed -i 's/[0-9][0-9][0-9].,/[0-9][0-9][0-9],/g' some.file
but I think you also want to capture the number ...
sed -i 's/\([0-9][0-9][0-9]\).,/\1,/g' some.file
Would be helpful if you posted your actual output as well ...

Since question is tagged linux, this GNU sed option comes in handy
$ echo 'JAN,334X,333B,337A,338D,332Q,335H,331U' | sed -E 's/[A-Z](,|$)/\1/2g'
JAN,334,333,337,338,332,335,331
2g means replace from 2nd match onwards till end of line
If number of letters is known for first column, this can be simplified to
$ echo 'JAN,334X,333B,337A,338D,332Q,335H,331U' | sed 's/[A-Z]//4g'
JAN,334,333,337,338,332,335,331

No need for sed, coreutils will do:
paste -d, <(cut -d, -f1 data) <(cut -d, -f2- data | tr -d 'A-Z')
This takes .3 seconds on my computer when run on the data file generated in ceving's answer.

Related

How to strip stdout before logging into file? [duplicate]

Without using sed or awk, only cut, how do I get the last field when the number of fields are unknown or change with every line?
You could try something like this:
echo 'maps.google.com' | rev | cut -d'.' -f 1 | rev
Explanation
rev reverses "maps.google.com" to be moc.elgoog.spam
cut uses dot (ie '.') as the delimiter, and chooses the first field, which is moc
lastly, we reverse it again to get com
Use a parameter expansion. This is much more efficient than any kind of external command, cut (or grep) included.
data=foo,bar,baz,qux
last=${data##*,}
See BashFAQ #100 for an introduction to native string manipulation in bash.
It is not possible using just cut. Here is a way using grep:
grep -o '[^,]*$'
Replace the comma for other delimiters.
Explanation:
-o (--only-matching) only outputs the part of the input that matches the pattern (the default is to print the entire line if it contains a match).
[^,] is a character class that matches any character other than a comma.
* matches the preceding pattern zero or more time, so [^,]* matches zero or more non‑comma characters.
$ matches the end of the string.
Putting this together, the pattern matches zero or more non-comma characters at the end of the string.
When there are multiple possible matches, grep prefers the one that starts earliest. So the entire last field will be matched.
Full example:
If we have a file called data.csv containing
one,two,three
foo,bar
then grep -o '[^,]*$' < data.csv will output
three
bar
Without awk ?...
But it's so simple with awk:
echo 'maps.google.com' | awk -F. '{print $NF}'
AWK is a way more powerful tool to have in your pocket.
-F if for field separator
NF is the number of fields (also stands for the index of the last)
There are multiple ways. You may use this too.
echo "Your string here"| tr ' ' '\n' | tail -n1
> here
Obviously, the blank space input for tr command should be replaced with the delimiter you need.
This is the only solution possible for using nothing but cut:
echo "s.t.r.i.n.g." | cut -d'.' -f2-
[repeat_following_part_forever_or_until_out_of_memory:] | cut -d'.' -f2-
Using this solution, the number of fields can indeed be unknown and vary from time to time. However as line length must not exceed LINE_MAX characters or fields, including the new-line character, then an arbitrary number of fields can never be part as a real condition of this solution.
Yes, a very silly solution but the only one that meets the criterias I think.
If your input string doesn't contain forward slashes then you can use basename and a subshell:
$ basename "$(echo 'maps.google.com' | tr '.' '/')"
This doesn't use sed or awk but it also doesn't use cut either, so I'm not quite sure if it qualifies as an answer to the question as its worded.
This doesn't work well if processing input strings that can contain forward slashes. A workaround for that situation would be to replace forward slash with some other character that you know isn't part of a valid input string. For example, the pipe (|) character is also not allowed in filenames, so this would work:
$ basename "$(echo 'maps.google.com/some/url/things' | tr '/' '|' | tr '.' '/')" | tr '|' '/'
the following implements A friend's suggestion
#!/bin/bash
rcut(){
nu="$( echo $1 | cut -d"$DELIM" -f 2- )"
if [ "$nu" != "$1" ]
then
rcut "$nu"
else
echo "$nu"
fi
}
$ export DELIM=.
$ rcut a.b.c.d
d
An alternative using perl would be:
perl -pe 's/(.*) (.*)$/$2/' file
where you may change \t for whichever the delimiter of file is
It is better to use awk while working with tabular data. You don't have to master on command. If it can be achieved by awk, why not use that? I suggest you do not waste your precious time, and use a handful of commands to get the job done.
Example:
# $NF refers to the last column in awk
ll | awk '{print $NF}'
If you have a file named filelist.txt that is a list paths such as the following:
c:/dir1/dir2/file1.h
c:/dir1/dir2/dir3/file2.h
then you can do this:
rev filelist.txt | cut -d"/" -f1 | rev
Adding an approach to this old question just for the fun of it:
$ cat input.file # file containing input that needs to be processed
a;b;c;d;e
1;2;3;4;5
no delimiter here
124;adsf;15454
foo;bar;is;null;info
$ cat tmp.sh # showing off the script to do the job
#!/bin/bash
delim=';'
while read -r line; do
while [[ "$line" =~ "$delim" ]]; do
line=$(cut -d"$delim" -f 2- <<<"$line")
done
echo "$line"
done < input.file
$ ./tmp.sh # output of above script/processed input file
e
5
no delimiter here
15454
info
Besides bash, only cut is used.
Well, and echo, I guess.
choose -1
choose supports negative indexing (the syntax is similar to Python's slices).
I realized if we just ensure a trailing delimiter exists, it works. So in my case I have comma and whitespace delimiters. I add a space at the end;
$ ans="a, b"
$ ans+=" "; echo ${ans} | tr ',' ' ' | tr -s ' ' | cut -d' ' -f2
b

How can I insert commas in 1 word that I have tailed

I have a log tailed with the following output:
$ tail -n1 /home/shares/number-10.log
123456
I want the output to be: 12,34,56
tail -n1 /home/shares/number-10.log | sed -r 's/([[:digit:]]{2})/\1,/g;s/,[[:space:]]?+$//'
Enable sed regular expression interpretation with -r or -E and then substitute all cases of a digit occurring two times with the two digits followed by a comma. Then in the second statement, remove the comma at the end.
Amended answer as requested:
tail -n1 /home/shares/number-10.log | sed -r 's/([[:digit:]]{2})/\1|/g;s/|[[:space:]]?+$//'
Does the same as the first example but uses "|" instead of ","
It is also possible to use the bash variable operations:
var=`tail -n1 /home/shares/number-10.log`
echo "${var:0:2},${var:2:2},${var:4:2}"
echo "${var:0:2}|${var:2:2}|${var:4:2}"

Bash - Add blank line above line starting with a period

I need to edit a text file by adding a blank line above every line starting with a period.
Before
Corn
.Apple
Words.
.Orange
Bean
After
Corn
.Apple
Words.
.Orange
Bean
Here is what I have so far.
This adds a spaces after every period. There are more in the actual file.
cat File.txt | sed -r 's/([.]+)/\n\1/g'
This displays the lines that start with a period
while read -r line; do
if [[ "$line" == "."* ]]; then
echo "$line"
fi
done < File.txt
How do I merge them together?
This produces the output that you want:
$ sed 's/^[.]/\n./' file
Corn
.Apple
Words.
.Orange
Bean
If you want to change the file in-place, use sed's -i option:
sed -i 's/^[.]/\n./' file
For Mac OSX or other BSD system, use:
sed -i '' 's/^[.]/\n./' file
We use ^ which matches only at the beginning of a line. Since we are matching a period at the beginning of the line, it is not necessary to capture a group with parentheses: we know the match is a period. All that we need to do is add a newline before that period.
with sed
sed 's/^\./\n\./'
with awk
awk '/^\./{print ""} 1'
or
awk 'sub(/^\./,"\n.") 1'
Using RegExp it could be:
cat File.txt | sed -r 's/^(\..+)/\n\1/g'
I think an awk script is going to work best.
/^\./ {print "";}
{print $0;}
Put that into a file, in this case, called "awkfile" and run it like `awk -f awkfile File.txt'
$ sed -e '/^\./i\\' pru.txt
Corn
.Apple
Words.
.Orange
Bean
this command line instructs sed(1) to search for lines beginning with a dot, and then insert a blank line before it. Look in sed(1) manual page for how to use the insert, replace and append commands.

How to find the last field using 'cut'

Without using sed or awk, only cut, how do I get the last field when the number of fields are unknown or change with every line?
You could try something like this:
echo 'maps.google.com' | rev | cut -d'.' -f 1 | rev
Explanation
rev reverses "maps.google.com" to be moc.elgoog.spam
cut uses dot (ie '.') as the delimiter, and chooses the first field, which is moc
lastly, we reverse it again to get com
Use a parameter expansion. This is much more efficient than any kind of external command, cut (or grep) included.
data=foo,bar,baz,qux
last=${data##*,}
See BashFAQ #100 for an introduction to native string manipulation in bash.
It is not possible using just cut. Here is a way using grep:
grep -o '[^,]*$'
Replace the comma for other delimiters.
Explanation:
-o (--only-matching) only outputs the part of the input that matches the pattern (the default is to print the entire line if it contains a match).
[^,] is a character class that matches any character other than a comma.
* matches the preceding pattern zero or more time, so [^,]* matches zero or more non‑comma characters.
$ matches the end of the string.
Putting this together, the pattern matches zero or more non-comma characters at the end of the string.
When there are multiple possible matches, grep prefers the one that starts earliest. So the entire last field will be matched.
Full example:
If we have a file called data.csv containing
one,two,three
foo,bar
then grep -o '[^,]*$' < data.csv will output
three
bar
Without awk ?...
But it's so simple with awk:
echo 'maps.google.com' | awk -F. '{print $NF}'
AWK is a way more powerful tool to have in your pocket.
-F if for field separator
NF is the number of fields (also stands for the index of the last)
There are multiple ways. You may use this too.
echo "Your string here"| tr ' ' '\n' | tail -n1
> here
Obviously, the blank space input for tr command should be replaced with the delimiter you need.
This is the only solution possible for using nothing but cut:
echo "s.t.r.i.n.g." | cut -d'.' -f2-
[repeat_following_part_forever_or_until_out_of_memory:] | cut -d'.' -f2-
Using this solution, the number of fields can indeed be unknown and vary from time to time. However as line length must not exceed LINE_MAX characters or fields, including the new-line character, then an arbitrary number of fields can never be part as a real condition of this solution.
Yes, a very silly solution but the only one that meets the criterias I think.
If your input string doesn't contain forward slashes then you can use basename and a subshell:
$ basename "$(echo 'maps.google.com' | tr '.' '/')"
This doesn't use sed or awk but it also doesn't use cut either, so I'm not quite sure if it qualifies as an answer to the question as its worded.
This doesn't work well if processing input strings that can contain forward slashes. A workaround for that situation would be to replace forward slash with some other character that you know isn't part of a valid input string. For example, the pipe (|) character is also not allowed in filenames, so this would work:
$ basename "$(echo 'maps.google.com/some/url/things' | tr '/' '|' | tr '.' '/')" | tr '|' '/'
the following implements A friend's suggestion
#!/bin/bash
rcut(){
nu="$( echo $1 | cut -d"$DELIM" -f 2- )"
if [ "$nu" != "$1" ]
then
rcut "$nu"
else
echo "$nu"
fi
}
$ export DELIM=.
$ rcut a.b.c.d
d
An alternative using perl would be:
perl -pe 's/(.*) (.*)$/$2/' file
where you may change \t for whichever the delimiter of file is
It is better to use awk while working with tabular data. You don't have to master on command. If it can be achieved by awk, why not use that? I suggest you do not waste your precious time, and use a handful of commands to get the job done.
Example:
# $NF refers to the last column in awk
ll | awk '{print $NF}'
If you have a file named filelist.txt that is a list paths such as the following:
c:/dir1/dir2/file1.h
c:/dir1/dir2/dir3/file2.h
then you can do this:
rev filelist.txt | cut -d"/" -f1 | rev
Adding an approach to this old question just for the fun of it:
$ cat input.file # file containing input that needs to be processed
a;b;c;d;e
1;2;3;4;5
no delimiter here
124;adsf;15454
foo;bar;is;null;info
$ cat tmp.sh # showing off the script to do the job
#!/bin/bash
delim=';'
while read -r line; do
while [[ "$line" =~ "$delim" ]]; do
line=$(cut -d"$delim" -f 2- <<<"$line")
done
echo "$line"
done < input.file
$ ./tmp.sh # output of above script/processed input file
e
5
no delimiter here
15454
info
Besides bash, only cut is used.
Well, and echo, I guess.
choose -1
choose supports negative indexing (the syntax is similar to Python's slices).
I realized if we just ensure a trailing delimiter exists, it works. So in my case I have comma and whitespace delimiters. I add a space at the end;
$ ans="a, b"
$ ans+=" "; echo ${ans} | tr ',' ' ' | tr -s ' ' | cut -d' ' -f2
b

Print a file, skipping the first X lines, in Bash [duplicate]

This question already has answers here:
How can I remove the first line of a text file using bash/sed script?
(19 answers)
Closed 3 years ago.
I have a very long file which I want to print, skipping the first 1,000,000 lines, for example.
I looked into the cat man page, but I did not see any option to do this. I am looking for a command to do this or a simple Bash program.
You'll need tail. Some examples:
$ tail great-big-file.log
< Last 10 lines of great-big-file.log >
If you really need to SKIP a particular number of "first" lines, use
$ tail -n +<N+1> <filename>
< filename, excluding first N lines. >
That is, if you want to skip N lines, you start printing line N+1. Example:
$ tail -n +11 /tmp/myfile
< /tmp/myfile, starting at line 11, or skipping the first 10 lines. >
If you want to just see the last so many lines, omit the "+":
$ tail -n <N> <filename>
< last N lines of file. >
Easiest way I found to remove the first ten lines of a file:
$ sed 1,10d file.txt
In the general case where X is the number of initial lines to delete, credit to commenters and editors for this:
$ sed 1,Xd file.txt
If you have GNU tail available on your system, you can do the following:
tail -n +1000001 huge-file.log
It's the + character that does what you want. To quote from the man page:
If the first character of K (the number of bytes or lines) is a
`+', print beginning with the Kth item from the start of each file.
Thus, as noted in the comment, putting +1000001 starts printing with the first item after the first 1,000,000 lines.
If you want to skip first two line:
tail -n +3 <filename>
If you want to skip first x line:
tail -n +$((x+1)) <filename>
A less verbose version with AWK:
awk 'NR > 1e6' myfile.txt
But I would recommend using integer numbers.
Use the sed delete command with a range address. For example:
sed 1,100d file.txt # Print file.txt omitting lines 1-100.
Alternatively, if you want to only print a known range, use the print command with the -n flag:
sed -n 201,300p file.txt # Print lines 201-300 from file.txt
This solution should work reliably on all Unix systems, regardless of the presence of GNU utilities.
Use:
sed -n '1d;p'
This command will delete the first line and print the rest.
If you want to see the first 10 lines you can use sed as below:
sed -n '1,10 p' myFile.txt
Or if you want to see lines from 20 to 30 you can use:
sed -n '20,30 p' myFile.txt
Just to propose a sed alternative. :) To skip first one million lines, try |sed '1,1000000d'.
Example:
$ perl -wle 'print for (1..1_000_005)'|sed '1,1000000d'
1000001
1000002
1000003
1000004
1000005
You can do this using the head and tail commands:
head -n <num> | tail -n <lines to print>
where num is 1e6 + the number of lines you want to print.
This shell script works fine for me:
#!/bin/bash
awk -v initial_line=$1 -v end_line=$2 '{
if (NR >= initial_line && NR <= end_line)
print $0
}' $3
Used with this sample file (file.txt):
one
two
three
four
five
six
The command (it will extract from second to fourth line in the file):
edu#debian5:~$./script.sh 2 4 file.txt
Output of this command:
two
three
four
Of course, you can improve it, for example by testing that all argument values are the expected :-)
cat < File > | awk '{if(NR > 6) print $0}'
I needed to do the same and found this thread.
I tried "tail -n +, but it just printed everything.
The more +lines worked nicely on the prompt, but it turned out it behaved totally different when run in headless mode (cronjob).
I finally wrote this myself:
skip=5
FILE="/tmp/filetoprint"
tail -n$((`cat "${FILE}" | wc -l` - skip)) "${FILE}"

Resources