using awk or sed to print all columns from the n-th to the last [duplicate] - linux

This question already has answers here:
awk to print all columns from the nth to the last with spaces
(4 answers)
Using awk to print all columns from the nth to the last
(27 answers)
Closed 6 years ago.
This is NOT a duplicate of another question.
All previous questions/solutions posted on stackoverflow have got the same issue: additional spaces get replaced into a single space.
Example (1.txt)
filename Nospaces
filename One space
filename Two spaces
filename Three spaces
Result:
awk '{$1="";$0=$0;$1=$1}1' 1.txt
One space
Two spaces
Three spaces
awk '{$1=""; print substr($0,2)}' 1.txt
One space
Two spaces
Three spaces

Specify IFS with -F option to avoid omitting multiple space by awk
awk -F "[ ]" '{$1="";$0=$0;$1=$1}1' 1.txt
awk -F "[ ]" '{$1=""; print substr($0,2)}' 1.txt

If you define a field as any number of non-space characters followed by any number of space characters, then you can remove the first N like this:
$ sed -E 's/([^[:space:]]+[[:space:]]*){1}//' file
Nospaces
One space
Two spaces
Three spaces
Change {1} to {N}, where N is the number of fields to remove. If you only want to remove 1 field from the start, then you can remove the {1} entirely (as well as the parentheses which are used to create a group):
sed -E 's/[^[:space:]]+[[:space:]]*//' file
Some versions of sed (e.g. GNU sed) allow you to use the shorthand:
sed -E 's/(\S+\s*){1}//' file
If there may be some white space at the start of the line, you can add a \s* (or [[:space:]]*) to the start of the pattern, outside of the group:
sed -E 's/\s*(\S+\s*){1}//' file
The problem with using awk is that whenever you touch any of the fields on given record, the entire record is reformatted, causing each field to be separated by OFS (the Output Field Separator), which is a single space by default. You could use awk with sub if you wanted but since this is a simple substitution, sed is the right tool for the job.

To preserve whitespace in awk, you'll have to use regular expression substitutions or use substrings. As soon as you start modifying individual fields, awk has to recalculate $0 using the defined (or implicit) OFS.
Referencing Tom's sed answer:
awk '{sub(/^([^[:blank:]]+[[:blank:]]+){1}/, "", $0); print}' 1.txt

Use cut:
cut -d' ' -f2- a.txt
prints all columns from the second to the last and preserves whitespace.

Working code in awk, no leading space, supporting multiple space in the columns and printing from the n-th column:
awk '{ print substr($0, index($0,$column_id)) }' 1.txt

Related

How to extract and replace columns with a multi-character delimiter?

I got a file with ^$ as delimiter, the text is like :
tony^$36^$developer^$20210310^$CA
I want to replace the datetime.
I tried awk -F '\^\$' '{print $4}' file.txt | sed -i '/20210310/20221210/' , but it returns nothing. Then I tried the awk part, it returns nothing, I guess it still treat the line as a whole and the delimiter doesn't work. Wondering why and how to solve it?
A simple solution would be:
sed 's/\^\$/\n/g; s/20210310/20221210/g' -i file.txt
which will modify the file to separate each section to a new line.
If you need a different delimiter, change the \n in the command to maybe space or , .. up to you.
And it will also replace the date in the file.
If you want to see the changes, and really modify the file, remove the -i from the command.
When I run your awk command, I get these warnings:
awk: warning: escape sequence `\^' treated as plain `^'
awk: warning: escape sequence `\$' treated as plain `$'
That explains why your output is blank: the field delimiter is interpreted as the regular expression '^$', which matches a completely blank line (only). As a result, each non-blank line of input is without any field separators, and therefore has only a single field. $4 can be non-empty only if there are at least four fields.
You can fix that by escaping the backslashes:
awk -F '\\^\\$' '{print $4}' file.txt
If all you want to do is print the modified datecodes py themselves, then that should get you going. However, the question ...
How to extract and replace columns with a multi-character delimiter?
... sounds like you may want actually to replace the datecode within each line, keeping the rest intact. In that case, it is a non-starter for the awk command to discard the other parts of the line. You have several options here, but two of the more likely would be
instead of sending field 4 out to sed for substitution, do the sub in the awk script, and then reconstitute the input line by printing all fields, with the expected delimiters. (This is left as an exercise.) OR
do the whole thing in sed:
sed -E 's/^((([^^]|\^[^$])*\^\$){3})20210310(\^\$.*)/\120221210\4/' file.txt
If you wanted to modify file.txt in-place then you could add the -i flag (which, on the other hand, is not useful in your original command, where sed's input is coming from a pipe rather than a file).
The -E option engages the POSIX extended regex dialect, which allows the given regex to be more readable (the alternative would require a bunch more \ characters).
Overall, presuming that there are five or more fields delimited by literal '^$' strings, and the fourth contains exactly "20210310", that matches the first three fields, including their trailing delimiters, and captures them all as group 1; matches the leading delimiter of the fifth field and all the remainder of the line and captures it as group 4; and substitutes replaces the whole line with group 1 followed by the new datecode followed by group 4.

Filtering large data file by date using command line

I have a csv file that contains a bunch of data with one of the columns being date. I am trying to extract all lines that have dates in a specific year and save it into a new file.
The format of file is like this with the date and time in the second column:
000000000,10/04/2021 02:10:15 AM,.....
So far I tried:
grep -E ^2020 data.csv >> temp.csv
But it just produced an empty temp list. Any ideas on how I can do this?
One potential solution is with awk:
awk -F"," '$2 ~ /\/2020 /' data.csv > temp.csv
Another potential option is with grep:
grep "\/2020 " data.csv > temp.csv
However, the grep solution may detect "/2020 " elsewhere in the file, rather than in column 2.
Although awk solution is best here, e.g.
awk -F, 'index($2, "/2021 ")' file
grep can also be used here:
grep '^[^,]*,[^,]*/2021 ' file
See the online demo
Notes:
awk -F, 'index($2, "/2021 ")' splits the lines (records) into fields with a comma (see -F,), and if there is a /2021 + space in the second field ($2) the line is printed
the ^[^,]*,[^,]*/2021 pattern in the grep command matches
^ - start of string
[^,]* - zero or more non-comma chars
,[^,]* - a , and zero or more non-comma chars
/2021 - a literal substring.

Get text only within parenthesis from a file in linux terminal [duplicate]

This question already has an answer here:
How can I extract the content between two brackets?
(1 answer)
Closed 4 years ago.
I have a large log file I need to sort, I want to extract the text between parentheses. The format is something like this:
<#44541545451865156> (example#6144) has left the server!
How would I go about extracting "example#6144"?
This sed should work here:
sed -E -n 's/.*\((.*)\).*$/\1/p' file_name
There are many ways to skin this cat.
Assuming you always have only one lexeme in parentheses, you can use bash parameter expansion:
while read t; do echo $(t=${t#*(}; echo ${t%)*}); done <logfile
The first substitution: ${t#*(} cuts off everything up and including the left parenthesis, leaving you with example#6144) has left the server!; the second one: ${t%)*} cuts off the right parenthesis and everything after that.
Alternatively, you can also use awk:
awk -F'[)(]' '{print $2}' logfile
-F'[)(]' tells awk to use either parenthesis as the field delimiter, so it splits the input string into three tokens: <#44541545451865156>, example#6144, and has left the server!; then {print $2} instructs it to print the second token.
cut would also do:
cut -d'(' -f 2 logfile | cut -d')' -f 1
Try this:
sed -e 's/^.*(\([^()]*\)).*$/\1/' <logfile
The /^.*(\([^()]*\)).*$/ is a regular expression or regex. Regexes are hard to read until you get used to them, but are most useful for extracting text by pattern, as you are doing here.

Change some field separators in awk

I have a input file
1.txt
joshwin_xc8#yahoo.com:1802752:2222:
ihearttofurkey#yahoo.com:1802756:111113
www.rothmany#mail.com:xxmyaduh:13#;:3A
and I want an output file:
out.txt
joshwin_xc8#yahoo.com||o||1802752||o||2222:
ihearttofurkey#yahoo.com||o||1802756||o||111113
www.rothmany#mail.com||o||xxmyaduh||o||13#;:3A
I want to replace the first two ':' in 1.txt with '||o||', but with the script I am using
awk -F: '{print $1,$2,$3}' OFS="||o||" 3.txt
But it is not giving the expected output.
Any help would be highly appreciated.
Perl solution:
perl -pe 's/:/||o||/ for $_, $_' 1.txt
-p reads the input line by line and prints each line after processing it
s/// is similar to substitution you might know from sed
for in postposition runs the previous command for every element in the following list
$_ keeps the line being processed
For higher numbers, you can use for ($_) x N where N is the number. For example, to substitute the first 7 occurrences:
perl -pe 's/:/||o||/ for ($_) x 7' 1.txt
Following sed may also help you in same.
sed 's/:/||o||/;s/:/||o||/' Input_file
Explanation: Simply substituting 1st occurrence of colon with ||o|| and then 2nd occurrence of colon now becomes 1st occurrence of colon now and substituting that colon with ||o|| as per OP's requirement.
Perl solution also, but I think the idea can apply to other languages: using the limit parameter of split:
perl -nE 'print join q(||o||), split q(:), $_, 3' file
(q quotes because I'm on Windows)
Suppose if we need to replace first 2 occurrence of : use below code
Like this you can change as per your requirement suppose if you need to change for first 7 occurences change {1..2} to {1..7}.
Out put will be saved in orginal file. it wont display the output
for i in {1..2}
> do
> sed -i "s/:/||o||/1" p.txt
> done

How to display the first word of each line in my file using the linux commands?

I have a file containing many lines, and I want to display only the first word of each line with the Linux commands.
How can I do that?
You can use awk:
awk '{print $1}' your_file
This will "print" the first column ($1) in your_file.
Try doing this using grep :
grep -Eo '^[^ ]+' file
try doing this with coreutils cut :
cut -d' ' -f1 file
I see there are already answers. But you can also do this with sed:
sed 's/ .*//' fileName
The above solutions seem to fit your specific case. For a more general application of your question, consider that words are generally defined as being separated by whitespace, but not necessarily space characters specifically. Columns in your file may be tab-separated, for example, or even separated by a mixture of tabs and spaces.
The previous examples are all useful for finding space-separated words, while only the awk example also finds words separated by other whitespace characters (and in fact this turns out to be rather difficult to do uniformly across various sed/grep versions). You may also want to explicitly skip empty lines, by amending the awk statement thus:
awk '{if ($1 !="") print $1}' your_file
If you are also concerned about the possibility of empty fields, i.e., lines that begin with whitespace, then a more robust solution would be in order. I'm not adept enough with awk to produce a one-liner for such cases, but a short python script that does the trick might look like:
>>> import re
>>> for line in open('your_file'):
... words = re.split(r'\s', line)
... if words and words[0]:
... print words[0]
...or on Windows (if you have GnuWin32 grep) :
grep -Eo "^[^ ]+" file

Resources