Use tr to replace single new lines but not multiple new lines - linux

Hi I have a file with data in the following format:
262353824192
Motley Crue Too Fast For Love Vinyl LP Leathur Records LR123 rare 3rd pressing
http://www.ebay.co.uk/itm/Motley-Crue-Too-Fast-Love-Vinyl-LP-Leathur-Records-LR123-rare-3rd-pressing-/262353824192
301870324112
TRAFFIC Same UK 1st press vinyl LP in gatefold / booklet sleeve Island pink eye
http://www.ebay.co.uk/itm/TRAFFIC-Same-UK-1st-press-vinyl-LP-gatefold-booklet-sleeve-Island-pink-eye-/301870324112
141948187203
NOW That's What I Call Music LP'S Joblot 2-14 MINT CONDITION Vinyl
http://www.ebay.co.uk/itm/NOW-Thats-Call-Music-LPS-Joblot-2-14-MINT-CONDITION-Vinyl-/141948187203
I would like replace the single new lines with a pipe, but leave the double new lines as they are. I have tried:
tr '\n' '|' < text.txt
But this replaces all new lines with | so the separate products are no longer on different lines. I basically want a | delimiter between the product number, title and url, but each separate product on a different line. How can I achieve this?

Use tr and a little bit of sed:
tr "\n" "|" < text.txt | sed 's/||\+/\n/g'

You could use awk to do this:
awk ' /^$/ { print; } /./ { printf("%s|", $0); } END {print '\n'}' text.txt
This will find any blank line and just print it as-is. If it fin
ds any value on the line it will use printf and stick a pipe after it. At the end of processing it prints a newline character to finish up.

This has already been partially answered HERE, but not completely.
I would add an additional transform to change double newlines to some character (hash in this case), then replace the hashes with a newline (or two if you want to go back to the original formatting of those) after changing the single newlines to be pipes.
sed -e ':a' -e 'N' -e '$!ba' -e 's/\n\n/#/g' -e 's/\n/|/g' -e 's/#/\n/g'
This gives the output:
262353824192|Motley Crue Too Fast For Love Vinyl LP Leathur Records LR123 rare 3rd pressing|http://www.ebay.co.uk/itm/Motley-Crue-Too-Fast-Love-Vinyl-LP-Leathur-Records-LR123-rare-3rd-pressing-/262353824192
301870324112|TRAFFIC Same UK 1st press vinyl LP in gatefold / booklet sleeve Island pink eye|http://www.ebay.co.uk/itm/TRAFFIC-Same-UK-1st-press-vinyl-LP-gatefold-booklet-sleeve-Island-pink-eye-/301870324112
141948187203|NOW That's What I Call Music LP'S Joblot 2-14 MINT CONDITION Vinyl|http://www.ebay.co.uk/itm/NOW-Thats-Call-Music-LPS-Joblot-2-14-MINT-CONDITION-Vinyl-/141948187203

awk to the rescue!
awk -F'\n' -v RS= -v OFS='|' '{$1=$1;printf "%s", $0 RT}' file
this preserves spacing between paragraphs, 3 lines as in the original file.

I made a very specific solution to your problem with awk (specific because it assumes you always have the same number of new lines between the groups of records).
awk 'BEGIN {RS="\n\n\n"; FS="\n"; OFS="|"} {print $1,$2,$3}' < text.txt
It sets the record separator to 3 newlines, field separator to one newline, and the output field separator to pipe. Then for each record (every block seperated by 3 newlines), it prints the first 3 fields (that are separated by one newline), and on the output it separates them with a pipe

Just use sed:
sergey#x50n:~> cat in.txt | tr '\n' '|' | sed -e 's/||\+/\n\n/g; s/|$/\n/'
262353824192|Motley Crue Too Fast For Love Vinyl LP Leathur Records LR123 rare 3rd pressing|http://www.ebay.co.uk/itm/Motley-Crue-Too-Fast-Love-Vinyl-LP-Leathur-Records-LR123-rare-3rd-pressing-/262353824192
301870324112|TRAFFIC Same UK 1st press vinyl LP in gatefold / booklet sleeve Island pink eye|http://www.ebay.co.uk/itm/TRAFFIC-Same-UK-1st-press-vinyl-LP-gatefold-booklet-sleeve-Island-pink-eye-/301870324112
141948187203|NOW That's What I Call Music LP'S Joblot 2-14 MINT CONDITION Vinyl|http://www.ebay.co.uk/itm/NOW-Thats-Call-Music-LPS-Joblot-2-14-MINT-CONDITION-Vinyl-/141948187203
First we replace all newlines with a pipe using tr as in your example.
Then the first expression in sed command (i.e. s/||\+/\n\n/g;) replaces all occurrences of more than one pipe with two newlines. You also may replace them with one line if you do not want blank lines between the lines of output. And the second expression of sed replaces the trailing pipe with a newline to produce more readable output (or more "conventional" empty line at the end of file).
Also note that \+ in sed regex is a GNU extension. Thus if you are using non-GNU implementation of sed (FreeBSD, AIX or so), use standard syntax: |||* instead of ||\+.

Related

How to truncate rest of the text in a file after finding a specific text pattern, in unix?

I have a HTML PAGE which I have extracted in unix using wget command, in that after the word "Check list" I need to remove all of the text and with the remaining I am trying to grep some data. I am unable to think on a way which can be helpful for removing the text after a keyword. if I do
s/Check list.*//g
It just removes the line , I want everything below that to be gone. How do I perform this?
The other solutions you have so far require non-POSIX-mandatory tools (GNU sed, GNU awk, or perl) so YMMV with their availability and will read the whole file into memory at once.
These will work in any awk in any shell on every Unix box and only read 1 line at a time into memory:
awk -F 'Check list' '{print $1} NF>1{exit}' file
or:
awk 'sub(/Check list.*/,""){f=1} {print} f{exit}' file
With GNU awk for multi-char RS you could do:
awk -v RS='Check list' '{print; exit}' file
but that would still read all of the text before Check list into memory at once.
Depending on which sed version you have, maybe
sed -z 's/Check list.*//'
The /g flag is useless as you only want to replace everything once.
If your sed does not have the -z option (which says to use the ASCII null character as line terminator instead of newline; this hinges on your file not containing any actual nulls, but that should trivially be true for any text file), try Perl:
perl -0777 -pe 's/Check list.*//s'
Unlike sed -z, this explicitly says to slurp the entire file into memory (the argument to -0 is the octal character code of a terminator character, but 777 is not a valid terminator character at all, so it always reads the entire file as a single "line") so this works even if there are spurious nulls in your file. The final s flag says to include newline in what . matches (otherwise s/.*// would still only substitute on the matching physical line).
I assume you are aware that removing everything will violate the integrity of the HTML file; it needs there to be a closing tag for every start tag near the beginning of the document (so if it starts with <html><body> you should keep </body></html> just before the end of the file, for example).
With awk you could make use of RS variable and then set field separator to regex with word boundaries and then print the very first field as per need.
awk -v RS="^$" -v FS='\\<check_list\\>' '{print $1}' Input_file
You might use q to instruct GNU sed to quit, thus ending processing, consider following simple example, let file.txt content be
123
456
789
and say you want to jettison everything beyond 5, then you could do
sed '/5/{s/5.*//;q}' file.txt
which gives output
123
4
Explanation: for line having 5, substitute 5 and everything beyond it with empty string (i.e. delete it), then q. Observe that lowercase q is used to provide printing of altered line before quiting.
(tested in GNU sed 4.7)

Replacing characters in each line on a file in linux

I have a file with different word in each line.
My goal is to replace the first character to a capital letter and replace the 3rd character to "#".
For example: football will be exchanged to Foo#ball.
I tried thinking about using awk and sed.It didn't help me since (to my knowledge) sed needs an exact character input and awk can print the desired character but not change it.
With GNU sed and two s commands:
echo 'football' | sed -E 's/(.)/\U\1/; s/(...)./\1#/'
Output:
Foo#ball
See: 3.3 The s Command, 5.7 Back-references and Subexpressions and 5.9.2 Upper/Lower case conversion
This might work for you (GNU sed):
sed 's/\(...\)./\u\1#/' file
With bash you can use parameter expansions alone to accomplish the task. For example, if you read each line into the variable line, you can do:
line="${line^}" # change football to Football (capitalize 1st char)
line="${line:0:3}#${line:4}" # make 4th character '#'
Example Input File
$ cat file
football
soccer
baseball
Example Use/Output
$ while read -r line; do line="${line^}"; echo "${line:0:3}#${line:4}"; done < file
Foo#ball
Soc#er
Bas#ball
While shell is typically slower, when use is limited to builtins, it doesn't fall too far behind.
(note: your question says 3rd character, but your example replaces the 4th character with '#')
With GNU awk for the 3rd arg to match():
$ echo 'football' | awk 'match($0,/(.)(..).(.*)/,a){$0=toupper(a[1]) a[2] "#" a[3]} 1'
Foo#ball
Cyrus' or Potong's answers are the preferred ones. (For Linux or systems with GNU sed because of \U or \u.)
This is just an additional solution with awk because you mentioned it and used also awk tag:
$ echo 'football'|awk '{a=substr($0,1,1);b=substr($0,2,2);c=substr($0,5);print toupper(a)b"#"c}'
Foo#ball
This is a most simple solution without RegEx. It will also work on non-GNU awk.
This should work with any version of awk:
awk '{
for(i=1;i<=NF;i++){
# Note that string indexes start at 1 in awk !
$i=toupper(substr($i,1,1)) "" substr($i,2,1) "#" substr($i,3)
}
print
}' file
Note: If a word is less than 3 characters long, like it, it will be printed as It#
if your data in 'd' file, tried on gnu sed:
sed -E 's/^(\w)(\w\w)\w/\U\1\E\2#/' d

using awk or sed to print all columns from the n-th to the last [duplicate]

This question already has answers here:
awk to print all columns from the nth to the last with spaces
(4 answers)
Using awk to print all columns from the nth to the last
(27 answers)
Closed 6 years ago.
This is NOT a duplicate of another question.
All previous questions/solutions posted on stackoverflow have got the same issue: additional spaces get replaced into a single space.
Example (1.txt)
filename Nospaces
filename One space
filename Two spaces
filename Three spaces
Result:
awk '{$1="";$0=$0;$1=$1}1' 1.txt
One space
Two spaces
Three spaces
awk '{$1=""; print substr($0,2)}' 1.txt
One space
Two spaces
Three spaces
Specify IFS with -F option to avoid omitting multiple space by awk
awk -F "[ ]" '{$1="";$0=$0;$1=$1}1' 1.txt
awk -F "[ ]" '{$1=""; print substr($0,2)}' 1.txt
If you define a field as any number of non-space characters followed by any number of space characters, then you can remove the first N like this:
$ sed -E 's/([^[:space:]]+[[:space:]]*){1}//' file
Nospaces
One space
Two spaces
Three spaces
Change {1} to {N}, where N is the number of fields to remove. If you only want to remove 1 field from the start, then you can remove the {1} entirely (as well as the parentheses which are used to create a group):
sed -E 's/[^[:space:]]+[[:space:]]*//' file
Some versions of sed (e.g. GNU sed) allow you to use the shorthand:
sed -E 's/(\S+\s*){1}//' file
If there may be some white space at the start of the line, you can add a \s* (or [[:space:]]*) to the start of the pattern, outside of the group:
sed -E 's/\s*(\S+\s*){1}//' file
The problem with using awk is that whenever you touch any of the fields on given record, the entire record is reformatted, causing each field to be separated by OFS (the Output Field Separator), which is a single space by default. You could use awk with sub if you wanted but since this is a simple substitution, sed is the right tool for the job.
To preserve whitespace in awk, you'll have to use regular expression substitutions or use substrings. As soon as you start modifying individual fields, awk has to recalculate $0 using the defined (or implicit) OFS.
Referencing Tom's sed answer:
awk '{sub(/^([^[:blank:]]+[[:blank:]]+){1}/, "", $0); print}' 1.txt
Use cut:
cut -d' ' -f2- a.txt
prints all columns from the second to the last and preserves whitespace.
Working code in awk, no leading space, supporting multiple space in the columns and printing from the n-th column:
awk '{ print substr($0, index($0,$column_id)) }' 1.txt

Shell Script get text between 2 special characters

I have read a few things out there but can't seem to work out this particular problem. I am writing a shell script. I am reading a file to a variable using
LOCAL_CONFIG=`cat local-config.php`
Which has lines like this
define( 'DB_USER', 'abcxyz' );
define( 'DB_PASSWORD', 'qwerty' );
How can I get the abcxyz and the qwerty parts of this??
Thanks in advance
Using awk
$ awk -F"'" '/^define\(/ {print $4}' local-config.php
abcxyz
qwerty
Explanation:
-F"'"
This defines the field separator as the single quote.
/^define\(/
This selects the lines that start with define(
print $4
For those selected lines, this prints the fourth field.
Using sed
$ sed -rn "/^define\(/ {s/([^']*'){3}//; s/'.*//; p;}" local-config.php
abcxyz
qwerty
-rn
This turns on extended regex syntax and turns off automatic printing.
/^define\(/
This selects the lines that start with define(
{
This starts a group. Commands in this group are executed only for the selected lines.
s/([^']*'){3}//
This removes all text up through and including the third quote.
s/'.*//
This removes all text after the next remaining quote.
p
This prints the line.
}
This ends the group.
Use grep along with -P parameter to enable perl-regexp mode.
$ grep -oP "\bdefine\( *'[^']*' *, *'\K[^']*(?=' *\);)" file
abcxyz
qwerty
\K discards the previously matched characters from printing at the final.
"cut" command will do in a more simpler way...
Command:
cat local-config.php | cut -d "'" -f4
output:
abcxyz
qwerty
Explanation:
Using cut with ' as delimiter we need to take the fourth part(f4) in the lines.

Replace whitespace with a comma in a text file in Linux

I need to edit a few text files (an output from sar) and convert them into CSV files.
I need to change every whitespace (maybe it's a tab between the numbers in the output) using sed or awk functions (an easy shell script in Linux).
Can anyone help me? Every command I used didn't change the file at all; I tried gsub.
tr ' ' ',' <input >output
Substitutes each space with a comma, if you need you can make a pass with the -s flag (squeeze repeats), that replaces each input sequence of a repeated character that is listed in SET1 (the blank space) with a single occurrence of that character.
Use of squeeze repeats used to after substitute tabs:
tr -s '\t' <input | tr '\t' ',' >output
Try something like:
sed 's/[:space:]+/,/g' orig.txt > modified.txt
The character class [:space:] will match all whitespace (spaces, tabs, etc.). If you just want to replace a single character, eg. just space, use that only.
EDIT: Actually [:space:] includes carriage return, so this may not do what you want. The following will replace tabs and spaces.
sed 's/[:blank:]+/,/g' orig.txt > modified.txt
as will
sed 's/[\t ]+/,/g' orig.txt > modified.txt
In all of this, you need to be careful that the items in your file that are separated by whitespace don't contain their own whitespace that you want to keep, eg. two words.
without looking at your input file, only a guess
awk '{$1=$1}1' OFS=","
redirect to another file and rename as needed
What about something like this :
cat texte.txt | sed -e 's/\s/,/g' > texte-new.txt
(Yes, with some useless catting and piping ; could also use < to read from the file directly, I suppose -- used cat first to output the content of the file, and only after, I added sed to my command-line)
EDIT : as #ghostdog74 pointed out in a comment, there's definitly no need for thet cat/pipe ; you can give the name of the file to sed :
sed -e 's/\s/,/g' texte.txt > texte-new.txt
If "texte.txt" is this way :
$ cat texte.txt
this is a text
in which I want to replace
spaces by commas
You'll get a "texte-new.txt" that'll look like this :
$ cat texte-new.txt
this,is,a,text
in,which,I,want,to,replace
spaces,by,commas
I wouldn't go just replacing the old file by the new one (could be done with sed -i, if I remember correctly ; and as #ghostdog74 said, this one would accept creating the backup on the fly) : keeping might be wise, as a security measure (even if it means having to rename it to something like "texte-backup.txt")
This command should work:
sed "s/\s/,/g" < infile.txt > outfile.txt
Note that you have to redirect the output to a new file. The input file is not changed in place.
sed can do this:
sed 's/[\t ]/,/g' input.file
That will send to the console,
sed -i 's/[\t ]/,/g' input.file
will edit the file in-place
Here's a Perl script which will edit the files in-place:
perl -i.bak -lpe 's/\s+/,/g' files*
Consecutive whitespace is converted to a single comma.
Each input file is moved to .bak
These command-line options are used:
-i.bak edit in-place and make .bak copies
-p loop around every line of the input file, automatically print the line
-l removes newlines before processing, and adds them back in afterwards
-e execute the perl code
If you want to replace an arbitrary sequence of blank characters (tab, space) with one comma, use the following:
sed 's/[\t ]+/,/g' input_file > output_file
or
sed -r 's/[[:blank:]]+/,/g' input_file > output_file
If some of your input lines include leading space characters which are redundant and don't need to be converted to commas, then first you need to get rid of them, and then convert the remaining blank characters to commas. For such case, use the following:
sed 's/ +//' input_file | sed 's/[\t ]+/,/g' > output_file
This worked for me.
sed -e 's/\s\+/,/g' input.txt >> output.csv

Resources