Sed is not writing to file - linux

I wanna simply change the delimiter on my CSV.
The file comes from a outside server, so the delimiter is something like this: ^A.
name^Atype^Avalue^A
john^Ab^A500
mary^Ac^A400
jack^Ad^A200
I want to get this:
name,type,value
john,b,500
mary,c,400
jack,d,200
I need to change it to a comma(,) or a tab(,), but my sed command, despite correctly output, does not write the file.
cat -v CSVFILE | sed -i "s/\^A/,/g"
When i use the line above, it correctly outputs the file delimited by a comma instead of ^A, but it doesn't write to the file.
I also tried like this:
sed -i "s/\^A/,/g" CSVFILE
Does not work also...
What am i doing wrong?

Literal ^A (two characters, ^ and A) is how cat -v visualizes control character 0x1 (ASCII code 1, named SOH (start of heading)). ^A is an example of caret notation to represent unprintable ASCII characters:
^A stands for keyboard combination Control-A, which, when preceded by generic escape sequence Control-V, is how you can create the actual control character in your terminal; in other words,
Control-VControl-A will insert an actual 0x1 character.
Incidentally, the logic of caret notation (^<letter>) is: the letter corresponds to the ASCII value of the control character represented; e.g., A corresponds to 0x1, and D corresponds to 0x4 (^D, EOT).
To put it differently: you add 0x40 to the ASCII value of the control character to get the ASCII value of its letter representation in caret notation.
^# to represent NUL (0x0 characters) and ^? to represent DEL (0x7f) are consistent with this notation, because # has ASCII value 0x40 (i.e., it comes just before A (0x41) in the ASCII table) and 0x40 + 0x7f constrained to 7 bits (bit-ANDed with the max. ASCII value 0x7f) yields 0x3f, which is the ASCII value of ?.
To inspect a given file for the ASCII values of exotic control characters, you can pipe it to od -c, which represents 0x1 as (octal) 001.
This implies that, when passing the file to sed directly, you cannot use caret notation and must instead use the actual control character in your s call.
Note that when you use Control-VControl-A to create an actual 0x1 character, it will also appear in caret notation - as ^A - but in that case it is just the terminal's visualization of the true control character; while it may look like the two printable characters ^ and A, it is not. Purely visually you cannot tell the difference - which is why using an escape sequence or ANSI C-quoted string to represent the control character is the better choice - see below.
Assuming your shell is bash, ksh, or zsh, the better alternative to using Control-VControl-A is to use an ANSI C-quoted string to generate the 0x1 character: $'\1'
However, as Lars Fischer points out in a comment on the question, GNU sed also recognizes escape sequence \x01 for 0x1.
Thus, your command should be:
sed -i 's/\x01/,/g' CSVFILE # \x01 only recognized by GNU sed
or, using an ANSI C-quoted string:
sed -i $'s/\1/,/g' CSVFILE
Note: While this form can in principle be used with BSD/OSX sed, the -i syntax is slightly different: you'd have to use sed -i '' $'s/\1/,/g' CSVFILE
The only reason to use sed for your task is to take advantage of in-place updating (-i); otherwise, tr is the better choice - see Ed Morton's answer.

This is the job tr was created to do:
tr '<control-A>' ',' < file > tmp && mv tmp file
Replace <control-A> with a literal control-A obviously.

If your sed supports the -i option, you could use it like this:
sed -i.bak -e "s/\^A/,/g" CSVFILE
(This assumes the delimiter in the source file consists of the two characters ^ and A; if ^A is supposed to refer to Control-A, then you will have to make adjustments accordingly, e.g. using 's/\x01/,/g'.)
Otherwise, assuming you want to keep a copy of the original file (e.g. in case the result is not what you expect -- see below), an incantation such as the following can be used:
mv CSVFILE CSVFILE.bak && sed "s/\^A/,/g" CSVFILE.bak > CSVFILE
As pointed out elsewhere, if the source-file separator is Control-A, you could also use tr '\001' , (or tr '\001' '\t' for a tab).
The caution is that the delimiter in the source file might well be used precisely because commas might appear in the "values" that the separator-character is separating. If that is a possibility, then a different approach will be needed. (See e.g. https://www.rfc-editor.org/rfc/rfc4180)

In case it's run under OS X :
Add an extension to the -i to write in a new file :
sed -i.bak "s/^A/,/g" CSVFILE
Or to write in place :
sed -i '' "s/^A/,/g" CSVFILE
You can also output to file with a cat but without -i on your sed
command :
cat -v CSVFILE | sed "s/^A/,/g" > ouput
Make sure you write the ^A this way :
Ctrl+V+Ctrl+A

Related

How to add character at the end of specific line in UNIX/LINUX?

Here is my input file. I want to add a character ":" into the end of lines that have ">" at the beginning of the line. I tried seq -i 's|$|:|' input.txt but ":" was added to all the ending of each line. It is also hard to call out specific line numbers because, in each of my input files, the line contains">" present in different line numbers. I want to run a loop for multiple files so it is useless.
>Pas_pyrG_2
AAAGTCACAATGGTTAAAATGGATCCTTATATTAATGTCGATCCAGGGACAATGAGCCCA
TTCCAGCATGGTGAAGTTTTTGTTACCGAAGATGGTGCAGAAACAGATCTGGATCTGGGT
>Pas_rpoB_4
CAAACTCACTATGGTCGTGTTTGTCCAATTGAAACTCCTGAAGGTCCAAACATTGGTTTG
ATCAACTCGCTTTCTGTATACGCAAAAGCGAATGACTTCGGTTTCTTGGAAACTCCATAC
CGCAAAGTTGTAGATGGTCGTGTAACTGATGATGTTGAATATTTATCTGCAATTGAAGAA
>Pas_cpn60_2
ATGAACCCAATGGATTTAAAACGCGGTATCGACATTGCAGTAAAAACTGTAGTTGAAAAT
ATCCGTTCTATTGCTAAACCAGCTGATGATTTCAAAGCAATTGAACAAGTAGGTTCAATC
TCTGCTAACTCTGATACTACTGTTGGTAAACTTATTGCTCAAGCAATGGAAAAAGTAGGT
AAAGAAGGCGTAATCACTGTAGAAGAAGGCTCAGGCTTCGAAGACGCATTAGACGTTGTA
Here is experted output file:
>Pas_pyrG_2:
AAAGTCACAATGGTTAAAATGGATCCTTATATTAATGTCGATCCAGGGACAATGAGCCCA
TTCCAGCATGGTGAAGTTTTTGTTACCGAAGATGGTGCAGAAACAGATCTGGATCTGGGT
>Pas_rpoB_4:
CAAACTCACTATGGTCGTGTTTGTCCAATTGAAACTCCTGAAGGTCCAAACATTGGTTTG
ATCAACTCGCTTTCTGTATACGCAAAAGCGAATGACTTCGGTTTCTTGGAAACTCCATAC
CGCAAAGTTGTAGATGGTCGTGTAACTGATGATGTTGAATATTTATCTGCAATTGAAGAA
>Pas_cpn60_2:
ATGAACCCAATGGATTTAAAACGCGGTATCGACATTGCAGTAAAAACTGTAGTTGAAAAT
ATCCGTTCTATTGCTAAACCAGCTGATGATTTCAAAGCAATTGAACAAGTAGGTTCAATC
TCTGCTAACTCTGATACTACTGTTGGTAAACTTATTGCTCAAGCAATGGAAAAAGTAGGT
AAAGAAGGCGTAATCACTGTAGAAGAAGGCTCAGGCTTCGAAGACGCATTAGACGTTGTA
Do seq have more option to modify or the other commands can solve this problem?
sed -i '/^>/ s/$/:/' input.txt
Search the lines of input for lines that match ^> (regex for "starts with the > character). Those that do substitute : for end-of-line (you got this part right).
/ slashes are the standard separator character in sed. If you wish to use different characters, be sure to pass -e or s|$|:| probably won't work. Since / characters, unlike | characters, are not meaningful character within the shell, it's best to use them unless the pattern also contains slashes, in which case things get unwieldy.
Be careful with sed -i. Make a backup - make sure you know what's changing by using diff to compare the files.
On OSX -i requires an argument.
Using ed to edit the file:
printf "%s\n" 'g/^>/s/$/:/' w | ed -s input.txt
For every line starting with >, add a colon to the end, and then write the changed file back to disk.

How to insert UTF-16 character with sed in a file?

I have a file coded as UTF-16 and I want to split each line into fields (at fixed positions) separated by commas.
I have tried the following:
Option 1)
sed -i 's/./&,/400;s/./&,/360;*<and so on for several positions in the line>* FILE
This seems to work, but when editing the file with vim, it is obvious that something is wrong, since the commas are displayed as a single character, but the other symbols are displayed as a two-byte character.
BEFORE: 2^#A^#U^#W^#2^#0^#1^#9^#0^#1^#0^#1^#0^#0^#0^#1^#0^#0^#
AFTER: 2^#,A^#U^#,W^#,2^#0^#1^#9^#0^#1^#0^#1^#,0^#0^#0^#1^#0^#0^#,
Option 2)
Then I tried to use sed again but, instead of "," I typed the UTF-16 code of the comma, that is 002c:
sed -i "s/./&\U002c/400...
or even
s/./&$(echo -ne '\u002c')/400...
None of these options worked, the results are exactly the same as in Option1.
The problem is not the insertion of a new character with sed.
I noticed that the file is UTF-16LE (I was assuming it was UTF-16BE). I converted it directly to ascii with
iconv -f UTF-16LE -t ASCII sourcefile >destinationfile
and after that I could handle the file as usual with sed, grep, awk, etc.

Find files with non-printing characters (null bytes)

I have got the log of my application with a field that contains strange characters.
I see these characters only when I use less command.
I tried to copy the result of my line of code in a text file and what I see is
CTP_OUT=^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#
I'd like to know if there is a way to find these null characters. I have tried with a grep command but it didn't show anything
I hardly believe it, I might write an answer involving cat!
The characters you are observing are non-printable characters which are often written in Carret notation. The Caret notation of a character is a way to visualize non-printable characters. As mentioned in the OP, ^# is the representation of NULL.
If your file has non-printable characters, you can visualize them using cat -vET:
-E, --show-ends: display $ at end of each line
-T, --show-tabs: display TAB characters as ^I
-v, --show-nonprinting: use ^ and M- notation, except for LFD and TAB
source: man cat
I've added the -E and -T flag to it, to convert everything non-printable.
As grep will not output the non-printable characters itself in any form, you have to pipe its output to cat to see them. The following example shows all lines containing non-printable characters
Show all lines with non-printable characters:
$ grep -E '[^[:print:]]' --color=never file | cat -vET
Here, the ERE [^[:print:]] selects all non-printable characters.
Show all lines with NULL:
$ grep -Pa '\x00' --color=never file | cat -vET
Be aware that we need to make use of the Perl regular expressions here as they understand the hexadecimal and octal notation.
Various control characters can be written in C language style: \n matches a newline, \t a tab, \r a carriage return, \f a form feed, etc.
More generally, \nnn, where nnn is a string of three octal digits, matches the character whose native code point is nnn. You can easily run into trouble if you don't have exactly three digits. So always use three, or since Perl 5.14, you can use \o{...} to specify any number of octal digits.
Similarly, \xnn, where nn are hexadecimal digits, matches the character whose native ordinal is nn. Again, not using exactly two digits is a recipe for disaster, but you can use \x{...} to specify any number of hex digits.
source: Perl 5 version 26.1 documentation
An example:
$ printf 'foo\012\011\011bar\014\010\012foobar\012\011\000\013\000car\012\011\011\011\012' > test.txt
$ cat test.txt
foo
bar
foobar
car
If we now use grep alone, we get the following:
$ grep -Pa '\x00' --color=never test.txt
car
But piping it to cat allows us to visualize the control characters:
$ grep -Pa '\x00' --color=never test.txt | cat -vET
^I^#^K^#car$
Why --color=never: If your grep is tuned to have --color=auto or --color=always it will add extra control characters to be interpreted as color for the terminal. And this might confuse you by the content.
$ grep -Pa '\x00' --color=always test.txt | cat -vET
^I^[[01;31m^[[K^#^[[m^[[K^K^[[01;31m^[[K^#^[[m^[[Kcar$
sed can.
sed -n '/\x0/ { s/\x0/<NUL>/g; p}' file
-n skips printing any output unless explicitly requested.
/\x0/ selects for only lines with null bytes.
{...} encapsulates multiple commands, so that they can be collectively applied always and only when the /\x0/ has detected a null on the line.
s/\x0/<NUL>/g; substitutes in a new, visible value for the null bytes. You could make it whatever you want - I used <NUL> as something both reasonably obvious and yet unlikely to occur otherwise. You should probably grep the file for it first to be sure the pattern doesn't exist before using it.
p; causes lines that have been edited (because they had a null byte) to show.
This basically makes sed an effective grep for nulls.

How to replace a <85> to a new line in bash script

I’m running out of idea on how to replace this character “<85>” to a new line (please treat this as one character only – I think this is a non-printable character).
I tried this one in my script:
cat file | awk '{gsub(”<85>”,RS);print}' > /tmp/file.txt
but didn’t work.
I hope someone can help.
Thanks.
With sed: sed -e $'s/\302\205/\\n/' file > file.txt
Or awk: awk '{gsub("\302\205","\n")}7'
The magic here was in converting the <85> character to octal codepoints.
I used hexdump -b on a file I manually inserted that character into.
tr '\205' '\n' <file > file.txt
tr is the transliterate command; it translates one character to another (or deletes it, or …). The version of tr on Mac OS X doesn't recognize hexadecimal escapes, so you have to use octal, and octal 205 is hex 85.
I am assuming that the file contains a single byte '\x85', rather than some combination of bytes that is being presented as <85>. tr is not good for recognizing multibyte sequences that need to be transliterated.

Convert string to hexadecimal on command line

I'm trying to convert "Hello" to 48 65 6c 6c 6f in hexadecimal as efficiently as possible using the command line.
I've tried looking at printf and google, but I can't get anywhere.
Any help greatly appreciated.
Many thanks in advance,
echo -n "Hello" | od -A n -t x1
Explanation:
The echo program will provide the string to the next command.
The -n flag tells echo to not generate a new line at the end of the "Hello".
The od program is the "octal dump" program. (We will be providing a flag to tell it to dump it in hexadecimal instead of octal.)
The -A n flag is short for --address-radix=n, with n being short for "none". Without this part, the command would output an ugly numerical address prefix on the left side. This is useful for large dumps, but for a short string it is unnecessary.
The -t x1 flag is short for --format=x1, with the x being short for "hexadecimal" and the 1 meaning 1 byte.
If you want to do this and remove the spaces you need:
echo -n "Hello" | od -A n -t x1 | sed 's/ *//g'
The first two commands in the pipeline are well explained by #TMS in his answer, as edited by #James. The last command differs from #TMS comment in that it is both correct and has been tested. The explanation is:
sed is a stream editor.
s is the substitute command.
/ opens a regular expression - any character may be used. / is
conventional, but inconvenient for processing, say, XML or path names.
/ or the alternate character you chose, closes the regular expression and
opens the substitution string.
In / */ the * matches any sequence of the previous character (in this
case, a space).
/ or the alternate character you chose, closes the substitution string.
In this case, the substitution string // is empty, i.e. the match is
deleted.
g is the option to do this substitution globally on each line instead
of just once for each line.
The quotes keep the command parser from getting confused - the whole
sequence is passed to sed as the first option, namely, a sed script.
#TMS brain child (sed 's/^ *//') only strips spaces from the beginning of each line (^ matches the beginning of the line - 'pattern space' in sed-speak).
If you additionally want to remove newlines, the easiest way is to append
| tr -d '\n'
to the command pipes. It functions as follows:
| feeds the previously processed stream to this command's standard input.
tr is the translate command.
-d specifies deleting the match characters.
Quotes list your match characters - in this case just newline (\n).
Translate only matches single characters, not sequences.
sed is uniquely retarded when dealing with newlines. This is because sed is one of the oldest unix commands - it was created before people really knew what they were doing. Pervasive legacy software keeps it from being fixed. I know this because I was born before unix was born.
The historical origin of the problem was the idea that a newline was a line separator, not part of the line. It was therefore stripped by line processing utilities and reinserted by output utilities. The trouble is, this makes assumptions about the structure of user data and imposes unnatural restrictions in many settings. sed's inability to easily remove newlines is one of the most common examples of that malformed ideology causing grief.
It is possible to remove newlines with sed - it is just that all solutions I know about make sed process the whole file at once, which chokes for very large files, defeating the purpose of a stream editor. Any solution that retains line processing, if it is possible, would be an unreadable rat's nest of multiple pipes.
If you insist on using sed try:
sed -z 's/\n//g'
-z tells sed to use nulls as line separators.
Internally, a string in C is terminated with a null. The -z option is also a result of legacy, provided as a convenience for C programmers who might like to use a temporary file filled with C-strings and uncluttered by newlines. They can then easily read and process one string at a time. Again, the early assumptions about use cases impose artificial restrictions on user data.
If you omit the g option, this command removes only the first newline. With the -z option sed interprets the entire file as one line (unless there are stray nulls embedded in the file), terminated by a null and so this also chokes on large files.
You might think
sed 's/^/\x00/' | sed -z 's/\n//' | sed 's/\x00//'
might work. The first command puts a null at the front of each line on a line by line basis, resulting in \n\x00 ending every line. The second command removes one newline from each line, now delimited by nulls - there will be only one newline by virtue of the first command. All that is left are the spurious nulls. So far so good. The broken idea here is that the pipe will feed the last command on a line by line basis, since that is how the stream was built. Actually, the last command, as written, will only remove one null since now the entire file has no newlines and is therefore one line.
Simple pipe implementation uses an intermediate temporary file and all input is processed and fed to the file. The next command may be running in another thread, concurrently reading that file, but it just sees the stream as a whole (albeit incomplete) and has no awareness of the chunk boundaries feeding the file. Even if the pipe is a memory buffer, the next command sees the stream as a whole. The defect is inextricably baked into sed.
To make this approach work, you need a g option on the last command, so again, it chokes on large files.
The bottom line is this: don't use sed to process newlines.
echo hello | hexdump -v -e '/1 "%02X "'
Playing around with this further,
A working solution is to remove the "*", it is unnecessary for both the original requirement to simply remove spaces as well if substituting an actual character is desired, as follows
echo -n "Hello" | od -A n -t x1 | sed 's/ /%/g'
%48%65%6c%6c%6f
So, I consider this as an improvement answering the original Q since the statement now does exactly what is required, not just apparently.
Combining the answers from TMS and i-always-rtfm-and-stfw, the following works under Windows using gnu-utils versions of the programs 'od', 'sed', and 'tr':
echo "Hello"| tr -d '\42' | tr -d '\n' | tr -d '\r' | od -v -A n -tx1 | sed "s/ //g"
or in a CMD file as:
#echo "%1"| tr -d '\42' | tr -d '\n' | tr -d '\r' | od -v -A n -tx1 | sed "s/ //g"
A limitation on my solution is it will remove all double quotes (").
"tr -d '\42'" removes quote marks that the Windows 'echo' will include.
"tr -d '\r'" removes the carriage return, which Windows includes as well as '\n'.
The pipe (|) character must follow immediately after the string or the Windows echo will add that space after the string.
There is no '-n' switch to the Windows echo command.

Resources