Why is my Bash script adding <feff> to the beginning of files? - linux

I've written a script that cleans up .csv files, removing some bad commas and bad quotes (bad, means they break an in house program we use to transform these files) using sed:
# remove all commas, and re-insert the good commas using clean.sed
sed -f clean.sed $1 > $1.1st
# remove all quotes
sed 's/\"//g' $1.1st > $1.tmp
# add the good quotes around good commas
sed 's/\,/\"\,\"/g' $1.tmp > $1.tmp1
# add leading quotes
sed 's/^/\"/' $1.tmp1 > $1.tmp2
# add trailing quotes
sed 's/$/\"/' $1.tmp2 > $1.tmp3
# remove utf characters
sed 's/<feff>//' $1.tmp3 > $1.tmp4
# replace original file with new stripped version and delete .tmp files
cp -rf $1.tmp4 quotes_$1
Here is clean.sed:
s/\",\"/XXX/g;
:a
s/,//g
ta
s/XXX/\",\"/g;
Then it removes the temp files and viola we have a new file that starts with the word "quotes" that we can use for our other processes.
My question is:
Why do I have to make a sed statement to remove the feff tag in that temp file? The original file doesn't have it, but it always appears in the replacement. At first I thought cp was causing this but if I put in the sed statement to remove before the cp, it isn't there.
Maybe I'm just missing something...

U+FEFF is the code point for a byte order mark. Your files most likely contain data saved in UTF-16 and the BOM has been corrupted by your 'cleaning process' which is most likely expecting ASCII. It's probably not a good idea to remove the BOM, but instead to fix your scripts to not corrupt it in the first place.

To get rid of these in GNU emacs:
Open Emacs
Do a find-file-literally to open the file
Edit off the leading three bytes
Save the file
There is also a way to convert files with DOS line termination convention to Unix line termination convention.

Related

How to remove 1st empty column from file using bash commands?

I have output file that can be read by a visualizing tool, the problem is that at the beginning of each line there is one space. Because of this the visualizing tool isn't able to recognize the scripts and, hence crashing. When I manually remove the first column of spaces, it works fine. Is there a way to remove the 1st empty column of spaces using bash command
What I want is to remove the excess column of empty space like shown in this second image using a bash command
At present I use Vim editor to remove the 1st column manually. But I would like to do it using a bash command so that I can automate the process. My file is not just full of columns, it has some independent data line
Using cut or sed would be two simple solutions.
cut
cut -c2- file > file.cut && mv file.cut file
cut cannot modify a file, therefore you need to redirect its output to a different file and then overwrite the old file with the new file.
sed
sed -i 's/^.//' file
-i modifies the file in-place.
I would use
sed -ie 's/^ //' file
to just remove spaces (in case a line does not contain it)

How to add character at the end of specific line in UNIX/LINUX?

Here is my input file. I want to add a character ":" into the end of lines that have ">" at the beginning of the line. I tried seq -i 's|$|:|' input.txt but ":" was added to all the ending of each line. It is also hard to call out specific line numbers because, in each of my input files, the line contains">" present in different line numbers. I want to run a loop for multiple files so it is useless.
>Pas_pyrG_2
AAAGTCACAATGGTTAAAATGGATCCTTATATTAATGTCGATCCAGGGACAATGAGCCCA
TTCCAGCATGGTGAAGTTTTTGTTACCGAAGATGGTGCAGAAACAGATCTGGATCTGGGT
>Pas_rpoB_4
CAAACTCACTATGGTCGTGTTTGTCCAATTGAAACTCCTGAAGGTCCAAACATTGGTTTG
ATCAACTCGCTTTCTGTATACGCAAAAGCGAATGACTTCGGTTTCTTGGAAACTCCATAC
CGCAAAGTTGTAGATGGTCGTGTAACTGATGATGTTGAATATTTATCTGCAATTGAAGAA
>Pas_cpn60_2
ATGAACCCAATGGATTTAAAACGCGGTATCGACATTGCAGTAAAAACTGTAGTTGAAAAT
ATCCGTTCTATTGCTAAACCAGCTGATGATTTCAAAGCAATTGAACAAGTAGGTTCAATC
TCTGCTAACTCTGATACTACTGTTGGTAAACTTATTGCTCAAGCAATGGAAAAAGTAGGT
AAAGAAGGCGTAATCACTGTAGAAGAAGGCTCAGGCTTCGAAGACGCATTAGACGTTGTA
Here is experted output file:
>Pas_pyrG_2:
AAAGTCACAATGGTTAAAATGGATCCTTATATTAATGTCGATCCAGGGACAATGAGCCCA
TTCCAGCATGGTGAAGTTTTTGTTACCGAAGATGGTGCAGAAACAGATCTGGATCTGGGT
>Pas_rpoB_4:
CAAACTCACTATGGTCGTGTTTGTCCAATTGAAACTCCTGAAGGTCCAAACATTGGTTTG
ATCAACTCGCTTTCTGTATACGCAAAAGCGAATGACTTCGGTTTCTTGGAAACTCCATAC
CGCAAAGTTGTAGATGGTCGTGTAACTGATGATGTTGAATATTTATCTGCAATTGAAGAA
>Pas_cpn60_2:
ATGAACCCAATGGATTTAAAACGCGGTATCGACATTGCAGTAAAAACTGTAGTTGAAAAT
ATCCGTTCTATTGCTAAACCAGCTGATGATTTCAAAGCAATTGAACAAGTAGGTTCAATC
TCTGCTAACTCTGATACTACTGTTGGTAAACTTATTGCTCAAGCAATGGAAAAAGTAGGT
AAAGAAGGCGTAATCACTGTAGAAGAAGGCTCAGGCTTCGAAGACGCATTAGACGTTGTA
Do seq have more option to modify or the other commands can solve this problem?
sed -i '/^>/ s/$/:/' input.txt
Search the lines of input for lines that match ^> (regex for "starts with the > character). Those that do substitute : for end-of-line (you got this part right).
/ slashes are the standard separator character in sed. If you wish to use different characters, be sure to pass -e or s|$|:| probably won't work. Since / characters, unlike | characters, are not meaningful character within the shell, it's best to use them unless the pattern also contains slashes, in which case things get unwieldy.
Be careful with sed -i. Make a backup - make sure you know what's changing by using diff to compare the files.
On OSX -i requires an argument.
Using ed to edit the file:
printf "%s\n" 'g/^>/s/$/:/' w | ed -s input.txt
For every line starting with >, add a colon to the end, and then write the changed file back to disk.

concatenation of strings in bash results in substitution

I need to read a file into an array and concatenate a string at the end of each line. Here is my bash script:
#!/bin/bash
IFS=$'\n' read -d '' -r -a lines < ./file.list
for i in "${lines[#]}"
do
tmp="$i"
tmp="${tmp}stuff"
echo "$tmp"
done
However, when I do this, an action of replace happens, instead of concatenation.
For example, in the file.list, we have:
http://www.example1.com
http://www.example2.com
What I need is:
http://www.example1.comstuff
http://www.example2.comstuff
But after executing the script above, I get things as below on the terminal:
stuff//www.example1.com
stuff//www.example2.com
Btw, my PC is Mac OS.
The problem also occurs while concatenating strings via awk, printf, and echo commands. For example echo $tmp"stuff" or echo "${tmp}""stuff"
The file ./file.lst is, most probably, generated on a Windows system or, at least, it was saved using the Windows convention for end of line.
Windows uses a sequence of two characters to mark the end of lines in a text file. These characters are CR (\r) followed by LF (\n). Unix-like systems (Linux and macOS starting with version 10) use LF as end of line character.
The assignment IFS=$'\n' in front of read in your code tells read to use LF as line separator. read doesn't store the LF characters in the array it produces (lines[]) but each entry from lines[] ends with a CR character.
The line tmp="${tmp}stuff" does what is it supposed to do, i.e. it appends the word stuff to the content of the variable tmp (a line read from the file).
The first line read from the input file contains the string http://www.example1.com followed by the CR character. After the string stuff is appended, the content of variable tmp is:
http://www.example1.com$'\r'stuff
The CR character is not printable. It has a special interpretation when it is printed on the terminal: it sends the cursor at the start of the line (column 1) without changing the line.
When echo prints the line above, it prints (starting on a new line) http://www.example1.com, then the CR character that sends the cursor back to the start of the line where is prints the string stuff. The stuff fragment overwrites the first 5 characters already printed on that line (http:) and the result, as it is visible on screen, is:
stuff//www.example1.com
The solution is to get rid of the CR characters from the input file. There are several ways to accomplish this goal.
A simple way to remove the CR characters from the input file is to use the command:
sed -i.bak s/$'\r'//g file.list
It removes all the CR characters from the content of file file.list, saves the updated string back into the file.list file and stores the original file.list file as file.list.bak (a backup copy in case it doesn't produce the output you expect).
Another way to get rid of the CR character is to ask the shell to remove it in the command where stuff is appended:
tmp="${tmp/$'\r'/}stuff"
When a variable is expanded in a construct like ${tmp/a/b}, all the appearances of a in $tmp are replaced with b. In this case we replace \r with nothing.
I'm guessing it's have something to do with the Carriage Return character.
Did your file.list created on windows? If so, try to use dos2unix before running the script.
Edit
You can check your files using the file command.
Example:
file file.list
If you saved the file in Windows Notepad like this:
Then it will probably come up like this:
file.list: ASCII text, with no line terminators
You can use built in tools like iconv to convert the encodings. However for a simple use like this, you can just use a command that works for multiple encodings without any conversion necessary.
You could simply buffer the file through cat, and use a regular expression that applies to either:
Carriage return followed by line terminator, or
Line terminator on it's own
Then append the string.
Example:
cat file.list | grep -E -v "^$" | sed -E -e "s/(\r?$)/stuff/g"
Will work with ASCII text, and ASCII text with no line terminators.
If you need to modify a stream to append a fixed string, you can use sed or awk, for instance:
sed 's/$/stuff/'
to append stuff to the end of each line.
using "dos2unix file.list" would also solve the problem

Replace Windows newlines in a lot of files using sed - but it doesn't

I have a lot of files that end in the classical ^M, an artifact from my Windows times. As this is all source code, git actually thinks those files changed, so I want to remove those nasty lines once and for all.
Here is what I created:
sed -i 's/^M//g' file
But that does not work. Of course I did not type a literal ^M but rather ^V^M (ctrl V, ctrl M). In vim it works (:%s/s/^M//g) and if I modify it like this:
sed -i 's/^M/a/g' file
It also works, i.e. it ends every line with an 'a'. It also works to do this:
sed -i 's/random_string//g' file
Where random_string exists in the file. So I can replace ^M by any character and I can remove lines but I cannot remove ^M. Why?
Note: It is important that it is just removed, no replacing by another invisible char or something. I would also like to avoid double execution and adding an arbitrary string and removing it afterwards. I want to understand why this fails (but it does not report an error).
That character is matched with \r by sed. Use:
sed -e "s/\r//g" input-file
For my case, I had to do
sed -e "s/\r/\n/g" filename.csv
After that wc -l filename Showed correct output instead of 0 lines.

Best tool to remove dos line ends and join line back up again

I have a csv file into which has crept some ^M dos line ends, and I want to get rid of them, as well as 16 spaces and 3 tabs which follow. Like, I have to merge that line with the next one down. Heres an offending record and a good one as a sample of what I mean:
"Mary had a ^M
little lamb", "Nursery Rhyme", 1878
"Mary, Mary quite contrary", "Nursery Rhyme", 1838
I can remove the ^M using sed as you can see, but I cannot work out how to rm the nix line end to join the lines back up.
sed -e "s/^M$ //g" rhymes.csv > rhymes.csv
UPDATE
Then I read "However, the Microsoft CSV format allows embedded newlines within a double-quoted field. If embedded newlines within fields are a possibility for your data, you should consider using something other than sed to work with the data file." from:
http://sed.sourceforge.net/sedfaq4.html
So editing my question to ask Which tool I should be using?
With help from How can I replace a newline (\n) using sed?, I made this one:
sed -e ':a;N;$!ba;s/\r\n \t\t\t/=/' -i rhymes.csv
<CR> <LF> <16 spaces> <3 tabs>
If you just want to delete the CR, you could use:
<yourfile tr -d "\r" | tee yourfile
(or if the two input and output file are different: <yourfile tr -d "\r" > output)
dos2unix file_name
to convert file, or
dos2unix old_file new_file
to create new file.

Resources