I got some Postgres table dumps from somebody using pgAdmin3 on Windows. (Blech.) First of all, it has a whole bunch of extra crap at the top of the file that I've had to get rid of-- things like "toc.dat" without comments, etc.
I've resorted to editing them by hand to get them in workable format to be imported, because as it stands they are somewhat garbled; for the most part I've succeeded, but when I open them in emacs, for example, they tend to be littered with the following character:
^#
and sometimes just alot of:
###
I haven't figured out how to remove them using sed or awk, mainly because I have no idea what they are (I don't think they are null characters) or even how to search for them in emacs. They show up as red for 'unprintable' characters. (Screenshot above.) They also don't seem to be printed to the terminal when I cat the file or when I open it in my OS X Text editor, but they certainly cause errors when I try to import the file in to postgres using
psql mydatabase < table.backup
unless I edit them all out.
Anybody have any idea of a good way to get rid of these short of editing them by hand? I've tried in place sed and also tried using tr, but to no effect-- perhaps I'm looking for the wrong thing. (As I'm sure you are aware, trying to google for '^#' is futile!)
Just was wondering if anybody had come across this at all because it's going to eat at me unless I figure it out...
Thanks!
Those are null characters. You can remove them with:
tr -d '\000' < file1 > file2
where the -d parameter is telling tr to remove characters with the octal value 000.
I found the tr command on this forum post, so some credit goes to them.
I might suggest acquiring access to a Windows machine (never thought I'd say that), loading the original dumps they gave you, and exporting in some other formats to see if you can avoid the problem altogether. Which to me seems safer than running any for of sed or tr on a database dump before importing. Good luck!
Related
I have spent all day today trying to find a proper solution, but I am not able to. My problem:
I have an XML file with tags containing multiple of the same.
Example:
<TASK INSTANCE />
<WORKFLOWLINK CONDITION=""/>
<WORKFLOWLINK CONDITION=""/>
I want to add the contents of an other XML file before the first <WORKFLOWLINK. The issue I've ran into is that this file is full of double quotes and slashes. I've tried replacing them and escaping them, but to no avail.
My tries mainly culminated on something like:
sed -e "0,/<WORKFLOWLINK/ /<WORKFLOWLINK/{ r ${filename}" -e "}" ${sourcefile}
If this isn't clear enough I'll get the exact data so you can see.
For the fun of sed:
sed -e "0,/<WORKFLOWLINK/{/<WORKFLOWLINK/{r ${sourcefile}" -e"}}"
The trick is to start a new "pattern/command" pair after your first address range condition 0,/<WORKFLOWLINK/.
Two nested patterns/addresses are not understood, there must be a command after the first pattern. Using an additional pair of curlies {} does that for you.
Apart from the brain exercise to do it in sed, #EdMorton is right in recommending to use an XML-processor. Also his request for an MCVE is appropriate. I had to do some guessing to see what you want and I hope I guessed right.
The mcve should at least have included
the error message or problem description defining your problem
the initialisation of your environment variables
some sample input; not the original data
You surely would have had an answer earlier and (in case mine does not satisfy you) probably a better one by now.
So, before your next question, please take the https://stackoverflow.com/tour
GNU sed version 4.2.1
GNU bash, version 3.1.17(1)-release (i686-pc-msys)
Everyone,
Thank you for thinking with me, even if I apparently broke some rules.
I have figured out a solution, granted it is not as pretty as can be, but for a one time action it is good enough.
I have moved from a single command to a combination of first detecting the location I want to put my data:
sed -e "0,/<WORKFLOWLINK/ s/<WORKFLOWLINK/##MARKER##\n\t<WORKFLOWLINK'" which will put the marker string in the desired location.
After this I replace the marker with the contents of the file I have. I managed to make the individual statements working when I was trying to do it all in a single statement before, so I just execute them separately.
sed -e "/##MARKER##/{r ${sourcefile}" -e 'd}'
I'm trying to use the following command on a text file:
$ sort <m.txt | uniq -c | sort -nr >m.dict
However I get the following error message:
sort: string comparison failed: Invalid or incomplete multibyte or wide character
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were ‘enwedig\r’ and ‘mwy\r’.
I'm using Cygwin on Windows 7 and was having trouble earlier editing m.txt to put each word within the file on a new line. Please see:
Using AWK to place each word in a text file on a new line
I'm not sure if I'm getting these errors due to this, or because m.txt contains characters from the Welsh alphabet (When I was working with Welsh text in Python, I was required t change the encoding to 'Latin-1').
I tried following the error message's advice and changing LC_ALL='C' however this has not helped. Can anyone elaborate on the errors I'm receiving and provide any advice on how I might go about trying to solve this problem.
UPDATE:
When trying dos2unix, errors were being displayed about invalid characters at certain lines. It turns out these were not Welsh characters, but other strange characters (arrows etc). I went through my text file removing these characters until I was able to use the dos2unix command without error. However, after using the dos2unix command all the text was concatenated (no spaces/newlines or anything, whereas it should have been so that each word in the file was on a seperate line) I then used unix2dos and the text file was back to normal. How can I each word on its own individual line and use the sort command without it giving me errors about '\r' characters?
I know it's an old question, but just running the command export LC_ALL='C' does the trick as described by sort: Set LC_ALL='C' to work around the problem..
Looks like a Windows line-ending related problem (\r\n versus \n). You can convert m.txt to Unix line-endings with
dos2unix m.txt
and then rerun your command.
I must warn you I'm a beginner. I have a text file in which some lines contain encoding errors. By "error", this is what I get when parsing the file in my linux console (question marks instead of characters):
I want to remove every line showing those "question marks". I tried to grep -v the problematic character, but it doesn't work. The file itself is UTF8 and I guess some of the lines come from texts encoded in another format. I know I could find a way to reconvert them properly, but I just want them gone for now.
Do you have any ideas about how I could do this please?
PS: Some lines contain diacritics which are displayed fine. The "strings" command seems to remove too many "good" lines.
When dealing with mojibake on character encodings other than ANSI you must check 2 things:
Is the file really encoded in X? (X being UTF-8 WITHOUT BOM in your case. You could be trying to read UTF-8 WITH BOM, UTF-16, latin-1, etc. as UTF-8, and that would be the problem). Try reading in (not converting to) other encodings and see if any of them fits.
Is your locale or text editor set to read the file as UTF-8? If not, that may be the problem. Check for support and figure out how to change the setting. In linux try locale and setlocale commands to check and set it properly.
I like how notepad++ for windows (which also runs perfectly in linux using wine) lets you set any encoding you want to read the file without trying to convert it (of course if you set any other than the one the file is encoded in you will only see those weird characters), and also has a different option which allows you to convert it from one encoding to another. That has been pretty useful to me.
If you are a beginner you may be interested in this article. It explains briefly and clearly the whats, whys and hows of character encoding.
[EDIT] If the above fails, even windows-1252 and such ANSI encodings, I've just learned here how to remove non-ascii characters using tr unix command, turning it into ASCII (but be aware information on extra characters is lost in this output and there is no coming back, so keep the input file just in case you find a better fix):
tr -cd '\11\12\40-\176' < $INPUT_FILE > $OUTPUT_FILE
or, if you want to get rid of the whole line:
grep -v -P "[^\11\12\40-\176]" $INPUT_FILE > $OUTPUT_FILE
[EDIT 2] This answer here gives a pretty good guess of what could be happening if none of the encodings work on your file (Unfortunately the only straight forward solution seems to be removing those problematic characters).
You can use a micro-Perl script like:
perl -pe 's/[^[:ascii:]]+//g;' my_utf8_file.txt
I'm trying to write a bash script that imports a CSV file and sends it off to somewhere on the web. If I use a handwritten CSV, i.e:
summary,description
CommaTicket1,"Description, with a comma"
QuoteTicket2,"Description ""with quotes"""
CommaAndQuoteTicke3,"Description, with a commas, ""and quotes"""
DoubleCommaTicket4,"Description, with, another comma"
DoubleQuoteTicket5,"Description ""with"" double ""quoty quotes"""
the READ command is able to read the file fine. However, if I create "the same file" (i.e: with the same fields) in Excel, READ doesn't work as it should and usually just reads the first value and that's all.
In relatively new to Bash scripting, so if someone thinks its a problem with my code, I'll upload it, but it seems it's a problem with the way Excel for Mac saves files, and I thought someone might have some thoughts on that.
Anything you guys can contribute will be much appreciated. Cheers!
By default, Excel on Mac indicates new records using the carriage-return character, but bash is looking for records using the newline character. When saving a file in Excel for Mac, be sure to change the character encoding (an option that is available when saving the file) to DOS or Windows, or the like, which should pop in a carriage-return and a newline, and should be "readable".
Alternatively, you could just process the file with tr, and convert all the CRs to LFs, i.e.,
tr '\r' '\n' < myfile.csv > newfile.csv
One way you can verify if this actually is the problem is by using od to inspect the file. Use something like:
od -c myfile.csv
And look for the end-of-line character.
Finally, you could also investigate bash's internal IFS variable, and set it to include "\r" in it. See: http://tldp.org/LDP/abs/html/internalvariables.html
I am trying to parse a giant log file using node.js, the file does not seem to get '\n' but when I do set list in vi it shows me '$" at the end of every line, does anyone know what that is. I means can I split a string on that.
I would recommend checking out your file via
cat -v -e
which will show you all unprintable characters and line endings.
It happens when you do set list, so you should read :h 'list' instead of asking this here. Everything what you need to know about this $ is stated in the help.
Second question (splitting string on end-of-line) is answered in :h getline(). I also doubt that file really does not have a NL so write here how did you came to conclusion «the file does not seem to get '\n'».