Sed doesn't read CSV when saved in Excel - excel

I'm trying to write a bash script that imports a CSV file and sends it off to somewhere on the web. If I use a handwritten CSV, i.e:
summary,description
CommaTicket1,"Description, with a comma"
QuoteTicket2,"Description ""with quotes"""
CommaAndQuoteTicke3,"Description, with a commas, ""and quotes"""
DoubleCommaTicket4,"Description, with, another comma"
DoubleQuoteTicket5,"Description ""with"" double ""quoty quotes"""
the READ command is able to read the file fine. However, if I create "the same file" (i.e: with the same fields) in Excel, READ doesn't work as it should and usually just reads the first value and that's all.
In relatively new to Bash scripting, so if someone thinks its a problem with my code, I'll upload it, but it seems it's a problem with the way Excel for Mac saves files, and I thought someone might have some thoughts on that.
Anything you guys can contribute will be much appreciated. Cheers!

By default, Excel on Mac indicates new records using the carriage-return character, but bash is looking for records using the newline character. When saving a file in Excel for Mac, be sure to change the character encoding (an option that is available when saving the file) to DOS or Windows, or the like, which should pop in a carriage-return and a newline, and should be "readable".
Alternatively, you could just process the file with tr, and convert all the CRs to LFs, i.e.,
tr '\r' '\n' < myfile.csv > newfile.csv
One way you can verify if this actually is the problem is by using od to inspect the file. Use something like:
od -c myfile.csv
And look for the end-of-line character.
Finally, you could also investigate bash's internal IFS variable, and set it to include "\r" in it. See: http://tldp.org/LDP/abs/html/internalvariables.html

Related

Data hidden in jpg

I am currently looking for hidden data in a jpg file but I have no clue on how to operate.
There is a jpg file containing text in a format I have never seen before :
-ne \xff\xd8\xff\xe0\x00\x10\x4a\x46\x49\x46\x00\x01\x01\x01\x00\x60\x00\x60\x00\x00\xff\xdb\x00\x43\x00\x06\x04\x04\x05\x04\x04\x06\x05\x05\x05\x06\x06\x06\x07\x09\x0e\x09\x09\x08\x08\x09\x12\x0d\x0d\x0a\x0e\x15\x12\x16\x16\x15\x12\x14\x14\x17\x1a\x21\x1c\x17\x18\x1f\x19\x14\x14\x1d\x27\x1d\x1f\x22\x23\x25\x25\x25\x16\x1c\x29\x2c\x28\x24\x2b\x21\x24\x25\x24\xff\xdb\x00\x43\x01\x06\x06\x06\x09\x08\x09\x11\x09\x09\x11\x24\x18\x14\x18\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\xff\xc0\x00\x11\x08\x01\x8e\x03\x4e\x03\x01\x22\x00\x02\x11\x01\x03\x11\x01\xff\xc4\x00\x1f\x00\x00\x01\x05\x01\x01\x01\x01\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\xff\xc4\x00\xb5\x10\x00\x02\x01\x03\x03\x02\x04\x03\x05\x05\x04\x04\x00\x00\x01\x7d\x01\x02\x03\x00\x04\x11\x05\x12\x21\x31\x41\x06\x13\x51\x61\x07\x22\x71\x14\x32\x81\x91\xa1\x08\x23
-ne \x42\xb1\xc1\x15\x52\xd1\xf0\x24\x33\x62\x72\x82\x09\x0a\x16\x17\x18\x19\x1a\x25\x26\x27\x28\x29\x2a\x34\x35\x36\x37\x38\x39\x3a\x43\x44\x45\x46\x47\x48\x49\x4a\x53\x54\x55\x56\x57\x58\x59\x5a\x63\x64\x65\x66\x67\x68\x69\x6a\x73\x74\x75\x76\x77\x78\x79\x7a\x83\x84\x85\x86\x87\x88\x89\x8a\x92\x93\x94\x95\x96\x97\x98\x99\x9a\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xff\xc4\x00\x1f\x01\x00\x03\x01\x01\x01\x01\x01\x01\x01\x01\x01\x00\x00\x00\x00\x00\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\xff\xc4\x00\xb5\x11\x00\x02\x01\x02\x04\x04\x03\x04\x07\x05\x04\x04\x00\x01\x02\x77\x00\x01\x02\x03\x11\x04\x05\x21\x31\x06\x12\x41\x51\x07\x61\x71\x13\x22\x32\x81\x08\x14\x42\x91\xa1\xb1\xc1\x09\x23\x33\x52\xf0\x15\x62\x72\xd1\x0a\x16\x24\x34\xe1\x25\xf1\x17\x18\x19\x1a\x26\x27\x28\x29\x2a\x35\x36\x37\x38\x39\x3a\x43\x44\x45\x46\x47\x48\x49
This is just the beggining of the file as there is at least a hundred lines.
The file type given by the command file : file.jpg: ASCII text, with very long lines
I tried some of the common tools to identify any patterns or hidden data like exiftools, strings, xxd but I found nothing.
If you have any idea on what to do it would be very much appreciated.
If it's a challenge of CTF, there are some common way to find out flag.
First try to find flag in file metadata, like description of file field
you can also try tool: stegsolve.jar.
In more advance sence, stego info hidden with some math calulation, give this tool a try: zsteg
Perhaps I'm misunderstanding the problem here, but if your file actually starts with a backslash character followed by the characters x, f, f, \, x, d, 8 and so on, then what you're looking at is the binary content of a JPG file that has been converted into ASCII text.
If so, you need to convert this back into binary data. For example, in Linux or MacOS, you could do this by entering the following on the command line:
echo -ne '\xff\xd8\xff\xe0\x00\x10\x4a\x46\x49\x46\x00\x01...etc...' > img.jpg
echo -ne '\x42\xb1\xc1\x15\x52\xd1\xf0\x24\x33\x62\x72\x82...etc...' >> img.jpg
(Note: > sends the results to a new file, and >> appends to the end of the file)
Or alternatively in Python:
with open("img.jpg","wb") as f:
f.write(b'\xff\xd8\xff\xe0\x00\x10\x4a\x46\x49\x46\x00\x01...etc...')
f.write(b'\x42\xb1\xc1\x15\x52\xd1\xf0\x24\x33\x62\x72\x82...etc...')
# and so on for all the other lines
Either way, you should end up with a file called img.jpg containing the image you're after.

How to echo/print actual file contents on a unix system

I would like to see the actual file contents without it being formatted to print. For example, to show:
\n0.032,170\n0.034,290
Instead of:
0.032,170
0.34,290
Is there a command to echo the file's actual data in bash? I've tried using head, cat, more, etc. but all those seem to echo the "print-formatted" text. For example:
$ cat example.csv
0.032,170
0.34,290
How can I print the actual characters within the file?
This reads as if you miss understand what the "actual characters in the file" are. You will not find the characters \ and n in that file. But only a line feed, which is a specific character. So the utilities like cat do actually output exactly the characters in the file.
Putting it the other way around: if you really had those two characters literally in the file, then a utility like cat would actually output them. I just checked that, just to be sure.
You can easily check that yourself if you open the file using a hexeditor. There you will see the character 0A (decimal 10) which is a line feed character. You will not see the pair of the two characters \ and n somewhere in that file.
Many programming languages and also shell environments use escape sequences like \n in string definitions and identify those as control characters which would not be typable otherwise. So maybe that is where your impression comes from that your files should contain those two characters.
To display newlines as \n, you might try:
awk 1 ORS='\\n' input-file
This is not the "actual characters in the file", as \n is merely a conventional method of displaying a newline, but this does seem to be what you want.

using tr to strip characters but keep line breaks

I am trying to format some text that was converted from UTF-16 to ASCII, the output looks like this:
C^#H^#M^#M^#2^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#
T^#h^#e^#m^#e^# ^#M^#a^#n^#a^#g^#e^#r^# ^#f^#o^#r^# ^#3^#D^#S^#^#^#^#^#^#^#^#^#^#^#^#^#^#
The only text I want out of that is:
CHMM2
Theme Manager for 3DS
So there is a line break "\n" at the end of each line and when I use
tr -cs 'a-zA-Z0-9' 'newtext' infile.txt > outfile.txt
It is stripping the new line as well so all the text ends up in one big string on one line.
Can anyone assist with figuring out how to strip out only the ^#'s and keeping spaces and new lines?
The ^#s are most certainly null characters, \0s, so:
tr -d '\0'
Will get rid of them.
But this is not really the correct solution. You should simply use theiconv command to convert from UTF-16 to UTF-8 (see its man page for more information). That is, of course, what you're really trying to accomplish here, and this will be the correct way to do it.
This is an XY problem. Your problem is not deleting the null characters. Your real problem is how to convert from UTF-16 to either UTF-8, or maybe US-ASCII (and I chose UTF-8, as the conservative answer).

^# character wreaking havoc in Windows Postgres backup file on Linux

I got some Postgres table dumps from somebody using pgAdmin3 on Windows. (Blech.) First of all, it has a whole bunch of extra crap at the top of the file that I've had to get rid of-- things like "toc.dat" without comments, etc.
I've resorted to editing them by hand to get them in workable format to be imported, because as it stands they are somewhat garbled; for the most part I've succeeded, but when I open them in emacs, for example, they tend to be littered with the following character:
^#
and sometimes just alot of:
###
I haven't figured out how to remove them using sed or awk, mainly because I have no idea what they are (I don't think they are null characters) or even how to search for them in emacs. They show up as red for 'unprintable' characters. (Screenshot above.) They also don't seem to be printed to the terminal when I cat the file or when I open it in my OS X Text editor, but they certainly cause errors when I try to import the file in to postgres using
psql mydatabase < table.backup
unless I edit them all out.
Anybody have any idea of a good way to get rid of these short of editing them by hand? I've tried in place sed and also tried using tr, but to no effect-- perhaps I'm looking for the wrong thing. (As I'm sure you are aware, trying to google for '^#' is futile!)
Just was wondering if anybody had come across this at all because it's going to eat at me unless I figure it out...
Thanks!
Those are null characters. You can remove them with:
tr -d '\000' < file1 > file2
where the -d parameter is telling tr to remove characters with the octal value 000.
I found the tr command on this forum post, so some credit goes to them.
I might suggest acquiring access to a Windows machine (never thought I'd say that), loading the original dumps they gave you, and exporting in some other formats to see if you can avoid the problem altogether. Which to me seems safer than running any for of sed or tr on a database dump before importing. Good luck!

Removing lines containing encoding errors in a text file

I must warn you I'm a beginner. I have a text file in which some lines contain encoding errors. By "error", this is what I get when parsing the file in my linux console (question marks instead of characters):
I want to remove every line showing those "question marks". I tried to grep -v the problematic character, but it doesn't work. The file itself is UTF8 and I guess some of the lines come from texts encoded in another format. I know I could find a way to reconvert them properly, but I just want them gone for now.
Do you have any ideas about how I could do this please?
PS: Some lines contain diacritics which are displayed fine. The "strings" command seems to remove too many "good" lines.
When dealing with mojibake on character encodings other than ANSI you must check 2 things:
Is the file really encoded in X? (X being UTF-8 WITHOUT BOM in your case. You could be trying to read UTF-8 WITH BOM, UTF-16, latin-1, etc. as UTF-8, and that would be the problem). Try reading in (not converting to) other encodings and see if any of them fits.
Is your locale or text editor set to read the file as UTF-8? If not, that may be the problem. Check for support and figure out how to change the setting. In linux try locale and setlocale commands to check and set it properly.
I like how notepad++ for windows (which also runs perfectly in linux using wine) lets you set any encoding you want to read the file without trying to convert it (of course if you set any other than the one the file is encoded in you will only see those weird characters), and also has a different option which allows you to convert it from one encoding to another. That has been pretty useful to me.
If you are a beginner you may be interested in this article. It explains briefly and clearly the whats, whys and hows of character encoding.
[EDIT] If the above fails, even windows-1252 and such ANSI encodings, I've just learned here how to remove non-ascii characters using tr unix command, turning it into ASCII (but be aware information on extra characters is lost in this output and there is no coming back, so keep the input file just in case you find a better fix):
tr -cd '\11\12\40-\176' < $INPUT_FILE > $OUTPUT_FILE
or, if you want to get rid of the whole line:
grep -v -P "[^\11\12\40-\176]" $INPUT_FILE > $OUTPUT_FILE
[EDIT 2] This answer here gives a pretty good guess of what could be happening if none of the encodings work on your file (Unfortunately the only straight forward solution seems to be removing those problematic characters).
You can use a micro-Perl script like:
perl -pe 's/[^[:ascii:]]+//g;' my_utf8_file.txt

Resources