For the same file, one hex editor shows 3A303230 where the other shows B03F0020 for the same location. I know those are the same but I don't understand how to convert one to the other?
Related
I am importing a dataset from excel using the built in import data wizard. However, when viewing the data in SAS, cells with newlines have all line feeds (alt+Enter) replaced with a period (.)
For example, in excel:
"Example text
with new line"
will be read in by SAS as:
"Example text.with new line"
Usually line feeds or carriage returns are replaced by spaces, where the hex code (if you format the text as hex) is 0A. When I convert the text in excel to hex in excel using a formula, the new line feeds also show up as 0A.
However, the hex code for the period in my text (what used to be a line return in excel) is 2E, rather than the expected 0A. This prevents me from differentiating them from normal full stops, which means there's no possible workaround. Has anyone else come across this issue? Is there an option to change/set the default line feed replacement character in SAS?
My import code (variables replaced with 'text' for simplicity) for reference:
data work.table;
length
text $ 50;
label
text = "Text"
format
text $CHAR50;
informat
text $CHAR50;
infile 'path/to/file'
lrecl=1000
encoding='LATIN9'
termstr=CRLF
dlm='7F'x
missover
dsd;
input
text $CHAR50;
run;
SAS Viewer will not render so called non-printables (characters <= '1F'x) and does not display carriage return characters as a line break.
Example:
Excel cell with two line breaks in the data value
Imported with
proc import datafile='sample.xlsx' out=work.have;
run;
and viewed in standard SAS data set viewer (viewtable) appear to have lost the new lines.
Rest assured they are still there.
proc print data=have;
var text / style = [fontsize=14pt];
format text $hex48.;
run;
I would not recommend using the Import Wizard; there are far better tools nowadays. EG's import wizard is unique in SAS tools in how it works, and really was meant only to supply a way for data analysts who were not programmers to quickly bring in data; it's not robust enough for production work.
In this case, what's happening is that SAS's method for reading the data in is very rudimentary. What it does is convert it to a delimited file, and it doesn't handle LF characters very cleanly there. Instead of keeping them, which would be possible but is riskier (remember, this has to work for any incoming file), what it does is convert those to periods.
You'll see that in the notes in the program it generates:
Some characters embedded within the spreadsheet data were
translated to alternative characters so as to avoid transmission
errors.
It's referring to the LF character in that case.
The only way to get around this that I'm aware of is to either:
Convert the file to CSV from Excel yourself, and then read it in
Use ACCESS to PC FILES (via PROC IMPORT, or the checkbox in the import wizard)
Either of those will allow you to read in your line feed characters.
I have tried using csv and txt format but each of them pose their own problem.
With the csv format I have found that it is possible to add a tab character before the number and that will preserve the leading zeros, but when the user adds characters to the front of the number, there will be a space in the middle of the number.
With the txt format the only way to preserve the leading zeros is to import the txt file into Excel and dictate each column that has numbers in it to be cast as text cells. Since a client could be opening this file, those instructions are too difficult to rely on a client not to mess up.
Basically what I need is a file format or a way to edit a csv that allows it to maintain leading zeros when it is opened into Excel, but to also be editable without losing the integrity of the number and not show any special characters when editing the cell. Please let me know any areas that need clarification as this is an oddly specific problem.
Since you're importing text or CSV I assume that you're using the "get data from text" tool. The way to maintain the leading zeros is to set the imported field to be "text" instead of "general" during step 3 of the import.
If you need to add leading zeroes back in to a specified length you can also use Right(Concatenate("00...00",text cell),desired length of field).
I need to work on tabular data with some people who will edit it using Excel 2010, likely set up with , as decimal delimiter and ; as list separator.
The data needs to be under version control, and it must be easy to fork/share, merge, and compare different states of the data set. This includes low-barrier ways to contribute for people outside our work group who do not have me around to help them set the system up.
What setup will allow my co-workers to easily open the file using Excel and edit cells, and commit and compare with very few clicks and maybe entering a commit message?
In particular, I require that if I open the file in Excel, do a null-change and save it again, and then do whatever is the commit step for this setup, the commit will be empty.
The data contains non-ASCII characters, in particular IPA-symbols.
I expected that I would just be able to use git with csv files. But while for ASCII data, .csv with comma as separator seems to fulfill these conditions, for more complete encodings I cannot seem to get Excel keep data the same on round-trips, not even by appearance, not to mention binary compatibility – it either loses unicode characters upon saving or does not recognize the format upon reading.
I have written a Robot UI test which is extracting some data from an excel file and compares them to what it got from UI. The problem is that in some cases, what the script reads from excel file has some hidden characters which cause a fail in the comparison. For example I have these tow strings (which I have printed their repr), first is obtained from UI elements and the second is derived from excel file:
1- 'u\\'Please fill back date...\\''
2- 'u\\'Please fill back date\\u2026\\''
Those hidden characters at the end of the second string fails the test case. How can I avoid it? I should mention that I have tried strip and it didn't help.
The "hidden" characters that you mention are just the ascii representation of the unicode horizontal ellipsis. Some microsoft products (and perhaps some non-microsoft products) will autocorrect ... into this character.
Is this at all possible?
If I open up my file in standard text editor e.g. notepad the preceeding zeros are displayed.
e.g. 000485001 shows up.
Although this doesn't happen in excel. All that's displayed is 485001
Just wondering if there's a way around this?
Thanks,
Yes, when you're importing (or using 'Text to columns') you can explicitly indicate the data type for a column (instead of General). If you select 'Text' the zeros will not be dropped.
Unfortunately you only see the dialog to specify this option when Excel is already open and you use either File/Open or Data/Text to Columns. If you just double click a .csv in the explorer you don't get this choice.
Excel tries very hard to determine the type of value it's importing. If it looks like a number, it will treat it like a number, and drop all the leading zeros as it reads it in. There's no way to get them back once they're lost.
You might try to import the file using the wizard that lets you set the data type for each column.
Rather than writing your data as a CSV file, use the SYLK (Symbolic Link) format instead. This format includes information about the style of a column, so that Excel will not try to auto-guess the type of data.
The easiest way to get started with this format is to export a small file from Excel and use that as a template.
Ok got around this by inserting a text character before the number i.e. #000485001
Simple enough!