Truncating characters when importing with SAS - excel

I have an Excel spreadsheet with company data and descriptions. Some of the cells basically contain mini-essays in them, pages and pages of straight text contained in a single cell. SAS has been giving me problems when I'm importing the file because it truncates some of the longer cells and the text gets cut off mid-sentence. Any ideas on how to avoid this? I've tried saving the file to a tab-delimited text file, but no luck.
Thanks!

Exporting to tab-delimited or csv may be the way to go, as you said. Be sure to have strings enclosed in quotes also. But do you have the length specified for the variable containing the long cells? According to SAS the maximum length is 32,767 characters, so perhaps try as large a number as it takes -- hopefully less than that.
Also the lrecl (max length of each line of the file) should be specified with a max of 32767.
data test;
length company_name $20 description1 description2 $10000;
infile my_tab_dlm_file lrecl = 50000 dsd delimiter = '09'x;
input company_name
description1
description2
;
run;

If you have a license for SAS/ACCESS (which this link explains how to check). You can use a libname to access the Excel spreadsheet (this link talks about Excel access) and this is a great paper which details how to get at the Excel data just like a SAS data set.
(but #Neil Neyman's answer sounds good too)

Related

Ideas to extract specific invoice pdf data for different formats and convert to Excel

I am currently working on a digitalisation project which consists in extracting specific information from pdf-formatted electricity invoices. Once the data is extracted, I would like to store it in an Excel spreadsheet.
The objectives are the following:
First of all, the data to be extracted would be the following:
https://i.stack.imgur.com/6RLo2.png
In this case, the data to be extracted is the information surrounded in red. This would be the CUPS, the total amount and the consumed electricity per period (P1-P6).
Once this is extracted, I would like to display this in an Excel Spreadsheet.
Could you please give me any ideas/tips regarding the extraction of this data? I understand that OCR software would do this best, but do not know how I could extract this specific information.
Thanks for you help and advice.
If there is no text data in your PDF then I don't believe there is a clean and consistent way to do this yet. If your invoice templates are always the same format and resolution, then the pixel coordinates of the text positions should be the same.
This means that you can create a cropped image with only the text you're interested in. Then you can use your OCR tool to extract all the text and you have extracted your data field. You would have to do this for all the data fields that you want to extract.
This would only work for invoices that always have the same format and resolution. So scanned invoices wouldn't work, and dynamic tables make things exponentially more complex as well.
I would check if its possible to simply extract the text using PDF to text 1st then work my cmd text parsing around that output, and loop file to file.
I don't have your sample to test so you would need to adjust to suit your bills
pdftotext -nopgbrk -layout electric.pdf - |findstr /i "cups factura" & pdftotext -nopgbrk -layout -y 200 -W 300 -H 200 electric.pdf
Personally would use the two parts as separate cycles so first pair replace the , with a safe csv character such as * then inject , for the large gap to make them 2 column csv (perhaps replace the Γé¼ with ,€ if necessary since your captured text may be in €uros already)
The second group I would possibly inject , by numeric position to form the desired columns, I only demo 4 column by 2 rows but you want 7 column by 4 rows, so adjust those values to suit. However, you can use any language you are familiar with such as VBA to split how you want to import in to eXcel.
In Excel you may want to use PowerQuery to read the pdf:
https://learn.microsoft.com/en-us/power-query/connectors/pdf
Then you can further process to extract the data you want within PowerQuery.
If you are interested in further data analysis after extraction you may want to consider KNIME as well:
https://hub.knime.com/jyotendra/spaces/Public/latest/Reading%20PDF%20and%20extracting%20information~pNh3GdorF0Z9WGm8
From there export to Excel is also supported.
edit:
after extracting, regex helps to filter for the specific data, e.g. look for key words, length and structure of the data item (e.g. the CUPS number), is it a currency with decimal etc.
edit 2: regex in Excel
How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops
e.g. look for a new line starting with CUPS followed by a sequence of 15-characters (if you have more details, you can specify the matching pattern more: e.g. starting with E, or 5th character is X or 5, etc.)

csv format hide decimal part when is opened with excel

I have a csv file. One column has a number with 2 decimal digits like 100,00. But hides the trailing zeros (100). When I open it with notepad is 100,00.
Welcome to the mad world of CSVs and Excel.
That's one of the problems with the CSV files and excel: The value trailing zeros aren't shown by default. You can always modify the CSV within excel and then save it as *.xls or *.xlsx files.
There is no way to tell Excel to open a CSV with showing numbers per default with all given digits, if they are zero because excel does interpret them as "Standard".
If you don't need to work with the numbers as numbers you can always exchange the , with a . (that might be depending on the locale, not sure). Or just export it as a String:
The CSV:
test
="100,00"
13,37
100.00
=100,01
will produce the following output (locale de_DE):
In general, if you need to work with excel and want as much pre-formatted as possible, don't use CSV.
CSV is just comma-separated values. Just values. No formatting hints are included. So you can't have a number's display format specified within the CSV -- the numbers will be displayed in whatever format Excel shows.
If you'd rather you can change the default display format in Excel (for all sheets, not just CSV's). Not what you're asking, but perhaps it will be your preference. See How to change the default number format in Excel? for details on that option.

How Do I separate data written in the same column?

So I've got bunch of file in excel that has the measure of the electric current versus voltage with like 2000 points. The problem is, they're all saved in the column same box. It's suppose to be such that Column A has all the voltages and Column B all the currents. Right now, everything is saved in Column A both voltages and current.
I really don't want to separate 2000 points with 20 files of it hand by hand, is there a good way to separate them? The good news is, the points are separated by a space e.g. [1.001 2.002] with 1.001 being a voltage or 2.002 being a current or separated by a negative sign if there is one [-1.001-2.002] so I feel like a simple program can fix this up. I know how to code in C and matlab(also, the goal is to make it matlab readable) but what's the best way to resolve this if it could also be done maybe in excel macro?
Are you on a unix/linux?
Save the file as CSV.
Run this from the terminal.
$ sed -i .bak "s/ /,/g" my-file.csv
Your original file will be saved as my-file.bak. The my-file.csv will now be comma delimited instead of space delineated.
Open the CSV in excel.
One option would be to open the sheet in Power Query and split the column based on a delimiter. Assuming you have something (a space, comma, etc.), then this will give you two new columns. In Query Editor, go to split column and then by delimiter. Source: https://support.office.com/en-au/article/Split-a-column-of-text-Power-Query-5282d425-6dd0-46ca-95bf-8e0da9539662?ui=en-US&rs=en-AU&ad=AU#__toc354843578
Edit: If you don't have Power Query, it's a free add-on for Excel: https://www.microsoft.com/en-gb/download/details.aspx?id=39379

How to make excel treat text as string in Clojure using data.csv?

I am using data.csv to write export data to a csv while however i have some alphanumeric fields which are ids but since they are all numbers excel is treating them is doubles and showing them in exponential form.
Is there a way that we can tell excel to treat is as it is.
Excel displays long numbers in csv files in an abbreviated form with exponents.
Unfortunately there is no way to disable that functionality from within the generated csv.
Also sending it in as text shows the same abbreviated format. Your choices are
1) Assuming the id number has fewer than 16 digits you can go into excel and change the format.
2) Alternatively you can prepend an apostrophe or text character to your id's before you generate the csv. For example
(ns sample.core
(:use [clojure.data.csv]
[clojure.java.io]))
(defn gen-csv [filename]
(with-open [out-file (writer filename)]
(write-csv out-file
[["'123000000" "'45612333"]
["'789909990" "'90099999124"]])))

How can I prevent leading zeroes from being stripped from columns in my CSV?

I need to upload a CSV file, but my text data 080108 keeps converting to a number 80108. What do I do?
Use Quoted CSV
Use quotes in your CSV so that the column is treated as a string instead of an integer. For example:
"Foo","080108","Bar"
I have seen this often when viewing the CSV using Microsoft Excel. Excel will truncate leading zeros in all number fields. if the output is not coming from excel then check the output in note pad. If it is coming from excel you need to add a single quote mark ' to the beginning of each zip code. then excel will retain the leading 0.

Resources