How to parse text file with 2 delimiters

How to parse text file with 2 delimiters - excel

I have a text file that is pipe-delimited that also has a new line indicator (START_OF_RECORD). The values are enclosed with single quotes and line breaks are expected in the 5th field. Notice the values with line breaks are still enclosed in single quotes though
Does excel have a native way to handle this? As far as I know, excel can only take in a custom delimiter. It's the START_OF_NEW_LINE that is causing the issue.
Sample screen shot of desired output, followed by input, followed by input as text.
|'START_OF_LINE'|'Key 1'|'Key 2'|'Key 3'|'text1
text2
text3
text4
text5'|'Date'|'END_OF_LINE'|'ID 1'|'ID 2'|'ID 3'|'ID 4'|'ID 5'|
|'START_OF_LINE'|'Key 1'|'Key 2'|'Key 3'|'text5
text6
text7
text8
text9'|'Date'|'END_OF_LINE'|'ID 1'|'ID 2'|'ID 3'|'ID 4'|'ID 5'|
I'm sure this can be hacked together with some tedious VBA but am really hoping there is a better way to do this before starting to write out code. I just have no idea how to handle the new line field using native functionality in excel

The case seems to be consistent. I've used notepad++ find and replace function on the text. and it seems to deliver what you need.
copy the text above and paste in notepad++ > replace "|\r\n|" with "|·|" **
then > replace "\n" with "\n||||"
then > replace "|·|" with "|\n"
remove the 1st "|"
copy n paste into excel, with "|" as delimiter.
Done.
**[Note:\r may not appear in the original file.. it is there in the copy paste activity.. omit it if it is not applicable]
If all the above can be executed using regex, then it is just a line of code away.. ( :

Related

Importing CSV cant split correctly

I am importing a csv to excel, the csv file looks like this if i open it up in notepad:
"Name",”id”,”comment”,”date””Mike”,”123”,”NA------save notes above this line------“,”01/03/2018””Jane”,”278”,”ANS----save notes above this line”,”01/02/2017”
Now if i open it in word and check where the breaks are, they appear at the end of each line correctly but there are also breaks in the comment field and if i load the whole file into an array called whole_file and then split like this:
lines = Split(whole_file, vbCrLf)
it will split correctly on the first line as these are the headers but on the following lines it will split at the carriages in the comments which i don't want it to do. If i remove chr(13) & chr(10) then the above split will not work. My question is what can i do to prepare my CSV so these carriages are removed from the comments or is there a way split each line say at the quotation marks which do not have a apostrophe in between?
Thanks

Use the built-in wizard to clarify the breaks in your data. Like comma, space, tab etc.

Importing tab-delimited text file into openrefine

I have a medium-sized tab-delimited .txt file - about 40k lines. When I import to Openrefine, line 406 puts all the rest of the content - the whole 40,000 lines, into a single cell in column 13 of that line.
I've tried grep-serching the invisibles in two different text editors (Sublime Text 2 & TextWrangler), and everything looks like it should.
I've also tried using Excel to convert to csv, and that actually works, but:
it's an inelegant workaround,
it has trouble with diacriticals, and
I don't want to spend ay more time resolving it in Excel anyway
I tried excepting the offending line with 10 lines on either side, and that throws the same problem.
Here are those 21 lines, copied directly from TextWrangler. (I can copy from Terminal output if that makes any difference.)
Any help, as always, is very much appreciated!!

Solved It! Well, sort of. It turns out that Column 13 had text that included double-quotes within the text itself (In other words, not having to do with delimiters at all).
For now, I'm just going to remove those quotes in the entire file, which does work - I tested it. **I'd rather figure out how to keep the quotes as part of the text. Tried escaping them with /, but that didn't work.
Thanks SO Community. Especially #Ettore.

I see. The problem is related to the quotations marks. Try importing your file by unchecking "Quotation marks are used to enclose cells containing column separators".
The empty columns in my screenshot are due to the fact that your file sometimes has two or three tabs as a separator. You can delete them easily after import using "reorder / remove columns"

.csv file seems to have a hidden delimeter - recognized by Excel but not Notepad (and other programs)

I have received a .csv file.
When I open this file with Notepad, all the entire information are displayed in one row:
Email;Cityjohnsmith#live.com;New York
However, when I open the file with MS Excel, it displays the information correctly. How can I recognize the delimiter character? Because the third program that is supposed to read this file is not able to recognize the delimiter.

So your CSV isn't comma delimited is what appears to be the problem.
The way it looks out of your Notepad copy is that the data is delimited by the separator " ; ". This means that each piece of data isn't separated by the typical Comma (,) character, but rather by the semi-colon (;). This is why notepad, which is simply viewing the raw textual data displays differing results than MS-Excel, which is attempting and succeeding to find a semi-common delimiting value in the file upon which to display results.
You may be well-served by either A) writing your code to recognize the delimiter as the semi-colon, and not a comma, or B) by using one of your tools to do a replace to get rid of the semi-colon in place of Commas.

.csv originally referred to comma separated values (csv). However, any character may be used to separate the values, the most common delimiters are the comma, tab, semicolon and colon. If the data is generated by another application you might need to accept semicolons as delimiters.
I'm not sure I'd write code for the problem as you describe. If I was forced to code it I'd write a short awk script to remove hidden (i.e. non-printing) characters.
I use two tools for csv issues. 010 Editor, from SweetScape Software Inc., will show you the file in hex, so you can see any non-displayable characters. The other, Delimit, from delimitware.com, is great for showing columns. In my opinion, 010 Editor will make your problem (and solution) obvious.
Here is a sample awk script that injects non-printing characters into your text. It then uses a regular expression to remove the non-printing characters.
BEGIN {
t=sprintf("%s\a%s\v%s", "Email;","Cityjohnsmith#live.","com;New York");
print "Input :", t;
gsub(/[^\x20-\x7E]/, "", t);
print "Result:", t;
}
To run the above code, use the following command:
awk -f xx.awk
where the above code is put in a text file called xx.awk.
The regex /[^\x20-\x7E]/ identifies all characters that are not printable (i.e. not between 'space' and tilde in ASCII).
The awk gsub statement searches for all characters meeting the regex and removes them.

Inserting multiline text in a csv field

I want to insert a multiline text data in a CSV field.
Ex:
var data = "\nanything\nin\nthis\nfield";
var fields = "\"Datafield1\",\"Datafield2:"+data+"\"\n";
When I save fields into a csv file and open it using MS Excel, I get to see only the first column. But when I open the file using a text editor I see:
"Datafield1","Datafield2:
anything
in
this
field"
I don't know whether I am going against CSV standards. Even if I am going against Please help me with a workaround.
Thanks...

By default MS Excel uses semicolon as a separator. use ; and you'll see this:

Here I place some text followed by the NewLine char followed by some more text and the
whole string MUST be quoted into a field in a csv file.
Do not use a CR since EXCEL will place it in the next cell.
""2" + NL + "DATE""
When you invoke EXCEL, you will see this. You may have to auto size the height to see the entire cell.
2
DATE
Here's the code in Basic
CHR$(34,"2", 10,"DATE", 34)

Importing CSV with line breaks in Excel 2007

I'm working on a feature to export search results to a CSV file to be opened in Excel. One of the fields is a free-text field, which may contain line breaks, commas, quotations, etc. In order to counteract this, I have wrapped the field in double quotes (").
However, when I import the data into Excel 2007, set the appropriate delimiter, and set the text qualifier to double quote, the line breaks are still creating new records at the line breaks, where I would expect to see the entire text field in a single cell.
I've also tried replacing CR/LF (\r\n) with just CR (\r), and again with just LF (\n), but no luck.
Has anyone else encountered this behavior, and if so, how did you fix it?
TIA,
-J
EDIT:
Here's a quick file I wrote by hand to duplicate the problem.
ID,Name,Description
"12345","Smith, Joe","Hey.
My name is Joe."
When I import this into Excel 2007, I end up with a header row, and two records. Note that the comma in "Smith, Joe" is being handled properly. It's just the line breaks that are causing problems.

Excel (at least in Office 2007 on XP) can behave differently depending on whether a CSV file is imported by opening it from the File->Open menu or by double-clicking on the file in Explorer.
I have a CSV file that is in UTF-8 encoding and contains newlines in some cells. If I open this file from Excel's File->Open menu, the "import CSV" wizard pops up and the file cannot be correctly imported: the newlines start a new row even when quoted. If I open this file by double-clicking on it in an Explorer window, then it opens correctly without the intervention of the wizard.

None of the suggested solutions worked for me.
What actually works (with any encoding):
Copy/paste the data from the csv-file (open in a text editor), then perform "text to columns" --> data gets transformed incorrectly.
The next stap is to go to the nearest empty column or empty worksheet and copy/paste again (same thing what you already have in your clipboard) --> automagically works now.

If you are doing this manually, download LibreOffice and use LibreOffice Calc to import your CSV. It does a much better job of stuff like this than any version of Excel I've tried, and it can save to XLS or XLSX as required if you need to transfer to Excel afterwards.
But if you're stuck with Excel and need a better fix, there seems to be a way. It seems to be locale dependent (which seems idiotic, in my humble opinion). I don't have Excel 2007, but I have Excel 2010, and the example given:
ID,Name,Description
"12345","Smith, Joe","Hey.
My name is Joe."
doesn't work. I wrote it in Notepad and chose Save as..., and next to the Save button you can choose the encoding. I chose UTF-8 as suggested, but with no luck. Changing the commas to semicolons worked for me, though. I didn't change anything else, and it just worked. So I changed the example to look like this, and chose the UTF-8 encoding when saving in Notepad:
ID;Name;Description
"12345";"Smith, Joe";"Hey.
My name is Joe."
But there's a catch! The only way it works is if you double-click the CSV file to open it in Excel. If I try to import data from text and chose this CSV, then it still fails on quoted newlines.
But there's another catch! The working field separator (comma in the original example, semicolon in my case) seems to depend on the system's Regional Settings (set under Control Panel -> Region and Language). In Norway, comma is the decimal separator. Excel seems to avoid this character and prefer a semicolon instead. I have access to another computer set to UK English locale, and on that computer, the first example with a comma separator works fine (only on doubleclick), and the one with semicolon actually fails! So much for interoperability. If you want to publish this CSV online and users may have Excel, I guess you have to publish both versions and suggest that people check which file gives the correct number of rows.
So all the details that I've been able to gather to get this to work are:
The file must be saved as UTF-8 with a BOM, which is what Notepad does when you chose UTF-8. I tried UTF-8 without BOM (can be switched easily in Notepad++), but then double-clicking the document fails.
You must use a comma or a semicolon separator, but not the one that is the decimal separator in your Regional Settings. Perhaps other characters work, but I don't know which.
You must quote fields that contain a newline with the " character.
I've used Windows line-endings (\r\n) both in the text field and as a record separator, that works.
You must double-click the file to open it, importing data from text doesn't work.
Hope this helps someone.

I have finally found the problem!
It turns out that we were writing the file using Unicode encoding, rather than ASCII or UTF-8. Changing the encoding on the FileStream seems to solve the problem.
Thanks everyone for all your suggestions!

Use Google Sheets and import the CSV file.
Then you can export that to use in Excel

Short Answer
Remove the newline/linefeed characters (\n with Notepad++). Excel will still recognise the carriage return character (\r) to separate records.
Long Answer
As mentioned newline characters are supported inside CSV fields but Excel doesn't always handle them gracefully. I faced a similar issue with a third party CSV that possibly had encoding issues but didn't improve with encoding changes.
What worked for me was removing all newline characters (\n). This has the effect of collapsing fields to a single record assuming that your records are separated by the combination of a carriage return and a newline (CR/LF). Excel will then properly import the file and recognise new records by the carriage return.
Obviously a cleaner solution is to first replace the real newlines (\r\n) with a temporary character combination, replacing the newlines (\n) with your seperating character of choice (e.g. comma in a semicolon file) and then replacing the temporary characters with proper newlines again.

If the field contains a leading space, Excel ignores the double quote as a text qualifier. The solution is to eliminate leading spaces between the comma (field separator) and double-quote. For example:
Broken:
Name,Title,Description
"John", "Mr.", "My detailed description"
Working:
Name,Title,Description
"John","Mr.","My detailed description"

If anyone stumbling across this thread and is looking for a definitive answer here goes (credit to the person mentioning LibreOffice:
1) Install LibreOffice
2) Open Calc and import file
3) My txt file had the fields separated by , and character fields enclosed in "
4) save as ODS file
5) Open ODS file in Excel
6) Save as .xls(x)
7) Done.
8) This worked perfectly for me and saved me BIGTIME!

+1 on J Ashley's comment. I ran into this problem also. It turns out that Excel requires:
A newline character("\n") in the quoted string
A carriage return and newline between each row.
E.g.
"Test", "Multiline item\n
multiline item"\r\n
"Test2", "Multiline item\n
multiline item"\r\n
I used notepad ++ to delimit each row properly and to only use newlines in the string. Discovered this by creating multiline entries in a blank excel doc and opening the csv in notepad ++.

Overview
Almost 10 years after the original post, Excel hasn't improved in importing CSV files. However, I found that it is much better in importing HTML tables. So, one can use Python to convert CSV to HTML and then import the resulting HTML to Excel.
The advantages of this approach are: (a) it works reliably, (b) you don't need to send your data to a third party service (e.g. Google sheets), (c) no extra "fat" installations required (LibreOffice, Numbers etc.) for most users, (d) higher level than meddling with CR/LF characters and BOM markers, (e) no need to fiddle with locale settings.
Steps
The following steps can be run on any bash-like shell as long as Python 3 is installed. Although Python can be used to directly read CSV, csvkit is used to do an intermediate conversion to JSON. This allows us to avoid having to deal with CSV intricacies in our Python code.
First, save the following script as json2html.py. The script reads a JSON file from stdin and dumps it as an HTML table:
#!/usr/bin/env python3
import sys, json, html
if __name__ == '__main__':
header_emitted = False
make_th = lambda s: "<th>%s</th>" % (html.escape(s if s else ""))
make_td = lambda s: "<td>%s</td>" % (html.escape(s if s else ""))
make_tr = lambda l, make_cell: "<tr>%s</tr>" % ( "".join([make_cell(v) for v in l]) )
print("<html><body>\n<table>")
for line in json.load(sys.stdin):
lk, lv = zip(*line.items())
if not header_emitted:
print(make_tr(lk, make_th))
header_emitted = True
print(make_tr(lv, make_td))
print("</table\n</body></html>")
Then, install csvkit in a virtual environment and use csvjson to feed the input file to our script. It is a good idea to disable cell type guessing with the -I argument:
$ virtualenv -p python3 pyenv
$ . ./pyenv/bin/activate
$ pip install csvkit
$ csvjson -I input.csv | python3 json2html.py > output.html
Now output.html can be imported in Excel. Line breaks in cells will have been preserved.
Optionally, you may want to cleanup your Python virtual environment:
$ deactivate
$ rm -rf pyenv

Multiline CSV can be imported easily in Excel versions with Power Query using following steps (tested in Excel 365 version 2207):
Go to Data-tab
Click "From Text/CSV" on the ribbon
Select file and click Import
Click "Transform Data" to open Power Query Editor
Click "Data source settings" on the Power Query Editor ribbon
Click "Change Source"
Select "Ignore quoted line breaks" from the "Line breaks" dropdown.
Click OK -> Close -> Close & Load

Paste into Notepad++, select Encoding > Encode in ANSI, copy all again and paste into Excel :)

I had a similar problem. I had some twitter data in MySQL. The data had Line feed( LF or \n) with in the data. I had a requirement of exporting the MySQL data into excel. The LF was messing up my import of csv file. So I did the following -
1. From MySQL exported to CSV with Record separator as CRLF
2. Opened the data in notepad++
3. Replaced CRLF (\r\n) with some string I am not expecting in the Data. I used ###~###! as replacement of CRLF
4. Replaced LF (\n) with Space
5. Replaced ###~###! with \r\n, so my record separator are back.
6. Saved and then imported into Excel
NOTE- While replacing CRLF or LF dont forget to Check Excended (\n,\r,\t... Checkbox [look at the left hand bottom of the Dialog Box)

My experience with Excel 2010 on WinXP with French regional settings
the separator of your imported csv must correspond to the list separator of your regional settings (; in my case)
you must double click on the file from the explorer. don't open it from Excel

On MacOS try using Numbers
If you have access to Mac OS I have found that the Apple spreadsheet Numbers does a good job of unpicking a complex multi-line CSV file that Excel could not handle. Just open the .csv with Numbers and then export to Excel.

Excel is incredibly broken when dealing with CSVs. LibreOffice does a much better job. So, I found out that:
The file must be encoded in UTF-8 with BOM, so consider this for all the points below
The best result, by far, is achieved by opening it from File Explorer
If you open it from within Excel there are two possible outcomes:
If it has only ASCII characters, it will most likely work
If it has non-ASCII characters, it will mess your line breaks
It seems to be heavily dependent on the decimal separator configured in the
OS's regional settings, so you have to select the right one
I would bet that it may also behave differently depending on OS and
Office version

This is for Excel 2016:
Just had the same problem with line breaks inside a csv file with the Excel Wizard.
Afterwards I was trying it with the "New Query" Feature:
Data -> New Query -> From File -> From CSV -> Choose the File -> Import -> Load
It was working perfectly and a very quick workaround for all of you that have the same problem.

With Excel 2019 I had a similar problem when working with CSV files via Data -> Import from text file / CSV. Once the connection is made and the data is synced, it reported xx error(s) because of shifted columns caused by the line breaks.
I managed to solve this by
Edit the query (Query -> Edit)
This opens the Power Query Editor
Go to Start -> Advanced Editor
This opens the query in text format, where line #2 had an instruction like
Source = Csv.Document(File.Contents("my.csv"),[Delimiter=",", .... , QuoteStyle=QuoteStyle.None]),
Change QuoteStyle.None to QuoteStyle.Csv
Click Finish
Apply & close
Documentation found here: https://learn.microsoft.com/en-us/powerquery-m/csv-document
NB. I since found where this is "hidden" in the UI. In the Power Query-editor, click Data source settings, Change source (bottom left), and the Line breaks combo should say Ignore line breaks between quotes.
NB2. Working from Dutch Excel here so my above-mentioned translations of button captions etc. may be a little off.

What just worked for me, importing into Excel directly provided that the import is done as a text format instead as csv format.
M/

just create a new sheet with cells with linebreak, save it to csv then open it with an editor that can show the end of line characters (like notepad++). By doing that you will notice that a linebreak in a cell is coded with LF while a "real" end of line is code with CR LF. Voilà, now you know how to generate a "correct" csv file for excel.

I also had this problem: ie., csv files (comma delimited, double quote delimited strings) with LF in quoted strings. These were downloaded Square files. I did a data import but instead of importing as text files, imported as "from HTML". This time it ignored the LF's in the quoted strings.

This worked on Mac, using csv and opening the file in Excel.
Using python to write the csv file.
data= '"first line of cell a1\r 2nd line in cell a1\r 3rd line in cell a1","cell b1","1st line in cell c1\r 2nd line in cell c1"\n"first line in cell a2"\n'
file.write(data)

In my case opening CSV in notepad++ and adding SEP="," as the first line allows me open CSV with line breaks and utf-8 in Excel without issues

Replace the separator with TAB(\t) instead of comma(,).
Then open the file in your editor (Notepad etc.), copy the content from there, then paste it in the Excel file.

It appears that this is much easier in more recent versions of Excel:
Go to "Data" -> "Get Data (Power Query)"
In the dialogue that opens, select "Text / CSV" on the right
Search for the file and then click "Next" and follow the recommendations (in my case, Excel now correctly realized it's UTF8 and that cells were separated by ";" and the text identifier were double quotes (")
You're done!
This took a little moment to load but afterwards I had an auto-formatted table that looked really nice and that did understand that multi-line entries were still part of the same entry.
If you want the multi-lines to show correctly, simply format the cells and under "Alignment", click the checkbox for "Wrap text". That should solve the last of your issues.
Good luck! ;-)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string