Comma separators in Fortran - io

I have come across the following issue with Fortran: that in reading a character array, for example, or any list in actuality, from a data file with fmt=*, both non-interquote blanks AND commas are natively considered as delimiters for the elements in the array/list. The fact that commas act as delimiters is a big problem for me.
So the question is: do you know of any semantic option or compilation directive in Fortran that permits to consider the commas in input files as characters and not as delimiters,
with the only delimiters being blanks? As an specific example, I would like that when reading a record like:
x,y,z
with:
read (7,*) adummy
would result in adummy (a scalar character variable) getting the value x,y,z not x.
Any help would be most welcome.

The solution is to specify formatting to match your data record, i.e. use character data descriptor when specifying the format:
read(7,fmt='(A)')adummy
will result in adummy having value x,y,z, assuming it is a variable of sufficient length.
However this method will not treat blanks as delimiters either, so if you want to read commas as character strings but have blanks as delimiter, the common way to achieve this is to read the whole record into the character variable and do the splitting into separate variables afterwards.

Related

What is the difference between these two tab-delimited .txt files that is causing .split("\t") to properly separate values from one but not the other?

I have two Japanese word frequency reports that were compiled from different sources. Each line contains a word and its number of occurrences, delimited by tabs. I also have a python script that is supposed to split each line into those two values using .split("\t"). The latter value is then converted into an integer, which is where the error is coming from:
ValueError: invalid literal for int() with base 10: '\ufeff29785713'
This is only occurring for data from the second file.
Upon testing to see if converting the number to a float would work (or change the error), the result was this:
ValueError: could not convert string to float: '\ufeff29785713'
Is this a result of the tabs or numerals in the second file perhaps not technically being the same character and not delimiting properly, causing unwanted characters in the latter value (or perhaps not splitting at all)? Both files are UTF-8 encoded.
Shorter version of first file (working)
Shorter version of second file
Honestly, not a python dev at all, but given that your second array element contains a rogue character pair you could try removing it after you split and before you convert to number:
x[1] = x[1].replace('\ufeff', '')
x being the name of the array you did split your line into. The replace operation will have no effect on the first file, because FEFF is not present

Most efficient way to store a string in bytes?

Say I have a simple bytecode-like file format for saving data.
If I want to store a string, should I do it like in source files where all characters between a certain byte is the string,
or should I first store the length of the string then the string bytes?
Or are both solutions horrible and if so which one can I use?
It depends on whether you want to store:
a single string
a number of strings
different length strings
all the same length
For all of the above, it may also matter if your strings contain:
any characters
only certain characters
formatting
In general, you should use Unicode.
For a single string, you simply can use an entire file to contain the string, the end-of-file will be the same as the end of string. No need to store the length of the string.
If the strings aren't all (around) the same length you can use an inline separator to separate the strings. Often the newline character is useful for this (especially since a lot of programming languages support this way of reading in a file line-by-line), but other markers such as tab are common.
CSV text files often use double quotes to enclose strings that contain commas (or other column separator) (which would otherwise indicate the next column value was starting), or line-breaks (which would otherwise indicate the next row).
Of course, now you have the problem of how to store a double quote in your string.
If you want to store formatting, you can use a markup language (html) or it may be enough to allow for line breaks and/or some markdown.

Erase characters from a string until a specific character

Python 3.4
I've got an Excel file with some messy organizing, but one this is for sure:
I need EVERYTHING except the stuff that appears before the very first comma in every single line, the comma included.
Example:
Print command of the file gives me this:
Word1 Funky,Left Side,UDLRDURLUDRUDLUR
Nothing (because not) exists lol extraline,Right
Side,RBRGBRGBRGRBGRBGBR
What I want to get is this:
Left Side,UDLRDURLUDRUDLUR
Right Side,RBRGBRGBRGRBGRBGBR
I'd also like to make that into a dictionary:
dictionary = {"Left Side":"UDLRDURLUDRUDLUR", "Right Side":"RBRGBRGBRGRBGRBGBR",}
So basically I want to get rid of everything until the first comma (comma included), make the second part the key (ends at second comma), and third part the value (line ends with value).
What would be the easiest way to execute this?
Suppose s contains the string to be examined:
s = "word1,Left Side,UDLRDURLUDRUDLUR"
There are a number of ways to get rid of everything up to and including the first comma. You can use
Slicing coupled with find: s[s.find(',')+1:]
This expression will yield the desired result if the string s contain at least one comma, but it will yield the entire string if the string does not contain any commas.
Split coupled with indexing: s.split(',',1)[1]
This expression will yield the desired result if the string s contain at least one comma, but it will raise IndexError if the string does not contain any commas.
Regular expressions, but that's overkill here.
Other techniques, but those are also overkill here.

String Comparison ignoring special characters C#

I am creating a generic List if unique strings.My string formats are like GBP/101-P506 some time it could be GBP-101-P-506. Both of these strings have to be considered as SAME. how could I compare such strings?
Most straight forward way would be, to replace the special characters with empty strings and compare the results...
Use temporary variables if you don't want to modify your originals.
Regards
RegEx the input and normalize the data before entering it into your data structure. If you don't want to change the original strings, you will have to consider all possible valid values anytime you need to perform operations on the strings.

Loading dataset containing both strings and number

I'm trying to load the following dataset:
Afghanistan,5,1,648,16,10,2,0,3,5,1,1,0,1,1,1,0,green,0,0,0,0,1,0,0,1,0,0,black,green
Albania,3,1,29,3,6,6,0,0,3,1,0,0,1,0,1,0,red,0,0,0,0,1,0,0,0,1,0,red,red
Algeria,4,1,2388,20,8,2,2,0,3,1,1,0,0,1,0,0,green,0,0,0,0,1,1,0,0,0,0,green,white
...
Problem is it contains both integers and strings.
I found some information on how to get out the integers only.
But haven't been able to see if there's any way to get all the data.
My question is that possible ??
If that is not possible, is there then any way to find the numbers on each line and throw everything else away without having to choose the columns?
I need specifically since it seems I cannot use str2num on a whole line at a time.
Almost anything is possible, you just have to define your goal accurately.
Assuming that your database is stored as a text file, you can parse it line by line using textread, and then apply regexp to filter only the numerical fields (this does not require having prior knowledge about the columns):
C = textread('database.txt', '%s', 'delimiter', '\n');
C = cellfun(#(x)regexp(x, '\d+', 'match'), C, 'Uniform', false);
The result here is a cell array of cell array of strings, where each string corresponds to a numerical field in a specific line.
Since the numbers are still stored as strings, you'd probably need to convert them to actual numerical values. There's a multitude of ways to do that, but you can use str2num in a tricky way: it can convert delimited strings into an array of numbers. This means that if you concatenate all strings in a specific line back into one string, and put spaces in between, you can apply str2num on all of them at once, like so:
C = cellfun(#(x)str2num(sprintf('%s ', x{:})), C, 'Uniform', false);
The resulting C is a cell array of vectors, each vector containing the values of all numerical fields in the corresponding line. To access a specific vector, you can use curly braces ({}). For instance, to access the numbers of the second line, you would use C{2}.
All the non-numerical fields are discarded in the process of parsing, of course. If you want to keep them as well, you should use a different regular expression with regexp.
Good luck!

Resources