Loading dataset containing both strings and number - string

I'm trying to load the following dataset:
Afghanistan,5,1,648,16,10,2,0,3,5,1,1,0,1,1,1,0,green,0,0,0,0,1,0,0,1,0,0,black,green
Albania,3,1,29,3,6,6,0,0,3,1,0,0,1,0,1,0,red,0,0,0,0,1,0,0,0,1,0,red,red
Algeria,4,1,2388,20,8,2,2,0,3,1,1,0,0,1,0,0,green,0,0,0,0,1,1,0,0,0,0,green,white
...
Problem is it contains both integers and strings.
I found some information on how to get out the integers only.
But haven't been able to see if there's any way to get all the data.
My question is that possible ??
If that is not possible, is there then any way to find the numbers on each line and throw everything else away without having to choose the columns?
I need specifically since it seems I cannot use str2num on a whole line at a time.

Almost anything is possible, you just have to define your goal accurately.
Assuming that your database is stored as a text file, you can parse it line by line using textread, and then apply regexp to filter only the numerical fields (this does not require having prior knowledge about the columns):
C = textread('database.txt', '%s', 'delimiter', '\n');
C = cellfun(#(x)regexp(x, '\d+', 'match'), C, 'Uniform', false);
The result here is a cell array of cell array of strings, where each string corresponds to a numerical field in a specific line.
Since the numbers are still stored as strings, you'd probably need to convert them to actual numerical values. There's a multitude of ways to do that, but you can use str2num in a tricky way: it can convert delimited strings into an array of numbers. This means that if you concatenate all strings in a specific line back into one string, and put spaces in between, you can apply str2num on all of them at once, like so:
C = cellfun(#(x)str2num(sprintf('%s ', x{:})), C, 'Uniform', false);
The resulting C is a cell array of vectors, each vector containing the values of all numerical fields in the corresponding line. To access a specific vector, you can use curly braces ({}). For instance, to access the numbers of the second line, you would use C{2}.
All the non-numerical fields are discarded in the process of parsing, of course. If you want to keep them as well, you should use a different regular expression with regexp.
Good luck!

Related

reading data from excel into SAS and converting character to numeric

I am reading data from an excel file in SAS and inserting the values to an oracle table. The oracle table has a numeric column. If the excel file has numbers, it works fine. But if the column is left blank in the excel file, it is read as a character value and insertion to oracle fails.
Is it possible to convert the column to numeric if its is blank, but read it as is if its has a number?
Thanks!
Let's assume that SAS is reading this column as a character and you cannot convert it directly within the file. This happens sometimes: maybe you don't have authorization to do it, or maybe it's just not working like you're expecting. SAS can go from character to numeric and numeric to character with two functions: input() and put().
Going from Character to Numeric: input()
input() is for changing character data into numbers.
This is great for reading in dates, currency, comma-separated numbers, etc. If you need your data as a number, use this function. Its syntax is:
num_var = input(char_var, informat.);
In your case, let's say we always expect numbers to be here even if it's missing. We'll use the 8. informat for our variable of interest, my_var.
data want;
set have_excel(rename=(my_var = my_var_char) );
my_var = input(my_var_char, 8.);
drop my_var_char;
run;
Note that we need to create a new variable. We rename the variable of interest to something else, then create a new version of the variable of interest that is a number. In SAS, just like many other languages and database systems, when a variable is declared as a character or number, it is always a character or a number.
Going from Numeric to Character: put()
put() is for putting a number to a character or a character to another character.
This is great for converting SAS dates to characters, adding custom formats, converting a character to another character, etc. The syntax is:
char_var = put(num_var, format.);
OR:
char_var = put(char_var, format.);
Note the previous use case: with put(), you can convert characters to other characters. This is very handy for standardizing values or even merging data using a format.
For example: let's convert a number to a comma-separated character number.
data want;
char_number = put(1234, comma.);
run;
Output:
char_number
1,234
Below case statement worked for me.
case
when missing(input(cats(COLUMN_VALUE), best8.)) THEN input(cats(COLUMN_VALUE), best8.)
when not missing(input(cats(COLUMN_VALUE), best8.)) THEN input(cats(COLUMN_VALUE), best8.)
end as COLUMN_VALUE

csv.writer inserting comma between each character

I'm using the following code:
plugara=['CNPJ', 56631781000177, 21498104000148, 3914296000144, 28186370000184]
plugara=map(str,plugara)
with open(result.csv, 'w') as f:
wr = csv.writer(f,dialect='excel')
wr.writerows(plugara)
The result I'm getting is putting a comma between each character and is breaking into a different column:
I would like it to be without those commas like this:
Any ideas?
The writerows method that you're calling expects its argument to be an iterable containing rows. Each row should be iterable itself, with its values being the items in the row. In your case, the values are strings, which can be iterated upon to give their characters. Unfortunately, that's not what you intended!
Exactly how to fix this issue depends on what you want the output to be.
If you want your output to consist of a single row, then just change your call to writerow to instead of writerows (note the plural). The writerow method will only write a single row out, rather than trying to write several of them at once.
On the other hand, if you want many rows, with just one item in each one (forming a single column), then you'll need to transform your data a little bit. Rather than directly passing in your list of strings, you need to produce an iterable of rows with one item in them (perhaps 1-tuples). Try something like this:
wr.writerows((item,) for item in plugara)
This call uses a generator expression to transform each string from plugara into a 1-tuple containing the string. This should produce the output you want.

String Comparison ignoring special characters C#

I am creating a generic List if unique strings.My string formats are like GBP/101-P506 some time it could be GBP-101-P-506. Both of these strings have to be considered as SAME. how could I compare such strings?
Most straight forward way would be, to replace the special characters with empty strings and compare the results...
Use temporary variables if you don't want to modify your originals.
Regards
RegEx the input and normalize the data before entering it into your data structure. If you don't want to change the original strings, you will have to consider all possible valid values anytime you need to perform operations on the strings.

How to convert a string containing non-numeric values into numeric values?

I have several variables of the form:
1 gdppercap
2 19786,97
3 20713,737
4 20793,163
5 23070,398
6 5639,175
I have copy-pasted the data into Stata, and it thinks they are strings. So far I have tried:
destring gdppercap, generate(gdppercap_n)
but get
gdppercap contains nonnumeric characters; no generate
And:
encode gdppercap, gen(gdppercap_n)
but get a variable numbered from 1 to 1055 regardless of the previous value.
Also I've tried:
gen gdppercap_n = real(gdppercap)
But get:
(1052 missing values generated)
Can you help me? As far as I can tell, Stata do not like the fact that the variable contains fraction numbers.
If I understand you correctly, the interpretation as string arises from one and possibly two facts:
The variable name may be echoed in the first observation. If so, that's text and it's inconsistent with a numeric variable. The root problem there is likely to be a copy-and-paste operation that copied too much. Stata typically gives you a choice when importing by copy-and-paste of whether the first row of what you copied is to be treated as variable names or as data, and you need the first choice, so that column headers become variable names, not data. It may be best to go back and do the copy-and-paste correctly. However, Stata can struggle with multiple header lines in a spreadsheet. Alternatively, use import excel, not a copy-and-paste. Alternatively, drop in 1 to remove the first observation, provided that it consistently is superfluous.
Commas indicate decimal places. destring can easily cope with this: see the help for its dpcomma option. Stata has no objection to fractions; that would be absurd. The problem is that you need to flag your use of commas.
Note that
destring is a wrapper for real(), so real() is not a way round this.
encode is for mapping genuine categorical variables to integers, as you discovered, and as its help does explain. It is not for fixing data input errors.
You can write a for loop to convert a comma to a period. I don't quite know your variables but imagine you have a variable gdppercap with information like 1234,343 and you want that to be 1234.343 before you do the destring.
For example:
forvalues x = 1(1)10 {
replace gdppercap = substr(gdppercap, 1, `x'-1) + "." + substr(gdppercap, `x'+1, .)
if substr(gdppercap, `x', 1) == ","
}

Comma separators in Fortran

I have come across the following issue with Fortran: that in reading a character array, for example, or any list in actuality, from a data file with fmt=*, both non-interquote blanks AND commas are natively considered as delimiters for the elements in the array/list. The fact that commas act as delimiters is a big problem for me.
So the question is: do you know of any semantic option or compilation directive in Fortran that permits to consider the commas in input files as characters and not as delimiters,
with the only delimiters being blanks? As an specific example, I would like that when reading a record like:
x,y,z
with:
read (7,*) adummy
would result in adummy (a scalar character variable) getting the value x,y,z not x.
Any help would be most welcome.
The solution is to specify formatting to match your data record, i.e. use character data descriptor when specifying the format:
read(7,fmt='(A)')adummy
will result in adummy having value x,y,z, assuming it is a variable of sufficient length.
However this method will not treat blanks as delimiters either, so if you want to read commas as character strings but have blanks as delimiter, the common way to achieve this is to read the whole record into the character variable and do the splitting into separate variables afterwards.

Resources