Trouble with "destring" and keeping decimals - decimal

I am new to Stata and i assume this is a beginner question. Yet I have just spent the last hour searching the internet for an answer to no avail!
I am using World Bank GDP data (imported from a csv file) and the data is in the string format. When I destring, the GDP data that contains decimal places gets ignored and simply comes out as a big number.
destring yr*, replace ignore("..")
Here is a sample of my data:
yr2016
205276172134.901
..
13397100000
When I run the command I posted, it transforms to:
yr2016
2.053e+14
1.340e+10
As you can see the .901 was tacked into the number instead of being perceived as a decimal space.
I have tried:
set dp period
But it didn't work.

You just need to set the format of the converted variable:
clear
set obs 1
generate string = "205276172134.901"
destring string, generate(numeric)
list
+------------------------------+
| string numeric |
|------------------------------|
1. | 205276172134.901 2.053e+11 |
+------------------------------+
format numeric %18.0g
list
+-------------------------------------+
| string numeric |
|-------------------------------------|
1. | 205276172134.901 205276172134.901 |
+-------------------------------------+
Type help format for more information.

The problem is that the ignore() option is removing every instance of a . in the string variable, Stata is not searching for a sequence of two consecutive ... There is no need to use the ignore option in this case. Try destring var, replace force and allow Stata to set rows with .. to missing.

Related

Trouble with format command changing the value

I first applied destring to an ID variable (with 17 digits). They are destrung but then they are shown in scientific notation. So I tried the command format %20.0f. Now all digits are shown but the last 2-3 digits are now changed.enter image description here
Stata can only hold numeric variables with up to 16 digits.
Your best option is probably to keep the ID as a string.
The command format only affects how a data point is displayed to humans, not how it is actually stored.
This is to complement the answer by #TheIceBear.
format never changes values. The problem is that your string is too big even for its numeric equivalent to be held exactly in a double, except occasionally.
clear
set obs 5
gen id = 17*"9" in 1
replace id = 16*"9" + "6" in 2
replace id = 16*"9" + "2" in 3
replace id = 15*"9" + "88" in 4
replace id = 15*"9" + "84" in 5
format id %20s
destring id, gen(nid)
format nid %20.0f
list
+----------------------------------------+
| id nid |
|----------------------------------------|
1. | 99999999999999999 100000000000000000 |
2. | 99999999999999996 100000000000000000 |
3. | 99999999999999992 100000000000000000 |
4. | 99999999999999988 99999999999999984 |
5. | 99999999999999984 99999999999999984 |
+----------------------------------------+

Using Turi to create a simple text classification

to get in touch with Turi I'm trying to create a model that is able to distinguish between strings consisting of chars and strings consisting of numbers.
I have CSV-file with training data. Each line consists of two entries, a string and an indicator whether this string is a number or a plane string
String, isNumber
bvmuuflo , 0
71047015 , 1
My Python-Script to generate the model looks like this:
import graphlab as gl
data = gl.SFrame('data.csv')
model = gl.classifier.create(data, target="isNumber", features=["String"])
This works fine. But I have no idea how to use the model to check for example if "qwerty" is a String or a Number.
I'm trying to use the model.classify(...) API-call. But the two calls
model.classify(gl.SFrame(["qwertzui"])
and
model.classify(gl.SFrame(["98765432"])
return the same result
Columns:
class int
probability float
Rows: 1
Data:
+-------+----------------+
| class | probability |
+-------+----------------+
| 1 | 0.509227594584 |
+-------+----------------+
[1 rows x 2 columns]
Obviously there is a mistake in my program, but I'm not able to find it.
Any help is welcome!
Since the model only has one column for training it will be able to identify strings it has already seen but unable to identify ones it has not. My guess is the .509 is the percentage of your input that is a string, so it just responds with that for anything it has not seen before.
This is obviously a toy example but if you want to get it to work I would use something like a bag of words, but for letters. Make 36 columns with the titles a,b,c...z,0,1...9 and put the count of each character per string for each row. This way the model will look at individual letters as giving a probability to the class instead of the string as a whole.

Comma separated combination of LTR/RTL/Digit characters reorder issue

I have a comma separated list of values generated from an excel sheet. (Numbers and RTL Characters)
Having these values in columns: 1 | 2 | 3 | 4 | 5
would yield me the output of 1,2,3,4,5
But the issue arises when I have RTL characters (Persian/Arabic) in my columns: 1 2 ب الف and a 5 in the end.
Now the output becomes 1,2الف, ب, 5
Since my columns can have multiple sets of RTL characters it can really mess up the output to the point that it's no more trivial to fix it by substituting several inputs.
What are my options to produce a csv file with the right order?
Tools I used where javascript and excel and both had the same issue.
If your purpose is to only display the CSV for human eye, you can add ‏RIGHT-TO-LEFT MARK (‏) before each number:
‏1, ‏2, ب, الف, ‏5
‏1, ‏2, ب, الف, ‏5
Note that these characters may drive crazy any tool you use to parse the CSV.
I think your CSV file already has the right order. In the text you pasted in the question:
1,2الف, ب, 5
The "1" is the first character in the string, and the "5" is the last. It just doesn't seem that way to you because the first half of the string (1,2) is rendering LTR whereas the second half of the string (الف, ب, 5) is rendering RTL.

Changing a numeric to a string variable in Stata

I have a variable ShiftStart that is a numeric variable in the format 01jan2014 06:59:59 (and so on). I want to change this to a string variable so that I can then substring it and create variables based on just date and just time separately.
When I try
generate str20 string_shiftstart=string(ShiftStart)
I create a string but all of the cells have been converted to strange values ("1.70e+12" and so on).
How can I keep the original contents of ShiftStart when it is converted to a string?
It seems you have a variable formatted as datetime. If so, no need to convert to string. There are appropriate functions that allow you to manipulate the original variable. This is clearly explained in help datetime:
clear
set more off
*----- example data -----
set obs 5
gen double datet = _n * 100000000
format datet %tc
list
*----- what you want -----
gen double date = dofc(datet)
format %td date
gen double hour = hh(datet) + mm(datet)/60 + ss(datet)/3600
list
The reason you find your original result surprising is because you are not aware of the fact that underlying the datetime display format, is a numerical value.
A good read (aside from help datetime) is
Stata tip 113: Changing a variable's format: What it does and does not mean, The Stata Journal, by Nicholas J. Cox.
Edit
To answer your last question:
If you want to create an indicator variable marking pre/post periods, one way is using td() (see the help file). Following the example given above:
// before 04jan1960
gen pre = date < td(04jan1960)
Creating this indicator variable is not always necessary. Most commands allow the use of the if qualifier, and you can insert the condition directly. See help if.
If you mean something else, you should be more explicit.

Match text from column within a certain cell - Excel

I have a column of few thousand filenames that are not uniform. For instance:
| Column A | Column B |
===============================
| junk_City1_abunc | City1 |
-------------------------------
| nunk_City1_blahb | City1 |
-------------------------------
| small=City2_jdjf | City2 |
-------------------------------
| mozrmcity3_somet | City3 |
I would like to identify the city within the text in column A and return it in Column B.
I've come up with a complex formula that does the trick, but it is difficult to adjust if more cities are added within the filenames in new entries within column A.
Here is an example:
=IF(ISNA(MATCH("*"&$W$3&"*",I248,0)),IF(ISNA(MATCH("*"&$W$4&"*",I248,0)),IF(ISNA(MATCH("*"&$W$5&"*",I248,0)),IF(ISNA(MATCH("*"&$W$6&"*",I248,0)),IF(ISNA(MATCH("*"&$W$7&"*",I248,0)),IF(ISNA(MATCH("*"&$W$8&"*",I248,0)),"Austin","Orlando"),"Las Vegas"),"Chicago"),"Boston"),"Las Angeles"),"National")
It seems like there should be an easier way to do it, but I just can't figure it out.
(To make matters worse, not only am I identifying a city within the filename, I'm looking for other attributes to populate other columns)
Can anyone help?
Use the formula =IFERROR(LOOKUP(1E+100,SEARCH($E$2:$E$11,A2),$E$2:$E$11),A2)
This does *****NOT***** have to be array entered.
Where $E$2:$E$11 is the list of names you want returned and A2 is the cell to test
If no matches are found instead of errors you will just use the full name in column b.
If you want errors or expect to NEVER have then you can just use:
=LOOKUP(1E+100,SEARCH($E$2:$E$11,A2),$E$2:$E$11)
Here's a round about way that works, not all my own work but a mish mash of bits from other sources:
Assuming the sheet is setup as follows:
The formula to use is below, this must be entered using Ctrl+Shift+Enter
=INDEX($C$2:$C$8,MAX(IF(ISERROR(SEARCH($C$2:$C$8,A2)),-1,1)*(ROW($C$2:$C$8)-ROW($C$2)+1)))
Annotated version:
=INDEX([List of search terms],MAX(IF(ISERROR(SEARCH([List of search terms],[Cell to search])),-1,1)*(ROW([List of search terms])-ROW([Cell containing first search term])+1)))

Resources