converting very long string into numeric - string

I have a person identification number variable in a panel dataset that is of string type with 19 characters (str19). Whenever I convert it into numeric using the destring command I lose precision because it is converted into either double (max 16 characters) or float, meaning that the ID numbers no longer identify respondents uniquely. I need it to be numeric in order to treat the data as panel (xt commands). What can I do?

The best way forward I can think of is to use egen's group() function to create identifiers. You don't provide a data or code example, but this illustrates the point.
. clear
. set obs 1
number of observations (_N) was 0, now 1
. gen strid = "1234567890123456789"
. egen numid = group(strid), label
. list
+-------------------------------------------+
| strid numid |
|-------------------------------------------|
1. | 1234567890123456789 1234567890123456789 |
+-------------------------------------------+
. list, nolabel
+-----------------------------+
| strid numid |
|-----------------------------|
1. | 1234567890123456789 1 |
+-----------------------------+
Note that this is documented: see this FAQ.

Related

Trouble with format command changing the value

I first applied destring to an ID variable (with 17 digits). They are destrung but then they are shown in scientific notation. So I tried the command format %20.0f. Now all digits are shown but the last 2-3 digits are now changed.enter image description here
Stata can only hold numeric variables with up to 16 digits.
Your best option is probably to keep the ID as a string.
The command format only affects how a data point is displayed to humans, not how it is actually stored.
This is to complement the answer by #TheIceBear.
format never changes values. The problem is that your string is too big even for its numeric equivalent to be held exactly in a double, except occasionally.
clear
set obs 5
gen id = 17*"9" in 1
replace id = 16*"9" + "6" in 2
replace id = 16*"9" + "2" in 3
replace id = 15*"9" + "88" in 4
replace id = 15*"9" + "84" in 5
format id %20s
destring id, gen(nid)
format nid %20.0f
list
+----------------------------------------+
| id nid |
|----------------------------------------|
1. | 99999999999999999 100000000000000000 |
2. | 99999999999999996 100000000000000000 |
3. | 99999999999999992 100000000000000000 |
4. | 99999999999999988 99999999999999984 |
5. | 99999999999999984 99999999999999984 |
+----------------------------------------+

Trouble with "destring" and keeping decimals

I am new to Stata and i assume this is a beginner question. Yet I have just spent the last hour searching the internet for an answer to no avail!
I am using World Bank GDP data (imported from a csv file) and the data is in the string format. When I destring, the GDP data that contains decimal places gets ignored and simply comes out as a big number.
destring yr*, replace ignore("..")
Here is a sample of my data:
yr2016
205276172134.901
..
13397100000
When I run the command I posted, it transforms to:
yr2016
2.053e+14
1.340e+10
As you can see the .901 was tacked into the number instead of being perceived as a decimal space.
I have tried:
set dp period
But it didn't work.
You just need to set the format of the converted variable:
clear
set obs 1
generate string = "205276172134.901"
destring string, generate(numeric)
list
+------------------------------+
| string numeric |
|------------------------------|
1. | 205276172134.901 2.053e+11 |
+------------------------------+
format numeric %18.0g
list
+-------------------------------------+
| string numeric |
|-------------------------------------|
1. | 205276172134.901 205276172134.901 |
+-------------------------------------+
Type help format for more information.
The problem is that the ignore() option is removing every instance of a . in the string variable, Stata is not searching for a sequence of two consecutive ... There is no need to use the ignore option in this case. Try destring var, replace force and allow Stata to set rows with .. to missing.

Why does Stata report zero observations when the variable has non-missing values?

I have a variable called co_dormant that takes on two string values:
Y or N.
So far, when I type summarize co_dormant, I get zero observations.
However, when I type table co_dormant, I get the frequency of Y and N.
I want to keep all observations that have non-missing co_dormant, and when I type
keep if co_dormant != .
all the observations are dropped.
Does anyone know what is happening?
summarize is meant for numeric type variables. (What would be, for example, the mean of a string variable?)
table by default gives the frequency. Stata can count frequencies for either string or numeric type variables.
If you want to drop missings (what Stata considers missings) you can use the missing() function. This works for both string and numeric variables:
clear
set more off
input ///
str1 myvar
Y
N
""
end
list
drop if missing(myvar)
list
See help missing for details on missing values.
If you executed what you say you executed, and the variable was string type, you would get an error:
. input ///
> str1 myvar
myvar
1. Y
2. N
3. ""
4. end
.
. list
+-------+
| myvar |
|-------|
1. | Y |
2. | N |
3. | |
+-------+
.
. keep if myvar != .
type mismatch
r(109);

Convert odd Stata string variable to date

I currently have economic data in the format YYYY.QX where Q indicates "Quarter" followed by X, which is in [1,4]. This is interpreted as a string.
I've tried to use the date(series, "YMD") and formatting command, as well as the encode function.
Ideally, I'd end up with a numerical variable indicating something like:
YYYY.X
YYYY.M, where "M" is the first month of that quarter
YYYYMM01, where "MM" is the first month of that quarter.
It's best to show exactly what code you tried and what Stata did or said in response.
Such dates are quarterly dates so treating them as anything else is at best indirect and at worst quite wrong.
. set obs 1
obs was 0, now 1
. gen example = "2013.Q4"
. gen qdate = yq(real(substr(example, 1,4)),real(substr(example, -1,1)))
. list
+-----------------+
| example qdate |
|-----------------|
1. | 2013.Q4 215 |
+-----------------+
. format qdate %tq
. list
+------------------+
| example qdate |
|------------------|
1. | 2013.Q4 2013q4 |
+------------------+
Note that your code indicating the date is a daily date can only be wrong. Also that encode (incidentally not a function, but a command) cannot help here unless you specify every string date explicitly as a value label.
UPDATE Note that the function date() is not an all-purpose function for creating any kind of date: it is only for daily dates. There is in fact a synonym daily().
This example shows that using quarterly() is another possibility.
. di quarterly(substr("2013.Q4", 1,4) + " " + substr("2013.Q4", -1,1), "Yq")
215
For a variable series containing such string dates, you could go
. gen qdate = quarterly(substr(series, 1, 4)) + " " + substr(series, -1, 1), "Yq")
. format qdate %tq

Algorithm to form a given pattern using some strings

Given are 6 strings of any length. The words are to be arranged in the pattern shown below. They can be arranged either vertically or horizontally.
--------
| |
| |
| |
---------------
| |
| |
| |
--------
The pattern need not to be symmetric and there need to be two empty areas as shown.
For example:
Given strings
PQF
DCC
ACTF
CKTYCA
PGYVQP
DWTP
The pattern can be
DCC...
W.K...
T.T...
PGYVQP
..C..Q
..ACTF
where dot represent empty areas.
The other example is
RVE
LAPAHFUIK
BIRRE
KZGLPFQR
LLHU
UUZZSQHILWB
Pattern is
LLHU....
A..U....
P..Z....
A..Z....
H..S....
F..Q....
U..H....
I..I....
KZGLPFQR
...W...V
...BIRRE
If multiple patterns are possible then pattern with lexicographically smallest first line, then second line and so on is to be formed. What algorithm can be used to solve this?
Find strings which suits to this constraint:
strlen(a) + strlen(b) - 1 = strlen(c)
strlen(d) + strlen(e) - 1 = strlen(f)
After that try every possible situation if they are valid. For example;
aaa.....
d.f.....
d.f.....
d.f.....
cccccccc
..f....e
..f....e
..bbbbbb
There will be 2*2*2 = 8 different situation.
There are a number of heuristics that you can apply, but before that, let's go over some properties of the puzzle.
+aa+
c f
+ee+eee+
f d
+bbb+
Let us call the length of the string with the same character as appeared in the diagram above. We have:
a + b - 1 = e
c + d - 1 = f
I will refer to the 2 strings for the cross in the middle as middle strings.
We also infer that the length of the string cannot be less than 2. Therefore, we can infer:
e > a, e > b
f > c, f > d
From this, we know that the 2 shortest strings cannot be middle strings, due to the inequality above.
The 3 largest strings cannot be equal also, since after choosing any of 3 string as middle string, we are left with 2 largest strings that are equal, and it is impossible according to the inequality above.
The puzzle is only tricky when the lengths are regular. When the lengths are irregular, you can do direct mapping from length to position.
If we have the 2 largest strings being equal, due to the inequality above, they are the 2 middle strings. The worst case for this one is a "regular" puzzle, where the length a, b, c, d are equal.
If the 2 largest strings are unequal, the largest string's position can be determined immediately (since its length is unique in the puzzle) - as one of the middle string. In worst case, there can be 3 candidates for the other middle string - just brute force and check all of them.
Algorithm:
Try to map unique length string to the position.
Brute force the 2 strings in the middle (taken into consideration what I mentioned above), and brute force to fill in the rest.
Even with stupid brute force, there are only 6! = 720 cases, if the string can only go from left to right, up to down (no reverse). There will be 46080 cases (* 2^6) if the string is allowed to be in any direction.

Resources