Convert odd Stata string variable to date - string

I currently have economic data in the format YYYY.QX where Q indicates "Quarter" followed by X, which is in [1,4]. This is interpreted as a string.
I've tried to use the date(series, "YMD") and formatting command, as well as the encode function.
Ideally, I'd end up with a numerical variable indicating something like:
YYYY.X
YYYY.M, where "M" is the first month of that quarter
YYYYMM01, where "MM" is the first month of that quarter.

It's best to show exactly what code you tried and what Stata did or said in response.
Such dates are quarterly dates so treating them as anything else is at best indirect and at worst quite wrong.
. set obs 1
obs was 0, now 1
. gen example = "2013.Q4"
. gen qdate = yq(real(substr(example, 1,4)),real(substr(example, -1,1)))
. list
+-----------------+
| example qdate |
|-----------------|
1. | 2013.Q4 215 |
+-----------------+
. format qdate %tq
. list
+------------------+
| example qdate |
|------------------|
1. | 2013.Q4 2013q4 |
+------------------+
Note that your code indicating the date is a daily date can only be wrong. Also that encode (incidentally not a function, but a command) cannot help here unless you specify every string date explicitly as a value label.
UPDATE Note that the function date() is not an all-purpose function for creating any kind of date: it is only for daily dates. There is in fact a synonym daily().
This example shows that using quarterly() is another possibility.
. di quarterly(substr("2013.Q4", 1,4) + " " + substr("2013.Q4", -1,1), "Yq")
215
For a variable series containing such string dates, you could go
. gen qdate = quarterly(substr(series, 1, 4)) + " " + substr(series, -1, 1), "Yq")
. format qdate %tq

Related

Trouble with format command changing the value

I first applied destring to an ID variable (with 17 digits). They are destrung but then they are shown in scientific notation. So I tried the command format %20.0f. Now all digits are shown but the last 2-3 digits are now changed.enter image description here
Stata can only hold numeric variables with up to 16 digits.
Your best option is probably to keep the ID as a string.
The command format only affects how a data point is displayed to humans, not how it is actually stored.
This is to complement the answer by #TheIceBear.
format never changes values. The problem is that your string is too big even for its numeric equivalent to be held exactly in a double, except occasionally.
clear
set obs 5
gen id = 17*"9" in 1
replace id = 16*"9" + "6" in 2
replace id = 16*"9" + "2" in 3
replace id = 15*"9" + "88" in 4
replace id = 15*"9" + "84" in 5
format id %20s
destring id, gen(nid)
format nid %20.0f
list
+----------------------------------------+
| id nid |
|----------------------------------------|
1. | 99999999999999999 100000000000000000 |
2. | 99999999999999996 100000000000000000 |
3. | 99999999999999992 100000000000000000 |
4. | 99999999999999988 99999999999999984 |
5. | 99999999999999984 99999999999999984 |
+----------------------------------------+

Find average and total time between dates in a row string in pyspark?

The spark df column has a list of ordered dates demilited by '|' . It looks like this :
+------------------------------------------------------------+
| date |
+------------------------------------------------------------+
|2018-08-01|2017-06-01|2015-11-01|2012-08-21|2010-04-02 |
|2016-08-11|2013-08-25|2013-04-01|2012-01-01 |
+------------------------------------------------------------+
I want to calculate the average time for each row and the total difference (last date- first date) in days like this :
+------------------------------------------------------------+---------------+---------------+
| date | average time | total time |
+------------------------------------------------------------+---------------+---------------+
|2018-08-01|2017-06-01|2015-11-01|2012-08-21|2010-08-01 | 584 | 2920 |
|2016-08-11|2013-08-25|2013-04-01|2012-08-11 | 365 | 1460 |
+------------------------------------------------------------+---------------+---------------+
We can use inbuilt array functions of dataframe for this case..
Example:
val df= Seq(("2018-08-01|2017-06-01|2015-11-01|2012-08-21|2010-04-02"),("2016-08-11|2013-08-25|2013-04-01|2012-01-01")).toDF("id") //sample data
df.withColumn("arr_id", array_sort(split(col("id"),"\\|"))). //split to create array and sort the array
withColumn("first_ele", col("arr_id")(0)). // to get first element of array
withColumn("last_ele",reverse(col("arr_id"))(0)). //to get last element of array
withColumn("total_diff",datediff(col("last_ele").cast("date"), col("first_ele").cast("date"))). //get total diff of first and last dates
withColumn("avg_time", col("total_diff")/size(col("arr_id"))). //caluculate diff of date and divide with size of array column
select("id", "total_diff","avg_time").
show(false)
Result:
+------------------------------------------------------+----------+--------+
|id |total_diff|avg_time|
+------------------------------------------------------+----------+--------+
|2018-08-01|2017-06-01|2015-11-01|2012-08-21|2010-04-02|3043 |608.6 |
|2016-08-11|2013-08-25|2013-04-01|2012-01-01 |1684 |421.0 |
+------------------------------------------------------+----------+--------+
I think the output in question and answer is off because start and end dates in array are little different in input and output in question!
#Shu gave an awesome answer in Scala.
This just pyspark version. Feel free to give him the accepted answer.
from pyspark.sql import functions as F
df.withColumn("date", F.split(F.regexp_replace("date","\|",","),','))\
.withColumn("start", F.reverse(F.col("date"))[0]).withColumn("end",F.col("date")[0])\
.withColumn("total_time", F.datediff("end","start"))\
.withColumn("Average", F.col("total_time")/(F.size(F.col("date"))))\
.drop("start","end").show()
+--------------------+----------+-------+
| date|total_time|Average|
+--------------------+----------+-------+
|[2018-08-01, 2017...| 3043| 608.6|
|[2016-08-11, 2013...| 1684| 421.0|
+--------------------+----------+-------+
Here is a simplified version of the other answers using array built-in functions in Spark 2.4+.
First, split the dates string to get an array.
Now, as the dates are already sorted, use element_at function to get first and last dates since when the index is negative it accesses the elements of the array from last to the first. The rest is just calculating datediff and divide it by the size of the array to get the average:
df.withColumn("array_dates", split(col("date"), "[|]"))\
.withColumn("total_time", datediff(element_at("array_dates", -1), element_at("array_dates", 1))) \
.withColumn("average_time", col("total_time") / size(col("array_dates"))) \
.drop("array_dates")\
.show(2, False)
#+------------------------------------------------------+----------+------------+
#|date |total_time|average_time|
#+------------------------------------------------------+----------+------------+
#|2018-08-01|2017-06-01|2015-11-01|2012-08-21|2010-04-02|3043 |608.6 |
#|2016-08-11|2013-08-25|2013-04-01|2012-01-01 |1684 |421.0 |
#+------------------------------------------------------+----------+------------+
If the dates in the string aren't ordred as in your example, you could sort the array after split using array_sort

Trouble with "destring" and keeping decimals

I am new to Stata and i assume this is a beginner question. Yet I have just spent the last hour searching the internet for an answer to no avail!
I am using World Bank GDP data (imported from a csv file) and the data is in the string format. When I destring, the GDP data that contains decimal places gets ignored and simply comes out as a big number.
destring yr*, replace ignore("..")
Here is a sample of my data:
yr2016
205276172134.901
..
13397100000
When I run the command I posted, it transforms to:
yr2016
2.053e+14
1.340e+10
As you can see the .901 was tacked into the number instead of being perceived as a decimal space.
I have tried:
set dp period
But it didn't work.
You just need to set the format of the converted variable:
clear
set obs 1
generate string = "205276172134.901"
destring string, generate(numeric)
list
+------------------------------+
| string numeric |
|------------------------------|
1. | 205276172134.901 2.053e+11 |
+------------------------------+
format numeric %18.0g
list
+-------------------------------------+
| string numeric |
|-------------------------------------|
1. | 205276172134.901 205276172134.901 |
+-------------------------------------+
Type help format for more information.
The problem is that the ignore() option is removing every instance of a . in the string variable, Stata is not searching for a sequence of two consecutive ... There is no need to use the ignore option in this case. Try destring var, replace force and allow Stata to set rows with .. to missing.

converting very long string into numeric

I have a person identification number variable in a panel dataset that is of string type with 19 characters (str19). Whenever I convert it into numeric using the destring command I lose precision because it is converted into either double (max 16 characters) or float, meaning that the ID numbers no longer identify respondents uniquely. I need it to be numeric in order to treat the data as panel (xt commands). What can I do?
The best way forward I can think of is to use egen's group() function to create identifiers. You don't provide a data or code example, but this illustrates the point.
. clear
. set obs 1
number of observations (_N) was 0, now 1
. gen strid = "1234567890123456789"
. egen numid = group(strid), label
. list
+-------------------------------------------+
| strid numid |
|-------------------------------------------|
1. | 1234567890123456789 1234567890123456789 |
+-------------------------------------------+
. list, nolabel
+-----------------------------+
| strid numid |
|-----------------------------|
1. | 1234567890123456789 1 |
+-----------------------------+
Note that this is documented: see this FAQ.

Matching, splitting, converting and summing string in Excel / Numbers

I'm trying to do a match-and-calculate formula in Excel (or in Numbers for Mac, is the same for me: I try them both as they seem equal, also function names are equal!).
This is what I have:
| 1 | 2 | 3 |
|-----------+-----------+-----------|
| Category |other stuff| duration |
|-----------+-----------+-----------|
| A + .... ... + 00:01:23 |
|-----------+-----------+-----------|
| A + .... ... + 00:30:19 |
|-----------+-----------+-----------|
| B + ......... + ......... |
|-----------+-----------+-----------|
| A + .... ... + 00:22:12 |
... ... ....
So, in column 3 I have a duration in time in this format "hh:mm:ss" and in column 1 are stored all of my categories.
I want to search for all rows in my table that are matching with the category "A" in column 1 and take the relative column 3, splitting the string and converting chars to numbers (in particular I'm interested in converting them to secs, so hh*3600+mm*60+ss) and finally sum up all these values. Is it possible?
I'm new with Excel and Numbers, but I'm pretty familiar with coding in programming languages generally: this is what I'd do in programming:
global_secs=0;
for(row r=top to end){
if(r.get_column(1).content_equals("A")){
cell c=r.get_column(3);
string=split(c.get_content(),":")
global_secs+=int(string[1])*3600+int(string[2])*60+int(string[3])
}
}
Is there a way to achieve this in Excel sheet (or Numbers)?
I'd like to do all of this in one, or more, formula only in Excel or Numbers.
One more thing: I do not want to change cells format because this should be an automatic process without human interaction, so unless there is a function to change a range of cells format dynamically I prefer not to do that (I know I can make "duration" as format and sum up without converting to integer, but originally my data is in hh:mm:ss format)
Thanks so much!
The formula you are looking for is
=SUMIF(A2:A5,"A",C2:C5)
The easiest way to get the result in seconds would have been to format the cell as [ss] in Custom category. But as you don't want to do formatting , the other way could be
=HOUR(result) * 3600 + MINUTE(result) * 60 + SECOND(result)
So formula becomes
=HOUR(SUMIF(A2:A5,"A",C2:C5)) * 3600 + MINUTE(SUMIF(A2:A5,"A",C2:C5)) * 60 + SECOND(SUMIF(A2:A5,"A",C2:C5))
See image for referecne
Looks like a matrix formula
=SUM(N($A$2:$A$8="A")*$B$2:$B$8)
where column A contains the category and column C the duration. Note you need to press ctrl shift enter to make it work.
To convert the result to seconds, an alternative approach to #Mrig' solution would be to format the result and convert it back to a number, i.e.
=VALUE(TEXT(SUM(N($A$2:$A$8="A")*$B$2:$B$8),"[ss]"))

Resources