Stata Replacing Part of String - string

I have 10 digit long string "0000000000" called my_var. I have two variables highclass (between 0 and 10) and lowclass (between 0 and 10).
I need to convert the digits between highclass and lowclass to 1.
For example, if a row has highclass =5 and lowclass =1, then my_var should become 1111100000.
I am not sure if the substring command will help me since I need to reference a variable.

As I understand it, lowclass is the position of the first 1 and highclass is the position of the last 1.
No loops are needed. In fact, a single statement would do it in Stata (which is the language the question is about).
Two ways to do it:
Old style (particularly pertinent to Stata 12 and below)
Here I have split the single statement into several, because I suspect it is clearer that way. Note that substr() (not substring()) is a function, not a command.
clear
input str10 my_var lowclass highclass
"0000000000" 1 5
"0000000000" 2 4
"0000000000" 3 3
"0000000000" 1 10
"0000000000" 7 10
end
local zeros "0000000000"
local ones "1111111111"
replace my_var = substr("`zeros'", 1, lowclass - 1)
replace my_var = my_var + substr("`ones'", 1, highclass - lowclass + 1)
replace my_var = my_var + substr("`zeros'", 1, 10 - highclass)
list
+----------------------------------+
| my_var lowclass highcl~s |
|----------------------------------|
1. | 1111100000 1 5 |
2. | 0111000000 2 4 |
3. | 0010000000 3 3 |
4. | 1111111111 1 10 |
5. | 0000001111 7 10 |
+----------------------------------+
New style (Stata 13 up)
Mata and Stata 13 up allow string multiplication, (e.g. 10 * "1") so this works:
replace my_var = (lowclass - 1) * "0" + (highclass - lowclass + 1) * "1" + (10 - highclass) * "0"
Note that e.g. -1 * "0" is perfectly legal but evaluates as missing (empty string).

Related

How to split data and assign it into designated variables?

I have data in Stata regarding the feeling of the current situation. There are seven types of feeling. The data is stored in the following format (note that the data type is a string, and one person can respond to more than 1 answer)
feeling
4,7
1,3,4
2,5,6,7
1,2,3,4,5,6,7
Since the data is a string, I tried to separate it by
split feeling, parse (,)
and I got the result
feeling1
feeling2
feeling3
feeling4
feeling5
feeling6
feeling7
4
7
1
3
4
2
5
6
7
1
2
3
4
5
6
7
However, this is not the result I want. which is that the representative number of feelings should go into the correct variable. For instance.
feeling1
feeling2
feeling3
feeling4
feeling5
feeling6
feeling7
4
7
1
3
4
2
5
6
7
1
2
3
4
5
6
7
I am not sure if there is any built-in command or function for this kind of problem. I am thinking about using forval in looping through every value in each variable and try to juggle it around into the correct variable.
A loop over the distinct values would be enough here. I give your example in a form explained in the Stata tag wiki as more helpful and then give code to get the variables you want as numeric variables.
* Example generated by -dataex-. For more info, type help dataex
clear
input str13 feeling
"4,7"
"1,3,4"
"2,5,6,7"
"1,2,3,4,5,6,7"
end
forval j = 1/7 {
gen wanted`j' = `j' if strpos(feeling, "`j'")
gen better`j' = strpos(feeling, "`j'") > 0
}
l feeling wanted1-better3
+---------------------------------------------------------------------------+
| feeling wanted1 better1 wanted2 better2 wanted3 better3 |
|---------------------------------------------------------------------------|
1. | 4,7 . 0 . 0 . 0 |
2. | 1,3,4 1 1 . 0 3 1 |
3. | 2,5,6,7 . 0 2 1 . 0 |
4. | 1,2,3,4,5,6,7 1 1 2 1 3 1 |
+---------------------------------------------------------------------------+
If you wanted a string result that would be yielded by
gen wanted`j' = "`j'" if strpos(feeling, "`j'")
Had the number of feelings been 10 or more you would have needed more careful code as for example a search for "1" would find it within "10".
Indicator (some say dummy) variables with distinct values 1 or 0 are immensely more valuable for most analysis of this kind of data.
Note Stata-related sources such as
this FAQ
this paper
and this paper.

Creation of brace sequence not working in bash

I must create a sequence of numbers using the number of elements that an list has.
arr1=(1 2 3 4 5 6)
I thought about the following expression in order to do so, but it is now working.
echo {0..$(expr ${#arr1[*]} - 1)}
{0..5} # output
The correct output should be:
0 1 2 3 4 5
Could anyone explain me why I do not get the correct values?
You just need to add an eval:
$ a=(1 2 3 4 5 6)
$ eval echo {0..$(expr ${#a[*]} - 1)}
0 1 2 3 4 5

Adding zeros to a string without generating a new variable

I am trying to add zeros to a string variable in such a way that all levels of the variables have same number of digits (assume 3).
clear
input tina bina str4 pine
1 10 "99"
1 11 "99"
2 11 "99"
2 11 "99"
3 12 "."
4 12 "888"
5 14 "88"
6 15 "777"
7 16 "77"
8 17 "0"
8 18 "7"
end
I managed to do this by generating a new variable which stores the number of digits I need to add to each observation in order to reach 3:
generate pi=3-strlen(pine)
replace pine= ("0"*pi) + pine if strlen(pine)<3
I wonder if there is a way to obtain the same result but without generating the variable?
I tried the following but it does not work :
replace pine= ("0"*(`=3-strlen(pine)')) + pine if strlen(pine)<3
Probably I am not so clear about what happens when I evaluate expressions.
Your approach does not work because it evaluates the expression for the first observation only:
. display `= 3 - strlen(pine)'
1
The single quotes are not required:
replace pine = ("0" * (3-strlen(pine) ) ) + pine if strlen(pine) < 3
+--------------------+
| tina bina pine |
|--------------------|
1. | 1 10 099 |
2. | 1 11 099 |
3. | 2 11 099 |
4. | 2 11 099 |
5. | 3 12 00. |
|--------------------|
6. | 4 12 888 |
7. | 5 14 088 |
8. | 6 15 777 |
9. | 7 16 077 |
10. | 8 17 000 |
|--------------------|
11. | 8 18 007 |
+--------------------+
I know there is already an accepted answer, but I wanted to throw out my suggestion. This is maybe a little bit simpler than the other answer and is straightforward to explain. You just want to replace a string variable of real numbers with leading zeros and keep it as a string. You can easily do this by running:
replace pine = string(real(pine),"%03.0f")
Depending on your goal this is maybe better than the previous answer, because it maintains your missing value as missing and not add zeros to it. Hopefully this helpful.

How many times a specific character appears after each other in a string

I’ve been looking for a way to count how many times a specific character appears after each other in a string. All the ways I found just counts how many times character “A” appears in the string in total.
Example of string:
0xAAABBC0123456789AABBCCDD0123456789ABCDEF
Each string is 43 characters long and starts with “0x”. Each string only contains the following characters in random order: 0-9 and A-F, (total amount of 16 different characters). Each character can appear after each other in a row several times, example: “AAA” or "111".
I’m interested in how many times each of the 16 characters maximum appears after each other in one string, and check this through all my strings.
So far I’ve only come up with this Powershell script that counts how many times each character appears per line:
Get-Content " C:\Temp\strings.txt" | ForEach-Object{
New-Object PSObject -Property #{
Strings = $_
Row = $_.ReadCount
9 = [regex]::matches($_,"9").count
D = [regex]::matches($_,"D").count
B = [regex]::matches($_,"B").count
C = [regex]::matches($_,"C").count
7 = [regex]::matches($_,"7").count
3 = [regex]::matches($_,"3").count
1 = [regex]::matches($_,"1").count
8 = [regex]::matches($_,"8").count
F = [regex]::matches($_,"F").count
2 = [regex]::matches($_,"2").count
4 = [regex]::matches($_,"4").count
E = [regex]::matches($_,"E").count
6 = [regex]::matches($_,"6").count
5 = [regex]::matches($_,"5").count
A = [regex]::matches($_,"A").count
0 = [regex]::matches($_,"0").count
}
} | Sort Count -Descending | Export-Csv -Path "C:\Temp\output.csv" –NoTypeInformation
I would preferably do this in Powershell, but if there’s another way of doing this more easily, please let me know.
You could use a lookbehind and a backreference to split the string into repeating groups:
$s = '0xAAABBC0123456789AABBCCDD0123456789ABCDEF'
$repeats = $s.Remove(0, 2) -split '(?<=(.))(?!\1|$)'
Now we can group the substring based on the first letter of each:
$groups = $repeats |Group-Object {$_[0]} -AsHashTable
And finally grab the longest sequence of each character:
'0123456789ABCDEF'.ToCharArray() |%{
[pscustomobject]#{
Character = "$_"
MaxLength = "$($groups[$_] |Sort Length -Descending |Select -First 1)".Length
}
}
And you should end up with a list (for your example) like this:
Character MaxLength
--------- ---------
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
A 3
B 2
C 2
D 2
E 1
F 1
One approach is to iterate the source string character by character and keep track on how many times the character has been seen. This is easily done with a hash table. Like so,
# Hashtable initialization. Add keys for 0-9A-F:
# Each char has initial count 0
$ht = #{}
"ABCDEF0123456789".ToCharArray() | % {
$ht.Add($($_.ToString()), 0)
}
# Test data, the 0x prefix will contain one extra zero
$s = "0xAAABBC0123456789AABBCCDD0123456789ABCDEF"
# Convert data to char array for iteration
# Increment value in hashtable by using the char as key
$s.ToCharArray() | % { $ht[$_.ToString()]+=1 }
# Check results
PS C:\> $ht
Name Value
---- -----
B 5
3 2
5 2
x 1
9 2
2 2
8 2
0 3
1 2
E 1
7 2
F 1
6 2
4 2
D 3
A 6
C 4
Build a HexPair iterating the string position for position (omitting the last) and increment a value in a hash table with the HexPair as the key.
$String = '0xAAABBC0123456789AABBCCDD0123456789ABCDEF'
$Hash=#{}
for ($i=2;$i -le ($string.length-2);$i++){
$Hash[$($String.Substring($i,2))]+=1
}
$Hash.GetEnumerator()|ForEach-Object{
[PSCustomObject]#{HexPair = $_.Name
Count = $_.Value}
} |Sort Count -Descending
Sample output
HexPair Count
------- -----
BC 3
AB 3
AA 3
CD 2
BB 2
9A 2
89 2
78 2
67 2
56 2
45 2
34 2
23 2
12 2
01 2
EF 1
DE 1
DD 1
D0 1
CC 1
C0 1
Alternative output:
$Hash.GetEnumerator()|ForEach-Object{
[PSCustomObject]#{HexPair = $_.Name
Count = $_.Value}
} |Sort HexPair|group Count |%{"Count {0} {1}" -f $_.Name,($_.Group.HexPair -Join(', '))}|Sort
Count 1 C0, CC, D0, DD, DE, EF
Count 2 01, 12, 23, 34, 45, 56, 67, 78, 89, 9A, BB, CD
Count 3 AA, AB, BC
The result came out this way, even though it gives me 15 extra rows per string, I can easily filter unwanted material out in Microsoft Excel.
#Removed all "0x" in textfile before running this script
$strings = Get-Content " C:\Temp\strings_without_0x.txt"
foreach($s in $strings) {
$repeats = $s.Remove(0, 2) -split '(?<=(.))(?!\1|$)'
$groups = $repeats |Group-Object {$_[0]} -AsHashTable
'0123456789ABCDEF'.ToCharArray() |%{
[pscustomobject]#{
String = "$s"
Character = "$_"
MaxLength = "$($groups[$_] |Sort Length -Descending |Select -First 1)".Length
}
} | Sort Count -Descending | Export-Csv -Path "C:\Temp\output.csv" -NoTypeInformation -Append}
Thank you for all great answers!
Try this.
$out=#()
$string="0xAAABBC0123456789AABBCCDD0123456789ABCDEF"
$out+="Character,Count"
$out+='0123456789ABCDEF'.ToCharArray()|%{"$_," + ($string.split("$_")|Where-object{$_ -eq ""}).count}
ConvertFrom-Csv $out |sort count -Descending
This yields the following:
Character Count
--------- -----
A 3
B 2
0 1
C 1
D 1
F 1
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
E 0
You can put it into a function like this:
function count_dups ($string){
$out=#() # null array
$out+="Character,Count" # header
$out+='0123456789ABCDEF'.ToCharArray()|%{"$_," + ($string.split("$_")|Where-object{$_ -eq ""}).count}
return ConvertFrom-Csv $out | sort count -Descending
}
The biggest part of what I'm doing here is this line.
'0123456789ABCDEF'.ToCharArray()|%{"$_," + (string.split("$_")|Where-object{$_ -eq ""}).count}
I am splitting the string into an array on the the characters fed in from the character array '0123456789ABCDEF'. Then I am counting the empty elements in the array.
I'm only creating the array $out so that the output can be formatted like your example.

How to use slicing in Python

I am finding slicing in Python a bit very difficult. Lets say if I want the first five and last five characters of a phrase to display how do i go about it. For example:
words = input("Enter a word ")
slice = words[:2]
print(slice)
You can use negative indexing for slice from end :
>>> s="teststring"
>>>
>>> s[-5:]
'tring'
>>> s[:5]
'tests'
Actually a slice notation observes the following law :
[start:end:step]
One way to remember how slices work is to think of the indices as pointing between characters, with the left edge of the first character numbered 0. Then the right edge of the last character of a string of n characters has index n, for example:
+---+---+---+---+---+---+
| P | y | t | h | o | n |
+---+---+---+---+---+---+
0 1 2 3 4 5 6
-6 -5 -4 -3 -2 -1
Read more about slicing https://docs.python.org/2/tutorial/introduction.html#strings
And https://docs.python.org/2.3/whatsnew/section-slices.html

Resources