lua split string and save in lua table - string

Is there an way to split an string and save it into an table like this:
str = "23 = John, 45 = Karl, 6 = Chloe, 34 = Sarah"
--[[ 23 John
45 Karl
6 Chloe
34 Sarah]]
I want the numbers to be the keys and the Names to be the values.

Adapt this code:
for k,v in str:gmatch("(%d+)%s*=%s*(%a+)") do
print(k,v)
end
This assumes that the names are composed of letters only.

Related

Stripping ints from a string in pandas column

I have a column like this:
Age
15-20 years old
20-25 years old
I want this as output:
Age_Min Age_Max
15 20
20 25
I am trying to use str.strip() but no success so far.
I tried d[['Age_Min','Age_Max']]=d['Age'].str.split('-',expand=True)
and the result is almost there. Is there a way to get only the integers and remove the string?
Any tips?
Use Series.str.split with expand=True:
In [858]: out = df['Age'].str.split('-', expand=True).rename(columns={0:'Age_Min', 1: 'Age_Max'})
In [860]: out['Age_Max'] = out['Age_Max'].str.split().str[0]
In [861]: out
Out[861]:
Age_Min Age_Max
0 15 20
1 20 25
OR using regex:
In [870]: out = df['Age'].str.extract("(\d*\-?\d+)")[0].str.split('-', expand=True).rename(columns={0:'Age_Min', 1: 'Age_Max'})
In [871]: out
Out[871]:
Age_Min Age_Max
0 15 20
1 20 25

Datamash: Transposing the column into rows based on group in bash

I have a tab delim file with a 2 columns like following
A 123
A 23
A 45
A 67
B 88
B 72
B 50
B 23
C 12
C 14
I want to transpose with the above data based on the first column like following
A 123 23 45 67
B 88 72 50 23
C 12 14
I tried the datamash transpose < input-file.txt but it didnt yield the output as expected.
One awk version:
awk '{printf ($1!=f?"\n%s":" "$2),$0;f=$1}' file
A 123 23 45 67
B 88 72 50 23
C 12 14
With this version, you get on blank line, but should be fast and handle large data since no loop or array variable are used.
$1!=f?"\n%s":" "$2),$0 If first field is not equal f, print new line and all fields
if $1 = f, only print field 2.
f=$1 set f to first field
datamash --group=1 --field-separator=' ' collapse 2 <file | tr ',' ' '
Output:
A 123 23 45 67
B 88 72 50 23
C 12 14
Input must be sorted, as in the question.
This might work for you (GNU sed):
sed -E ':a;N;s/^((\S+)\s+.*)\n\2/\1/;ta;P;D' file
Append the next line and if the first field of the first line is the same as the first field of the second line, remove the newline and the first field of the second line. Print the first line in the pattern space and then delete it and the following newline and repeat.

sort pyspark dataframe within groups

I would like to sort column "time" within each "id" group.
The data looks like:
id time name
132 12 Lucy
132 10 John
132 15 Sam
78 11 Kate
78 7 Julia
78 2 Vivien
245 22 Tom
I would like to get this:
id time name
132 10 John
132 12 Lucy
132 15 Sam
78 2 Vivien
78 7 Julia
78 11 Kate
245 22 Tom
I tried
df.orderby(['id','time'])
But I don't need to sort "id".
I have two questions:
Can I just sort "time" within same "id"? and How?
Will be more efficient if I just sort "time" than using orderby() to sort both columns?
This is exactly what windowing is for.
You can create a window partitioned by the "id" column and sorted by the "time" column. Next you can apply any function on that window.
# Create a Window
from pyspark.sql.window import Window
w = Window.partitionBy(df.id).orderBy(df.time)
Now use this window over any function:
For e.g.: let's say you want to create a column of the time delta between each row within the same group
import pyspark.sql.functions as f
df = df.withColumn("timeDelta", df.time - f.lag(df.time,1).over(w))
I hope this gives you an idea. Effectively you have sorted your dataframe using the window and can now apply any function to it.
If you just want to view your result, you could find the row number and sort by that as well.
df.withColumn("order", f.row_number().over(w)).sort("order").show()

Converting string to numeric in Stata

I have survey data with the age of individuals in a variable named agen. Originally, the variable was string so I converted it to numeric using the encode command. When I tried to generate a new variable hhage referring to the age of head of household, the new variable generated was inconsistent.
The commands I used are the following:
encode agen, gen(age)
gen hhage=age if relntohrp==1
The new variable generated is not consistent because when I browsed it: the age of the hh head in the first houshehold is 65 while the new number generated was 63. When I checked the second household, the variable hhage reported 28 instead of 33 as the head of the housheold head. And so on.
Run help encode and you can read:
Do not use encode if varname contains numbers that merely happen to be stored
as strings; instead, use generate newvar = real(varname) or destring;
see real() or [D] destring.
For example:
clear all
set more off
input id str5 age
1 "32"
2 "14"
3 "65"
4 "54"
5 "98"
end
list
encode age, gen(age2)
destring age, gen(age3)
list, nolabel
Note the difference between using encode and destring. The former assigns numerical codes (1, 2, 3, ...) to the string values, while destring converts the string value to numeric. This you see stripping the value labels when you list:
. list, nolabel
+------------------------+
| id age age3 age2 |
|------------------------|
1. | 1 32 32 2 |
2. | 2 14 14 1 |
3. | 3 65 65 4 |
4. | 4 54 54 3 |
5. | 5 98 98 5 |
+------------------------+
A simple list or browse may confuse you because encode assigns the sequence of natural numbers but also assigns value labels equal to the original strings:
. list
+------------------------+
| id age age3 age2 |
|------------------------|
1. | 1 32 32 32 |
2. | 2 14 14 14 |
3. | 3 65 65 65 |
4. | 4 54 54 54 |
5. | 5 98 98 98 |
+------------------------+
The nolabel option shows the "underlying" data.
You mention it is inconsistent, but for future questions posting exact input and results is more useful for those trying to help you.
Try taking a look at this method? Sounds like you may have slipped up somewhere in your method.

Choose For Random Strings In Commodore 64 BASIC

I have this variable declarations on my program:
X="MAGENTA"
Y="CYAN"
Z="TAN"
A="KHAKI"
Now what I want is to randomly choose one of these and PRINT it. But how to do this?
My BASIC is pretty rusty but you should just be able to use something like:
10 X$ = "MAGENTA"
20 Y$ = "CYAN"
30 Z$ = "TAN"
40 A$ = "KHAKI"
50 N = INT(RND(1) * 4)
60 IF N = 0 THEN PRINT X$
70 IF N = 1 THEN PRINT Y$
80 IF N = 2 THEN PRINT Z$
90 IF N = 3 THEN PRINT A$
or, putting it in a subroutine for code re-use:
10 X$ = "MAGENTA"
20 Y$ = "CYAN"
30 Z$ = "TAN"
40 A$ = "KHAKI"
50 GOSUB 1000
60 PRINT RC$
70 END
1000 TV = INT(RND(1) * 4)
1010 IF TV = 0 THEN RC$ = X$
1020 IF TV = 1 THEN RC$ = Y$
1030 IF TV = 2 THEN RC$ = Z$
1040 IF TV = 3 THEN RC$ = A$
1050 RETURN
Of course, you probably should be using arrays for that sort of thing so you can just use:
10 DIM A$(3)
10 A$(0) = "MAGENTA"
20 A$(1) = "CYAN"
30 A$(2) = "TAN"
40 A$(3) = "KHAKI"
50 PRINT A$(INT(RND(1)*4))
The above answer is correct and comprehensive.
This answer, on the other hand, is not, but I was actually doing a little bit of Commodore BASIC last month and decided that string indexing CAN be useful, sometimes, so here's a non-answer that sort of reframes your problem.
100 X$ = "MAGENTACYAN TAN KHAKI "
110 PRINT MID$(X$,INT(RND(1)*4)*7, 7)
This code gets a random int from 0 to 3, then uses that to find the start index into a single string that contains all four entries, each of which is padded out (where necessary) to 7 characters. That padding is needed because the final parameter to MID$ is the length of the substring to be extracted.
WHY BOTHER?
When to consider indexing over an array:
(1) when your string data is near-uniform length, and
(2) when you have a LOT of little strings.
If those two conditions are true, then the full code, including the data, is more compact, and takes less memory due to allocating fewer pointers.
P.S. Bonus point if you find that I've made an off-by-one error!
Here's another way to do it, using one variable for the output and ON..GOSUB to set it based on a random number in the range [1..4].
10 ON INT(RND(1)*4+1) GOSUB 100,110,120,130
20 PRINT A$
30 END
100 A$ = "MAGENTA":RETURN
110 A$ = "CYAN":RETURN
120 A$ = "TAN":RETURN
130 A$ = "KHAKI":RETURN

Resources