Gretl- How to create a dummy variable that says the individual has a small child - economics

I have a variable that is the age of the last child and I have to create a dummy for individuals who have children under 6 years of age, we also have some individuals who have empty values, or have had no children.
example of variable:
1 - 10
2 - 5
3 - 7
4 - 30
5 -
6 - 25
7 - 3
8-15
9 -
10 - 33

If I understood correctly, you want to create a dummy using two conditions:
dummy = 1 if:
(condition 1) the age is less than 6
(condition 2) the age is available (or different from NA)
To achieve that using Gretl you can use:
##### Creating "age of the last child" series #####
nulldata 10
series age_of_the_last_child = NA
matrix m = {10, 5, 7, 30, NA, 25, 3, 15, NA, 33}
loop i = 1..10 --quiet
age_of_the_last_child[i] = m[i]
endloop
###################################################
series dummy = (age_of_the_last_child < 6) ? 1 : 0
series dummy = misszero(dummy)
Or, if you want a more compact way:
series dummy = misszero((age_of_the_last_child < 6) ? 1 : 0)

Related

Compare current value with n values above and below on Pandas DataFrame

I have this df:
x
0 2
1 2
2 2
3 1
4 1
5 2
6 2
I need to compare current value on column x with respect to the n previous and next values based on a defined condition, if condition is met q times then add 1 in a new column, if not, add 0.
For instance, if n is 2, q is 3 and the condition is current_value <= value / 2. In this case, the code will do 7 comparisons:
1st comparison: compare current_value = 2 to previous n = 2 numbers (in this case there are no such numbers because is the first value on the column) and then compare current_value = 2 to the next n = 2 values (in this case both numbers are 2, so condtion is not met on neither (2 <= 2/2)). In this case there are no conditions met, as q = 3 >= 0 the code adds 0 to the new column.
2nd comparison: compare current_value = 2 to previous n = 2 numbers (in this case there is just one number above, the condition is not met (2 <= 2/2)) and then compare current_value = 2 to the next n = 2 values (in this case there's a number 2 and then a number 1, so condition is not met (2 <= 2/2 and 2 <= 1/2)). In this case there are no conditions met, as q = 3 >= 0 the code adds 0 to the new column.
3rd comparison: In this case there are no condition met, as q = 3 >= 0 the code adds 0 to the new column.
4th comparison: compare current_value = 1 to previous n = 2 numbers (in this case there are two number 2 above, the condition is met on both of them (1 <= 2/2)) and then compare current_value = 1 to the next n = 2 values (in this case there's a number 1 and then a number 2, so condition is met once (1 <= 2/2 and 1 <= 1/2)). In this case there are 3 conditions met, as q = 3 >= 3 the code adds 1 to the new column.
5th comparison: In this case there are 3 conditions met, as q = 3 >= 3 the code adds 1 to the new column.
6th comparison: In this case there are no conditions met, as q = 3 >= 0 the code adds 0 to the new column.
7th comparison: In this case there are no conditions met, as q = 3 >= 0 the code adds 0 to the new column.
Desired result:
x comparison
0 2 0
1 2 0
2 2 0
3 1 1
4 1 1
5 2 0
6 2 0
I was thinking on using something like shift function but I'm not sure how to implement it. Any help?
I suggest to use numpy here, to benefit from its sliding window view:
import numpy as np
from numpy.lib.stride_tricks import sliding_window_view as swv
n = 2
q = 3
# convert to numpy array
a = df['x'].astype(float).to_numpy()
# create a sliding window
# remove central value, divide by 2
# compare to original value
# count number of matches
count = (a[:,None] <= swv(np.pad(a, n, constant_values=np.nan), 2*n+1)[:, np.r_[:n,n+1:2*n+1]]/2).sum(1)
# array([0, 0, 0, 3, 3, 0, 0])
# compare number of matches to q
df['comparison'] = (count >= q).astype(int)
print(df)
An alternative with only pandas would require to compute two rolling windows (forward and backward) as it's not trivial to access the current index in a centered rolling with min_periods=1:
n = 2
q = 3
s1 = df['x'].rolling(n+1, min_periods=2).apply(lambda x: sum(x.iloc[-1]<=x.iloc[:-1]/2))
s2 = df.loc[::-1, 'x'].rolling(n+1, min_periods=2).apply(lambda x: sum(x.iloc[-1]<=x.iloc[:-1]/2))
df['comparison'] = s1.add(s2, fill_value=0).ge(3).astype(int)
Output:
x comparison
0 2 0
1 2 0
2 2 0
3 1 1
4 1 1
5 2 0
6 2 0

How to extract data from data frame when value of column change

I want to extract part of the data frame when value change from 0 to 1.
logic1: when value change from 0 to 1, start to save data until value again change to 0. (also points before 1 and after 1)
logic2: when value change from 0 to 1, start to save data until value again change to 0. (don't need to save points before 1 and after 1)
only save data when the first time value of flag change from 0 to 1, after this if again value change from 0 to 1 don't need to do anything
df=pd.DataFrame({'value':[3,4,7,8,11,1,15,20,15,16,87],'flag':[0,0,0,1,1,1,0,0,1,1,0]})
Desired output:
df_out_1=pd.DataFrame({'value':[7,8,11,1,15]})
Desired output:
df_out_2=pd.DataFrame({'value':[8,11,1]})
Idea is get consecutive groups of 1 and 0 consecutive groups to s, filter only 1 groups and get first 1 group by compare by minimal value:
df = df.reset_index(drop=True)
s = df['flag'].ne(df['flag'].shift()).cumsum()
m = s.eq(s[df['flag'].eq(1)].min())
df2 = df.loc[m, ['value']]
print (df2)
value
3 8
4 11
5 1
And then filter values with aff and remove 1 to default RangeIndex:
df1 = df.loc[(df2.index + 1).union(df2.index - 1), ['value']]
print (df1)
value
2 7
3 8
4 11
5 1
6 15

Pandas remove group if difference between first and last row in group exceeds value

I have a dataframe df:
df = pd.DataFrame({})
df['X'] = [3,8,11,6,7,8]
df['name'] = [1,1,1,2,2,2]
X name
0 3 1
1 8 1
2 11 1
3 6 2
4 7 2
5 8 2
For each group within 'name' and want to remove that group if the difference between the first and last row of that group is smaller than a specified value d_dif in absolute way:
For example, when d_dif= 5, I want to get:
X name
0 3 1
1 8 1
2 11 1
If your data is increasingly in X, you can use groupby().transform() and np.ptp
threshold = 5
ranges = df.groupby('name')['X'].transform(np.ptp)
df[ranges > threshold]
If you only care about first and last, then transform just first and last:
threshold = 5
groups = df.groupby('name')['X']
ranges = groups.transform('last') - groups.transform('first')
df[ranges.abs() > threshold]

Reorder columns in groups by number embedded in column name?

I have a very large dataframe with 1,000 columns. The first few columns occur only once, denoting a customer. The next few columns are representative of multiple encounters with the customer, with an underscore and the number encounter. Every additional encounter adds a new column, so there is NOT a fixed number of columns -- it'll grow with time.
Sample dataframe header structure excerpt:
id dob gender pro_1 pro_10 pro_11 pro_2 ... pro_9 pre_1 pre_10 ...
I'm trying to re-order the columns based on the number after the column name, so all _1 should be together, all _2 should be together, etc, like so:
id dob gender pro_1 pre_1 que_1 fre_1 gen_1 pro2 pre_2 que_2 fre_2 ...
(Note that the re-order should order the numbers correctly; the current order treats them like strings, which orders 1, 10, 11, etc. rather than 1, 2, 3)
Is this possible to do in pandas, or should I be looking at something else? Any help would be greatly appreciated! Thank you!
EDIT:
Alternatively, is it also possible to re-arrange column names based on the string part AND number part of the column names? So the output would then look similar to the original, except the numbers would be considered so that the order is more intuitive:
id dob gender pro_1 pro_2 pro_3 ... pre_1 pre_2 pre_3 ...
EDIT 2.0:
Just wanted to thank everyone for helping! While only one of the responses worked, I really appreciate the effort and learned a lot about other approaches / ways to think about this.
Here is one way you can try:
# column names copied from your example
example_cols = 'id dob gender pro_1 pro_10 pro_11 pro_2 pro_9 pre_1 pre_10'.split()
# sample DF
df = pd.DataFrame([range(len(example_cols))], columns=example_cols)
df
# id dob gender pro_1 pro_10 pro_11 pro_2 pro_9 pre_1 pre_10
#0 0 1 2 3 4 5 6 7 8 9
# number of columns excluded from sorting
N = 3
# get a list of columns from the dataframe
cols = df.columns.tolist()
# split, create an tuple of (column_name, prefix, number) and sorted based on the 2nd and 3rd item of the tuple, then retrieved the first item.
# adjust "key = lambda x: x[2]" to group cols by numbers only
cols_new = cols[:N] + [ a[0] for a in sorted([ (c, p, int(n)) for c in cols[N:] for p,n in [c.split('_')]], key = lambda x: (x[1], x[2])) ]
# get the new dataframe based on the cols_new
df_new = df[cols_new]
# id dob gender pre_1 pre_10 pro_1 pro_2 pro_9 pro_10 pro_11
#0 0 1 2 8 9 3 6 7 4 5
Luckily there is a one liner in python that can fix this:
df = df.reindex(sorted(df.columns), axis=1)
For Example lets say you had this dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name': [2, 4, 8, 0],
'ID': [2, 0, 0, 0],
'Prod3': [10, 2, 1, 8],
'Prod1': [2, 4, 8, 0],
'Prod_1': [2, 4, 8, 0],
'Pre7': [2, 0, 0, 0],
'Pre2': [10, 2, 1, 8],
'Pre_2': [10, 2, 1, 8],
'Pre_9': [10, 2, 1, 8]}
)
print(df)
Output:
Name ID Prod3 Prod1 Prod_1 Pre7 Pre2 Pre_2 Pre_9
0 2 2 10 2 2 2 10 10 10
1 4 0 2 4 4 0 2 2 2
2 8 0 1 8 8 0 1 1 1
3 0 0 8 0 0 0 8 8 8
Then used
df = df.reindex(sorted(df.columns), axis=1)
Then the dataframe will then look like:
ID Name Pre2 Pre7 Pre_2 Pre_9 Prod1 Prod3 Prod_1
0 2 2 10 2 10 10 2 10 2
1 0 4 2 0 2 2 4 2 4
2 0 8 1 0 1 1 8 1 8
3 0 0 8 0 8 8 0 8 0
As you can see, the columns without underscore will come first, followed by an ordering based on the number after the underscore. However this also sorts of the column names, so the column names that come first in the alphabet will be first.
You need to split you column on '_' then convert to int:
c = ['A_1','A_10','A_2','A_3','B_1','B_10','B_2','B_3']
df = pd.DataFrame(np.random.randint(0,100,(2,8)), columns = c)
df.reindex(sorted(df.columns, key = lambda x: int(x.split('_')[1])), axis=1)
Output:
A_1 B_1 A_2 B_2 A_3 B_3 A_10 B_10
0 68 11 59 69 37 68 76 17
1 19 37 52 54 23 93 85 3
Next case, you need human sorting:
import re
def atoi(text):
return int(text) if text.isdigit() else text
def natural_keys(text):
'''
alist.sort(key=natural_keys) sorts in human order
http://nedbatchelder.com/blog/200712/human_sorting.html
(See Toothy's implementation in the comments)
'''
return [ atoi(c) for c in re.split(r'(\d+)', text) ]
df.reindex(sorted(df.columns, key = lambda x:natural_keys(x)), axis=1)
Output:
A_1 A_2 A_3 A_10 B_1 B_2 B_3 B_10
0 68 59 37 76 11 69 68 17
1 19 52 23 85 37 54 93 3
Try this.
To re-order the columns based on the number after the column name
cols_fixed = df.columns[:3] # change index no based on your df
cols_variable = df.columns[3:] # change index no based on your df
cols_variable = sorted(cols_variable, key=lambda x : int(x.split('_')[1])) # split based on the number after '_'
cols_new = cols_fixed + cols_variable
new_df = pd.DataFrame(df[cols_new])
To re-arrange column names based on the string part AND number part of the column names
cols_fixed = df.columns[:3] # change index no based on your df
cols_variable = df.columns[3:] # change index no based on your df
cols_variable = sorted(cols_variable)
cols_new = cols_fixed + cols_variable
new_df = pd.DataFrame(df[cols_new])

Keeping rows from the previous year

I am just going to give it a try as I know that here are some smart people who might have an r -code for this.
I wont be able to code this by myself.
So I got a dataset that contains the names and years-months between 2000-01 and 2008-12. Looking like this:
Name Date
A 2000-01
A 2000-02
A ...
A 2008-12
A 2000-01
B 2000-01
B ...
B 2008-12
C and so on..
It can happen that for each name in my key column there is one value for each year. Thats the best I can ask for. Unfortunately some years dont have a value in my key column.
Getting further in my dataset looking only at Name A:
So if I do not have 1 observations for every year between 2000-2008 and I want to get the row from the year and month that does not have a value for my key column based on the month from the year on the next observation.
In this example:
2003-02 has a value for my keycolumn and 2002-02 does not, I want to get back the row from the date 2002-02 and Name A.
In a nutshell: "Keeping rows from the previous year based on key column from the next year"
Is there some easy way to code this?
Thank you :)
There's no straightforward and easy way to code what you're describing, but it's certainly possible to break the problem down into easier parts. The core part of the problem is as follows. Given a dataframe of rows with non-NA values, e.g.
year month
1 2002 12
2 2005 11
3 2006 01
4 2008 07
for each row, check the dataframe to see if the previous year exists; if yes, return the row, if no, return an additional row with the previous year and the same month. Here's what a function to do that might look like
check_ym <- function(y, m, dat) {
if ((y - 1) %in% dat$year) {
return(data.frame(Date = paste(y, m, sep = "-"), stringsAsFactors = FALSE))
} else {
return(data.frame(Date = paste(c(y - 1, y), c(m, m), sep = "-"), stringsAsFactors = FALSE))
}
}
Now, let's make some fake data.
library(dplyr)
library(tidyr)
library(purrr)
# Simulate data
set.seed(123)
x <- data.frame(Date = paste(sample(2000:2008, 4),
sprintf("%02d", sample(1:12, 4, replace = TRUE)),
sep = "-"),
KeyColumn = floor(runif(4, 1, 10)))
d <- data.frame(Date = paste(rep(2000:2008, each = 12),
sprintf("%02d", rep(1:12, times = 9)),
sep = "-")) %>%
left_join(x)
Identify the non-NA rows:
dd <- d %>%
na.omit() %>%
separate(Date, into = c("year", "month")) %>%
mutate(year = as.numeric(year))
dd
# year month KeyColumn
# 1 2002 12 5
# 2 2005 11 5
# 3 2006 01 5
# 4 2008 07 9
Then, we run the function above, iterating through the year and month columns. This gives us
out <- map2_df(dd$year, dd$month, .f = check_ym, dat = dd)
out
# Date
# 1 2001-12
# 2 2002-12
# 3 2004-11
# 4 2005-11
# 5 2006-01
# 6 2007-07
# 7 2008-07
Finally, we join this with our original data:
inner_join(out, d)
# Joining, by = "Date"
# Date KeyColumn
# 1 2001-12 NA
# 2 2002-12 5
# 3 2004-11 NA
# 4 2005-11 5
# 5 2006-01 5
# 6 2007-07 NA
# 7 2008-07 9
This is just for one Name. We can also do this for many Names. First create some fake data:
# Simulate data
set.seed(123)
d <- map_df(setNames(1:3, LETTERS[1:3]), function(...) {
x <- data.frame(Date = paste(sample(2000:2008, 4),
sprintf("%02d", sample(1:12, 4, replace = TRUE)),
sep = "-"),
KeyColumn = floor(runif(4, 1, 10)))
data.frame(Date = paste(rep(2000:2008, each = 12),
sprintf("%02d", rep(1:12, times = 9)),
sep = "-")) %>%
left_join(x)
}, .id = "Name")
dd <- d %>%
na.omit() %>%
separate(Date, into = c("year", "month")) %>%
mutate(year = as.numeric(year))
dd
# Name year month KeyColumn
# 1 A 2002 12 5
# 2 A 2005 11 5
# 3 A 2006 01 5
# 4 A 2008 07 9
# 5 B 2000 04 6
# 6 B 2004 01 7
# 7 B 2005 12 9
# 8 B 2006 03 9
# 9 B 2000 04 6
# 10 C 2003 12 1
# 11 C 2005 04 7
# 12 C 2006 11 5
# 13 C 2008 02 8
Now, use split to split the dataframe into three dataframes by Name; for each sub-dataframe, we apply check_ym(), and then we combine the results together and join it with the original data:
lapply(split(dd, dd$Name), function(dat) {
map2_df(dat$year, dat$month, .f = check_ym, dat = dat)
}) %>%
bind_rows(.id = "Name") %>%
inner_join(d)
# Joining, by = c("Name", "Date")
# Name Date KeyColumn
# 1 A 2001-12 NA
# 2 A 2002-12 5
# 3 A 2004-11 NA
# 4 A 2005-11 5
# 5 A 2006-01 5
# 6 A 2007-07 NA
# 7 A 2008-07 9
# 8 B 2000-04 6
# 9 B 2003-01 NA
# 10 B 2004-01 7
# 11 B 2005-12 9
# 12 B 2006-03 9
# 13 C 2002-12 NA
# 14 C 2003-12 1
# 15 C 2004-04 NA
# 16 C 2005-04 7
# 17 C 2006-11 5
# 18 C 2007-02 NA
# 19 C 2008-02 8

Resources