I'm relatively new to R, and I'm currently stuck.
I have observations that are made up of legal articles, fe:
BIV:III,XXVIII.1(b);CIV:2.
So I splitted them resulting in a string listing each observation and the legal articles used. This looks like:
ArtAGr list of 400230
chr[1:2] "BIV:III,XXVIII.1(b)" "CIV:2"
chr[1:1] "ILA:2.3(b)"
chr[1:3] "BIV:IB.3(d)" "CIV:7,9" "ILA:VII.1"
The BIV and CIV would need to become my new variables. However, the observations vary, so some observations include both BIV and CIV, while others include other legal articles like ILA:II.3(b)
Now, I would like to create a dataframe from these guys, so I can group all the observations in a column for each major article.
Eventually, the perfect dataframe should look like:
Dispute BIV CIV ILA
1 III, XXVIII.1(b) 2 NA
2 NA NA II.3(b)
3 IV.3(d) 7,9 VII.1
4 II NA NA
So, I will need to create a new object grouping all observations who contain a text like BIV, and a O or N/A for those observations that do not use this legal article. Any thoughts would be greatly appreciated!
Thanks a lot!
Sven
Here's an approach:
# a vector of character strings (not the splitted ones)
vec <- c("BIV:III,XXVIII.1(b);CIV:2",
"ILA:II.3(b)",
"BIV:IB.3(d);CIV:7,9;ILA:VII.1")
# split strings
s <- strsplit(vec, "[;:]")
# target words
tar <- c("BIV", "CIV", "ILA")
# create data frame
setNames(as.data.frame(do.call(rbind, lapply(s, function(x)
replace(rep(NA_character_, length(tar)),
match(x[c(TRUE, FALSE)], tar), x[c(FALSE, TRUE)])))), tar)
The result:
BIV CIV ILA
1 III,XXVIII.1(b) 2 <NA>
2 <NA> <NA> II.3(b)
3 IB.3(d) 7,9 VII.1
Related
I am trying to convert a pandas dataframe wih 2 columns , into a dictionary such that the values of one column are the keys, and the values of the other column are the values of the dictionary. If the keys happen to be repeating (which they are), I want the values of the same key to be appended in a list.
So far I did the following , but this takes a very long time if I want to convert a 100K plus records to a dictionary.
A B
1 ab kate
2 ab drew
3 ab mike
4 ab eric
5 cd bobby
6 cd kyle
7 ab alex
8 ab michelle
9 cd heather
fdict = dict()
for d, d2 in zip(t.A, t.B):
fdict.setdefault(d, list()).append(d2)
Please help me understand how I can do this faster using python.
Thanks !
I think df.set_index('ID').T.to_dict('list') this oneliner would serve your purpose and faster.
Hi I have a dataset of multiple households where all people within households have been matched between two datasources. The dataframe therefore consists of a 'household' col, and two person cols (one for each datasource). However some people (like Jonathan or Peter below) where not able to be matched and so have a blank second person column.
Household
Person_source_A
Person_source_B
1
Oliver
Oliver
1
Jonathan
1
Amy
Amy
2
David
Dave
2
Mary
Mary
3
Lizzie
Elizabeth
3
Peter
As the dataframe is gigantic, my aim is to take a sample of the unmatched individuals, and then output a df that has all people within households where only sampled unmatched people exist. Ie say my random sample includes Oliver but not Peter, then I would only household 1 in the output.
My issue is I've filtered to take the sample and now am stuck making progress. Some combination of join, agg/groupBy... will work but I'm struggling. I add a flag to the sampled unmatched names to identify them which i think is helpful...
My code:
# filter to unmatched people
df_unmatched = df.filter(col('per_A').isNotNull()) & col('per_B').isNull())
# take random sample of 10%
df_unmatched_sample = df_unmatched.sample(0.1)
# add flag of sampled unmatched persons
df_unmatched_sample = df_unmatched.withColumn('sample_flag', lit('1'))
As it pertains to your intent:
I just want to reduce my dataframe to only show the full households of
households where an unmatched person exists that has been selected by
a random sample out of all unmatched people
Using your existing approach you could use a join on the Household of the sample records
# filter to unmatched people
df_unmatched = df.filter(col('per_A').isNotNull()) & col('per_B').isNull())
# take random sample of 10%
df_unmatched_sample = df_unmatched.sample(0.1).select("Household").distinct()
desired_df = df.join(df_unmatched_sample,["Household"],"inner")
Edit 1
In response to op's comment:
Is there a slightly different way that keeps a flag to identify the
sampled unmatched person (as there are some households with more than
one unmatched person)?
A left join on your existing dataset after adding the flag column to your sample may help you to achieve this eg:
# filter to unmatched people
df_unmatched = df.filter(col('per_A').isNotNull()) & col('per_B').isNull())
# take random sample of 10%
df_unmatched_sample = df_unmatched.sample(0.1).withColumn('sample_flag', lit('1'))
desired_df = (
df.alias("dfo").join(
df_unmatched_sample.alias("dfu"),
[
col("dfo.Household")==col("dfu.Household") ,
col("dfo.per_A")==col("dfu.per_A"),
col("dfo.per_B").isNull()
],
"left"
)
)
I have a DataFrame with zip codes, among other things. The data, as a sample, looks like this:
Zip Item1 Item2 Item3
78264.0 pan elephant blue
73909.0 steamer panda yellow
2602.0 pot rhino orange
59661.0 fork zebra green
861893.0 sink ocelot red
77892.0 spatula doggie brown
Some of these zip codes are invalid, having either too many or too few digits. I'm trying to remove those rows that have an invalid number of characters/digits (seven characters in this case, because I am checking length based on str() and the .0 is included in there). The following lengths loop:
zips = mydata.iloc[:,0].astype(str)
lengths = []
for i in zips:
lengths.append(len(i))
produces a series (not to be confused with Series, although maybe it is--I'm new at Python) of zip code character lengths for each row. I am then trying to subset the DataFrame based on the information from the lengths variable. I tried a couple of different ways; this following was the latest version:
for i in lengths.index(i):
if mydata.iloc[i:,0] != 7:
mydata.iloc[i:,0].drop()
Naturally, this fails, with a ValueError: '44114.0' is not in list error. Can anyone give some advice as to how to do what I'm trying to accomplish?
You can write this more concisely using Pandas filtering rather than loops and ifs.
Here is an example:
valid_zips = mydata[mydata.astype(str).str.len() == 7]
or
zip_code_upper_bound = 100000
valid_zips = mydata[mydata < zip_code_upper_bound]
assuming fractional numbers are not included in your set. Note that the first example will remove shorter zips, while the second will leave them in, which you might want as they could have had leading zeros.
Sample output:
With df defined as (from your example):
Zip Item1 Item2 Item3
0 78264.0 pan elephant blue
1 73909.0 steamer panda yellow
2 2602.0 pot rhino orange
3 59661.0 fork zebra green
4 861893.0 sink ocelot red
5 77892.0 spatula doggie brown
Using the following code:
df[df.Zip.astype(str).str.len() == 7]
The result is:
Zip Item1 Item2 Item3
0 78264.0 pan elephant blue
1 73909.0 steamer panda yellow
3 59661.0 fork zebra green
5 77892.0 spatula doggie brown
Using str.len
df[df.iloc[:,0].astype(str).str.len()!=7]
A
1 1.222222
2 1.222200
dput :
df=pd.DataFrame({'A':[1.22222,1.222222,1.2222]})
See if this works
df1 = df['ZipCode'].astype(str).map(len)==5
I have 3 columns
a b c
jon ben 2
ben jon 2
roy jack 1
jack roy 1
I'm trying to retrieve all unique permutations e.g. ben and jon = jon and ben so they should only appear once. Expected output:
a b c
jon ben 2
roy jack 1
Any ideas of a function that could do this? The order in the output does not matter. I've tried concatenating and then removing duplicates, but obviously this only considers the string order.
I've created a fourth column by joining all three columns together =a1&","&b1&","&c1 and used excel's built in remove duplicates function. This doesnt work as the order of the strings are different.
In your forth column use the formula
=if(A1<B1,A1&","&B1&","&C1,B1&","&A1&","&C1)
Which should join A and B in alphabetical order, then you can remove duplicates as you have done.
I have a large .csv file with 8 variables and about 350,000 observations. In this file, each actual observation is actually split up into 105 rows. That is, each row has data for one specific demographic, and there are 105 demographic cuts (all relating to the same event). This makes it very difficult to merge this file with others.
I would like to change it so that there are 3,500 observations with variables for demographic statistics. I've tried creating a macro, but I haven't had much luck.
This is what it looks like now.
This is what I'd like it to look like.
This way, each ID is a unique observation. I think that this will make it much easier to work with. I can use either Stata or Excel. What is the best way to do this?
So here is an example with what I understand you want:
clear all
set more off
*----- example data -----
input id store date cut
1 5 1 1
1 5 1 2
2 8 1 1
2 9 1 2
2 8 2 3
end
format date %td
set seed 012385
gen val1 = floor(runiform()*1000)
gen val2 = floor(runiform()*2000)
list, sepby(id)
*----- what you want ? -----
reshape wide val1 val2, i(id store date) j(cut)
list, sepby(id)
My id variable is numerical, as are the cuts (see help destring and help encode to convert). The example data is also a bit more complex than the one you posted (in case your example is not representative enough).
The missings (.) that result are expected. val11 is to be interpreted as val1 of cut == 1. val21 as val2 of cut == 1. val12 as val1 of cut == 2, and so on. So when id == 1, val13 and val23 are missing because this person does not appear with cut ==3.
I hope that was clear enough for you to apply to your data.