I have a DataFrame with zip codes, among other things. The data, as a sample, looks like this:
Zip Item1 Item2 Item3
78264.0 pan elephant blue
73909.0 steamer panda yellow
2602.0 pot rhino orange
59661.0 fork zebra green
861893.0 sink ocelot red
77892.0 spatula doggie brown
Some of these zip codes are invalid, having either too many or too few digits. I'm trying to remove those rows that have an invalid number of characters/digits (seven characters in this case, because I am checking length based on str() and the .0 is included in there). The following lengths loop:
zips = mydata.iloc[:,0].astype(str)
lengths = []
for i in zips:
lengths.append(len(i))
produces a series (not to be confused with Series, although maybe it is--I'm new at Python) of zip code character lengths for each row. I am then trying to subset the DataFrame based on the information from the lengths variable. I tried a couple of different ways; this following was the latest version:
for i in lengths.index(i):
if mydata.iloc[i:,0] != 7:
mydata.iloc[i:,0].drop()
Naturally, this fails, with a ValueError: '44114.0' is not in list error. Can anyone give some advice as to how to do what I'm trying to accomplish?
You can write this more concisely using Pandas filtering rather than loops and ifs.
Here is an example:
valid_zips = mydata[mydata.astype(str).str.len() == 7]
or
zip_code_upper_bound = 100000
valid_zips = mydata[mydata < zip_code_upper_bound]
assuming fractional numbers are not included in your set. Note that the first example will remove shorter zips, while the second will leave them in, which you might want as they could have had leading zeros.
Sample output:
With df defined as (from your example):
Zip Item1 Item2 Item3
0 78264.0 pan elephant blue
1 73909.0 steamer panda yellow
2 2602.0 pot rhino orange
3 59661.0 fork zebra green
4 861893.0 sink ocelot red
5 77892.0 spatula doggie brown
Using the following code:
df[df.Zip.astype(str).str.len() == 7]
The result is:
Zip Item1 Item2 Item3
0 78264.0 pan elephant blue
1 73909.0 steamer panda yellow
3 59661.0 fork zebra green
5 77892.0 spatula doggie brown
Using str.len
df[df.iloc[:,0].astype(str).str.len()!=7]
A
1 1.222222
2 1.222200
dput :
df=pd.DataFrame({'A':[1.22222,1.222222,1.2222]})
See if this works
df1 = df['ZipCode'].astype(str).map(len)==5
Related
I have a dataframe that looks like this:
CustomerID CustomerStatus CustomerTier Order.Blue.Good Order.Green.Bad Order.Red.Good
----------------------------------------------------------------------------------------------------------------
101 ACTIVE PREMIUM NoticeABC: Good 5 NoticeYAF: Bad 1 NoticeAFV: Good 4
102 INACTIVE DIAMOND NoticeTAC: Bad 3
I'm trying to transform it to look like this:
CustomerID CustomerStatus CustomerTier Color Outcome NoticeCode NoticeDesc
----------------------------------------------------------------------------------------------------------------
101 ACTIVE PREMIUM Blue Good NoticeABC Good 5
101 ACTIVE PREMIUM Green Bad NoticeYAF Bad 1
101 ACTIVE PREMIUM Red Good NoticeAFV Good 4
102 INACTIVE DIAMOND Green Bad NoticeTAC Bad3
I believe this is just a wide-to-long data transformation, which I tried using this approach:
df = pd.wide_to_long(df, ['Order'], i=['CustomerID','CustomerStatus','CustomerTier'], j='Color', sep='.')
However, this is returning an empty dataframe. I'm sure I'm doing something wrong with the separator--perhaps because there are 2 of them in the column names?
I feel like splitting the column names into Color, Outcome, NoticeCode, and NoticeDesc would be relatively easy once I figure out how to do this conversion, but just struggling with this part!
Any helpful tips to point me in the right direction would be greatly appreciated! Thank you!
I believe this would need to be solved with two separate calls to pd.wide_to_long as so:
# To set "Outcome" column
df = pd.wide_to_long(df,
stubnames = ['Order.Blue', 'Order.Green', 'Order.Red'],
i = ['CustomerID','CustomerStatus','CustomerTier'],
j = 'Outcome',
sep = '.')
df = pd.wide_to_long(df
stubnames = 'Order',
i = ['CustomerID', 'CustomerStatus', 'CustomerTier'],
j = 'Color',
sep= '.')
Then, to split the Notice column, you could use pd.str.split as so:
df[['NoticeCode', 'NoticeDesc']] = df['Outcome'].str.split(': ', expand=True)
Let me know how this goes and we can workshop a bit!
I have a list Col_values and Data Frame df.
Col_values = ['App','dragons']
df
apps b c dragon e
1 apple bat cat dance eat
2 air ball can dog ear
3 ant biscuit camel doll enter
4 alpha batch came disc end
5 axis bag come dell
6 angry catch
7 attack
My expected output is OutDict
OutDict={'App' : ['apple','air','ant','alpha','axis','angry','attack'],
'dragons':['dance','dog','doll','disc','dell']}
I need the mapping to be occur irrespective of the Case and plurality.
Thanks in Advance. :-)
df.loc[:,['apps','dragon']].to_dict(orient='list')
Output
{'apps': ['apple', 'air', 'ant', 'alpha', 'axis', 'angry', 'attack'],
'dragon': ['dance', 'dog', 'doll', 'disc', 'dell', 'None', 'None']}
try this:
{col:df[col].tolist() for col in df.loc[:,['age','fare']].columns}
Background
I have the following toy df that contains lists in the columns Before and After as seen below
import pandas as pd
before = [list(['in', 'the', 'bright', 'blue', 'box']),
list(['because','they','go','really','fast']),
list(['to','ride','and','have','fun'])]
after = [list(['there', 'are', 'many', 'different']),
list(['i','like','a','lot','of', 'sports']),
list(['the','middle','east','has','many'])]
df= pd.DataFrame({'Before' : before,
'After' : after,
'P_ID': [1,2,3],
'Word' : ['crayons', 'cars', 'camels'],
'N_ID' : ['A1', 'A2', 'A3']
})
Output
After Before N_ID P_ID Word
0 [in, the, bright, blue, box] [there, are, many, different] A1 1 crayons
1 [because, they, go, really, fast] [i, like, a, lot, of, sports ] A2 2 cars
2 [to, ride, and, have, fun] [the, middle, east, has, many] A3 3 camels
Problem
Using the following block of code:
df.loc[:, ['After', 'Before']] = df[['After', 'Before']].apply(lambda x: x.str[0].str.replace(',', '')) taken from Removing commas and unlisting a dataframe produce the following output:
Close-to-what-I-want-but-not-quite- Output
After Before N_ID P_ID Word
0 in there A1 1 crayons
1 because i A2 2 cars
2 to the A3 3 camels
This output is close but not quite what I am looking for because After and Before columns have only one word outputs (e.g. there) when my desired output looks as such:
Desired Output
After Before N_ID P_ID Word
0 in the bright blue box there are many different A1 1 crayons
1 because they go really fast i like a lot of sports A2 2 cars
2 to ride and have fun the middle east has many A3 3 camels
Question
How do I get my Desired Output?
agg + join. The commas aren't present in your lists, they are just part of the __repr__ of the list.
str_cols = ['Before', 'After']
d = {k: ' '.join for k in str_cols}
df.agg(d).join(df.drop(str_cols, 1))
Before After P_ID Word N_ID
0 in the bright blue box there are many different 1 crayons A1
1 because they go really fast i like a lot of sports 2 cars A2
2 to ride and have fun the middle east has many 3 camels A3
If you'd prefer in place (faster):
df[str_cols] = df.agg(d)
applymap
In line
New copy of a dataframe with desired results
df.assign(**df[['After', 'Before']].applymap(' '.join))
Before After P_ID Word N_ID
0 in the bright blue box there are many different 1 crayons A1
1 because they go really fast i like a lot of sports 2 cars A2
2 to ride and have fun the middle east has many 3 camels A3
In place
Mutate existing df
df.update(df[['After', 'Before']].applymap(' '.join))
df
Before After P_ID Word N_ID
0 in the bright blue box there are many different 1 crayons A1
1 because they go really fast i like a lot of sports 2 cars A2
2 to ride and have fun the middle east has many 3 camels A3
stack and str.join
We can use this result in a similar "In line" and "In place" way as shown above.
df[['After', 'Before']].stack().str.join(' ').unstack()
After Before
0 there are many different in the bright blue box
1 i like a lot of sports because they go really fast
2 the middle east has many to ride and have fun
We can specify the lists we want to convert to string and then use .apply in a for loop:
lst_cols = ['Before', 'After']
for col in lst_cols:
df[col] = df[col].apply(' '.join)
Before After P_ID Word N_ID
0 in the bright blue box there are many different 1 crayons A1
1 because they go really fast i like a lot of sports 2 cars A2
2 to ride and have fun the middle east has many 3 camels A3
I have a large .csv file with 8 variables and about 350,000 observations. In this file, each actual observation is actually split up into 105 rows. That is, each row has data for one specific demographic, and there are 105 demographic cuts (all relating to the same event). This makes it very difficult to merge this file with others.
I would like to change it so that there are 3,500 observations with variables for demographic statistics. I've tried creating a macro, but I haven't had much luck.
This is what it looks like now.
This is what I'd like it to look like.
This way, each ID is a unique observation. I think that this will make it much easier to work with. I can use either Stata or Excel. What is the best way to do this?
So here is an example with what I understand you want:
clear all
set more off
*----- example data -----
input id store date cut
1 5 1 1
1 5 1 2
2 8 1 1
2 9 1 2
2 8 2 3
end
format date %td
set seed 012385
gen val1 = floor(runiform()*1000)
gen val2 = floor(runiform()*2000)
list, sepby(id)
*----- what you want ? -----
reshape wide val1 val2, i(id store date) j(cut)
list, sepby(id)
My id variable is numerical, as are the cuts (see help destring and help encode to convert). The example data is also a bit more complex than the one you posted (in case your example is not representative enough).
The missings (.) that result are expected. val11 is to be interpreted as val1 of cut == 1. val21 as val2 of cut == 1. val12 as val1 of cut == 2, and so on. So when id == 1, val13 and val23 are missing because this person does not appear with cut ==3.
I hope that was clear enough for you to apply to your data.
I'm relatively new to R, and I'm currently stuck.
I have observations that are made up of legal articles, fe:
BIV:III,XXVIII.1(b);CIV:2.
So I splitted them resulting in a string listing each observation and the legal articles used. This looks like:
ArtAGr list of 400230
chr[1:2] "BIV:III,XXVIII.1(b)" "CIV:2"
chr[1:1] "ILA:2.3(b)"
chr[1:3] "BIV:IB.3(d)" "CIV:7,9" "ILA:VII.1"
The BIV and CIV would need to become my new variables. However, the observations vary, so some observations include both BIV and CIV, while others include other legal articles like ILA:II.3(b)
Now, I would like to create a dataframe from these guys, so I can group all the observations in a column for each major article.
Eventually, the perfect dataframe should look like:
Dispute BIV CIV ILA
1 III, XXVIII.1(b) 2 NA
2 NA NA II.3(b)
3 IV.3(d) 7,9 VII.1
4 II NA NA
So, I will need to create a new object grouping all observations who contain a text like BIV, and a O or N/A for those observations that do not use this legal article. Any thoughts would be greatly appreciated!
Thanks a lot!
Sven
Here's an approach:
# a vector of character strings (not the splitted ones)
vec <- c("BIV:III,XXVIII.1(b);CIV:2",
"ILA:II.3(b)",
"BIV:IB.3(d);CIV:7,9;ILA:VII.1")
# split strings
s <- strsplit(vec, "[;:]")
# target words
tar <- c("BIV", "CIV", "ILA")
# create data frame
setNames(as.data.frame(do.call(rbind, lapply(s, function(x)
replace(rep(NA_character_, length(tar)),
match(x[c(TRUE, FALSE)], tar), x[c(FALSE, TRUE)])))), tar)
The result:
BIV CIV ILA
1 III,XXVIII.1(b) 2 <NA>
2 <NA> <NA> II.3(b)
3 IB.3(d) 7,9 VII.1