turn lists of lists into strings pandas dataframe - python-3.x

Background
I have the following toy df that contains lists in the columns Before and After as seen below
import pandas as pd
before = [list(['in', 'the', 'bright', 'blue', 'box']),
list(['because','they','go','really','fast']),
list(['to','ride','and','have','fun'])]
after = [list(['there', 'are', 'many', 'different']),
list(['i','like','a','lot','of', 'sports']),
list(['the','middle','east','has','many'])]
df= pd.DataFrame({'Before' : before,
'After' : after,
'P_ID': [1,2,3],
'Word' : ['crayons', 'cars', 'camels'],
'N_ID' : ['A1', 'A2', 'A3']
})
Output
After Before N_ID P_ID Word
0 [in, the, bright, blue, box] [there, are, many, different] A1 1 crayons
1 [because, they, go, really, fast] [i, like, a, lot, of, sports ] A2 2 cars
2 [to, ride, and, have, fun] [the, middle, east, has, many] A3 3 camels
Problem
Using the following block of code:
df.loc[:, ['After', 'Before']] = df[['After', 'Before']].apply(lambda x: x.str[0].str.replace(',', '')) taken from Removing commas and unlisting a dataframe produce the following output:
Close-to-what-I-want-but-not-quite- Output
After Before N_ID P_ID Word
0 in there A1 1 crayons
1 because i A2 2 cars
2 to the A3 3 camels
This output is close but not quite what I am looking for because After and Before columns have only one word outputs (e.g. there) when my desired output looks as such:
Desired Output
After Before N_ID P_ID Word
0 in the bright blue box there are many different A1 1 crayons
1 because they go really fast i like a lot of sports A2 2 cars
2 to ride and have fun the middle east has many A3 3 camels
Question
How do I get my Desired Output?

agg + join. The commas aren't present in your lists, they are just part of the __repr__ of the list.
str_cols = ['Before', 'After']
d = {k: ' '.join for k in str_cols}
df.agg(d).join(df.drop(str_cols, 1))
Before After P_ID Word N_ID
0 in the bright blue box there are many different 1 crayons A1
1 because they go really fast i like a lot of sports 2 cars A2
2 to ride and have fun the middle east has many 3 camels A3
If you'd prefer in place (faster):
df[str_cols] = df.agg(d)

applymap
In line
New copy of a dataframe with desired results
df.assign(**df[['After', 'Before']].applymap(' '.join))
Before After P_ID Word N_ID
0 in the bright blue box there are many different 1 crayons A1
1 because they go really fast i like a lot of sports 2 cars A2
2 to ride and have fun the middle east has many 3 camels A3
In place
Mutate existing df
df.update(df[['After', 'Before']].applymap(' '.join))
df
Before After P_ID Word N_ID
0 in the bright blue box there are many different 1 crayons A1
1 because they go really fast i like a lot of sports 2 cars A2
2 to ride and have fun the middle east has many 3 camels A3
stack and str.join
We can use this result in a similar "In line" and "In place" way as shown above.
df[['After', 'Before']].stack().str.join(' ').unstack()
After Before
0 there are many different in the bright blue box
1 i like a lot of sports because they go really fast
2 the middle east has many to ride and have fun

We can specify the lists we want to convert to string and then use .apply in a for loop:
lst_cols = ['Before', 'After']
for col in lst_cols:
df[col] = df[col].apply(' '.join)
Before After P_ID Word N_ID
0 in the bright blue box there are many different 1 crayons A1
1 because they go really fast i like a lot of sports 2 cars A2
2 to ride and have fun the middle east has many 3 camels A3

Related

Search columns for keywords and match only the first substrings found

below is an example table that shows what I'm hoping to achieve. I have the column "text" that provides comments and I would like to add a new column called "searchBrand" which returns a category based on the search of substring keywords within "text".
text
searchBrand
The M3 is quick!
BMW
Some would say the Prius is economical, but others like the M3 for its speed
Toyota
Who likes the focus?
Ford
I created a simple dictionary that shows the substrings I'm looking for as the key and the applicable brand as the value.
keywords = {"M3": "BMW",
"Prius": "Toyota",
"Focus": "Ford"}
Using the code below works for many cases, however, if there are two matches in the same row, the returned value is NAN.
df["searchBrand"]=['text'].str.findall("|".join(keywords.keys())).str.join(",").map(keywords)
For example, this is what I get, but it's not what I'm looking for:
text
searchBrand
The M3 is quick!
BMW
Some would say the Prius is economical, but others like the M3 for its speed
nan
Who likes the focus?
Ford
Thank you in advance
You can try with pandas.Series.str.extract for getting the first match:
>>> df
text
0 The M3 is quick!
1 Some would say the Prius is economical, but ot...
2 Who likes the Focus?
>>>
>>> df['searchBrand'] = df['text'].str.extract(f'({"|".join(keywords.keys())})', expand=False).map(keywords)
>>> df
text searchBrand
0 The M3 is quick! BMW
1 Some would say the Prius is economical, but ot... Toyota
2 Who likes the Focus? Ford
In the above, I have changed Who likes the focus? to Who likes the Focus?. If that's not a typo in the question, you can try:
>>> df["searchBrand"] = df['text'].str.title().str.extract(f'({"|".join(keywords.keys())})', expand=False).map(keywords)

Drop rows based on float length in Python

I have a DataFrame with zip codes, among other things. The data, as a sample, looks like this:
Zip Item1 Item2 Item3
78264.0 pan elephant blue
73909.0 steamer panda yellow
2602.0 pot rhino orange
59661.0 fork zebra green
861893.0 sink ocelot red
77892.0 spatula doggie brown
Some of these zip codes are invalid, having either too many or too few digits. I'm trying to remove those rows that have an invalid number of characters/digits (seven characters in this case, because I am checking length based on str() and the .0 is included in there). The following lengths loop:
zips = mydata.iloc[:,0].astype(str)
lengths = []
for i in zips:
lengths.append(len(i))
produces a series (not to be confused with Series, although maybe it is--I'm new at Python) of zip code character lengths for each row. I am then trying to subset the DataFrame based on the information from the lengths variable. I tried a couple of different ways; this following was the latest version:
for i in lengths.index(i):
if mydata.iloc[i:,0] != 7:
mydata.iloc[i:,0].drop()
Naturally, this fails, with a ValueError: '44114.0' is not in list error. Can anyone give some advice as to how to do what I'm trying to accomplish?
You can write this more concisely using Pandas filtering rather than loops and ifs.
Here is an example:
valid_zips = mydata[mydata.astype(str).str.len() == 7]
or
zip_code_upper_bound = 100000
valid_zips = mydata[mydata < zip_code_upper_bound]
assuming fractional numbers are not included in your set. Note that the first example will remove shorter zips, while the second will leave them in, which you might want as they could have had leading zeros.
Sample output:
With df defined as (from your example):
Zip Item1 Item2 Item3
0 78264.0 pan elephant blue
1 73909.0 steamer panda yellow
2 2602.0 pot rhino orange
3 59661.0 fork zebra green
4 861893.0 sink ocelot red
5 77892.0 spatula doggie brown
Using the following code:
df[df.Zip.astype(str).str.len() == 7]
The result is:
Zip Item1 Item2 Item3
0 78264.0 pan elephant blue
1 73909.0 steamer panda yellow
3 59661.0 fork zebra green
5 77892.0 spatula doggie brown
Using str.len
df[df.iloc[:,0].astype(str).str.len()!=7]
A
1 1.222222
2 1.222200
dput :
df=pd.DataFrame({'A':[1.22222,1.222222,1.2222]})
See if this works
df1 = df['ZipCode'].astype(str).map(len)==5

Concatenating INDEX/MATCH with multiple criteria and multiple matches

I am using Excel to track a team game where players are divided into teams and subteams within teams. Each player within a subteam scores a certain number of points, and I would like to have a summary string for each player with the number of points other players in the same subteam scored.
Example:
A B C D
PLAYER TEAM SUBTEAM POINTS
Alice Red 1 70
Bob Red 1 20
Charlie Red 1 10
Dave Red 2 70
Erin Red 2 30
Frank Blue 1 55
Grace Blue 1 45
My desired output looks like this:
A B C D E
PLAYER TEAM SUBTEAM POINTS SUMMARY
Alice Red 1 70 Bob:20, Charlie:10
Bob Red 1 20 Alice:70, Charlie:10
Charlie Red 1 10 Alice:70, Bob:20
Dave Red 2 70 Erin:30
Erin Red 2 30 Dave:70
Frank Blue 1 55 Grace:45
Grace Blue 1 45 Frank:55
The furthest I was able to go is a combination of CONCATENATE, INDEX, and MATCH in an array formula:
{=CONCATENATE(INDEX($A$2:$A$8,MATCH(1,(C2=$C$2:$C$8)*(B2=$B$2:$B$8),0)), ":", INDEX($D$2:$D$8,MATCH(1,(C2=$C$2:$C$8)*(B2=$B$2:$B$8),0)))}
This unfortunately just outputs a summary for the first player in the subteam:
A B C D E
PLAYER TEAM SUBTEAM POINTS SUMMARY
Alice Red 1 70 Alice:70
Bob Red 1 20 Alice:70
Charlie Red 1 10 Alice:70
Dave Red 2 70 Dave:70
Erin Red 2 30 Dave:70
Frank Blue 1 55 Grace:45
Grace Blue 1 45 Grace:45
What I need to do now is:
Excluding the player for the summary (I don't want Alice in the summary for Alice, but only Bob and Charlie)
Getting it to work for multiple matches (there can be an arbitrary number of players in each subteam)
Getting CONCATENATE to work with an unknown number of strings (because as said above, there can be an arbitrary number of players in each subteam).
Ideas appreciated!
I put together a helper column that concatenates each player/points and the TEXTJOINIFS from TEXTJOIN for xl2010/xl2013 with criteria for the desired results.
Unfortunately Excel (prior to Excel 2016) cannot conveniently join text. The best you can do (if you want to avoid VBA) is to use some helper cells and split this "Summary" into separate cells.
See example below. The array formula in cell E4 is dragged to cell J10.
= IFERROR(INDEX($A$4:$D$10,MATCH(SMALL(IF(($B$4:$B$10=$B4)*($C$4:$C$10=$C4)*($A$4:$A$10<>$A4),
ROW($A$4:$A$10)),E$3),ROW($A$4:$A$10),0),MATCH(E$2,$A$1:$D$1,0)),"")
Note this is an array formula, so you must press Ctrl+Shift+Enter instead of just Enter after typing this formula.
Of course, in this example I assume 3 players. Your requirement of arbitrary amount of players cannot be met with formulas alone, but you can just extend the "Summary" section over to the right as far as necessary.
If you really wanted to, you could even concatenate the "Summary" rows to form a single cell, e.g. something like:
= CONCATENATE(E4,": ",F4,", ",...)

group pandas DataFrame by one column and then get lists of values which occur in those categories from other column

I am looking for a possibility to group a DataFrame by one (or more) columns and than add another column to the grouped DataFrame which gives me those values that occure in this categorie from another column in the original DataFrame. (It's probably easier understand what I would like to do by the follwing example.)
For example I have a DataFrame which contains the information of the color and location of some cars. I want to know how many cars of each color I have (for this I use groupby, but I am open for other suggestions), but I would also like to get a list of cities those cars are located in.
import pandas as pd
df = pd.DataFrame({'cars': ['A','B','C', 'D', 'E'], 'color':['blue','red', 'blue', 'red', 'blue'], 'city':['X', 'Y', 'X', 'Z', 'Z']})
df =
cars city color
0 A X blue
1 B Y red
2 C X blue
3 D Z red
4 E Z blue
new_df = df.groupby(['color']).size().reset_index().rename(columns={0:'nr_of_cars'})
new_df =
color nr_of_cars
0 blue 3
1 red 2
So in my_df I have the number of cars whith each color, but I would also like to know the cities those cars are located in. A new DataFrame would finally look like this (I don't exactly need those cities in the same DataFrame, I just need to accees them easily):
color nr_of_cars cities
0 blue 3 X, Z
1 red 2 Y, Z
What I know is that I could do a conditional selection for each color.
other_df = df[df['color'] == 'blue']['city'].unique()
But is there a way where I do not have to loop through a list of colors? My real DataFrame is a bit bigger, so that I would be happy to receive some suggestions.
edit: Just fixed typo.
IIUC:
In [90]: df.groupby('color').agg({'cars':'size','city':'unique'}).reset_index()
Out[90]:
color cars city
0 blue 3 [X, Z]
1 red 2 [Y, Z]
#Dillon,
if you want to see all available aggregate methods (functions) and attributes, then try to use ipython or Jupyter like as follows:
first create a "GroupBy" object:
In [91]: g = df.groupby('color')
then type g. and press <Tab> key:
In [92]: g.
g.agg g.apply g.cars g.corrwith g.cummax g.describe g.ffill g.get_group g.idxmax g.mad g.min
g.aggregate g.backfill g.city g.count g.cummin g.diff g.fillna g.groups g.idxmin g.max g.ndim
g.all g.bfill g.color g.cov g.cumprod g.dtypes g.filter g.head g.indices g.mean g.ngroup >
g.any g.boxplot g.corr g.cumcount g.cumsum g.expanding g.first g.hist g.last g.median g.ngroup

Horizontal Index Match (returning column header)

A B C
1 Fruit Color Meat <- Column Header
2 Banana Red Pork
3 Apple Black Chicken
4 Orange White Beef
From the table1 above to table2 below:
A B
1 Name What? <- Column Header
2 Banana Fruit <- Formula should return these values, based on table 1
3 Red Color
4 Beef Meat
5 Pork Meat
Looking for a formula to return corresponding column name in B2,3,4...
I tried =INDEX(Table1[#Headers],MATCH(J:J,Table1,0))
It would seem that the three columns are unique; e.g. there would never be a Beef in the Color column. In that case you can simply query each column, passing back a 1, 2 or 3 as the case may be.
=IFERROR(INDEX(Table1[#Headers],
ISNUMBER(MATCH([#Name], Table1[Fruit], 0))*1+
ISNUMBER(MATCH([#Name], Table1[Color], 0))*2+
ISNUMBER(MATCH([#Name], Table1[Meat], 0))*3),
"no column")
    
I'm not sure whether Port was an intentional misspelling but it does demonstrate what occurs when there is no match.

Resources