Transform the values of the features of a dataframe

Transform the values of the features of a dataframe - python-3.x

I want to impute following transformations in the values:
The 'Name' column to show only the titles (for ex:Miss,Mr).
The 'Cabin' column to contain only the 1st letter (for ex:'C' instead of the whole 'C54'.
Please help me with a general solution lastly for such similar problems. Thank you.(This was in a jupyter notebook and I didn't know to properly present the code)
categoric.head()
output:
Name Cabin
0 Braund, Mr. Owen Harris A23
1 Cumings, Mrs. John Bradley (Florence Briggs Th... C85
2 Heikkinen, Miss. Laina C54
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) C123
4 Allen, Mr. William Henry B231

pandas has an entire set of methods related to String Handling for Series.
The cabins requires you to slice the first letter:
categoric.Cabin.str[0]
#0 A
#1 C
#2 C
#3 C
#4 B
To get the titles, you can use .str.extract, with a capturing group with all different values separated by the vertical bar. Since . has a special meaning in patterns, need to escape it by preceding it with \:
categoric.Name.str.extract('(Mr\.|Mrs\.|Miss\.)')
# 0
#0 Mr.
#1 Mrs.
#2 Miss.
#3 Mrs.
#4 Mr.

categoric.Name= categoric.Name.apply(lambda x: x.split(', ')[1].split('.')[0])
categoric.Cabin = categoric.Cabin.slice(0,1)

Related

formula to find value from two separate tables and based on values in three tables

I'm looking for a formula for the Party column in Table 3 that will produce its values based on the data contained in Table 1 and Table 2.
NumSelect value in Table 3 determines Party value in Table 3.
Where NumSelect has "p", it refers to data in Table 1. If no "p" in NumSelect, then it refers to Table 2.
Number in NumSelect refers to row number.
If the corresponding ShortName has a value, that value should be returned.
If the corresponding ShortName is blank, then the corresponding Name should be returned.
Uppercase "P" and lowercase "p" in the NumSelect should both point to Table 1.
Each table is an Excel Table and its rows may expand or contract.
Certain rows in Table 1 and Table 2 may be empty.
Formula should not be volatile, not require control+shift+enter to enter the formula, and not require VBA.
Thanks!
Sorry for the bad formatting. I had this question formatted perfectly, but Stack Overflow kept preventing me from posting it because it claimed, "Your post appears to contain code that is not properly formatted as code. Please indent all code by 4 spaces using the code toolbar button or the CTRL+K keyboard shortcut. For more editing help, click the [?] toolbar icon."
Table 1
Name
Gender
ShortName
Occupation
Grace Turner
F
Singer
Cadie Crawford
F
Tiger
Fine Artist
Paige Johnston
F
Archeologist
Dexter Payne
M
Klondike
Veterinarian
Valeria Barnes
F
Chef
Florrie Reed
F
Lawer
Emily Ferguson
F
Scientist
Sam Hawkins
M
Alpha
Biochemist
Savana Ellis
F
Cook
Table 2
Name
Gender
ShortName
Occupation
Vanessa Cooper
F
Producer
Jasmine Morris
F
Beta
Baker
Evelyn Taylor
F
Economist
Adelaide Roberts
F
Historian
Blake Cunningham
M
Lion
Chef
Adelaide Harrison
F
Chemist
Frederick Watson
M
Journalist
Table 3
NumSelect
Party
p2
Tiger
3
Evelyn Taylor
P8
Alpha
2
Beta
7
Frederick Watson
p7
Emily Ferguson

Long Formula
Your formula has 717 characters, this one has 347.
=IF(ISNUMBER(SEARCH("P",[#NumSelect])),
IF(INDEX(Table1[ShortName],VALUE(RIGHT([#NumSelect],1)))="",
INDEX(Table1[Name],VALUE(RIGHT([#NumSelect],1))),
INDEX(Table1[ShortName],VALUE(RIGHT([#NumSelect],1)))),
IF(INDEX(Table2[ShortName],[#NumSelect])="",
INDEX(Table2[Name],[#NumSelect]),
INDEX(Table2[ShortName],[#NumSelect])))
A pseudo-code could look like this:
=IF(ISNUMBER(A),IF(B="",C,B),IF(D="",E,D))
The issue is that B (lines 2 & 4) and D (lines 5 & 7) are repeated expressions.
Hopefully, this will help someone to make a major improvement.
Microsoft 365
Using the LET function, you could use the following:
=LET(iIndex,[#NumSelect],sIndex,VALUE(SUBSTITUTE(LOWER(iIndex),"p","")),
IF(LEN(iIndex)>LEN(sIndex),
LET(nShort,INDEX(Table1[ShortName],sIndex),nLong,INDEX(Table1[Name],sIndex),
IF(nShort="",nLong,nShort)),
LET(nShort,INDEX(Table2[ShortName],sIndex),nLong,INDEX(Table2[Name],sIndex),
IF(nShort="",nLong,nShort))))

Welp, I figured out the formula. But it's very inefficient. I'm sure someone here could make it a lot shorter and more efficient.
Here it is:
=IF(
INDEX(FILTER(CHOOSE(IF(LOWER(LEFT([#NumSelect],1))="p",1,2),Table1[[Name]:[ShortName]],Table2[[Name]:[ShortName]]),CHOOSE(IF(LOWER(LEFT([#NumSelect],1))="p",1,2),Table1[Name],Table2[Name])<>""),SUBSTITUTE(LOWER([#NumSelect]),"p",""),3)
=0,
INDEX(FILTER(CHOOSE(IF(LOWER(LEFT([#NumSelect],1))="p",1,2),Table1[[Name]:[ShortName]],Table2[[Name]:[ShortName]]),CHOOSE(IF(LOWER(LEFT([#NumSelect],1))="p",1,2),Table1[Name],Table2[Name])<>""),SUBSTITUTE(LOWER([#NumSelect]),"p",""),1),
INDEX(FILTER(CHOOSE(IF(LOWER(LEFT([#NumSelect],1))="p",1,2),Table1[[Name]:[ShortName]],Table2[[Name]:[ShortName]]),CHOOSE(IF(LOWER(LEFT([#NumSelect],1))="p",1,2),Table1[Name],Table2[Name])<>""),SUBSTITUTE(LOWER([#NumSelect]),"p",""),3)
)

split by dot for a column from pandas dataframe

I have a pandas dataframe with a name column as below
name
Dr. Maso Guilani
Paul Dupey
Mrs. Sarah Kant
Cathay Pane
Canine Paul
I want to remove strings like "Dr. , Mrs." from that "name" column
I tried as below.
df['name']=df.name.replace({"Mrs.": ""},regex=True).replace({"Dr.": ""},regex=True)
But I want to generalize this as I am not sure how many prefixes like "Dr. , Mrs." are
available in the huge dataset. Basically I want to remove all the prefix with dots. Thanks.
Expected output:
name
Maso Guilani
Paul Dupey
Sarah Kant
Cathay Pane
Canine Paul

With your shown samples, please try following. Using str.replace function of Pandas here. Simple explanation of regex would be: replacing everything from starting of value(with a lazy match) till first dot followed by 1 or more spaces with NULL in name column.
df['name'].str.replace(r'^.*?\.\s+','')
Output will be as follows.
Maso Guilani
Paul Dupey
Sarah Kant
Cathay Pane
Canine Paul

One way of doing this:
Via split() and apply() method:
df['name']=df['name'].str.split('.',1).apply(lambda x:x[1] if len(x)>1 else x[0])
Output of df:
0 Maso Guilani
1 Paul Dupey
2 Sarah Kant
3 Cathay Pane
4 Canine Paul

EXCEL Get top 3 largest numbers in repetitive array

enter image description hereI have an array of people with scores in other column. I need to find top 3 people with highest score and print their names.
Example:
Maria 1
Thomas 4
John 3
Jack 2
Ray 2
Laura 4
Kate 3
Result should be:
Thomas
Laura
John
What I get:
Thomas
Thomas
John
What I get:
Thomas
John
num
I have tried using LARGE, MATCH, MIN, MAX but nothings works.
My first failure code:
=INDEX($A$2:$A$8; MATCH(LARGE(($B$2:$B$8);{1;2;3}); $B$2:$B$8;0))
My second failure code:
{=INDEX($A$2:$A$14;SMALL(IF($B$2:$B$14=MAX($B$2:$B$14);ROW($B$2:$B$14)-1);ROW(B4)-1))}

Put this in the second row of the column you want:
=INDEX(A:A,AGGREGATE(15,7,ROW($B$1:$B$7)/((COUNTIF($D$1:D1,$A$1:$A$7)=0)*($B$1:$B$7=LARGE(B:B,ROW(1:1)))),1))
And drag down three rows:

Excel - return all unique permutations of 3 columns

I have 3 columns
a b c
jon ben 2
ben jon 2
roy jack 1
jack roy 1
I'm trying to retrieve all unique permutations e.g. ben and jon = jon and ben so they should only appear once. Expected output:
a b c
jon ben 2
roy jack 1
Any ideas of a function that could do this? The order in the output does not matter. I've tried concatenating and then removing duplicates, but obviously this only considers the string order.
I've created a fourth column by joining all three columns together =a1&","&b1&","&c1 and used excel's built in remove duplicates function. This doesnt work as the order of the strings are different.

In your forth column use the formula
=if(A1<B1,A1&","&B1&","&C1,B1&","&A1&","&C1)
Which should join A and B in alphabetical order, then you can remove duplicates as you have done.

Selecting Text from an R string to create a new object

I'm relatively new to R, and I'm currently stuck.
I have observations that are made up of legal articles, fe:
BIV:III,XXVIII.1(b);CIV:2.
So I splitted them resulting in a string listing each observation and the legal articles used. This looks like:
ArtAGr list of 400230
chr[1:2] "BIV:III,XXVIII.1(b)" "CIV:2"
chr[1:1] "ILA:2.3(b)"
chr[1:3] "BIV:IB.3(d)" "CIV:7,9" "ILA:VII.1"
The BIV and CIV would need to become my new variables. However, the observations vary, so some observations include both BIV and CIV, while others include other legal articles like ILA:II.3(b)
Now, I would like to create a dataframe from these guys, so I can group all the observations in a column for each major article.
Eventually, the perfect dataframe should look like:
Dispute BIV CIV ILA
1 III, XXVIII.1(b) 2 NA
2 NA NA II.3(b)
3 IV.3(d) 7,9 VII.1
4 II NA NA
So, I will need to create a new object grouping all observations who contain a text like BIV, and a O or N/A for those observations that do not use this legal article. Any thoughts would be greatly appreciated!
Thanks a lot!
Sven

Here's an approach:
# a vector of character strings (not the splitted ones)
vec <- c("BIV:III,XXVIII.1(b);CIV:2",
"ILA:II.3(b)",
"BIV:IB.3(d);CIV:7,9;ILA:VII.1")
# split strings
s <- strsplit(vec, "[;:]")
# target words
tar <- c("BIV", "CIV", "ILA")
# create data frame
setNames(as.data.frame(do.call(rbind, lapply(s, function(x)
replace(rep(NA_character_, length(tar)),
match(x[c(TRUE, FALSE)], tar), x[c(FALSE, TRUE)])))), tar)
The result:
BIV CIV ILA
1 III,XXVIII.1(b) 2 <NA>
2 <NA> <NA> II.3(b)
3 IB.3(d) 7,9 VII.1

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string