Adding variable to df after creatng by mutating - string

I am new to R and have a dummy question.
I want ot creat a numerical variable based on string variable.
the main factor is called icd10
icd10 <- c("v860", "x45", 'C150', 'NA')
df <- data.frame(icd10)
I want to creat a new varaivle called main.
I want to have result like this:
icd10 main
1 v860 2
2 x45 2
3 C150 1
4 NA NA
I used the following command:
df %>%
mutate(df, main=ifelse(grepl("C", icd10 ) , 1 , 2))
1- It wokrs. But do not know how to fix NA in the new variable.
2-Wben I get the table or print it deosn't show up.
It gives the follwing results:
print(df)
icd10
1 v860
2 x45
3 C150
4 NA
table(df$can2)
< table of extent 0 >
Thanks for help
KZ
I am expecting a quick and easy solution from community

Related

Renaming columns in dataframe w.r.t another specific column

BACKGROUND: Large excel mapping file with about 100 columns and 200 rows converted to .csv. Then stored as dataframe. General format of df as below.
Starts with a named column (e.g. Sales) and following two columns need to be renamed. This pattern needs to be repeated for all columns in excel file.
Essentially: Link the subsequent 2 columns to the "parent" one preceding them.
Sales Unnamed: 2 Unnamed: 3 Validation Unnamed: 5 Unnamed: 6
0 Commented No comment Commented No comment
1 x x
2 x x
3 x x
APPROACH FOR SOLUTION: I assume it would be possible to begin with an index (e.g. index of Sales column 1 = x) and then rename the following two columns as (x+1) and (x+2).
Then take in the text for the next named column (e.g. Validation) and so on.
I know the rename() function for dataframes.
BUT, not sure how to apply the iteratively for changing column titles.
EXPECTED OUTPUT: Unnamed 2 & 3 changed to Sales_Commented and Sales_No_Comment, respectively.
Similarly Unnamed 5 & 6 change to Validation_Commented and Validation_No_Comment.
Again, repeated for all 100 columns of file.
EDIT: Due to the large number of cols in the file, creating a manual list to store column names is not a viable solution. I have already seen this elsewhere on SO. Also, the amount of columns and departments (Sales, Validation) changes in different excel files with the mapping. So a dynamic solution is required.
Sales Sales_Commented Sales_No_Comment Validation Validation_Commented Validation_No_Comment
0 Commented No comment Commented No comment
1 x x
2 x
3 x x x
As a python novice, I considered a possible approach for the solution using the limited knowledge I have, but not sure what this would look like as a workable code.
I would appreciate all help and guidance.
1.You need is to make a list with the column names that you would want.
2.Make it a dict with the old column names as the keys and new column name as the values.
3. Use df.rename(columns = your_dictionary).
import numpy as np
import pandas as pd
df = pd.read_excel("name of the excel file",sheet_name = "name of sheet")
print(df.head())
Output>>>
Sales Unnamed : 2 Unnamed : 3 Validation Unnamed : 5 Unnamed : 6 Unnamed :7
0 NaN Commented No comment NaN Comment No comment Extra
1 1.0 2 1 1.0 1 1 1
2 3.0 1 1 1.0 1 1 1
3 4.0 3 4 5.0 5 6 6
4 5.0 1 1 1.0 21 3 6
# get new names based on the values of a previous named column
new_column_names = []
counter = 0
for col_name in df.columns:
if (col_name[:7].strip()=="Unnamed"):
new_column_names.append(base_name+"_"+df.iloc[0,counter].replace(" ", "_"))
else:
base_name = col_name
new_column_names.append(base_name)
counter +=1
# convert to dict key pair
dictionary = dict(zip(df.columns.tolist(),new_column_names))
# rename columns
df = df.rename(columns=dictionary)
# drop first column
df = df.iloc[1:].reset_index(drop=True)
print(df.head())
Output>>
Sales Sales_Commented Sales_No_comment Validation Validation_Comment Validation_No_comment Validation_Extra
0 1.0 2 1 1.0 1 1 1
1 3.0 1 1 1.0 1 1 1
2 4.0 3 4 5.0 5 6 6
3 5.0 1 1 1.0 21 3 6

Replacing less occurring as Others

Replacing less occurring as Others.
In one of the column, i.e "Name" of the data frame say df, I have the data as below:
Sample Input:
Name
A
A
A
B
B
C
D
df['Name'].value_counts()
A 3
B 2
C 1
D 1
I would need output in below format.
Expected output:
A 3
B 2
Others 2
Any code in python3 is appreciated. Thanks in advance.
You need:
x = list(df['name'].value_counts()[:2].index) # This will fetch top N values
df['name'] = np.where(df['name'].isin(x), df['name'], 'others')
print(df['Name'].value_counts())
output
A 3
B 2
others 2

Use a split function in every row of one column of a data frame

I have a rather big pandas data frame (more than 1 million rows) with columns containing either strings or numbers. Now I would like to split the strings in one column before the expression "is applied".
An example to explain what I mean:
What I have:
a b description
2 4 method A is applied
10 5 titration is applied
3 1 computation is applied
What I am looking for:
a b description
2 4 method A
10 5 titration
3 1 computation
I tried the following,
df.description = df.description.str.split('is applied')[0]
But this didn't bring the desired result.
Any ideas how to do it? :-)
You are close, need str[0]:
df.description = df.description.str.split(' is applied').str[0]
Alternative solution:
df.description = df.description.str.extract('(.*)\s+is applied')
print (df)
a b description
0 2 4 method A
1 10 5 titration
2 3 1 computation
But for better performance use list comprehension:
df.description = [x.split(' is applied')[0] for x in df.description]
you can use replace
df.description = df.description.str.replace(' is applied','')
df
a b description
0 2 4 method A
1 10 5 titration
2 3 1 computation

Python pandas: Weird index value

I have posted a similar thread but have now another angle to explore: After doing a covariance analysis between X and Z groupby 2 different levels, I get a DF like
index X Z
(1,1,'X') 2.3 0
...
'1' and '1' are the 2 different levels (I could have chosen '1' and '2'; there are 5 and 10 different levels)
Now I would like to extract each 'element' of the index and have something
index X Z H1 H2 H3
(1,1,'X') 2.3 0 1 1 X
...
I read few posts on slice and dice things - but this is not a normal string is it?
Cheers
(1,1,'X') isn't a string here, It's a tuple.
So you need to split the tuple into multiple columns. You can achieve this
by using apply(pandas.Series)
say your dataframe was df in this case.
df.apply(pandas.series)
In [10]: df['index'].apply(pd.Series)
Out[10]:
0 1 2 3
0 1 1 'X'
You need to add the columns back to original data frame so
df[['H1', 'H2','H3']] = df.apply(pandas.Series)

R for loop: create a new column with the count of a sub str from a different column

I used to fiddle with R and now it all seems to have escaped me . . .
I have a table with a few hundred columns and about 100k rows. One of those columns contains strings that sometimes have commas in them (e.g. chicken,goat,cow or just chicken). I need a script with a (I believe) for loop that can create a new column (I know the new column code should not be in the for loop), count the number of commas (or the number of entries in the column in question less one) and add one so I can find out how many entries are in each column. An example:
col
chicken
chicken,goat
cow,chicken,goat
cow
I want a script to turn create an additional column in the table that would look like . . .
col2
1
2
3
1
I would use count.fields (from base R):
mydf$col2 <- count.fields(file = textConnection(as.character(mydf$col)),
sep = ",")
mydf
# col col2
# 1 chicken 1
# 2 chicken,goat 2
# 3 cow,chicken,goat 3
# 4 cow 1
Update: Accounting for blank lines
count.fields has a logical argument blank.lines.skip. So, to capture information for empty lines, just set that to TRUE.
Example:
mydf <- data.frame(col = c("chicken", "", "chicken,goat", "cow,chicken,goat", "cow"))
count.fields(file = textConnection(as.character(mydf$col)),
sep = ",", blank.lines.skip=FALSE)
# [1] 1 0 2 3 1
A loop is not needed here, I think. Using the stringr package...
require(stringr)
dat$aninum <- sapply(dat$ani,str_count,pattern=',')+1
which gives
ani aninum
1 chicken 1
2 chicken,goat 2
3 cow,chicken,goat 3
4 cow 1
You could use ?strsplit:
df <- data.frame(col=c("chicken", "chicken,goat", "cow,chicken,goat", "cow"), stringsAsFactors=FALSE)
df$col2 <- sapply(strsplit(df$col, ","), length)
df
# col col2
# 1 chicken 1
# 2 chicken,goat 2
# 3 cow,chicken,goat 3
# 4 cow 1

Resources