How to add string to all values in a column of pandas DataFrame - python-3.x

Say you have a DataFrame with columns;
col_1 col_2
1 a
2 b
3 c
4 d
5 e
how would you change the values of col_2 so that, new value = current value + 'new'

Use +:
df.col_2 = df.col_2 + 'new'
print (df)
col_1 col_2
0 1 anew
1 2 bnew
2 3 cnew
3 4 dnew
4 5 enew
Thanks hooy for another solution:
df.col_2 += 'new'
Or assign:
df = df.assign(col_2 = df.col_2 + 'new')
print (df)
col_1 col_2
0 1 anew
1 2 bnew
2 3 cnew
3 4 dnew
4 5 enew

Related

How to check and delete pandas column if all rows contain just punctuation character?

I have a dataframe with multiple columns and I want to delete such columns where all rows just contain any of the punctuation character. e-g
col_1 col_2 col_3 col_4
0 1 _ ab 1,235
1 2 ? cd 8,900
2 3 _ ef 1,235
3 4 - gh 8,900
Here I just want to delete col_2. How can I achieve that?
Idea is test if all values of column contains number or string by Series.str.contains in DataFrame.apply and DataFrame.all, last filter by DataFrame.loc:
df = df.loc[:, df.astype(str).apply(lambda x: x.str.contains('\d|\w')).all()]
Or:
df = df.loc[:, df.astype(str).apply(lambda x: x.str.contains('\d|[a-zA-Z]')).all()]
print (df)
col_1 col_3 col_4
0 1 ab 1,235
1 2 cd 8,900
2 3 ef 1,235
3 4 gh 8,900
If possible get all values for remove in string is posible add ^ for start of string and $ for end of string and then invert mask by ~:
p = """[!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ ]"""
df = df.loc[:, ~df.astype(str).apply(lambda x: x.str.contains('^' + p + '$')).all()]
print (df)
col_1 col_3 col_4
0 1 ab 1,235
1 2 cd 8,900
2 3 ef 1,235
3 4 gh 8,900
You could use python's string punctuation method and filter out the column that has punctuations with pandas' all method :
import string
#create a list of individual punctuations
punctuation_list = list(string.punctuation)
#filter out unwanted column
df.loc[:,~df.isin(punctuation_list).all()]
col_1 col_3 col_4
0 1 ab 1,235
1 2 cd 8,900
2 3 ef 1,235
3 4 gh 8,900

pandas transform one row into multiple rows

I have a dataframe as below.
My dataframe as below.
ID list
1 a, b, c
2 a, s
3 NA
5 f, j, l
I need to break each items in the list column(String) into independent row as below:
ID item
1 a
1 b
1 c
2 a
2 s
3 NA
5 f
5 j
5 l
Thanks.
Use str.split to separate your items then explode:
print (df.assign(list=df["list"].str.split(", ")).explode("list"))
ID list
0 1 a
0 1 b
0 1 c
1 2 a
1 2 s
2 3 NaN
3 5 f
3 5 j
3 5 l
A beginners approach : Just another way of doing the same thing using pd.DataFrame.stack
df['list'] = df['list'].map(lambda x : str(x).split(','))
dfOut = pd.DataFrame(df['list'].values.tolist())
dfOut.index = df['ID']
dfOut = dfOut.stack().reset_index()
del dfOut['level_1']
dfOut.rename(columns = {0 : 'list'}, inplace = True)
Output:
ID list
0 1 a
1 1 b
2 1 c
3 2 a
4 2 s
5 3 nan
6 5 f
7 5 j
8 5 l

Pandas: How to increment a new column based on increment and consecutive properties of 2 other columns?

I'm currently working on a bulk data pre-processing framework in pandas and since I'm relatively new to pandas, I can't seem to solve this problem:
Given: A dataset with 2 columns :col_1, col_2
Required: A new column req_col such that it's value is incremented if
a. the values in col_1 are not consecutiveORb.the value of col_2 is incremented
consecutively
NOTE:
col_2 always starts from 1 and always increases in value and values are never missing (always consecutive), eg:1,1,2,2,3,3,4,5,6,6,6,7,8,8,9.....
col_1 always starts from 0 and always increases in value, but some
values can be missing (need not be consecutive), eg:0,1,2,2,3,6,6,6,10,10,10...
EXPECTED ANSWER:
col_1 col_2 req_col #Changes in req_col explained below
0 1 1
0 1 1
0 2 2 #because col_2 value has incremented
1 2 2
1 2 2
3 2 3 #because '3' is not consectutive to '1' in col_1
3 3 4 #because of increment in col_2
5 3 5 #because '5' is not consecutive to '3' in col_1
6 4 6 #because of increment in col_2 and so on...
6 4 6
Try:
df['req_col'] = (df['col_1'].diff().gt(1) | # col_1 is not consecutive
df['col_2'].diff().ne(0) # col_2 is has a jump
).cumsum()
Output:
0 1
1 1
2 2
3 2
4 2
5 3
6 4
7 5
8 6
9 6
dtype: int32

pandas how to convert a two-dimension dataframe to a one-dimension dataframe

suppose I have a dataframe with multi columns.
a b c
1
2
3
How to convert it to a single columns dataframe
1 a
2 a
3 a
1 b
2 b
3 b
1 c
2 c
3 c
please note that the former is a Dataframe other than Panel
Use melt:
df = df.reset_index().melt('index', var_name='col').set_index('index')[['col']]
print (df)
col
index
1 a
2 a
3 a
1 b
2 b
3 b
1 c
2 c
3 c
Or numpy.repeat and numpy.tile with DataFrame constructor::
a = np.repeat(df.columns, len(df))
b = np.tile(df.index, len(df.columns))
df = pd.DataFrame(a, index=b, columns=['col'])
print (df)
col
1 a
2 a
3 a
1 b
2 b
3 b
1 c
2 c
3 c
another way is,
pd.DataFrame(list(itertools.product(df.index, df.columns.values))).set_index([0])
Output:
1
0
1 a
1 b
1 c
2 a
2 b
2 c
3 a
3 b
3 c
For exact output:
use sort_values
print pd.DataFrame(list(itertools.product(df.index, df.columns.values))).set_index([0]).sort_values(by=[1])
1
0
1 a
2 a
3 a
1 b
2 b
3 b
1 c
2 c
3 c

How to combine a data frame with another that contains comma separated values?

I am working with 2 data frames that I created based from an Excel file. One data frame contains values that are separated with commas, that is,
df1 df2
----------- ------------
0 LFTEG42 X,Y,Z
1 JOCOROW 1,2
2 TLR_U01 I
3 PR_UDG5 O,M
df1 and df2 are my column names. My intention is to merge the two data frames and generate the following output:
desired result
----------
0 LFTEG42X
1 LFTEG42Y
2 LFTEG42Z
3 JOCOROW1
4 JOCOROW2
5 TLR_U01I
6 .....
n PR_UDG5M
This is the code that I used but I ended up with the following result:
input_file = pd.ExcelFile \
('C:\\Users\\devel\\Desktop_12\\Testing\\latest_Calculation' + str(datetime.now()).split(' ')[0] + '.xlsx')
# convert the worksheets to dataframes
df1 = pd.read_excel(input_file, index_col=None, na_values=['NA'], parse_cols="H",
sheetname="Analysis")
df2 = pd.read_excel(input_file, index_col=None, na_values=['NA'], parse_cols="I",
sheetname="Analysis")
data_frames_merged = df1.append(df2, ignore_index=True)
current result
--------------
NaN XYZ
NaN 1,2
NaN I
... ...
PR_UDG5 NaN
Questions
Why did I end up receiving a NaN (not a number) value?
How can I achieve my desired result of merging these two data frames with the comma values?
I break down the steps
df=pd.concat([df1,df2],axis=1)
df.df2=df.df2.str.split(',')
df=df.set_index('df1').df2.apply(pd.Series).stack().reset_index().drop('level_1',1).rename(columns={0:'df2'})
df['New']=df.df1+df.df2
df
Out[34]:
df1 df2 New
0 LFTEG42 X LFTEG42X
1 LFTEG42 Y LFTEG42Y
2 LFTEG42 Z LFTEG42Z
3 JOCOROW 1 JOCOROW1
4 JOCOROW 2 JOCOROW2
5 TLR_U01 I TLR_U01I
6 PR_UDG5 O PR_UDG5O
7 PR_UDG5 M PR_UDG5M
Data Input :
df1
Out[36]:
df1
0 LFTEG42
1 JOCOROW
2 TLR_U01
3 PR_UDG5
df2
Out[37]:
df2
0 X,Y,Z
1 1,2
2 I
3 O,M
Dirty one-liner
new_df = pd.concat([df1['df1'], df2['df2'].str.split(',', expand = True).stack()\
.reset_index(1,drop = True)], axis = 1).sum(1)
0 LFTEG42X
0 LFTEG42Y
0 LFTEG42Z
1 JOCOROW1
1 JOCOROW2
2 TLR_U01I
3 PR_UDG5O
3 PR_UDG5M
Also, similar to #vaishali except using melt
df = pd.concat([df1,df2['df2'].str.split(',',expand=True)],axis=1).melt(id_vars='df1').dropna().drop('variable',axis=1).sum(axis=1)
0 LFTEG42X
1 JOCOROW1
2 TLR_U01I
3 PR_UDG5O
4 LFTEG42Y
5 JOCOROW2
7 PR_UDG5M
8 LFTEG42Z
Setup
df1 = pd.DataFrame(dict(A='LFTEG42 JOCOROW TLR_U01 PR_UDG5'.split()))
df2 = pd.DataFrame(dict(A='X,Y,Z 1,2 I O,M'.split()))
Getting creative
df1.A.repeat(df2.A.str.count(',') + 1) + ','.join(df2.A).split(',')
0 LFTEG42X
0 LFTEG42Y
0 LFTEG42Z
1 JOCOROW1
1 JOCOROW2
2 TLR_U01I
3 PR_UDG5O
3 PR_UDG5M
dtype: object

Resources