How to label encode a DataFrame column which contains both numbers and strings?

How to label encode a DataFrame column which contains both numbers and strings? - python-3.x

I have this DataFrame column
+-------------------------------------+--+
| df: | |
+-------------------------------------+--+
| Index Ticket* | |
| 0 254326 | |
| 1 CA345 | |
| 3 SA12 | |
| 4 267891 | |
| ' ' | |
| ' ' | |
| ' ' | |
| 700 CA356 | |
+-------------------------------------+--+
It contains two kinds of values. Some are pure numbers and others are strings having letters and numbers.
Many rows have the same letters (CA345, CA675 etc). I would like to group and label the rows with same letters with the same numbers.
Eg. All rows having "CA" labelled as 0, all rows having "SA" labelled as 1.
Remaining rows all have six digit numbers (no letters in them). I would like to label all such rows with the same number (say 2 for example)

1st Approach
Define a custom function, check if the row isinstance(val, str) and contains "SA" or "CA"
def label_ticket(row):
if isinstance(row['Ticket'], str) and 'CA' in row['Ticket']:
return 0
if isinstance(row['Ticket'], str) and 'SA' in row['Ticket']:
return 1
return 2
Apply the custom function to new column df('Label').
df['Label'] = df.apply(label_ticket, axis=1)
print(df)
Ticket Label
0 254326 2
1 CA345 0
2 SA12 1
3 267891 2
700 CA356 0
2nd Approach
Further understanding the situation, it seems you have no idea what instances will come up in df['Ticket']. In this case you can use re.split() to search all string pattern and classify them into category accordingly.
import pandas as pd
import re
df = pd.DataFrame(columns=['Ticket'],
data=[[254326],
['CA345'],
['SA12'],
[267891],
['CA356']])
df['Pattern'] = df['Ticket'].apply(lambda x: ''.join(re.split("[^a-zA-Z]*", str(x))))
df_label = pd.DataFrame(df['Pattern'].unique(), columns=['Pattern']).reset_index(level=0).rename(columns={'index': 'Label'})
df = df.merge(df_label, how='left')
print(df)
Ticket Pattern Label
0 254326 0
1 CA345 CA 1
2 SA12 SA 2
3 267891 0
4 CA356 CA 1

I have not enough knowledge of python but
you may have try pandas.Series.str.extract
and
regular expression
Like:
ptrn=r'(?P<CA>(CA[\d]+))|(?P<SA>(SA[\d]+))|(?P<DIGIT>[\d]{6})'
import pandas as pd
import numpy as np
ls={'tk':[ '254326' , 'CA345', 'SA12' , '267891' , 'CA356' ]}
df = pd.DataFrame(ls)
s=df['tk'].str.extract(ptrn,expand=False)
newDf={0:[x for x in s['CA'] if pd.isnull(x)==False],1:[x for x in s['SA'] if pd.isnull(x)==False],2:[x for x in s['DIGIT'] if pd.isnull(x)==False]}
print(newDf)
out put:
{0: ['CA345', 'CA356'], 1: ['SA12'], 2: ['254326', '267891']}
demo

Related

how to change values in a df specifying by index contain in multiple lists, and each list for one column

I have a list where I have all the index of values to be replaced. I have to change them in 8 diferent columns with 8 diferent lists. The replacement could be a simple string.
How can I do it?
I have more than 20 diferent columns in this df
Eg:
list1 = [0,1,2]
list2 =[2,4]
list8 = ...
sustitution = 'no data'
Column A
Column B
marcos
peter
Julila
mike
Fran
Ramon
Pedri
Gavi
Olmo
Torres
OUTPUT:
| Column A | Column B |
| -------- | -------- |
| no data | peter |
| no data | mike |
| no data | no data |
| Pedri | Gavi |
| Olmo | no data |`

Use DataFrame.loc with zipped lists and columns names:
list1 = [0,1,2]
list2 =[2,4]
L = [list1,list2]
cols = ['Column A','Column B']
sustitution = 'no data'
for c, i in zip(cols, L):
df.loc[i, c] = sustitution
print (df)
Column A Column B
0 no data peter
1 no data mike
2 no data no data
3 Pedri Gavi
4 Olmo no data

You can use the underlying numpy array:
list1 = [0,1,2]
list2 = [2,4]
lists = [list1, list2]
col = np.repeat(np.arange(len(lists)), list(map(len, lists)))
# array([0, 0, 0, 1, 1])
row = np.concatenate(lists)
# array([0, 1, 2, 2, 4])
df.values[row, col] = 'no data'
Output:
Column A Column B
0 no data peter
1 no data mike
2 no data no data
3 Pedri Gavi
4 Olmo no data

Adding a new column whose values are based on another column in either dataframe or excel

I want to add a new column "X" whose values should be either 0 or 1 such that if there exists a value(particularly date in my case) in column "A", it should give 1 or any text
example:
A | X
----------
*date* | 1
null | 0
*date* | 1
*date* | 1
*date* | 1
null | 0
any way to do this in pandas or excel/office

Here is an example in excel:
=if(isnull(a2);0;1)
or
=if(a2>0;1;0)
(dates are aways greater then zero.

Pandas: With array of col names in a desired column order, select those that exist, NULL those that don't

I have an array of column names I want as my output table in that order e.g. ["A", "B", "C"]
I have an input table that USUALLY contains all of the values in the array but NOT ALWAYS (the raw data is a JSON API response).
I want to select all available columns from the input table, and if a column does not exist, I want it filled with NULLs or NA or whatever, it doesn't really matter.
Let's say my input DataFrame (call it input_table) looks like this:
+-----+--------------+
| A | C |
+-----+--------------+
| 123 | test |
| 456 | another_test |
+-----+--------------+
I want an output dataframe that has columns A, B, C in that order to produce
+-----+------+--------------+
| A | B | C |
+-----+------+--------------+
| 123 | NULL | test |
| 456 | NULL | another_test |
+-----+------+--------------+
I get a keyerror when I do input_table[["A","B","C"]]
I get a NoneType returned when I do input_table.get(["A","B","C"])
I was able to achieve what I want via:
for i in desired_columns_array:
if i not in input_dataframe:
ouput_dataframe[i] = ""
else:
output_dataframe[i] = input_dataframe[i]
But I'm wondering if there's something less verbose?
How do I get a desired output schema to match an input array when one or more columns in the input dataframe may not be present?

Transpose and reindex
df = pd.DataFrame([[123,'test'], [456, 'another test']], columns=list('AC'))
l = list('ACB')
df1 = df.T.reindex(l).T[sorted(l)]
A B C
0 123 NaN test
1 456 NaN another test

DataFrame.reindex over the column axis:
cols = ['A', 'B', 'C']
df.reindex(cols, axis='columns')
A B C
0 123 NaN test
1 456 NaN another_test

Python, converting int to str, trailing/leading decimal/zeros

I convert my dataframe values to str, but when I concatenate them together the previous ints are including trailing decimals.
df["newcol"] = df['columna'].map(str) + '_' + df['columnb'].map(str) + '_' + df['columnc'].map(str)
This is giving me output like
500.0 how can I get rid of this leading/trailing decimal? sometimes my data in column a will have non alpha numeric characters.
+---------+---------+---------+------------------+----------------------+
| columna | columnb | columnc | expected | currently getting |
+---------+---------+---------+------------------+----------------------+
| | -1 | 27 | _-1_27 | _-1.0_27.0 |
| | -1 | 42 | _-1_42 | _-1.0_42.0 |
| | -1 | 67 | _-1_67 | _-1.0_67.0 |
| | -1 | 95 | _-1_95 | _-1.0_95.0 |
| 91_CCMS | 14638 | 91 | 91_CCMS_14638_91 | 91_CCMS_14638.0_91.0 |
| DIP96 | 1502 | 96 | DIP96_1502_96 | DIP96_1502.0_96.0 |
| 106 | 11694 | 106 | 106_11694_106 | 00106_11694.0_106.0 |
+---------+---------+---------+------------------+----------------------+
Error:
invalid literal for int() with base 10: ''

Edit:
If your df has more than 3 columns, and you want to join only 3 columns, you may specify those columns in the command using columns slicing. Assume your df has 5 columns named as : AA, BB, CC, DD, EE. You want only joining columns CC, DD, EE. You just need to specify those 3 columns before the fillna, and assign the result to newcol as you want:
df["newcol"] = df[['CC', 'DD', 'EE']].fillna('') \
.applymap(lambda x: x if isinstance(x, str) else str(int(x))).agg('_'.join, axis=1)
Note: I just break command into 2 lines using '\' for easy reading.
Original:
I guess your real data of columna columnb columnc contain str, float, int, empty space, blank space, and maybe even NaN.
Float with decimal values = .00 in a column dtype object will show without decimal.
Assume your df has only 3 columns: colmna, columnb, columnc as you said. Using command below will handle: str, float, int, NaN and joining 3 columns into one as you want:
df.fillna('').applymap(lambda x: x if isinstance(x, str) else str(int(x))).agg('_'.join, axis=1)
I created a sample similar as yours
columna columnb columnc
0 -1 27
1 NaN -1 42
2 -1 67
3 -1 95
4 91_CCMS 14638 91
5 DIP96 96
6 106 11694 106
Using your command returns the concatenated string having '.0' as you described
df['columna'].map(str) + '_' + df['columnb'].map(str) + '_' + df['columnc'].map(str)
Out[1926]:
0 _-1.0_27.0
1 nan_-1.0_42.0
2 _-1.0_67.0
3 _-1.0_95.0
4 91_CCMS_14638_91
5 DIP96__96
6 106_11694_106
dtype: object
Using my command:
df.fillna('').applymap(lambda x: x if isinstance(x, str) else str(int(x))).agg('_'.join, axis=1)
Out[1927]:
0 _-1_27
1 _-1_42
2 _-1_67
3 _-1_95
4 91_CCMS_14638_91
5 DIP96__96
6 106_11694_106
dtype: object

I couldn't reproduce this error but maybe you could try something like:
df["newcol"] = df['columna'].map(lambda x: str(int(x)) if isinstance(x, int) else str(x)) + '_' + df['columnb'].map(lambda x: str(int(x))) + '_' + df['columnc'].map(lambda x: str(int(x)))

Filtering columns in a pandas dataframe

I have a dataframe with the following column.
A
55B
<lhggkkk>
66c
dggfhhjjjj
I need to filter the records which start with number(such as 55B and 66C) separately and the others separately. Can anyone please help?

Try:
import pandas as pd
df = pd.DataFrame()
df['A'] = ['55B','<lhggkkk>','66c','dggfhhjjjj']
df['B'] = df['A'].apply(lambda x:x[0].isdigit())
print(df)
A B
0 55B True
1 <lhggkkk> False
2 66c True
3 dggfhhjjjj False

Try to check if the first number is digit then boolen index i.e
mask = df['A'].str[0].str.isdigit()
one = df[mask]
two = df[~mask]
print(one,'\n',two)
A
0 55B
2 66c
A
1 <lhggkkk>
3 dggfhhjjjj

To check first string is digit or not:
df['A'].str[0].str.isdigit()
So:
import pandas as pd
import numpy as np
df:
-----------------
| A
-----------------
0 | 55B
1 | <lhggkkk>
2 | 66c
3 | dggfhhjjjj
df['Result'] = np.where(df['A'].str[0].str.isdigit(), 'Numbers', 'Others')
df:
----------------------------
| A | Result
----------------------------
0 | 55B | Numbers
1 | <lhggkkk> | Others
2 | 66c | Numbers
3 | dggfhhjjjj | Others

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to label encode a DataFrame column which contains both numbers and strings? - python-3.x

Related

how to change values in a df specifying by index contain in multiple lists, and each list for one column

Adding a new column whose values are based on another column in either dataframe or excel

Pandas: With array of col names in a desired column order, select those that exist, NULL those that don't

Python, converting int to str, trailing/leading decimal/zeros

Filtering columns in a pandas dataframe

Categories

Resources