How to decode 'one hot encoded' column names from bytes to string in pandas dataframe - python-3.x

I did one hot encoding on my categorical variables in my dataframe and my columns were renamed as per below..
before one hot encoding
d = {'PROD_ID': ['OM', 'RM', 'VL']
df = pd.DataFrame(data=d)
full_data = pd.get_dummies(data, drop_first=True)
After one hot encoding
full_data
PROD_ID_b'OM'
PROD_ID_b'VL'
PROD_ID_b'RM'
I need to remove b and '' from above dataframe, i.e i need PROD_ID_OM
PROD_ID_VL
PROD_ID_RM

You can pass the prefix param to get_dummies method as follows, then it will add the prefix to all columns as you need.
df = pd.DataFrame({'PROD_ID': ['OM', 'RM', 'VL']})
nwdf = pd.get_dummies(df,prefix=['PROD_ID'])
print(nwdf.columns)
Output: Index(['PROD_ID_OM', 'PROD_ID_RM', 'PROD_ID_VL'], dtype='object')

Related

Iteratively append new data into pandas dataframe column and join with another dataframe

I have been doing data extract from many API. I would like to add a common column among all APIs.
And I have tried below
df = pd.DataFrame()
for i in range(1,200):
url = '{id}/values'.format(id=i)
res = request.get(url,headers=headers)
if res.status_code==200:
data =json.loads(res.content.decode('utf-8'))
if data['success']:
df['id'] = i
test = pd.json_normalize(data[parent][child])
df = df.append(test,index=False)
But data-frame id column I'm getting only the last iterated id only. And in case of APIs has many rows I'm getting invalid data.
From performance reasons it would be better first storing data in a dictionary and then create from this dictionary dataframe:
import pandas as pd
from collections import defaultdict
d = defaultdict(list)
for i in range(1,200):
# simulate dataframe retrieved from pd.json_normalize() call
row = pd.DataFrame({'id': [i], 'field1': [f'f1-{i}'], 'field2': [f'f2-{i}'], 'field3': [f'f3-{i}']})
for k, v in row.to_dict().items():
d[k].append(v[0])
df = pd.DataFrame(d)

How to remove rows in pandas dataframe column that contain the hyphen character?

I have a DataFrame given as follows:
new_dict = {'Area_sqfeet': '[1002, 322, 420-500,300,1.25acres,100-250,3.45 acres]'}
df = pd.DataFrame([new_dict])
df.head()
I want to remove hyphen values and change acres to sqfeet in this dataframe.
How may I do it efficiently?
Use list comprehension:
mylist = ["1002", "322", "420-500","300","1.25acres","100-250","3.45 acres"]
# ['1002', '322', '420-500', '300', '1.25acres', '100-250', '3.45 acres']
Step 1: Remove hyphens
filtered_list = [i for i in mylist if "-" not in i] # remove hyphens
Step 2: Convert acres to sqfeet:
final_list = [i if 'acres' not in i else eval(i.split('acres')[0])*43560 for i in filtered_list] # convert to sq foot
#['1002', '322', '300', 54450.0, 150282.0]
Also, if you want to keep the "sqfeet" next tot he converted values use this:
final_list = [i if 'acres' not in i else "{} sqfeet".format(eval(i.split('acres')[0])*43560) for i in filtered_list]
# ['1002', '322', '300', '54450.0 sqfeet', '150282.0 sqfeet']
It's not clear if this is homework, and you haven't shown us what you have already tried per https://stackoverflow.com/help/how-to-ask
Here's something that might get you going in the right direction:
import pandas as pd
col_name = 'Area_sqfeet'
# per comment on your question, you need to make a dataframe with more
# than one row, your original question only had one row
new_list = ["1002", "322", "420-500","300","1.25acres","100-250","3.45 acres"]
df = pd.DataFrame(new_list)
df.columns = ["Area_sqfeet"]
# once you have the df as strings, here's how to remove the ones with hyphens
df = df[df["Area_sqfeet"].str.contains("-")==False]
print(df.head())

converting the values in a text file and making new text file in python

I have a text file like this example:
example:
"class" "Name" "Access" "CF33456_12.RCC" "CF33457_05.RCC" "CF33458_04.RCC"
"ff" "edi" "ff" "kju" 2444.91910958478 1669.55827263364 699.627215729572
"gg" "edi" "gg" "uhy" 2002.95278984564 369.565070720533 351.056685823175
in this file there are 6 columns (based on the headers) ,so the 1st column is rows name. I would like to change the numbers (the last 3 columns) to log2 value and make a new file with exactly similar structure. here is the expected output:
expected output:
"class" "Name" "Access" "CF33456_12.RCC" "CF33457_05.RCC" "CF33458_04.RCC"
"ff" "edi" "ff" "kju" 11.2555710189065 10.7052507333626 9.45044260143907
"gg" "edi" "gg" "uhy" 10.9679127014901 8.52968459728736 8.45556019395986
I am tryint to do that in python using this code:
df = pd.read_table("myfile.txt", index_col=0)
import numpy as np
df2 = df.iloc[:, [3,4,5]]
df3 = np.array(df2)
df4 = np.log2(df3)
final = pd4.DataFrame(df4)
it convert to log2 value but it does not return a file with the same structure. do you know how to fix it?
In your example the original dataframe (which has the structure of the input table) can be changed using this code:
df = pd.read_table("myfile.txt", index_col=0)
import numpy as np
df2 = df.iloc[:, [3:5]]
df3 = np.array(df2)
df4 = np.log2(df3)
df.iloc[:, [3:5]] = df4
final = df
(it is obvious that df4 has another format - it is a slice of the table and the indexes are removed while converting to numpy array)

Pandas checks with prefix and more checksum if searched prefix exists or no data

I have below code snippet which works fine.
import pandas as pd
import numpy as np
prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50']
df = pd.read_csv('new_hosts', index_col=False, header=None)
df['prefix'] = df[0].str[:4]
df['grp'] = df.groupby('prefix').cumcount()
df = df.pivot(index='grp', columns='prefix', values=0)
df['sj12'] = df['sj12'].str.extract('(\w{2}\d{2}\w\*)', expand=True)
df = df[ prefixes ].dropna(axis=0, how='all').replace(np.nan, '', regex=True)
df = df.rename_axis(None)
Example File new_hosts
sj000001
sj000002
sj000003
sj000004
sj124000
sj125000
sj126000
sj127000
sj128000
sj129000
sj130000
sj131000
sj132000
cr000011
cr000012
cr000013
cr000014
crn00001
crn00002
crn00003
crn00004
euk000011
eu0000012
eu0000013
eu0000014
eu5000011
eu5000013
eu5000014
eu5000015
Current output:
sj00 sj12 cr00 cr08 eu00 eu50
sj000001 cr000011 crn00001 euk000011 eu5000011
sj000002 cr000012 crn00002 eu0000012 eu5000013
sj000003 cr000013 crn00003 eu0000013 eu5000014
sj000004 cr000014 crn00004 eu0000014 eu5000015
What's expected:
1) As code works fine but as you see the current output the second column don't have any values but still appearing So, how could i have a checksum if a particular column don't have any values then remove that from display.
2) Can we place a check for the prefixes if they exists in the dataframe before processing to avoid the error.
Appreciate any help.
IIUC, before
df = df[ prefixes ].dropna(axis=0, how='all').replace(np.nan, '', regex=True)
you can do:
# remove all empty columns
df = df.dropna(axis=1, how='all')
That would solve your first part. Second part can be reindex?
# select prefixes:
prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50', 'sh00', 'dt00', 'sh00', 'dt00']
df = df.reindex(prefixes, axis=1).dropna(axis=1, how='all').replace(np.nan, '', regex=True)
Note the axis=1, not axis=0 is identical to what I propose for question 1.
Much thanks to Quang Hoang for the hints on the post, Just for the workaround, i got it working as follows until i get a better answer:
# Select prefixes
prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50']
df = pd.read_csv('new_hosts', index_col=False, header=None)
df['prefix'] = df[0].str[:4]
df['grp'] = df.groupby('prefix').cumcount()
df = df.pivot(index='grp', columns='prefix', values=0)
df = df[prefixes]
# For column `sj12` only extract the values having `sj12` and a should be a word immediately after that like `sj12[a-z]`
df['sj12'] = df['sj12'].str.extract('(\w{2}\d{2}\w\*)', expand=True)
df.replace('', np.nan, inplace=True)
# Remove the empty columns
df = df.dropna(axis=1, how='all')
# again drop if all values in the row are nan and replace nan to empty for live columns
df = df.dropna(axis=0, how='all').replace(np.nan, '', regex=True)
# drop the index field
df = df.rename_axis(None)
print(df)

Merge and then sort columns of a dataframe based on the columns of the merging dataframe

I have two dataframes, both indexed with timestamps. I would like to preserve the order of the columns in the first dataframe that is merged.
For example:
#required packages
import pandas as pd
import numpy as np
# defining stuff
num_periods_1 = 11
num_periods_2 = 4
# create sample time series
dates1 = pd.date_range('1/1/2000 00:00:00', periods=num_periods_1, freq='10min')
dates2 = pd.date_range('1/1/2000 01:30:00', periods=num_periods_2, freq='10min')
column_names_1 = ['C', 'B', 'A']
column_names_2 = ['B', 'C', 'D']
df1 = pd.DataFrame(np.random.randn(num_periods_1, len(column_names_1)), index=dates1, columns=column_names_1)
df2 = pd.DataFrame(np.random.randn(num_periods_2, len(column_names_2)), index=dates2, columns=column_names_2)
df3 = df1.merge(df2, how='outer', left_index=True, right_index=True, suffixes=['_1', '_2'])
print("\nData Frame Three:\n", df3)
The above code generates two data frames the first with columns C, B, and A. The second dataframe has columns B, C, and D. The current output has the columns in the following order; C_1, B_1, A, B_2, C_2, D. What I want the columns from the output of the merge to be C_1, C_2, B_1, B_2, A_1, D_2. The order of the columns is preserved from the first data frame and any data similar to the second data frame is added next to the corresponding data.
Could there be a setting in merge or can I use sort_index to do this?
EDIT: Maybe a better way to phrase the sorting process would be to call it uncollated. Where each column is put together and so on.
Using an OrderedDict, as you suggested.
from collections import OrderedDict
from itertools import chain
c = df3.columns.tolist()
o = OrderedDict()
for x in c:
o.setdefault(x.split('_')[0], []).append(x)
c = list(chain.from_iterable(o.values()))
df3 = df3[c]
An alternative that involves extracting the prefixes and then calling sorted on the index.
# https://stackoverflow.com/a/46839182/4909087
p = [s[0] for s in c]
c = sorted(c, key=lambda x: (p.index(x[0]), x))
df = df[c]

Resources