Is there any way to reverse categorical value to original string or text value? - python-3.x

I have applied below is the data frame
cc temp
0 US 37.0
1 CA 12.0
2 US 35.0
3 AU 20.0
Now convert into category using
df = df.cc.cat.codes
I'm getting this as output
cc temp
0 2 37.0
1 1 12.0
2 2 35.0
3 0 20.0
My requirement is that how can I reverse it as origin any idea?

You could use LabelEncoder from sklearn.preprocessing, which offers a similar functionality to what you've done.
Here's how to do it with your dataframe:
# Assuming you've created the dataframe already
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
# Transforming categorical variable to label encoded form
df['cc'] = le.fit_transform(df['cc'])
# Converting back from label encoded form to labels
df['cc'] = le.inverse_transform(df['cc'])
You can read about label encoder and scikit-learn's implementation at https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html.
Might also help to read about other forms of encoding categorical variables such as one hot encoding and target encoding, and which to use where.

Related

Unable to convert text format to proper data frame using Pandas

I am reading text source from URL = 'https://www.census.gov/construction/bps/txt/tb2u201901.txt'
here i used Pandas to convert it into Dataframe
df = pd.read_csv(URL, sep = '\t')
After exporting the df i see all the columns are merged into single column inspite of giving the seperator
as '\t'. how to solve this issue.
As your file is not a CSV file, you should use the function read_fwf() from pandas because your columns have a fixed width. You need also to remove the first 12 lines that are not part of your data and you need to remove the empty lines with dropna().
df = pd.read_fwf(URL, skiprows=12)
df.dropna(inplace=True)
df.head()
United States 94439 58086 1600 1457 33296 1263
1 Northeast 9099.0 3330.0 272.0 242.0 5255.0 242.0
2 New England 1932.0 1079.0 90.0 72.0 691.0 46.0
3 Connecticut 278.0 202.0 8.0 3.0 65.0 8.0
4 Maine 357.0 222.0 6.0 0.0 129.0 5.0
5 Massachusetts 819.0 429.0 38.0 54.0 298.0 23.0
Your output is coming correct . If you open the URL , you will see that there sentences written which are not tab separated so its not able to present in correct way.
From line number 9 the results are correct
[![enter image description here][1]][1]
[1]: https://i.stack.imgur.com/2K61J.png

Creating a single file from multiple files (Python 3.x)

I can't figure out a great way to do this, but I have 2 files with a standard date and value format.
File 1 File 2
Date Value Date Value
4 7.0 1 9.0
5 5.5 . .
6 4.0 7 2.0
I want to combine files 1 and 2 to get the following:
Combined Files
Date Value1 Value2 Avg
1 NaN 9.0 9.0
2 NaN 9.0 9.0
3 NaN 8.5 8.5
4 7.0 7.5 7.25
5 5.5 5.0 5.25
6 4.0 3.5 3.75
7 NaN 2.0 2.0
How would I attempt this? I figured I should make a masked array with the date going from 1 to 7 and then just append the files together, but I don't know how I would do that with file 1. Any help where to look would be appreciated.
Using Python 3.x
EDIT:
I solved my own problem!
I am sure there is a better way to streamline this. My solution, doesn't use the example above, I just threw in my code.
def extractFiles(Dir, newDir, newDir2):
fnames = glob(Dir)
farray = np.array(fnames)
## Dates range from 723911 to 737030
dateArray = np.arange(723911,737030) # Store the dates
dataArray = [] # Store the data, This needs to be a list! Not np.array!
for f in farray:
## Extracting Data
CH4 = np.genfromtxt(f, comments='#', delimiter=None, dtype=np.float).T
myData = np.full(dateArray.shape, np.nan) # Create an masked array
myDate = np.array([])
## Converts the given datetime into something more useable
for x, y in zip(*CH4[1:2], *CH4[2:3]):
myDate = np.append(myDate,
(mdates.date2num(datetime.strptime('{}-{}'.format(int(x), int(y)), '%Y-%m'))))
## Finds where the dates are the same and places the approprite concentration value
for i in range(len(CH4[3])):
idx = np.where(dateArray == myDate[i])
myData[idx] = CH4[3, i]
## Store all values in the list
dataArray.append(myData)
## Convert list to numpy array and save in txt file
dataArray = np.vstack((dateArray, dataArray))
np.savetxt(newDir, dataArray.T, fmt='%1.2f', delimiter=',')
## Find the averge of the data to plot
avg = np.nanmean(dataArray[1:].T,1)
avg = np.vstack((dateArray, avg))
np.savetxt(newDir2, avg.T, fmt='%1.2f', delimiter=',')
return avg
Here is my answer based on the information you gave me:
import pandas as pd
import os
# I stored two Excel files in a subfolder of this sample code
# Code
# ----Files
# -------- File1.xlsx
# -------- File2.xlsx
# Here I am saving the path to a variable
file_path = os.path.join(*[os.getcwd(), 'Files', ''])
# I define an empty DataFrame that we then fill we the files information
final_df = pd.DataFrame()
# file_number will be used to increment the Value column based number of files that we load.
# First file will be Value1, second will lead to Value2
file_number = 1
# os.listdir is now "having a look" into the "Files" folder and will return a list of files which is contained in there
# ['File1.xlsx', 'File2.xlsx'] in our case
for file in os.listdir(file_path):
# we load the Excel file with pandas function "read_excel"
df = pd.read_excel(file_path + file)
# Rename the column "Value" to "Value" + the "file_number"
df = df.rename(columns={'Value': 'Value'+str(file_number)})
# Check if the Dataframe already contains values
if not final_df.empty:
# If there is values already then we merge them together with the new values
final_df = final_df.merge(df, how='outer', on='Date')
else:
# Otherwise we "initialize" our final_df with the first Excel file that we loaded
final_df = df
# at the end we increment the file number by one to continue to next file
file_number += 1
# get all column names that have "Value" in it
value_columns = [w for w in final_df.columns if 'Value' in w]
# Create a new column for the average and build the average on all columns that we found for value columns
final_df['Avg'] = final_df.apply(lambda x: x[value_columns].mean(), axis=1)
# Sort the dataframe based on the Date
sorted_df = final_df.sort_values('Date')
print(sorted_df)
The print will output this:
Date Value1 Value2 Avg
3 1 NaN 9.0 9.00
4 2 NaN 9.0 9.00
5 3 NaN 8.5 8.50
0 4 7.0 7.5 7.25
1 5 5.5 5.0 5.25
2 6 4.0 3.5 3.75
6 7 NaN 2.0 2.00
Please be aware that this is not paying attention on the file names and is just loading one file after another based on the alphabet.
But this has the advantage that you can put as many files in there as you want.
If you need to load them in a specific order I can probably help you with that as well.

One Hot Encoding a single column

I am trying to use one hot encoder on the target column('Species') in the Iris dataset.
But I am getting the following errors:
ValueError: Expected 2D array, got 1D array instead:
Reshape your data either using array.reshape(-1, 1) if your data has a
single feature or array.reshape(1, -1) if it contains a single sample.
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
0 1 5.1 3.5 1.4 0.2 Iris-setosa
1 2 4.9 3.0 1.4 0.2 Iris-setosa
2 3 4.7 3.2 1.3 0.2 Iris-setosa
3 4 4.6 3.1 1.5 0.2 Iris-setosa
4 5 5.0 3.6 1.4 0.2 Iris-setosa
I did google the issue and i found that most of the scikit learn estimators need a 2D array rather than a 1D array.
At the same time, I also found that we can try passing the dataframe with its index to encode single columns, but it didn't work
onehotencoder = OneHotEncoder(categorical_features=[df.columns.tolist().index('pattern_id')
X = dataset.iloc[:,1:5].values
y = dataset.iloc[:, 5].values
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder= LabelEncoder()
y = labelencoder.fit_transform(y)
onehotencoder = OneHotEncoder(categorical_features=[0])
y = onehotencoder.fit_transform(y)
I am trying to encode a single categorical column and split into multiple columns (the way the encoding usually works)
ValueError: Expected 2D array, got 1D array instead: Reshape your
data either using array.reshape(-1, 1) if your data has a single
feature or array.reshape(1, -1) if it contains a single sample.
Says that you need to convert your array to a vector.
You can do that by:
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd
import numpy as np
# load iris dataset
>>> iris = datasets.load_iris()
>>> iris = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])
>>> y = iris.target.values
>>> onehotencoder = OneHotEncoder(categories='auto')
>>> y = onehotencoder.fit_transform(y.reshape(-1,1))
# y - will be sparse matrix of type '<class 'numpy.float64'>
# if you want it to be a array you need to
>>> print(y.toarray())
[[1. 0. 0.]
[1. 0. 0.]
. . . .
[0. 0. 1.]
[0. 0. 1.]]
Also you can use get_dummies function (docs)
>>> pd.get_dummies(iris.target).head()
0.0 1.0 2.0
0 1 0 0
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
Hope that helps!
For your case, since it looks like you are using the kaggle dataset, I would just use
import pandas as pd
pd.get_dummies(df.Species).head()
Out[158]:
Iris-setosa Iris-versicolor Iris-virginica
0 1 0 0
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
Note that the default here encodes all the classes (3 species), where it is common to use just two and compare differences in the means to the baseline group, (eg. the default in R or typically when doing regression/ANOVA which can be accomplished using the drop_first argument).
I came across similar situation and found the below method to be working :
Use two square brackets for the column name in the fit or fit_transform command
one_hot_enc = OneHotEncoder()
arr = one_hot_enc.fit_transform(data[['column']])
df = pd.DataFrame(arr)
The fit_transform gives you an array and you can convert this to pandas dataframe. You may append this to the original dataframe or directly assign to an existing column.

reading data with varying length header

I want to read in python a file which contains a varying length header and then extract in a dataframe/series the variables which are coming after the header.
The data looks like :
....................................................................
Data coverage and measurement duty cycle:
When the instrument duty cycle is not in measure mode (i.e. in-flight
calibrations) the data is not given here (error flag = 2).
The measurements have been found to exhibit a strong sensitivity to cabin
pressure.
Consequently the instrument requires calibrated at each new cabin
pressure/altitude.
Data taken at cabin pressures for which no calibration was performed is
not given here (error flag = 2).
Measurement sensivity to large roll angles was also observed.
Data corresponding to roll angles greater than 10 degrees is not given
here (error flag = 2)
......................................................................
High Std: TBD ppb
Target Std: TBD ppb
Zero Std: 0 ppb
Mole fraction error flag description :
0 : Valid data
2 : Missing data
31636 0.69 0
31637 0.66 0
31638 0.62 0
31639 0.64 0
31640 0.71 0
.....
.....
So what I want is to extract the data as :
Time C2H6 Flag
0 31636 0.69 0 NaN
1 31637 0.66 0 NaN
2 31638 0.62 0 NaN
3 31639 0.64 0 NaN
4 31640 0.71 0 NaN
5 31641 0.79 0 NaN
6 31642 0.85 0 NaN
7 31643 0.81 0 NaN
8 31644 0.79 0 NaN
9 31645 0.85 0 NaN
I can do that with
infile="/nfs/potts.jasmin-north/scratch/earic/AEOG/data/mantildas_faam_20180911_r1_c118.na"
flightdata = pd.read_fwf(infile, skiprows=53, header=None, names=['Time', 'C2H6', 'Flag'],)
but I m skipping approximately 53 rows because I counted how much I should skip. I have a bunch of these files and some don't have exactly 53 rows in the header so I was wondering what would be the best way to deal with this and a criteria to have Python always only read the three columns of data when finds them? I thought if I'd want let's say Python to actually read the data from where encounters
Mole fraction error flag description :
0 : Valid data
2 : Missing data
what should I do ? What about another criteria to use which would work better ?
You can split on the header delimiter, like so:
with open(filename, 'r') as f:
myfile = f.read()
infile = myfile.split('Mole fraction error flag description :')[-1]
# skip lines with missing data
infile = infile.split('\n')
# likely a better indicator of a line with incorrect format, you know the data better
infile = '\n'.join([line for line in infile if ' : ' not in line])
# create dataframe
flightdata = pd.read_fwf(infile, header=None, names=['Time', 'C2H6', 'Flag'],)

creating a matrix from from a list of subjects with different genes present or absent with python

I have a file with different subjects that have a list of genes that are present per subject (new line per gene). I would like to restructure the data to a matrix with the different subjects in the rows and then a column for every gene that is present (with a 1 or 0 for present or absent). I have the original data as an excel file that I have imported with pandas to try and do this with Python. But honestly I have no clue how to do this in a nice way.
image of how the data is structured and of how it is supposed to be formatted.
I really appreciate all the help I can get!
So many thanks already
If this is your file original file:
Subject,Gene
subject1,gene1
subject1,gene2
subject1,gene3
subject2,gene1
subject2,gene4
subject3,gene2
subject3,gene4
subject3,gene5
Then you can do something like this with pd.crosstab:
>>> import pandas as pd
>>> df = pd.read_csv("genes.csv")
>>> pd.crosstab(df["Subject"], df["Gene"])
Gene gene1 gene2 gene3 gene4 gene5
Subject
subject1 1 1 1 0 0
subject2 1 0 0 1 0
subject3 0 1 0 1 1
Use pivot()
df['count'] = 1
df.pivot(index='Subject', columns='Gene', values='count')
Gene gene1 gene2 gene3 gene4 gene5
Subject
subject1 1.0 1.0 1.0 NaN NaN
subject2 1.0 NaN NaN 1.0 NaN
subject3 NaN 1.0 NaN 1.0 1.0
UPDATED -- Full example based your comment
# import pandas module
import pandas as pd
import numpy as np
# read your excel file
df = pd.read_excel(r'path\to\your\file\myFile.xlsx')
# create a new column call 'count' and set it to a value of 1
df['count'] = 1
# use pivot and assign it to a new variable: df2
df2 = df.pivot(index='Subject', columns='Gene', values='count').replace(np.nan, 0)
# print your new dataframe
print(df2)

Resources