I have a dataframe, wherein the column 'team' needs to be encoded.
These are my codes:
#Load the required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
#Create dictionary
data = {'team': ['A', 'A', 'B', 'B', 'C'],
'Income': [5849, 4583, 3000, 2583, 6000],
'Coapplicant Income': [0, 1508, 0, 2358, 0],
'LoanAmount': [123, 128, 66, 120, 141]}
#Convert dictionary to dataframe
df = pd.DataFrame(data)
print("\n df",df)
# Initiate label encoder
le = LabelEncoder()
# return encoded label
label = le.fit_transform(df['team'])
# printing label
print("\n label =",label )
# removing the column 'team' from df
df.drop("team", axis=1, inplace=True)
# Appending the array to our dataFrame
df["team"] = label
# printing Dataframe
print("\n df",df)
I am getting the below result after encoding:
However, I wish to ensure following two things:
Encoding starts with 1 and not 0
The location of column 'team' should remain the same as original
i.e. I wish to have following result:
Can somebody please help me out how to do this ?
Do not drop the column and increment the label on assignment:
le = LabelEncoder()
# return encoded label
label = le.fit_transform(df['team'])
# Replacing the column
df["team"] = label + 1
Output:
df
team
Income
Coapplicant Income
LoanAmount
0
1
5849
0
123
1
1
4583
1508
128
2
2
3000
0
66
3
2
2583
2358
120
4
3
6000
0
141
Related
I have big data as shown in the uploaded pic, it has 90 BAND-INDEX and each BAND-INDEX has 300 rows.
I want to search the text file for a specific value like -24.83271 and extract the BAND-INDEX containing that value in an array form. Can you please write the code to do so? Thank you in advance
I am unable to extract the specific BAND-INDEX in array form.
Try reading the file line by line and using a generator. Here is an example:
import csv
import pandas as pd
# generate and save demo csv
pd.DataFrame({
'Band-Index': (0.01, 0.02, 0.03, 0.04, 0.05, 0.06),
'value': (1, 2, 3, 4, 5, 6),
}).to_csv('example.csv', index=False)
def search_values_in_file(search_values: list):
with open('example.csv') as csvfile:
reader = csv.reader(csvfile)
reader.__next__() # skip header
for row in reader:
band_index, value = row
if value in search_values:
yield row
# get lines from csv where value in ['4', '6']
df = pd.DataFrame(list(search_values_in_file(['4', '6'])), columns=['Band-Index', 'value'])
print(df)
# Band-Index value
# 0 0.04 4
# 1 0.06 6
I have a pandas DataFrame X_train containing a column 'state' having different(repeated) state names. In another Y_train DataFrame, I have class value 0-1.
In a dictionary variable Temp I have a probability of each state(unique) belonging to class 0 and 1.
Now I want to replace all the state names in X_train with their probability score corresponds to class label in Y_train.
How to do it?
Solution:
The data as you described:
import pandas as pd
X_train = pd.DataFrame([{'state': 'A'}, {'state': 'B'}, {'state': 'A'},{'state': 'A'}])
Y_train = pd.DataFrame([{'class': 1}, {'class': 0}, {'class': 1}, {'class': 1}])
Temp = {'A': {0: 0.75, 1: 0.25}, 'B': {0: 0.20, 1:0.8}}
Combined the two dataframes using a concat columnwise like so:
combined = pd.concat([X_train, Y_train], axis=1)
where axis=1 means you want to combine by column
Now run a double loop to assign the new values
for classname in combined['class'].unique():
for state in combined['state'].unique():
combined.loc[combined['class'] == classname, 'class'] = Temp[state][classname]
You'll end up with a combined looking like this:
state class
0 A 0.25
1 B 0.75
2 A 0.25
3 A 0.25
then just split up your data frames again
X_train = combined[['state']]
Y_train = combined[['class']]
I have 3 big CSV files. I try to randomly extract some samples from the files without loading them into the memory. I am doing this:
SITS = dd.read_csv("sits_train_0.csv", blocksize="512MB",
usecols=band_blue + ["samplefid"]).set_index("samplefid")
MASK = dd.read_csv("mask_train_0.csv", blocksize="512MB",
usecols=band_mask + ["samplefid"]).set_index("samplefid")
GP = dd.read_csv("sits_gp_train_0.csv", blocksize="512MB",
usecols=band_blue_gp + ["samplefid"]).set_index("samplefid")
# SITS = pd.read_csv("sits_train_0.csv",
# usecols=band_blue + ["samplefid"]).set_index("samplefid")
# MASK = pd.read_csv("mask_train_0.csv",
# usecols=band_mask + ["samplefid"]).set_index("samplefid")
# GP = pd.read_csv("sits_gp_train_0.csv",
# usecols=band_blue_gp + ["samplefid"]).set_index("samplefid")
np.random.seed(0)
NSAMPLES=100
samples = np.random.choice(MASK.index, size=NSAMPLES, replace=False)
s = SITS.loc[samples][band_blue].compute().values
m = MASK.loc[samples][band_mask].compute().values
sg = GP.loc[samples][band_blue_gp].compute().values
# s = SITS.loc[samples][band_blue].values
# m = MASK.loc[samples][band_mask].values
# sg = GP.loc[samples][band_blue_gp].values
I had strange results, so I compare to pandas with smaller files (see commented code above) for which I have correct results.
If I set blocksize to None, the results are fine, but it loads everything in memory, so using dask is not useful in that case and my CSV are to big to fits in memory. My CSV are written randomly so I need to use index to recover the same samples from the 3 CSV.
I feel I miss something from dask, but I don't see what.
I'd recommend using sample
In [16]: import pandas as pd
In [17]: import dask.dataframe as dd
In [18]: df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
...: 'num_wings': [2, 0, 0, 0],
...: 'num_specimen_seen': [10, 2, 1, 8]},
...: index=['falcon', 'dog', 'spider', 'fish'])
In [19]: ddf = dd.from_pandas(df, npartitions=2)
In [20]: ddf.sample??
In [21]: df.sample(frac=0.5, replace=True, random_state=1)
Out[21]:
num_legs num_wings num_specimen_seen
dog 4 0 2
fish 0 0 8
In [22]: ddf.sample(frac=0.5, replace=True, random_state=1)
Out[22]:
Dask DataFrame Structure:
num_legs num_wings num_specimen_seen
npartitions=2
dog int64 int64 int64
fish ... ... ...
spider ... ... ...
Dask Name: sample, 4 tasks
In [23]: ddf.sample(frac=0.5, replace=True, random_state=1).compute()
Out[23]:
num_legs num_wings num_specimen_seen
falcon 2 2 10
fish 0 0 8
I have multiples files with a lot of data and 19 columns. I am trying to to multiple for-loop and set it equal the first column, second etc. in the files.
import numpy as np
import glob
import pandas as pd
#
lat=np.zeros(90)
long=np.zeros(180)
indat=np.zeros(19)
#
file_in = glob.glob('filenames*.dat').
for a in range(140):
for i in range (90):
for j in range (180):
df = pd.DataFrame()
for f in file_in:
cols = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] #there are nineteen columns
indat = df.append(pd.read_csv(f, delimiter='\\s+', header=None, usecols=cols, skiprows=4), ignore_index=True)
lat[i]=indat[0] # error here
long[j]=indat[1]
#updates some code here
if i >=70:
dens[a,j,i-70]=indat[2]
It gave me this error:
ValueError: setting an array element with a sequence.
Updates:
indat has 19 columns, many files but all the format is the same.
Sample indat
#columns
#0 1 2 3 ..... 19
-90 0 2e-12 #just some number
-90 2 3e-12 #just some number
-90 4 4e-12 #just some number
...
-90 360 1e-12 #just some number
-88 0 1e-11 #just some number
-88 2 2e-11 #just some number
-88 4 3e-11 #just some number
...
-88 360 4e-11 #just some number
...
90 0 2.5e-12 #just some number
90 2 3.5e-11 #just some number
90 4 4.5e-12 #just some number
...
90 360 1.5e-12 #just some number
EDIT: I clean the code up based on everyone suggestions
import numpy as np
import glob
import pandas as pd
file_in = glob.glob('filenames*.dat').
df = pd.DataFrame()
for f in file_in:
cols = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]
indat = pd.read_csv(f, delimiter='\\s+', header=None, usecols=cols, skiprows=4)
for a in range(140):
for i in range (90):
for j in range (180):
lat[i]=indat[0] # error here
long[j]=indat[1]
if i >=70:
dens[a,j,i-70]=indat[2]
you tried to assign a column (pandas series) indat[0] to an element of a numpy vector lat[i]
Also what the point of indat=np.zeros(19) when you override it to be a dataframe later?
What is the content of indat[0]?
This line of code
indat = df.append(pd.read_csv(f, delimiter='\\s+', header=None, usecols=cols, skiprows=4), ignore_index=True)
is basically same as
indat = pd.read_csv(f, delimiter='\\s+', header=None, usecols=cols, skiprows=4)
because df never changed, i.e. it is always an empty dataframe
Since the content of indat is unknown, it's difficult to fix your code.
If you just want to make it run without error, I suggest to write
lat[i] = indat[0].values[0] # take the first value of the vector
long[i] = indat[1].values[0] # take the first value of the vector
It's good to take some tutorial on Numpy and Pandas since it can be very confusing without some basic understanding.
I have function for text preprocessing which is simply removing stopwords as:
def text_preprocessing():
df['text'] = df['text'].apply(word_tokenize)
df['text']=df['text'].apply(lambda x: [item for item in x if item not in stopwords])
new_array=[]
for keywords in df['text']: #converts list of words into string
P=" ".join(str(x) for x in keywords)
new_array.append(P)
df['text'] = new_array
return df['text']
I want to pass text_preprocessing() into another function tf_idf() which gives feature matrix what I essentially did as:-
def tf_idf():
tfidf = TfidfVectorizer()
feature_array = tfidf.fit_transform(text_preprocessing)
keywords_data=pd.DataFrame(feature_array.toarray(), columns=tfidf.get_feature_names())
return keywords_data
I got an error as TypeError: 'function' object is not iterable
Rather than building additional functions for stop-word removal you can simply pass a custom list of stop-words to TfidfVectorizer. As you can see in the example below "test" is successfully excluded from the Tfidf vocabulary.
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
# Setting up
numbers = np.random.randint(1, 5, 3)
text = ['This is a test.', 'Is this working?', "Let's see."]
df = pd.DataFrame({'text': text, 'numbers': numbers})
# Define custom stop words and instantiate TfidfVectorizer with them
my_stopwords = ['test'] # the list can be longer
tfidf = TfidfVectorizer(stop_words=my_stopwords)
text_tfidf = tfidf.fit_transform(df['text'])
# Optional - concatenating tfidf with df
df_tfidf = pd.DataFrame(text_tfidf.toarray(), columns=tfidf.get_feature_names())
df = pd.concat([df, df_tfidf], axis=1)
# Initial df
df
Out[133]:
numbers text
0 2 This is a test.
1 4 Is this working?
2 3 Let's see.
tfidf.vocabulary_
Out[134]: {'this': 3, 'is': 0, 'working': 4, 'let': 1, 'see': 2}
# Final df
df
Out[136]:
numbers text is let see this working
0 2 This is a test. 0.707107 0.000000 0.000000 0.707107 0.000000
1 4 Is this working? 0.517856 0.000000 0.000000 0.517856 0.680919
2 3 Let's see. 0.000000 0.707107 0.707107 0.000000 0.000000