Sklearn random forest - perform_grid_search gives back error: one or more test scores are non-finite - scikit-learn

Here is my code in Sklearn for using random forest - i have already manually filled NA with 0s. I have also gotten this code to run with the same data before, so it must be something in the code that is able to be fixed and not the data itself:
code:
parameters = {'max_depth':[2, 3, 4, 5, 7], 'n_estimators':[1, 10, 25, 50, 100, 256, 512], 'random_state':[42]}
def perform_grid_search(X_data, y_data): """ Function to perform a grid search. """
rf = RandomForestClassifier(criterion='entropy')
clf = GridSearchCV(rf, parameters, cv=4, scoring='roc_auc', n_jobs=3)
clf.fit(X_data, y_data)
print(clf.cv_results_['mean_test_score'])
return clf.best_params_['n_estimators'], clf.best_params_['max_depth']
and when running:
#next function
n_estimator, depth = perform_grid_search(X_train, y_train) c_random_state = 42
print(n_estimator, depth, c_random_state)
this error comes back:
model_selection_search.py:922: UserWarning: One or more of the test scores are non-finite: [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan] warnings.warn(
and ValueError: could not convert string to float: ''
Please let me know as this has totally broken down my code process!

Related

Trying to add content to an existing dataFrame

I've been trying for days to find a solution to my problem. I am trying to add content to a pre-existing column in a data frame, however when I print the screen my program shows me that only the first 100 lines are being modified, it is never shown beyond line 100 and the items are not added but overwritten. I've tried several ways and it always gives an error, the closest way to what I want is this print that I put in the post. Could someone help me, I would be very grateful.
import functions
import pandas as pd
from time import sleep
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
pd.set_option('max_rows', 10)
# Module of which the page link returns
site = functions.pagina()
#Browser Options
options = Options()
options.headless = True # Prevent browser from opening
driver = webdriver.Firefox(options=options) #Include options in browser
#Run browser with URL
driver.get(site)
# Wait in seconds for the page to load
print('Esperando 5s')
sleep(5)
#Create Data Frame
df = pd.DataFrame(columns = ['rank','nome','classe','item_lvl','servidor','mortes_temporada'], index=None)
while True:
#Get the 'next page' variable
next_page = driver.find_element_by_css_selector('#pagination-hook > nav > ul > li:nth-child(2) > a')
#As long as the Next variable is empty
if next_page is not None:
#Url of current page
driver.get(driver.current_url)
print(driver.current_url[-6:].upper())
#Attempting to insert data into existing rank column
ranks = pd.Series([rank_page.text for rank_page in driver.find_elements_by_xpath('//div[4]//div[1]/table/tbody/tr/td[1]')])
#MY PROBLEM ADDING CONTENT TO A COLUMN
df['rank'] =+ ranks
#Click on Next
driver.execute_script("arguments[0].click();", next_page)
sleep(4)
else:
break
print('*' * 80)
print(df)
print('*' * 80)
RETURN PRINT - IMAGE
Esperando 5s
PAGE=1
********************************************************************************
rank nome classe item_lvl servidor mortes_temporada
0 1 NaN NaN NaN NaN NaN
1 2 NaN NaN NaN NaN NaN
2 3 NaN NaN NaN NaN NaN
3 4 NaN NaN NaN NaN NaN
4 5 NaN NaN NaN NaN NaN
.. ... ... ... ... ... ...
95 96 NaN NaN NaN NaN NaN
96 97 NaN NaN NaN NaN NaN
97 98 NaN NaN NaN NaN NaN
98 99 NaN NaN NaN NaN NaN
99 100 NaN NaN NaN NaN NaN
[100 rows x 6 columns]
********************************************************************************
PAGE=2
********************************************************************************
rank nome classe item_lvl servidor mortes_temporada
0 101 NaN NaN NaN NaN NaN
1 102 NaN NaN NaN NaN NaN
2 103 NaN NaN NaN NaN NaN
3 104 NaN NaN NaN NaN NaN
4 105 NaN NaN NaN NaN NaN
.. ... ... ... ... ... ...
95 196 NaN NaN NaN NaN NaN
96 197 NaN NaN NaN NaN NaN
97 198 NaN NaN NaN NaN NaN
98 199 NaN NaN NaN NaN NaN
99 200 NaN NaN NaN NaN NaN
[100 rows x 6 columns]
********************************************************************************
When you add data to a column in Pandas, and df is a DataFrame, you have to include data for the rest of the columns for every row.
In your case, the fix is:
df2 = pd.DataFrame({
"rank": [rank_page.text for rank_page in driver.find_elements_by_xpath('//div[4]//div[1]/table/tbody/tr/td[1]')],
"nome": np.nan,
"classe": np.nan,
"item_lvl": np.nan,
"servidor": np.nan,
"mortes_temporada": np.nan
})
df = df.append(df2)

Merging rows in pandas DataFrame

I am writing a script to scrape a series of tables in a pdf into python using tabula-py.
This is fine. I do get the data. But the data is multi-line, and useless in reality.
I would like to merge the rows where the first column (Tag is not NaN).
I was about to put the whole thing in an iterator, and do it manually, but I realize that pandas is a powerful tool, but I don't have the pandas vocabulary to search for the right tool. Any help is much appreciated.
My Code
filename='tags.pdf'
tagTableStart=2 #784
tagTableEnd=39 #822
tableHeadings = ['Tag','Item','Length','Description','Value']
pageRange = "%d-%d" % (tagTableStart, tagTableEnd)
print ("Scanning pages %s" % pageRange)
# extract all the tables in that page range
tables = tabula.read_pdf(filename, pages=pageRange)
How The data is stored in the DataFrame:
(Empty fields are NaN)
Tag
Item
Length
Description
Value
AA
Some
2
Very Very
Text
Very long
Value
AB
More
4
Other Very
aaaa
Text
Very long
bbbb
Value
cccc
How I want the data:
This is almost as it is displayed in the pdf (I couldn't figure out how to make text multi line in SO editor)
Tag
Item
Length
Description
Value
AA
Some\nText
2
Very Very\nVery long\nValue
AB
More\nText
4
Other Very\nVery long\n Value
aaaa\nbbbb\ncccc
Actual sample output (obfuscated)
Tag Item Length Description Value
0 AA PYTHROM-PARTY-I 20 Some Current defined values are :
1 NaN NaN NaN texst Byte1:
2 NaN NaN NaN NaN C
3 NaN NaN NaN NaN DD
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN DD
6 NaN NaN NaN NaN DD
7 NaN NaN NaN NaN DD
8 NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN B :
10 NaN NaN NaN NaN JLSAFISFLIHAJSLIhdsflhdliugdyg89o7fgyfd
11 NaN NaN NaN NaN ISFLIHAJSLIhdsflhdliugdyg89o7fgyfd
12 NaN NaN NaN NaN upon ISFLIHAJSLIhdsflhdliugdyg89o7fgy
13 NaN NaN NaN NaN asdsadct on the dasdsaf the
14 NaN NaN NaN NaN actsdfion.
15 NaN NaN NaN NaN NaN
16 NaN NaN NaN NaN SLKJDBFDLFKJBDSFLIUFy7dfsdfiuojewv
17 NaN NaN NaN NaN csdfgfdgfd.
18 NaN NaN NaN NaN NaN
19 NaN NaN NaN NaN fgfdgdfgsdfgfdsgdfsgfdgfdsgsdfgfdg
20 BB PRESENT-AMOUNT-BOX 11 Lorem Ipsum NaN
21 CC SOME-OTHER-VALUE 1 sdlkfgsdsfsdf 1
22 NaN NaN NaN device NaN
23 NaN NaN NaN ueghkjfgdsfdskjfhgsdfsdfkjdshfgsfliuaew8979vfhsdf NaN
24 NaN NaN NaN dshf87hsdfe4ir8hod9 NaN
Create groups from ID columns then join each rows:
agg_func = dict(zip(df.columns, [lambda s: '\n'.join(s).strip()] * len(df.columns)))
out = df.fillna('').groupby(df['Tag'].ffill(), as_index=False).agg(agg_func)
Output:
>>> out
Tag Item Length Description Value
0 AA Some\nText 2 Very Very\nVery long\nValue
1 AB More\nText 4 Other Very\nVery long\nValue aaaa\nbbbb\ncccc
agg_func is equivalent to write:
{'Tag': lambda s: '\n'.join(s).strip(),
'Item': lambda s: '\n'.join(s).strip(),
'Length': lambda s: '\n'.join(s).strip(),
'Description': lambda s: '\n'.join(s).strip(),
'Value': lambda s: '\n'.join(s).strip()}

dropna remove all rows with valid values and only NA rows are left in pandas

I am trying to clean np values in an open sourced data.
I am using python3, Jupyter and pandas.
response = urllib.request.urlopen('https://resources.lendingclub.com/LoanStats3c.csv.zip')
import shutil
url = 'https://resources.lendingclub.com/LoanStats3c.csv.zip'
file_name = 'LoanStats3c.csv.zip'
with urllib.request.urlopen(url) as response, open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
with zipfile.ZipFile(file_name) as zf:
zf.extractall()
loan=pd.read_csv(open('LoanStats3c.csv'), skiprows=1, parse_dates=True, index_col='id')
loan.describe()
# remove all columns with all NAs
loan = loan.dropna(axis=1, how = 'all')
loan.describe()
# remove all rows with any NAs
loan = loan.dropna(axis = 0)
loan.describe()
But, the results are all columns with all NAs:
loan_amnt funded_amnt funded_amnt_inv installment annual_inc dti \
count 0.0 0.0 0.0 0.0 0.0 0.0
mean NaN NaN NaN NaN NaN NaN
std NaN NaN NaN NaN NaN NaN
min NaN NaN NaN NaN NaN NaN
25% NaN NaN NaN NaN NaN NaN
50% NaN NaN NaN NaN NaN NaN
75% NaN NaN NaN NaN NaN NaN
max NaN NaN NaN NaN NaN NaN
Why all rows with valid values are gone and only the NA columns are left ?
thanks
When you'r using .dropna() like that all ocurrences with NaN values are deleted from dataframe
loan.dropna(axis=1, how = 'all')
Will delete the columns with all values in NaN
loan.dropna(axis = 0)
Will delete the rows with a least one value in NaN
I saw the file and i'm pretty sure that every rows has at least one column in NaN
Finally when using .describe() the dataframe is empty and the values that are shown are a descriptive statistics of that, if you want to see the real DF use print(df) or in jupyter just let the variable at the end of the block
some code
some code
some code
variable = pd.DataFrame([])
#print(variable)
variable
That would show you the value of the variable

Empty columns when inserted into into df from another df

I tried to add from df5 columns to df_prog. But for some reason they remain empty. I do not understand what I'm doing wrong. Code:
df5['Kol1_1Y']
223520 14.0
223521 65.0
223522 13.0
223523 39.0
223524 13.0
223525 3.0
223526 10.0
223527 19.0
223528 16.0
223529 29.0
Name: Kol1_1Y, dtype: float64
df_prog['Kol1_1Y'] = df5['Kol1_1Y']
df_prog['Kol2_1Y'] = df5['Kol2_1Y']
df_prog['Kol1_3M'] = df5['Kol1_3M']
df_prog['Kol2_3M'] = df5['Kol2_3M']
df_prog.to_excel("C:\python\progGB.xlsx")
df_prog
0 RESPR PREVPR Kol1_1Y Kol2_1Y Kol1_3M Kol2_3M
0 0.4944 0.4944 1.4894 NaN NaN NaN NaN
1 0.7073 0.7073 3.2020 NaN NaN NaN NaN
2 0.3965 0.3965 -0.3989 NaN NaN NaN NaN
3 0.4501 0.4501 -0.1826 NaN NaN NaN NaN
4 0.0271 0.0271 -6.1202 NaN NaN NaN NaN
5 0.2488 0.2488 -2.8447 NaN NaN NaN NaN
6 0.5190 0.5190 0.0176 NaN NaN NaN NaN
7 0.6667 0.6667 2.2334 NaN NaN NaN NaN
8 0.7708 0.7708 4.5216 NaN NaN NaN NaN
9 0.7074 0.7074 2.9906 NaN NaN NaN NaN
Pandas = assignment checking both index and columns. In your case, columns is matched, but index is different. Therefore, it assigns all NaN. To ignore index and columns, you need assigning from numpy ndarray such as:
df_prog['Kol1_1Y'] = df5['Kol1_1Y'].values
df_prog['Kol2_1Y'] = df5['Kol2_1Y'].values
df_prog['Kol1_3M'] = df5['Kol1_3M'].values
df_prog['Kol2_3M'] = df5['Kol2_3M'].values

NaN output from Dense layer

I am trying to train a simple neural net on few instances that have many features. I have 8 instances to train on, and 3 to test on (I know, this is very little, but makes sense for my application). Each instance has 4212 features which are floats ranging anywhere from 0 -> in the 1000s. When inputting these tensors into my first dense layer the layer outputs all NaN values.
I have tried changing the loss functions/optimizer/accuracy/activation functions and have not been able to gain anything but NaN output from my first dense layer.
here is my code that builds and fits the model. I have attached the output that shows tensor input and output as well as what my data looks like.
model=Sequential()
model.add(Dense(50, input_dim=4212, activation='relu'))
model.add(Dense(25, activation='relu'))
model.add(Dense(1, activation='softmax'))
model.compile(loss='mse', optimizer='rmsprop', metrics=['accuracy'])
model.fit(X_train, Y_train, epochs=150, batch_size=BATCH_SIZE)
And then here is what my data looks likes, as well the outputs from each tensor:
Training Data
[[ 0.00e+00 0.00e+00 0.00e+00 ... 1.43e+00 -1.24e-01 -7.60e-01]
[ 1.27e+02 2.00e+00 2.00e+00 ... 2.40e-01 -1.90e-02 -3.90e-01]
[ 2.80e+02 4.00e+00 4.00e+00 ... 1.29e+00 1.65e-01 1.62e+00]
[ 4.21e+02 4.00e+00 4.00e+00 ... 7.70e-01 9.00e-03 2.10e-01]
[ 5.81e+02 5.00e+00 5.00e+00 ... 9.90e-01 6.40e-02 5.30e-01]]
[50. 90. 92. 71. 67.]
Validation Data
[[ 1.276e+03 1.000e+00 1.000e+00 ... 6.700e-01 -3.600e-02 -4.100e-01]
[ 0.000e+00 0.000e+00 0.000e+00 ... 1.000e-02 3.000e-03 4.700e-01]
[ 0.000e+00 0.000e+00 0.000e+00 ... 1.000e+00 -2.500e-02 -3.400e-01]]
[54. 2. 3.]
Tensors input:
[[ 1.276e+03 1.000e+00 1.000e+00 ... 6.700e-01 -3.600e-02 -4.100e-01]
[ 0.000e+00 0.000e+00 0.000e+00 ... 1.000e-02 3.000e-03 4.700e-01]
[ 0.000e+00 0.000e+00 0.000e+00 ... 1.000e+00 -2.500e-02 -3.400e-01]]
[[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan]
[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan]
[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan]]
[[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan]
[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan]
[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan]]
I am trying to use vectors that represent the stats for a player in the MLB for one season to predict the amount of runs they score the next season. I expect poor accuracy now, but I intend to add LSTM layers and train on more players careers later.
Here is more code as to how I am obtaining tensor outputs
intermediate_layer_model = Model(inputs=model.input,
outputs=model.layers[layer_name].input)
intermediate_output = intermediate_layer_model.predict(V_data)
print (intermediate_output)
intermediate_layer_model = Model(inputs=model.input,
outputs=model.layers[1].input)
intermediate_output = intermediate_layer_model.predict(V_data)
print (intermediate_output)
intermediate_layer_model = Model(inputs=model.input,
outputs=model.layers[2].input)
intermediate_output = intermediate_layer_model.predict(V_data)
print (intermediate_output)
print(model.summary())

Resources