I can't seem to find the answer to my question so I'm trying my luck on here. Would very much appreciate your help.
I've got a Pandas dataframe with values in Col1 and Col2. Instead of the np.nan values in Col2, I'd like to calculate the following: today's Col2 value = previous day's Col2 value multiplied by today's Col1 value.
This should be some form of recursive function. I've tried several answers, including a for loop here below, but none seem to work:
df = pd.read_excel('/Users/fhggshgf/Desktop/test.xlsx')
df.index = df.date
df.drop(['date'], axis=1, inplace=True)
for i in range(1, len(df)):
fill_value = df['Col2'].iloc[i - 1]
finaldf['Col2'].fillna(fill_value, inplace=True)
screenshot
You could try something like this.
import pandas as pd
import numpy as np
df = pd.DataFrame({'date': [1,2,3,4,5,6],
'col_1': [951, 909, 867, 844, 824, 826],
'col_2': [179, 170, 164, 159, 153, 149]})
col_2_update_list = []
for i, row in df.iterrows():
if i != 0:
today_col_1 = df.at[i,'col_1']
prev_day_col_2 = df.at[i-1,'col_2']
new_col_2_val = prev_day_col_2 * today_col_1
col_2_update_list.append(new_col_2_val)
else:
col_2_update_list.append(np.nan)
df['updated_col_2'] = col_2_update_list
This avoids the use of loops but you need to create 2 new columns:
import pandas as pd
import numpy as np
import sys
df = pd.DataFrame({'date': [1,2,3,4,5,6],
'col_1': [951, 909, 867, 844, 824, 826],
'col_2': [179, np.nan, 164, 159, np.nan, 149]})
print(df)
# Compare 2 columns
df['col_4'] = df['col_2'].fillna(method='ffill')*df['col_1']
df['col_3'] = df['col_2'].fillna(sys.maxsize)
df['col_2'] = df[['col_4','col_3']].min(axis=1).astype(int)
df = df.drop(['col_4', 'col_3'], axis = 1)
print(df)
Related
I'm trying to import data into a pandas dataframe. The file type is 'csv' but the data in the file is structured as a python list. The below code is only returning the column headers. Any suggestions? What am I doing wrong?
import pandas as pd
data_path = pd.read_csv(r'C:\Users\john_smith\file_name.csv')
df = pd.DataFrame(data_path, columns=["article_id","author_id","viewer_id","view_date"])
df
An example of the data in the file is below. There aren't any headers in the file.
[[126,17,62,"2019-07-02"],[149,42,22,"2019-06-23"],[138,39,33,"2019-07-26"]]
Example of what is returned is below:
The returned output
It's really not clear, but if you have a file that literally looks like:
file.csv
[[126,17,62,"2019-07-02"],[149,42,22,"2019-06-23"],[138,39,33,"2019-07-26"]]
We can attempt to read that with ast.literal_eval
from ast import literal_eval
with open('file.csv') as f:
data = literal_eval(f.read())
print(data)
print(type(data))
# Output:
[[126, 17, 62, '2019-07-02'], [149, 42, 22, '2019-06-23'], [138, 39, 33, '2019-07-26']]
<class 'list'>
Now we can work with pandas:
df = pd.DataFrame(data, columns=["article_id","author_id","viewer_id","view_date"])
print(df)
# Output:
article_id author_id viewer_id view_date
0 126 17 62 2019-07-02
1 149 42 22 2019-06-23
2 138 39 33 2019-07-26
You can use this :
import pandas as pd
path_file = r'C:\Users\john_smith\file_name.csv'
df = pd.read_csv(path_file, delimiter=';', names=['data'])
df['data'] = df['data'][0][1:-1]
df = df.assign(**{'data':df['data'].str.split('\],\[')})
out = df.explode('data').replace({'\[': '', '\]': '', '"': ''}, regex=True)
out = (out['data'].str.split(',', expand=True)
.rename(columns={0:'article_id', 1:'author_id', 2:'viewer_id', 3:'view_date'})
.reset_index(drop=True)
)
>>> print(out)
I have a dataset Data.csv
Country,Age,Salary,Purchased
France,44,72000,No
Spain,27,48000,Yes
Germany,30,54000,No
Spain,38,61000,No
Germany,40,,Yes
France,35,58000,Yes
Spain,,52000,No
France,48,79000,Yes
Germany,50,83000,No
France,37,67000,Yes
I tried to fill nan values using sklearn.impute.SimpleImputer by using following code
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
# Taking care of missing data
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = 'NaN', strategy = 'mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
But I get a error which says:
File "C:\Users\Krishna Rohith\Machine Learning A-Z\Part 1 - Data Preprocessing\Section 2 ----------- --------- Part 1 - Data Preprocessing --------------------\missing_data.py", line 16, in <module>
imputer = imputer.fit(X[:, 1:3])
File "C:\Users\Krishna Rohith\Anaconda3\lib\site-packages\sklearn\impute\_base.py", line 268, in fit
X = self._validate_input(X)
File "C:\Users\Krishna Rohith\Anaconda3\lib\site-packages\sklearn\impute\_base.py", line 242, in _validate_input
raise ve
File "C:\Users\Krishna Rohith\Anaconda3\lib\site-packages\sklearn\impute\_base.py", line 235, in _validate_input
force_all_finite=force_all_finite, copy=self.copy)
File "C:\Users\Krishna Rohith\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 562, in check_array
allow_nan=force_all_finite == 'allow-nan')
File "C:\Users\Krishna Rohith\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 60, in _assert_all_finite
msg_dtype if msg_dtype is not None else X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
I know how to do it numpy but can someone please tell me using sklearn.impute?
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
Replace 'NaN' by numpy default Nan np.nan
I have a table in Excel with one column that I want to read into the list:
At first I tried it like this:
>>> df = pandas.read_excel('emails.xlsx', sheet_name=None)
>>> df
OrderedDict([('Sheet1', Chadisayed#gmx.com
0 wonderct#mail.ru
1 fcl#fcl-bd.com
2 galina#dorax-investments.com
>>> for k, v in df.items():
... print(type(v), v)
...
<class 'pandas.core.frame.DataFrame'> Chadisayed#gmx.com
0 wonderct#mail.ru
1 fcl#fcl-bd.com
2 galina#dorax-investments.com
>>> df = df.items()[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'odict_items' object is not subscriptable
I tried it differently:
>>> df = pandas.read_excel('emails.xlsx', index_col=0)
>>> df
Empty DataFrame
Columns: []
Index: [wonderct#mail.ru, fcl#fcl-bd.com, galina#dorax-investments.com]
[419 rows x 0 columns]
>>> foo = []
>>> for i in df.index:
... foo.append(i)
...
>>> foo
['wonderct#mail.ru', 'fcl#fcl-bd.com', 'galina#dorax-investments.com']
It almost worked, but the first element is missing. What else can I do? Is there really no way to read the Excel file simply line by line?
Try this:
df=pd.read_excel('temp.xlsx', header=None)
target_list=list(df[0].values)
Use:
target_list = pandas.read_excel('emails.xlsx', index_col=None, names=['A'])['A'].tolist()
I get a keyerror for "Displacement" when I try to plot Force against Displacement with pandas for these group of dataframes. Please help.
The link to the excel sheet being used:
https://www.dropbox.com/s/f8lnp973ojv3ish/neurospheress.xlsx?dl=0
I tried clearing any space in the column titles but it doesn't work
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_excel('neurospheress.xlsx', sep='\s*,\s*', sheet_name = 'LS')
df1 = data.iloc[:80,:2]
df2 = data.iloc[:80,2:4]
df3 = data.iloc[:80,4:]
dfs = [df1,df2,df3]
for i,df in enumerate(dfs):
plt.plot(df['Displacement'], df['Force'], linestyle='--', alpha= 0.8, label='df{}'.format(i))
plt.legend(loc='best')
plt.show()
The below solution works, it basically adds two things to your solution
a) Skip the first row from excel
b) Rename the column names for df2 and df3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_excel('neurospheress.xlsx', sep='\s*,\s*', sheet_name = 'LS',skiprows=1)
df1 = data.iloc[:80,:2]
df2 = data.iloc[:80,2:4]
df3 = data.iloc[:80,4:]
dfs = [df1,df2,df3]
df2.rename(columns={'Force.1':'Force','Displacement.1':'Displacement'},inplace=True)
df3.rename(columns={'Force.2':'Force','Displacement.2':'Displacement'},inplace=True)
print(data.columns)
print(df1.columns)
print(df2.columns)
for i,df in enumerate(dfs):
plt.plot(df['Displacement'], df['Force'], linestyle='--', alpha= 0.8, label='df{}'.format(i))
plt.legend(loc='best')
plt.show()
I am not able to understand why this error is there in the code. Yesterday it was working absolutely fine and I was getting the correct result.
for j in range(column): #Column is the number of columns in the dataframe 'traindata'
if np.all(traindata.iloc[:, j] == 0): #Compare all values in a column to 0
traindata = traindata.drop(traindata.columns[j], axis=1, inplace=True)
print(traindata.shape)
The error is:
AttributeError: 'NoneType' object has no attribute 'iloc'.
I tried to look for what the issue is, but haven't found the right answer
The problem is from: traindata = traindata.drop(traindata.columns[j], axis=1, inplace=True). You can check the value of traindata right after it by adding one line of code print(traindata), you will see it returns 'None'.
you can change to:
traindata.drop(traindata.columns[j], axis=1, inplace=True)
However, you may receive new error (IndexError: single positional indexer is out-of-bounds) because you keep dropping columns.
the problem is in this line of code :
traindata = traindata.drop(traindata.columns[j], axis=1, inplace=True)
the function drop changes the type of traindata, because you're affecting its return value.
you should not affect the result because it's applied to the dataframe
you should replace that line of code with:
traindata.drop(traindata.columns[j], axis=1, inplace=True)
Here is the complete code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
from sklearn.model_selection import train_test_split
from sklearn.model_selection import learning_curve
from sklearn.model_selection import validation_curve
from sklearn.model_selection import cross_val_score
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
traindata = pd.read_csv('santandertrain.csv')
traindata_copy = traindata.copy()
traindata = pd.DataFrame(traindata)
print(traindata.shape)
[row, column] = traindata.shape
# testdata = pd.read_csv('santandertest.csv')
# testdata_copy = testdata.copy()
# print(testdata.shape) # (49342, 4992)
# print(traindata.describe().T)
# print(traindata.isnull().sum()) There are no NA values
for j in range(column):
if np.all((traindata.iloc[:, j]) == 0):
traindata = traindata.drop(traindata.columns[j], axis=1, inplace=True)
print(traindata.shape)
When I run this code below, .iloc works perfect with training data
print(traindata.iloc[1:10, 0:5])
I had a similar problem with the following in a Jupyter notebook:
df = df.drop(['B1','C1','D2','E1','G1','H1','I1','K1'], axis=1, inplace=True)
df = df.replace(r'^\s*$', np.nan, regex=True)
In my case I got the error:
AttributeError: 'NoneType' object has no attribute 'replace'
The solution that worked for me was related to using inplace=True and assigning the result of the line to df. So, here I had to either assign the result to df by writing df = df.drop... or by using inplace=True and not assigning the expression to df.
My solution without an error is:
df = df.drop(['B1','C1','D2','E1','G1','H1','I1','K1'], axis=1)
For your case change to:
traindata = traindata.drop([j], axis=1)