I am trying to change all negative values to 0 in excel files.
However, it seems like the pandas skips the first row.
Please help me with preventing skip issue! Thank you.
Here is the code:
# importing pandas module
import pandas as pd
import numpy as np
import csv
from pandas import DataFrame
df = pd.read_csv("FAPI-N2-rere_2D_modified.csv")
df[df < 0 ] = 0
df.to_csv('FAPI-N2-rere_2D_modified2.csv')
==========================================================
I have tried to add some codes into the above,
# importing pandas module
import pandas as pd
import numpy as np
import csv
from pandas import DataFrame
df = pd.read_csv("FAPI-N2-rere_2D_modified.csv", **header = None**)
df[df < 0 ] = 0
df.to_csv('FAPI-N2-rere_2D_modified2.csv')
However, i keep getting the typeerror:
TypeError: '<' not supported between instances of 'str' and 'int'
I would be so much appreciated if anyone could please help me.
Thank you so much!
Some columns of your dataframe contains strings.
If their content is numeric you can try to convert them into numeric datatype, here it is explained how to do that:
Change column type in pandas
If you have columns containing strings, you need to replace your negative values only on numeric columns.
Related
I want to read a CSV File (filled in by temperature sensors) by python3.
Reading CSV File into array works fine. Printing a single cell by index fails. Please help for the right line of code.
This is the code.
import sys
import pandas as pd
import numpy as np
import array as ar
#Reading the CSV File
# date;seconds;Time;Aussen-Temp;Ruecklauf;Kessel;Vorlauf;Diff
# 20211019;0;20211019;12,9;24;22,1;24,8;0,800000000000001
# ...
# ... (2800 rows in total)
np = pd.read_csv('/var/log/LM92Temperature_U-20211019.csv',
header=0,
sep=";",
usecols=['date','seconds','Time','Aussen- Temp','Ruecklauf','Kessel','Vorlauf','Diff'])
br = np # works fine
print (br) # works fine - prints whole CSV Table :-) !
#-----------------------------------------------------------
# Now I want to print the element [2] [3] of the two dimensional "CSV" array ... How to manage that ?
print (br [2] [3]) # ... ends up with an error ...
# what is the correct coding needed now, please?
Thanks in advance & Regards
Give the name of the column, not the index:
print(br['Time'][3])
As an aside, you can read your data with only the following, and you may want decimal=',' as well:
import pandas as pd
br = pd.read_csv('/var/log/LM92Temperature_U-20211019.csv', sep=';', decimal=',')
print(br)
I don't know why na_values is not changing the values with "$-" to NaN. I have manually entered the $- in the file and there are no spaces.
import pandas as pd
df=pd.read_csv('discounted_products.csv',na_values = ['$-'])
df.head()
enter image description here
Please help here.
It could be because pandas works with regex by default on string methods (albeit it is not mentioned in the specific read_csv documentation). Try
na_values = [r'$-']
Update
It worked fine for me
from io import StringIO
df = pd.read_csv(StringIO(
'''a,b
test,$-'''), na_values=[r'$-'])
print(df)
a b
0 test NaN
I am trying to run the below script to add to columns to the left of a file; however it keeps giving me
valueError: header must be integer or list of integers
Below is my code:
import pandas as pd
import numpy as np
read_file = pd.read_csv("/home/ex.csv",header='true')
df=pd.DataFrame(read_file)
def add_col(x):
df.insert(loc=0, column='Creation_DT', value=pd.to_datetime('today'))
df.insert(loc=1, column='Creation_By', value="Sean")
df.to_parquet("/home/sample.parquet")
add_col(df)
Any ways to make the creation_dt column a string?
According to pandas docs header is row number(s) to use as the column names, and the start of the data and must be int or list of int. So you have to pass header=0 to read_csv method.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Also, pandas automatically creates dataframe from read file, you don't need to do it additionally. Use just
df = pd.read_csv("/home/ex.csv", header=0)
You can try:
import pandas as pd
import numpy as np
read_file = pd.read_csv("/home/ex.csv")
df=pd.DataFrame(read_file)
def add_col(x):
df.insert(loc=0, column='Creation_DT', value=str(pd.to_datetime('today')))
df.insert(loc=1, column='Creation_By', value="Sean")
df.to_parquet("/home/sample.parquet")
add_col(df)
I am trying to calculate number of days between two dates and i get an error
TypeError: ("Iterator operand 0 dtype could not be cast from dtype('<M8[us]') to dtype('<M8[D]') according to the rule 'safe'", 'occurred at index 0')
The column REASSIGN_DATE has blank values in few rows which i am filling using another Column called SETUP_DATE to remove and blank values and fill a date.
Then i am trying to calculate the number days between these two days using the below code.
import pandas as pd
import numpy as np
import datetime
from datetime import date
from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta
import xlrd
import workdays
import defusedxml
from xlrd import open_workbook
defusedxml.defuse_stdlib()
def secure_open_workbook(**kwargs):
try:
return open_workbook(**kwargs)
except EntitiesForbidden:
raise ValueError('Please use a xlsx file without XEE')
#loading Raw Data
releases = pd.read_excel(r'C:\Desktop\Releases.xlsx',
sheet_name = 'Releases',
header = 0
)
releases.loc[releases['REASSIGN_DATE'].isnull(),'REASSIGN_DATE']=releases['SETUP_DATE']
releases['REASSIGN_DATE']=pd.to_datetime(releases['REASSIGN_DATE'])
releases['RELEASED_DATE']=pd.to_datetime(releases['RELEASED_DATE'])
releases['RELEASED_DAYS']=releases.apply(lambda x:
np.busday_count(x.REASSIGN_DATE,x.RELEASED_DATE),axis =1)
releases_2=releases.drop(['SETUPDATE','RELEASEDDATE','REASSIGNDATE'],axis=1)
I get the error.
I even tried to add astype('datetime64[D]') as well however i get the error astype('datetime64[ns]') cannot be converted to astype('datetime64[D]').
How could i avoid this error.
Regards,
Ren.
New coder here, trying to run some t-tests in Python 3.6. Right now, to run my t-tests between my 2 data sets, I have been doing the following:
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.tools import FigureFactory as FF
import numpy as np
import pandas as pd
import scipy
from scipy import stats
long_term_survivor_GENE1 = [-0.38,-0.99,-1.04,0.1, etc..]
short_term_survivor_GENE1 = [0.32, 0.33,0.96, etc...]
stats.ttest_ind(long_term_survivor_GENE1,short_term_survivor_GENE1)
Which requires me to manually enter the values for each column of both data sets for each specific gene (GENE1 in this case). Is there any way to be able to call for the values from the data set so that Python can just read the values without me typing them out myself? For example, some way that I can just say:
long_term_survivor_GENE1 = ##call values from GENE1 column from dataset 1##
short_term_survivor_GENE1 = ## call values from GENE1 column from dataset 2##
Thanks for any help, and sorry that I'm not very well-versed in this stuff. Appreciate any feedback/tips. If you have any other questions, please let me know!
If you've shoved your data into the columns of a pandas dataframe then it might be as easy as this.
>>> import pandas as pd
>>> long_term_survivor_GENE1 = [-0.38,-0.99,-1.04,0.1]
>>> short_term_survivor_GENE1 = [0.32, 0.33,0.96, 0.56]
>>> df = pd.DataFrame({'long_term_survivor_GENE1': long_term_survivor_GENE1, 'short_term_survivor_GENE1': short_term_survivor_GENE1})
>>> from scipy import stats
>>> stats.ttest_ind(df['long_term_survivor_GENE1'], df['short_term_survivor_GENE1'])
Ttest_indResult(statistic=-3.615804684179662, pvalue=0.011153077626049458)
It might be a good idea to review the statistics behind this though. If you haven't already got them in a dataframe then have a look for some of the many answers here on SO about using read_csv for assistance.