I am trying to run the below script to add to columns to the left of a file; however it keeps giving me
valueError: header must be integer or list of integers
Below is my code:
import pandas as pd
import numpy as np
read_file = pd.read_csv("/home/ex.csv",header='true')
df=pd.DataFrame(read_file)
def add_col(x):
df.insert(loc=0, column='Creation_DT', value=pd.to_datetime('today'))
df.insert(loc=1, column='Creation_By', value="Sean")
df.to_parquet("/home/sample.parquet")
add_col(df)
Any ways to make the creation_dt column a string?
According to pandas docs header is row number(s) to use as the column names, and the start of the data and must be int or list of int. So you have to pass header=0 to read_csv method.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Also, pandas automatically creates dataframe from read file, you don't need to do it additionally. Use just
df = pd.read_csv("/home/ex.csv", header=0)
You can try:
import pandas as pd
import numpy as np
read_file = pd.read_csv("/home/ex.csv")
df=pd.DataFrame(read_file)
def add_col(x):
df.insert(loc=0, column='Creation_DT', value=str(pd.to_datetime('today')))
df.insert(loc=1, column='Creation_By', value="Sean")
df.to_parquet("/home/sample.parquet")
add_col(df)
Related
I am trying to change all negative values to 0 in excel files.
However, it seems like the pandas skips the first row.
Please help me with preventing skip issue! Thank you.
Here is the code:
# importing pandas module
import pandas as pd
import numpy as np
import csv
from pandas import DataFrame
df = pd.read_csv("FAPI-N2-rere_2D_modified.csv")
df[df < 0 ] = 0
df.to_csv('FAPI-N2-rere_2D_modified2.csv')
==========================================================
I have tried to add some codes into the above,
# importing pandas module
import pandas as pd
import numpy as np
import csv
from pandas import DataFrame
df = pd.read_csv("FAPI-N2-rere_2D_modified.csv", **header = None**)
df[df < 0 ] = 0
df.to_csv('FAPI-N2-rere_2D_modified2.csv')
However, i keep getting the typeerror:
TypeError: '<' not supported between instances of 'str' and 'int'
I would be so much appreciated if anyone could please help me.
Thank you so much!
Some columns of your dataframe contains strings.
If their content is numeric you can try to convert them into numeric datatype, here it is explained how to do that:
Change column type in pandas
If you have columns containing strings, you need to replace your negative values only on numeric columns.
I am pretty new to Python (using Python3) and read Pandas to import dataset.
I need to import dataset from url - https://newonlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/leukemia_remission/index.txt
and convert it to csv file, I am getting some special character in converted csv -> ��
I am download txt file and converting it to csv, is is the right approach?
and converted csv is putting entire text into one column
from urllib.request import urlretrieve
import pandas as pd
from pandas import DataFrame
url = 'https://newonlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/leukemia_remission/index.txt'
urlretrieve(url, 'index.txt')
df = pd.read_csv('index.txt', sep='/t', engine='python', lineterminator='\r\n')
csv_file = df.to_csv('index.csv', sep='\t', index=False, header=True)
print(csv_file)
after successful import, I have to Extract X as all columns except the first column and Y as first column also.
I'll appreciate your all help.
from urllib.request import urlretrieve
import pandas as pd
url = 'https://newonlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/leukemia_remission/index.txt'
urlretrieve(url, 'index.txt')
df = pd.read_csv('index.txt', sep='\t',encoding='utf-16')
Y = df[['REMISS']]
X = df.drop(['REMISS'],axis=1)
I am trying to calculate number of days between two dates and i get an error
TypeError: ("Iterator operand 0 dtype could not be cast from dtype('<M8[us]') to dtype('<M8[D]') according to the rule 'safe'", 'occurred at index 0')
The column REASSIGN_DATE has blank values in few rows which i am filling using another Column called SETUP_DATE to remove and blank values and fill a date.
Then i am trying to calculate the number days between these two days using the below code.
import pandas as pd
import numpy as np
import datetime
from datetime import date
from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta
import xlrd
import workdays
import defusedxml
from xlrd import open_workbook
defusedxml.defuse_stdlib()
def secure_open_workbook(**kwargs):
try:
return open_workbook(**kwargs)
except EntitiesForbidden:
raise ValueError('Please use a xlsx file without XEE')
#loading Raw Data
releases = pd.read_excel(r'C:\Desktop\Releases.xlsx',
sheet_name = 'Releases',
header = 0
)
releases.loc[releases['REASSIGN_DATE'].isnull(),'REASSIGN_DATE']=releases['SETUP_DATE']
releases['REASSIGN_DATE']=pd.to_datetime(releases['REASSIGN_DATE'])
releases['RELEASED_DATE']=pd.to_datetime(releases['RELEASED_DATE'])
releases['RELEASED_DAYS']=releases.apply(lambda x:
np.busday_count(x.REASSIGN_DATE,x.RELEASED_DATE),axis =1)
releases_2=releases.drop(['SETUPDATE','RELEASEDDATE','REASSIGNDATE'],axis=1)
I get the error.
I even tried to add astype('datetime64[D]') as well however i get the error astype('datetime64[ns]') cannot be converted to astype('datetime64[D]').
How could i avoid this error.
Regards,
Ren.
I have n number of txt files each having 99 floating numbers in 99 column. I read each files and append all data by following script.
import glob
import numpy as np
import matplotlib.pyplot as plt
msd_files = (glob.glob('MSD_no_fs*'))
msd_all=[]
for msd_file in msd_files:
# print(msd_file)
msd = numpy.loadtxt(fname=msd_file, delimiter=',')
msd_all.append(msd)
After that I need to make column wise summation of each files. for example file1,column1+file2,column1+...+file(n)column(1) and iterate this for all column. What will be the effective way to perform this? Can I use list comprehension for that?
**edited code and it works fine now.
import glob
import numpy as np
import matplotlib.pyplot as plt
msd_files = (glob.glob('MSD_no_fs*'))
msd_all=[]
for msd_file in msd_files:
with open(msd_file) as f:
for line in f:
# msd_all.append([float(v) for v in line.strip().split(',')])
msd_all.append(float(line.strip()))
msa_array = np.array(msd_all)
x=np.split(msa_array,99)
x=np.array(x)
result=np.mean(x,axis=0)
print(result.shape)
print(len(result))
It depends on efficiency level you want. Using numpy to load many csv files might be a bad choice. Here is my suggestion.
import glob
import numpy as np
msd_files = (glob.glob('MSD_no_fs*'))
msd_all=[]
for msd_file in msd_files:
with open(msd_file) as f:
for line in f:
msd_all.append([float(v) for v in line.strip().split(',')])
msa_array = np.array(msd_all)
result = msa_array.sum(axis=0)
I imported a file from spss, (sav file), however, the titles of my columns
appear as integers instead of strings. Is there a way to fix it? Below is the code I used....I would apreciate any help!
import fnmatch
import sys # import sys
import os
import pandas as pd #pandas importer
import savReaderWriter as spss # to import file from SPSS
import io #importing io
import codecs #to resolve the UTF-8 unicode
with spss.SavReader('file_name.sav') as reader: #Should I add "Np"
records = reader.all()
with codecs.open('file_name.sav', "r",encoding='utf-8', errors='strict')
as fdata: # Not sure if the problems resides on this line
df = pd.DataFrame(records)
df.head()
Wondering whether there is a way to actually convert the titles from numbers to strings. It has happened as if it were excel, but excel has an easy fix for that.
Thanks in advance!
After you have created the DataFrame, you can use df.columns = df.columns.map(str) to change the column headers to strings.