user defined function in python to read csv file - python-3.x

Reads in a dataset using pandas.
Parameters
----------
file_path : string containing path to a file
Returns
-------
Pandas DataFrame with data read in from the file path
'''
I have defined the following UDF but it doesnt work.
def read_data(file_path):
pandas.read_csv('file_path')

Looks like you are missing the return and the variable shouldn't have quotes
import pandas as pd
def read_data(file_path: str) -> pd.DataFrame:
return pd.read_csv(file_path)

Related

Convert panda column to a string

I am trying to run the below script to add to columns to the left of a file; however it keeps giving me
valueError: header must be integer or list of integers
Below is my code:
import pandas as pd
import numpy as np
read_file = pd.read_csv("/home/ex.csv",header='true')
df=pd.DataFrame(read_file)
def add_col(x):
df.insert(loc=0, column='Creation_DT', value=pd.to_datetime('today'))
df.insert(loc=1, column='Creation_By', value="Sean")
df.to_parquet("/home/sample.parquet")
add_col(df)
Any ways to make the creation_dt column a string?
According to pandas docs header is row number(s) to use as the column names, and the start of the data and must be int or list of int. So you have to pass header=0 to read_csv method.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Also, pandas automatically creates dataframe from read file, you don't need to do it additionally. Use just
df = pd.read_csv("/home/ex.csv", header=0)
You can try:
import pandas as pd
import numpy as np
read_file = pd.read_csv("/home/ex.csv")
df=pd.DataFrame(read_file)
def add_col(x):
df.insert(loc=0, column='Creation_DT', value=str(pd.to_datetime('today')))
df.insert(loc=1, column='Creation_By', value="Sean")
df.to_parquet("/home/sample.parquet")
add_col(df)

determine file path for pandas.read_csv in python

I use the following code to read a csv file and save it as pandas data frame, but the method always return the iris dataset not my data. What is the problem?
import pandas as pd
a = pd.read_csv(r"D:\data.csv")
print(a)

Python - Storing float values in CSV file

I am trying to store the positive and negative score of statements in a text file. I want to store the score in a csv file. I have implemented the below given code:
import openpyxl
from nltk.tokenize import sent_tokenize
import csv
from senti_classifier import senti_classifier
from nltk.corpus import wordnet
file_content = open('amazon_kindle.txt')
for lines in file_content:
sentence = sent_tokenize(lines)
pos_score,neg_score = senti_classifier.polarity_scores(sentence)
with open('target.csv','w') as f:
writer = csv.writer(f,lineterminator='\n',delimiter=',')
for val in range(pos_score):
writer.writerow(float(s) for s in val[0])
f.close()
But the code displays me the following error in for loop.
Traceback (most recent call last):
File "C:\Users\pc\AppData\Local\Programs\Python\Python36-32\classifier.py",
line 21, in for val in pos_score: TypeError: 'float' object is not iterable
You have several errors with your code:
Your code and error do not correspond with each other.
for val in pos_score: # traceback
for val in range(pos_score): #code
pos_score is a float so both are errors range() takes an int and for val takes an iterable. Where do you expect to get your list of values from?
And from usage it looks like you are expecting a list of list of values because you are also using a generator expression in your writerow
writer.writerow(float(s) for s in val[0])
Perhaps you are only expecting a list of values so you can get rid of the for loop and just use:
writer.writerow(float(val) for val in <list_of_values>)
Using:
with open('target.csv','w') as f:
means you no longer need to call f.close() and with closes the file at the end of the with block. This also means the writerow() needs to be in the with block:
with open('target.csv','w') as f:
writer = csv.writer(f,lineterminator='\n',delimiter=',')
writer.writerow(float(val) for val in <list_of_values>)

Combine multiple csv files into a single xls workbook Python 3

We are in the transition at work from python 2.7 to python 3.5. It's a company wide change and most of our current scripts were written in 2.7 and no additional libraries. I've taken advantage of the Anaconda distro we are using and have already change most of our scripts over using the 2to3 module or completely rewriting them. I am stuck on one piece of code though, which I did not write and the original author is not here. He also did not supply comments so I can only guess at the whole of the script. 95% of the script works correctly until the end where after it creates 7 csv files with different parsed information it has a custom function to combine the csv files into and xls workbook with each csv as new tab.
import csv
import xlwt
import glob
import openpyxl
from openpyxl import Workbook
Parsefiles = glob.glob(directory + '/' + "Parsed*.csv")
def xlsmaker():
for f in Parsefiles:
(path, name) = os.path.split(f)
(chort_name, extension) = os.path.splittext(name)
ws = wb.add_sheet(short_name)
xreader = csv.reader(open(f, 'rb'))
newdata = [line for line in xreader]
for rowx, row in enumerate(newdata)
for colx, value in enumerate(row):
if value.isdigit():
ws.write(rowx, colx, value)
xlsmaker()
for f in Parsefiles:
os.remove(f)
wb.save(directory + '/' + "Finished" + '' + oshort + '' + timestr + ".xls")
This was written all in python 2.7 and still works correctly if I run it in python 2.7. The issue is that it throws an error when running in python 3.5.
File "parsetool.py", line 521, in (module)
xlsmaker()
File "parsetool.py", line 511, in xlsmaker
ws = wb.add_sheet(short_name)
File "c:\pythonscripts\workbook.py", line 168 in add_sheet
raise TypeError("The paramete you have given is not of the type '%s'"% self._worksheet_class.__name__)
TypeError: The parameter you have given is not of the type "Worksheet"
Any ideas about what should be done to fix the above error? Iv'e tried multiple rewrites, but I get similar errors or new errors. I'm considering just figuring our a whole new method to create the xls, possibly pandas instead.
Not sure why it errs. It is worth the effort to rewrite the code and use pandas instead. Pandas can read each csv file into a separate dataframe and save all dataframes as a separate sheet in an xls(x) file. This can be done by using the ExcelWriter of pandas. E.g.
import pandas as pd
writer = pd.ExcelWriter('yourfile.xlsx', engine='xlsxwriter')
df = pd.read_csv('originalfile.csv')
df.to_excel(writer, sheet_name='sheetname')
writer.save()
Since you have multiple csv files, you would probably want to read all csv files and store them as a df in a dict. Then write each df to Excel with a new sheet name.
Multi-csv Example:
import pandas as pd
import sys
import os
writer = pd.ExcelWriter('default.xlsx') # Arbitrary output name
for csvfilename in sys.argv[1:]:
df = pd.read_csv(csvfilename)
df.to_excel(writer,sheet_name=os.path.splitext(csvfilename)[0])
writer.save()
(Note that it may be necessary to pip install openpyxl to resolve errors with xlsxwriter import missing.)
You can use the code below, to read multiple .csv files into one big .xlsx Excel file.
I also added the code for replacing ',' by '.' (or vice versa) for improved compatibility on windows environments and according to your locale settings.
import pandas as pd
import sys
import os
import glob
from pathlib import Path
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
writer = pd.ExcelWriter('fc15.xlsx') # Arbitrary output name
for csvfilename in all_filenames:
txt = Path(csvfilename).read_text()
txt = txt.replace(',', '.')
text_file = open(csvfilename, "w")
text_file.write(txt)
text_file.close()
print("Loading "+ csvfilename)
df= pd.read_csv(csvfilename,sep=';', encoding='utf-8')
df.to_excel(writer,sheet_name=os.path.splitext(csvfilename)[0])
print("done")
writer.save()
print("task completed")
Here's a slight extension to the accepted answer. Pandas 1.5 complains about the call to writer.save(). The fix is to use the writer as a context manager.
import sys
from pathlib import Path
import pandas as pd
with pd.ExcelWriter("default.xlsx") as writer:
for csvfilename in sys.argv[1:]:
p = Path(csvfilename)
sheet_name = p.stem[:31]
df = pd.read_csv(p)
df.to_excel(writer, sheet_name=sheet_name)
This version also trims the sheet name down to fit in Excel's maximum sheet name length, which is 31 characters.
If your csv file is in Chinese with gbk encoding, you can use the following code
import pandas as pd
import glob
import datetime
from pathlib import Path
now = datetime.datetime.now()
extension = "csv"
all_filenames = [i for i in glob.glob(f"*.{extension}")]
with pd.ExcelWriter(f"{now:%Y%m%d}.xlsx") as writer:
for csvfilename in all_filenames:
print("Loading " + csvfilename)
df = pd.read_csv(csvfilename, encoding="gb18030")
df.to_excel(writer, index=False, sheet_name=Path(csvfilename).stem)
print("done")
print("task completed")

Load XML string from Column in PySpark

I have a JSON file in which one of the columns is an XML string.
I tried extracting this field and writing to a file in the first step and reading the file in the next step. But each row has an XML header tag. So the resulting file is not a valid XML file.
How can I use the PySpark XML parser ('com.databricks.spark.xml') to read this string and parse out the values?
The following doesn't work:
tr = spark.read.json( "my-file-path")
trans_xml = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='book').load(tr.select("trans_xml"))
Thanks,
Ram.
Try Hive XPath UDFs (LanguageManual XPathUDF):
>>> from pyspark.sql.functions import expr
>>> df.select(expr("xpath({0}, '{1}')".format(column_name, xpath_expression)))
or Python UDF:
>>> from pyspark.sql.types import *
>>> from pyspark.sql.functions import udf
>>> import xml.etree.ElementTree as ET
>>> schema = ... # Define schema
>>> def parse(s):
... root = ET.fromstring(s)
result = ... # Select values
... return result
>>> df.select(udf(parse, schema)(xml_column))

Resources