I'm trying to generate records from a csv retrieved from a given url in Python 3.
Given urlopen returns a bytes-mode file object but csv.DictReader expects text-mode file objects, I've had to wrap the urlopen file object in a TextIOWrapper (with a utf-8 decoding). Now, unfortunately I'm stuck between two undesirable options:
1) TextIOWrapper doesn't support seek, so I can't reset the csv_file generator after checking for a header with Sniffer.
2) If I don't seek back to 0, I truncate the first 18 records.
How do I modify the below code so that it can both check for headers and yield all records off of one urlopen call?
What have I tried?
Reading the URL twice: once to check for headers, a second time to generate the csv records. This seems suboptimal.
Code that skips the first 18 records below. Uncomment line 12 to generate the seek error.
import csv
import io
import urllib.request
def read_url_csv(url, fieldnames=None, transform=None):
if transform is None:
transform = lambda x, filename, fieldnames: x
with urllib.request.urlopen(url) as csv_binary:
csv_file = io.TextIOWrapper(csv_binary, "utf-8")
has_header = csv.Sniffer().has_header(csv_file.read(1024))
#csv_file.seek(0)
reader = csv.DictReader(csv_file, fieldnames=fieldnames)
if has_header and fieldnames is not None: #Overwriting
next(reader)
for record in reader:
yield transform(record, url, fieldnames)
StringIO supports seeking.
csv_file = io.StringIO(io.TextIOWrapper(csv_binary, "utf-8").read())
has_header = csv.Sniffer().has_header(csv_file.read(1024))
csv_file.seek(0)
Related
import requests
from bs4 import BeautifulSoup
from lxml import etree
import csv
with open('1_colonia.csv', 'r', encoding='utf-8') as csvfile:
reader = csv.reader(csvfile, delimiter=';')
next(reader) # skip the header row
for row in reader:
url = row[0]
page = requests.get(url)
# parse the html with BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
# parse the HTML and print the result to the console
dom = etree.HTML(str(soup))
property = (dom.xpath('//*[#id="header"]/div/div[2]/h1'))
duration = (dom.xpath('//*[#id="header"]/div/p'))
price = (dom.xpath('//*[#id="price"]/div/div/span/span[3]'))
# save the data to a CSV file, adding the url as a column to the CSV file
with open('2_colonia.csv', 'a', newline='', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile, delimiter=';')
writer.writerow([url, property[0].text, duration[0].text,price[0].text])
'1_colonia.csv' contains a list of 815 links of properties on sale.
The script works until this message appears:
Traceback (most recent call last):
File "/home/flimflam/Python/colonia/2_colonia.py", line 23, in <module>
writer.writerow([url, property[0].text, duration[0].text, price[0].text])
IndexError: list index out of range
I am not sure where the problem lies. Can anyone help me out, please?
Thanks,
xpath returns lists (for the kind of expression you are using), so in your script property, duration and price are lists.
Depending on what you're searching, xpath can return 0, 1 or multiple elements.
So you must check whether there are results on the list before accessing them. If the list is empty and you try to access the first element (as in property[0], for instance) you will get an exception.
A simple way of checking if there's data on your lists before writing to the csv file would be:
with open('2_colonia.csv', 'a', newline='', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile, delimiter=';')
# check if the lists are not empty
if len(property) > 0 and len(duration) > 0 and len(price) > 0:
writer.writerow([url, property[0].text, duration[0].text, price[0].text])
else:
writer.writerow([url, 'error'])
I am reading data from .csv data and want to write parts of that data into an output file.
When I execute the program and print the results, I get the complete data set of the input file.
However, when I hit print() again, only the last line of the input file is shown.
When I write the print-result into another csv file, as well only the last line is transfered
Basically I am new at this and struggle to understand how data is stored in cache and passed on.
import csv
with open("path to the input file") as file:
reader = csv.reader(file)
for line in file:
input_data = line.strip().split(";")
print(input_data)
with open(os.path.join("path to the output file"), "w") as file1:
toFile = input_data
file1.write(str(toFile))
There is no error messages, just not the expected result. I expect 10 lines to be transferred, but only the last makes it to the output .csv
Thank you for your help!
When you loop over the lines in the csv each iteration you assign the value of that line to input data, overwriting the value previously stored in input_data.
I would recommend something like the following:
import csv
with open('path to input file', 'r') as input, open('path to output file', 'w') as output:
reader = csv.reader(input)
for line in reader:
ouput.write(line.strip().split(';'))
you can open multiple files in a single with clause like I showed in the example. Then for each line in the file you write the stripped and split string to the file.
This should do it. You created the reader object correctly but didn't use it. I hope my example will better your understanding of the reader class.
#!/usr/bin/env python
import csv
from os import linesep
def write_csv_to_file(csv_file_name, out_file_name):
# Open the csv-file
with open(csv_file_name, newline='') as csvfile:
# Create a reader object, pass it the csv-file and tell it how to
# split up values
reader = csv.reader(csvfile, delimiter=',', quotechar='|')
# Open the output file
with open(out_file_name, "w") as out_file:
# Loop through the rows that the reader found
for row in reader:
# Join the row values using a comma as separator
r = ', '.join(row)
# Print row and write to output file
print(r)
out_file.write(r + linesep)
if __name__ == '__main__':
csv_file = "example.csv"
out_file = "out.txt"
write_csv_to_file(csv_file, out_file)
My Code:
import csv
import requests
url = 'https://xxxxxxxxxxxxxxxxxxx.csv'
r = requests.get(url)
text = r.iter_lines()
reader = csv.reader(text, delimiter=',')
for row in reader:
print(row)
I am getting the following error:
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
Just save the csv separately and then read in with pandas:
import pandas
pd.read_csv('SO/AN_LATEST_ANNOUNCED.csv')
Result:
(To save the file separately just open the link in a browser.)
#bug: this program should output url contents but instead outputs the urls themselves
import csv
import requests
import json
with open('PluginIndex.csv', newline='') as csvfile: #opens the plugin index file and stores in var "csvfile"
reader = csv.DictReader(csvfile) #reads the contents of csvfile and stores it
for row in reader: #for each row of text in the contents of csv
url = row["repo"]
print(url) #outputs the repo url of each csv row
resp = requests.get(url)
data = json.loads(resp.text)
data.keys()
I need the content of each url in a csv file to be outputted to the command prompt, however the requests and json libraries are not enough. Does anyone know a way to achieve this?
I have data which is being accessed via http request and is sent back by the server in a comma separated format, I have the following code :
site= 'www.example.com'
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
soup = soup.get_text()
text=str(soup)
The content of text is as follows:
april,2,5,7
may,3,5,8
june,4,7,3
july,5,6,9
How can I save this data into a CSV file.
I know I can do something along the lines of the following to iterate line by line:
import StringIO
s = StringIO.StringIO(text)
for line in s:
But i'm unsure how to now properly write each line to CSV
EDIT---> Thanks for the feedback as suggested the solution was rather simple and can be seen below.
Solution:
import StringIO
s = StringIO.StringIO(text)
with open('fileName.csv', 'w') as f:
for line in s:
f.write(line)
General way:
##text=List of strings to be written to file
with open('csvfile.csv','wb') as file:
for line in text:
file.write(line)
file.write('\n')
OR
Using CSV writer :
import csv
with open(<path to output_csv>, "wb") as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for line in data:
writer.writerow(line)
OR
Simplest way:
f = open('csvfile.csv','w')
f.write('hi there\n') #Give your csv text here.
## Python will convert \n to os.linesep
f.close()
You could just write to the file as you would write any normal file.
with open('csvfile.csv','wb') as file:
for l in text:
file.write(l)
file.write('\n')
If just in case, it is a list of lists, you could directly use built-in csv module
import csv
with open("csvfile.csv", "wb") as file:
writer = csv.writer(file)
writer.writerows(text)
I would simply write each line to a file, since it's already in a CSV format:
write_file = "output.csv"
with open(write_file, "wt", encoding="utf-8") as output:
for line in text:
output.write(line + '\n')
I can't recall how to write lines with line-breaks at the moment, though :p
Also, you might like to take a look at this answer about write(), writelines(), and '\n'.
To complement the previous answers, I whipped up a quick class to write to CSV files. It makes it easier to manage and close open files and achieve consistency and cleaner code if you have to deal with multiple files.
class CSVWriter():
filename = None
fp = None
writer = None
def __init__(self, filename):
self.filename = filename
self.fp = open(self.filename, 'w', encoding='utf8')
self.writer = csv.writer(self.fp, delimiter=';', quotechar='"', quoting=csv.QUOTE_ALL, lineterminator='\n')
def close(self):
self.fp.close()
def write(self, elems):
self.writer.writerow(elems)
def size(self):
return os.path.getsize(self.filename)
def fname(self):
return self.filename
Example usage:
mycsv = CSVWriter('/tmp/test.csv')
mycsv.write((12,'green','apples'))
mycsv.write((7,'yellow','bananas'))
mycsv.close()
print("Written %d bytes to %s" % (mycsv.size(), mycsv.fname()))
Have fun
What about this:
with open("your_csv_file.csv", "w") as f:
f.write("\n".join(text))
str.join() Return a string which is the concatenation of the strings in iterable.
The separator between elements is
the string providing this method.
In my situation...
with open('UPRN.csv', 'w', newline='') as out_file:
writer = csv.writer(out_file)
writer.writerow(('Name', 'UPRN','ADMIN_AREA','TOWN','STREET','NAME_NUMBER'))
writer.writerows(lines)
you need to include the newline option in the open attribute and it will work
https://www.programiz.com/python-programming/writing-csv-files