compiling a regex query with a row in python - python-3.x

I have two csv files. I want to take each row in turn of csv1, find its entry in csv 2 and then pull out other information for that row from csv2.
I'm struggling with the items I am looking for are company names. In both csv1 and csv2 they may or may not have the suffix 'LTD, LTD, Limited or LIMITED' within the name so I would like the query to not include these substrings in the query. My code is still only finding exact matches, rather than ignoring 'Ltd' etc. I'm guessing it's the way I'm combining 'row[0]' with the regex query but can't figure it out.
Code
import re, csv
with open (r'c:\temp\noregcompanies.csv', 'rb') as q:
readerM=csv.reader(q)
for row in readerM:
companySourcename = row[0]+"".join(r'.*(?!Ltd|Limited|LTD|LIMITED).*')
IBcompanies = re.compile(companySourcename)
IBcompaniesString = str(companySourcename)
with open (r'c:\temp\chdata.csv', 'rb') as f:
readerS = csv.reader(f)
for row in readerS:
companyCHname = row[0]+"".join(r'.*(?!Ltd|Limited|LTD|LIMITED)*')
CHcompanies = re.compile(companyCHname)
if CHcompanies.match(IBcompaniesString):
print ('Match is: ',row [0], row[1])
with open (r'c:\temp\outputfile.csv', 'ab') as o:
writer = csv.writer(o, delimiter=',')
writer.writerow(row) t

Related

How to get specific column value from .csv Python3?

I have a .csv file with Bitcoin price and market data, and I want to get the 5th and 7th columns from the last row in the file. I have worked out how to get the last row, but I'm not sure how to extract columns (values) 5 and 7 from it. Code:
with open('BTCAUD_data.csv', mode='r') as BTCAUD_data:
writer = csv.reader(BTCAUD_data, delimiter=',')
data = list(BTCAUD_data)[-1]
print(data)
Edit: How would I also add column names, and would adding them help me? (I have already manually put the names into individual columns in the first line of the file itself)
Edit #2: Forget about the column names, they are unimportant. I still don't have a working solution. I have a vague idea that I'm not actually reading the file as a list, but rather as a string. (This means when I subscript the data variable, I get a single digit, rather than an item in a list) Any hints to how I read the line as a list?
Edit #3: I have got everything working to expectations now, thanks for everyone's help :)
Your code never uses the csv-reader. You can do so like this:
import csv
# This creates a file with demo data
with open('BTCAUD_data.csv', 'w') as f:
f.write(','.join( f"header{u}" for u in range(10))+"\n")
for l in range(20):
f.write(','.join( f"line{l}_{c}" for c in range(10))+"\n")
# this reads and processes the demo data
with open('BTCAUD_data.csv', 'r', newline="") as BTCAUD_data:
reader = csv.reader(BTCAUD_data, delimiter=',')
# 1st line is header
header = next(reader)
# skip through the file, row will be the last line read
for row in reader:
pass
print(header)
print(row)
# each row is a list and you can index into it
print(header[4], header[7])
print(row[4], row[7])
Output:
['header0', 'header1', 'header2', 'header3', 'header4', 'header5', 'header6', 'header7', 'header8', 'header9']
['line19', 'line19', 'line19', 'line19', 'line19', 'line19', 'line19', 'line19', 'line19', 'line19']
header4 header7
line19_4 line19_7
Better use pandas for handling CSV file.
import pandas as pd
df=pd.read_csv('filename')
df.column_name will give the corresponding column
If you read this csv file into df and try df.Year will give you the Year column.

Compare 2 CSV files (encoded = "utf8") keeping data format

I have 2 stock lists (New and Old). How can I compare it to see what items have been added and what had been removed (happy to add them to 2 different files added and removed)?
so far I have tired along the lines of looking row by row.
import csv
new = "new.csv"
old = "old.csv"
add_file = "add.csv"
remove_file = "remove.csv"
with open(new,encoding="utf8") as new_read, open(old,encoding="utf8") as old_read:
new_reader = csv.DictReader(new_read)
old_reader = csv.DictReader(old_read)
for new_row in new_reader :
for old_row in old_reader:
if old_row["STOCK CODE"] == new_row["STOCK CODE"]:
print("found")
This works for 1 item. if I add an *else: * it just keeps printing that until its found. So it's not an accurate way of comparing the files.
I have 5k worth of rows.
There must be a better way to add the differences to the 2 different files and keep the same data structure at the same time ?
N.B i have tired this link Python : Compare two csv files and print out differences
2 minor issues:
1. the data structure is not kept
2. there is not reference to the change of location
You could just read the data into memory and then compare.
I used sets for the codes in this example for faster lookup.
import csv
def get_csv_data(file_name):
data = []
codes = set()
with open(file_name, encoding="utf8") as csv_file:
reader = csv.DictReader(csv_file)
for row in reader:
data.append(row)
codes.add(row['STOCK CODE'])
return data, codes
def write_csv(file_name, data, codes):
with open(file_name, 'w', encoding="utf8", newline='') as csv_file:
headers = list(data[0].keys())
writer = csv.DictWriter(csv_file, fieldnames=headers)
writer.writeheader()
for row in data:
if row['STOCK CODE'] not in codes:
writer.writerow(row)
new_data, new_codes = get_csv_data('new.csv')
old_data, old_codes = get_csv_data('old.csv')
write_csv('add.csv', new_data, old_codes)
write_csv('remove.csv', old_data, new_codes)

Using regex to find and delete data

Need to search through data and delete customer Social Security Numbers.
with open('customerdata.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
data.append(row)
for row in customerdata.csv:
results = re.search(r'\d{3}-\d{2}-\d{4}', row)
re.replace(results, "", row)
print(results)
New to scripting and not sure what it is I need to do to fix this.
This is not a job for a regex.
You are using a csv.DictReader, which is awesome. This means you have access to the column names in your csv file. What you should do is make a note of the column that contains the SSN, then write out the row without it. Something like this (not tested):
with open('customerdata.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
del row['SSN']
print(row)
If you need to keep the data but blank it out, then something like:
with open('customerdata.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
row['SSN'] = ''
print(row)
Hopefully you can take things from here; for example, rather than printing, you might want to use a csv dict writer. Depends on your use case. Though, do stick with csv operations and definitely avoid regexes here. Your data is in csv format. Think about the data as rows and columns, not as individual strings to be regexed upon. :)
I'm not seeing a replace function for re in the Python 3.6.5 docs.
I believe the function you would want to use is re.sub:
re.sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.
This means that all you need in your second for loop is:
for row in customerdata.csv:
results = re.sub(r'\d{3}-\d{2}-\d{4}', row, '')
print(results)

Original order of columns in csv not retained in unicodecsv.DictReader

I am trying read a CSV file into python 3 using unicodecsv library. Code follows :
with open('filename.csv', 'rb') as f:
reader = unicodecsv.DictReader(f)
Student_Data = list(reader)
But the order of the columns in the CSV file is not retained when I output any element from the Student_Data. The output contains any random order of the columns. Is there anything wrong with the code? How do I fix this?
As stated in csv.DictReader documentation, the DictReader object behaves like a dict - so it is not ordered.
You can obtain the list of the fieldnames with:
reader.fieldnames
But if you only want to obtain a list of the field values, in original order, you can just use a normal reader:
with open('filename.csv', 'rb') as f:
reader = unicodecsv.reader(f)
for row in reader:
Student_Data = row

Can't perform reverse web search from a csv file

I've written some code to scrape "Address" and "Phone" against some shop names which is working fine. However, it has got two parameters to be filled in to perform it's activity. I expected to do the same from a csv file where "Name" will be in first column and "Lid" will be in second column and the harvested results will be placed across third and fourth column accordingly. At this point, I can't get any idea as to how I can perform the search from a csv file. Any suggestion will be vastly appreciated.
import requests
from lxml import html
Names=["Literati Cafe","Standard Insurance Co","Suehiro Cafe"]
Lids=["3221083","497670909","12183177"]
for Name in Names and Lids:
Page_link="https://www.yellowpages.com/los-angeles-ca/mip/"+Name.replace(" ","-")+"-"+Name
response = requests.get(Page_link)
tree = html.fromstring(response.text)
titles = tree.xpath('//article[contains(#class,"business-card")]')
for title in titles:
Address= title.xpath('.//p[#class="address"]/span/text()')[0]
Contact = title.xpath('.//p[#class="phone"]/text()')[0]
print(Address,Contact)
You can get your Names and Lids lists from CSV like:
import csv
Names, Lids = [], []
with open("file_name.csv", "r") as f:
reader = csv.DictReader(f)
for line in reader:
Names.append(line["Name"])
Lids.append(line["Lid"])
(nevermind PEP violations for now ;)). Then you can use it in the rest of your code, although I'm not sure what you are trying to achieve with your for Name in Names and Lids: loop but it's not giving you what you think it is - it will not loop through the Names list but only through the Lids list.
Also the first order of optimization should be to replace your loop with the loop over the CSV, like:
with open("file_name.csv", "r") as f:
reader = csv.DictReader(f)
for entry in reader:
page_link = "https://www.yellowpages.com/los-angeles-ca/mip/{}-{}".format(entry["Name"].replace(" ","-"), entry["Lid"])
# rest of your scraping code...

Resources