Using regex to find and delete data - python-3.x

Need to search through data and delete customer Social Security Numbers.
with open('customerdata.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
data.append(row)
for row in customerdata.csv:
results = re.search(r'\d{3}-\d{2}-\d{4}', row)
re.replace(results, "", row)
print(results)
New to scripting and not sure what it is I need to do to fix this.

This is not a job for a regex.
You are using a csv.DictReader, which is awesome. This means you have access to the column names in your csv file. What you should do is make a note of the column that contains the SSN, then write out the row without it. Something like this (not tested):
with open('customerdata.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
del row['SSN']
print(row)
If you need to keep the data but blank it out, then something like:
with open('customerdata.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
row['SSN'] = ''
print(row)
Hopefully you can take things from here; for example, rather than printing, you might want to use a csv dict writer. Depends on your use case. Though, do stick with csv operations and definitely avoid regexes here. Your data is in csv format. Think about the data as rows and columns, not as individual strings to be regexed upon. :)

I'm not seeing a replace function for re in the Python 3.6.5 docs.
I believe the function you would want to use is re.sub:
re.sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.
This means that all you need in your second for loop is:
for row in customerdata.csv:
results = re.sub(r'\d{3}-\d{2}-\d{4}', row, '')
print(results)

Related

Python problems writing rows in CSV

I have this script that reads a CSV and saves the second column to a list, I'm trying to get it to write the contents of the list to a new CSV. The problem is every entry should have its own row but the new file sets everything into the same row.
I've tried moving the second with open code to within the first with open and I've tried adding a for loop to the second with open but no matter what I try I don't get the right results.
Here is the code:
import csv
col_store=[]
with open('test-data.csv', 'r') as rf:
reader = csv.reader(rf)
for row in reader:
col_store.append(row[1])
with open('meow.csv', 'wt') as f:
csv_writer = csv.writer(f)
csv_writer.writerows([col_store])
In your case if you have a column of single letters/numbers then Y.R answer will work.
To have a code that works in all cases, use this.
with open('meow.csv', 'wt') as f:
csv_writer = csv.writer(f)
csv_writer.writerows(([_] for _ in col_store))
From here it is mentioned that writerows expect an an iterable of row objects. Every row object should be an iterable of strings or numbers for Writer objects
The problem is that you are using 'writerows' treating 'col_store' as a list with one item.
The simplest approach to fixing this is calling
csv_writer.writerows(col_store)
# instead of
csv_writer.writerows([col_store])
However, this will lead to a probably unwanted result - having blank lines between the lines.
To solve this, use:
with open('meow.csv', 'wt', newline='') as f:
csv_writer = csv.writer(f)
csv_writer.writerows(col_store)
For more about this, see CSV file written with Python has blank lines between each row
Note: writerows expects 'an iterable of row objects' and 'row objects must be an interable of strings or numbers'.
(https://docs.python.org/3/library/csv.html)
Therefore, in the generic case (trying to write integers for examlpe), you should use Sam's solution.

Writing each sublist in a list of lists to a separate CSV

I have a list of lists containing a varying number of strings in each sublist:
tq_list = [['The mysterious diary records the voice.', 'Italy is my favorite country', 'I am happy to take your donation', 'Any amount will be greatly appreciated.'], ['I am counting my calories, yet I really want dessert.', 'Cats are good pets, for they are clean and are not noisy.'], ['We have a lot of rain in June.']]
I would like to create a new CSV file for each sublist. All I have so far is a way to output each sublist as a row in the same CSV file using the following code:
name_list = ["sublist1","sublist2","sublist3"]
with open("{}.csv".format(*name_list), "w", newline="") as f:
writer = csv.writer(f)
for row in tq_list:
writer.writerow(row)
This creates a single CSV file named 'sublist1.csv'.
I've toyed around with the following code:
name_list = ["sublist1","sublist2","sublist3"]
for row in tq_list:
with open("{}.csv".format(*name_list), "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(row)
Which also only outputs a single CSV file named 'sublist1.csv', but with only the values from the last sublist. I feel like this is a step in the right direction, but obviously not quite there yet.
What the * in "{}.csv".format(*name_list) in your code actually does is this: It unpacks the elements in name_list to be passed into the function (in this case format). That means that format(*name_list) is equivalent to format("sublist1", "sublist2", "sublist3"). Since there is only one {} in your string, all arguments to format except "sublist1" are essentially discarded.
You might want to do something like this:
for index, row in enumerate(tq_list):
with open("{}.csv".format(name_list[index]), "w", newline="") as f:
...
enumerate returns a counting index along with each element that it iterates over so that you can keep track of how many elements there have already been. That way you can write into a different file each time. You could also use zip, another handy function that you can look up in the Python documentation.

Merge line in csv file python

I have this in csv file:
Titre,a,b,c,d,e
01,jean,paul,,
01,,,jack,
02,jeanne,jack,,
02,,,jean
and i want :
Titre,a,b,c,d,e
01,jean,paul,jack,
02,jeanne,jack,,jean
can you help me ?
In general, a good approach is to read the csv file and iterate through the rows using Python's CSV module.
CSV will create an iterator that will let you loop through your file like this:
import csv
with open('your filename.csv', 'r') as infile:
reader = csv.reader(infile)
for line in reader:
for value in line:
# Do your thing
You're going to need to construct a new data set that has different properties. The requirements you described:
Ignore any empty cells
Any time you encounter a row that has a new index number, add a new row to your new data set
Any time you encounter a row that has an index number you've seen before, add it to the row that you already created (except for that index number value itself)
I'm not writing that part of the code for you because you need to learn and grow. It's a good task for a beginner.
Once you've constructed that data set, it will look like this:
example_processed_data = [["Titre","a","b","c","d","e"],
["01","jean","paul","jack"],
["02","jeanne","jack","","jean"]]
You can then create a CSV writer, and create your outfile by iterating over that data, similarly to how you iterated over the infile:
with open('outfile.csv', 'w') as outfile:
writer = csv.writer(outfile)
for line in example_processed_data:
writer.writerow(line)
print("Done! Wrote", len(example_processed_data), "lines to outfile.csv.")

compiling a regex query with a row in python

I have two csv files. I want to take each row in turn of csv1, find its entry in csv 2 and then pull out other information for that row from csv2.
I'm struggling with the items I am looking for are company names. In both csv1 and csv2 they may or may not have the suffix 'LTD, LTD, Limited or LIMITED' within the name so I would like the query to not include these substrings in the query. My code is still only finding exact matches, rather than ignoring 'Ltd' etc. I'm guessing it's the way I'm combining 'row[0]' with the regex query but can't figure it out.
Code
import re, csv
with open (r'c:\temp\noregcompanies.csv', 'rb') as q:
readerM=csv.reader(q)
for row in readerM:
companySourcename = row[0]+"".join(r'.*(?!Ltd|Limited|LTD|LIMITED).*')
IBcompanies = re.compile(companySourcename)
IBcompaniesString = str(companySourcename)
with open (r'c:\temp\chdata.csv', 'rb') as f:
readerS = csv.reader(f)
for row in readerS:
companyCHname = row[0]+"".join(r'.*(?!Ltd|Limited|LTD|LIMITED)*')
CHcompanies = re.compile(companyCHname)
if CHcompanies.match(IBcompaniesString):
print ('Match is: ',row [0], row[1])
with open (r'c:\temp\outputfile.csv', 'ab') as o:
writer = csv.writer(o, delimiter=',')
writer.writerow(row) t

Save Tweets as .csv, Contains String Literals and Entities

I have tweets saved in JSON text files. I have a friend who wants tweets containing keywords, and the tweets need to be saved in a .csv. Finding the tweets is easy, but I run into two problems and am struggling with finding a good solution.
Sample data are here. I have included the .csv file that is not working as well as a file where each row is a tweet in JSON format.
To get into a dataframe, I use pd.io.json.json_normalize. It works smoothly and handles nested dictionaries well, but pd.to_csv does not work because it does not handle, as far as I can tell, string literals well. Some of the tweets contain '\n' in the text field, and pandas writes new lines when that happens.
No problem, I process pd['text'] to remove '\n'. The resulting file still has too many rows, 1863 compared to the 1388 it should. I then modified my code to replace all string-literals:
tweets['text'] = [item.replace('\n', '') for item in tweets['text']]
tweets['text'] = [item.replace('\r', '') for item in tweets['text']]
tweets['text'] = [item.replace('\\', '') for item in tweets['text']]
tweets['text'] = [item.replace('\'', '') for item in tweets['text']]
tweets['text'] = [item.replace('\"', '') for item in tweets['text']]
tweets['text'] = [item.replace('\a', '') for item in tweets['text']]
tweets['text'] = [item.replace('\b', '') for item in tweets['text']]
tweets['text'] = [item.replace('\f', '') for item in tweets['text']]
tweets['text'] = [item.replace('\t', '') for item in tweets['text']]
tweets['text'] = [item.replace('\v', '') for item in tweets['text']]
Same result, pd.to_csv saves a file with more rows than actual tweets. I could replace string literals in all columns, but that is clunky.
Fine, don't use pandas. with open(outpath, 'w') as f: and so on creates a .csv file with the correct number of rows. Reading the file, either with pd.read_csv or reading line by line will fail, however.
It fails because of how Twitter handles entities. If a tweet's text contains a url, mention, hashtag, media, or link, then Twitter returns a dictionary that contains commas. When pandas flattens the tweet, the commas get preserved within a column, which is good. But when the data are read in, pandas splits what should be one column into multiple columns. For example, a column might look like [{'screen_name': 'ProfOsinbajo','name': 'Prof Yemi Osinbajo','id': 2914442873,'id_str': '2914442873', 'indices': [0,' 13]}]', so splitting on commas creates too many columns:
[{'screen_name': 'ProfOsinbajo',
'name': 'Prof Yemi Osinbajo',
'id': 2914442873",
'id_str': '2914442873'",
'indices': [0,
13]}]
That is the outcome whether I use with open(outpath) as f: as well. With that approach, I have to split lines, so I split on commas. Same problem - I do not want to split on commas if they occur in a list.
I want those data to be treated as one column when saved to file or read from file. What am I missing? In terms of the data at the repository above, I want to convert forstackoverflow2.txt to a .csv with as many rows as tweets. Call this file A.csv, and let's say it has 100 columns. When opened, A.csv should also have 100 columns.
I'm sure there are details I've left out, so please let me know.
Using the csv module works. It writes the file out as a .csv while counting the lines, then reads it back in and counts the lines again.
The result matched, and opening the .csv in Excel also gives 191 columns and 1338 lines of data.
import json
import csv
with open('forstackoverflow2.txt') as f,\
open('out.csv','w',encoding='utf-8-sig',newline='') as out:
data = json.loads(next(f))
print('columns',len(data))
writer = csv.DictWriter(out,fieldnames=sorted(data))
writer.writeheader() # write header
writer.writerow(data) # write the first line of data
for i,line in enumerate(f,2): # start line count at two
data = json.loads(line)
writer.writerow(data)
print('lines',i)
with open('out.csv',encoding='utf-8-sig',newline='') as f:
r = csv.DictReader(f)
lines = list(r)
print('readback columns',len(lines[0]))
print('readback lines',len(lines))
Output:
columns 191
lines 1338
readback lines 1338
readback columns 191
#Mark Tolonen's answer is helpful, but I ended up going a separate route. When saving the tweets to file, I removed all \r, \n, \t, and \0 characters from anywhere in the JSON. Then, I save the file with as tab separated so that commas in fields like location or text do not confuse a read function.

Resources