Gap Analysis/Report for CSV in Python 3.6.2 - python-3.x

Start End
MM0001 MM0009
MM0010 MM0020
MM0030 MM0039
MM0059 MM0071
Good afternoon, I wanted to create code in Python in 3.6.2 that will allow me to essentially look for gaps in rows of consecutive numbers, such as with this one. It would then output to the screen for the missing numbers in a format similar to below:
MM0021 MM0029
MM0040 MM0051
MM0052 MM0058
I've created some code for this program based on an answer I found around here, but I don't believe it's complete, as well as it being done in Python 2.7 I believe. I however used it as a basis for what I was trying to do.
import csv
with open('thedata.csv') as csvfile:
reader = csv.reader (csvfile)
for line, row in enumerate(reader, 1):
if not row:
print 'Start of line', line, 'Contents', row
Any help will be greatly appreciated.

import csv
def out(*args):
print('{},{}'.format(*(str(i).rjust(4, "0") for i in args)))
prev = 0
data = csv.reader(open('thedata.csv'))
print(*next(data), sep=', ') # header
for line in data:
start, end = (int(s.strip()[2:]) for s in line)
if start != prev+1:
out(prev+1, start-1)
prev = end
out(start, end)
it’s really ugly sorry, but should work?
outputs comma separated text
if something doesn’t work, or you want it to save to a file, just comment

Related

Python 3.7 Start reading from a specific point within a csv

Hey I could really use help here. I've tried for 1 hour to find a solution for python but was unable to find it.
I am using Python 3.7
My input is a file provided by a customer - I cannot change it. It is structured in the following way:
It starts with random text not in CSV format and from line 3 on the rest of the file is in csv format.
text line
text line
text line or nothing
Enter
[Start of csv file] "column Namee 1","column Namee 2" .. until 6
"value1","value2" ... until 6 - continuing for many lines.
I wanted to extract the first 3 lines to create a pure CSV file but was unable to find code to only do it for a specific line range. It also seems the wrong solution as I think starting to read from a certain point should be possible.
Then I thought split () is the solution but it did not work for this format. The values are sometimes numbers, dates or strings. You cannot use the seek() method as they start differently.
Right now my dictreader takes the first line as an index and consequently the rest is rendered in chaos.
import csv
import pandas as pd
from prettytable import PrettyTable
with open(r'C:\Users\Hans\Downloads\file.csv') as csvfile:
csv_reader = csv.DictReader (r'C:\Users\Hans\Downloads\file.csv', delimiter=',')
for lines in csvfile:
print (lines)
If some answer for python has been found please link it, I was not able to find it.
Thank you so much for your help. I really appreciate it.
I will insist with the pandas option, given that the documentation clearly states that the skiprows parameter allows to skip n number of lines. I tried it with the example provided by #Chris Doyle (saving it to a file named line_file.csv) and it works as expected.
import pandas as pd
f = pd.read_csv('line_file.csv', skiprows=3)
Output
name num symbol
0 chris 4 $
1 adam 7 &
2 david 5 %
If you know the number of lines you want to skip then just open the file and read that many lines then pass the filehandle to Dictreader and it will read the remaining lines.
import csv
skip_n_lines = 3
with open('test.dat') as my_file:
for _ in range(skip_n_lines):
print("skiping line:", my_file.readline(), end='')
print("###CSV DATA###")
csv_reader = csv.DictReader(my_file)
for row in csv_reader:
print(row)
FILE
this is junk
this is more junk
last junk
name,num,symbol
chris,4,$
adam,7,&
david,5,%
OUTPUT
skiping line: this is junk
skiping line: this is more junk
skiping line: last junk
###CSV DATA###
OrderedDict([('name', 'chris'), ('num', '4'), ('symbol', '$')])
OrderedDict([('name', 'adam'), ('num', '7'), ('symbol', '&')])
OrderedDict([('name', 'david'), ('num', '5'), ('symbol', '%')])

How do I delete rows in one CSV based on another CSV

I am working with two CSV files, both contain only one column of data, but are over 50,000 rows. I need to compare the data from CSV1 against CSV2 and remove any data that displays in both of these files. I would like to print out the final list of data as a 3rd CSV file if possible.
The CSV files contain usernames. I have tried running deduplication scripts but realize that this does not remove entries found in both CSV files entirely since it only removes the duplication of a username. This is what I have been currently working with but I can already tell that this isn't going to give me the results I am looking for.
import csv
AD_AccountsCSV = open("AD_Accounts.csv", "r")
BA_AccountsCSV = open("BA_Accounts.csv", "r+")
def Remove(x,y):
final_list =[]
for item in x:
if item not in y:
final_list.append(item)
for i in y:
if i not in x:
final_list.append(i)
print (final_list)
The way that I wrote this code would print the results within the terminal after running the script but I realize that my output may be around 1,000 entries.
# define the paths
fpath1 = "/path/to/file1.csv"
fpath2 = "/path/to/file2.csv"
fpath3 = "/path/to/your/file3.csv"
with open(fpath1) as f1, open(fpath2) as f2, open(fpath3, "w") as f3:
l1 = f1.readlines()
l2 = f2.readlines()
not_in_both = [x for x in set(l1 + l2) if x in l1 and x in l2]
for x in not_in_both:
print(x, file=f3)
The with open() as ... clause takes care of closing the file.
You can combine several file openings under with.
Assuming, that the elements in the files are the only elements per line, I used simple readlines() (which automatically removes the newline character at the end). Otherwise it becomes more complicated in this step.
List-expressions make it nice to filter lists by conditions.
Default end='\n' in print() adds newline at end of each print.
In the way you did
For formatting code, please follow official style guides, e.g.
https://www.python.org/dev/peps/pep-0008/
def select_exclusive_accounts(path_to_f1,path_to_f2, path_to_f3):
# you have quite huge indentations - use 4 spaces!
with open(path_to_f1) as f1, open(path_to_f2) as f2, \
open(path_to_f3, "w") as f3:
for item in in_f1:
if item not in in_f2:
f3.write(item)
for i in in_f2:
if i not in in_f1:
f3.write(item)
select_exclusive_accounts("AD_Accounts.csv",
"BA_Accounts.csv",
"exclusive_accounts.csv")
Also here no imports not needed because these are standard Python commands.

Remove unicode '\xa0' from pandas column

I was given a latin-1 .txt dataset, which I am trying to clean up to use for proper analysis using python 3 and pandas. The dataset, being scraped from html contains a number of \xa0 occurences, which I can't seem to deal with using decode, strip, -u replace, or any other method which I found on stack overflow. All of my attempts seem to be ignored by python, still printing out the same results. As I am new to data scraping, chances are that I missed out on something obvious, but right now I don't see a way forward
I have tried to decode to ascii, strip to str and then replace, or replace using a -u clause, not leading to anything
filepath = 'meow3.txt'
outF = open("myOutFile.txt", "a")
with open(filepath) as fp:
line = fp.readline()
for line in fp:
if line.strip().startswith(','):
line = line.replace(',','',1)
line = line.replace(u'\xa0', u' ')
print(line)
df = pd.read_csv('myOutFile.txt', sep=",", encoding="latin-1", header=None, names=["Company name", "Total", "Invested since-to"])
print (df)
3sun Group, £10m ,Feb 2014
,Abacus, £4.15m ,Aug 2013
,Accsys Group ,£12m, Mar 2017,
Acro ,\xa0£7.8m, Nov 2015 – Sep 2017,
ACS Clothing, £25.3m ,Jan 2014
this is how the dataset looks like, and why in my code I am removing the first comma provided it is at the start of the column. But none of the suggested answers I tried seemed to help with removing the \xa0 part of the dataset, still giving the same result (seen above). If anyone has any clue for how I could make this work, I would be very grateful,
Cheers,
Jericho
Edit: While I know this would be best dealt with by pre-processing before turning it into txt file, I have no access or control of that process, and I have to work with the data I was given.
I suddenly stuck by this problem today and finally find a quickest and neatest solution.
Say your pandas dataframe has a column with values like 'South\xa0Carolina'.
Use the following code to remove all '\xa0'. Actually I have tried .replace("\xa0"," ") and .replace(r'\xa0',r' '), but none of them worked.
data['col'].str.split().str.join(' ')
do this after reading the file.
df['col'] = df['col'].apply(lambda x: str(x).replace(u'\xa0', u''))
Maybe decoding line to UTF8 will help
line = line.decode('UTF-8')
Then do the string replacement after that, if necessary.

pandas.read_clipboard only reads whole lines not columns

I transferred all my python3 codes from macOS to Ubuntu 18.04 and in one program I need to use pandas.clipboard(). At this point of time there is a list in the clipboard with multiple lines and columns divided by tabs and each element in quotation marks.
After just trying
import pandas as pd
df = pd.read_clipboard()
I'm getting this error: pandas.errors.ParserError: Expected 8 fields in line 3, saw 11. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.. And line 3 looks like "word1" "word2 and another" "word3" .... Without the quotation marks you count 11 elements and within quotation marks you count 8.
In the next step I tried
import pandas as pd
df = pd.read_clipboard(sep='\t')
and I'm getting no errors but it results only in a Series with each line of the clipboard source in one element.
Yes, maybe it's a solution to write a code for separating each element of a line after this step but because it's working very well under macOS (with just pd.read_clipboard()) I hope that there's a better solution.
Thank you for helping.
I wrote a "turnaround" for my question. It's not the exact solution but because I just need the elements of one column in an array I solved it like that:
import pyperclip
# read clipboard
cb = pyperclip.paste()
# lines in array
cb_arr = cb.splitlines()
column = []
for cb_line in cb_arr:
# words in array
cb_words = cb_line.split("\"")
# pick element of column 1
word = cb_words[1]
column.append(word)
# delete column name
column.pop(0)
print(column)
Maybe it helps someone else, too.

Iterate appending Python List output to rows in excel

As output of my python code I am getting the marks of Randy and Shaw everytime I run my program. I run this program couple of times every month for many years.
I am storing their marks in a list in python. but how do I save it following format? I am getting output in following format[Output in a row for two different persons]
import pandas
from openpyxl import load_workbook
#These lists I am getting from a very complicated code so just creating new lists here
L1=('7/6/2016', 24,24,13)
L2=('5/8/2016', 25,24,16)
L3=('7/6/2016', 21,16,19)
L4=('5/8/2016', 23,24,21)
L5=('4/11/2016', 13, 12,17)
print('Randy's grades')
print(L1)
print(L2)
print(L3)
print('Shaw's grades')
print(L4)
print(L5)
book = load_workbook('C:/Users/Desktop/Masterfile.xlsx')
writer = pandas.ExcelWriter('Masterfile.xlsx', engine='openpyxl')
Output at run no 1:
For Randy
7/6/2016, 24,24,13
5/8/2016, 25,24,16
For Shaw
7/6/2016, 21,16,19
5/8/2016, 23,24,21
4/11/2016, 13, 12,17
Output at run no 2:
For Randy
7/8/2016, 24,24,13
5/9/2016, 25,24,16
For Shaw
7/8/2016, 21,16,19
5/9/2016, 23,24,21
I will have many such output runs for couple of years so I want to save the data by appending in the same document.
I am using OpenPyxl to open doc and I know I need to use append() operation but I am having hard time to save my list as row. I am new here. Please help me with Syntax!I understand the logic but difficulty with syntax!
Thank you!
Since you said that you are willing to use csv format, I will show a csv solution.
with open('FileToWriteTo.csv', 'w') as outFile:
outFile.write(','.join([str(item) for item in L1])) # Take everything in L1 and put commas between them then write to file
outFile.write('\n') # Write newline
outFile.write(','.join([str(item) for item in L2]))
outFile.write('\n')
outFile.write(','.join([str(item) for item in L3]))
outFile.write('\n')
outFile.write(','.join([str(item) for item in L4]))
outFile.write('\n')
outFile.write(','.join([str(item) for item in L5]))
outFile.write('\n')
If you keep a list of lists instead of separate lists, this becomes easier with a for loop:
with open('FileToWriteTo.csv', 'w') as outFile:
for row in listOfLists:
outFile.write(','.join([str(item) for item in row]))
outFile.write('\n')

Resources