Trying to compare two integers in Python - python-3.x

Okay, I have been digging through Stackoverflow and other sites trying understand why this is not working. I created a function to open a csv file. The function opens the file once to count the number of rows then again to actually process the file. What I am attempting to do is this. Once a file has been processed and the record counts match. I will then load the data into a database. The problem is that the record counts are not matching. I checked both variables and they are both 'int', so I do not understand why '==' is not working for me. Here is the function I created:
def mktdata_import(filedir):
'''
This function is used to import market data
'''
files = []
files = filedir.glob('*.csv')
for f in files:
if fnmatch.fnmatch(f,'*NASDAQ*'):
num_rows = 0
nasObj = []
with open(f,mode='r') as nasData:
nasIn = csv.DictReader(nasData, delimiter=',')
recNum = sum(1 for _ in nasData)
with open(f,mode='r') as nasData:
nasIn = csv.DictReader(nasData, delimiter=',')
for record in nasIn:
if (recNum - 1) != num_rows:
num_rows += 1
nasObj.append(record)
elif(recNum - 1) == num_rows:
print('Add records to database')
else:
print('All files have been processed')
print('{} has this many records: {}'.format(f, num_rows))
print(type(recNum))
print(type(num_rows))
else:
print("Not a NASDAQ file!")

(moving comment to answer)
nasData includes all the rows in the file, including the header row. When converting the data to dictionaries with DictReader, only the data rows are processed so len(nasData) will always be one more than len(nasIn)
As the OP mentioned, iterating the elements did not work so using the line number was required to get the script working: (recNum) == nasIn.line_num

Related

extracting data from multiple pdfs and putting that data into an excel table

I am taking data extracted from multiple pdfs that were merged into one pdf.
The data is based on clinical measurements taken from a sample at different time points. Some time points have certain measurement values while others are missing.
So far, I've been able to merge the pdfs, extract the text and specific data from the text, but I want to put it all into a corresponding excel table.
Below is my current code:
import PyPDF2
from PyPDF2 import PdfFileMerger
from glob import glob
#merge all pdf files in current directory
def pdf_merge():
merger = PdfFileMerger()
allpdfs = [a for a in glob("*.pdf")]
[merger.append(pdf) for pdf in allpdfs]
with open("Merged_pdfs1.pdf", "wb") as new_file:
merger.write(new_file)
if __name__ == "__main__":
pdf_merge()
#scan pdf
text =""
with open ("Merged_pdfs1.pdf", "rb") as pdf_file, open("sample.txt", "w") as text_file:
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
for page_number in range(0, number_of_pages):
page = read_pdf.getPage(page_number)
text += page.extractText()
text_file.write(text)
#turn text script into list, separated by newlines
def Convert(text):
li = list(text.split("\n"))
return li
li = Convert(text)
filelines = []
for line in li:
filelines.append(line)
print(filelines)
#extract data from text and put into dictionary
full_data = []
test_data = {"Sample":[], "Timepoint":[],"Phosphat (mmol/l)":[], "Bilirubin, total (µmol/l)":[],
"Bilirubin, direkt (µmol/l)":[], "Protein (g/l)":[], "Albumin (g/l)":[],
"AST (U/l)":[], "ALT (U/l)":[], "ALP (U/l)":[], "GGT (U/l)":[], "IL-6 (ng/l)":[]}
for line2 in filelines:
# For each data item, extract it from the line and strip whitespace
if line2.startswith("Phosphat"):
test_data["Phosphat (mmol/l)"].append(line2.split(" ")[-2].strip())
if line2.startswith("Bilirubin,total"):
test_data["Bilirubin, total (µmol/l)"].append(line2.split(" ")[-2].strip())
if line2.startswith("Bilirubin,direkt"):
test_data["Bilirubin, direkt (µmol/l)"].append(line2.split(" ")[-4].strip())
if line2.startswith("Protein "):
test_data["Protein (g/l)"].append( line2.split(" ")[-2].strip())
if line2.startswith("Albumin"):
test_data["Albumin (g/l)"].append(line2.split(" ")[-2].strip())
if line2.startswith("AST"):
test_data["AST (U/l)"].append(line2.split(" ")[-2].strip())
if line2.startswith("ALT"):
test_data["ALT (U/l)"].append(line2.split(" ")[-4].strip())
if line2.startswith("Alk."):
test_data["ALP (U/l)"].append(line2.split(" ")[-2].strip())
if line2.startswith("GGT"):
test_data["GGT (U/l)"].append(line2.split(" ")[-4].strip())
if line2.startswith("Interleukin-6"):
test_data["IL-6 (ng/l)"].append(line2.split(" ")[-4].strip())
for sampnum in range(100):
num = str(sampnum)
sampletype = "T" and "H"
if line2.startswith(sampletype+num):
sample = sampletype+num
test_data["Sample"]=sample
for time in range(0,360):
timepoint = str(time) + "h"
word_list = list(line2.split(" "))
for word in word_list:
if word == timepoint:
test_data["Timepoint"].append(word)
full_data.append(test_data)
import pandas as pd
df = pd.DataFrame(full_data)
df.to_excel("IKC4.xlsx", sheet_name="IKC", index=False)
print(df)
The issue is I'm wondering how to move the individual items in the list to their own cells in excel, with the proper timepoint, since they dont necessarily correspond to the right timepoint. For example, timepoint 1 and 3 can have protein measurements, whereas timepoint 2 is missing this info, but timepoint 3 measurements are found at position 2 in the list and will likely be in the wrong row for an excel table.
I figured maybe I need to make an alternative dictionary for the timepoints, and attach the corresponding measurements to the proper timepoint. I'm starting to get confused though on how to do all this and am now asking for help!
Thanks in advance :)
I tried doing an "else" argument after every if argument to add a "-" if there if a measurement wasnt present for that timepoint, but I got far too many dashes since it iterates through the lines of the entire pdf.

How to maintain the last index value in python list while doing iteration

My scenario is, lets say I have a
list1= ["a","b","c"] and this list is dynamic (data is getting appended).
My requirement is I need to process the list data each day to eventhub but I should not upload all data each day.I just need to upload the delta.
my approach is
index=0
for i range(len(list1):
## upload
index=index+1
I want to preserve the latest index value, for e.g. in first run index would be 2 and for next run index should be 3 not 0 as per above code. How should I proceed?
I'd simply create a local file file that stores the index. Next time you need it load it and get from there on, assuming you are using the array and the script locally and only have to upload the part of that array.
import os
index = 0
if os.path.isfile("indexFile.txt"):
f = open("indexFile.txt" , "r")
s = f.read()
f.close()
try: index = int(s)
except: index = 0
print("index is: " + str(index))
print("do something with that index")
index += 1
print("store the index back to file")
f = open("indexFile.txt" , "w")
f.write(str(index))
f.close()

Read out .csv and hand results to a dictionary

I am learning some coding, and I am stuck with an error I can't explain. Basically I want to read out a .csv file with birth statistics from the US to figure out the most popular name in the time recorded.
My code looks like this:
# 0:Id, 1: Name, 2: Year, 3: Gender, 4: State, 5: Count
names = {} # initialise dict names
maximum = 0 # store for maximum
l = []
with open("Filepath", "r") as file:
for line in file:
l = line.strip().split(",")
try:
name = l[1]
if name in names:
names[name] = int(names[name]) + int(l(5))
else:
names[name] = int(l(5))
except:
continue
print(names)
max(names)
def max(values):
for i in values:
if names[i] > maximum:
names[i] = maximum
else:
continue
return(maximum)
print(maximum)
It seems like the dictionary does not take any values at all since the print command does not return anything. Where did I go wrong (incidentally, the filepath is correct, it takes a while to get the result since the .csv is quite big. So my assumption is that I somehow made a mistake writing into the dictionary, but I was staring at the code for a while now and I don't see it!)
A few suggestions to improve your code:
names = {} # initialise dict names
maximum = 0 # store for maximum
with open("Filepath", "r") as file:
for line in file:
l = line.strip().split(",")
names[name] = names.get(name, 0) + l[5]
maximum = [(v,k) for k,v in names]
maximum.sort(reversed=True)
print(maximum[0])
You will want to look into Python dictionaries and learn about get. It helps you accomplish the objective of making your names dictionary in less lines of codes (more Pythonic).
Also, you used def to generate a function but you never called that function. That is why it's not printing.
I propose the shorted code above. Ask if you have questions!
Figured it out.
I think there were a few flow issues: I called a function before defining it... is that an issue or is python okay with that?
Also I think I used max as a name for a variable, but there is a built-in function with the same name, that might cause an issue I guess?! Same with value
This is my final code:
names = {} # initialise dict names
l = []
def maxval(val):
maxname = max(val.items(), key=lambda x : x[1])
return maxname
with open("filepath", "r") as file:
for line in file:
l = line.strip().split(",")
name = l[1]
try:
names[name] = names.get(name, 0) + int(l[5])
except:
continue
#print(str(l))
#print(names)
print(maxval(names))

Compare 2 CSV files (encoded = "utf8") keeping data format

I have 2 stock lists (New and Old). How can I compare it to see what items have been added and what had been removed (happy to add them to 2 different files added and removed)?
so far I have tired along the lines of looking row by row.
import csv
new = "new.csv"
old = "old.csv"
add_file = "add.csv"
remove_file = "remove.csv"
with open(new,encoding="utf8") as new_read, open(old,encoding="utf8") as old_read:
new_reader = csv.DictReader(new_read)
old_reader = csv.DictReader(old_read)
for new_row in new_reader :
for old_row in old_reader:
if old_row["STOCK CODE"] == new_row["STOCK CODE"]:
print("found")
This works for 1 item. if I add an *else: * it just keeps printing that until its found. So it's not an accurate way of comparing the files.
I have 5k worth of rows.
There must be a better way to add the differences to the 2 different files and keep the same data structure at the same time ?
N.B i have tired this link Python : Compare two csv files and print out differences
2 minor issues:
1. the data structure is not kept
2. there is not reference to the change of location
You could just read the data into memory and then compare.
I used sets for the codes in this example for faster lookup.
import csv
def get_csv_data(file_name):
data = []
codes = set()
with open(file_name, encoding="utf8") as csv_file:
reader = csv.DictReader(csv_file)
for row in reader:
data.append(row)
codes.add(row['STOCK CODE'])
return data, codes
def write_csv(file_name, data, codes):
with open(file_name, 'w', encoding="utf8", newline='') as csv_file:
headers = list(data[0].keys())
writer = csv.DictWriter(csv_file, fieldnames=headers)
writer.writeheader()
for row in data:
if row['STOCK CODE'] not in codes:
writer.writerow(row)
new_data, new_codes = get_csv_data('new.csv')
old_data, old_codes = get_csv_data('old.csv')
write_csv('add.csv', new_data, old_codes)
write_csv('remove.csv', old_data, new_codes)

Is there a way to pass variable as counter to list index in python?

Sorry if i am asking very basic question but i am new to python and need help with below question
I am trying to write a file parser where i am counting number of occurrences(modified programs) mentioned in the file.
I am trying to then store all the occurrences in a empty list and putting counter for each occurrence.
Till here all is fine
Now i am trying to create files based on the names captured in the empty list and store the lines that are not matching between in separate file but i am getting error index out of range as when i am passing el[count] is taking count as string and not taking count's value.
Can some one help
import sys
import re
count =1
j=0
k=0
el=[]
f = open("change_programs.txt", 'w+')
data = open("oct-released_diff.txt",encoding='utf-8',errors='ignore')
for i in data:
if len(i.strip()) > 0 and i.strip().startswith("diff --git"):
count = count + 1
el.append(i)
fl=[]
else:
**filename = "%s.txt" % el[int (count)]**
h = open(filename, 'w+')
fl.append(i)
print(fl, file=h)
el = '\n'.join(el)
print(el, file=f)
print(filename)
data.close()

Resources