I am trying to create dummy data for NER task by replacing person_name with some dummy names. But it's giving me weird results in case of same entities occuring multiple times as discussed here:
Strange result when removing item from a list while iterating over it
Modifying list while iterating
Input example spans:
{
'text':"Mohan dob is 25th dec 1980. Mohan loves to play cricket.",
'spans':[{'start':0, 'end':5,'label':'person_name','ngram':'Mohan'},
{start':28, 'end':33,'label':'person_name','ngram':'Mohan'},
{start':13, 'end':26,'label':'date','ngram':'25th dec 1980'}
]
}
The entities person_name occurs twice in a sample.
sample_names=['Jon', 'Sam']
I want to replace (0, 5, 'person_name') and (28, 33, 'person_name') with sample_names.
Dummy Examples Output:
{
{'text':"Jon dob is 25th dec 1980. Jon loves to play cricket.",
'spans':[{'start':0, 'end':3,'label':'person_name','ngram':'Jon'},
{start':26, 'end':31,'label':'person_name','ngram':'Jon'},
{start':11, 'end':24,'label':'date','ngram':'25th dec 1980'}
]
},
{'text':"Sam dob is 25th dec 1980. Sam loves to play cricket.",
'spans':[{'start':0, 'end':3,'label':'person_name','ngram':'Sam'},
{start':26, 'end':31,'label':'person_name','ngram':'Sam'},
{start':11, 'end':24,'label':'date','ngram':'25th dec 1980'}
]
}
}
The spans also get's updated in output
target_entity='person_name'
names=sample_names
Code:
def generate(data, target_entity, names):
text = data['text']
spans = data['spans']
new_sents=[]
if spans:
spans = [(d['start'], d['end'], d['label']) for d in spans]
spans.sort()
labellist=[s[2] for s in spans]
# get before_spans and after_spans around target entity
for n in names:
gap = 0
for i, tup in enumerate(spans):
lab = tup[2]
if lab == target_entity:
new_spans={"before": spans[:i], "after": spans[i+1:]}
print("the spans before and after :\n",new_spans)
start=tup[0] #check this
end=tup[1]
ngram = text[start:end]
new_s = text[:start] + n + text[end:]
gap = len(n) - len(ngram)
before = new_spans["before"]
after = [(tup[0]+gap, tup[1]+gap, tup[2]) for tup in new_spans["after"]]
s_sp = before + [(start, start + len(n), target_label)] + after
text=new_s
en={"text": new_s,"spans": [{"start": tup[0], "end": tup[1], "label": tup[2], "ngram": new_s[tup[0]:tup[1]]} for tup in s_sp]}
spans = s_sp
new_sents.append(en)
If all you seek to do is replace the placeholder with a new value, you can do something like this:
## --------------------
## Some enxaple input from you
## --------------------
input_data = [
(162, 171, 'pno'),
(241, 254, 'person_name'),
(373, 384, 'date'),
(459, 477, 'date'),
None,
(772, 785, 'person_name'),
(797, 806, 'pno')
]
## --------------------
## --------------------
## create an iterator out of our name list
## you will need to decide what happens if sample names
## gets exhausted.
## --------------------
sample_names = [
'Jon',
'Sam'
]
sample_names_itter = iter(sample_names)
## --------------------
for row in input_data:
if not row:
continue
start = row[0]
end = row[1]
name = row[2] if row[2] != "person_name" else next(sample_names_itter)
print(f"{name} dob is 25th dec 1980. {name} loves to play cricket.")
I have 2 txt files with names and scores. For example:
File 1 File 2 Desired Output
Name Score Name Score Name Score
Michael 20 Michael 30 Michael 50
Adrian 40 Adrian 50 Adrian 90
Jane 60 Jane 60
I want to sum scores with same names and print them. I tried to pair names and scores in two different dictionaries and after that merge the dictionaries. However, I can't keep same names with different scores. So, I'm stuck here. I've written something like following :
d1=dict()
d2=dict()
with open('data1.txt', "r") as f:
test = [i for line in f for i in line.split()]
i = 0
while i < len(test) - 1:
d1[test[i]] = test[i + 1]
i += 2
del d1['Name']
with open('data2.txt', "r") as f:
test = [i for line in f for i in line.split()]
i = 0
while i < len(test) - 1:
d2[test[i]] = test[i + 1]
i += 2
del d2['Name']
z = dict(d2.items() | d1.items())
Using a dictionary comprehension should get you what you are after. I have assumed the contents of the files are:
File1.txt:
Name Score
Michael 20
Adrian 40
Jane 60
File2.txt:
Name Score
Michael 30
Adrian 50
Then you can get a total as:
with open("file1.txt", "r") as file_in:
next(file_in) # skip header
file1_data = dict(row.split() for row in file_in if row)
with open("file2.txt", "r") as file_in:
next(file_in) # skip header
file2_data = dict(row.split() for row in file_in if row)
result = {
key: int(file1_data.get(key, 0)) + int(file2_data.get(key, 0))
for key
in set(file1_data).union(file2_data) # could also use file1_data.keys()
}
print(result)
This should give you a result like:
{'Michael': 50, 'Jane': 60, 'Adrian': 90}
Use defaultdict
from collections import defaultdict
name_scores = defaultdict(int)
files = ('data1.txt', 'data2.txt')
for file in files:
with open(file, 'r') as f:
for name, score in f.split():
name_scores[name] += int(score)
edit: You'll probably have to skip any header line and maybe clean up trailing white spaces, but the gist of it is above.
I'm new in python and I need some help on read the file and count the word in column.
I have 2 data file, which is category.csv and data.csv.
category.csv:
CATEGORY
Technology
Furniture
Office Supplies
and below is data.csv
CATEGORY
Technology
Furniture
Technology
Furniture
Office Supplies
First, I want to select the 'Technology' in category.csv and match it with data.cvs, after that, it will start to count 'Technology' appears how many times in data.cvs.
import csv # import csv file
filePath1 = "category.csv"
filePath2 = "data.csv"
with open(filePath1) as csvfile1: # open category file
with open(filePath2) as csvfile2: # open data file
reader1 = csv.DictReader(csvfile1) # dictread file
reader2 = csv.DictReader(csvfile2) # dictread file
for row1 in reader1: # read all row in data file
for row2 in reader2:
for row1['CATEGORY'] in row2['CATEGORY']:
total_tech = row2['CATEGORY'].count('Technology')
total_furn = row2['CATEGORY'].count('Furniture')
total_offi = row2['CATEGORY'].count('Office Supplies')
print("=============================================================================")
print("Display category average stock level")
print("=============================================================================")
print( "Technology :", total_tech)
print("Furniture :", total_furn)
print("Office Supplies :", total_offi)
print( "=============================================================================")
But i'm failed to count it with above code, can somebody help me ? Thank you so much.
Here is the solution -
import csv # import csv file
filePath1 = "category.csv"
filePath2 = "data.csv"
categories = {}
with open(filePath1) as csvfile: # open category file
reader = csv.DictReader(csvfile) # dictread file
for row in reader: # Create a dictionary map of all the categories, and initialise count to 0
categories[row["CATEGORY"]] = 0
with open(filePath2) as csvfile: # open data file
reader = csv.DictReader(csvfile) # dictread file
for row in reader:
categories[row["CATEGORY"]] += 1 # For every item in data file, increment the count of the category
print("=============================================================================")
print("Display category average stock level")
print("=============================================================================")
for key, value in categories.items():
print("{:<20} :{:>4}".format(key, value))
print("=============================================================================")
The output is like this -
=============================================================================
Display category average stock level
=============================================================================
Technology : 2
Office Supplies : 1
Furniture : 2
=============================================================================
I have been working on a program for a week now, but have been unable to get it to work according to the guidelines.
In this program (payroll.py), I have to open the CSV data file (employees.csv), read the records in the file, and produce a payroll report using the functions in payroll.py. The output should be printed, not written to a separate output file, and should end up looking like this:
LastName FirstName Hours RegHours OTHours RegPay OTPay GrossPay Deductions NetPay
Hightower Michael 42.0 40.0 2.0 400.00 30.00 430.00 107.07 322.93
Jackson Samuel 53.0 40.0 13.0 506.00 246.68 752.67 187.42 565.25
Jones Catherine 35.0 35.0 0.00 680.05 0.00 680.05 169.33 510.72
The payroll program works just fine on its own (without calling the CSV file), but when I try to call the file (using "from csv import reader"), one of two things happens:
1) I can call the first three columns (last name, first name, and hours), but I am unable to "insert" the additional columns (I get an index error because, of course, those columns don't exist in the original CSV file), or
2) The program only pulls up one entire record, which happens to be the last record in the CSV file.
Any guidance on how to accomplish this would be greatly appreciated. Thank you.
Here is the code for payroll.py:
def main() :
employeeFirstName, employeeLastName = employeeFullName()
employeePayRate, employeeHoursWorked = employeePay()
employeeRegularHours, employeeOvertimeHours = calculateRegularHours(employeeHoursWorked)
employeeOvertimeHours = calculateOvertimeHours(employeeHoursWorked)
employeeTotalHours = calculateTotalHours(employeeRegularHours, employeeOvertimeHours)
regularPayAmount = calculateRegularPay(employeePayRate, employeeRegularHours)
overtimePayAmount = calculateOvertimePay(employeePayRate, employeeOvertimeHours)
grossPayAmount = calculateGrossPay(regularPayAmount, overtimePayAmount)
federalTaxWithheld = calculateFederalTax(grossPayAmount)
stateTaxWithheld = calculateStateTax(grossPayAmount)
medicareTaxWithheld = calculateMedicareTax(grossPayAmount)
socSecTaxWithheld = calculateSocSecTax(grossPayAmount)
totalTaxesWithheld = calculateTotalTaxes(federalTaxWithheld, stateTaxWithheld, medicareTaxWithheld, socSecTaxWithheld)
netPayAmount = calculateNetPay(grossPayAmount, totalTaxesWithheld)
payrollSummaryReport(employeeFirstName, employeeLastName, employeePayRate, employeeRegularHours, employeeOvertimeHours, employeeTotalHours, regularPayAmount, overtimePayAmount, grossPayAmount, federalTaxWithheld, stateTaxWithheld, medicareTaxWithheld, socSecTaxWithheld, totalTaxesWithheld, netPayAmount)
def employeeFullName() :
employeeFirstName = str(input("Enter the employee's first name: "))
employeeLastName = str(input("Enter the employee's last name: "))
return employeeFirstName, employeeLastName
def employeePay() :
employeePayRate = float(input("Enter the employee's hourly pay rate: "))
employeeHoursWorked = float(input("Enter the employee's hours worked: "))
return employeePayRate, employeeHoursWorked
def calculateRegularHours(employeeHoursWorked) :
if employeeHoursWorked < 40 :
employeeRegularHours = employeeHoursWorked
employeeOvertimeHours = 0
else:
employeeRegularHours = 40
employeeOvertimeHours = employeeHoursWorked - 40
return employeeRegularHours, employeeOvertimeHours
def calculateOvertimeHours(employeeHoursWorked) :
if employeeHoursWorked > 40 :
employeeOvertimeHours = employeeHoursWorked - 40
else :
employeeOvertimeHours = 0
return employeeOvertimeHours
def calculateTotalHours(employeeRegularHours, employeeOvertimeHours) :
employeeTotalHours = employeeRegularHours + employeeOvertimeHours
return employeeTotalHours
def calculateRegularPay(employeePayRate, employeeHoursWorked) :
regularPayAmount = employeePayRate * employeeHoursWorked
return regularPayAmount
def calculateOvertimePay(employeePayRate, employeeOvertimeHours) :
overtimePayRate = 1.5
overtimePayAmount = (employeePayRate * employeeOvertimeHours) * overtimePayRate
return overtimePayAmount
def calculateGrossPay(regularPayAmount, overtimePayAmount) :
grossPayAmount = regularPayAmount + overtimePayAmount
return grossPayAmount
def calculateFederalTax(grossPayAmount) :
federalTaxRate = 0.124
federalTaxWithheld = grossPayAmount * federalTaxRate
return federalTaxWithheld
def calculateStateTax(grossPayAmount) :
stateTaxRate = 0.049
stateTaxWithheld = grossPayAmount * stateTaxRate
return stateTaxWithheld
def calculateMedicareTax(grossPayAmount) :
medicareTaxRate = 0.014
medicareTaxWithheld = grossPayAmount * medicareTaxRate
return medicareTaxWithheld
def calculateSocSecTax(grossPayAmount) :
socSecTaxRate = 0.062
socSecTaxWithheld = grossPayAmount * socSecTaxRate
return socSecTaxWithheld
def calculateTotalTaxes(federalTaxWithheld, stateTaxWithheld, medicareTaxWithheld, socSecTaxWithheld) :
totalTaxesWithheld = federalTaxWithheld + stateTaxWithheld + medicareTaxWithheld + socSecTaxWithheld
return totalTaxesWithheld
def calculateNetPay(grossPayAmount, totalTaxesWithheld) :
netPayAmount = grossPayAmount - totalTaxesWithheld
return netPayAmount
def payrollSummaryReport(employeeFirstName, employeeLastName, employeePayRate, employeeRegularHours, employeeOvertimeHours, employeeTotalHours, regularPayAmount, overtimePayAmount, grossPayAmount, federalTaxWithheld, stateTaxWithheld, medicareTaxWithheld, socSecTaxWithheld, totalTaxesWithheld, netPayAmount) :
print()
print("\t\t\t\t\t\tPayroll Summary Report")
print()
print("%-12s%-12s%-8s%-10s%-10s%-12s%-10s%-11s%-13s%-10s" % ("LastName", "FirstName", "Hours", "RegHours", "OTHours", "RegPay", "OTPay", "GrossPay", "Deductions", "NetPay"))
print("%-12s%-12s%-8.2f%-10.2f%-10.2f$%-11.2f$%-9.2f$%-10.2f$%-12.2f$%-10.2f" % (employeeLastName, employeeFirstName, employeeTotalHours, employeeRegularHours, employeeOvertimeHours, regularPayAmount, overtimePayAmount, grossPayAmount, totalTaxesWithheld, netPayAmount))
main ()
The CSV file (employees.csv) I need to use looks like this:
First,Last,Hours,Pay
Matthew,Hightower,42,10
Samuel,Jackson,53,12.65
Catherine,Jones,35,19.43
Charlton,Heston,52,10
Karen,Black,40,12
Sid,Caesar,38,15
George,Kennedy,25,35
Linda,Blair,42,18.6
Beverly,Garland,63,10
Jerry,Stiller,52,15
Efrem,Zimbalist,34,16
Linda,Harrison,24,14
Erik,Estrada,41,15.5
Myrna,Loy,40,14.23
You can treat your .csv file as a regular one. No need for reader. Here is a function that might deal with your file:
def get_data(fname):
'''
Function returns the dictionary with following
format:
{ 0 : {
"fname": "...",
"lname": "...",
"gross": "...",
},
1 : {
....,
,,,,
},
}
'''
result = {} # return value
i = 0 # you can zip range() if you want to
with open(fname, 'r') as f:
for line in f.readlines()[1:]:
result[i] = {}
tmp = line.split(",") # list of values from file
# access file values by their index, e.g.
# tmp[0] -> first name
# tmp[1] -> last name
# tmp[2] -> hours
# tmp[3] -> pay rate
# do calculations using your functions (calculateOvertimePay,
# calculateTotalHours, etc.) and store the results in dictionary
# e.g:
result[i]["fname"] = tmp[0]
result[i]["lname"] = tmp[1]
# ...
# do calculations for report
# ...
# result[i]["regular"] = calc...(....)
# result[i]["overtime"] = calc...(....)
result[i]["gross"] = calculateGrossPay(result[i]["regular"], result[i]["overtime"])
i += 1
return result
There are several thing your might want to do with your payrollSummaryReport(...) function to improve it:
replace your huge argument list with dict, or list
tinker it a bit to fit your requirements
Your might do your improvements in this way:
def payrollSummaryReport(vals) :
print()
print("\t\t\t\t\t\tPayroll Summary Report")
print()
print("%-12s%-12s%-8s%-10s%-10s%-12s%-10s%-11s%-13s%-10s" %\
("LastName", "FirstName", "Hours", "RegHours", "OTHours", "RegPay", "OTPay", "GrossPay", "Deductions", "NetPay"))
for i in vals:
print("%-12s%-12s%-8.2f%-10.2f%-10.2f$%-11.2f$%-9.2f$%-10.2f$%-12.2f$%-10.2f" %\
(vals[i]["fname"], vals[i]["lname"], vals[i]["gross"], ''' repeat for all fields '''))