How to convert cmudict-0.7b or cmudict-0.7b.dict in to FST format to use it with phonetisaurus? - cmusphinx

I am looking for a simple procedure to generate FST (finite state transducer) from cmudict-0.7b or cmudict-0.7b.dict, which will be used with phonetisaurus.
I tried following set of commands (phonetisaurus Aligner, Google NGramLibrary and phonetisaurus arpa2wfst) and able to generate FST but it didn't work. I am not sure where I did a mistake or miss any step. I guess very first command ie phonetisaurus-align, is not correct.
phonetisaurus-align --input=cmudict.dict --ofile=cmudict/cmudict.corpus --seq1_del=false
ngramsymbols < cmudict/cmudict.corpus > cmudict/cmudict.syms
/usr/local/bin/farcompilestrings --symbols=cmudict/cmudict.syms --keep_symbols=1 cmudict/cmudict.corpus > cmudict/cmudict.far
ngramcount --order=8 cmudict/cmudict.far > cmudict/cmudict.cnts
ngrammake --v=2 --bins=3 --method=kneser_ney cmudict/cmudict.cnts > cmudict/cmudict.mod
ngramprint --ARPA cmudict/cmudict.mod > cmudict/cmudict.arpa
phonetisaurus-arpa2wfst-omega --lm=cmudict/cmudict.arpa > cmudict/cmudict.fst
I tried fst with phonetisaurus-g2p as follows:
phonetisaurus-g2p --model=cmudict/cmudict.fst --nbest=3 --input=HELLO --words
But it didn't return anything....
Appreciate any help on this matter.

It is very important to keep dictionary in the right format. Phonetisaurus is very sensitive about that, it requires word and phonemes to be tab separated, spaces would not work then. It also does not allow pronunciation variant numbers CMUSphinx uses like (2) or (3). You need to cleanup dictionary with simple python script for example before feeding it into phonetisaurus. Here is the one I use:
#!/usr/bin/python
import sys
if len(sys.argv) != 3:
print "Split the list on train and test sets"
print
print "Usage: traintest.py file split_count"
exit()
infile = open(sys.argv[1], "r")
outtrain = open(sys.argv[1] + ".train", "w")
outtest = open(sys.argv[1] + ".test", "w")
cnt = 0
split_count = int(sys.argv[2])
for line in infile:
items = line.split()
if items[0][-1] == ')':
items[0] = items[0][:-3]
if items[0].find("_") > 0:
continue
line = items[0] + '\t' + " ".join(items[1:]) + '\n'
if cnt % split_count == 3:
outtest.write(line)
else:
outtrain.write(line)
cnt = cnt + 1

Related

Python skip the lines which do not have any of starting line in the output

I am trying to write a code after getting help from google and So to parse a command output but still getting some problem, as the output what i am expecting continuous there line starting with dn , instance and tag but somehow the very first output only contains dn and tag So, i want those line which do not have all these three starting strings then just skip those, as i am learning so not getting the idea to do that.
Below is my code:
import subprocess as sp
p = sp.Popen(somecmd, shell=True, stdout=sp.PIPE)
stout = p.stdout.read().decode('utf8')
output = stout.splitlines()
startline = ["instance:", "tag"]
for line in output:
print(line)
Script output:
dn: ou=People,ou=pti,o=pt
tag: pti00631
dn: cn=pti00857,ou=People,ou=pti,o=pt
instance: Jassu Lal
tag: pti00857
dn: cn=pti00861,ou=People,ou=pti,o=pt
instance: Gatti Lal
tag: pti00861
Desired output:
dn: cn=pti00857,ou=People,ou=pti,o=pt
instance: Jassu Lal
tag: pti00857
dn: cn=pti00861,ou=People,ou=pti,o=pt
instance: Gatti Lal
tag: pti00861
Assuming your output always the same, your loop can look like this:
lines_to_skip = 3
skip_lines = False
skipped_lines = 0
for line in output():
if "dn: " in line and not "dn: cn" in line:
skip_lines = True
if skip_lines:
if skipped_lines < lines_to_skip:
skipped_lines += 1
continue
if skipped_lines == lines_to_skip:
skip_lines = False
skipped_lines = 0
print(line)
It will check if there is a dn without the cn, counts to 3 (or rather lines_to_skip) and starts outputting when it's reached the lines to skip.
It's a pretty hacky solution but the best one I could come up with for the given context
The below code is flexible. You only need to add the tags in the necessary_tags dictionary without which you do not want to print. It can be more than 3 as well. It also accounts for situations when you receive a particular tag more than once.
import subprocess as sp
p = sp.Popen(somecmd, shell=True, stdout=sp.PIPE)
stout = p.stdout.read().decode('utf8')
output = stout.splitlines()
output.append("")
necessary_tags = {'dn':0, 'instance':0, 'tag':0}
temp_output = []
for line in (output):
tag = line.split(':')[0].strip()
if necessary_tags.get(tag, -1) != -1:
necessary_tags[tag] += 1
temp_output.append(line)
elif line == "":
if all(necessary_tags.values()):
for out in temp_output:
print(out)
temp_output = []
necessary_tags.update({}.fromkeys(necessary_tags,0))
print()

How to handle blank line,junk line and \n while converting an input file to csv file

Below is the sample data in input file. I need to process this file and turn it into a csv file. With some help, I was able to convert it to csv file. However not fully converted to csv since I am not able to handle \n, junk line(2nd line) and blank line(4th line). Also, i need help to filter transaction_type i.e., avoid "rewrite" transaction_type
{"transaction_type": "new", "policynum": 4994949}
44uu094u4
{"transaction_type": "renewal", "policynum": 3848848,"reason": "Impressed with \n the Service"}
{"transaction_type": "cancel", "policynum": 49494949, "cancel_table":[{"cancel_cd": "AU"}, {"cancel_cd": "AA"}]}
{"transaction_type": "rewrite", "policynum": 5634549}
Below is the code
import ast
import csv
with open('test_policy', 'r') as in_f, open('test_policy.csv', 'w') as out_f:
data = in_f.readlines()
writer = csv.DictWriter(
out_f,
fieldnames=[
'transaction_type', 'policynum', 'cancel_cd','reason'],lineterminator='\n',
extrasaction='ignore')
writer.writeheader()
for row in data:
dict_row = ast.literal_eval(row)
if 'cancel_table' in dict_row:
cancel_table = dict_row['cancel_table']
cancel_cd= []
for cancel_row in cancel_table:
cancel_cd.append(cancel_row['cancel_cd'])
dict_row['cancel_cd'] = ','.join(cancel_cd)
writer.writerow(dict_row)
Below is my output not considering the junk line,blank line and transaction type "rewrite".
transaction_type,policynum,cancel_cd,reason
new,4994949,,
renewal,3848848,,"Impressed with
the Service"
cancel,49494949,"AU,AA",
Expected output
transaction_type,policynum,cancel_cd,reason
new,4994949,,
renewal,3848848,,"Impressed with the Service"
cancel,49494949,"AU,AA",
Hmm I try to fix them but I do not know how CSV file work, but my small knoll age will suggest you to run this code before to convert the file.
txt = {"transaction_type": "renewal",
"policynum": 3848848,
"reason": "Impressed with \n the Service"}
newTxt = {}
for i,j in txt.items():
# local var (temporar)
lastX = ""
correctJ = ""
# check if in J is ascii white space "\n" and get it out
if "\n" in f"b'{j}'":
j = j.replace("\n", "")
# for grammar purpose check if
# J have at least one space
if " " in str(j):
# if yes check it closer (one by one)
for x in ([j[y:y+1] for y in range(0, len(j), 1)]):
# if 2 spaces are consecutive pass the last one
if x == " " and lastX == " ":
pass
# if not update correctJ with new values
else:
correctJ += x
# remember what was the last value checked
lastX = x
# at the end make J to be the correctJ (just in case J has not grammar errors)
j = correctJ
# add the corrections to a new dictionary
newTxt[i]=j
# show the resoult
print(f"txt = {txt}\nnewTxt = {newTxt}")
Termina:
txt = {'transaction_type': 'renewal', 'policynum': 3848848, 'reason': 'Impressed with \n the Service'}
newTxt = {'transaction_type': 'renewal', 'policynum': 3848848, 'reason': 'Impressed with the Service'}
Process finished with exit code 0

Linecache getline does not work after my application was installed

I am creating a tool that gives an overview of hundredths of test results. This tool access a log file, checks for Pass and Fail verdicts. When it is a fail, I need to go back to previous lines of the log to capture the cause of failure.
The linecache.getline works in my workspace (Python Run via eclipse). But after I created a windows installer (.exe file) and installed the application in my computer, the linecache.getline returns nothing. Is there something I need to add into my setup.py file to fix this or is it my code issue?
Tool Code
precon:
from wx.FileDialog, access the log file
self.result_path = dlg.GetPath()
try:
with open(self.result_path, 'r') as file:
self.checkLog(self.result_path, file)
def checkLog(self, path, f):
line_no = 1
index = 0
for line in f:
n = re.search("FAIL", line, re.IGNORECASE) or re.search("PASS", line, re.IGNORECASE)
if n:
currentline = re.sub('\s+', ' ', line.rstrip())
finalresult = currentline
self.list_ctrl.InsertStringItem(index, finaltestname)
self.list_ctrl.SetStringItem(index, 1, finalresult)
if currentline == "FAIL":
fail_line1 = linecache.getline(path, int(line_no - 3)) #Get reason of failure
fail_line2 = linecache.getline(path, int(line_no - 2)) #Get reason of failure
cause = fail_line1.strip() + " " + fail_line2.strip()
self.list_ctrl.SetStringItem(index, 2, cause)
index += 1
line_no += 1
The issue was resolved by doing the get_line function from this link:
Python: linecache not working as expected?

Lua - Cipher Logic Error Involving string.gsub, Ciphered Output is Not Input

My program seems to be experiencing logical errors. I have looked it over multiple times and even written another program similar to this one (also seems to have the same error). I cannot figure out what is wrong, although I think it may involve my usage of string.gsub...
repeat
local file = io.open("out.txt", "w")
print("would you like to translate Cipher to English or English to Cipher?")
print("enter 1 for translation to English. enter 2 for translation to Cipher")
tchoice=io.read()
if tchoice=="2" then
print(" enter any text to translate it: ")
rawtextin=io.read()
text=string.lower(rawtextin)
text1=string.gsub(text,"a","q")
text2=string.gsub(text1,"b","e")
text3=string.gsub(text2,"c","h")
text4=string.gsub(text3,"d","c")
text5=string.gsub(text4,"e","j")
text6=string.gsub(text5,"f","m")
text7=string.gsub(text6,"g","r")
text8=string.gsub(text7,"h","g")
text9=string.gsub(text8,"i","b")
text10=string.gsub(text9,"j","a")
text11=string.gsub(text10,"k","d")
text12=string.gsub(text11,"l","y")
text13=string.gsub(text12,"m","v")
text14=string.gsub(text13,"n","z")
text15=string.gsub(text14,"o","x")
text16=string.gsub(text15,"p","k")
text17=string.gsub(text16,"q","i")
text18=string.gsub(text17,"r","l")
text19=string.gsub(text18,"s","f")
text20=string.gsub(text19,"t","s")
text21=string.gsub(text20,"u","w")
text22=string.gsub(text21,"v","t")
text23=string.gsub(text22,"w","p")
text24=string.gsub(text23,"x","u")
text25=string.gsub(text24,"y","n")
text26=string.gsub(text25,"z","o")
text27=string.gsub(text26," ","#")
print(text27)
elseif tchoice=="1" then
print("enter text!")
rawtextin=io.read()
text=string.lower(rawtextin)
text1=string.gsub(text,"q","a")
text2=string.gsub(text1,"e","b")
text3=string.gsub(text2,"h","c")
text4=string.gsub(text3,"c","d")
text5=string.gsub(text4,"j","e")
text6=string.gsub(text5,"m","f")
text7=string.gsub(text6,"r","g")
text8=string.gsub(text7,"g","h")
text9=string.gsub(text8,"b","i")
text10=string.gsub(text9,"a","j")
text11=string.gsub(text10,"d","k")
text12=string.gsub(text11,"y","l")
text13=string.gsub(text12,"v","m")
text14=string.gsub(text13,"z","n")
text15=string.gsub(text14,"x","o")
text16=string.gsub(text15,"k","p")
text17=string.gsub(text16,"i","q")
text18=string.gsub(text17,"l","r")
text19=string.gsub(text18,"f","s")
text20=string.gsub(text19,"s","t")
text21=string.gsub(text20,"w","u")
text22=string.gsub(text21,"t","v")
text23=string.gsub(text22,"p","w")
text24=string.gsub(text23,"u","x")
text25=string.gsub(text24,"n","y")
text26=string.gsub(text25,"o","z")
text27=string.gsub(text26,"#"," ")
print(text27)
end
print("writing to out.txt...")
file:write(text27)
file:close()
print("done!")
print("again? type y for yes or anything else for no.")
again=io.read()
until again~="y"
x=io.read()
No errors in the code - What am I missing? I am aware this is not the most efficient way of doing this but I need to figure out what is going wrong before I write a more efficient program using loops and tables.
Sample run (with only significant data included):
in:2
in:hi test
out:gb#safs
in:y
in:1
in:gb#safs
out:hq vjvv
local decoded = 'abcdefghijklmnopqrstuvwxyz #'
local encoded = 'qehcjmrgbadyvzxkilfswtpuno# '
local enc, dec = {}, {}
for i = 1, #decoded do
local e, d = encoded:sub(i,i), decoded:sub(i,i)
enc[d] = e
dec[e] = d
end
repeat
local file = io.open("out.txt", "w")
local text27, rawtextin
print("would you like to translate Cipher to English or English to Cipher?")
print("enter 1 for translation to English. enter 2 for translation to Cipher")
local tchoice = io.read()
if tchoice == "2" then
print(" enter any text to translate it: ")
rawtextin = io.read()
text27 = rawtextin:lower():gsub('.',enc)
print(text27)
elseif tchoice == "1" then
print("enter text!")
rawtextin = io.read()
text27 = rawtextin:lower():gsub('.',dec)
print(text27)
end
print("writing to out.txt...")
file:write(text27)
file:close()
print("done!")
print("again? type y for yes or anything else for no.")
local again = io.read()
until again ~= "y"

Lines of code you have written [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Out of curiosity, is there any way to get the number of lines of code you have written (in a specific project)?
I tried perforce with p4 describe #CLN | wc -l, but apart from so many edge cases (comments being included, new lines being added etc.), it skips the newly added files as well. Edge cases can be ignored, if we try to display physical line of code but newly added files still cause the issue.
I went ahead and wrote a Python script that prints out the number of lines of code added/changed by a user and the average number of lines per change.
Tested on Windows with Python 2.7.2. You can run from the command line - it assumes you have p4 in your path.
Usage: codestats.py -u [username]
It works with git too: codestats.py -u [authorname] -g.
It does some blacklisting to prune out bulk adds (e.g. you just added a library), and also imposes a blacklist on certain types of files (e.g. .HTML files, etc.). Otherwise, it works pretty well.
Hope this helps!
########################################################################
# Script that computes the lines of code stats for a perforce/git user.
########################################################################
import argparse
import logging
import subprocess
import sys
import re
VALID_ARGUMENTS = [
("user", "-u", "--user", "Run lines of code computation for the specified user.", 1),
("change", "-c", "--change", "Just display lines of code in the passed in change (useful for debugging).", 1),
("git", "-g", "--git", "Use git rather than perforce (which is the default versioning system queried).", 0)
]
class PrintHelpOnErrorArgumentParser(argparse.ArgumentParser):
def error(self, message):
logging.error("error: {0}\n\n".format(message))
self.print_help()
sys.exit(2)
def is_code_file(depot_path):
fstat_output = subprocess.Popen(['p4', 'fstat', depot_path], stdout=subprocess.PIPE).communicate()[0].split('\n')
text_file = False
head_type_regex = re.compile('^... headType (\S+)\s*$')
for line in fstat_output:
head_type_line = head_type_regex.match(line)
if head_type_line:
head_type = head_type_line.group(1)
text_file = (head_type.find('text') != -1)
if text_file:
blacklisted_file_types = ['html', 'css', 'twb', 'twbx', 'tbm', 'xml']
for file_type in blacklisted_file_types:
if re.match('^\/\/depot.*\.{}#\d+$'.format(file_type), depot_path):
text_file = False
break
return text_file
def parse_args():
parser = PrintHelpOnErrorArgumentParser()
for arg_name, short_switch, long_switch, help, num_args in VALID_ARGUMENTS:
if num_args != 0:
parser.add_argument(
short_switch,
nargs=num_args,
type=str,
dest=arg_name)
else:
parser.add_argument(
long_switch,
short_switch,
action="store_true",
help=help,
dest=arg_name)
return parser.parse_args()
file_edited_regex = re.compile('^... .*?#\d+ edit\s*$')
file_deleted_regex = re.compile('^... .*?#\d+ delete\s*$')
file_integrated_regex = re.compile('^... .*?#\d+ integrate\s*$')
file_added_regex = re.compile('^... (.*?#\d+) add\s*$')
affected_files_regex = re.compile('^Affected files ...')
outliers = [] # Changes that seem as if they weren't hand coded and merit inspection
def num_lines_in_file(depot_path):
lines = len(subprocess.Popen(['p4', 'print', depot_path], stdout=subprocess.PIPE).communicate()[0].split('\n'))
return lines
def parse_change(changelist):
change_description = subprocess.Popen(['p4', 'describe', '-ds', changelist], stdout=subprocess.PIPE).communicate()[0].split('\n')
parsing_differences = False
parsing_affected_files = False
differences_regex = re.compile('^Differences \.\.\..*$')
line_added_regex = re.compile('^add \d+ chunks (\d+) lines.*$')
line_removed_regex = re.compile('^deleted \d+ chunks (\d+) lines.*$')
line_changed_regex = re.compile('^changed \d+ chunks (\d+) / (\d+) lines.*$')
file_diff_regex = re.compile('^==== (\/\/depot.*#\d+)\s*\S+$')
skip_file = False
num_lines_added = 0
num_lines_deleted = 0
num_lines_changed_added = 0
num_lines_changed_deleted = 0
num_files_added = 0
num_files_edited = 0
for line in change_description:
if differences_regex.match(line):
parsing_differences = True
elif affected_files_regex.match(line):
parsing_affected_files = True
elif parsing_differences:
if file_diff_regex.match(line):
regex_match = file_diff_regex.match(line)
skip_file = not is_code_file(regex_match.group(1))
elif not skip_file:
regex_match = line_added_regex.match(line)
if regex_match:
num_lines_added += int(regex_match.group(1))
else:
regex_match = line_removed_regex.match(line)
if regex_match:
num_lines_deleted += int(regex_match.group(1))
else:
regex_match = line_changed_regex.match(line)
if regex_match:
num_lines_changed_added += int(regex_match.group(2))
num_lines_changed_deleted += int(regex_match.group(1))
elif parsing_affected_files:
if file_added_regex.match(line):
file_added_match = file_added_regex.match(line)
depot_path = file_added_match.group(1)
if is_code_file(depot_path):
lines_in_file = num_lines_in_file(depot_path)
if lines_in_file > 3000:
# Anomaly - probably a copy of existing code - discard this
lines_in_file = 0
num_lines_added += lines_in_file
num_files_added += 1
elif file_edited_regex.match(line):
num_files_edited += 1
return [num_files_added, num_files_edited, num_lines_added, num_lines_deleted, num_lines_changed_added, num_lines_changed_deleted]
def contains_integrates(changelist):
change_description = subprocess.Popen(['p4', 'describe', '-s', changelist], stdout=subprocess.PIPE).communicate()[0].split('\n')
contains_integrates = False
parsing_affected_files = False
for line in change_description:
if affected_files_regex.match(line):
parsing_affected_files = True
elif parsing_affected_files:
if file_integrated_regex.match(line):
contains_integrates = True
break
return contains_integrates
#################################################
# Note: Keep this function in sync with
# generate_line.
#################################################
def generate_output_specifier(output_headers):
output_specifier = ''
for output_header in output_headers:
output_specifier += '| {:'
output_specifier += '{}'.format(len(output_header))
output_specifier += '}'
if output_specifier != '':
output_specifier += ' |'
return output_specifier
#################################################
# Note: Keep this function in sync with
# generate_output_specifier.
#################################################
def generate_line(output_headers):
line = ''
for output_header in output_headers:
line += '--' # for the '| '
header_padding_specifier = '{:-<'
header_padding_specifier += '{}'.format(len(output_header))
header_padding_specifier += '}'
line += header_padding_specifier.format('')
if line != '':
line += '--' # for the last ' |'
return line
# Returns true if a change is a bulk addition or a private change
def is_black_listed_change(user, changelist):
large_add_change = False
all_adds = True
num_adds = 0
is_private_change = False
is_third_party_change = False
change_description = subprocess.Popen(['p4', 'describe', '-s', changelist], stdout=subprocess.PIPE).communicate()[0].split('\n')
for line in change_description:
if file_edited_regex.match(line) or file_deleted_regex.match(line):
all_adds = False
elif file_added_regex.match(line):
num_adds += 1
if line.find('... //depot/private') != -1:
is_private_change = True
break
if line.find('... //depot/third-party') != -1:
is_third_party_change = True
break
large_add_change = all_adds and num_adds > 70
#print "{}: {}".format(changelist, large_add_change or is_private_change)
return large_add_change or is_third_party_change
change_header_regex = re.compile('^Change (\d+)\s*.*?\s*(\S+)#.*$')
def get_user_and_change_header_for_change(changelist):
change_description = subprocess.Popen(['p4', 'describe', '-s', changelist], stdout=subprocess.PIPE).communicate()[0].split('\n')
user = None
change_header = None
for line in change_description:
change_header_match = change_header_regex.match(line)
if change_header_match:
user = change_header_match.group(2)
change_header = line
break
return [user, change_header]
if __name__ == "__main__":
log = logging.getLogger()
log.setLevel(logging.DEBUG)
args = parse_args()
user_stats = {}
user_stats['num_changes'] = 0
user_stats['lines_added'] = 0
user_stats['lines_deleted'] = 0
user_stats['lines_changed_added'] = 0
user_stats['lines_changed_removed'] = 0
user_stats['total_lines'] = 0
user_stats['files_edited'] = 0
user_stats['files_added'] = 0
change_log = []
if args.git:
git_log_command = ['git', 'log', '--author={}'.format(args.user[0]), '--pretty=tformat:', '--numstat']
git_log_output = subprocess.Popen(git_log_command, stdout=subprocess.PIPE).communicate()[0].split('\n')
git_log_line_regex = re.compile('^(\d+)\s*(\d+)\s*\S+$')
total = 0
adds = 0
subs = 0
for git_log_line in git_log_output:
line_match = git_log_line_regex.match(git_log_line)
if line_match:
adds += int(line_match.group(1))
subs += int(line_match.group(2))
total = adds - subs
num_commits = 0
git_shortlog_command = ['git', 'shortlog', '--author={}'.format(args.user[0]), '-s']
git_shortlog_output = subprocess.Popen(git_shortlog_command, stdout=subprocess.PIPE).communicate()[0].split('\n')
git_shortlog_line_regex = re.compile('^\s*(\d+)\s+.*$')
for git_shortlog_line in git_shortlog_output:
line_match = git_shortlog_line_regex.match(git_shortlog_line)
if line_match:
num_commits += int(line_match.group(1))
print "Git Stats for {}: Commits: {}. Lines of code: {}. Average Lines Per Change: {}.".format(args.user[0], num_commits, total, total*1.0/num_commits)
sys.exit(0)
elif args.change:
[args.user, change_header] = get_user_and_change_header_for_change(args.change)
change_log = [change_header]
else:
change_log = subprocess.Popen(['p4', 'changes', '-u', args.user, '-s', 'submitted'], stdout=subprocess.PIPE).communicate()[0].split('\n')
output_headers = ['Current Change', 'Num Changes', 'Files Added', 'Files Edited']
output_headers.append('Lines Added')
output_headers.append('Lines Deleted')
if not args.git:
output_headers.append('Lines Changed (Added/Removed)')
avg_change_size = 0.0
output_headers.append('Total Lines')
output_headers.append('Avg. Lines/Change')
line = generate_line(output_headers)
output_specifier = generate_output_specifier(output_headers)
print line
print output_specifier.format(*output_headers)
print line
output_specifier_with_carriage_return = output_specifier + '\r'
for change in change_log:
change_match = change_header_regex.search(change)
if change_match:
user_stats['num_changes'] += 1
changelist = change_match.group(1)
if not is_black_listed_change(args.user, changelist) and not contains_integrates(changelist):
[files_added_in_change, files_edited_in_change, lines_added_in_change, lines_deleted_in_change, lines_changed_added_in_change, lines_changed_removed_in_change] = parse_change(change_match.group(1))
if lines_added_in_change > 5000 and changelist not in outliers:
outliers.append([changelist, lines_added_in_change])
else:
user_stats['lines_added'] += lines_added_in_change
user_stats['lines_deleted'] += lines_deleted_in_change
user_stats['lines_changed_added'] += lines_changed_added_in_change
user_stats['lines_changed_removed'] += lines_changed_removed_in_change
user_stats['total_lines'] += lines_changed_added_in_change
user_stats['total_lines'] -= lines_changed_removed_in_change
user_stats['total_lines'] += lines_added_in_change
user_stats['files_edited'] += files_edited_in_change
user_stats['files_added'] += files_added_in_change
current_output = [changelist, user_stats['num_changes'], user_stats['files_added'], user_stats['files_edited']]
current_output.append(user_stats['lines_added'])
current_output.append(user_stats['lines_deleted'])
if not args.git:
current_output.append('{}/{}'.format(user_stats['lines_changed_added'], user_stats['lines_changed_removed']))
current_output.append(user_stats['total_lines'])
current_output.append(user_stats['total_lines']*1.0/user_stats['num_changes'])
print output_specifier_with_carriage_return.format(*current_output),
print
print line
if len(outliers) > 0:
print "Outliers (changes that merit inspection - and have not been included in the stats):"
outlier_headers = ['Changelist', 'Lines of Code']
outlier_specifier = generate_output_specifier(outlier_headers)
outlier_line = generate_line(outlier_headers)
print outlier_line
print outlier_specifier.format(*outlier_headers)
print outlier_line
for change in outliers:
print outlier_specifier.format(*change)
print outlier_line
The other answers seem to have missed the source-control history side of things.
From http://forums.perforce.com/index.php?/topic/359-how-many-lines-of-code-have-i-written/
Calculate the answer in multiple steps:
1) Added files:
p4 filelog ... | grep ' add on .* by <username>'
p4 print -q foo#1 | wc -l
2) Changed files:
p4 describe <changelist> | grep "^>" | wc -l
Combine all the counts together (scripting...), and you'll have a total.
You might also want to get rid of whitespace lines, or lines without alphanumeric chars, with a grep?
Also if you are doing it regularly, it would be more efficient to code the thing in P4Python and do it incrementally - keeping history and looking at only new commits.
Yes, there are many ways to count lines of code.
tl;dr Install Eclipse Metrics Plugin. Here is the instruction how to do it. Below there is a short script if you want to do it without Eclipse.
Shell script
I will present you quite general approach. It works on Linux, however it's portable to other systems. Save this 2 lines to lines.sh file:
#!/bin/sh
find -name "*.java" | awk '{ system("wc "$0) }' | awk '{ print $1 "\t" $4; lines += $1; files++ } END { print "Total: " lines " lines in " files " files."}'
It's a shell script which uses find, wc and great awk. Add permission to execute:
chmod +x lines.sh
Now we can execute our shell script.
Let's say you saved lines.sh in /home/you/workspace/projectX.
Script counts lines in .java files, which are located in subdirectories of /home/you/workspace/projectX.
So let's run it with ./lines.sh. You can change *.java for any other types of files.
Sample output:
adam#adam ~/workspace/Checkers $ ./lines.sh
23 ./src/Checkers.java
14 ./src/event/StartGameEvent.java
38 ./src/event/YourColorEvent.java
52 ./src/event/BoardClickEvent.java
61 ./src/event/GameQueue.java
14 ./src/event/PlayerEscapeEvent.java
14 ./src/event/WaitEvent.java
16 ./src/event/GameEvent.java
38 ./src/event/EndGameEvent.java
38 ./src/event/FakeBoardEvent.java
127 ./src/controller/ServerThread.java
14 ./src/controller/ServerConfig.java
46 ./src/controller/Server.java
170 ./src/controller/Controller.java
141 ./src/controller/ServerNetwork.java
246 ./src/view/ClientNetwork.java
36 ./src/view/Messages.java
53 ./src/view/ButtonField.java
47 ./src/view/ViewConfig.java
32 ./src/view/MainWindow.java
455 ./src/view/View.java
36 ./src/view/ImageLoader.java
88 ./src/model/KingJump.java
130 ./src/model/Cords.java
70 ./src/model/King.java
77 ./src/model/FakeBoard.java
90 ./src/model/CheckerMove.java
53 ./src/model/PlayerColor.java
73 ./src/model/Checker.java
201 ./src/model/AbstractPiece.java
75 ./src/model/CheckerJump.java
154 ./src/model/Model.java
105 ./src/model/KingMove.java
99 ./src/model/FieldType.java
269 ./src/model/Board.java
56 ./src/model/AbstractJump.java
80 ./src/model/AbstractMove.java
82 ./src/model/BoardState.java
Total: 3413 lines in 38 files.
Find an app to calculate the lines, there are many subtleties to counting lines - comments, blank lines, multiple operators per line etc.
Visual Studio has "Calculate Code Metrics" functionality, since you're not mentioning one single language I can't be more specific about which tool to use, just saying "find" and "grep" may not be the way to go.
Also consider the fact that lines of code don't measure actual progress. Completed features on your roadmap measures progress and the lower the lines of code - the better. It wouldn't be a first if a proud developer claims his 60,000 lines of code are marvelous only to find out there's a way to do the same thing in 1000 lines.
Have a look at SLOCCount. It only counts actual lines of code and performs some additional computations as well.
On OSX, you can easily install it via Homebrew with brew install sloccount.
Sample output for a project of mine:
$ sloccount .
Have a non-directory at the top, so creating directory top_dir
Adding /Users/padde/Desktop/project/./Gemfile to top_dir
Adding /Users/padde/Desktop/project/./Gemfile.lock to top_dir
Adding /Users/padde/Desktop/project/./Procfile to top_dir
Adding /Users/padde/Desktop/project/./README to top_dir
Adding /Users/padde/Desktop/project/./application.rb to top_dir
Creating filelist for config
Adding /Users/padde/Desktop/project/./config.ru to top_dir
Creating filelist for controllers
Creating filelist for db
Creating filelist for helpers
Creating filelist for models
Creating filelist for public
Creating filelist for tmp
Creating filelist for views
Categorizing files.
Finding a working MD5 command....
Found a working MD5 command.
Computing results.
SLOC Directory SLOC-by-Language (Sorted)
256 controllers ruby=256
66 models ruby=66
10 config ruby=10
9 top_dir ruby=9
5 helpers ruby=5
0 db (none)
0 public (none)
0 tmp (none)
0 views (none)
Totals grouped by language (dominant language first):
ruby: 346 (100.00%)
Total Physical Source Lines of Code (SLOC) = 346
Development Effort Estimate, Person-Years (Person-Months) = 0.07 (0.79)
(Basic COCOMO model, Person-Months = 2.4 * (KSLOC**1.05))
Schedule Estimate, Years (Months) = 0.19 (2.28)
(Basic COCOMO model, Months = 2.5 * (person-months**0.38))
Estimated Average Number of Developers (Effort/Schedule) = 0.34
Total Estimated Cost to Develop = $ 8,865
(average salary = $56,286/year, overhead = 2.40).
SLOCCount, Copyright (C) 2001-2004 David A. Wheeler
SLOCCount is Open Source Software/Free Software, licensed under the GNU GPL.
SLOCCount comes with ABSOLUTELY NO WARRANTY, and you are welcome to
redistribute it under certain conditions as specified by the GNU GPL license;
see the documentation for details.
Please credit this data as "generated using David A. Wheeler's 'SLOCCount'."
There is an easier way to do all this, which incidentally is faster than using grep:
First get all the change lists for a particular user, this is a commandline command you can use it in python script by using os.system():
p4 changes -u <username> > 'some_text_file.txt'
Now you need to extract all the changelists number so ,we will use regex for it, here it is done using python :
f = open('some_text_file.txt','r')
lists = f.readlines()
pattern = re.compile(r'\b[0-9][0-9][0-9][0-9][0-9][0-9][0-9]\b')
labels = []
for i in lists:
labels.append(pattern.findall(i))
changelists = []
for h in labels:
if(type(h) is list):
changelists.append(str(h[0]))
else:
changelists.append(str(h))
Now that you have all the changelists numbers in 'labels'.
We will iterate through the list and for every changelist find number of lines added and number of lines deleted, getting the ultimate difference would give us total number of lines added. The following liens of code do exactly that:
for i in changelists:
os.system('p4 describe -ds '+i+' | findstr "^add" >> added.txt')
os.system('p4 describe -ds '+i+' | findstr "^del" >> deleted.txt')
added = []
deleted = []
file = open('added.txt')
for i in file:
added.append(i)
count = []
count_added = 0
count_add = 0
count_del = 0
for j in added:
count = [int(s) for s in j.split() if s.isdigit()]
count_add += count[1]
count = []
file = open('deleted.txt')
for i in file:
deleted.append(i)
for j in labels:
count = [int(s) for s in j.split() if s.isdigit()]
count_del += count[1]
count = []
count_added = count_add - count_del
print count_added
count_added will have number of lines that were added by the user.

Resources