How to reset QFile/ QTextStream? - python-3.x

I am getting a file to be reset after a first read. All googling hasn't helped at all.
How do I reset a file to it's begin after it has been read a first time?
Trial one:
inFile = QFile( self._pathFileName )
inFile.open(QFile.ReadOnly | QFile.Text)
stream = QTextStream(inFile)
# Count first all lines in the file
self._numLinesRead = 0
self._mumLinesTotal = 0
while not stream.atEnd():
self._mumLinesTotal=+1
stream.readLine();
inFile.seek(0)
stream.seek(0)
pos = stream.pos() # pos is equal to 0 after this line verified with debugging
while( not stream.atEnd() ): # but here it still thinks he's at file end and jumps over
....
Trial two:
inFile = QFile( self._pathFileName )
inFile.open(QFile.ReadOnly | QFile.Text)
stream = QTextStream(inFile)
# Count first all lines in the file
self._numLinesRead = 0
self._mumLinesTotal = 0
while not stream.atEnd():
self._mumLinesTotal=+1
stream.readLine();
inFile.close()
del inFile
del stream
inFile = QFile( self._pathFileName )
inFile.open(QFile.ReadOnly | QFile.Text)
stream = QTextStream(inFile)
# everyting has been reset?!
while( not stream.atEnd() ): # Nop it still thinks it is atEnd and jumps over
....
I tried all the solutions found in the net. Nothing helps. What I'm doing wrong?

Not sure what I should say, but after a complete reboot of the system "Trial two" worked.
After a second reboot (two days and a lot of code changes later) also "Trial one" came to life.
After all: both trials are from my point of view valid and working.
If in case you discover strange behaviour during development and debugging try a reboot.

Related

Angr can't solve the googlectf beginner problem

I am a student studying angr, first time.
I'm watching the code in this url.
https://github.com/Dvd848/CTFs/blob/master/2020_GoogleCTF/Beginner.md
import angr
import claripy
FLAG_LEN = 15
STDIN_FD = 0
base_addr = 0x100000 # To match addresses to Ghidra
proj = angr.Project("./a.out", main_opts={'base_addr': base_addr})
flag_chars = [claripy.BVS('flag_%d' % i, 8) for i in range(FLAG_LEN)]
flag = claripy.Concat( *flag_chars + [claripy.BVV(b'\n')]) # Add \n for scanf() to accept the input
state = proj.factory.full_init_state(
args=['./a.out'],
add_options=angr.options.unicorn,
stdin=flag,
)
# Add constraints that all characters are printable
for k in flag_chars:
state.solver.add(k >= ord('!'))
state.solver.add(k <= ord('~'))
simgr = proj.factory.simulation_manager(state)
find_addr = 0x101124 # SUCCESS
avoid_addr = 0x10110d # FAILURE
simgr.explore(find=find_addr, avoid=avoid_addr)
if (len(simgr.found) > 0):
for found in simgr.found:
print(found.posix.dumps(STDIN_FD))
https://github.com/google/google-ctf/tree/master/2020/quals/reversing-beginner/attachments
Which is the answer of googlectf beginner.
But, the above code does not work. It doesn't give me the answer.
I want to know why the code is not working.
When I execute this code, the output was empty.
I run the code with python3 in Ubuntu 20.04 in wsl2
Thank you.
I believe this script isn't printing anything because angr fails to find a solution and then exits. You can prove this by appending the following to your script:
else:
raise Exception('Could not find the solution')
If the exception raises, a valid solution was not found.
In terms of why it doesn't work, this code looks like copy & paste from a few different sources, and so it's fairly convoluted.
For example, the way the flag symbol is passed to stdin is not ideal. By default, stdin is a SimPackets, so it's best to keep it that way.
The following script solves the challenge, I have commented it to help you understand. You will notice that changing stdin=angr.SimPackets(name='stdin', content=[(flag, 15)]) to stdin=flag will cause the script to fail, due to the reason mentioned above.
import angr
import claripy
base = 0x400000 # Default angr base
project = angr.Project("./a.out")
flag = claripy.BVS("flag", 15 * 8) # length is expected in bits here
initial_state = project.factory.full_init_state(
stdin=angr.SimPackets(name='stdin', content=[(flag, 15)]), # provide symbol and length (in bytes)
add_options ={
angr.options.SYMBOL_FILL_UNCONSTRAINED_MEMORY,
angr.options.SYMBOL_FILL_UNCONSTRAINED_REGISTERS
}
)
# constrain flag to common alphanumeric / punctuation characters
[initial_state.solver.add(byte >= 0x20, byte <= 0x7f) for byte in flag.chop(8)]
sim = project.factory.simgr(initial_state)
sim.explore(
find=lambda s: b"SUCCESS" in s.posix.dumps(1), # search for a state with this result
avoid=lambda s: b"FAILURE" in s.posix.dumps(1) # states that meet this constraint will be added to the avoid stash
)
if sim.found:
solution_state = sim.found[0]
print(f"[+] Success! Solution is: {solution_state.posix.dumps(0)}") # dump whatever was sent to stdin to reach this state
else:
raise Exception('Could not find the solution') # Tell us if angr failed to find a solution state
A bit of Trivia - there are actually multiple 'solutions' that the program would accept, I guess the CTF flag server only accepts one though.
❯ echo -ne 'CTF{\x00\xe0MD\x17\xd1\x93\x1b\x00n)' | ./a.out
Flag: SUCCESS

I need to clean seismological events from a text file

The question here is related to the same type of file I asked another question about, almost one month ago (I need to split a seismological file so that I have multiple subfiles).
My goal now is to delete events which in their first line contain the string 'RSN 3'. So far I have tried editing the aforementioned question's best answer code like this:
with open(refcatname) as fileContent:
for l in fileContent:
check_rsn_3 = l[45:51]
if check_rsn_3 == "RSN 3":
line = l[:-1]
check_event = line[1:15]
print(check_event, check_rsn_3)
if not check_rsn_3 == "RSN 3":
# Strip white spaces to make sure is an empty line
if not l.strip():
subFile.write(
eventInfo + "\n"
) # Add event to the subfile
eventInfo = "" # Reinit event info
eventCounter += 1
if eventCounter == 700:
subFile.close()
fileId += 1
subFile = open(
os.path.join(
catdir,
"Paquete_Continental_"
+ str(fileId)
+ ".out",
),
"w+",
)
eventCounter = 0
else:
eventInfo += l
subFile.close()
Expected results: Event info of earthquakes with 'RSN N' (where N≠3)
Actual results: First line of events with 'RSN 3' is deleted but not the remaining event info.
Thanks in advance for your help :)
I'd advise against checking if the string is at an exact location (e.g. l[45:51]) since a single character can mess that up, you can instead check if the entire string contains "RSN 3" with if "RSN 3" in l
With the line = l[:-1] you only get the last character of the line, so the line[1:15] won't work since it's not an array.
But if you need to delete several lines, you could just check if the current line contains "RSN 3", and then read line after line until one contains "RSN " while skipping the ones in between.
skip = False
for line in fileContent:
if "RSN 3" in line:
skip = True
continue
if "RSN " in line and "RSN 3" not in line:
skip = False
# rest of the logic
if skip:
continue
This way you don't even parse the blocks whose first line contains "RSN 3".

Lines of code you have written [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Out of curiosity, is there any way to get the number of lines of code you have written (in a specific project)?
I tried perforce with p4 describe #CLN | wc -l, but apart from so many edge cases (comments being included, new lines being added etc.), it skips the newly added files as well. Edge cases can be ignored, if we try to display physical line of code but newly added files still cause the issue.
I went ahead and wrote a Python script that prints out the number of lines of code added/changed by a user and the average number of lines per change.
Tested on Windows with Python 2.7.2. You can run from the command line - it assumes you have p4 in your path.
Usage: codestats.py -u [username]
It works with git too: codestats.py -u [authorname] -g.
It does some blacklisting to prune out bulk adds (e.g. you just added a library), and also imposes a blacklist on certain types of files (e.g. .HTML files, etc.). Otherwise, it works pretty well.
Hope this helps!
########################################################################
# Script that computes the lines of code stats for a perforce/git user.
########################################################################
import argparse
import logging
import subprocess
import sys
import re
VALID_ARGUMENTS = [
("user", "-u", "--user", "Run lines of code computation for the specified user.", 1),
("change", "-c", "--change", "Just display lines of code in the passed in change (useful for debugging).", 1),
("git", "-g", "--git", "Use git rather than perforce (which is the default versioning system queried).", 0)
]
class PrintHelpOnErrorArgumentParser(argparse.ArgumentParser):
def error(self, message):
logging.error("error: {0}\n\n".format(message))
self.print_help()
sys.exit(2)
def is_code_file(depot_path):
fstat_output = subprocess.Popen(['p4', 'fstat', depot_path], stdout=subprocess.PIPE).communicate()[0].split('\n')
text_file = False
head_type_regex = re.compile('^... headType (\S+)\s*$')
for line in fstat_output:
head_type_line = head_type_regex.match(line)
if head_type_line:
head_type = head_type_line.group(1)
text_file = (head_type.find('text') != -1)
if text_file:
blacklisted_file_types = ['html', 'css', 'twb', 'twbx', 'tbm', 'xml']
for file_type in blacklisted_file_types:
if re.match('^\/\/depot.*\.{}#\d+$'.format(file_type), depot_path):
text_file = False
break
return text_file
def parse_args():
parser = PrintHelpOnErrorArgumentParser()
for arg_name, short_switch, long_switch, help, num_args in VALID_ARGUMENTS:
if num_args != 0:
parser.add_argument(
short_switch,
nargs=num_args,
type=str,
dest=arg_name)
else:
parser.add_argument(
long_switch,
short_switch,
action="store_true",
help=help,
dest=arg_name)
return parser.parse_args()
file_edited_regex = re.compile('^... .*?#\d+ edit\s*$')
file_deleted_regex = re.compile('^... .*?#\d+ delete\s*$')
file_integrated_regex = re.compile('^... .*?#\d+ integrate\s*$')
file_added_regex = re.compile('^... (.*?#\d+) add\s*$')
affected_files_regex = re.compile('^Affected files ...')
outliers = [] # Changes that seem as if they weren't hand coded and merit inspection
def num_lines_in_file(depot_path):
lines = len(subprocess.Popen(['p4', 'print', depot_path], stdout=subprocess.PIPE).communicate()[0].split('\n'))
return lines
def parse_change(changelist):
change_description = subprocess.Popen(['p4', 'describe', '-ds', changelist], stdout=subprocess.PIPE).communicate()[0].split('\n')
parsing_differences = False
parsing_affected_files = False
differences_regex = re.compile('^Differences \.\.\..*$')
line_added_regex = re.compile('^add \d+ chunks (\d+) lines.*$')
line_removed_regex = re.compile('^deleted \d+ chunks (\d+) lines.*$')
line_changed_regex = re.compile('^changed \d+ chunks (\d+) / (\d+) lines.*$')
file_diff_regex = re.compile('^==== (\/\/depot.*#\d+)\s*\S+$')
skip_file = False
num_lines_added = 0
num_lines_deleted = 0
num_lines_changed_added = 0
num_lines_changed_deleted = 0
num_files_added = 0
num_files_edited = 0
for line in change_description:
if differences_regex.match(line):
parsing_differences = True
elif affected_files_regex.match(line):
parsing_affected_files = True
elif parsing_differences:
if file_diff_regex.match(line):
regex_match = file_diff_regex.match(line)
skip_file = not is_code_file(regex_match.group(1))
elif not skip_file:
regex_match = line_added_regex.match(line)
if regex_match:
num_lines_added += int(regex_match.group(1))
else:
regex_match = line_removed_regex.match(line)
if regex_match:
num_lines_deleted += int(regex_match.group(1))
else:
regex_match = line_changed_regex.match(line)
if regex_match:
num_lines_changed_added += int(regex_match.group(2))
num_lines_changed_deleted += int(regex_match.group(1))
elif parsing_affected_files:
if file_added_regex.match(line):
file_added_match = file_added_regex.match(line)
depot_path = file_added_match.group(1)
if is_code_file(depot_path):
lines_in_file = num_lines_in_file(depot_path)
if lines_in_file > 3000:
# Anomaly - probably a copy of existing code - discard this
lines_in_file = 0
num_lines_added += lines_in_file
num_files_added += 1
elif file_edited_regex.match(line):
num_files_edited += 1
return [num_files_added, num_files_edited, num_lines_added, num_lines_deleted, num_lines_changed_added, num_lines_changed_deleted]
def contains_integrates(changelist):
change_description = subprocess.Popen(['p4', 'describe', '-s', changelist], stdout=subprocess.PIPE).communicate()[0].split('\n')
contains_integrates = False
parsing_affected_files = False
for line in change_description:
if affected_files_regex.match(line):
parsing_affected_files = True
elif parsing_affected_files:
if file_integrated_regex.match(line):
contains_integrates = True
break
return contains_integrates
#################################################
# Note: Keep this function in sync with
# generate_line.
#################################################
def generate_output_specifier(output_headers):
output_specifier = ''
for output_header in output_headers:
output_specifier += '| {:'
output_specifier += '{}'.format(len(output_header))
output_specifier += '}'
if output_specifier != '':
output_specifier += ' |'
return output_specifier
#################################################
# Note: Keep this function in sync with
# generate_output_specifier.
#################################################
def generate_line(output_headers):
line = ''
for output_header in output_headers:
line += '--' # for the '| '
header_padding_specifier = '{:-<'
header_padding_specifier += '{}'.format(len(output_header))
header_padding_specifier += '}'
line += header_padding_specifier.format('')
if line != '':
line += '--' # for the last ' |'
return line
# Returns true if a change is a bulk addition or a private change
def is_black_listed_change(user, changelist):
large_add_change = False
all_adds = True
num_adds = 0
is_private_change = False
is_third_party_change = False
change_description = subprocess.Popen(['p4', 'describe', '-s', changelist], stdout=subprocess.PIPE).communicate()[0].split('\n')
for line in change_description:
if file_edited_regex.match(line) or file_deleted_regex.match(line):
all_adds = False
elif file_added_regex.match(line):
num_adds += 1
if line.find('... //depot/private') != -1:
is_private_change = True
break
if line.find('... //depot/third-party') != -1:
is_third_party_change = True
break
large_add_change = all_adds and num_adds > 70
#print "{}: {}".format(changelist, large_add_change or is_private_change)
return large_add_change or is_third_party_change
change_header_regex = re.compile('^Change (\d+)\s*.*?\s*(\S+)#.*$')
def get_user_and_change_header_for_change(changelist):
change_description = subprocess.Popen(['p4', 'describe', '-s', changelist], stdout=subprocess.PIPE).communicate()[0].split('\n')
user = None
change_header = None
for line in change_description:
change_header_match = change_header_regex.match(line)
if change_header_match:
user = change_header_match.group(2)
change_header = line
break
return [user, change_header]
if __name__ == "__main__":
log = logging.getLogger()
log.setLevel(logging.DEBUG)
args = parse_args()
user_stats = {}
user_stats['num_changes'] = 0
user_stats['lines_added'] = 0
user_stats['lines_deleted'] = 0
user_stats['lines_changed_added'] = 0
user_stats['lines_changed_removed'] = 0
user_stats['total_lines'] = 0
user_stats['files_edited'] = 0
user_stats['files_added'] = 0
change_log = []
if args.git:
git_log_command = ['git', 'log', '--author={}'.format(args.user[0]), '--pretty=tformat:', '--numstat']
git_log_output = subprocess.Popen(git_log_command, stdout=subprocess.PIPE).communicate()[0].split('\n')
git_log_line_regex = re.compile('^(\d+)\s*(\d+)\s*\S+$')
total = 0
adds = 0
subs = 0
for git_log_line in git_log_output:
line_match = git_log_line_regex.match(git_log_line)
if line_match:
adds += int(line_match.group(1))
subs += int(line_match.group(2))
total = adds - subs
num_commits = 0
git_shortlog_command = ['git', 'shortlog', '--author={}'.format(args.user[0]), '-s']
git_shortlog_output = subprocess.Popen(git_shortlog_command, stdout=subprocess.PIPE).communicate()[0].split('\n')
git_shortlog_line_regex = re.compile('^\s*(\d+)\s+.*$')
for git_shortlog_line in git_shortlog_output:
line_match = git_shortlog_line_regex.match(git_shortlog_line)
if line_match:
num_commits += int(line_match.group(1))
print "Git Stats for {}: Commits: {}. Lines of code: {}. Average Lines Per Change: {}.".format(args.user[0], num_commits, total, total*1.0/num_commits)
sys.exit(0)
elif args.change:
[args.user, change_header] = get_user_and_change_header_for_change(args.change)
change_log = [change_header]
else:
change_log = subprocess.Popen(['p4', 'changes', '-u', args.user, '-s', 'submitted'], stdout=subprocess.PIPE).communicate()[0].split('\n')
output_headers = ['Current Change', 'Num Changes', 'Files Added', 'Files Edited']
output_headers.append('Lines Added')
output_headers.append('Lines Deleted')
if not args.git:
output_headers.append('Lines Changed (Added/Removed)')
avg_change_size = 0.0
output_headers.append('Total Lines')
output_headers.append('Avg. Lines/Change')
line = generate_line(output_headers)
output_specifier = generate_output_specifier(output_headers)
print line
print output_specifier.format(*output_headers)
print line
output_specifier_with_carriage_return = output_specifier + '\r'
for change in change_log:
change_match = change_header_regex.search(change)
if change_match:
user_stats['num_changes'] += 1
changelist = change_match.group(1)
if not is_black_listed_change(args.user, changelist) and not contains_integrates(changelist):
[files_added_in_change, files_edited_in_change, lines_added_in_change, lines_deleted_in_change, lines_changed_added_in_change, lines_changed_removed_in_change] = parse_change(change_match.group(1))
if lines_added_in_change > 5000 and changelist not in outliers:
outliers.append([changelist, lines_added_in_change])
else:
user_stats['lines_added'] += lines_added_in_change
user_stats['lines_deleted'] += lines_deleted_in_change
user_stats['lines_changed_added'] += lines_changed_added_in_change
user_stats['lines_changed_removed'] += lines_changed_removed_in_change
user_stats['total_lines'] += lines_changed_added_in_change
user_stats['total_lines'] -= lines_changed_removed_in_change
user_stats['total_lines'] += lines_added_in_change
user_stats['files_edited'] += files_edited_in_change
user_stats['files_added'] += files_added_in_change
current_output = [changelist, user_stats['num_changes'], user_stats['files_added'], user_stats['files_edited']]
current_output.append(user_stats['lines_added'])
current_output.append(user_stats['lines_deleted'])
if not args.git:
current_output.append('{}/{}'.format(user_stats['lines_changed_added'], user_stats['lines_changed_removed']))
current_output.append(user_stats['total_lines'])
current_output.append(user_stats['total_lines']*1.0/user_stats['num_changes'])
print output_specifier_with_carriage_return.format(*current_output),
print
print line
if len(outliers) > 0:
print "Outliers (changes that merit inspection - and have not been included in the stats):"
outlier_headers = ['Changelist', 'Lines of Code']
outlier_specifier = generate_output_specifier(outlier_headers)
outlier_line = generate_line(outlier_headers)
print outlier_line
print outlier_specifier.format(*outlier_headers)
print outlier_line
for change in outliers:
print outlier_specifier.format(*change)
print outlier_line
The other answers seem to have missed the source-control history side of things.
From http://forums.perforce.com/index.php?/topic/359-how-many-lines-of-code-have-i-written/
Calculate the answer in multiple steps:
1) Added files:
p4 filelog ... | grep ' add on .* by <username>'
p4 print -q foo#1 | wc -l
2) Changed files:
p4 describe <changelist> | grep "^>" | wc -l
Combine all the counts together (scripting...), and you'll have a total.
You might also want to get rid of whitespace lines, or lines without alphanumeric chars, with a grep?
Also if you are doing it regularly, it would be more efficient to code the thing in P4Python and do it incrementally - keeping history and looking at only new commits.
Yes, there are many ways to count lines of code.
tl;dr Install Eclipse Metrics Plugin. Here is the instruction how to do it. Below there is a short script if you want to do it without Eclipse.
Shell script
I will present you quite general approach. It works on Linux, however it's portable to other systems. Save this 2 lines to lines.sh file:
#!/bin/sh
find -name "*.java" | awk '{ system("wc "$0) }' | awk '{ print $1 "\t" $4; lines += $1; files++ } END { print "Total: " lines " lines in " files " files."}'
It's a shell script which uses find, wc and great awk. Add permission to execute:
chmod +x lines.sh
Now we can execute our shell script.
Let's say you saved lines.sh in /home/you/workspace/projectX.
Script counts lines in .java files, which are located in subdirectories of /home/you/workspace/projectX.
So let's run it with ./lines.sh. You can change *.java for any other types of files.
Sample output:
adam#adam ~/workspace/Checkers $ ./lines.sh
23 ./src/Checkers.java
14 ./src/event/StartGameEvent.java
38 ./src/event/YourColorEvent.java
52 ./src/event/BoardClickEvent.java
61 ./src/event/GameQueue.java
14 ./src/event/PlayerEscapeEvent.java
14 ./src/event/WaitEvent.java
16 ./src/event/GameEvent.java
38 ./src/event/EndGameEvent.java
38 ./src/event/FakeBoardEvent.java
127 ./src/controller/ServerThread.java
14 ./src/controller/ServerConfig.java
46 ./src/controller/Server.java
170 ./src/controller/Controller.java
141 ./src/controller/ServerNetwork.java
246 ./src/view/ClientNetwork.java
36 ./src/view/Messages.java
53 ./src/view/ButtonField.java
47 ./src/view/ViewConfig.java
32 ./src/view/MainWindow.java
455 ./src/view/View.java
36 ./src/view/ImageLoader.java
88 ./src/model/KingJump.java
130 ./src/model/Cords.java
70 ./src/model/King.java
77 ./src/model/FakeBoard.java
90 ./src/model/CheckerMove.java
53 ./src/model/PlayerColor.java
73 ./src/model/Checker.java
201 ./src/model/AbstractPiece.java
75 ./src/model/CheckerJump.java
154 ./src/model/Model.java
105 ./src/model/KingMove.java
99 ./src/model/FieldType.java
269 ./src/model/Board.java
56 ./src/model/AbstractJump.java
80 ./src/model/AbstractMove.java
82 ./src/model/BoardState.java
Total: 3413 lines in 38 files.
Find an app to calculate the lines, there are many subtleties to counting lines - comments, blank lines, multiple operators per line etc.
Visual Studio has "Calculate Code Metrics" functionality, since you're not mentioning one single language I can't be more specific about which tool to use, just saying "find" and "grep" may not be the way to go.
Also consider the fact that lines of code don't measure actual progress. Completed features on your roadmap measures progress and the lower the lines of code - the better. It wouldn't be a first if a proud developer claims his 60,000 lines of code are marvelous only to find out there's a way to do the same thing in 1000 lines.
Have a look at SLOCCount. It only counts actual lines of code and performs some additional computations as well.
On OSX, you can easily install it via Homebrew with brew install sloccount.
Sample output for a project of mine:
$ sloccount .
Have a non-directory at the top, so creating directory top_dir
Adding /Users/padde/Desktop/project/./Gemfile to top_dir
Adding /Users/padde/Desktop/project/./Gemfile.lock to top_dir
Adding /Users/padde/Desktop/project/./Procfile to top_dir
Adding /Users/padde/Desktop/project/./README to top_dir
Adding /Users/padde/Desktop/project/./application.rb to top_dir
Creating filelist for config
Adding /Users/padde/Desktop/project/./config.ru to top_dir
Creating filelist for controllers
Creating filelist for db
Creating filelist for helpers
Creating filelist for models
Creating filelist for public
Creating filelist for tmp
Creating filelist for views
Categorizing files.
Finding a working MD5 command....
Found a working MD5 command.
Computing results.
SLOC Directory SLOC-by-Language (Sorted)
256 controllers ruby=256
66 models ruby=66
10 config ruby=10
9 top_dir ruby=9
5 helpers ruby=5
0 db (none)
0 public (none)
0 tmp (none)
0 views (none)
Totals grouped by language (dominant language first):
ruby: 346 (100.00%)
Total Physical Source Lines of Code (SLOC) = 346
Development Effort Estimate, Person-Years (Person-Months) = 0.07 (0.79)
(Basic COCOMO model, Person-Months = 2.4 * (KSLOC**1.05))
Schedule Estimate, Years (Months) = 0.19 (2.28)
(Basic COCOMO model, Months = 2.5 * (person-months**0.38))
Estimated Average Number of Developers (Effort/Schedule) = 0.34
Total Estimated Cost to Develop = $ 8,865
(average salary = $56,286/year, overhead = 2.40).
SLOCCount, Copyright (C) 2001-2004 David A. Wheeler
SLOCCount is Open Source Software/Free Software, licensed under the GNU GPL.
SLOCCount comes with ABSOLUTELY NO WARRANTY, and you are welcome to
redistribute it under certain conditions as specified by the GNU GPL license;
see the documentation for details.
Please credit this data as "generated using David A. Wheeler's 'SLOCCount'."
There is an easier way to do all this, which incidentally is faster than using grep:
First get all the change lists for a particular user, this is a commandline command you can use it in python script by using os.system():
p4 changes -u <username> > 'some_text_file.txt'
Now you need to extract all the changelists number so ,we will use regex for it, here it is done using python :
f = open('some_text_file.txt','r')
lists = f.readlines()
pattern = re.compile(r'\b[0-9][0-9][0-9][0-9][0-9][0-9][0-9]\b')
labels = []
for i in lists:
labels.append(pattern.findall(i))
changelists = []
for h in labels:
if(type(h) is list):
changelists.append(str(h[0]))
else:
changelists.append(str(h))
Now that you have all the changelists numbers in 'labels'.
We will iterate through the list and for every changelist find number of lines added and number of lines deleted, getting the ultimate difference would give us total number of lines added. The following liens of code do exactly that:
for i in changelists:
os.system('p4 describe -ds '+i+' | findstr "^add" >> added.txt')
os.system('p4 describe -ds '+i+' | findstr "^del" >> deleted.txt')
added = []
deleted = []
file = open('added.txt')
for i in file:
added.append(i)
count = []
count_added = 0
count_add = 0
count_del = 0
for j in added:
count = [int(s) for s in j.split() if s.isdigit()]
count_add += count[1]
count = []
file = open('deleted.txt')
for i in file:
deleted.append(i)
for j in labels:
count = [int(s) for s in j.split() if s.isdigit()]
count_del += count[1]
count = []
count_added = count_add - count_del
print count_added
count_added will have number of lines that were added by the user.

Updating a label from a thread in Perl

I'm using perl on a linux box, and I have 2 devices - a pc(the linux box) and a router/dsl-thingy - on my local net at ip addresses 192.168.1.1 & 192.168.1.2 and am trying to list or show the progress of pinging such + a test of 8 other none existing devices, with the below code, but am having troubles with my StatusLabel updating, any help...
for($i=1;$i<=10;++$i) { # --- $i<$VarClients --- 254
my $thr_List = ("ping$i");
$thr_List = threads->create(\&pingingthreads, "$i");
}
sub pingingthreads{
my #pingpong = ping("$localAddress$i", '-c 1', '-i .2'); # -i may not count for much?
print "Pinging: $localAddress$i\n"; # output goes from address1 - address10 ok
$StatusLabel = "Pinging: $localAddress$i"; # only the last responding one(device) seems to be shown in my statuslabel?!
$val = ($val + 10); # 0.392156863
print "$val\% done...\n"; # goes to 100% for me ok
# $indicatorbar->value( $val ); # I have a ProgressBar and it gets stuck on 20% also
if ($val == 100){$val = 0;
} # reset after scanning
# then after the last ping, update the statusLable:
#my #ParamList = ('something', 'testing', 7, 8, 9);
#$thr5 = threads->create(\&updateStatusLable, #ParamList); # starting a thread within a thread ???
# ping response text...
for( #pingpong ) { # need to do something for none responding clients & any time laps/ping latency..., or *** ???
$pong=$_;
chop ($pong); # Get rid of the trailling \n ??
if ($pong =~ m/1 packets transmitted, 1 received, 0% packet loss/) {
push(#boxs, "$localAddress$i");
} else{
# see the other lines from the ping's output
# print "$pong\n";
}
}
}
# For $localAddress$i icmp_seq=1 Destination Host Unreachable ???
--------------------- # StatusBar/progress label & bar ----------------
my $sb = $main->StatusBar();
$sb->addLabel( -textvariable => \$StatusLabel,
-relief => 'flat',
-font => $font,
-foreground => "$statusbartextColour",
);
my $indicatorbar = $sb->ProgressBar( -padx=>2, -pady=>2, -borderwidth=>2,
-troughcolor=>"$Colour2",
-colors=>[ 0, "$indicatorcolour" ],
-length=>106,
-relief => 'flat',
-value => "$val",
)->pack;
# $val = 0;
# $indicatorbar->value( $val );
=====================================
my $StatusLabel :shared = ();
my $val :shared = (0); # var for progress bar value
I have uploaded my full code here (http://cid-99cdb89630050fff.office.live.com/browse.aspx/.Public) if needed, its in the Boxy.zip...
By default data in Perl threads are private; updates to a variable in one thread will not change the value of that variable in other threads (or in the main thread). You will want to declare $val as a shared variable.
See threads::shared.
I see you have declared $val as shared at the bottom of the script, so I didn't see it until it was too late. Not coincidentally, the Perl interpreter is also not going to see that declaration until it is too late. The top 95% of your program is manipulating the global, thread-private variable $var and not the lexical, shared $var you declare at the end of your script. Move this declaration to the top of the script.
Putting use strict at the top of your program would have caught this and saved you minutes, if not hours, of grief.
You don't. GUI frameworks tend to be not threadsafe. You communicate the info to the thread in which the GUI is run instead. Example
First sorry for replying here, but have lost my cookie or the ability to reply and edit etc...
Thanks ikegami, I will have to play with the example for a while to see if I can work things out and mix it into what I'm doing... but on first sight, looks just right... Thanks very much.
I was able to update the $StatusLabel using:
# in 3 seconds maybe do a fade to: Ready...
my #ParamList = ('ping', 'testing', 4, 5, 6);
$thr2 = threads->create(\&updateStatusLable, #ParamList);
sub updateStatusLable {
# should probably check if threads are running already first???
# and if so ***/*** them ???
my #InboundParameters = #_;
my $tid = threads->tid();
# my $thr_object = threads->self(); # Get a thread's object
# print("This is a new thread\($tid\)... and I am counting to 3...\n");
sleep(3);
$StatusLabel = "Ready..."; # print "Am now trying to change the status bar's label to \"Ready...\"\n";
# try updating better/smoother... My main window needs "focus and a mouse move" I think
# for the new text to appear...
# print('Recieved the parameters: ', join(', ', #InboundParameters), ".\n" );
# $returnedvalue = "the thread should return this...";
# return($returnedvalue); # try returning any value just to test/see how...
}
but will try your method... Thanks again.

Compare many text files that contain duplicate "stubs" from the previous and next file and remove duplicate text automatically

I have a large number of text files (1000+) each containing an article from an academic journal. Unfortunately each article's file also contains a "stub" from the end of the previous article (at the beginning) and from the beginning of the next article (at the end).
I need to remove these stubs in preparation for running a frequency analysis on the articles because the stubs constitute duplicate data.
There is no simple field that marks the beginning and end of each article in all cases. However, the duplicate text does seem to formatted the same and on the same line in both cases.
A script that compared each file to the next file and then removed 1 copy of the duplicate text would be perfect. This seems like it would be a pretty common issue when programming so I am surprised that I haven't been able to find anything that does this.
The file names sort in order, so a script that compares each file to the next sequentially should work. E.G.
bul_9_5_181.txt
bul_9_5_186.txt
are two articles, one starting on page 181 and the other on page 186. Both of these articles are included bellow.
There is two volumes of test data located at [http://drop.io/fdsayre][1]
Note: I am an academic doing content analysis of old journal articles for a project in the history of psychology. I am no programmer, but I do have 10+ years experience with linux and can usually figure things out as I go.
Thanks for your help
FILENAME: bul_9_5_181.txt
SYN&STHESIA
ISI
the majority of Portugese words signifying black objects or ideas relating to black. This association is, admittedly, no true synsesthesia, but the author believes that it is only a matter of degree between these logical and spontaneous associations and genuine cases of colored audition.
REFERENCES
DOWNEY, JUNE E. A Case of Colored Gustation. Amer. J. of Psycho!., 1911, 22, S28-539MEDEIROS-E-ALBUQUERQUE. Sur un phenomene de synopsie presente par des millions de sujets. / . de psychol. norm, et path., 1911, 8, 147-151. MYERS, C. S. A Case of Synassthesia. Brit. J. of Psychol., 1911, 4, 228-238.
AFFECTIVE PHENOMENA — EXPERIMENTAL
BY PROFESSOR JOHN F. .SHEPARD
University of Michigan
Three articles have appeared from the Leipzig laboratory during the year. Drozynski (2) objects to the use of gustatory and olfactory stimuli in the study of organic reactions with feelings, because of the disturbance of breathing that may be involved. He uses rhythmical auditory stimuli, and finds that when given at different rates and in various groupings, they are accompanied by characteristic feelings in each subject. He records the chest breathing, and curves from a sphygmograph and a water plethysmograph. Each experiment began with a normal record, then the stimulus was given, and this was followed by a contrast stimulus; lastly, another normal was taken. The length and depth of breathing were measured (no time line was recorded), and the relation of length of inspiration to length of expiration was determined. The length and height of the pulsebeats were also measured. Tabular summaries are given of the number of times the author finds each quantity to have been increased or decreased during a reaction period with each type of feeling. The feeling state accompanying a given rhythm is always complex, but the result is referred to that dimension which seemed to be dominant. Only a few disconnected extracts from normal and reaction periods are reproduced from the records. The author states that excitement gives increase in the rate and depth of breathing, in the inspiration-expiration ratio, and in the rate and size of pulse. There are undulations in the arm volume. In so far as the effect is quieting, it causes decrease in rate and depth of
182
JOHN F. SHEPARD
breathing, in the inspiration-expiration ratio, and in the pulse rate and size. The arm volume shows a tendency to rise with respiratory waves. Agreeableness shows
It looks like a much simpler solution would actually work.
No one seems to be using the information provided by the filenames. If you do make use of this information, you may not have to do any comparisons between files to identify the area of overlap. Whoever wrote the OCR probably put some thought into this problem.
The last number in the file name tells you what the starting page number for that file is. This page number appears on a line by itself in the file as well. It also looks like this line is preceded and followed by blank lines. Therefore for a given file you should be able to look at the name of the next file in the sequence and determine the page number at which you should start removing text. Since this page number appears in your file just look for a line that contains only this number (preceded and followed by blank lines) and delete that line and everything after. The last file in the sequence can be left alone.
Here's an outline for an algorithm
choose a file; call it: file1
look at the filename of the next file; call it: file2
extract the page number from the filename of file2; call it: pageNumber
scan the contents of file1 until you find a line that contains only pageNumber
make sure this line is preceded and followed by a blank line.
remove this line and everything after
move on to the next file in the sequence
You should probably try something like this (I've now tested it on the sample data you provided):
#!/usr/bin/ruby
class A_splitter
Title = /^[A-Z]+[^a-z]*$/
Byline = /^BY /
Number = /^\d*$/
Blank_line = /^ *$/
attr_accessor :recent_lines,:in_references,:source_glob,:destination_path,:seen_in_last_file
def initialize(src_glob,dst_path=nil)
#recent_lines = []
#seen_in_last_file = {}
#in_references = false
#source_glob = src_glob
#destination_path = dst_path
#destination = STDOUT
#buffer = []
split_em
end
def split_here
if destination_path
#destination.close if #destination
#destination = nil
else
print "------------SPLIT HERE------------\n"
end
print recent_lines.shift
#in_references = false
end
def at_page_break
((recent_lines[0] =~ Title and recent_lines[1] =~ Blank_line and recent_lines[2] =~ Number) or
(recent_lines[0] =~ Number and recent_lines[1] =~ Blank_line and recent_lines[2] =~ Title))
end
def print(*args)
(#destination || #buffer) << args
end
def split_em
Dir.glob(source_glob).sort.each { |filename|
if destination_path
#destination.close if #destination
#destination = File.open(File.join(#destination_path,filename),'w')
print #buffer
#buffer.clear
end
in_header = true
File.foreach(filename) { |line|
line.gsub!(/\f/,'')
if in_header and seen_in_last_file[line]
#skip it
else
seen_in_last_file.clear if in_header
in_header = false
recent_lines << line
seen_in_last_file[line] = true
end
3.times {recent_lines.shift} if at_page_break
if recent_lines[0] =~ Title and recent_lines[1] =~ Byline
split_here
elsif in_references and recent_lines[0] =~ Title and recent_lines[0] !~ /\d/
split_here
elsif recent_lines.length > 4
#in_references ||= recent_lines[0] =~ /^REFERENCES *$/
print recent_lines.shift
end
}
}
print recent_lines
#destination.close if #destination
end
end
A_splitter.new('bul_*_*_*.txt','test_dir')
Basically, run through the files in order, and within each file run through the lines in order, omitting from each file the lines that were present in the preceding file and printing the rest to STDOUT (from which it can be piped) unless a destination director is specified (called 'test_dir' in the example see the last line) in which case files are created in the specified directory with the same name as the file which contained the bulk of their contents.
It also removes the page-break sections (journal title, author, and page number).
It does two split tests:
a test on the title/byline pair
a test on the first title-line after a reference section
(it should be obvious how to add tests for additional split-points).
Retained for posterity:
If you don't specify a destination directory it simply puts a split-here line in the output stream at the split point. This should make it easier for testing (you can just less the output) and when you want them in individual files just pipe it to csplit (e.g. with
csplit -f abstracts - '---SPLIT HERE---' '{*}'
or something) to cut it up.
Here's is the beginning of another possible solution in Perl (It works as is but could probably be made more sophisticated if needed). It sounds as if all you are concerned about is removing duplicates across the corpus and don't really care if the last part of one article is in the file for the next one as long as it isn't duplicated anywhere. If so, this solution will strip out the duplicate lines leaving only one copy of any given line in the set of files as a whole.
You can either just run the file in the directory containing the text files with no argument or alternately specify a file name containing the list of files you want to process in the order you want them processed. I recommend the latter as your file names (at least in the sample files you provided) do not naturally list out in order when using simple commands like ls on the command line or glob in the Perl script. Thus it won't necessarily compare the correct files to one another as it just runs down the list (entered or generated by the glob command). If you specify the list, you can guarantee that they will be processed in the correct order and it doesn't take that long to set it up properly.
The script simply opens two files and makes note of the first three lines of the second file. It then opens a new output file (original file name + '.new') for the first file and writes out all the lines from the first file into the new output file until it finds the first three lines of the second file. There is an off chance that there are not three lines from the second file in the last one but in all the files I spot checked that seemed to be the case because of the journal name header and page numbers. One line definitely wasn't enough as the journal title was often the first line and that would cut things off early.
I should also note that the last file in your list of files entered will not be processed (i.e. have a new file created based off of it) as it will not be changed by this process.
Here's the script:
#!/usr/bin/perl
use strict;
my #files;
my $count = #ARGV;
if ($count>0){
open (IN, "$ARGV[0]");
#files = <IN>;
close (IN);
} else {
#files = glob "bul_*.txt";
}
$count = #files;
print "Processing $count files.\n";
my $lastFile="";
foreach(#files){
if ($lastFile ne ""){
print "Processing $_\n";
open (FILEB,"$_");
my #fileBLines = <FILEB>;
close (FILEB);
my $line0 = $fileBLines[0];
if ($line0 =~ /\(/ || $line0 =~ /\)/){
$line0 =~ s/\(/\\\(/;
$line0 =~ s/\)/\\\)/;
}
my $line1 = $fileBLines[1];
my $line2 = $fileBLines[2];
open (FILEA,"$lastFile");
my #fileALines = <FILEA>;
close (FILEA);
my $newName = "$lastFile.new";
open (OUT, ">$newName");
my $i=0;
my $done = 0;
while ($done != 1 and $i < #fileALines){
if ($fileALines[$i] =~ /$line0/
&& $fileALines[$i+1] == $line1
&& $fileALines[$i+2] == $line2) {
$done=1;
} else {
print OUT $fileALines[$i];
$i++;
}
}
close (OUT);
}
$lastFile = $_;
}
EDIT: Added a check for parenthesis in the first line that goes into the regex check for duplicity later on and if found escapes them so that they don't mess up the duplicity check.
You have a nontrivial problem. It is easy to write code to find the duplicate text at the end of file 1 and the beginning of file 2. But you don't want to delete the duplicate text---you want to split it where the second article begins. Getting the split right might be tricky---one marker is the all caps, another is the BY at the start of the next line.
It would have helped to have examples from consecutive files, but the script below works on one test case. Before trying this code, back up all your files. The code overwrites existing files.
The implementation is in Lua.
The algorithm is roughly:
Ignore blank lines at the end of file 1 and the start of file 2.
Find a long sequence of lines common to end of file 1 and start of file 2.
This works by trying a sequence of 40 lines, then 39, and so on
Remove sequence from both files and call it overlap.
Split overlap at title
Append first part of overlap to file1; prepend second part to file2.
Overwrite contents of files with lists of lines.
Here's the code:
#!/usr/bin/env lua
local ext = arg[1] == '-xxx' and '.xxx' or ''
if #ext > 0 then table.remove(arg, 1) end
local function lines(filename)
local l = { }
for line in io.lines(filename) do table.insert(l, (line:gsub('', ''))) end
assert(#l > 0, "No lines in file " .. filename)
return l
end
local function write_lines(filename, lines)
local f = assert(io.open(filename .. ext, 'w'))
for i = 1, #lines do
f:write(lines[i], '\n')
end
f:close()
end
local function lines_match(line1, line2)
io.stderr:write(string.format("%q ==? %q\n", line1, line2))
return line1 == line2 -- could do an approximate match here
end
local function lines_overlap(l1, l2, k)
if k > #l2 or k > #l1 then return false end
io.stderr:write('*** k = ', k, '\n')
for i = 1, k do
if not lines_match(l2[i], l1[#l1 - k + i]) then
if i > 1 then
io.stderr:write('After ', i-1, ' matches: FAILED <====\n')
end
return false
end
end
return true
end
function find_overlaps(fname1, fname2)
local l1, l2 = lines(fname1), lines(fname2)
-- strip trailing and leading blank lines
while l1[#l1]:find '^[%s]*$' do table.remove(l1) end
while l2[1] :find '^[%s]*$' do table.remove(l2, 1) end
local matchsize -- # of lines at end of file 1 that are equal to the same
-- # at the start of file 2
for k = math.min(40, #l1, #l2), 1, -1 do
if lines_overlap(l1, l2, k) then
matchsize = k
io.stderr:write('Found match of ', k, ' lines\n')
break
end
end
if matchsize == nil then
return false -- failed to find an overlap
else
local overlap = { }
for j = 1, matchsize do
table.remove(l1) -- remove line from first set
table.insert(overlap, table.remove(l2, 1))
end
return l1, overlap, l2
end
end
local function split_overlap(l)
for i = 1, #l-1 do
if l[i]:match '%u' and not l[i]:match '%l' then -- has caps but no lowers
-- io.stderr:write('Looking for byline following ', l[i], '\n')
if l[i+1]:match '^%s*BY%s' then
local first = {}
for j = 1, i-1 do
table.insert(first, table.remove(l, 1))
end
-- io.stderr:write('Split with first line at ', l[1], '\n')
return first, l
end
end
end
end
local function strip_overlaps(filename1, filename2)
local l1, overlap, l2 = find_overlaps(filename1, filename2)
if not l1 then
io.stderr:write('No overlap in ', filename1, ' an
Are the stubs identical to the end of the previous file? Or different line endings/OCR mistakes?
Is there a way to discern an article's beginning? Maybe an indented abstract? Then you could go through each file and discard everything before the first and after (including) the second title.
Are the titles & author always on a single line? And does that line always contain the word "BY" in uppercase? If so, you can probably do a fair job withn awk, using those criteria as the begin/end marker.
Edit: I really don't think that using diff is going to work as it is a tool for comparing broadly similar files. Your files are (from diff's point of view) actually completely different - I think it will get out of sync immediately. But then, I'm not a diff guru :-)
A quick stab at it, assuming that the stub is strictly identical in both files:
#!/usr/bin/perl
use strict;
use List::MoreUtils qw/ indexes all pairwise /;
my #files = #ARGV;
my #previous_text;
for my $filename ( #files ) {
open my $in_fh, '<', $filename or die;
open my $out_fh, '>', $filename.'.clean' or die;
my #lines = <$in_fh>;
print $out_fh destub( \#previous_text, #lines );
#previous_text = #lines;
}
sub destub {
my #previous = #{ shift() };
my #lines = #_;
my #potential_stubs = indexes { $_ eq $lines[0] } #previous;
for my $i ( #potential_stubs ) {
# check if the two documents overlap for that index
my #p = #previous[ $i.. $#previous ];
my #l = #lines[ 0..$#previous-$i ];
return #lines[ $#previous-$i + 1 .. $#lines ]
if all { $_ } pairwise { $a eq $b } #p, #l;
}
# no stub detected
return #lines;
}

Resources