I am working on a program that reads logfiles in, in a weird format.
I have the following code reading from a number of logfiles
(logFiles)
It works fine to itterate through them and so on. The problem is that for some reason the last element that i append to the list dataList overwrites the entire list so all fields are the same..
def fetchDataFromLogFiles(colIndex, colNames, colType, logFiles):
dataList = []
rowchange = []
row = []
for i in range(0, len(colIndex)):
row.append(0)
rowchange.append(0)
for x in range(0, len(logFiles)):
print("fetching data from: ")
print(logFiles[x])
logFile = open(logFiles[x] , "r")
for x in range(0,4):
line = logFile.readline()
if not line:
break
else:
line = line.split(",")
TimeStamp = line[1]
rest = line[2:]
#TimeDate formatting
whiteSpaces = len(TimeStamp)- len(TimeStamp.lstrip())
TimeStamp = TimeStamp[whiteSpaces:]
TimeStampFormat = "%a %b %d %H:%M:%S %Y"
DT = datetime.datetime.strptime(TimeStamp, TimeStampFormat)
#print(type(DT))
#print(DT)
row[0] = DT
rowchange[0] +=1
for i in range(0, len(rest)):
rest[i] = rest[i].replace(" ","")
if(i%2 == 0):
try:
index = colIndex.index(rest[i])
rowchange[index] += 1
if(colType[index] == "Bool"):
if(rest[i+1] == "0"):
row[index] = False
#print(type(row[index]))
else:
row[index] = True
#print(type(row[index]))
elif(colType[index] == "Short"):
row[index] = int(rest[i+1])
#print(type(row[index]))
else:
row[index] = rest[i+1]
except:
continue
print(x)
print(row[0])
dataList.append(row)
print(dataList[x][0])
logFile.close()
print("file closed")
for x in range (0,len(dataList)):
print(dataList[x][0])
return dataList
Debug print output as follows to show what i mean
0
2022-01-24 04:57:39
2022-01-24 04:57:39
1
2022-01-24 04:57:39
2022-01-24 04:57:39
2
2022-01-24 04:57:39
2022-01-24 04:57:39
3
2022-01-24 04:57:40
2022-01-24 04:57:40
file closed
2022-01-24 04:57:40
2022-01-24 04:57:40
2022-01-24 04:57:40
2022-01-24 04:57:40
A couple of lines of log file:
$1, Mon Jan 24 04:57:39 2022,$2,10,$3,42,$4, 0,$5, 1, #0, 42, #1, 130, #2, 144, #3, 49, #4, 45, #5, 66, #6, 2975, #7, -3981, #8, 4028, #9, 3900, #10, 2480, #13, 4029, #16, 2930, #17, 2190, #20, 102, #21, 2, #24, 2900, #25, 2200, #26, 6, #27, 51, #30, 2898, #31, 0, #32, 0, #33, 150, #34, 511, #35, -2, #36, 22, #37, 549, #38, -2, #39, 22, #40, 60, #41, 45, #42, 218, #43, -306, #45, 236, #46, -152, #48, 100, #49, 0, #50, 100, #51, 0, #52, 137, #53, -200, #55, 137, #56, -64, #58, 7850, #59, 1, #60, 8300, #61, 1, #62, 100, #63, -1, #64, 60, #65, 88, #66, 86, #67, 108, #68, 1, #73, 1800, #74, 2500, #76, 0, #77, 0, #78, 0, #79, 0, #80, 0, #81, 173, #82, 174, #83, 0, #84, 0, #85, -11, #86, -100, #88, 400, #89, 2729, #90, 0, #91, 2762, #93, 5, #94, 15, #95, 11, #96, 99, #97, 30, #98, 0, #100, 0, #101, 0, #102, 0,#105, 1252, #106, 0, #107, 0, #108, 0,#109, 4029, #110, 0,#111, 3900,#112, 2480,#113, 2520,#114, 2200,#115, 2900, #116, 1, #117, 0, #118, 0, #119, 156, #120, 165, #121, 0, #123, 0, #124, 0, #126, 40, #127, 60, #128, 350, #129, 450, #130, 57,#131, 1740,#132, 1682, #133, 1, #134, 0, #135, 0, #136, 0, #137, 2, #138, 0, #141, 135, #142, 0,#150, 4800, #151, 0, #152, 0,#153, 4864, #154, 1, #155, 1, #156, 0, #157, 0, #158, 1, #160, 15, #161, 40, #162, 15, #163, 40, #164, 1, #165, 1, #166, 0,#170, 1000,#172, 3000, #173, 0, #175, 0, #179, 0, #180, 599, #181, 0, #182, 9, #183, 42, #184, 0, #185, 7, #186, 49, #187, 0, #188, 4, #189, 1, #190, 1,#191, 2261,#192, 1796, #193, 1, #194, 0, #197, 0,#199, 2000,#200, 10000, #201, 10, #202, 10,#203, 3000, #204, 0, #205, 900, #206, 0,#209, 1774, #210, 300, #211, 900,#212, 1800, #213, 100,#217, 3000,#218, 4500,#219, 1500,#220, 3000, #222, 60, #223, 9,#224, 3150, #225, 0, #226, 0, #231, 30, #232, 1, #233, 0, #234, 0, #235, 100, #236, 0, #237, 0, #240, 34, #241, 29, #251, 0,
$1, Mon Jan 24 04:57:39 2022,$2,10,$3,42,$4, 0,$5, 1, #8, 4029, #30, 2897, #82, 173, #141, 132,#153, 5120, #185, 8,
$1, Mon Jan 24 04:57:39 2022,$2,10,$3,42,$4, 0,$5, 1, #30, 2898, #91, 2761, #130, 17,#131, 1701,#132, 1683, #141, 130,#153, 5376, #155, 0, #185, 9,
$1, Mon Jan 24 04:57:40 2022,$2,10,$3,42,$4, 0,$5, 1, #7, -3982, #8, 4028, #89, 2731, #141, 128,#153, 5632, #185, 8,
After Ethans comment i edited the code to not use globals, but still have the same problem
Related
Is there a way to find unique n lines group, and sorting them as well? for example the following file content:
GPO: Microsoft Office Settings v2021.06
Folder Id: Software\Policies\Microsoft\Office\16.0\access\security\trusted locations\location20\description
Value: 74, 0, 38, 0, 74, 0, 32, 0, 71, 0, 101, 0, 110, 0, 101, 0, 114, 0, 97, 0, 108, 0, 32, 0, 84, 0, 114, 0, 117, 0, 115, 0, 116, 0, 101, 0, 100, 0, 32, 0, 76, 0, 111, 0, 99, 0, 97, 0, 116, 0, 105, 0, 111, 0, 110, 0, 0, 0
State: Enabled
GPO: Google Chrome Settings v2022.02
Folder Id: Software\Policies\Google\Chrome\CookiesAllowedForUrls\144
Value: 91, 0, 42, 0, 46, 0, 93, 0, 104, 0, 108, 0, 100, 0, 46, 0, 99, 0, 104, 0, 0, 0
State: Enabled
GPO: Internet Explorer Policy Settings
Folder Id: Software\Policies\Microsoft\Windows\CurrentVersion\Internet Settings\DisableCachingOfSSLPages
Value: 0, 0, 0, 0
State: Enabled
GPO: Google Chrome Settings v2022.02
Folder Id: Software\Policies\Google\Chrome\CookiesAllowedForUrls\38
Value: 91, 0, 42, 0, 46, 0, 93, 0, 100, 0, 111, 0, 99, 0, 117, 0, 115, 0, 105, 0, 103, 0, 110, 0, 46, 0, 110, 0, 101, 0, 116, 0, 0, 0
State: Enabled
GPO: AnyConnect Settings
Folder Id: Software\Policies\Microsoft\Windows\QoS\Cisco AnyConnect\
Value: 67, 0, 105, 0, 115, 0, 99, 0, 111, 0, 74, 0, 97, 0, 98, 0, 98, 0, 101, 0, 114, 0, 46, 0, 101, 0, 120, 0, 101, 0, 0, 0
State: Enabled
GPO: AnyConnect Settings
Folder Id: Software\Policies\Microsoft\Windows\QoS\Cisco AnyConnect\
Value: 67, 0, 105, 0, 115, 0, 99, 0, 111, 0, 74, 0, 97, 0, 98, 0, 98, 0, 101, 0, 114, 0, 46, 0, 101, 0, 120, 0, 101, 0, 0, 0
State: Enabled
n=4, The GPO policy AnyConnect Settings appears multiple times, and has duplicate attributes (Folder Id, Value, State), after processing the file, it should look like this:
GPO: AnyConnect Settings
Folder Id: Software\Policies\Microsoft\Windows\QoS\Cisco AnyConnect\
Value: 67, 0, 105, 0, 115, 0, 99, 0, 111, 0, 74, 0, 97, 0, 98, 0, 98, 0, 101, 0, 114, 0, 46, 0, 101, 0, 120, 0, 101, 0, 0, 0
State: Enabled
GPO: Google Chrome Settings v2022.02
Folder Id: Software\Policies\Google\Chrome\CookiesAllowedForUrls\38
Value: 91, 0, 42, 0, 46, 0, 93, 0, 100, 0, 111, 0, 99, 0, 117, 0, 115, 0, 105, 0, 103, 0, 110, 0, 46, 0, 110, 0, 101, 0, 116, 0, 0, 0
State: Enabled
GPO: Google Chrome Settings v2022.02
Folder Id: Software\Policies\Google\Chrome\CookiesAllowedForUrls\144
Value: 91, 0, 42, 0, 46, 0, 93, 0, 104, 0, 108, 0, 100, 0, 46, 0, 99, 0, 104, 0, 0, 0
State: Enabled
GPO: Internet Explorer Policy Settings
Folder Id: Software\Policies\Microsoft\Windows\CurrentVersion\Internet Settings\DisableCachingOfSSLPages
Value: 0, 0, 0, 0
State: Enabled
GPO: Microsoft Office Settings v2021.06
Folder Id: Software\Policies\Microsoft\Office\16.0\access\security\trusted locations\location20\description
Value: 74, 0, 38, 0, 74, 0, 32, 0, 71, 0, 101, 0, 110, 0, 101, 0, 114, 0, 97, 0, 108, 0, 32, 0, 84, 0, 114, 0, 117, 0, 115, 0, 116, 0, 101, 0, 100, 0, 32, 0, 76, 0, 111, 0, 99, 0, 97, 0, 116, 0, 105, 0, 111, 0, 110, 0, 0, 0
State: Enabled
How can I programatically perform this.
Edit: Below is what I have right now:
import json
import re
GPOs = []
regexs = ['GPO:', 'Folder Id:', 'Value:', 'State:']
with open("administrative-templates-after.txt", mode="r") as after_template:
lines = after_template.readlines()
entry = []
for index in range(0, len(lines) - 1):
for regex in regexs:
if re.search(regex,lines[index]):
entry.append("%s %s" % (regex, lines[index].replace(regex, "").strip()))
if len(entry) == 4 or ( len(entry) == 3 and re.search('State:',entry[len(entry)-1])):
GPOs.append(entry)
entry = []
GPOs = list(dict.fromkeys(list(map(tuple, GPOs))))
with open("after_templates_unique.txt", "w") as fp:
json.dump(GPOs, fp, indent=2)
It works for now, first it groups the lines into a list of arrays, and removes the duplicate lines, but I feel like it can be optimized as right now it works by brute force.
Once I am able to find a better approach I will try and sort these groups.
I am analysing an algorithm that gives the location of a "peak value" of a square matrix (This means that the neighbors of the value are less or equal than the value).
The algorith in question is very inefficient, because it goes checking values one by one, starting in the position (0,0) and moving to the neighbor that is more than the number. Here is the code:
def algorithm(problem, location = (0, 0), trace = None):
# if it's empty, it's done!
if problem.numRow <= 0 or problem.numCol <= 0: #O(1)
return None
nextLocation = problem.getBetterNeighbor(location, trace) #O(1)
#This evaluates the neighbor values and returns the highest value. If it doesn't have a better neighbor, it return itself
if nextLocation == location:
# If it doesnt have a better neighbor, then its a peak.
if not trace is None: trace.foundPeak(location) #O(1)
return location
else:
#there is a better neighbor, go to the neighbor and do a recursive call with that location
return algorithm(problem, nextLocation, trace) #O(????)
I know that the best case is that the peak is in (0,0), and I determined that the worst case scenario is the following (Using a 10x10 matrix):
problem = [
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 10],
[34, 35, 36, 37, 38, 39, 40, 41, 0, 11],
[33, 0, 0, 0, 0, 0, 0, 42, 0, 12],
[32, 0, 54, 55, 56, 57, 0, 43, 0, 13],
[31, 0, 53, 0, 0, 58, 0, 44, 0, 14],
[30, 0, 52, 0, 0, 0, 0, 45, 0, 15],
[29, 0, 51, 50, 49, 48, 47, 46, 0, 16],
[28, 0, 0, 0, 0, 0, 0, 0, 0, 17],
[27, 26, 25, 24, 23, 22, 21, 20, 19, 18]]
Note that it basically makes the algorithm go in a spiral and it has to evaluate 59 positions.
So, the question is: How do I get the time complexity for this case in particular and why is that?
I know that all the operations are O(1), except for the recursion, and I'm lost
For an arbitrary matrix of size [m,n], as you showed with your example, we can break down the traversal of a given matrix made by this algorithm (A) as follows:
A will traverse n-1 elements from the top-left corner to element 8,
then m-1 elements from 9 to 17,
then n-1 elements from 18 to 27,
then m-3 elements from 27 to 33,
then n-3 elements from 34 to 40,
then m-5 elements from 41 to 45,
then n-5 elements from 46 to 50,
then m-7 elements from 51 to 53
etc.
At this point, the pattern should be clear, and thus the following worst-case recurrence relation can be established:
T(m,n) = T(m-2,n-2) + m-1 + n-1
T(m,n) = T(m-4,n-4) + m-3 + n-3 + m-1 + n-1
...
T(m,n) = T(m-2i,n-2i) + i*m + i*n -2*(i^2)
where i is the number of iterations, and this recurrence will continue only while m-2i and n-2i are both greater than 0.
WLOG we can assume m>=n and so this algorithm continues while m-2i>0 or while m>2i or for im/2 iterations. Thus plugging back in for i, we get:
T(m,n) = T(m-m,n-m) + m/2*m + m/2*n -2*((m/2)^2)
T(m,n) = 0 + m^2/2 + m*n/2 -2*((m^2/4))
T(m,n) = 0 + m^2/2 + m*n/2 -2*((m^2/4))
T(m,n) = m*n/2 = O(m*n)
I have multiple npz files which i want to merge into one npz.file with the format similar to "mnist.npz"
the format of mnist.npz is:
((array([[[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]],
[0, 0, 0, ..., 0, 0, 0]]], dtype=uint8),
array([5, 0, 4, ..., 5, 6, 8], dtype=uint8))
Here two arrays are merged into one big npz file.
My two npz arrays are:
x_array:
[[[252, 251, 253],
[151, 150, 152],
[ 28, 25, 27],
...,
[ 30, 25, 27],
[ 30, 25, 27],
[ 32, 27, 29]],
[ 23, 18, 20]],
[[ 50, 92, 163],
[ 55, 90, 163],
[ 75, 105, 176],
...,
[148, 197, 242],
[109, 157, 208],
[109, 165, 222]],
[[ 87, 104, 155],
[ 82, 112, 168],
...,
[ 29, 52, 105],
[ 30, 55, 111],
[ 36, 55, 106]]]
y_array:
[1, 1, 1, 1, 1, 1]
When i tried to merge my files, the output i got is:
(array([[[252, 251, 253],
[151, 150, 152],
[ 28, 25, 27],
...,
[ 30, 25, 27],
[ 30, 25, 27],
[ 32, 27, 29]],
[ 23, 18, 20]]], dtype=uint8), array([[[ 50, 92, 163],
[ 55, 90, 163],
[ 75, 105, 176],
...,
[148, 197, 242],
[109, 157, 208],
[109, 165, 222]],
[ 87, 104, 155],
[ 82, 112, 168],
...,
[ 29, 52, 105],
[ 30, 55, 111],
[ 36, 55, 106]]], dtype=uint8),1, 1, 1, 1, 1, 1)
So in the last line, my array is formated as
1, 1, 1, 1, 1, 1
instead of something like:
array([1, 1, 1, 1, 1, 1], dtype=uint8)
My code for merging two npz files is:
data = load('x_array.npz',allow_pickle=True)
lst = data.files
for item in lst:
x_train = data[item]
#print((x_item,x_train))
data1 = load('y_array.npz',allow_pickle=True)
lst1 = data1.files
for item in lst1:
y_train = data1[item]
out1 = (*x_train,*y_train)
np.savez('out1.npz',out1)
print(out1)
Can anyone please suggest how i can convert my second array of (1, 1, 1, 1, 1, 1) to array([1, 1, 1, 1, 1, 1], dtype=uint8)? Any suggestions are helpful
After going through my code i found out that by changing the line
out1 = (*x_train,*y_train)
to
out1 = (*x_train,y_train)
I'm using pytesseract to return the coordinates of the objects in an image.
By using this piece of code:
import pytesseract
from pytesseract import Output
import cv2
img = cv2.imread('wine.jpg')
d = pytesseract.image_to_data(img, output_type=Output.DICT)
print(d)
for i in range(n_boxes):
(x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
cv2.imshow('img', img)
cv2.waitKey(0)
I get that:
{'level': [1, 2, 3, 4, 5, 5, 2, 3, 4, 5, 4, 5, 2, 3, 4, 5], 'page_num': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'block_num': [0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3], 'par_num': [0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1], 'line_num': [0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 2, 2, 0, 0, 1, 1], 'word_num': [0, 0, 0, 0, 1, 2, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1], 'left': [0, 485, 485, 485, 485, 612, 537, 537, 555, 555, 537, 537, 454, 454, 454, 454], 'top': [0, 323, 323, 323, 323, 324, 400, 400, 400, 400, 426, 426, 0, 0, 0, 0], 'width': [1200, 229, 229, 229, 115, 102, 123, 123, 89, 89, 123, 123, 296, 296, 296, 296], 'height': [900, 29, 29, 29, 28, 28, 40, 40, 15, 15, 14, 14, 892, 892, 892, 892], 'conf': ['-1', '-1', '-1', '-1', 58, 96, '-1', '-1', '-1', 95, '-1', 95, '-1', '-1', '-1', 95], 'text': ['', '', '', '', "JACOB'S", 'CREEK', '', '', '', 'SHIRAZ', '', 'CABERNET', '', '', '', '']}
[image used][]1
However, when I use this image:
I get that:
{'level': [1, 2, 3, 4, 5], 'page_num': [1, 1, 1, 1, 1], 'block_num': [0, 1, 1, 1, 1], 'par_num': [0, 0, 1, 1, 1], 'line_num': [0, 0, 0, 1, 1], 'word_num': [0, 0, 0, 0, 1], 'left': [0, 0, 0, 0, 0], 'top': [0, 162, 162, 162, 162], 'width': [1200, 0, 0, 0, 0], 'height': [900, 276, 276, 276, 276], 'conf': ['-1', '-1', '-1', '-1', 95], 'text': ['', '', '', '', '']}
Any idea why some image are working and some aren't?
It is mainly caused by different quality and contrast. it is much easier for the OCR engine to detect texts in desired images.
you can add a few pre-processing routines, including thresholding, blurring, histogram equalization and lots of other techniques. it is mainly subjective so I can not provide you with working code, it is more like trial and error to find the best technique for your scope
UPDATE:
here is a code that might help you
def preprocessing_typing_detection(inputImage):
inputImage= cv2.cvtColor(inputImage, cv2.COLOR_BGR2GRAY)
inputImage= cv2.Laplacian(inputImage, cv2.CV_8U)
return inputImage
i have list with numbers and i want to slice all the elements between numbers 192 that exist on the list and pass them to a list
my list
[192, 0, 1, 0, 1, 192, 12, 0, 5, 0, 1, 0, 1, 66, 218, 0, 10, 5, 115, 116, 97, 116, 115, 1, 108, 192, 20, 192, 53, 0, 1, 0, 1, 0, 0, 0, 162, 0, 4, 74, 125, 133, 155, 192, 53, 0, 1, 0, 1, 0, 0, 0, 162, 0, 4, 74, 125, 133, 156, 192, 53, 0, 1, 0, 1, 0, 0, 0, 162, 0, 4, 74, 125, 133, 154, 192, 53, 0, 1, 0, 1, 0, 0, 0, 162, 0, 4, 74, 125, 133, 157]
i want someting like this
[192, 0, 1, 0, 1 ]
[192, 12, 0, 5, 0, 1, 0, 1, 66, 218, 0, 10, 5, 115, 116, 97, 116, 115, 1, 108]
[192, 20, 192, 53, 0, 1, 0, 1, 0, 0, 0, 162, 0, 4, 74, 125, 133, 155]
until the end of the list.
Here's one possible way to do it:
# input list
lst = [192, 0, 1, 0, 1, 192, 12, 0, 5, 0, 1, 0, 1, 66, 218, 0, 10, 5, 115, 116, 97, 116, 115, 1, 108, 192, 20, 192, 53, 0, 1, 0, 1, 0, 0, 0, 162, 0, 4, 74, 125, 133, 155, 192, 53, 0, 1, 0, 1, 0, 0, 0, 162, 0, 4, 74, 125, 133, 156, 192, 53, 0, 1, 0, 1, 0, 0, 0, 162, 0, 4, 74, 125, 133, 154, 192, 53, 0, 1, 0, 1, 0, 0, 0, 162, 0, 4, 74, 125, 133, 157]
# list of indexes where 192 is found,
# plus one extra index for the final slice
indexes = [i for i, n in enumerate(lst) if n == 192] + [len(lst)]
# create the slices between consecutive indexes
[lst[indexes[i]:indexes[i+1]] for i in range(len(indexes) - 1)]
The result will be:
[[192, 0, 1, 0, 1],
[192, 12, 0, 5, 0, 1, 0, 1, 66, 218, 0, 10, 5, 115, 116, 97, 116, 115, 1, 108],
[192, 20],
[192, 53, 0, 1, 0, 1, 0, 0, 0, 162, 0, 4, 74, 125, 133, 155],
[192, 53, 0, 1, 0, 1, 0, 0, 0, 162, 0, 4, 74, 125, 133, 156],
[192, 53, 0, 1, 0, 1, 0, 0, 0, 162, 0, 4, 74, 125, 133, 154],
[192, 53, 0, 1, 0, 1, 0, 0, 0, 162, 0, 4, 74, 125, 133, 157]]
You can build a generator with itertools.groupby that uses 192's equality method as a key function, pair the output of the generator with zip and then use itertools.chain.from_iterable to join the pairs (the example below assumes your list is stored in variable l):
from itertools import groupby, chain
i = (list(g) for _, g in groupby(l, key=(192).__eq__))
[list(chain.from_iterable(p)) for p in zip(i, i)]
This returns:
[[192, 0, 1, 0, 1],
[192, 12, 0, 5, 0, 1, 0, 1, 66, 218, 0, 10, 5, 115, 116, 97, 116, 115, 1, 108],
[192, 20],
[192, 53, 0, 1, 0, 1, 0, 0, 0, 162, 0, 4, 74, 125, 133, 155],
[192, 53, 0, 1, 0, 1, 0, 0, 0, 162, 0, 4, 74, 125, 133, 156],
[192, 53, 0, 1, 0, 1, 0, 0, 0, 162, 0, 4, 74, 125, 133, 154],
[192, 53, 0, 1, 0, 1, 0, 0, 0, 162, 0, 4, 74, 125, 133, 157]]