Lack of documentation for format of Cuckoo Sandbox report file - need to know key names and order for parsing - sandbox

My program parses a the JSON report file output from Cuckoo looking for DLLs. Unfortunately, there seems to be very little (none as far as I can see) documentation specifying which fields in the file are present, and whether they appear in a particular order; this makes scraping the documents incredibly uncertain as to whether the correct data is going to be captured.
At the moment, I parse the data between the 'pe_imports' (if it is present) and the 'peid_signatures' keys to find the DLLs, however, will the peid_signatures field always be present, and will it always directly succeed the DLLs? If there is any documentation for this, or you could point me to the right part in the Cuckoo code, please let me know :)
Thanks

will the peid_signatures field always be present?
It depends upon the signatures if some objects(in your case dll) matches with the defined signatures, the cuckoo will add score and signature information in the report. These signatures will not always present in json report.
If you are interested in finding the keys generated by cuckoo for its analysis report, you can find them in static.py file. Location of file
cuckoo/cuckoo/processing/static.py
def _get_peid_signatures(self):
"""Gets PEID signatures.
#return: matched signatures or None.
"""
try:
sig_path = cwd("peutils", "UserDB.TXT", private=True)
signatures = peutils.SignatureDatabase(sig_path)
return signatures.match(self.pe, ep_only=True)
except:
return None
and for results
def run(self):
"""Run analysis.
#return: analysis results dict or None.
"""
if not os.path.exists(self.file_path):
return {}
try:
self.pe = pefile.PE(self.file_path)
except pefile.PEFormatError:
return {}
results = {}
results["peid_signatures"] = self._get_peid_signatures()
results["pe_imports"] = self._get_imported_symbols()
results["pe_exports"] = self._get_exported_symbols()
results["pe_sections"] = self._get_sections()
results["pe_resources"] = self._get_resources()
results["pe_versioninfo"] = self._get_versioninfo()
results["pe_imphash"] = self._get_imphash()
results["pe_timestamp"] = self._get_timestamp()
results["pdb_path"] = self._get_pdb_path()
results["signature"] = self._get_signature()
results["imported_dll_count"] = len([x for x in results["pe_imports"] if x.get("dll")])
return results
If there is any documentation for this
Not sure about this point, cuckoo sandbox has detailed installation guide, how to modify and customization the modules and output, currently I am also unable to find documents.
If there is any confusion or you need more details you can ask,

Related

Unable to access text from a class using selenium on python

I am willing to parse https://2gis.kz , and I encountered the problem that I am getting error while using .text or any methods used to extract text from a class
I am typing the search query such as "fitness"
My window variable is
all_cards = driver.find_elements(By.CLASS_NAME,"_1hf7139")
for card_ in all_cards:
card_.click()
window = driver.find_element(By.CLASS_NAME, "_18lzknl")
This is a quite simplified version of how I open a mini-window with all of the essential information inside it. Below I am attaching the piece of code where I am trying to extract text from a phone number holder.
texts = window.find_elements(By.CLASS_NAME,'_b0ke8')
print(texts) # this prints out something from where I am concluding that this thing is accessible
try:
print(texts.text)
except:
print(".text")
try:
print(texts.text())
except:
print(".text()")
try:
print(texts.get_attribute("innerHTML"))
except:
print('getAttribute("innerHTML")')
try:
print(texts.get_attribute("textContent"))
except:
print('getAttribute("textContent")')
try:
print(texts.get_attribute("outerHTML"))
except:
print('getAttribute("outerHTML")')
Hi, guys, I solved an issue. The .text was not working for some reason. I guess developers somehow managed to protect information from using this method. I used a
get_attribute("innerHTML") # afaik this allows us to get a html code of a particular class
and now it works like a charm.
texts = window.find_elements(By.TAG_NAME, "bdo")
with io.open("t.txt", "a", encoding="utf-8") as f:
for text in texts:
nums = re.sub("[^0-9]", "",
text.get_attribute("innerHTML"))
f.write(nums+'\n')
f.close()
So the problem was that:
I was trying to print a list of items just by using print(texts)
Even when I tried to print each element of texts variable in a for loop, I was getting an error due to the fact that it was decoded in utf-8.
I hope someone will find it useful and will not spend a plethora of time trying to fix such a simple bug.
find_elements method returns a list of web elements. So this
texts = window.find_elements(By.CLASS_NAME,'_b0ke8')
gives you texts a list of web elements.
You can not apply .text method directly on list.
In order to get each element text you will have to iterate over elements in the list and extract that element text, like this:
text_elements = window.find_elements(By.CLASS_NAME,'_b0ke8')
for element in text_elements:
print(element.text)
Also, I'm not sure about locators you are using.
_1hf7139, _18lzknl and _b0ke8 class names are seem to be dynamic class names i.e they may change each browsing session.

Automatically generate Google search query

I am trying parse HTML data of certain patents to gather information using Python 3.7 and bs4.
My Problem simplified:
Given this URL
https://patents.google.com/patent/X/en?oq=Y
Where:
X = automatically generated string by Google
Y = My user input (a patent number)
and usually: X == Y (some patent number)
I need to get the value of X.
A more detailed description of my problem:
For 90% of my queries, there is no problem, as I can simply parse using the following code:
patent_number = "EP1000000B1"
paten_url = ("https://patents.google.com/patent/" + patent_number + "/en?oq=" + patent_number)
r = requests.get(patent_url)
response = r.content
soup = BeautifulSoup(response, "html.parser")
However, sometimes the query structure varies, for instance:
I try to search for patent number WO198700753A1 using the code above, but I get an 404 Error, because the URL
https://patents.google.com/patent/WO198700753A1/en?oq=WO198700753A1
does not exist.
This part does not seem to be relevant
en?oq=" + patent_number
, but the first part is.
Searching Google Patents by hand reveals that Google automatically redirects my query from WO198700753A1 to WO1987000753A1 (another 0 added).
Is there any way to automatically generate my url (the part in the middle), so my program will always find results?
Thanks for your help ;)
At the moment, the structure of the link has changed and the oq parameter is not needed. The patent number cannot be randomly generated because each patent has its own unique number. If you generated a patent which not exist, then this page does not exist too.

How to modify spacy.tokens.doc.Doc tokens with pipeline components in SpaCy

I'm using SpaCy to pre-process some data. However, I'm stuck on how to modify the content of the spacy.tokens.doc.Doc class.
For example, here:
npc = spacy.load("pt")
def pre_process_text(doc) -> str:
new_content = ""
current_tkn = doc[0]
for idx, next_tkn in enumerate(doc[1:], start=0):
# Pre-process data
# new_content -> currently, it is the way I'm generating
# the new content, concatenating the modified tokens
return new_content
nlp.add_pipe(pre_process_text, last=True)
In the comment part inside the above code, there are some tokens that I would like to remove from doc param, or I would like to change its token text content. In other words, I can modify the content of spacy.tokens.doc.Doc by (1) removing tokens entirely, or (2) changing tokens contents.
Is there a way to create another spacy.tokens.doc.Doc with those modified tokens but keeping the Vocab from the npc = spacy.load("pt").
Currently, I'm generating the new content by returning a string, but is there a way to return the modified Doc?
One of the core principles of spaCy's Doc is that it should always represent the original input:
spaCy's tokenization is non-destructive, so it always represents the original input text and never adds or deletes anything. This is kind of a core principle of the Doc object: you should always be able to reconstruct and reproduce the original input text.
While you can work around that, there are usually better ways to achieve the same thing without breaking the input text ↔ Doc text consistency.
I've outlined some approaches for excluding tokens without destroying the original input in my comment here.
Alternatively, if you really want to modify the Doc, your component could create a new Doc object and return that. The Doc object takes a vocab (e.g. the original doc's vocab), a list of string words and an optional list of spaces, a list of booleans indicating whether the token at that position is followed by a space or not.
from spacy.tokens import Doc
def pre_process_text(doc):
# Generate a new list of tokens here
new_words = create_new_words_here(doc)
new_doc = Doc(doc.vocab, words=new_words)
return new_doc
Note that you probably want to add this component first in the pipeline before other components run. Otherwise, you'd lose any linguistic features assigned by previous components (like part-of-speech tags, dependencies etc).

filled PDF fields showing up differently in different contexts

I have a python script that creates a number of pdf forms (0 - 10) and then concatenates them into one form. The fields on the compiled PDF show up differently in 4 different contexts. I am developing in debian linux, and the pdf viewer (Okular) does not show any fields within the compiled PDF, whereas on Windows 10, if I open the pdf with chrome, I have to hover over the field to see the field value. It has the correct field data for the first page, however, each subsequent page is just a duplicate of the first page, which is incorrect. If I open the pdf with Microsoft Edge, it correctly displays the form data for each page, however when I go to print with edge, none of the form data shows up.
I am using pdfrw for writing to pdf, and pypdf2 for merging. I have tried a number of different things, including attempting to flatten the pdf with python (which there is very little support for btw), reading and writing instead of merging, attempting to convert the form fields into text, along with many other things that I have since forgotten about since they did not work.
def writeToPdf(unfilled, output, data, fields):
'''Function writes the data from data to unfilled, and saves it as output'''
# TODO: Use literal declarations for lists, dicts, etc
checkboxes = [
'misconduct_complete',
'misconduct_incomplete',
'not_final_exam',
'supervise_exam',
'not_final_home_exam',
'not_final_assignment',
'not_final_oral_exam',
'not_final_lab_exam',
'not_final_practical_exam',
'not_final_other'
]
template_pdf = pdfrw.PdfReader(unfilled)
annotations = template_pdf.pages[0][Annot_Key]
for annotation in annotations:
# TODO: Singly nested if's with no else's suggest a logic problem, find a clearer way to do this.
if annotation[Subtype_Key] == Widget_Subtype_Key:
if annotation[Annot_Field_Key]:
key = annotation[Annot_Field_Key][1:-1]
if key in fields:
if key in checkboxes:
annotation.update(pdfrw.PdfDict(AS=pdfrw.PdfName('Yes')))
else:
if(key == 'course'):
annotation.update(pdfrw.PdfDict(V='{}'.format(data[key][0:8])))
else:
annotation.update(pdfrw.PdfDict(V='{}'.format(data[key])))
pdfrw.PdfWriter().write(output, template_pdf)
def set_need_appearances_writer(writer):
# basically used to ensured there are not
# overlapping form fields, which makes printing hard
try:
catalog = writer._root_object
# get the AcroForm tree and add "/NeedAppearances attribute
if "/AcroForm" not in catalog:
writer._root_object.update({
NameObject("/AcroForm"): IndirectObject(len(writer._objects), 0, writer)})
need_appearances = NameObject("/NeedAppearances")
writer._root_object["/AcroForm"][need_appearances] = BooleanObject(True)
except Exception as e:
print('set_need_appearances_writer() catch : ', repr(e))
return writer
def mergePDFs(listOfPdfPaths, outputPDf):
'''Function Merges a list of pdfs into a single one, and saves it to outputPDf'''
pdf_writer = PdfFileWriter()
set_need_appearances_writer(pdf_writer)
pdf_writer.setPageMode('/UseOC')
for path in listOfPdfPaths:
pdf_reader = PdfFileReader(path)
for page in range(pdf_reader.getNumPages()):
pdf_writer.addPage(pdf_reader.getPage(page))
with open(outputPDf, 'wb') as fh:
pdf_writer.write(fh)
As mentioned above, there is different results for different contexts. Within Debian Linux, the okular view shows no forms, within windows 10 google chrome shows duplicate fields after the first page (but I have to hover over/click the field), Microsoft Edge shows the correct with each page having its own field data, and if i look at the print preview, it also shows no form data
If anyone else is having this quite obscure problem, the behavior is unspecified for the use case that I was dealing with (template fillable form with the same field names). The only solution that is available with python at the moment (at least that I found in my many hours researching and testing) was to flatten the pdf, create a separate pdf, and write the form data to the desired locations (I did this with reportlab), then to overlay the template pdf with the created pdf. Overall this is not a good solution for many reasons, so if you have a better one, please Post it!

How to use a conditional statement while scraping?

I'm trying to scrape the MTA website and need a little help scraping the "Train Lines Row." (Website for reference: https://advisory.mtanyct.info/EEoutage/EEOutageReport.aspx?StationID=All
The train line information is stored as image files (1 line subway, A line subway, etc) describing each line that's accessible through a particular station. I've had success scraping info out of rows in which only one train passes through, but I'm having difficulty figuring out how to iterate through the columns which have multiple trains passing through it...using a conditional statement to test for whether it has one line or multiple lines.
tableElements = table.find_elements_by_tag_name('tr')
that's the table i'm iterating through
tableElements[2].find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_element_by_tag_name('img').get_attribute('alt')
this successfully gives me the values if only one value exists in the particular column
tableElements[8].find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')
this successfully gives me a list of values I can successfully iterate through to extract my needed values.
Now I try and combine these lines of code together in a forloop to extract all the information without stopping.
for info in tableElements[1:]:
if info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')[1] == True:
for images in info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img'):
print(images.get_attribute('alt'))
else:
print(info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_element_by_tag_name('img').get_attribute('alt'))
I'm getting the error message: "list index out of range." I dont know why, as every iteration done in isolation seems to work. My hunch is I haven't correctly used the boolean operation properly here. My idea was that if find_elements_by_tag_name had an index of [1] that would mean multiple image text for me to iterate through. Hence, why I want to use this boolean operation.
Hi All, thanks so much for your help. I've uploaded my full code to Github and attached a link for your reference: https://github.com/tsp2123/MTA-Scraping/blob/master/MTA.ElevatorData.ipynb
The end goal is going to be to put this information into a dataframe using some formulation of and having a for loop that will extract the image information that I want.
dataframe = []
for elements in tableElements:
row = {}
columnName1 = find_element_by_class_name('td')
..
Your logic isn't off here.
"My hunch is I haven't correctly used the boolean operation properly here. My idea was that if find_elements_by_tag_name had an index of [1] that would mean multiple image text for me to iterate through."
The problem is it can't check if the statement is True if there's nothing in index position [1]. Hence the error at this point.
if info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')[1] == True:
What you want to do is use try: So something like:
for info in tableElements[1:]:
try:
if info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')[1] == True:
for images in info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img'):
print(images.get_attribute('alt'))
else:
print(info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_element_by_tag_name('img').get_attribute('alt'))
except:
#do something else
print ('Nothing found in index position.')
Is it also possible to back to your question and provide the full code? When I try this, I'm getting 11 table elements, so want to test it with the specific table you're trying to scrape.

Resources