Split ip address out of list or string - python-3.x

I am trying to analyze a logfile in python 3.7. I have entries like this:
2020-07-13 18:05:43.332880;sshd;1144;logon root from 192.168.179.9 started
2020-07-13 18:10:12.332880;sshd;1854;logon admin from 192.168.179.3 finished
2020-07-14 03:17:02.332880;sshd;1169;logon admin from 10.0.1.5 failed
2020-07-14 03:19:30.332880;sshd;1297;logon root from 10.0.1.3 failed
I read the file into python as such:
def readLog(fname):
""" Read log-events from file """
lines = []
file = open("<pathname>", encoding = 'utf-8')
lines = file.read()#.split('\n')
file.close()
return lines # returns the logfile
I need to get a summary as output looking like this:
day:2020-07-01,prog:SSH,success:12,failure:13,ip:1.2.3.4
So the goal is to get all info such as the date, amount of failed and succsessful attempts for each different ip address.
I tried splitting but I can't get it to work, I have tried a few approaches but with no result yet. I tried:
events = readLog(logFname)
for event in events:
event.split(" ",4)
print(event)
#ipList=[]
#ipList.append(event)
The ip is after the 4th space, but I want only the ip so only from the 4th space untill the 5th. Not sure how to do that either. But I need to get the split to works first, then I can try to solve the details. I looked around on the internet and found a few solutions (see sources below) but I haven't been able to use them correctly or get them to work. I tried at something like this:
events = readLog(logFname)
for event in events:
[i.split(" ", 4) for i in event]
Hope someone can help me with this,thank you.
Sources:
How to split elements of a list?
Python Recognizing An IP In A String
http://www.datasciencemadesimple.com/remove-spaces-in-python/

Doing this should give you ip, you can repeat with different indices to get different data from the string
events = readLog(logFname)
for event in events:
[i.split(" ")[4] for i in event]

Related

Is there a way to fetch the url from google search result when a csv file full of keyword is uploaded in Python?

Is it possible to obtain the url from Google search result page, given the keyword? Actually, I have a csv file that contains a lot of companies name. And I want there website which shows up on the top of search result in google, when I upload that csv file it fetch the company name/keyword and put it on the search field.
For eg: - stack overflow, this is one of the entry in my csv file and it should be fetched and put in the search field, and it should return the best match/first url from search result. Eg: - www.stackoverflow.com
And this returned result should be stored in the same file which I have uploaded and next to the keyword for it searched.
I am not aware much about these concepts, so any help will be very appreciated.
Thanks!
google package has one dependency on beautifulsoup which need to be installed first.
then install :
pip install google
search(query, tld='com', lang='en', num=10, start=0, stop=None, pause=2.0)
query : query string that we want to search for.
tld : tld stands for top level domain which means we want to search our result on google.com or google.in or some other domain.
lang : lang stands for language.
num : Number of results we want.
start : First result to retrieve.
stop : Last result to retrieve. Use None to keep searching forever.
pause : Lapse to wait between HTTP requests. Lapse too short may cause Google to block your IP. Keeping significant lapse will make your program slow but its safe and better option.
Return : Generator (iterator) that yields found URLs. If the stop parameter is None the iterator will loop forever.
Below code is the solution for your question.
import pandas
from googlesearch import search
df = pandas.read_csv('test.csv')
result = []
for i in range(len(df['keys'])):
for j in search(df['keys'][i], tld="com", num=10, stop=1, pause=2):
result.append(j)
dict1 = {'keys': df['keys'], 'url': result}
df = pandas.DataFrame(dict1)
df.to_csv('test.csv')
Sample input format file image:
Output File Image:

Is it possible to modify the contents of a text object using Python?

I have a dictionary file called “labels” that contains text objects.
Screen capture of file
When I display the contents of this file, I get the following:
{'175.123.98.240': Text(-0.15349206308126684, -0.6696533109609498, '175.123.98.240'),
'54.66.152.105': Text(-1.0, -0.5455880938500245, '54.66.152.105'),
'62.97.116.82': Text(0.948676253595717, 0.6530664635187481, '62.97.116.82'),
'24.73.75.234': Text(0.849485905682265, -0.778703553136851, '24.73.75.234'),
'1.192.128.23': Text(0.2883091762715677, -0.03432011446968225, '1.192.128.23'),
'183.82.9.19': Text(-0.8855214994079628, 0.7201660238351776, '183.82.9.19'),
'14.63.160.219': Text(-0.047457773060320695, 0.655032585063581, '14.63.160.219')}
I want to change the IP address in the text object portion such that the file looks like this:
{'175.123.98.240': Text(-0.15349206308126684, -0.6696533109609498, 'xxx.123.98.240'),
'54.66.152.105': Text(-1.0, -0.5455880938500245, 'xxx.66.152.105'),
'62.97.116.82': Text(0.948676253595717, 0.6530664635187481, 'xxx.97.116.82'),
'24.73.75.234': Text(0.849485905682265, -0.778703553136851, 'xxx.73.75.234'),
'1.192.128.23': Text(0.2883091762715677, -0.03432011446968225, 'xxx.192.128.23'),
'183.82.9.19': Text(-0.8855214994079628, 0.7201660238351776, 'xxx.82.9.19'),
'14.63.160.219': Text(-0.047457773060320695, 0.655032585063581, 'xxx.63.160.219')}
This file is used for printing labels on a networkx graph.
I have a couple of questions.
Can the contents of a text object be modified?
If so, can it be changed without iterating through the file since the number of changes could range from 3 to 6,000, depending on what I am graphing?
How would I do it?
I did consider changing the IP address before I created my node and edge files but that resulted in separate IP address being clustered incorrectly. For example: 173.6.48.24 and 1.6.48.24 would both be converted to xxx.6.48.24.
Changing the IP address at the time of printing the labels seems like the only sensible method.
I am hoping someone could point me in the right direction. I have never dealt with text objects and I am out of my depth on this one.
Thanks
Additional information
The original data set is a list of IP addresses that have attack several honeypots I am running. I have taken the data and catalogued the data based on certain attack criteria.
The data that I showed was just one of the small attack networks. The label file was generated using the code:
labels = nx.draw_networkx_labels(compX, pos=pos_df)
Where compX is the file containing the data to be graphed and pos_df is the layout of the graph. In this case, I used nx.spring_layout().
I can display the contents of the label file using:
for k,v in labels.items():
print(v)
However, “v” contains the text object, which I do not seem to be able to work with. The content of “v” is a follows:
Text(-0.15349206308126684, -0.6696533109609498, '175.123.98.240')
Text(-1.0, -0.5455880938500245, '54.66.152.105')
Text(0.948676253595717, 0.6530664635187481, '62.97.116.82')
Text(0.849485905682265, -0.778703553136851, '24.73.75.234')
Text(0.2883091762715677, -0.03432011446968225, '1.192.128.23')
Text(-0.8855214994079628, 0.7201660238351776, '183.82.9.19')
Text(-0.047457773060320695, 0.655032585063581, '14.63.160.219')
This is where I am stuck. I do not seem to be able to come up with any code that does not return some kind of “'Text' object has no attribute xxxx”.
As for replacing the first octet, I have the following code that works on a dataframe and I have just been experimenting to see if I can adapt it but so far, no luck:
df[column_ID] = df[column_ID].apply(lambda x: "xxx."+".".join(x.split('.')[1:4])) # Replace First octet
As I said, I would prefer not to iterate through the file. This cluster has seven entries; others can contain up to 6,000 nodes – granted the graph looks like a hairball with this many nodes, but most are between 3 and 25 nodes. I have a total of 60 clusters and as I collect more information, this number will rise.
I found a solution to replacing text inside a text object:
1) Convert text object to string
2) Find the position to be changed and make the change
3) Use set_text() to make the change to the text object
Example code
# Anonymize Source IP address
for k,v in labels.items():
a = str(v)
a = a[a.find(", '"):]
a = 'xxx' + a[a.find("."):][:-2]
v.set_text(a)

How to use a conditional statement while scraping?

I'm trying to scrape the MTA website and need a little help scraping the "Train Lines Row." (Website for reference: https://advisory.mtanyct.info/EEoutage/EEOutageReport.aspx?StationID=All
The train line information is stored as image files (1 line subway, A line subway, etc) describing each line that's accessible through a particular station. I've had success scraping info out of rows in which only one train passes through, but I'm having difficulty figuring out how to iterate through the columns which have multiple trains passing through it...using a conditional statement to test for whether it has one line or multiple lines.
tableElements = table.find_elements_by_tag_name('tr')
that's the table i'm iterating through
tableElements[2].find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_element_by_tag_name('img').get_attribute('alt')
this successfully gives me the values if only one value exists in the particular column
tableElements[8].find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')
this successfully gives me a list of values I can successfully iterate through to extract my needed values.
Now I try and combine these lines of code together in a forloop to extract all the information without stopping.
for info in tableElements[1:]:
if info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')[1] == True:
for images in info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img'):
print(images.get_attribute('alt'))
else:
print(info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_element_by_tag_name('img').get_attribute('alt'))
I'm getting the error message: "list index out of range." I dont know why, as every iteration done in isolation seems to work. My hunch is I haven't correctly used the boolean operation properly here. My idea was that if find_elements_by_tag_name had an index of [1] that would mean multiple image text for me to iterate through. Hence, why I want to use this boolean operation.
Hi All, thanks so much for your help. I've uploaded my full code to Github and attached a link for your reference: https://github.com/tsp2123/MTA-Scraping/blob/master/MTA.ElevatorData.ipynb
The end goal is going to be to put this information into a dataframe using some formulation of and having a for loop that will extract the image information that I want.
dataframe = []
for elements in tableElements:
row = {}
columnName1 = find_element_by_class_name('td')
..
Your logic isn't off here.
"My hunch is I haven't correctly used the boolean operation properly here. My idea was that if find_elements_by_tag_name had an index of [1] that would mean multiple image text for me to iterate through."
The problem is it can't check if the statement is True if there's nothing in index position [1]. Hence the error at this point.
if info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')[1] == True:
What you want to do is use try: So something like:
for info in tableElements[1:]:
try:
if info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')[1] == True:
for images in info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img'):
print(images.get_attribute('alt'))
else:
print(info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_element_by_tag_name('img').get_attribute('alt'))
except:
#do something else
print ('Nothing found in index position.')
Is it also possible to back to your question and provide the full code? When I try this, I'm getting 11 table elements, so want to test it with the specific table you're trying to scrape.

Pandas Drop and Replace functions won't work within a UDF

I looked around at other questions but couldn't find out that addresses the issue I'm having. I am cleaning a data set in an ipython notebook. When I run the cleaning tasks individually they work as expected, but I am having trouble with the replace() and drop() functions when they are included in a UDF. Specifically, these lines aren't doing anything within the UDF, however, a dataframe is returned that completes the other tasks as expected (i.e. reads in the file, sets the index, and filters select dates out).
Any help is much appreciated!
Note that in this problem the df.drop() and df.replace() commands both work as expected when executed outside of the UDF. The function is below for your reference. The issue is with the last two lines "station.replace()" and "station.drop()".
def read_file(file_path):
'''Function to read in daily x data'''
if os.path.exists(os.getcwd()+'/'+file_path) == True:
station = pd.read_csv(file_path)
else:
!unzip alldata.zip
station = pd.read_csv(file_path)
station.set_index('date',inplace=True) #put date in the index
station = station_data[station_data.index > '1984-09-29'] #removes days where there is no y-data
station.replace('---','0',inplace=True)
station.drop(columns=['Unnamed: 0'],axis=1,inplace=True) #drop non-station columns
There was a mistake here:
station = station_data[station_data.index > '1984-09-29']
I was using an old table index. I corrected it to:
station = station[station.index > '1984-09-29']
Note, I had to restart the notebook and re-run it from the top for it to work. I believe it was an issue with conflicting table names in the UDF vs. what was already stored in memory.

audio file isn't being parsed with Google Speech

This question is a followup to a previous question.
The snippet of code below almost works...it runs without error yet gives back a None value for results_list. This means it is accessing the file (I think) but just can't extract anything from it.
I have a file, sample.wav, living publicly here: https://storage.googleapis.com/speech_proj_files/sample.wav
I am trying to access it by specifying source_uri='gs://speech_proj_files/sample.wav'.
I don't understand why this isn't working. I don't think it's a permissions problem. My session is instantiated fine. The code chugs for a second, yet always comes up with no result. How can I debug this?? Any advice is much appreciated.
from google.cloud import speech
speech_client = speech.Client()
audio_sample = speech_client.sample(
content=None,
source_uri='gs://speech_proj_files/sample.wav',
encoding='LINEAR16',
sample_rate_hertz= 44100)
results_list = audio_sample.async_recognize(language_code='en-US')
Ah, that's my fault from the last question. That's the async_recognize command, not the sync_recognize command.
That library has three recognize commands. sync_recognize reads the whole file and returns the results. That's probably the one you want. Remove the letter "a" and try again.
Here's an example Python program that does this: https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/speech/cloud-client/transcribe.py
FYI, here's a summary of the other types:
async_recognize starts a long-running, server-side operation to translate the whole file. You can make further calls to the server to see whether it's finished with the operation.poll() method and, when complete, can get the results via operation.results.
The third type is streaming_recognize, which sends you results continually as they are processed. This can be useful for long files where you want some results immediately, or if you're continuously uploading live audio.
I finally got something to work:
import time
from google.cloud import speech
speech_client = speech.Client()
sample = speech_client.sample(
content = None
, 'gs://speech_proj_files/sample.wav'
, encoding='LINEAR16'
, sample_rate= 44100
, 'languageCode': 'en-US'
)
retry_count = 100
operation = sample.async_recognize(language_code='en-US')
while retry_count > 0 and not operation.complete:
retry_count -= 1
time.sleep(10)
operation.poll() # API call
print(operation.complete)
print(operation.results[0].transcript)
print(operation.results[0].confidence)
for op in operation.results:
print op.transcript
Then something like
for op in operation.results:
print op.transcript

Resources