Linkedin web scraping snippet - python-3.x

I'm doing a web scraping data university research project. I started working on a ready GitHub project, but this project does not retrieve all the data.
The project works like this:
Search Google using keywords: example: (accountant 'email me at' Google)
Extract a snippet.
Retrieve data from this snippet.
The issue is:
The snippets extracted are like this: " ... marketing division in 2009. For more information on career opportunities with our company, email me: vicki#productivedentist.com. Neighborhood Smiles, LLC ..."
The snippet does not show all, the "..." hides information like role, location... How can I retrieve all the information with the script?
from googleapiclient.discovery import build #For using Google Custom Search Engine API
import datetime as dt #Importing system date for the naming of the output file.
import sys
from xlwt import Workbook #For working on xls file.
import re #For email search using regex.
if __name__ == '__main__':
# Create an output file name in the format "srch_res_yyyyMMdd_hhmmss.xls in output folder"
now_sfx = dt.datetime.now().strftime('%Y%m%d_%H%M%S')
output_dir = './output/'
output_fname = output_dir + 'srch_res_' + now_sfx + '.xls'
search_term = sys.argv[1]
num_requests = int(sys.argv[2])
my_api_key = "replace_with_you_api_key" #Read readme.md to know how to get you api key.
my_cse_id = "011658049436509675749:gkuaxghjf5u" #Google CSE which searches possible LinkedIn profile according to query.
service = build("customsearch", "v1", developerKey=my_api_key)
wb=Workbook()
sheet1 = wb.add_sheet(search_term[0:15])
wb.save(output_fname)
sheet1.write(0,0,'Name')
sheet1.write(0,1,'Profile Link')
sheet1.write(0,2,'Snippet')
sheet1.write(0,3,'Present Organisation')
sheet1.write(0,4,'Location')
sheet1.write(0,5,'Role')
sheet1.write(0,6,'Email')
sheet1.col(0).width = 256 * 20
sheet1.col(1).width = 256 * 50
sheet1.col(2).width = 256 * 100
sheet1.col(3).width = 256 * 20
sheet1.col(4).width = 256 * 20
sheet1.col(5).width = 256 * 50
sheet1.col(6).width = 256 * 50
wb.save(output_fname)
row = 1 #To insert the data in the next row.
#Function to perform google search.
def google_search(search_term, cse_id, start_val, **kwargs):
res = service.cse().list(q=search_term, cx=cse_id, start=start_val, **kwargs).execute()
return res
for i in range(0, num_requests):
# This is the offset from the beginning to start getting the results from
start_val = 1 + (i * 10)
# Make an HTTP request object
results = google_search(search_term,
my_cse_id,
start_val,
num=10 #num value can be 1 to 10. It will give the no. of results.
)
for profile in range (0, 10):
snippet = results['items'][profile]['snippet']
myList = [item for item in snippet.split('\n')]
newSnippet = ' '.join(myList)
contain = re.search(r'[\w\.-]+#[\w\.-]+', newSnippet)
if contain is not None:
title = results['items'][profile]['title']
link = results['items'][profile]['link']
org = "-NA-"
location = "-NA-"
role = "-NA-"
if 'person' in results['items'][profile]['pagemap']:
if 'org' in results['items'][profile]['pagemap']['person'][0]:
org = results['items'][profile]['pagemap']['person'][0]['org']
if 'location' in results['items'][profile]['pagemap']['person'][0]:
location = results['items'][profile]['pagemap']['person'][0]['location']
if 'role' in results['items'][profile]['pagemap']['person'][0]:
role = results['items'][profile]['pagemap']['person'][0]['role']
print(title[:-23])
sheet1.write(row,0,title[:-23])
sheet1.write(row,1,link)
sheet1.write(row,2,newSnippet)
sheet1.write(row,3,org)
sheet1.write(row,4,location)
sheet1.write(row,5,role)
sheet1.write(row,6,contain[0])
print('Wrote {} search result(s)...'.format(row))
wb.save(output_fname)
row = row + 1
print('Output file "{}" written.'.format(output_fname))

Related

How can i add all the elements which has same class in the variable using selenium in the python

content = driver.find_element_by_class_name('topics-sec-block')
container = content.find_elements_by_xpath('//div[#class="col-sm-7 topics-sec-item-cont"]')
the code is below:
for i in range(0, 40):
title = []
url = []
heading=container[i].find_element_by_xpath('//div[#class="col-sm-7 topics-sec-item-cont"]/a/h2').text
link = container[i].find_element_by_xpath('//div[#class="col-sm-7 topics-sec-item-cont"]/a')
title.append(heading)
url.append(link.get_attribute('href'))
print(title)
print(url)
it is giving me the 40 number of lines but all lines have same title and url as (some of them is given below):
['Stuck in Mexico: Central American asylum seekers in limbo']
['https://www.aljazeera.com/news/2020/03/stuck-mexico-central-american-asylum-seekers-limbo-200305103910955.html']
['Stuck in Mexico: Central American asylum seekers in limbo']
['https://www.aljazeera.com/news/2020/03/stuck-mexico-central-american-asylum-seekers-limbo-200305103910955.html']

AWS Cognito 90 day automated Password rotation

I have a requirement to create an automated password reset script. I created a custom field in order to try and track this and also hope I can access some of the standard fields. This script should find users with the following criteria:
The latest of any of the following 3 dates that are >= 90 days ago : Sign_Up, Forgot_Password, or custom:pwdCreateDate
I can't seem to find any boto3 cognito client ways of getting the information on this except for forgot password which shows up in admin_list_user_auth_events and that response doesn't include username in the response. I suppose since you provide username to get the events you can figure out a way to find the latest forgot password from the events and tie it to the username.
Has anyone else implemented any boto3 automation to set the account to force password reset based on any of these fields?
here is where i landed, take it with the understanding that coginito has some limitations which make true flawless password rotation difficult. Also know if you can make the script more efficient you should because in lambda you probably time out with more than about 350 users due to the 5RPS on the admin API.
Prerequisites : set the lambda function to 5 concurrency or you will exceed the limit of 5RPS. 1 mutable field in your cognito userpool attributes to put a date in. a custom lambda zip file that includes pandas saved to s3.
import os
import sys
# this adds the parent directory of bin so we can find the module
parent_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), os.path.pardir))
sys.path.append(parent_dir)
#This addes venv lib/python2.7/site-packages/ to the search path
mod_path = os.path.abspath(parent_dir+"/lib/python"+str(sys.version_info[0])+"."+str(sys.version_info[1])+"/site-packages/")
sys.path.append(mod_path)
import boto3
import datetime
import pandas as pd
import time
current_path = os.path.dirname(os.path.realpath(__file__))
# Use this one for the parent directory
ENV_ROOT = os.path.abspath(os.path.join(current_path, os.path.pardir))
# Use this one for the current directory
#ENV_ROOT = os.path.abspath(os.path.join(current_path))
sys.path.append(ENV_ROOT)
#if __name__ == "__main__":
def lambda_handler(event, context):
user_pool_id = os.environ['USER_POOL_ID']
idp_client = boto3.client('cognito-idp')
users_list = []
page_token = None
dateToday = datetime.datetime.today().date()
def update_user(user) :
idp_client.admin_update_user_attributes(
UserPoolId = user_pool_id,
Username = user,
UserStatus = 'RESET_REQUIRED',
UserAttributes = [
{
'Name': 'custom:pwdCreateDate',
'Value': str(dateToday)
}
]
)
users = idp_client.list_users(
UserPoolId = user_pool_id
)
for user in users['Users']: users_list.append(user['Username'])
page_token = users['PaginationToken']
while 'PaginationToken' in users :
users = idp_client.list_users(
UserPoolId = user_pool_id,
PaginationToken = page_token
)
for user in users["Users"]: users_list.append(user["Username"])
if 'PaginationToken' in users :
page_token = users['PaginationToken']
attrPwdDates = []
for i in range(len(users_list)) :
userAttributes = idp_client.admin_get_user(
UserPoolId = user_pool_id,
Username = users_list[i]
)
for a in userAttributes['UserAttributes'] :
if a['Name'] == 'custom:pwdCreateDate' :
attrPwdDates.append(datetime.datetime.strptime(a['Value'], '%Y-%m-%d %H:%M:%S.%f').date())
time.sleep(1.0)
list_of_userattr_tuples = list(zip(users_list, attrPwdDates))
df1 = pd.DataFrame(list_of_userattr_tuples,columns = ['Username','Password_Last_Set'])
authPwdDates = []
for i in range(len(users_list)) :
authEvents = idp_client.admin_list_user_auth_events(
UserPoolId = user_pool_id,
Username = users_list[i]
)
for event in authEvents['AuthEvents'] :
if event['EventType'] == 'ForgotPassword' and event['EventResponse'] == 'Pass' :
authPwdDates.append(event['CreationDate'].date())
break
time.sleep(1.0)
list_of_userauth_tuples = list(zip(users_list, authPwdDates))
df2 = pd.DataFrame(list_of_userauth_tuples,columns = ['Username','Password_Last_Forgot'])
df3 = df1.merge(df2,how='left', on = 'Username')
df3[['Password_Last_Set','Password_Last_Forgot']] = df3[['Password_Last_Set','Password_Last_Forgot']].apply(pd.to_datetime)
cols = ['Password_Last_Set','Password_Last_Forgot']
df4 = df3.loc[df3[cols].max(axis=1)<=pd.Timestamp.now() - pd.Timedelta(90, unit='d'), 'Username']
for i,r in df4.iterrows() :
update_user(r['Username'])

How can I return a string from a Google BigQuery row iterator object?

My task is to write a Python script that can take results from BigQuery and email them out. I've written a code that can successfully send an email, but I am having trouble including the results of the BigQuery script in the actual email. The query results are correct, but the actual object I am returning from the query (results) always returns as a Nonetype.
For example, the email should look like this:
Hello,
You have the following issues that have been "open" for more than 7 days:
-List issues here from bigquery code
Thanks.
The code reads in contacts from a contacts.txt file, and it reads in the email message template from a message.txt file. I tried to make the bigquery object into a string, but it still results in an error.
from google.cloud import bigquery
import warnings
warnings.filterwarnings("ignore", "Your application has authenticated using end user credentials")
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from string import Template
def query_emailtest():
client = bigquery.Client(project=("analytics-merch-svcs-thd"))
query_job = client.query("""
select dept, project_name, reset, tier, project_status, IssueStatus, division, store_number, top_category,
DATE_DIFF(CURRENT_DATE(), in_review, DAY) as days_in_review
from `analytics-merch-svcs-thd.MPC.RESET_DETAILS`
where in_review IS NOT NULL
AND IssueStatus = "In Review"
AND DATE_DIFF(CURRENT_DATE(), in_review, DAY) > 7
AND ready_for_execution IS NULL
AND project_status = "Active"
AND program_name <> "Capital"
AND program_name <> "SSI - Capital"
LIMIT 50
""")
results = query_job.result() # Waits for job to complete.
return results #THIS IS A NONETYPE
def get_queryresults(results): #created new method to put query results into a for loop and store it in a variable
for i,row in enumerate(results,1):
bq_data = (i , '. ' + str(row.dept) + " " + row.project_name + ", Reset #: " + str(row.reset) + ", Store #: " + str(row.store_number) + ", " + row.IssueStatus + " for " + str(row.days_in_review)+ " days")
print (bq_data)
def get_contacts(filename):
names = []
emails = []
with open(filename, mode='r', encoding='utf-8') as contacts_file:
for a_contact in contacts_file:
names.append(a_contact.split()[0])
emails.append(a_contact.split()[1])
return names, emails
def read_template(filename):
with open(filename, 'r', encoding='utf-8') as template_file:
template_file_content = template_file.read()
return Template(template_file_content)
names, emails = get_contacts('mycontacts.txt') # read contacts
message_template = read_template('message.txt')
results = query_emailtest()
bq_results = get_queryresults(query_emailtest())
import smtplib
# set up the SMTP server
s = smtplib.SMTP(host='smtp-mail.outlook.com', port=587)
s.starttls()
s.login('email', 'password')
# For each contact, send the email:
for name, email in zip(names, emails):
msg = MIMEMultipart() # create a message
# bq_data = get_queryresults(query_emailtest())
# add in the actual person name to the message template
message = message_template.substitute(PERSON_NAME=name.title())
message = message_template.substitute(QUERY_RESULTS=bq_results) #SUBSTITUTE QUERY RESULTS IN MESSAGE TEMPLATE. This is where I am having trouble because the Row Iterator object results in Nonetype.
# setup the parameters of the message
msg['From']='email'
msg['To']='email'
msg['Subject']="This is TEST"
# body = str(get_queryresults(query_emailtest())) #get query results from method to put into message body
# add in the message body
# body = MIMEText(body)
#msg.attach(body)
msg.attach(MIMEText(message, 'plain'))
# query_emailtest()
# get_queryresults(query_emailtest())
# send the message via the server set up earlier.
s.send_message(msg)
del msg
Message template:
Dear ${PERSON_NAME},
Hope you are doing well. Please find the following alert for Issues that have been "In Review" for greater than 7 days.
${QUERY_RESULTS}
If you would like more information, please visit this link that contains a complete dashboard view of the alert.
ISE Services
The BQ result() function returns a generator, so I think you need to change your return to yield from.
I'm far from a python expert, but the following pared-down code worked for me.
from google.cloud import bigquery
import warnings
warnings.filterwarnings("ignore", "Your application has authenticated using end user credentials")
def query_emailtest():
client = bigquery.Client(project=("my_project"))
query_job = client.query("""
select field1, field2 from `my_dataset.my_table` limit 5
""")
results = query_job.result()
yield from results # NOTE THE CHANGE HERE
results = query_emailtest()
for row in results:
print(row.field1, row.field2)

Pygal bar chart says “No data”

I am trying to create a bar graph in pygal that uses the api for hacker news and charts the most active news based on comments. I posted my code below, but I cannot figure out why my graph keep saying "No data"??? Any suggestions? Thanks!
import requests
import pygal
from pygal.style import LightColorizedStyle as LCS, LightenStyle as LS
from operator import itemgetter
# Make an API call, and store the response.
url = 'https://hacker-news.firebaseio.com/v0/topstories.json'
r = requests.get(url)
print("Status code:", r.status_code)
# Process information about each submission.
submission_ids = r.json()
submission_dicts = []
for submission_id in submission_ids[:30]:
# Make a separate API call for each submission.
url = ('https://hacker-news.firebaseio.com/v0/item/' +
str(submission_id) + '.json')
submission_r = requests.get(url)
print(submission_r.status_code)
response_dict = submission_r.json()
submission_dict = {
'comments': int(response_dict.get('descendants', 0)),
'title': response_dict['title'],
'link': 'http://news.ycombinator.com/item?id=' + str(submission_id),
}
submission_dicts.append(submission_dict)
# Visualization
my_style = LS('#336699', base_style=LCS)
my_config = pygal.Config()
my_config.show_legend = False
my_config.title_font_size = 24
my_config.label_font_size = 14
my_config.major_label_font_size = 18
my_config.show_y_guides = False
my_config.width = 1000
chart = pygal.Bar(my_config, style=my_style)
chart.title = 'Most Active News on Hacker News'
chart.add('', submission_dicts)
chart.render_to_file('hn_submissons_repos.svg')
The values in the array passed to the add function need to be either numbers or dicts that contain the key value (or a mixture of the two). The simplest solution would be to change the keys used when creating submission_dict:
submission_dict = {
'value': int(response_dict.get('descendants', 0)),
'label': response_dict['title'],
'xlink': 'http://news.ycombinator.com/item?id=' + str(submission_id),
}
Notice that link has become xlink, this is one of the optional parameters that are defined in the Value Configuration section of the pygal docs.

Getting the number of old issues and the table (login and number of commits) of the most active members of the repository

I can not get the above information using github.api. Reading the documentation did not help much. There is still no complete understanding of the work with dates. Here is an example of my code for getting open issues:
import requests
import json
from datetime import datetime
username = '\'
password = '\'
another_page = True
opened = 0
closed = 0
api_oldest = 'https://api.github.com/repos/grpc/grpc/issues?
per_page=5&q=sort=created:>`date -v-14d "+%Y-%m-%d"`&order=asc'
api_issue = 'https://api.github.com/repos/grpc/grpc/issues?
page=1&per_page=5000'
api_pulls = 'https://api.github.com/repos/grpc/grpc/pulls?page=1'
datetime.now()
while another_page:
r = requests.get(api_issue, auth=(username, password))
#json_response = json.loads(r.text)
#results.append(json_response)
if 'next' in r.links:
api_issue = r.links['next']['url']
if item['state'] == 'open':
opened += 1
else:
closed += 1
else:
another_page=False
datetime.now()
print(opened)
There are a few issues with your code. For example, what does item represent ?. Your code can be modified as follows to iterate and get the number of open issues .
import requests
username = '/'
password = '/'
another_page = True
opened = 0
closed = 0
api_issue = "https://api.github.com/repos/grpc/grpc/issues?page=1&per_page=5000"
while another_page:
r = requests.get(api_issue, auth=(username, password))
json_response = r.json()
#results.append(json_response)
for item in json_response:
if item['state'] == 'open':
opened += 1
else:
closed += 1
if 'next' in r.links:
api_issue = r.links['next']['url']
else:
another_page=False
print(opened)
If you want issues that were created in the last 14 days, you could make the api request using the following URL.
api_oldest = "https://api.github.com/repos/grpc/grpc/issues?q=sort=created:>`date -d '14 days ago'`&order=asc"

Resources