google ads CRITERIA_PERFORMANCE_REPORT don't allow to remove first row - python-3.x

I have encounter a problem with the google ads report and I have no clue how to fix it... I use the following code to extract the data from google ads via API call
import sys
from googleads import adwords
import pandas as pd
import pandas as np
import io
output = io.StringIO()
def main(client):
# Initialize appropriate service.
report_downloader = client.GetReportDownloader(version='v201809')
# Create report query.
report_query = (adwords.ReportQueryBuilder()
.Select('CampaignId', 'AdGroupId', 'Id', 'Criteria',
'CriteriaType', 'FinalUrls', 'Impressions', 'Clicks',
'Cost')
.From('CRITERIA_PERFORMANCE_REPORT')
.Where('Status').In('ENABLED', 'PAUSED')
.During('LAST_7_DAYS')
.Build())
# You can provide a file object to write the output to. For this
# demonstration we use sys.stdout to write the report to the screen.
report_downloader.DownloadReportWithAwql(
report_query, 'CSV', output, skip_report_header=False,
skip_column_header=False, skip_report_summary=False,
include_zero_impressions=True)
output.seek(0)
df = pd.read_csv(output)
df = df.to_csv('results.csv')
if __name__ == '__main__':
# Initialize client object.
adwords_client = adwords.AdWordsClient.LoadFromStorage()
main(adwords_client)
the code works as expected and pulls the data and save it in a CSV file, however, when I access the columns it prints just one column 'CRITERIA_PERFORMANCE_REPORT (Nov 5, 2019-Nov 11, 2019)' when I open the csv file looks like this
result.csv
I have tried to drop the first row with df.drop(df.index[0]) to access the rest of the data however nothing seems to be working. is there any way I can remove the first row or change to use the second row as the columns names which is the result I expected.
thanks in advance

I'm able to eliminate the header there with the following download request:
report_downloader.DownloadReportWithAwql(
report_query, 'CSV', output, skip_report_header=True,
skip_column_header=False, skip_report_summary=True,
include_zero_impressions=True
)
I think if you include skip_report_header=True, skip_report_summary=True you'll get what you want.

Related

Read a table from google docs without using API

I want to read a google sheet table in python, but without using API.
I tried with BytesIO, Beatifulsoup.
I know about the soluthion with gspread, but I need to read table without token. Only using url.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from lxml.etree import tostring
from io import BytesIO
import requests
req=requests.get('https://docs.google.com/spreadsheets/d/sheetId/edit#gid=', auth=('email', 'password'))
page = req.text
here i've got html code, like <!doctype html><html lang="en-US" dir="ltr"><head><base href="h and so on...
i also tried lib BeautifulSoup, but the result is same.
For reading a table from html, you can use pandas.read_html.
If it's an unrestricted spreadsheet like this one, you probably don't even need requests - you can just directly pass the URL to read_html. [view dataframe]
import pandas as pd
sheetId = '1bQo1an4yS1tSOMDhmUTGYtUlgnHDQ47EmIcj4YyuIxo' ## REPLACE WITH YOUR SHEETID
sheetUrl = f'https://docs.google.com/spreadsheets/d/{sheetId}'
sheetDF = pd.read_html(
sheetUrl, attrs={'class': 'waffle'}, skiprows=1
)[0].drop(['1'], axis=1).dropna(axis=0, how='all').dropna(axis=1, how='all')
If it's not unrestricted, then the way you're using requests.get would not work either, because you're not passing the auth argument correctly. I actually don't think there is any way to login to Google with just requests.auth. You could login to drive on your browser, open a sheet and then copy the request to https://curlconverter.com/ and paste to your code from there.
import pandas as pd
import requests
from bs4 import BeautifulSoup
sheetUrl = 'YOUR_SHEET_URL'
cookies = 'PASTE_FROM_https://curlconverter.com/'
headers = 'PASTE_FROM_https://curlconverter.com/'
req = requests.get(sheetUrl, cookies=cookies, headers=headers)
# in one line, no error-handling
# sheetDF = pd.read_html(req.text, attrs={'class': 'waffle'}, skiprows=1)[0].drop(['1'], axis=1).dropna(axis=0, how='all').dropna(axis=1, how='all')
# req.raise_for_status() # raise error if request fails
if req.status_code != 200:
print(req.status_code, req.reason, '- failed to get', url)
soup = BeautifulSoup(req.content)
wTable = soup.find('table', class_="waffle")
if wTable is not None:
dfList = pd.read_html(str(wTable), skiprows=1) # set skiprows=1 to skip top row with column names A, B, C...
sheetDF = dfList[0] # because read_html returns a LIST of dataframes
sheetDF = sheetDF.drop(['1'], axis='columns') # drop row #s 1,2,3...
sheetDF = sheetDF.dropna(axis='rows', how='all') # drop empty rows
sheetDF = sheetDF.dropna(axis='columns', how='all') # drop empty columns
### WHATEVER ELSE YOU WANT TO DO WITH DATAFRAME ###
else: print('COULD NOT FIND TABLE')
However, please note that the cookies are probably only good for up to 5 hours maximum (and then you'll need to paste new ones), and that if there are multiple sheets in one spreadsheet, you'll only be able to scrape the first sheet with requests/pandas. So, it would be better to use the API for restricted or multi-sheet spreadsheets.

Use requests to download webpage that requires cookies into a dataframe in python

I can't get the below code to navigate through the disclaimer page of the website, I think the issue is how I try and collect the cookie.
I want to try and use requests rather than selenium.
import requests
import pandas as pd
from pandas import read_html
# open the page with the disclaimer just to get the cookies
disclaimer = "https://umm.gassco.no/disclaimer"
disclaimerdummy = requests.get(disclaimer)
# open the actual page and use the cookies from the fake page opened before
actualpage = "https://umm.gassco.no/disclaimer/acceptDisclaimer"
actualpage2 = requests.get(actualpage, cookies=disclaimerdummy.cookies)
# store the content of the actual page in text format
actualpagetext = (actualpage2.text)
# identify relevant data sources by looking at the 'msgTable' class in the webpage code
# This is where the tables with the realtime data can be found
gasscoflow = read_html(actualpagetext, attrs={"class": "msgTable"})
# create the dataframes for the two relevant tables
Table0 = pd.DataFrame(gasscoflow[0])
Table1 = pd.DataFrame(gasscoflow[1])
Table2 = pd.DataFrame(gasscoflow[2])
Table3 = pd.DataFrame(gasscoflow[3])
Table4 = pd.DataFrame(gasscoflow[4])
After Seeing the website first of all it has only 2 tables and you could use session to use cookies across request instead of storing in a variable follow the below code to get all your expected data it is printing only last 2 rows as I have used tail command, you can modify and get your desired data from those tables.
import requests
import pandas as pd
from pandas import read_html
s=requests.session()
s1=s.get("https://umm.gassco.no")
s2=s.get("https://umm.gassco.no/disclaimer/acceptDisclaimer?")
data = read_html(s2.text, attrs={"class": "msgTable"})
t0 = pd.DataFrame(data[0])
t1 = pd.DataFrame(data[1])
print(t0.tail(2))
print(t1.tail(2))
Output:
Let me know if you have any questions :)

Python filewriting with write(), writelines(), to_csv() not working

I'm running a piece of code which takes input from a txt file, uses the input to scrape a tor webpage and then gives a list of strings called result. I'm using the tbselenium module. I need to write this list to two output files valid.txt and address.txt, when i run the script i get the result (a list of strings) but nothing is written to the two output files. There is no error raised and the print statements inside the two functions work perfectly. The input is read successfully
from tbselenium.tbdriver import TorBrowserDriver
import requests
import time
import pandas as pd
def read_input():
with open('Entries.txt') as fp:
users = fp.readlines()
return users
users = read_input()
result = some_function(users) # This function scrapes the webpage using selenium
def write_output(result):
with open('valid.txt', 'a+') as fw:
fw.writelines(result)
print('Writing to valid.txt', result)
def write_addr(result):
with open('address.txt', 'a+') as fw:
for x in result:
fw.write(x.split(':')[5]+'\n')
print('Writing to address.txt')
write_output(result)
write_addr(result)
I then tried writing the same output to a csv file.
df = pd.DataFrame(result)
print(df)
df.to_csv('valid.csv', mode='a', header=False)
The dataFrame is created but nothing is written to the csv file. It is not even created if i haven't already created one in my folder.
If i don't run the scraping function and try to write something to the output files then it works.
Solved. While running the selenium driver, it changes the current working directory to the directory where the tor browser is located and hence all the files are being
saved there.
import pandas as pd
def write_output(result):
with open('valid.txt', 'a+') as fw:
fw.writelines(result)
print('Writing to valid.txt', result)
def write_addr(result):
with open('address.txt', 'a+') as fw:
for x in result:
fw.write(x.split(':')[5]+'\n')
print('Writing to address.txt')
result = ['I am :scrap:ed d:ata:from tor wit:h add:ress:es\n', 'I am :scrap:ed d:ata:from tor wit:h add:ress:es\n', 'I am :scrap:ed d:ata:from tor wit:h add:ress:es\n']
write_output(result)
write_addr(result)
df = pd.DataFrame(result)
print(df)
df.to_csv('valid.csv', mode='a', header=False)
I didn't find any problem with your code (at least not with the write functions you have for creating valid.txt, addrress.txt, and valid.csv).
I have tested your code with my own dummy result. As seen from the attached images, the respective files were created successfully. I suspect the error might be from your result list. You should also check to make sure the 5th index after the split(':') is not a space character and note that the files will be created in the directory the python script is opened from (in case you are looking for the files in the wrong directory). Other than these the desired functions should run properly provided your result is returned from your web scraping function.
Cheers

How to extract image src from the website

I tried scraping the table rows from the website to get the data on corona virus spread.
I wanted to extract the src for all the tags so as to get the source of the flag's image along with all the data for each country. Could someone help ?
import pandas as pd
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
options = Options()
options.add_argument('--headless')
driver = webdriver.Firefox(options=options)
driver = webdriver.Firefox(options=options)
driver.get("https://google.com/covid19-map/?hl=en")
df = pd.read_html(driver.page_source)[1]
df.to_csv("Data.csv", index=False)
driver.quit()
While Gareth's answer has already been accepted, his answer inspired me to write this one form a pandas point of view. Since we know the url for flags are a fixed pattern and the only thing that changes is the name. We can create a new column by lowercasing the name, replacing spaces with underscores and then weaving the name in the fixed URL pattern
import pandas as pd
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
options = Options()
options.add_argument('--headless')
driver = webdriver.Chrome()
driver.get("https://google.com/covid19-map/?hl=en")
df = pd.read_html(driver.page_source)[1]
df['flag_url'] = df.apply(lambda row: f"https://www.gstatic.com/onebox/sports/logos/flags/{row.Location.lower().replace(' ', '_')}_icon_square.svg", axis=1)
df.to_csv("Data.csv", index=False)
driver.quit()
OUTPUT SAMPLE
Location,Confirmed,Cases per 1M people,Recovered,Deaths,flag_url
Worldwide,882068,125.18,185067,44136,https://www.gstatic.com/onebox/sports/logos/flags/worldwide_icon_square.svg
United Kingdom,29474,454.19,135,2352,https://www.gstatic.com/onebox/sports/logos/flags/united_kingdom_icon_square.svg
United States,189441,579.18,7082,4074,https://www.gstatic.com/onebox/sports/logos/flags/united_states_icon_square.svg
Not the most genious way, but since you have the page source already, how about using regex to match the urls of the images?
import re
print (re.findall(r'https://www.gstatic.com/onebox/sports/logos/flags/.+?.svg', driver.page_source))
The image links are in order so it matches the order of confirmed cases - except that on my computer, the country I'm in right now is at the top of the list.
If this is not what you want, I can delete this answer.
As mentioned by #Chris Doyle in the comments, this can even simply done by noticing the urls are the same, with ".+?" replaced by the country's name (all lowercase, connected with underscores). You have that information in the csv file.
country_name = "United Kingdom"
url = "https://www.gstatic.com/onebox/sports/logos/flags/"
url += '_'.join(country_name.lower().split())
url += '.svg'
print (url)
Also be sure to check out his answer using purely panda :)

running in parallel requests.post over a pandas data frame

So I am trying to run a defined function that is a requests.post that gets the input from a pandas dataframe and save it to the same dataframe but different column
import requests, json
import pandas as pd
import argparse
def postRequest(input, url):
'''Post response from url'''
headers = {'content-type': 'application/json'}
r = requests.post(url=url, json=json.loads(input), headers=headers)
response = r.json()
return response
def payload(text):
# get proper payload from text
std_payload = { "auth_key":"key",
"org":{ "id":org_id, "name":"org" },
"ver":{"id":ver_id, "name":"ver" },
"mess":{ "id":80}}
std_payload['message']['text'] = text
std_payload = json.dumps(std_payload)
return std_payload
def find(df):
ff=pd.DataFrame(columns=['text','expected','word','payload','response'])
count=0
for leng in range(0,len(df)):
search=df.text[leng].split()
ff.loc[count]=df.iloc[leng]
ff.loc[count,'word']='orginalphrase'
count=count+1
for w in range(0,len(search)):
if df.text[leng]=="3174":
ff.append(df.iloc[leng],ignore_index=True)
ff.loc[count,'text']="3174"
ff.loc[count,'word']=None
ff.loc[count,'expected']='[]'
continue
word=search[:]
ff.loc[count,'word']=word[w]
word[w]='z'
phrase=' '.join(word)
ff.loc[count,'text']=phrase
ff.loc[count,'expected']=df.loc[leng,'expected']
count=count+1
if df.text[leng]=="3174":
continue
return ff
# read in csv of phrases to be tested
df = pd.read_csv(filename,engine='python')
#allows empty cells by setting them to the phrase empty
df=df.fillna("3174")
sf=find(df)
for i in sf.index:
sf['payload']=payload(sf.text[i])
for index in df.index:
sf.response[index]=postRequest(df.text[index],url)
From all my tests this operation is running over the dataframe one by one which when my dataframe is large this operation can take a few hours.
Searching online for running things in parallel give me a few methods but I do not understand what the methods are doing, I have seen pooling and threading examples while i can get the examples to work. Such as:
Simultaneously run POST in Python
Asynchronous Requests with Python requests
When I try and apply them with my code, specifically I cannot get any method to work with the postRequest it still goes one by one.
Can any one provide assistance in getting the paralleling to work correctly. If more informations is required please let me know.
Thanks
Edit:
here is the last thing I was working with
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_url = {executor.submit(postRequest, sf.payload[index],trends_url): index for index in range(10)}
counts=0
for future in concurrent.futures.as_completed(future_to_url):
repo = future_to_url[future]
data = future.result()
sf.response[count]=data
count=count+1
also the dataframe has anywhere between 2000 and 4000 rows so doing it in sequence can take up to 4 hours,

Resources