I'm trying to create a script that returns domain and backlink numbers for each URL held in a dataframe from an SEMRUSH API.
The dataframe cotaining the URLs has some of the following information:
0
0 www.ig.com/jp/trading-strategies/swing-trading...
1 www.ig.com/it/news-e-idee-di-trading/criptoval...
2 www.ig.com/uk/news-and-trade-ideas/the-omicron...
[1468 rows x 1 columns]
When I run my script I get the following error:
requests.exceptions.InvalidSchema: No connection adapters were found for '0 https://api.semrush.com/analytics/v1/?key=1f0e...\nName: 0, dtype: object'
Here is the part of the code that generates the error:
for index, url in gsdf.iterrows():
rr = requests.request("GET","https://api.semrush.com/analytics/v1/?key="+API_KEY+"&type=backlinks_tld&target="+url+"&target_type=url&export_columns=domains_num,backlinks_num&display_limit=1",headers=headers, data = payload)
data=json.loads(rr.text.encode('utf8'))
srdf=srdf.append({domains_num:data, backlinks_num:data}, ignore_index=True)
I'm not sure why this happens as I'm new to Python. Can you help?
Kind thanks
Mark
Related
Using multiple if conditions to filter a list works. However, I am looking for a better alternative. As a beginner Python user, I fail to make the filter() and lambda () functions work even after using this resource: https://www.geeksforgeeks.org/python-filter-list-of-strings-based-on-the-substring-list/. Any help will be appreciated.
The following code block (Method 1) works.
mylist = ['MEPS HC-226: MEPS Panel 23 Three-Year Longitudinal Data File',
'HC-203: 2018 Jobs File',
'HC-051H 2000 Home Health',
'NHC-001F Facility Characteristics Round 1',
'HC-IC Linked Data 1999',
'HC-004 1996 Employment Data/Family-Level Weight (HC-004 replaced by HC-012)',
'HC-030 1998 MEPS HC Survey Data (CD-ROM)']
sublist1 = []
for element in mylist:
if element.startswith(("MEPS HC", "HC")):
if "-IC" not in element:
if "replaced" not in element:
if "CD-ROM" not in element:
sublist1.append(element)
print(sublist1)
(Output below)
['MEPS HC-226: MEPS Panel 23 Three-Year Longitudinal Data File', 'HC-203: 2018 Jobs File', 'HC-051H 2000 Home Health']
However, issues are with the following code block (Method 2).
sublist2 = []
excludes = ['-IC', 'replaced', 'CD-ROM']
for element in mylist:
if element.startswith(("MEPS HC", "HC")):
sublist2 = mylist(filter(lambda x: any(excludes in x for exclude in excludes), mylist))
sublist2.append(element)
print(sublist2)
TypeError: 'list' object is not callable
My code block with multiple if conditions (Method 1) to filter the list works. However, I could not figure out why the code block with filter() and lambda functions (Method 2) does not work. I was expecting the same results I got from Method 1. I am open to other solutions as an alternative to Method 1.
I need to calculate the distance between two dates.
df3['dist_2_1'] = (df3['Date2'] - df3['Date1'])
When I save this into my SQLite DB the format is terrible, so I decided to use an integer format which is much better.
df3['dist_2_1'] = (df3['Date2'] - df3['Date1']).astype('timedelta64[D]').astype(int)
So far so good, but in a similar case, I've NULL values which cause an error when I try to do the diference between dates.
df3['dist_B_3'] = df3['Break_date'] - df3['Date3']
The Break_date can be null, so I want that in this case the final result in dist_B_3 is 0, but now is an error that breaks everything. I tested this so far, but doesn't work...
try:
if df3['Break_date'] == 'NaT':
df3['dist_B_3'] = 0
else:
df3['dist_B_3'] = df3['Break_date'] - df3['Date3']
#().astype('timedelta64[D]').astype(int)
except Exception:
print("error in the dist_B_3")
My df3['Break_date'] df is this one, so the NaT are the ones creating the error.
0 2022-07-13
1 2022-07-12
2 2022-07-14
3 2022-07-14
4 NaT
5 NaT
Any idea on how to handle this?
I've been trying to optimise a bokeh server to calculate live stats by selected country on Covid19.
I found myself repeating a groupby function to calculate new columns and was wondering, having selected the groupby, if I could then apply it in a similar way to .agg() on multiple columns ?
For example:
dfall = pd.DataFrame(db("SELECT * FROM C19daily"))
dfall.set_index(['geoId', 'date'], drop=False, inplace=True)
dfall = dfall.sort_index(ascending=True)
dfall.head()
id date geoId cases deaths auid
geoId date
AD 2020-03-03 70119 2020-03-03 AD 1 0 AD03/03/2020
2020-03-14 70118 2020-03-14 AD 1 0 AD14/03/2020
2020-03-16 70117 2020-03-16 AD 3 0 AD16/03/2020
2020-03-17 70116 2020-03-17 AD 9 0 AD17/03/2020
2020-03-18 70115 2020-03-18 AD 0 0 AD18/03/2020
I need to create new columns based on 'cases' and 'deaths' and applying various functions like cumsum(). Currently I do this the long way
dfall['ccases'] = dfall.groupby(level=0)['cases'].cumsum()
dfall['dpc_cases'] = dfall.groupby(level=0)['cases'].pct_change(fill_method='pad', periods=7)
.....
dfall['cdeaths'] = dfall.groupby(level=0)['deaths'].cumsum()
dfall['dpc_deaths'] = dfall.groupby(level=0)['deaths'].pct_change(fill_method='pad', periods=7)
I tried to optimise the groupby call like this:-
with dfall.groupby(level=0) as gr:
gr = g['cases'].cumsum()...
But the error suggest the class doesn't support this
AttributeError: __enter__
I thought I could use .agg({}) and supply dictionary
g = dfall.groupby(level=0).agg({'cc' : 'cumsum', 'cd' : 'cumsum'})
but that produces another error
pandas.core.base.SpecificationError: nested renamer is not supported
I have plenty of other bits to optimise, I thought this python part would be the easiest and save a few ms!
Could anyone nudge me in the right direction?
To avoid repeating dfall.groupby(level=0) you can just save it in a variable:
gb = dfall.groupby(level=0)
gb_cases = gb['cases']
dfall['ccases'] = gb_cases.cumsum()
dfall['dpc_cases'] = gb_cases.pct_change(fill_method='pad', periods=7)
...
And to run multiple aggregations using a single expression, I think you can use named aggregation. But I have no clue whether it will be more performant or not. Either way, it's better to profile the code and improve the actual bottlenecks.
Working with search console api,
made it through the basics.
Now i'm stuck on splitting and arranging the data:
When trying to split, i'm getting a NaN, nothing i try works.
46 ((174.0, 3753.0, 0.04636290967226219, 7.816147...
47 ((93.0, 2155.0, 0.0431554524361949, 6.59025522...
48 ((176.0, 4657.0, 0.037792570324243074, 6.90251...
49 ((20.0, 1102.0, 0.018148820326678767, 7.435571...
50 ((31.0, 1133.0, 0.02736098852603707, 8.0935569...
Name: test, dtype: object
When trying to manipulate the data like this (and similar interactions):
data=source['test'].tolist()
data
Its clear that the data is not really available...
[<searchconsole.query.Report(rows=1)>,
<searchconsole.query.Report(rows=1)>,
<searchconsole.query.Report(rows=1)>,
<searchconsole.query.Report(rows=1)>,
<searchconsole.query.Report(rows=1)>]
Anyone have an idea how can i interact with my data ?
Thanks.
for reference, this is the code and the program i work with:
account = searchconsole.authenticate(client_config='client_secrets.json', credentials='credentials.json')
webproperty = account['https://www.example.com/']
def APIsc(date,keyword):
results=webproperty.query.range(date, days=-30).filter('query', keyword, 'contains').get()
return results
source['test']=source.apply(lambda x: APIsc(x.date, x.keyword), axis=1)
source
made by: https://github.com/joshcarty/google-searchconsole
I am trying to list nearby venues using get Nearby venues that are previously defined, and every line worked fine and then I cannot label properly nearby venues using Foursquare although its working fine ( I have to reset my Id and Secret as it just stop working). Im using Python 3.5 at Jupyter Notebook
What Im doing wrong? Thank you!!
BT_venues=getNearbyVenues(names=BT_df['Sector'],
latitudes=BT_df['Latitude'],
longitudes=BT_df['Longitude']
)
-----------------------------------------------------------------------
----
KeyError Traceback (most recent call
last)
<ipython-input-99-563e09cdcab5> in <module>()
1 BT_venues=getNearbyVenues(names=BT_df['Sector'],
2 latitudes=BT_df['Latitude'],
----> 3 longitudes=BT_df['Longitude']
4 )
<ipython-input-93-cfc09962ae0b> in getNearbyVenues(names, latitudes,
longitudes, radius)
18
19 # make the GET request
---> 20 results = requests.get(url).json()['response']
['groups'][0]
['items']
21
22 # return only relevant information for each nearby venue
KeyError: 'groups'
As for groups this was the code
venues = res['response']['groups'][0]['items']
nearby_venues = json_normalize(venues) # flatten JSON
# columns only
filtered_columns = ['venue.name', 'venue.categories',
'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]
# only one category per a row
nearby_venues['venue.categories'] =
nearby_venues.apply(get_category_type,
axis=1)
# columns cleaning up
nearby_venues.columns = [col.split(".")[-1] for col in
nearby_venues.columns]
nearby_venues.head()
Check response['meta'], you may have exceeded your quota.
If you need an instant resolution, create a new foursquare account. Then create new application and use your new client id and secret to call api