Issues with Python Lambda with filter() function - python-3.x

Using multiple if conditions to filter a list works. However, I am looking for a better alternative. As a beginner Python user, I fail to make the filter() and lambda () functions work even after using this resource: https://www.geeksforgeeks.org/python-filter-list-of-strings-based-on-the-substring-list/. Any help will be appreciated.
The following code block (Method 1) works.
mylist = ['MEPS HC-226: MEPS Panel 23 Three-Year Longitudinal Data File',
'HC-203: 2018 Jobs File',
'HC-051H 2000 Home Health',
'NHC-001F Facility Characteristics Round 1',
'HC-IC Linked Data 1999',
'HC-004 1996 Employment Data/Family-Level Weight (HC-004 replaced by HC-012)',
'HC-030 1998 MEPS HC Survey Data (CD-ROM)']
sublist1 = []
for element in mylist:
if element.startswith(("MEPS HC", "HC")):
if "-IC" not in element:
if "replaced" not in element:
if "CD-ROM" not in element:
sublist1.append(element)
print(sublist1)
(Output below)
['MEPS HC-226: MEPS Panel 23 Three-Year Longitudinal Data File', 'HC-203: 2018 Jobs File', 'HC-051H 2000 Home Health']
However, issues are with the following code block (Method 2).
sublist2 = []
excludes = ['-IC', 'replaced', 'CD-ROM']
for element in mylist:
if element.startswith(("MEPS HC", "HC")):
sublist2 = mylist(filter(lambda x: any(excludes in x for exclude in excludes), mylist))
sublist2.append(element)
print(sublist2)
TypeError: 'list' object is not callable
My code block with multiple if conditions (Method 1) to filter the list works. However, I could not figure out why the code block with filter() and lambda functions (Method 2) does not work. I was expecting the same results I got from Method 1. I am open to other solutions as an alternative to Method 1.

Related

Why Do I Keep Receving A "Requests.Exceptions.InvalidSchema: No Connection Adapters Were Found For '0" Error?

I'm trying to create a script that returns domain and backlink numbers for each URL held in a dataframe from an SEMRUSH API.
The dataframe cotaining the URLs has some of the following information:
0
0 www.ig.com/jp/trading-strategies/swing-trading...
1 www.ig.com/it/news-e-idee-di-trading/criptoval...
2 www.ig.com/uk/news-and-trade-ideas/the-omicron...
[1468 rows x 1 columns]
When I run my script I get the following error:
requests.exceptions.InvalidSchema: No connection adapters were found for '0 https://api.semrush.com/analytics/v1/?key=1f0e...\nName: 0, dtype: object'
Here is the part of the code that generates the error:
for index, url in gsdf.iterrows():
rr = requests.request("GET","https://api.semrush.com/analytics/v1/?key="+API_KEY+"&type=backlinks_tld&target="+url+"&target_type=url&export_columns=domains_num,backlinks_num&display_limit=1",headers=headers, data = payload)
data=json.loads(rr.text.encode('utf8'))
srdf=srdf.append({domains_num:data, backlinks_num:data}, ignore_index=True)
I'm not sure why this happens as I'm new to Python. Can you help?
Kind thanks
Mark

How to read in pandas column as column of lists?

Probably a simple solution but I couldn't find a fix scrolling through previous questions so thought I would ask.
I'm reading in a csv using pd.read_csv() One column is giving me issues:
0 ['Bupa', 'O2', 'EE', 'Thomas Cook', 'YO! Sushi...
1 ['Marriott', 'Evans']
2 ['Toni & Guy', 'Holland & Barrett']
3 []
4 ['Royal Mail', 'Royal Mail']
It looks fine here but when I reference the first value in the column i get:
df['brand_list'][0]
Out : '[\'Bupa\', \'O2\', \'EE\', \'Thomas Cook\', \'YO! Sushi\', \'Costa\', \'Starbucks\', \'Apple Store\', \'HMV\', \'Marks & Spencer\', "Sainsbury\'s", \'Superdrug\', \'HSBC UK\', \'Boots\', \'3 Store\', \'Vodafone\', \'Marks & Spencer\', \'Clarks\', \'Carphone Warehouse\', \'Lloyds Bank\', \'Pret A Manger\', \'Sports Direct\', \'Currys PC World\', \'Warrens Bakery\', \'Primark\', "McDonald\'s", \'HSBC UK\', \'Aldi\', \'Premier Inn\', \'Starbucks\', \'Pizza Hut\', \'Ladbrokes\', \'Metro Bank\', \'Cotswold Outdoor\', \'Pret A Manger\', \'Wetherspoon\', \'Halfords\', \'John Lewis\', \'Waitrose\', \'Jessops\', \'Costa\', \'Lush\', \'Holland & Barrett\']'
Which is obviously a string not a list as expected. How can I retain the list type when I read in this data?
I've tried the import ast method I've seen in other posts: df['brand_list_new'] = df['brand_list'].apply(lambda x: ast.literal_eval(x)) Which didn't work.
I've also tried to replicate with dummy dataframes:
df1 = pd.DataFrame({'a' : [['test','test1','test3'], ['test59'], ['test'], ['rhg','wreg']],
'b' : [['erg','retbn','ert','eb'], ['g','eg','egr'], ['erg'], 'eg']})
df1['a'][0]
Out: ['test', 'test1', 'test3']
Which works as I would expect - this suggests to me that the solution lies in how I am importing the data
Apologies, I was being stupid. The following should work:
import ast
df['brand_list_new'] = df['brand_list'].apply(lambda x: ast.literal_eval(x))
df['brand_list_new'][0]
Out: ['Bupa','O2','EE','Thomas Cook','YO! Sushi',...]
As desired

pandas groupby trying to optimse several steps

I've been trying to optimise a bokeh server to calculate live stats by selected country on Covid19.
I found myself repeating a groupby function to calculate new columns and was wondering, having selected the groupby, if I could then apply it in a similar way to .agg() on multiple columns ?
For example:
dfall = pd.DataFrame(db("SELECT * FROM C19daily"))
dfall.set_index(['geoId', 'date'], drop=False, inplace=True)
dfall = dfall.sort_index(ascending=True)
dfall.head()
id date geoId cases deaths auid
geoId date
AD 2020-03-03 70119 2020-03-03 AD 1 0 AD03/03/2020
2020-03-14 70118 2020-03-14 AD 1 0 AD14/03/2020
2020-03-16 70117 2020-03-16 AD 3 0 AD16/03/2020
2020-03-17 70116 2020-03-17 AD 9 0 AD17/03/2020
2020-03-18 70115 2020-03-18 AD 0 0 AD18/03/2020
I need to create new columns based on 'cases' and 'deaths' and applying various functions like cumsum(). Currently I do this the long way
dfall['ccases'] = dfall.groupby(level=0)['cases'].cumsum()
dfall['dpc_cases'] = dfall.groupby(level=0)['cases'].pct_change(fill_method='pad', periods=7)
.....
dfall['cdeaths'] = dfall.groupby(level=0)['deaths'].cumsum()
dfall['dpc_deaths'] = dfall.groupby(level=0)['deaths'].pct_change(fill_method='pad', periods=7)
I tried to optimise the groupby call like this:-
with dfall.groupby(level=0) as gr:
gr = g['cases'].cumsum()...
But the error suggest the class doesn't support this
AttributeError: __enter__
I thought I could use .agg({}) and supply dictionary
g = dfall.groupby(level=0).agg({'cc' : 'cumsum', 'cd' : 'cumsum'})
but that produces another error
pandas.core.base.SpecificationError: nested renamer is not supported
I have plenty of other bits to optimise, I thought this python part would be the easiest and save a few ms!
Could anyone nudge me in the right direction?
To avoid repeating dfall.groupby(level=0) you can just save it in a variable:
gb = dfall.groupby(level=0)
gb_cases = gb['cases']
dfall['ccases'] = gb_cases.cumsum()
dfall['dpc_cases'] = gb_cases.pct_change(fill_method='pad', periods=7)
...
And to run multiple aggregations using a single expression, I think you can use named aggregation. But I have no clue whether it will be more performant or not. Either way, it's better to profile the code and improve the actual bottlenecks.

Why am i getting <searchconsole.query.Report(rows=1)> instead of numbers/strs

Working with search console api,
made it through the basics.
Now i'm stuck on splitting and arranging the data:
When trying to split, i'm getting a NaN, nothing i try works.
46 ((174.0, 3753.0, 0.04636290967226219, 7.816147...
47 ((93.0, 2155.0, 0.0431554524361949, 6.59025522...
48 ((176.0, 4657.0, 0.037792570324243074, 6.90251...
49 ((20.0, 1102.0, 0.018148820326678767, 7.435571...
50 ((31.0, 1133.0, 0.02736098852603707, 8.0935569...
Name: test, dtype: object
When trying to manipulate the data like this (and similar interactions):
data=source['test'].tolist()
data
Its clear that the data is not really available...
[<searchconsole.query.Report(rows=1)>,
<searchconsole.query.Report(rows=1)>,
<searchconsole.query.Report(rows=1)>,
<searchconsole.query.Report(rows=1)>,
<searchconsole.query.Report(rows=1)>]
Anyone have an idea how can i interact with my data ?
Thanks.
for reference, this is the code and the program i work with:
account = searchconsole.authenticate(client_config='client_secrets.json', credentials='credentials.json')
webproperty = account['https://www.example.com/']
def APIsc(date,keyword):
results=webproperty.query.range(date, days=-30).filter('query', keyword, 'contains').get()
return results
source['test']=source.apply(lambda x: APIsc(x.date, x.keyword), axis=1)
source
made by: https://github.com/joshcarty/google-searchconsole

python - cannot make corr work

I'm struggling with getting a simple correlation done. I've tried all that was suggested under similar questions.
Here are the relevant parts of the code, the various attempts I've made and their results.
import numpy as np
import pandas as pd
try01 = data[['ESA Index_close_px', 'CCMP Index_close_px' ]].corr(method='pearson')
print (try01)
Out:
Empty DataFrame
Columns: []
Index: []
try04 = data['ESA Index_close_px'][5:50].corr(data['CCMP Index_close_px'][5:50])
print (try04)
Out:
**AttributeError: 'float' object has no attribute 'sqrt'**
using numpy
try05 = np.corrcoef(data['ESA Index_close_px'],data['CCMP Index_close_px'])
print (try05)
Out:
AttributeError: 'float' object has no attribute 'sqrt'
converting the columns to lists
ESA_Index_close_px_list = list()
start_value = 1
end_value = len (data['ESA Index_close_px']) +1
for items in data['ESA Index_close_px']:
ESA_Index_close_px_list.append(items)
start_value = start_value+1
if start_value == end_value:
break
else:
continue
CCMP_Index_close_px_list = list()
start_value = 1
end_value = len (data['CCMP Index_close_px']) +1
for items in data['CCMP Index_close_px']:
CCMP_Index_close_px_list.append(items)
start_value = start_value+1
if start_value == end_value:
break
else:
continue
try06 = np.corrcoef(['ESA_Index_close_px_list','CCMP_Index_close_px_list'])
print (try06)
Out:
****TypeError: cannot perform reduce with flexible type****
Also tried .astype but not made any difference.
data['ESA Index_close_px'].astype(float)
data['CCMP Index_close_px'].astype(float)
Using Python 3.5, pandas 0.18.1 and numpy 1.11.1
Would really appreciate any suggestion.
**edit1:*
Data is coming from an excel spreadsheet
data = pd.read_excel('C:\\Users\\Ako\\Desktop\\ako_files\\for_corr_‌​tool.xlsx') prior to the correlation attempts, there are only column renames and
data = data.drop(data.index[0])
to get rid of a line
regarding the types:
print (type (data['ESA Index_close_px']))
print (type (data['ESA Index_close_px'][1]))
Out:
**edit2*
parts of the data:
print (data['ESA Index_close_px'][1:10])
print (data['CCMP Index_close_px'][1:10])
Out:
2 2137
3 2138
4 2132
5 2123
6 2127
7 2126.25
8 2131.5
9 2134.5
10 2159
Name: ESA Index_close_px, dtype: object
2 5241.83
3 5246.41
4 5243.84
5 5199.82
6 5214.16
7 5213.33
8 5239.02
9 5246.79
10 5328.67
Name: CCMP Index_close_px, dtype: object
Well, I've encountered the same problem today.
try use .astype('float64') to help make the type correct.
data['ESA Index_close_px'][5:50].astype('float64').corr(data['CCMP Index_close_px'][5:50].astype('float64'))
This works well for me. Hope it can help you as well.
You can try as following:
Top15['Citable docs per capita']=(Top15['Citable docs per capita']*100000)
Top15['Citable docs per capita'].astype('int').corr(Top15['Energy Supply per Capita'].astype('int'))
It worked for me.

Resources