Combining 3 data frames based on common field 'Country'

Combining 3 data frames based on common field 'Country' - python-3.x

I have 3 dataframes with a common index of Country. I need to combine each of the 3 based on that Country field.
My first try was to combine two and then the third and this is how far I got:
pd.merge(energy, GDP, how='outer', left_index=True, right_index=True)
I have tried 3 options highly rated on this site:
import functools
dfs = [energy, GDP, ScimEn]
df_final = functools.reduce(lambda left,right: pd.merge(left,right,on='Country'), dfs)
energy.merge(GDP,on='Country').merge(ScimEn,on='Country')
pd.concat([energy.set_index('Country'), GDP.set_index('Country'), ScimEn.set_index('Country')], axis=1)
KeyError: 'Country'
During handling of the above exception, another exception occurred:
KeyError
Traceback (most recent call last)
in ()
40 #df_final = functools.reduce(lambda left,right: pd.merge(left,right,on='Country'), dfs)
41 #energy.merge(GDP,on='Country').merge(ScimEn,on='Country')
---> 42 pd.concat([energy.set_index('Country'), GDP.set_index('Country'), ScimEn.set_index('Country')], axis=1)

You need to do multiple merges. Try the following:
merge1 = pd.merge(energy, GDP, how='inner', on='Country')
final = pd.merge(merge1, ScimEn, how='inner', on='Country')

Related

Why I am not able to generate schema using tfdv.infer_schema()?

"TypeError: statistics is of type StatsOptions, should be a DatasetFeatureStatisticsList proto." error shows when I am generating schema using tfdv.infer_schema() option but I am not able to do when I filter relevant feature using tfdv.StatsOptions class using feature_allowlist. So can anyone help me in this ?
features_remove= {"region","fiscal_week"}
columns= [col for col in df.columns if col not in features_remove]
stat_Options= tfdv.StatsOptions(feature_allowlist=columns)
print(stat_Options.feature_allowlist)
schema= tfdv.infer_schema(stat_Options)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-53-e61b2454028e> in <module>
----> 1 schema= tfdv.infer_schema(stat_Options)
2 schema
C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_data_validation\api\validation_api.py in infer_schema(statistics, infer_feature_shape, max_string_domain_size, schema_transformations)
95 """
96 if not isinstance(statistics, statistics_pb2.DatasetFeatureStatisticsList):
---> 97 raise TypeError(
98 'statistics is of type %s, should be '
99 'a DatasetFeatureStatisticsList proto.' % type(statistics).__name__)
TypeError: statistics is of type StatsOptions, should be a DatasetFeatureStatisticsList proto.

For the very simple reason that you have to pass a statistics_pb2.DatasetFeatureStatisticsList object to the tfdv.infer_schema function and not the statsOptions.
You should go this way :
features_remove= {"region","fiscal_week"}
columns= [col for col in df.columns if col not in features_remove]
stat_Options= tfdv.StatsOptions(feature_allowlist=columns)
print(stat_Options.feature_allowlist)
stats = tfdv.generate_statistics_from_dataframe(df, stat_Options)
schema= tfdv.infer_schema(stats)

getting error while forming train matrix in book recommendation system

I am new to data science and facing issues while creating a book recommendation system by collaborative filtering. Can someone please advise on the below error.
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
data = pd.read_csv('BX-Book-Ratings.csv',engine = 'python')
df = data.iloc[1:10000,:]
print(df)
print(df.dtypes)
df['isbn']= pd.to_numeric(df['isbn'], errors = 'coerce')
df = df[np.isfinite(df).all(1)]
df['isbn'] = df['isbn'].astype(np.int64)
from sklearn.model_selection import train_test_split
n_users = df.user_id.unique().shape[0]
n_book = df.isbn.unique().shape[0]
train_data, test_data = train_test_split(df, test_size=0.5)
print(n_users , n_book)
train_data_matrix = np.zeros((n_users, n_book))
for line in train_data.itertuples():
#[user_id index, book_id index] = given rating.
train_data_matrix[line[1] - 1, line[2] - 1] = line[3]
train_data_matrix
--------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-125-caa0bcd40167> in <module>
2 for line in train_data.itertuples():
3 #[user_id index, book_id index] = given rating.
----> 4 train_data_matrix[line[1] - 1, line[2] - 1] = line[3]
5 train_data_matrix
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

The Most probable cause of error is the index value are having mismatch.
I can see ISBN is int type but what about user_id??
Fix:
The fix is to create an unique index for the these n_users * n_book.
Method1 : this can be created either using another unique
dataframe for consumer and item and use its index.
Method2 : create a dict and use unique values as key and some index.
Now whatever method is used should be consistent across rest of process or it will result in mismatch of
book-item rating.
This fix uses method 2.
# Method2
user_dict= {}
for item,value in enumerate(df.user_id.unique().tolist()):
consumer_dict[value]= item
book_dict = {}
for item, value in enumerate(df.isbn.unique().tolist()):
item_dict[value] = item
print(len(user_dict.keys()), len(book_dict.keys()))
for line in train.itertuples():
row_index = user_dict[line[1]]
col_index = book_dict[line[2]]
data_matrix[row_index, col_index] = line[3]
Hope This Helps , Snapshot of data will probably help to fix this.

TypeError: frame_apply() got an unexpected keyword argument 'broadcast'

I am running a function on every row of a Pandas dataset using .loc.
dm.loc[:50, 'address_components'] = dm.loc[:50, ['description', 'city']].apply(lambda row: get_address(row[0], row[1]), axis=1)
The function itself is below, although the error doesn't apply to the function I'm passing
def get_address(address, city):
geolocator = GoogleV3(api_key=api_key, domain='maps.googleapis.com', scheme=None, client_id=None,
secret_key=None,
user_agent=None)
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=0.0, swallow_exceptions=True, return_value_on_exception=None)
# cleanup address
address = address.strip(' ').strip(',')
full_addr = '{}, {}'.format(address, city)
#print(full_addr)
data = geocode(full_addr, timeout=None, exactly_one=True)
if data:
#print(data.raw)
return data.raw
return None
and the response
TypeError Traceback (most recent call last)
<ipython-input-77-46aa2724c2f6> in <module>
1
2
----> 3 dm.loc[:50, 'address_components'] = dm.loc[:50, ['description', 'city']].apply(lambda row: get_address(row[0], row[1]), axis=1)
~/virt_env/virt2/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
6909 result_type=result_type,
6910 args=args,
-> 6911 kwds=kwds,
6912 )
6913 return op.get_result()
TypeError: frame_apply() got an unexpected keyword argument 'broadcast'
That error is new as the function did run before without issues, and started throwing an error message only after I updated Pandas to 1.0.4. However, it persists even after I downgraded pandas back to 0.25.1. Broadcast parameter also doesn’t make sense to me.

Turns out it really is a version issue. Downgraded back to pandas 0.25, restarted the kernel and it's flying again!

I can't be sure, but you seem to be assigning to the first 50 rows each time apply iterates, whilst iterating each row
dm.loc[:50, 'address_components'] =
should be
dm["address_components"] = dm.loc[:50, ['description', 'city']].apply(lambda row: get_address(row[0], row[1]), axis=1)

Applying function to pandas dataframe

I have a pandas dataframe called 'tourdata' consisting of 676k rows of data. Two of the columns are latitude and longitude.
Using the reverse_geocode package I want to convert these coordinates to a country data.
When I call :
import reverse_geocode as rg
tourdata['Country'] = rg.search((row[tourdata['latitude']],row[tourdata['longitude']]))
I get the error :
ValueErrorTraceback (most recent call last)
in ()
1 coordinates = (tourdata['latitude'],tourdata['longitude']),
----> 2 tourdata['Country'] = rg.search((row[tourdata['latitude']],row[tourdata['longitude']]))
~/anaconda/envs/py3/lib/python3.6/site-packages/reverse_geocode/init.py
in search(coordinates)
114 """
115 gd = GeocodeData()
--> 116 return gd.query(coordinates)
117
118
~/anaconda/envs/py3/lib/python3.6/site-packages/reverse_geocode/init.py
in query(self, coordinates)
46 except ValueError as e:
47 logging.info('Unable to parse coordinates: {}'.format(coordinates))
---> 48 raise e
49 else:
50 results = [self.locations[index] for index in indices]
~/anaconda/envs/py3/lib/python3.6/site-packages/reverse_geocode/init.py
in query(self, coordinates)
43 """
44 try:
---> 45 distances, indices = self.tree.query(coordinates, k=1)
46 except ValueError as e:
47 logging.info('Unable to parse coordinates: {}'.format(coordinates))
ckdtree.pyx in scipy.spatial.ckdtree.cKDTree.query()
ValueError: x must consist of vectors of length 2 but has shape (2,
676701)
To test that the package is working :
coordinates = (tourdata['latitude'][0],tourdata['longitude'][0]),
results = (rg.search(coordinates))
print(results)
Outputs :
[{'country_code': 'AT', 'city': 'Wartmannstetten', 'country': 'Austria'}]
Any help with this appreciated. Ideally I'd like to access the resulting dictionary and apply only the country code to the Country column.

The search method expects a list of coordinates. To obtain a single data point you can use "get" method.
Try :
tourdata['country'] = tourdata.apply(lambda x: rg.get((x['latitude'], x['longitude'])), axis=1)
It works fine for me :
import pandas as pd
tourdata = pd.DataFrame({'latitude':[0.3, 2, 0.6], 'longitude':[12, 5, 0.8]})
tourdata['country'] = tourdata.apply(lambda x: rg.get((x['latitude'], x['longitude'])), axis=1)
tourdata['country']
Output :
0 {'country': 'Gabon', 'city': 'Booué', 'country...
1 {'country': 'Sao Tome and Principe', 'city': '...
2 {'country': 'Ghana', 'city': 'Mumford', 'count...
Name: country, dtype: object

applying a lambda function to pandas dataframe

First time posting on stackoverflow, so bear with me if I'm making some faux pas please :)
I'm trying to calculate the distance between two points, using geopy, but I can't quite get the actual application of the calculation to work.
Here's the head of the dataframe I'm working with (there are some missing values later in the dataframe, not sure if this is the issue or how to handle it in general):
start lat start long end_lat end_long
0 38.902760 -77.038630 38.880300 -76.986200
2 38.895914 -77.026064 38.915400 -77.044600
3 38.888251 -77.049426 38.895914 -77.026064
4 38.892300 -77.043600 38.888251 -77.049426
I've set up a function:
def dist_calc(st_lat, st_long, fin_lat, fin_long):
from geopy.distance import vincenty
start = (st_lat, st_long)
end = (fin_lat, fin_long)
return vincenty(start, end).miles
This one works fine when given manual input.
However, when I try to apply() the function, I run into trouble with the below code:
distances = df.apply(lambda row: dist_calc(row[-4], row[-3], row[-2], row[-1]), axis=1)
I'm fairly new to python, any help will be much appreciated!
Edit: error message:
distances = df.apply(lambda row: dist_calc2(row[-4], row[-3], row[-2], row[-1]), axis=1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4262, in apply
ignore_failures=ignore_failures)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4358, in _apply_standard
results[i] = func(v)
File "<stdin>", line 1, in <lambda>
File "<stdin>", line 5, in dist_calc2
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geopy/distance.py", line 322, in __init__
super(vincenty, self).__init__(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geopy/distance.py", line 115, in __init__
kilometers += self.measure(a, b)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geopy/distance.py", line 414, in measure
u_sq = cos_sq_alpha * (major ** 2 - minor ** 2) / minor ** 2
UnboundLocalError: ("local variable 'cos_sq_alpha' referenced before assignment", 'occurred at index 10')

The default settings for pandas functions typically used to import text data like this (pd.read_table() etc) will interpret the spaces in the first 2 column names as separators, so you'll end up with 6 columns instead of 4, and your data will be misaligned:
In [23]: df = pd.read_clipboard()
In [24]: df
Out[24]:
start lat start.1 long end_lat end_long
0 0 38.902760 -77.038630 38.880300 -76.986200 NaN
1 2 38.895914 -77.026064 38.915400 -77.044600 NaN
2 3 38.888251 -77.049426 38.895914 -77.026064 NaN
3 4 38.892300 -77.043600 38.888251 -77.049426 NaN
In [25]: df.columns
Out[25]: Index(['start', 'lat', 'start.1', 'long', 'end_lat', 'end_long'], dtype='object')
Notice column names are wrong, the last column is full of NaNs, etc. If I apply your function to the dataframe in this form, I get the same error as you did.
Its usually better to try to fix this before it gets imported as a dataframe. I can think of 2 methods:
clean the data before importing, for example copy it into an editor and replace the offending spaces with underscores. This is the easiest.
use a regex to fix it during import. This may be necessary if the dataset is very large, or its is pulled from a website and has to be refreshed regularly.
Here's an example of case (2):
In [35]: df = pd.read_clipboard(sep=r'\s{2,}|\s(?=-)', engine='python')
In [36]: df = df.rename_axis({'start lat': 'start_lat', 'start long': 'start_long'}, axis=1)
In [37]: df
Out[37]:
start_lat start_long end_lat end_long
0 38.902760 -77.038630 38.880300 -76.986200
2 38.895914 -77.026064 38.915400 -77.044600
3 38.888251 -77.049426 38.895914 -77.026064
4 38.892300 -77.043600 38.888251 -77.049426
The specified that separators must contain either 2+ whitespaces characters, or 1 whitespace followed by a hyphen (minus sign). Then I rename the columns to what i assume are the expected values.
From this point your function / apply works fine, but i've changed it a little:
PEP8 recommends putting imports at the top of each file, rather than in a function
Extracting the columns by name is more robust, and would have given a much more understandable error than the weird error thrown by geopy.
For example:
In [51]: def dist_calc(row):
...: start = row[['start_lat','start_long']]
...: end = row[['end_lat', 'end_long']]
...: return vincenty(start, end).miles
...:
In [52]: df.apply(lambda row: dist_calc(row), axis=1)
Out[52]:
0 3.223232
2 1.674780
3 1.365851
4 0.420305
dtype: float64

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Combining 3 data frames based on common field 'Country' - python-3.x

You need to do multiple merges. Try the following: merge1 = pd.merge(energy, GDP, how='inner', on='Country') final = pd.merge(merge1, ScimEn, how='inner', on='Country')

Related

Why I am not able to generate schema using tfdv.infer_schema()?

getting error while forming train matrix in book recommendation system

TypeError: frame_apply() got an unexpected keyword argument 'broadcast'

Applying function to pandas dataframe

applying a lambda function to pandas dataframe

Categories

Resources