Axis is specified but .drop() errors out - python-3.x

I've a data set with 3 columns:
UserID ProductID Ratings
0 AKM1MP6P0OYPR 0132793040 5.0
1 A2CX7LUOHB2NDG 0321732944 5.0
2 A2NWSAGRHCP8N5 0439886341 1.0
3 A2WNBOD3WNDNKT 0439886341 3.0
4 A1GI0U4ZRJA8WN 0439886341 1.0
I'm trying to drop the rows where the UserID count is les than 50, that is if the user has rated less than 50 products. So I ran the code below:
for i in data.UserID.unique():
print (i)
if(data['UserID'].loc[data['UserID'] == i].value_counts().values < 50):
data.drop(data.loc[data['UserID'] == i], axis = 0, inplace = True)
But I receive an error message stating the labels are not contained in an axis like so:
KeyError: "labels ['UserID' 'ProductID' 'Ratings'] not contained in axis"
Could anyone please tell me what I'm getting wrong?
Thanks!

Related

How to handle errors with TimeDelta and Integers in Python

I need to calculate the distance between two dates.
df3['dist_2_1'] = (df3['Date2'] - df3['Date1'])
When I save this into my SQLite DB the format is terrible, so I decided to use an integer format which is much better.
df3['dist_2_1'] = (df3['Date2'] - df3['Date1']).astype('timedelta64[D]').astype(int)
So far so good, but in a similar case, I've NULL values which cause an error when I try to do the diference between dates.
df3['dist_B_3'] = df3['Break_date'] - df3['Date3']
The Break_date can be null, so I want that in this case the final result in dist_B_3 is 0, but now is an error that breaks everything. I tested this so far, but doesn't work...
try:
if df3['Break_date'] == 'NaT':
df3['dist_B_3'] = 0
else:
df3['dist_B_3'] = df3['Break_date'] - df3['Date3']
#().astype('timedelta64[D]').astype(int)
except Exception:
print("error in the dist_B_3")
My df3['Break_date'] df is this one, so the NaT are the ones creating the error.
0 2022-07-13
1 2022-07-12
2 2022-07-14
3 2022-07-14
4 NaT
5 NaT
Any idea on how to handle this?

(Neo4j / py2neo) Update relationship after it's been created

I am having troubles with updating a relationship property:
my goal is to map a dataset like the following into a Neo4j graph:
PersonName IllnessType
0 A 1
1 A 2
2 A 3
3 B 1
4 B 2
5 B 1
I basically cycle over the lines of this dataset, creating a Node for each Person and each Illness found on the line, and merging to avoid duplicates:
from py2neo import *
graph = Graph()
person_node= Node("Person", **kwargs)
graph.merge(person_node, "Person", "Name")
illness_node = Node("Illness", **kwargs)
graph.merge(illness_node, "Illness", "IllnessType")
edge = Relationship.type("SUFFERS_FROM")
rel = sfEdge(person_node, illness_node)
self.graph.merge(rel)
What I like to add now, is to add a weight on the "SUFFERS_FROM" edge that count how many times a person has suffered from a certain illness. What I tried to do was:
rm = RelationshipMatcher()
edge_to_increment = rm.match(nodes=(None, patNode), r_type=None).first()
if edge_to_increment is None:
edge_to_increment = edge(person_node, illness_node)
edge_to_increment["COUNT"]=1
self.graph.merge(edge_to_increment)
else:
edge_to_increment["COUNT"] += 1
c = e2r["COUNT"]
But then when I visualize the result, all edges have weight 1 even though the edge B-->1 should have weight 2.
Thanks in advance

how to replace stock-prices symbols in a dataframe

I would like to get the S&P 500 ['Adj Close'] column and replace the column with the corresponding stock symbol, however, I am not able to replace the dataframe columns because it gives me an error: KeyError: '5'
What I would like to achieve is to loop through all the available stocks from the list and replace the Adj Close with the stock symbol.
This is what I did:
First I have scraped the stock symbols from Wikipedia and added them to a list.
data = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
symbols = data[0] # get first column
symbols.head()
stock = symbols['Symbol'].to_list()
print(stock[0:5])
this gives me a list of stock symbols as below:
['MMM', 'ABT', 'ABBV', 'ABMD', 'ACN']
then I scraped Yahoo finance to get the daily financial data as below
stock_url = 'https://query1.finance.yahoo.com/v7/finance/download/{}?'
params = {
'range' : '1y',
'interval' : '1d',
'events' : 'history'
}
response = requests.get(stock_url.format(stock[0]), params=params)
file = StringIO(response.text)
reader = csv.reader(file)
data = list(reader)
df = pd.DataFrame(data)
stock_data = df['5']
Fix for key error
You are calling the the url using the list 'stock' and it gives a 404 response when I tried.
Call the URL with individual stock like below,
requests.get(stock_url.format(stock[0]), params=params)
Also do below, The column 5 is stored as integer instead of character. That is the reason you got 'key error'
stock_data = df[5]
I tried for stock 'MMM' - stock[0] and it prints below:
0 1 2 3 4 5 \
0 Date Open High Low Close Adj Close
1 2019-12-11 168.380005 168.839996 167.330002 168.740005 162.682480
2 2019-12-12 166.729996 170.850006 166.330002 168.559998 162.508926
3 2019-12-13 169.619995 171.119995 168.080002 168.789993 162.730667
4 2019-12-16 168.940002 170.830002 168.190002 170.750000 164.620316
.. ... ... ... ... ... ...
249 2020-12-04 172.130005 173.160004 171.539993 172.460007 172.460007
250 2020-12-07 171.720001 172.500000 169.179993 170.149994 170.149994
251 2020-12-08 169.740005 172.830002 169.699997 172.460007 172.460007
252 2020-12-09 172.669998 175.639999 171.929993 175.289993 175.289993
253 2020-12-10 174.869995 175.399994 172.690002 173.490005 173.490005
[254 rows x 7 columns]
Loop through stocks and replace Adj Close (Edited as per requirements from comments)
Code for looping through stocks and replacing Adj close with Stock symbol.
stock_url = 'https://query1.finance.yahoo.com/v7/finance/download/{}?'
params = {
'range' : '1y',
'interval' : '1d',
'events' : 'history'
}
df = pd.DataFrame()
for i in stock:
response = requests.get(stock_url.format(i), params=params)
file = io.StringIO(response.text)
reader = csv.reader(file)
data = list(reader)
df1 = pd.DataFrame(data)
df1.loc[df1[5] == 'Adj Close',5] = i
df = df.append(df1)
Tried the code for first 3 stocks and here it is:

Find two consecutive quarters of GDP decline, and ending with two consecutive quarters of GDP growth

I have the following df with data about the American quarterly GDP in billions of chained 2009 dollars, from 1947q1 to 2016q2:
df = pd.DataFrame(data = [1934.5, 1932.3, 1930.3, 1960.7, 1989.5, 2021.9, 2033.2, 2035.3, 2007.5, 2000.8, 2022.8, 2004.7, 2084.6, 2147.6, 2230.4, 2273.4, 2304.5, 2344.5, 2392.8, 2398.1, 2423.5, 2428.5, 2446.1, 2526.4, 2573.4, 2593.5, 2578.9, 2539.8, 2528.0, 2530.7, 2559.4, 2609.3, 2683.8, 2727.5, 2764.1, 2780.8, 2770.0, 2792.9, 2790.6, 2836.2, 2854.5, 2848.2, 2875.9, 2846.4, 2772.7, 2790.9, 2855.5, 2922.3, 2976.6, 3049.0, 3043.1, 3055.1, 3123.2, 3111.3, 3119.1, 3081.3, 3102.3, 3159.9, 3212.6, 3277.7, 3336.8, 3372.7, 3404.8, 3418.0, 3456.1, 3501.1, 3569.5, 3595.0, 3672.7, 3716.4, 3766.9, 3780.2, 3873.5, 3926.4, 4006.2, 4100.6, 4201.9, 4219.1, 4249.2, 4285.6, 4324.9, 4328.7, 4366.1, 4401.2, 4490.6, 4566.4, 4599.3, 4619.8, 4691.6, 4706.7, 4736.1, 4715.5, 4707.1, 4715.4, 4757.2, 4708.3, 4834.3, 4861.9, 4900.0, 4914.3, 5002.4, 5118.3, 5165.4, 5251.2, 5380.5, 5441.5, 5411.9, 5462.4, 5417.0, 5431.3, 5378.7, 5357.2, 5292.4, 5333.2, 5421.4, 5494.4, 5618.5, 5661.0, 5689.8, 5732.5, 5799.2, 5913.0, 6017.6, 6018.2, 6039.2, 6274.0, 6335.3, 6420.3, 6433.0, 6440.8, 6487.1, 6503.9, 6524.9, 6392.6, 6382.9, 6501.2, 6635.7, 6587.3, 6662.9, 6585.1, 6475.0, 6510.2, 6486.8, 6493.1, 6578.2, 6728.3, 6860.0, 7001.5, 7140.6, 7266.0, 7337.5, 7396.0, 7469.5, 7537.9, 7655.2, 7712.6, 7784.1, 7819.8, 7898.6, 7939.5, 7995.0, 8084.7, 8158.0, 8292.7, 8339.3, 8449.5, 8498.3, 8610.9, 8697.7, 8766.1, 8831.5, 8850.2, 8947.1, 8981.7, 8983.9, 8907.4, 8865.6, 8934.4, 8977.3, 9016.4, 9123.0, 9223.5, 9313.2, 9406.5, 9424.1, 9480.1, 9526.3, 9653.5, 9748.2, 9881.4, 9939.7, 10052.5, 10086.9, 10122.1, 10208.8, 10281.2, 10348.7, 10529.4, 10626.8, 10739.1, 10820.9, 10984.2, 11124.0, 11210.3, 11321.2, 11431.0, 11580.6, 11770.7, 11864.7, 11962.5, 12113.1, 12323.3, 12359.1, 12592.5, 12607.7, 12679.3, 12643.3, 12710.3, 12670.1, 12705.3, 12822.3, 12893.0, 12955.8, 12964.0, 13031.2, 13152.1, 13372.4, 13528.7, 13606.5, 13706.2, 13830.8, 13950.4, 14099.1, 14172.7, 14291.8, 14373.4, 14546.1, 14589.6, 14602.6, 14716.9, 14726.0, 14838.7, 14938.5, 14991.8, 14889.5, 14963.4, 14891.6, 14577.0, 14375.0, 14355.6, 14402.5, 14541.9, 14604.8, 14745.9, 14845.5, 14939.0, 14881.3, 14989.6, 15021.1, 15190.3, 15291.0, 15362.4, 15380.8, 15384.3, 15491.9, 15521.6, 15641.3, 15793.9, 15747.0, 15900.8, 16094.5, 16186.7, 16269.0, 16374.2, 16454.9, 16490.7, 16525.0, 16583.1],
index = ['1947q1', '1947q2', '1947q3', '1947q4', '1948q1', '1948q2', '1948q3',
'1948q4', '1949q1', '1949q2', '1949q3', '1949q4', '1950q1', '1950q2',
'1950q3', '1950q4', '1951q1', '1951q2', '1951q3', '1951q4', '1952q1',
'1952q2', '1952q3', '1952q4', '1953q1', '1953q2', '1953q3', '1953q4',
'1954q1', '1954q2', '1954q3', '1954q4', '1955q1', '1955q2', '1955q3',
'1955q4', '1956q1', '1956q2', '1956q3', '1956q4', '1957q1', '1957q2',
'1957q3', '1957q4', '1958q1', '1958q2', '1958q3', '1958q4', '1959q1',
'1959q2', '1959q3', '1959q4', '1960q1', '1960q2', '1960q3', '1960q4',
'1961q1', '1961q2', '1961q3', '1961q4', '1962q1', '1962q2', '1962q3',
'1962q4', '1963q1', '1963q2', '1963q3', '1963q4', '1964q1', '1964q2',
'1964q3', '1964q4', '1965q1', '1965q2', '1965q3', '1965q4', '1966q1',
'1966q2', '1966q3', '1966q4', '1967q1', '1967q2', '1967q3', '1967q4',
'1968q1', '1968q2', '1968q3', '1968q4', '1969q1', '1969q2', '1969q3',
'1969q4', '1970q1', '1970q2', '1970q3', '1970q4', '1971q1', '1971q2',
'1971q3', '1971q4', '1972q1', '1972q2', '1972q3', '1972q4', '1973q1',
'1973q2', '1973q3', '1973q4', '1974q1', '1974q2', '1974q3', '1974q4',
'1975q1', '1975q2', '1975q3', '1975q4', '1976q1', '1976q2', '1976q3',
'1976q4', '1977q1', '1977q2', '1977q3', '1977q4', '1978q1', '1978q2',
'1978q3', '1978q4', '1979q1', '1979q2', '1979q3', '1979q4', '1980q1',
'1980q2', '1980q3', '1980q4', '1981q1', '1981q2', '1981q3', '1981q4',
'1982q1', '1982q2', '1982q3', '1982q4', '1983q1', '1983q2', '1983q3',
'1983q4', '1984q1', '1984q2', '1984q3', '1984q4', '1985q1', '1985q2',
'1985q3', '1985q4', '1986q1', '1986q2', '1986q3', '1986q4', '1987q1',
'1987q2', '1987q3', '1987q4', '1988q1', '1988q2', '1988q3', '1988q4',
'1989q1', '1989q2', '1989q3', '1989q4', '1990q1', '1990q2', '1990q3',
'1990q4', '1991q1', '1991q2', '1991q3', '1991q4', '1992q1', '1992q2',
'1992q3', '1992q4', '1993q1', '1993q2', '1993q3', '1993q4', '1994q1',
'1994q2', '1994q3', '1994q4', '1995q1', '1995q2', '1995q3', '1995q4',
'1996q1', '1996q2', '1996q3', '1996q4', '1997q1', '1997q2', '1997q3',
'1997q4', '1998q1', '1998q2', '1998q3', '1998q4', '1999q1', '1999q2',
'1999q3', '1999q4', '2000q1', '2000q2', '2000q3', '2000q4', '2001q1',
'2001q2', '2001q3', '2001q4', '2002q1', '2002q2', '2002q3', '2002q4',
'2003q1', '2003q2', '2003q3', '2003q4', '2004q1', '2004q2', '2004q3',
'2004q4', '2005q1', '2005q2', '2005q3', '2005q4', '2006q1', '2006q2',
'2006q3', '2006q4', '2007q1', '2007q2', '2007q3', '2007q4', '2008q1',
'2008q2', '2008q3', '2008q4', '2009q1', '2009q2', '2009q3', '2009q4',
'2010q1', '2010q2', '2010q3', '2010q4', '2011q1', '2011q2', '2011q3',
'2011q4', '2012q1', '2012q2', '2012q3', '2012q4', '2013q1', '2013q2',
'2013q3', '2013q4', '2014q1', '2014q2', '2014q3', '2014q4', '2015q1',
'2015q2', '2015q3', '2015q4', '2016q1', '2016q2'])
df.columns = ['GDP in billions of chained 2009 dollars']
df.index.rename('quarter', inplace = True)
A recession period is defined as starting with two consecutive quarters of GDP decline, and ending with two consecutive quarters of GDP growth. The goal is to create a function 'get_recession_periods()' that returns all of the recession periods between 1947q1 and 2016q2. The output could a dataframe with two columns (start and end) or a list of tuples [(start and end), ...] with all the recession periods found.
Here is my try:
get_recession_periods()
lst_start = []
for i in range(0,len(df['GDP in billions of chained 2009 dollars'])-2):
if df['GDP in billions of chained 2009 dollars'][i] < df['GDP in billions of chained 2009 dollars'][i-1] and df['GDP in billions of chained 2009 dollars'][i+1] < df['GDP in billions of chained 2009 dollars'][i]:
lst_start.append(df.index[i])
start = lst_start[0]
lst_end = []
for j in range(df.index.get_loc(start),len(df['GDP in billions of chained 2009 dollars'])-2):
if df['GDP in billions of chained 2009 dollars'][j] > df['GDP in billions of chained 2009 dollars'][j-1] and df['GDP in billions of chained 2009 dollars'][j+1] > df['GDP in billions of chained 2009 dollars'][j]:
lst_end.append(df.index[j])
return (lst_start[0], lst_end[0])
But with the function above, I am only able to get the start and end quarter of the first recession in 1947.
Any idea?
This is probably overkill for this particular example... In a nutshell this is a bit more complicated than #zaq's answer but also much faster (about 9x here, and the difference would be much bigger on larger datasets) because it's vectorized instead of looped. But for this very small dataset here, clearly you would go with the simpler answer since even the slower way is fast enough. Finally, it stores the data in the dataframe itself rather than as a tuple (which could be an advantage or disadvantage, depending on the situation).
Thanks to #zaq for pointing out that I misread the question initially. I believe this now gives the same answer as zaq's except we have different implicit assumptions about the initial state of the world (beginning in recession or not) which is indeterminate in the data provided.
df['change'] = df.diff() # change in GDP from prior quarter
start = (df.change<0) & (df.change.shift(-1)<0) # potential start
end = (df.change>0) & (df.change.shift(-1)>0) # potential end
df['recess' ] = np.nan
df.loc[ start, 'recess' ] = -1
df.loc[ end, 'recess' ] = 1
df['recess'] = df.recess.ffill() # if the current row doesn't fit the
# definition of a start or end, then
# fill it with the prior row value
df['startend'] = np.nan
df.loc[ (df.recess==-1) & (df.recess.shift()== 1), 'startend'] = -1 # start
df.loc[ (df.recess== 1) & (df.recess.shift()==-1), 'startend'] = 1 # end
df[df.startend.notnull()]
GDP change recess startend
quarter
1947q4 1960.7 30.4 1.0 1.0
1949q1 2007.5 -27.8 -1.0 -1.0
1950q1 2084.6 79.9 1.0 1.0
1953q3 2578.9 -14.6 -1.0 -1.0
1954q2 2530.7 2.7 1.0 1.0
1957q4 2846.4 -29.5 -1.0 -1.0
1958q2 2790.9 18.2 1.0 1.0
1969q4 4715.5 -20.6 -1.0 -1.0
1970q2 4715.4 8.3 1.0 1.0
1974q3 5378.7 -52.6 -1.0 -1.0
1975q2 5333.2 40.8 1.0 1.0
1980q2 6392.6 -132.3 -1.0 -1.0
1980q4 6501.2 118.3 1.0 1.0
1981q4 6585.1 -77.8 -1.0 -1.0
1982q4 6493.1 6.3 1.0 1.0
1990q4 8907.4 -76.5 -1.0 -1.0
1991q2 8934.4 68.8 1.0 1.0
2008q3 14891.6 -71.8 -1.0 -1.0
2009q3 14402.5 46.9 1.0 1.0
One issue with your code is that you are not tracking the current status of the economy. If there are 20 consecutive quarters of GDP decline, your code will report 18 recessions beginning. And if there are 20 quarters of growth, it will report 18 recessions ending even if there wasn't one to begin with.
So, I introduce a Boolean recession to indicate whether we are in recession currently. Other changes: chained inequalities like a < b < c work in Python as expected, and improve readability; also, your column name is so verbose that I used positional indexing iloc instead, to have readable conditions in if-statements.
lst_start = []
lst_end = []
recession = False
for i in range(1, len(df)-1):
if not recession and (df.iloc[i-1, 0] > df.iloc[i, 0] > df.iloc[i+1, 0]):
recession = True
lst_start.append(df.index[i])
elif recession and (df.iloc[i-1, 0] < df.iloc[i, 0] < df.iloc[i+1, 0]):
recession = False
lst_end.append(df.index[i])
print(list(zip(lst_start, lst_end)))

python - cannot make corr work

I'm struggling with getting a simple correlation done. I've tried all that was suggested under similar questions.
Here are the relevant parts of the code, the various attempts I've made and their results.
import numpy as np
import pandas as pd
try01 = data[['ESA Index_close_px', 'CCMP Index_close_px' ]].corr(method='pearson')
print (try01)
Out:
Empty DataFrame
Columns: []
Index: []
try04 = data['ESA Index_close_px'][5:50].corr(data['CCMP Index_close_px'][5:50])
print (try04)
Out:
**AttributeError: 'float' object has no attribute 'sqrt'**
using numpy
try05 = np.corrcoef(data['ESA Index_close_px'],data['CCMP Index_close_px'])
print (try05)
Out:
AttributeError: 'float' object has no attribute 'sqrt'
converting the columns to lists
ESA_Index_close_px_list = list()
start_value = 1
end_value = len (data['ESA Index_close_px']) +1
for items in data['ESA Index_close_px']:
ESA_Index_close_px_list.append(items)
start_value = start_value+1
if start_value == end_value:
break
else:
continue
CCMP_Index_close_px_list = list()
start_value = 1
end_value = len (data['CCMP Index_close_px']) +1
for items in data['CCMP Index_close_px']:
CCMP_Index_close_px_list.append(items)
start_value = start_value+1
if start_value == end_value:
break
else:
continue
try06 = np.corrcoef(['ESA_Index_close_px_list','CCMP_Index_close_px_list'])
print (try06)
Out:
****TypeError: cannot perform reduce with flexible type****
Also tried .astype but not made any difference.
data['ESA Index_close_px'].astype(float)
data['CCMP Index_close_px'].astype(float)
Using Python 3.5, pandas 0.18.1 and numpy 1.11.1
Would really appreciate any suggestion.
**edit1:*
Data is coming from an excel spreadsheet
data = pd.read_excel('C:\\Users\\Ako\\Desktop\\ako_files\\for_corr_‌​tool.xlsx') prior to the correlation attempts, there are only column renames and
data = data.drop(data.index[0])
to get rid of a line
regarding the types:
print (type (data['ESA Index_close_px']))
print (type (data['ESA Index_close_px'][1]))
Out:
**edit2*
parts of the data:
print (data['ESA Index_close_px'][1:10])
print (data['CCMP Index_close_px'][1:10])
Out:
2 2137
3 2138
4 2132
5 2123
6 2127
7 2126.25
8 2131.5
9 2134.5
10 2159
Name: ESA Index_close_px, dtype: object
2 5241.83
3 5246.41
4 5243.84
5 5199.82
6 5214.16
7 5213.33
8 5239.02
9 5246.79
10 5328.67
Name: CCMP Index_close_px, dtype: object
Well, I've encountered the same problem today.
try use .astype('float64') to help make the type correct.
data['ESA Index_close_px'][5:50].astype('float64').corr(data['CCMP Index_close_px'][5:50].astype('float64'))
This works well for me. Hope it can help you as well.
You can try as following:
Top15['Citable docs per capita']=(Top15['Citable docs per capita']*100000)
Top15['Citable docs per capita'].astype('int').corr(Top15['Energy Supply per Capita'].astype('int'))
It worked for me.

Resources