Web Scraping - stock prices, quandl - python-3.x

I have a quick question. My code looks like below:
import quandl
names_of_company = ['KGHM','INDYKPOL','KRUK','KRUSZWICA']
for names in names_of_company:
x = quandl.get('WSE/{names_of_company}', start_date='2018-11-26',
end_date='2018-11-29')
I am trying to get all the data in one output but I can't change the names of each company one after another. Do you have any ideas?
Thanks for help

unless I'm missing something, looks like you should just be able to do a pretty basic for loop. it was the syntax that was was incorrect.
import quandl
import pandas as pd
names_of_company = ['KGHM','INDYKPOL','KRUK','KRUSZWICA']
results = pd.DataFrame()
for names in names_of_company:
x = quandl.get('WSE/%s' %names, start_date='2018-11-26',
end_date='2018-11-29')
x['company'] = names
results = results.append(x).reset_index(drop=True)

Related

Use pandas to make sense of malfomed Excel data

My job has me doing some data analysis and the exported spreadsheet that is given to me (the ONLY way able to be given) has data that looks like this:
But what I need it to look like, ideally, would be something like this:
I've tried some other codes and to be honest I've made a mangled mess and got rid of it as I only succeeded in jumbling the data. I've done several other pandas projects where I was able to sort and make sense of the data, but it had the same structure and was easier to do. At this point I just dont feel I have the logical part of how to go about fixing the data. I would do it manually but it's over 48k lines. Any help you may be able to provide would be greatly appreciated.
Edit: This is what the data looks like if we 'delete blanks and shift-up'
Try this :
import pandas as pd
df = pd.read_excel('your_excel_file.xlsx')
for i, col in enumerate(df.columns[-4:]):
if col == 'Subscription Name':
df[col] = df[col].shift(-1)
elif col == 'Resource Group':
df[col] = df[col].shift(-2)
else:
df[col] = df[col].shift(-3)
out = df.ffill().drop_duplicates().reset_index(drop=True)
>>> display(out)
Edit :
You can also use :
out = df[df['Resource Name'].notna()].ffill()
Or for better efficiency (as per #Vladimir Fokow) :
out = df.dropna(how='all').ffill()
Instead of :
out = df.ffill().drop_duplicates().reset_index(drop=True)

How to get the interceipt from model summary in Python linearmodels?

I am running a panel reggression using Python linearmodels, something like:
import pandas as pd
from linearmodels.panel import PanelOLS
data = pd.read_csv('data.csv', sep=',')
data = data.set_index(['panel_id', 'date'])
controls = ['A','B','C']
controls['const'] = 1
model = PanelOLS(data.Y, controls, entity_effects= True)
result = model.fit(use_lsdv=True)
I really need to pull out the coefficient on the constant, but looks like this would not work
intercept = result.summary.const
Could not really find the answer in
linearmodels' documentation on github
More generally, does anyone know how to pull out the estimate coefficients from the linearmodels summary? Thank you!
result.params['const']
would give the intercept, in general result.params gives the series of regression coefficients in linearmodels

How can I do something similar to VLOOKUP in Excel with urlparse?

I need to compare two sets of data from csv's, one (csv1) with a column 'listing_url' the other (csv2) with columns 'parsed_url' and 'url_code'. I would like to use the result set from urlparse on csv1 (specifically the netloc) to compare to csv2 'parsed_url' and output matching value from 'url_code' to csv.
from urllib.parse import urlparse
import re, pandas as pd
scr = pd.read_csv('csv2',squeeze=True,usecols=['parsed_url','url_code'])[['parsed_url','url_code']]
data = pd.read_csv('csv1')
L = data.values.T[0].tolist()
T = pd.Series([scr])
for i in L:
n = urlparse(i)
nf = pd.Series([(n.netloc)])
I'm stuck trying to convert the data into objects I can use map with, if that's even the best thing to use, I don't know.

how to use pandas to organize sales data into 12 months and find the 10 most profitable products for those 12 months?

I need write a program that organizes the data in the provided spreadsheet to find the top 10 most profitable products by each month. The program needs to take an input from the user to specify the year in which to compile the data.
I've gotten as far as printing all of the products sold in each month by their highest profitability but I don't know how to make it print only the top 10 for each month.
I'm also lost on how to take an input from the user to select only certain year for the program to compile the data.
Please help.
the link to download the files for my project: https://drive.google.com/drive/folders/1VkzTWydV7Qae7hOn6WUjDQutQGmhRaDH?usp=sharing
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
xl = pd.ExcelFile("SalesDataFull.xlsx")
OrdersOnlyData = xl.parse("Orders")
df_year = OrdersOnlyData["Order Date"].dt.year
OrdersOnlyData["Year"] = df_year
df_month = OrdersOnlyData["Order Date"].dt.month
OrdersOnlyData["Month"] = df_month
dataframe = OrdersOnlyData[["Year","Month","Product Name","Profit"]]
month_profit = dataframe.groupby(["Year","Month","Product Name"]).Profit.sum().sort_values(ascending=False)
month_profit = month_profit.reset_index()
month_profit = month_profit.sort_values(["Year","Month","Profit"],ascending=[True,True,False])
print(month_profit)
As #Franco pointed out, it is difficult to recommend a proper solution since you did not provide a data sample together with your question. In any case, the function that you are looking for is most likely nth().
This is probably how it should look like:
month_profit = month_profit.sort_values('Profit', ascending=False).groupby(['Year', 'Month']).nth([range(10)]).sort_values(by=['Year', 'Month', 'Profit'], ascending=[True, True, False])

Adding numerical values from dict to a new column in a Pandas DataFrame

I am practicing machine learning and working with a movie/rating dataset. I am trying to create a new column in the dataframe which numerically identifies each genre (around 1300 of them). My logic was to create a dictionary of the unique genres and label with a integer. Then create a for loop to iterate through each row of the dataframe, checking the genre of each, then assigning its appropriate value to a new column named "genre_Id". However this has been causing a infinite loop in which I can not even break with ctrl-c. Same issue when working in Jupyter ( Interrupt Kernel fails to stop it). Below is a summarized version of my approach.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
movies_data = pd.read_csv("C://mypython/moviedata/movies.csv")
ratings_data = pd.read_csv("C://mypython/moviedata/ratings.csv")
joined = pd.merge(movies_data,ratings_data, how = 'inner', on=['movieId'])
print(joined.head())
pd.options.display.float_format = '{:,.2f}'.format
genres = joined['genres'].unique()
genre_dict = {}
Id = 1
for i in genres:
genre_dict[i] = Id
Id += 1
joined['genre_id'] = 0
increment = 0
for i in joined['genres']:
if i in genre_dict:
joined['genre_id'][increment] = genre_dict[i]
increment += 1
I know I should probably be taking a smaller sample to work with as there is about 20,000,000 rows in the dataset but I figured I'd try this as a exercise.
I also recieve the "setting values from copy warning" though this hasn't caused me issues in the past for my other projects. Any thoughts on how to do this would be greatly appreciated.
EDIT Found a solution using the Series map feature.
joined['genre_id'] = joined.genres.map(genre_dict)
I have no permission to just comment. This is a suggestion and right procedure to handle categorical values in a dataset. You can use inbuilt sklearn.preprocessing.OneHotEncoder function which do the work you wanted to do.
For better understanding with examples check this One Hot Encode Sequence Data in Python. Let me know if this works for you.

Resources