Read a NASDAQ HTML table to a Dataframe - python-3.x

I get the most recent list of traded companies from NASDAQ with this code however I'd like to have the results shown in a data-frame instead of just the list with all the other info I might not need.
Any ideas how could that be achieved? Thanks
Parsing latest NASDAQ company
from bs4 import BeautifulSoup
import requests
r=requests.get('https://www.nasdaq.com/screening/companies-by
industry.aspx
exchange=NASDAQ&sortname=marketcap&sorttype=1&pagesize=4000')
data = r.text
soup = BeautifulSoup(data, "html.parser")
table = soup.find( "table", {"id":"CompanylistResults"} )
for row in table.findAll("tr"):
for cell in row("td"):
print (cell.get_text().strip())

Looks like you are looking for the aptly named read_html, though you need to play around until you get what you want. In your case:
>>> import pandas as pd
>>> df=pd.read_html(table.prettify(),flavor='bs4')[0]
>>> df.columns = [c.strip() for c in df.columns]
See output below.
The first line is what gets the job done, and the second one just strips off all those pesky spaces and new lines in your header. Looks like there is a hidden ADR TSO which seems useless, so you could drop it if you do not know what it is. It may also make sense to drop all even rows as they are just a continuation of the odd rows, and useless links as far as I can tell. In a single line:
>>> df = df.drop(['ADR TSO'], axis=1) #Drop useless column
>>> df1= df[::2] #To get rid of even rows
>>> df2= df[~df['Name'].str.contains('Stock Quote')].head() #By string filtration if we are not sure about the odd/even thing
Output of original head just for show:
>>> df.head()
Name Symbol Market Cap \
0 Amazon.com, Inc. AMZN $802.18B
1 AMZN Stock Quote AMZN Ratings AMZN Stock Report NaN NaN
2 Microsoft Corporation MSFT $789.12B
3 MSFT Stock Quote MSFT Ratings MSFT Stock Report NaN NaN
4 Alphabet Inc. GOOGL $740.3B
ADR TSO Country IPO Year \
0 NaN United States 1997
1 NaN NaN NaN
2 NaN United States 1986
3 NaN NaN NaN
4 NaN United States n/a
Subsector
0 Catalog/Specialty Distribution
1 NaN
2 Computer Software: Prepackaged Software
3 NaN
4 Computer Software: Programming, Data Processing
Output of cleaned df.head():
Name Symbol Market Cap Country IPO Year \
0 Amazon.com, Inc. AMZN $802.18B United States 1997
2 Microsoft Corporation MSFT $789.12B United States 1986
4 Alphabet Inc. GOOGL $740.3B United States n/a
6 Alphabet Inc. GOOG $735.24B United States 2004
8 Apple Inc. AAPL $720.3B United States 1980
Subsector
0 Catalog/Specialty Distribution
2 Computer Software: Prepackaged Software
4 Computer Software: Programming, Data Processing
6 Computer Software: Programming, Data Processing
8 Computer Manufacturing

Related

Map Pandas Series Containing key/value pairs to a new columns with data

I have a dataframe containing a pandas series (column 2) as below:
column 1
column 2
column 3
1123
Requested By = John Doe 1\n Requested On = 12 October 2021\n Comments = This is a generic request
INC29192
1251
NaN
INC18217
1918
Requested By = John Doe 2\n Requested On = 2 September 2021\n Comments = This is another generic request
INC19281
I'm struggling to extract, split and map column 2 data to a series of new column names with the appropriate data for that record (where possible, that is where there is data available as I have NaNs).
The Desired output is something like (where Ive dropped the column 3 data for legibility):
column 1
column 3
Requested By
Requested On
Comments
1123
INC29192
John Doe 1
12 October 2021
This is a generic request
1251
INC18217
NaN
NaN
NaN
1918
INC19281
John Doe 2
2 September 2021
This is another generic request
I have spent quite some time, trying various approaches, from lambda functions to comprehensions to explode methods but havent quite found a solution that provides the desired output.
First I would convert column 2 values to dictionaries and then convert them to Dataframes and join them to your df:
df['column 2'] = df['column 2'].apply(lambda x:
{y.split(' = ',1)[0]:y.split(' = ',1)[1]
for y in x.split(r'\n ')}
if not pd.isna(x) else {})
df = df.join(pd.DataFrame(df['column 2'].values.tolist())).drop('column 2', axis=1)
print(df)
Output:
column 1 column 3 Requested By Requested On Comments
0 1123 INC29192 John Doe 1 12 October 2021 This is a generic request
1 1251 INC18217 NaN NaN NaN
2 1918 INC19281 John Doe 2 2 September 2021 This is another generic request

How to display latitude and longitude as two separate columns from current single with no separator?

I have a data frame in Pandas with a column labeled “Location.”
The column is an object data type and is in the following format within the column:
Location
————————
POINT
(-73.525969
41.081897)
I’d like to remove the formatting and store each of the data points in two columns: Latitude and Longitude, which would have to be created. How would I accomplish this?
Similar post I’ve reviewed always have delimiters between the numbers (such as a comma) but this doesn’t.
Thank you!
Serial Number
List Year
Date Recorded
Town
Address
Assessed Value
Sale Amount
Sales Ratio
Property Type
Residential Type
Non Use Code
Assessor Remarks
OPM remarks
Location
0
141466
2014
2015-08-06
Stamford
83 OVERBROOK DRIVE
503270.0
850000.0
0.592082
Residential
Single Family
NaN
NaN
NaN
POINT (-73.525969 41.081897)
1
140604
2014
2015-06-29
New Haven
56 HIGHVIEW LANE
86030.0
149900.0
0.573916
Residential
Single Family
NaN
NaN
NaN
POINT (-72.878115 41.30285)
2
14340
2014
2015-07-01
Ridgefield
32 OVERLOOK DR
351880.0
570000.0
0.617333
Residential
Single Family
NaN
NaN
NaN
POINT (-73.508273 41.286223)
3
140455
2014
2015-04-30
New Britain
171 BRADFORD WALK
204680.0
261000.0
0.784215
Residential
Condo
NaN
NaN
NaN
POINT (-72.775592 41.713335)
4
141195
2014
2015-06-26
Stamford
555 GLENBROOK ROAD
229330.0
250000.0
0.917320
Residential
Single Family
NaN
NaN
NaN
POINT (-73.519774 41.07203)
You can use pandas.Series.str.extract:
df['Location'].str.extract('POINT \((?P<latitude>[-\d.#+])\s+(?P<longitude>[-\d.#+])\)').astype(float)
You might need to change the \s+ separator depending on your real input

Is it possible to set a custom delimiter/separator for a specific column when cleaning data with pandas?

My dataset is a .txt file separated by colons (:). One of the columns contains a date AND time, the date is separated by backslash (/) which is fine. However, the time is separated by colons (:) just like the rest of the data which throws off my method for cleaning the data.
Example of a couple of lines of the dataset:
USA:Houston, Texas:05/06/2020 12:00:00 AM:car
Japan:Tokyo:05/06/2020 11:05:10 PM:motorcycle
USA:Houston, Texas:12/15/2020 12:00:10 PM:car
Japan:Kyoto:01/04/1999 05:30:00 PM:bicycle
I'd like to clean the dataset before loading it into a dataframe in python using pandas. How do I separate the columns? I can't use
df = pandas.read_csv('example.txt', sep=':', header=None)
because that will separate the time data into different columns. Any ideas?
You can concatenate the columns back:
df = pandas.read_csv("example.txt", sep=":", header=None)
df[6] = pd.to_datetime(
df[2].astype(str) + ":" + df[3].astype(str) + ":" + df[4].astype(str)
)
df = df[[0, 1, 6, 5]].rename(
columns={0: "State", 1: "City", 6: "Time", 5: "Type"}
)
print(df)
Prints:
State City Time Type
0 USA Houston, Texas 2020-05-06 00:00:00 car
1 Japan Tokyo 2020-05-06 23:05:10 motorcycle
2 USA Houston, Texas 2020-12-15 12:00:10 car
3 Japan Kyoto 1999-01-04 17:30:00 bicycle
pd.read_csv(file, sep=r"(?:(?<!\d):(?!\d)?)", header=None)
See demo for the regex!
Regex is looking for a non-digit behind the : via negative look behind ?<! and possibly a non-digit after : via negative look ahead. The latter is optional to cover the case e.g. Tokyo:05 and split here also. The ?: at the beginning says "don't keep the :'s you find in the result".
I get:
0 1 2 3
0 USA Houston, Texas 05/06/2020 12:00:00 AM car
1 Japan Tokyo 05/06/2020 11:05:10 PM motorcycle
2 USA Houston, Texas 12/15/2020 12:00:10 PM car
3 Japan Kyoto 01/04/1999 05:30:00 PM bicycle

Unable to convert text format to proper data frame using Pandas

I am reading text source from URL = 'https://www.census.gov/construction/bps/txt/tb2u201901.txt'
here i used Pandas to convert it into Dataframe
df = pd.read_csv(URL, sep = '\t')
After exporting the df i see all the columns are merged into single column inspite of giving the seperator
as '\t'. how to solve this issue.
As your file is not a CSV file, you should use the function read_fwf() from pandas because your columns have a fixed width. You need also to remove the first 12 lines that are not part of your data and you need to remove the empty lines with dropna().
df = pd.read_fwf(URL, skiprows=12)
df.dropna(inplace=True)
df.head()
United States 94439 58086 1600 1457 33296 1263
1 Northeast 9099.0 3330.0 272.0 242.0 5255.0 242.0
2 New England 1932.0 1079.0 90.0 72.0 691.0 46.0
3 Connecticut 278.0 202.0 8.0 3.0 65.0 8.0
4 Maine 357.0 222.0 6.0 0.0 129.0 5.0
5 Massachusetts 819.0 429.0 38.0 54.0 298.0 23.0
Your output is coming correct . If you open the URL , you will see that there sentences written which are not tab separated so its not able to present in correct way.
From line number 9 the results are correct
[![enter image description here][1]][1]
[1]: https://i.stack.imgur.com/2K61J.png

How to access a column of grouped data to perform linear regression in pandas?

I want to perform a linear regression on groupes of grouped data frame in pandas. The function I am calling throws a KeyError that I cannot resolve.
I have an environmental data set called dat that includes concentration data of a chemical in different tree species of various age classes in different country sites over the course of several time steps. I now want to do a regression of concentration over time steps within each group of (site, species, age).
This is my code:
```
import pandas as pd
import statsmodels.api as sm
dat = pd.read_csv('data.csv')
dat.head(15)
SampleName Concentration Site Species Age Time_steps
0 batch1 2.18 Germany pine 1 1
1 batch2 5.19 Germany pine 1 2
2 batch3 11.52 Germany pine 1 3
3 batch4 16.64 Norway spruce 0 1
4 batch5 25.30 Norway spruce 0 2
5 batch6 31.20 Norway spruce 0 3
6 batch7 12.63 Norway spruce 1 1
7 batch8 18.70 Norway spruce 1 2
8 batch9 43.91 Norway spruce 1 3
9 batch10 9.41 Sweden birch 0 1
10 batch11 11.10 Sweden birch 0 2
11 batch12 15.73 Sweden birch 0 3
12 batch13 16.87 Switzerland beech 0 1
13 batch14 22.64 Switzerland beech 0 2
14 batch15 29.75 Switzerland beech 0 3
def ols_res_grouped(group):
xcols_const = sm.add_constant(group['Time_steps'])
linmod = sm.OLS(group['Concentration'], xcols_const).fit()
return linmod.params[1]
grouped = dat.groupby(['Site','Species','Age']).agg(ols_res_grouped)
```
I want to get the regression coefficient of concentration data over Time_steps but get a KeyError: 'Time_steps'. How can the sm method access group["Time_steps"]?
According to pandas's documentation, agg applies functions to each column independantly.
It might be possible to use NamedAgg but I am not sure.
I think it is a lot easier to just use a for loop for this :
for _, group in dat.groupby(['Site','Species','Age']):
coeff = ols_res_grouped(group)
# if you want to put the coeff inside the dataframe
dat.loc[group.index, 'coeff'] = coeff

Resources