Scraping and creating new df - python-3.x

I'm basically trying to create a dataframe using the following code.
Here is the resulting table I’m trying to achieve:
info_list = []
data_list = []
mini_exc = ['CLFAR', 'CLFLE', 'CLHOL', 'CLCAN', 'CLCLE']
for exc in mini_exc:
grab_page = requests.get(f"http://availability.samknows.com/broadband/exchange/{exc}")
soup = BeautifulSoup(grab_page.content, 'html.parser')
warning = soup.findAll('div', class_='item-content')
for x in warning:
for y in x.findAll('th'):
info_list.append(y.text)
for z in x.findAll('td'):
data_list.append(z.text)
Basically, I would like to have a dataframe with the elements in info list as column names and data_list as rows corresponding to the correspondent columns.
As you can see I obtained a dataframe with correct info data, but I could not add the new columns. I know that:
for y in x.findAll('th'):
info_list.append(y.text)
should be outside the loop because I just need it once, but I put it there so you could get the column names.

That'what I did in the end guys, I created a Dataframe from a list with the info values I needed and after I run this code:
"""
for exc in ['CLFAR', 'CLFLE', 'CLHOL', 'CLCAN', 'CLCLE']:
info_list = []
grab_page = requests.get(f"http://availability.samknows.com/broadband/exchange/{exc}")
soup = BeautifulSoup(grab_page.content, 'html.parser')
warning = soup.findAll('div', class_='item-content')
for x in warning:
for z in x.findAll('td'):
info_list.append(z.text)
exch_only_serv[exc] = info_list
"""
So all the time was creating a list with the values for the following code and creating a new column with it.

Related

Fill csv data lists with for loop

I am manipulating .csv files. I have to loop through each column of numeric data in the file and enter them into different lists. The code I have is the following:
import csv
salto_linea = "\n"
csv_file = "02_CSV_data1.csv"
with open(csv_file, 'r') as csv_doc:
doc_reader = csv.reader(csv_doc, delimiter = ",")
mpg = []
cylinders = []
displacement = []
horsepower = []
weight = []
acceleration = []
year = []
origin = []
lt = [mpg, cylinders, displacement, horsepower,
weight, acceleration, year, origin]
for i,ln in zip(range (0,9),lt):
print(f"{i} -> {ln}")
for row in doc_reader:
y = row[i]
ln.append(y)
In the loop, try to have range() serve me as an index so that in the nested for loop, it loops through the first column (the first element of each row in the csv) and feeds it into the first list of 'lt'. The problem I have is that I go through the data column and enter it, but range() continues to advance in the first loop, ending the nesting, thinking that it would iterate i = 1, and that new value of 'i' would enter again. the nested loop traversing the next column and vice versa. I also tried it with some other while loop to iterate a counter that adds to each iteration and serves as an index but it didn't work either.
How I can fill the sublists in 'lt' with the data which is inside the csv file??
without seing the ontents of the CSV file itself, the best way of reading the data into a table is with the pandas module, which can be done in one line of code.
import pandas as pd
df = pd.read_csv('02_CSV_data1.csv')
this would have read all the data into a dataframe and you can work with this.
Alternatively, ammend the for loop like this:
for row in doc_reader:
for i, ln in enumerate(lt):
ln.append(row[i])
for bigger data, i would prefer pandas which has vectorised methods.

How to create a dataframe from extracted hashtags?

I have used below code to extract hashtags from tweets.
def find_tags(row_string):
tags = [x for x in row_string if x.startswith('#')]
return tags
df['split'] = df['text'].str.split(' ')
df['hashtags'] = df['split'].apply(lambda row : find_tags(row))
df['hashtags'] = df['hashtags'].apply(lambda x : str(x).replace('\\n', ',').replace('\\', '').replace("'", ""))
df.drop('split', axis=1, inplace=True)
df
However, when I am counting them using the below code I am getting output that is counting each character.
from collections import Counter
d = Counter(df.hashtags.sum())
data = pd.DataFrame([d]).T
data
Output I am getting is:
I think the problem lies with the code that I am using to extract hashtags. But I don't know how to solve this issue.
Change find_tags by replace in list comprehension with split and for count values use Series.explode with Series.value_counts:
def find_tags(row_string):
return [x.replace('\\n', ',').replace('\\', '').replace("'", "")
for x in row_string.split() if x.startswith('#')]
df['hashtags'] = df['text'].apply(find_tags)
and then:
data = df.hashtags.explode().value_counts().rename_axis('val').reset_index(name='count')

How do I extract specific values from a DataFrame and add them to a list?

Sample DataFrame:
id date price
93 6021501535 2014-07-25 430000
93 6021501535 2014-12-23 700000
313 4139480200 2014-06-18 1384000
313 4139480200 2014-12-09 1400000
first_list = []
second_list = []
I need to add the first price that corresponds to a specific ID to the first list and the second price for that same ID to the second list.
Example:
first_list = [430,000, 1,384,000]
second_list = [700,000, 1,400,000]
After which, I'm going to plot the values from both lists on a lineplot to compare the difference in price between the first and second list.
I've tried doing this with groupby and loc and I kept running into errors. I then tried iterating over each row using a simple for loop but ran into more problems...
I would appreciate some help.
Based on your question I think it's not necessary to save them into a list because you could also store them somewhere else (e.g. another DataFrame) and plot them. The functions below should help with filling wherever you want to store your data.
def date(your_id):
first_date = df.loc[(df['id']==your_id)].iloc[0,1]
second_date = df.loc[(df['id']==your_id)].iloc[1,1]
return first_date, second_date
def price(your_id):
first_date, second_date = date(your_id)
price_first_date = df.loc[(df['id']==6021501535) & (df['date']==first_date)].iloc[0,2]
price_second_date = df.loc[(df['id']==6021501535) & (df['date']==second_date)].iloc[0,2]
return price_first_date, price_second_date
price_first_date, price_second_date = price(6021501535)
If now for example you want to store your data in a new df you could do something like:
selected_ids = [6021501535, 4139480200]
new_df = pd.DataFrame(index=np.arange(1,len(selected_ids)+1), columns=['price_first_date', 'price_second_date'])
for i in range(len(selected_ids)):
your_id = selected_ids[i]
new_df.iloc[i, 0], new_df.iloc[i, 1] = price(your_id)
new_df then contains all 'first date prices' in the first column and all 'second date prices' in the second column. Plotting should work out.

Python - issues with for loops to create dataframe with BeautifulSoup scrape

I'm a beginner in Python and I'm trying to create a new dataframe using BeautifulSoup to scrape a webpage. I'm following some code that worked in a different page, but it's not working here. My final table of data is blank, so seems it's not appending. Any help is appreciated. This is what I've done:
from bs4 import BeautifulSoup
import requests
import pandas as pd
allergens = requests.get(url = 'http://guetta.com/diginn/allergens/')
allergens = BeautifulSoup(allergens.content)
items = allergens.find_all('div', class_ = 'menu-item-card')
final_table = {}
for item in allergens.find_all('div', class_ = 'menu-item-card'):
for row in item.find_all('h4', recursive = False)[0:]:
for column in row.find_all('p', class_ = 'menu-item__allergens'):
col_name = column['class'][0].split('__')[1]
if col_name not in final_table:
final_table[col_name] = []
final_table[col_name].append(column.text)
df_allergens = pd.DataFrame(final_table)
This returns nothing. No errors, just empty brackets. I was able to retrieve each element individually, so I think the items should work but obviously I'm missing something.
Edit:
Here is what the output needs to be:
Item Name Allergens
Classic Dig | Soy
Item2 | allergen1, allergen2
Item3 | allergen2
You don't need to find all h4 tags in every item. So make a change like below:
...
for item in allergens.find_all('div', class_ = 'menu-item-card'):
for column in item.find_all('p', class_ = 'menu-item__allergens'):
col_name = column['class'][0].split('__')[1]
if col_name not in final_table:
final_table[col_name] = []
final_table[col_name].append(column.text)
...

Display and save contents of a data frame with multi-dimensional array elements

I have created and updated a pandas dataframe to fill details of a section of an image and its corresponding features.
slice_sq_dim = 200
df_slice = pd.DataFrame({'Sample': str,
'Slice_ID':int,
'Slice_Array': [np.zeros((slice_sq_dim,slice_sq_dim))],
'Interface_Array': [np.zeros((slice_sq_dim,slice_sq_dim))],
'Slice_Array_Threshold': [np.zeros((slice_sq_dim,slice_sq_dim))]})
I added individual elements of this dataframe by updating the value of each cell through row by row iteration. Once I have completed my dataframe (with around 200 rows), I cannot seem to display more than the first row of its contents. I assume that this is due to the inclusion of multi-dimensional numpy arrays (image slices) as a component. I have also exported this data into a JSON file so that it can act as an input file during the next run. The following code shows how I exactly tried this and also how I fill my dataframe.
Slices_data_file = os.path.join(os.getcwd(), "Slices_dataframe.json")
if os.path.isfile(Slices_data_file):
print("Using the saved data of slices from previous run..")
df_slice = pd.read_json(Slices_data_file, orient='records')
else:
print("No previously saved slice data found..")
no_of_slices = 20
for index, row in df_files.iterrows(): # df_files is the previous dataframe with image path details
path = row['image_path']
slices, slices_thresh, slices_interface = slice_image(path, slice_sq_dim, no_of_slices)
# each of the output is a list of 20 image slices
for n, arr in enumerate(slices):
indx = (indx_row - 1 ) * no_of_slices + n
df_slice.Sample[indx] = path
df_slice.Slice_ID[indx] = n+1
df_slice.Slice_Array[indx] = arr
df_slice.Interface_Array[indx] = slices_interface[n]
df_slice.Slice_Array_Threshold[indx] = slices_thresh[n]
df_slice.to_json(Slices_data_file, orient='records')
I would like to do the following things:
Complete the dataframe with the possibility to add further columns of scalar values
View the dataframe normally with multiple rows and iterate using functions such as df_slice.iterrows() which is currently not supported
Save and reuse the database so as to avoid the repeated and time-consuming operations
Any advice or better suggestions?
After some while of searching, I found some topics that helped. pd.Series was very appropriate here. Also, I think that there was a "SettingwithCopyWarning" thatI chose to ignore somewhere in between. Final code is given below:
Slices_data_file = os.path.join(os.getcwd(), "Slices_dataframe.json")
if os.path.isfile(Slices_data_file):
print("Using the saved data of slices from previous run..)")
df_slice = pd.read_json(Slices_data_file, orient = 'columns')
else:
print("No previously saved slice data found..")
Sample_col = []
Slice_ID_col = []
Slice_Array_col = []
Interface_Array_col = []
Slice_Array_Threshold_col = []
no_of_slices = 20
slice_sq_dim = 200
df_slice = pd.DataFrame({'Sample': str,
'Slice_ID':int,
'Slice_Array': [],
'Interface_Array': [],
'Slice_Array_Threshold': []})
for index, row in df_files.iterrows():
path = row['image_path']
slices, slices_thresh, slices_interface = slice_image(path, slice_sq_dim, no_of_slices)
for n, arr in enumerate(slices):
Sample_col.append(Image_Unique_ID)
Slice_ID_col.append(n+1)
Slice_Array_col.append(arr)
Interface_Array_col.append(slices_interface[n])
Slice_Array_Threshold_col.append(slices_thresh[n])
print("Sicing -> ", Image_Unique_ID, " Complete")
df_slice['Sample'] = pd.Series(Sample_col)
df_slice['Slice_ID'] = pd.Series(Slice_ID_col)
df_slice['Slice_Array'] = pd.Series(Slice_Array_col)
df_slice['Interface_Array'] = pd.Series(Interface_Array_col)
df_slice['Slice_Array_Threshold'] = pd.Series(Slice_Array_Threshold_col)
df_slice.to_json(os.path.join(os.getcwd(), "Slices_dataframe.json"), orient='columns')

Resources