Web scraping multiple pages with python 3? - python-3.x

I got csv-file with numerous URLs. I read it into a pandas dataframe for convenience. I need to do some statistical work later - and pandas is just handy. It looks a little like this:
import pandas as pd
csv = [{"URLs" : "www.mercedes-benz.de", "electric" : 1}, {"URLs" : "www.audi.de", "electric" : 0}]
df = pd.DataFrame(csv)
My task is to check if the websites contain certain strings and to add an extra column with 1 if so, and else 0. For example: I want to check, wether www.mercedes-benz.de contains the string car.
import requests
page_content = requests.get("www.mercedes-benz.de")
if "car" in page_content.text:
print ('1')
else:
print ('0')
How do I iterate/loop through pd.URLs and store the information in the pandas dataframe?

I think you need loop by data by DataFrame.iterrows and then create new values with loc:
for i, row in df.iterrows():
page_content = requests.get(row['URLs'])
if "car" in page_content.text:
df.loc[i, 'car'] = '1'
else:
df.loc[i, 'car'] = '0'
print (df)
URLs electric car
0 http://www.mercedes-benz.de 1 1
1 http://www.audi.de 0 1

Related

create dataframe of liste

I want to create dataframe form existing lists( each row of file will be written in row dataframe.
with open(filename, mode='r', encoding='cp1252') as f:
lines=f.readlines()
liste1 = str(lines[0])
df1 = pd.DataFrame(liste1)
who can help me please?
below the 3 first rows of file f1.
[‘x1’, ‘major’, ’1198’, ‘TCP’]
[‘x1’, ‘minor’, ‘1198’, ‘UDP’]
[‘x2’, ‘major’, ’1198’, ‘UDP’]
If I understand this properly, want each row in the DataFrame to be a string you read from a line in the file?
Note that liste in your case is a string so I am not sure what you are going for.
This approach should work anyways.
import pandas as pd
df1 = pd.DataFrame()
with open(filename, mode='r', encoding='cp1252') as f:
lines=f.readlines()
liste1 = str(lines[0])
df1 = df1.append(pd.Series(liste1), ignore_index=True)
So if liste1 has form
> "This is a string"
then your DataFrame will look like this
df1.head()
0
0 This is a string
if liste1 has form
> ["This", "is", "a", "list"]
then your DataFrame will look like this
df1.head()
0 1 2 3
0 This is a list
You can then call this append() routine as many times as you want inside a loop.
However, I suspect that there is a function, such as pd.read_table(), that can do this all for you automatically (as #jezrael suggested in the comments to your question).

How to add another column in an already existing dataframe performing a simple task

I have a sample spreadsheet which contains the name of an item,its price and a url
I have to create a dataframe which adds another column named index - which compares the integer value obtained from the corresponding url in URL column to the price in the price column and shows whether is less/more than the price column
for eg
name price url
egg 2 www.xyz/1-ed
ham 34 www.xyz/2-ed
the url contains another price ,
example for egg its 4 and ham its 32
so the output should be:
name price url index
egg 2 www.xyz/1-ed less/n
ham 34 www.xyz/2-ed more
obviously the real code contains more than 300 entries,so i have to apply it to that.
from bs4 import BeautifulSoup
import time
from smtplib import SMTP
import pandas as pd
import numpy as np
import requests
import re
data = pd.read_csv(r'C:\Users\sahay\Desktop\python\priceforeca1.csv')
df=pd.DataFrame(data,columns=['rates','URL'])
print(df)
This is just a small part of the whole code. I am unable to get past this step.
Thanks for helping a noob out!
You can first extract the url column from the dataframe.
urls= df["url"]
Then you will need to access each url and parse the html page and obtain the price. You can use beautiful soup to achieve this. But more on this can be suggested only after knowing the html structure of the page.
You can then append the calculated prices to the dataframe as pricesFromUrl. On this data frame you can apply a function which checks whether the price column is more or less than the priceFromUrl column and create a new column named index.
Sample snippet:
import pandas as pd
def getMoreOrLess(price, priceFromUrl):
result="equal"
if price<priceFromUrl:
result="less"
elif price>priceFromUrl:
result="more"
return result
table = {'name': ['egg','ham'],
'price': [2,34],
'url':['www.xyz/1-ed','www.xyz/2-ed']
}
df = pd.DataFrame(table, columns = ['name','price','url'])
urls= df["url"]
price=[]
for url in urls:
# Do the beautiful soup code to extract price here and append to price
#Let us say price = [4,32]
price=[4,32]
df['priceFromUrl'] = price
df['index'] = df.apply(lambda x: getMoreOrLess(x['price'], x['priceFromUrl']), axis=1)
df=df.drop(columns='priceFromUrl') #You can delete the column later
print(df)
What you can do here is use the apply() function of pandas.
Here's what you need to do:
import pandas as pd
import re
# Function to check if url digit is greater, less or equal to the price
def checkForIndex(price,url):
match=re.findall(r'(\d+)',url)
if(int(match[0])>price):
return 'more'
elif(int(match[0])<price):
return 'less'
else:
return 'equal'
# Making sample data for dataframe.
d={'Price':[2,34,4,3,67],'url':['www.xyz/1-ed','www.xyz/2-ed','www.xyz/4-ed','www.xyz/5-ed','www.xyz/66-ed']}
# Making dataframe
dataFrame= pd.DataFrame(data=d)
# Making a new column based on conditions of ither columns.
dataFrame['index']=dataFrame.apply(lambda x: checkForIndex(x.Price, x.url), axis=1)
# Printing the dataframe.
print(dataFrame)
Output:
Price url index
0 2 www.xyz/1-ed less
1 34 www.xyz/2-ed less
2 4 www.xyz/4-ed equal
3 3 www.xyz/5-ed more
4 67 www.xyz/66-ed less
Here's some links for further reading:
apply()
Regex
Hope this helps, cheers!

Summarize non-zero values or any values from pandas dataframe with timestamps- From_Time & To_Time

I have a dataframe given below
I want to extract all the non-zero values from each column to put it in a summarize way like this
If any value repeated for period of time then starting time of value should go in 'FROM' column and end time of value should go in 'TO' column with column name in 'BLK-ASB-INV' column and value should go in 'Scount' column. For this I have started to write the code like this
import pandas as pd
df = pd.read_excel("StringFault_Bagewadi_16-01-2020.xlsx")
df = df.set_index(['Date (+05:30)'])
cols=['BLK-ASB-INV', 'Scount', 'FROM', 'TO']
res=pd.DataFrame(columns=cols)
for col in df.columns:
ss=df[col].iloc[df[col].to_numpy().nonzero()[0]]
.......
After that I am unable to think how should I approach to get the desired output. Is there any way to do this in python? Thanks in advance for any help.
Finally I have solved my problem, I have written the code given below works perfectly for me.
import pandas as pd
df = pd.read_excel("StringFault.xlsx")
df = df.set_index(['Date (+05:30)'])
cols=['BLK-ASB-INV', 'Scount', 'FROM', 'TO']
res=pd.DataFrame(columns=cols)
for col in df.columns:
device = []
for i in range(len(df[col])):
if df[col][i] == 0:
None
else:
if i < len(df[col])-1 and df[col][i]==df[col][i+1]:
try:
if df[col].index[i] > device[2]:
continue
except IndexError:
device.append(df[col].name)
device.append(df[col][i])
device.append(df[col].index[i])
continue
else:
if len(device)==3:
device.append(df[col].index[i])
res = res.append({'BLK-ASB-INV':device[0], 'Scount':device[1], 'FROM':device[2], 'TO': device[3]}, ignore_index=True)
device=[]
else:
device.append(df[col].name)
device.append(df[col][i])
if i == 0:
device.append(df[col].index[i])
else:
device.append(df[col].index[i-1])
device.append(df[col].index[i])
res = res.append({'BLK-ASB-INV':device[0], 'Scount':device[1], 'FROM':device[2], 'TO': device[3]}, ignore_index=True)
device=[]
For reference, here is the output datafarme

Use lambda, apply, and join function on a pandas dataframe

Goal
Apply deid_notes function to df
Background
I have a df that resembles this sample df
import pandas as pd
df = pd.DataFrame({'Text' : ['there are many different types of crayons',
'i like a lot of sports cares',
'the middle east has many camels '],
'P_ID': [1,2,3],
'Word' : ['crayons', 'cars', 'camels'],
'P_Name' : ['John', 'Mary', 'Jacob'],
'N_ID' : ['A1', 'A2', 'A3']
})
#rearrange columns
df = df[['Text','N_ID', 'P_ID', 'P_Name', 'Word']]
df
Text N_ID P_ID P_Name Word
0 many types of crayons A1 1 John crayons
1 i like sports cars A2 2 Mary cars
2 has many camels A3 3 Jacob camels
I use the following function to deidentify certain words within the Text column using NeuroNER http://neuroner.com/
def deid_notes(text):
#use predict function from neuorNER to tag words to be deidentified
ner_list = n1.predict(text)
#n1.predict wont work in this toy example because neuroNER package needs to be installed (and installation is difficult)
#but the output resembles this: [{'start': 1, 'end:' 11, 'id': 1, 'tagged word': crayon}]
#use start and end position of tagged words to deidentify and replace with **BLOCK**
if len(ner_list) > 0:
parts_to_take = [(0, ner_list[0]['start'])] + [(first["end"]+1, second["start"]) for first, second in zip(ner_list, ner_list[1:])] + [(ner_list[-1]['end'], len(text)-1)]
parts = [text[start:end] for start, end in parts_to_take]
deid = '**BLOCK**'.join(parts)
#if n1.predict does not identify any words to be deidentified, place NaN
else:
deid='NaN'
return pd.Series(deid, index='Deid')
Problem
I apply the deid_notes function to my df using the following code
fx = lambda x: deid_notes(x.Text,axis=1)
df.join(df.apply(fx))
But I get the following error
AttributeError: ("'Series' object has no attribute 'Text'", 'occurred at index Text')
Question
How do I get the deid_notes function to work on my df?
Assuming you are returning a pandas series as output from deid_notes function taking text as the only input argument. Pass the axis = 1 argument to the apply instead of died_notes. For eg.
# Dummy function
def deid_notes(text):
deid = 'prediction to: ' + text
return pd.Series(deid, index = ['Deid'])
fx = lambda x: deid_notes(x.Text)
df.join(df.apply(fx, axis =1))

Searching many strings for many dictionary keys, quickly

I have a unique question, and I am primarily hoping to identify ways to speed up this code a little. I have a set of strings stored in a dataframe, each of which has several names in it and I know the number of names before this step, like so:
print df
description num_people people
'Harry ran with sally' 2 []
'Joe was swinging with sally' 2 []
'Lola Dances alone' 1 []
I am using a dictionary with the keys that I am looking to find in description, like so:
my_dict={'Harry':'1283','Joe':'1828','Sally':'1298', 'Cupid':'1982'}
and then using iterrows to search each string for matches like so:
for index, row in df.iterrows():
row.people=[key for key in my_dict if re.findall(key,row.desciption)]
and when run it ends up with
print df
description num_people people
'Harry ran with sally' 2 ['Harry','Sally']
'Joe was swinging with sally' 2 ['Joe','Sally']
'Lola Dances alone' 1 ['Lola']
The problem that I see, is that this code is still fairly slow to get the job done, and I have a large number of descriptions and over 1000 keys. Is there a faster way of performing this operation, like maybe using the number of people found?
Faster solution:
#strip ' in start and end of text, create lists from words
splited = df.description.str.strip("'").str.split()
#filtering
df['people'] = splited.apply(lambda x: [i for i in x if i in my_dict.keys()])
print (df)
description num_people people
0 'Harry ran with Sally' 2 [Harry, Sally]
1 'Joe was swinging with Sally' 2 [Joe, Sally]
2 'Lola Dances alone' 1 [Lola]
Timings:
#[30000 rows x 3 columns]
In [198]: %timeit (orig(my_dict, df))
1 loop, best of 3: 3.63 s per loop
In [199]: %timeit (new(my_dict, df1))
10 loops, best of 3: 78.2 ms per loop
df['people'] = [[],[],[]]
df = pd.concat([df]*10000).reset_index(drop=True)
df1 = df.copy()
my_dict={'Harry':'1283','Joe':'1828','Sally':'1298', 'Lola':'1982'}
def orig(my_dict, df):
for index, row in df.iterrows():
df.at[index, 'people']=[key for key in my_dict if re.findall(key,row.description)]
return (df)
def new(my_dict, df):
df.description = df.description.str.strip("'")
splited = df.description.str.split()
df.people = splited.apply(lambda x: [i for i in x if i in my_dict.keys()])
return (df)
print (orig(my_dict, df))
print (new(my_dict, df1))

Resources