I tried to download a list of photos from the CSV file. I want to save all photos from the link list column and save it to the folder with the same name as the link list column.
Expected:
CSV file:
1 a-Qc01o78E a-rrBuci0w a-qj8s5nlM a-Cwciy2zx
0 2 https://photo.yupoo.com/ven-new/9f13389c/big.jpg https://photo.yupoo.com/ven-new/8852c424/big.jpg https://photo.yupoo.com/ven-new/650d84fd/big.jpg https://photo.yupoo.com/ven-new/a99f9e52/big.jpg
1 3 https://photo.yupoo.com/ven-new/f0adc019/big.jpg https://photo.yupoo.com/ven-new/c434624c/big.jpg https://photo.yupoo.com/ven-new/bed9125c/big.jpg https://photo.yupoo.com/ven-new/2d0b7a67/big.jpg
2 4 https://photo.yupoo.com/ven-new/8844627a/big.jpg https://photo.yupoo.com/ven-new/edda4ec4/big.jpg https://photo.yupoo.com/ven-new/3283fe57/big.jpg https://photo.yupoo.com/ven-new/1f6425e5/big.jpg
3 5 https://photo.yupoo.com/ven-new/eeb8b78a/big.jpg https://photo.yupoo.com/ven-new/6cdcbbf7/big.jpg https://photo.yupoo.com/ven-new/f64ca040/big.jpg https://uvd.yupoo.com/ven-new/22259049_oneTrue.jpg
4 6 https://photo.yupoo.com/ven-new/9c3e9a92/big.jpg https://photo.yupoo.com/ven-new/ea257725/big.jpg https://photo.yupoo.com/ven-new/64b5a57f/big.jpg https://uvd.yupoo.com/ven-new/22257899_oneTrue.jpg
5 7 https://photo.yupoo.com/ven-new/baaf8945/big.jpg https://photo.yupoo.com/ven-new/e9cc7392/big.jpg https://photo.yupoo.com/ven-new/753418b5/big.jpg https://uvd.yupoo.com/ven-new/22257619_oneTrue.jpg
6 8 https://photo.yupoo.com/ven-new/a325f44a/big.jpg https://photo.yupoo.com/ven-new/671be145/big.jpg https://photo.yupoo.com/ven-new/1742a09d/big.jpg https://photo.yupoo.com/ven-new/c9d0aa0f/big.jpg
7 9 https://photo.yupoo.com/ven-new/367cf72a/big.jpg https://photo.yupoo.com/ven-new/ae7f1b1b/big.jpg https://photo.yupoo.com/ven-new/d0ef54ed/big.jpg https://photo.yupoo.com/ven-new/2d0905df/big.jpg
8 10 https://photo.yupoo.com/ven-new/3fcacff3/big.jpg https://photo.yupoo.com/ven-new/c4ea9b1e/big.jpg https://photo.yupoo.com/ven-new/683db958/big.jpg https://photo.yupoo.com/ven-new/3b065995/big.jpg
9 11 https://photo.yupoo.com/ven-new/c7de704a/big.jpg https://photo.yupoo.com/ven-new/92abc9ea/big.jpg https://photo.yupoo.com/ven-new/bd1083db/big.jpg https://photo.yupoo.com/ven-new/a9086d26/big.jpg
10 12 https://photo.yupoo.com/ven-new/fc481727/big.jpg https://photo.yupoo.com/ven-new/a49c94df/big.jpg
11 13 https://photo.yupoo.com/ven-new/1e0e0e10/big.jpg https://photo.yupoo.com/ven-new/62580909/big.jpg
12 14 https://photo.yupoo.com/ven-new/934b423e/big.jpg https://photo.yupoo.com/ven-new/74b81853/big.jpg
13 15 https://photo.yupoo.com/ven-new/adf878b2/big.jpg https://photo.yupoo.com/ven-new/5ad881c3/big.jpg
14 16 https://photo.yupoo.com/ven-new/59dc1203/big.jpg https://photo.yupoo.com/ven-new/3cd676ac/big.jpg
15 17 https://photo.yupoo.com/ven-new/6d8eb080/big.jpg
16 18 https://photo.yupoo.com/ven-new/9a027ada/big.jpg
17 19 https://photo.yupoo.com/ven-new/bdeaf1b5/big.jpg
18 20 https://photo.yupoo.com/ven-new/1f293683/big.jpg
def get_photos(x):
with requests.Session() as c:
df = pd.read_csv("C:\\Users\\Lukasz\\Desktop\\PROJEKTY PYTHON\\W TRAKCIE\\YUOPOO SCRAPER\\FF_data_frame.csv")
NazwyKolum = pd.read_csv("C:\\Users\\Lukasz\\Desktop\\PROJEKTY PYTHON\\W TRAKCIE\\bf3_strona.csv")
c.get('https://photo.yupoo.com/')
c.headers.update({'referer': 'https://photo.yupoo.com/'})
URLS=(df[NazwyKolum['LINKS'][x]]) #.to_string(index=False))
print(URLS) #prints a list of links. Example https://photo.yupoo.com/ven-new/650d84fd/big.jpg, https://photo.yupoo.com/ven-new/bed9125c/big.jpg
#proxies = {'https': 'http://45.77.76.254:8080'}
res = c.get(URLS,timeout=None)
if res.status_code == 200:
return res.content
try:
for x in range(2,50):
with open("C:\\Users\\Lukasz\\Desktop\\PROJEKTY PYTHON\\W TRAKCIE\\YUOPOO SCRAPER\\"+str(x)+'NAZWAZDECIA.jpg', 'wb') as f:
f.write(get_photos(x))
except:
print("wr")
To be honest, I do not how to handle it anymore, I wast a lot of time with no progress. Thank you, guys.
import pandas as pd
import requests, os
df = pd.read_csv("a.csv")
def create_directory(directory):
if not os.path.exists(directory):
os.makedirs(directory)
def download_save(url, folder):
create_directory(folder)
res = requests.get(url)
with open(f'{folder}/{url.split("/")[-2]}.jpg', 'wb') as f:
f.write(res.content)
for col in df.columns:
print(col)
for url in df[col].tolist():
print(url)
if str(url).startswith("http"):
download_save(url, col)
The above code creates column names as directories and downloads the images and saves it with unique name in the url.
Here a.csv is as follows:
a-Qc01o78E,a-rrBuci0w,a-qj8s5nlM,a-Cwciy2zx
https://photo.yupoo.com/ven-new/9f13389c/big.jpg,https://photo.yupoo.com/ven-new/8852c424/big.jpg,https://photo.yupoo.com/ven-new/650d84fd/big.jpg,https://photo.yupoo.com/ven-new/a99f9e52/big.jpg
https://photo.yupoo.com/ven-new/f0adc019/big.jpg,https://photo.yupoo.com/ven-new/c434624c/big.jpg,https://photo.yupoo.com/ven-new/bed9125c/big.jpg,https://photo.yupoo.com/ven-new/2d0b7a67/big.jpg
https://photo.yupoo.com/ven-new/8844627a/big.jpg,https://photo.yupoo.com/ven-new/edda4ec4/big.jpg,https://photo.yupoo.com/ven-new/3283fe57/big.jpg,https://photo.yupoo.com/ven-new/1f6425e5/big.jpg
https://photo.yupoo.com/ven-new/eeb8b78a/big.jpg,https://photo.yupoo.com/ven-new/6cdcbbf7/big.jpg,https://photo.yupoo.com/ven-new/f64ca040/big.jpg,https://uvd.yupoo.com/ven-new/22259049_oneTrue.jpg
https://photo.yupoo.com/ven-new/9c3e9a92/big.jpg,https://photo.yupoo.com/ven-new/ea257725/big.jpg,https://photo.yupoo.com/ven-new/64b5a57f/big.jpg,https://uvd.yupoo.com/ven-new/22257899_oneTrue.jpg
Related
Let's say I have a list of "classifiers" and each have a "predict" method. So for example, take the following code
import pandas as pd
class ClassifierA():
def __init__(self, name):
self.name = name
def predict(self, x):
return x + 5
class ClassifierB():
def __init__(self, name):
self.name = name
def predict(self, x):
return x + 10
class ClassifierC():
def __init__(self, name):
self.name = name
def predict(self, x):
return x - 10
classifierA = ClassifierA(name="classifierA")
classifierB = ClassifierB(name="classifierB")
classifierC = ClassifierC(name="classifierC")
list_of_classifiers = [classifierA, classifierB, classifierC]
In this case, the classifiers do something very simple, but in the real world, imagine they actually perform some prediction.
Now, imagine we have a Pandas DataFrame with many observations. In this case, 10 observations each is just a number from 1 to 10.
df = pd.DataFrame({
'x': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
})
How can I apply each of the classifiers in the list to df.x in parallel? The output of what I want to achieve is the same as:
for classifier in list_of_classifiers:
df['{}_prediction'.format(classifier.name)] = df['x'].apply(lambda x: classifier.predict(x))
which yields
x classifierA_prediction classifierB_prediction classifierC_prediction
0 1 6 11 -9
1 2 7 12 -8
2 3 8 13 -7
3 4 9 14 -6
4 5 10 15 -5
5 6 11 16 -4
6 7 12 17 -3
7 8 13 18 -2
8 9 14 19 -1
9 10 15 20 0
but the problem here is that I'm looping over each classifier in the list and also looping over each row in the dataframe via apply. I wonder if there's something more efficient.
For example, I could also do
predictions = df['x'].apply(
lambda x: (classifierA.predict(x), classifierB.predict(x), classifierC.predict(x)))
but the problem there is that then I am specifying each classifier manually and the list might be very big.
Let's say we have many excel files with the multiple sheets as follows:
Sheet 1: 2021_q1_bj
a b c d
0 1 2 23 2
1 2 3 45 5
Sheet 2: 2021_q2_bj
a b c d
0 1 2 23 6
1 2 3 45 7
Sheet 3: 2019_q1_sh
a b c
0 1 2 23
1 2 3 45
Sheet 4: 2019_q2_sh
a b c
0 1 2 23
1 2 3 40
I wish to append all the sheets to one if the last part split by _ of sheet names are same across all excel files. ie., sheet 1 will append with sheet 2 since their both have common bj, if another excel file also have sheets with name bj, it will also be append to this one, same logic for sheet 3 and sheet 4.
How could I achieve that in Pandas or other Python packages?
The expected result for current excel file would be:
bj:
a b c d
0 1 2 23 2
1 2 3 45 5
2 1 2 23 6
3 2 3 45 7
sh:
a b c
0 1 2 23
1 2 3 45
2 1 2 23
3 2 3 40
Code for reference:
import os, glob
import pandas as pd
files = glob.glob("*.xlsx")
for each in files:
dfs = pd.read_excel(each, sheet_name=None, index_col=[0])
df_out = pd.concat(dfs.values(), keys=dfs.keys())
for n, g in df_out.groupby(df_out.index.to_series().str[0].str.rsplit('_', n=1).str[-1]):
g.droplevel(level=0).dropna(how='all', axis=1).reset_index(drop=True).to_excel(f'Out_{n}.xlsx', index=False)
Update:
You may download test excel files and final expected result from this link.
Try:
dfs = pd.read_excel('Downloads/WS_1.xlsx', sheet_name=None, index_col=[0])
df_out = pd.concat(dfs.values(), keys=dfs.keys())
for n, g in df_out.groupby(df_out.index.to_series().str[0].str.rsplit('_', n=1).str[-1]):
g.droplevel(level=0).dropna(how='all', axis=1).reset_index(drop=True).to_excel(f'Out_{n}.xlsx')
Update
import os, glob
import pandas as pd
files = glob.glob("Downloads/test_data/*.xlsx")
writer = pd.ExcelWriter('Downloads/test_data/Output_file.xlsx', engine='xlsxwriter')
excel_dict = {}
for each in files:
dfs = pd.read_excel(each, sheet_name=None, index_col=[0])
excel_dict.update(dfs)
df_out = pd.concat(dfs.values(), keys=dfs.keys())
for n, g in df_out.groupby(df_out.index.to_series().str[0].str.rsplit('_', n=1).str[-1]):
g.droplevel(level=0).dropna(how='all', axis=1).reset_index(drop=True).to_excel(writer, index=False, sheet_name=f'{n}')
writer.save()
writer.close()
I have achieved the whole process and get the final expected result with the code below.
Thanks to provide alternative and more concise solutions or give me some advices if it's possible:
import os, glob
import pandas as pd
from pandas import ExcelWriter
from datetime import datetime
def save_xls(dict_df, path):
writer = ExcelWriter(path)
for key in dict_df:
dict_df[key].to_excel(writer, key, index=False)
writer.save()
root_dir = './original/'
for root, subFolders, files in os.walk(root_dir):
# print(subFolders)
for file in files:
if '.xlsx' in file:
file_path = os.path.join(root_dir, file)
print(file)
f = pd.ExcelFile(file_path)
dict_dfs = {}
for sheet_name in f.sheet_names:
df_new = f.parse(sheet_name = sheet_name)
print(sheet_name)
## get the year and quarter from the sheet name
year, quarter, city = sheet_name.split("_")
# year, quarter, city = sheet_name.split("_")
df_new["year"] = year
df_new["quarter"] = quarter
df_new["city"] = city
dict_dfs[sheet_name] = df_new
save_xls(dict_df = dict_dfs, path = './add_columns_from_sheet_name/' + "new_" + file)
root_dir = './add_columns_from_sheet_name/'
list1 = []
df = pd.DataFrame()
for root, subFolders, files in os.walk(root_dir):
# print(subFolders)
for file in files:
if '.xlsx' in file:
# print(file)
city = file.split('_')[0]
# print(file)
file_path = os.path.join(root_dir, file)
# print(file_path)
dfs = pd.read_excel(file_path, sheet_name=None)
df_out = pd.concat(dfs.values(), keys=dfs.keys())
for n, g in df_out.groupby(df_out.index.to_series().str[0].str.rsplit('_', n=1).str[-1]):
print(n)
timestr = datetime.utcnow().strftime('%Y%m%d-%H%M%S%f')[:-3]
g.droplevel(level=0).dropna(how='all', axis=1).reset_index(drop=True).to_excel(f'./output/{n}_{timestr}.xlsx', index=False)
file_set = set()
file_dir = './output/'
file_list = os.listdir(file_dir)
for file in file_list:
data_type = file.split('_')[0]
file_set.add(data_type)
print(file_set)
file_dir = './output'
file_list = os.listdir(file_dir)
df1 = pd.DataFrame()
df2 = pd.DataFrame()
df3 = pd.DataFrame()
df4 = pd.DataFrame()
file_set = set()
for file in file_list:
if '.xlsx' in file:
# print(file)
df_temp = pd.read_excel(os.path.join(file_dir, file))
if 'bj' in file:
df1 = df1.append(df_temp)
elif 'sh' in file:
df2 = df2.append(df_temp)
elif 'gz' in file:
df3 = df3.append(df_temp)
elif 'sz' in file:
df4 = df4.append(df_temp)
# function
def dfs_tabs(df_list, sheet_list, file_name):
writer = pd.ExcelWriter(file_name,engine='xlsxwriter')
for dataframe, sheet in zip(df_list, sheet_list):
dataframe.to_excel(writer, sheet_name=sheet, startrow=0 , startcol=0, index=False)
writer.save()
# list of dataframes and sheet names
dfs = [df1, df2, df3, df4]
sheets = ['bj', 'sh', 'gz', 'sz']
# run function
dfs_tabs(dfs, sheets, './final/final_result.xlsx')
I have created a script that collects the information on a website and puts it on a script. I'm on my process to become acquainted with python scraping and I would like some help as I would like to player numbers to be on a different column
# import libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
import xlsxwriter
import xlwt
from xlwt import Workbook
# Workbook is created
wb = Workbook()
# add_sheet is used to create sheet.
sheet1 = wb.add_sheet('Sheet 1')
#send request
#url = 'http://fcf.cat/acta/1920/futbol-11/infantil-primera-divisio/grup-11/1i/sant-ildefons-ue-b/1i/lhospitalet-centre-esports-c'
url = 'https://www.fcf.cat/acta/2422183'
page = requests.get(url,timeout=5, verify=False)
soup = BeautifulSoup(page.text,'html.parser')
#read acta
#acta_text = []
#acta_text_element = soup.find_all(class_='acta-table')
#for item in acta_text_element:
# acta_text.append(item.text)
i = 0
acta = []
for tr in soup.find_all('tr'):
values = [td.text.strip() for td in tr.find_all('td') ]
print(values)
acta.append(values)
i = 1 + i
sheet1.write(i,0,values)
wb.save('xlwt example.xls')
print(acta)
Thanks,
Two things to consider:
You can separate the first element in the list by using values[0] then use values[1:] for the remaining items
Use isnumeric to check if a string value is a number
Try this code:
for tr in soup.find_all('tr'):
values = [td.text.strip() for td in tr.find_all('td') ]
print(values)
acta.append(values)
i = 1 + i
if len(values) and values[0].isnumeric(): # if first element is number
sheet1.write(i,0,values[0]) # number in column 1
sheet1.write(i,1,values[1:]) # rest of list in column 2
else:
sheet1.write(i,0,values) # all values in column 1
Excel output (truncated)
To take the team on the left, for example, try this:
tables = soup.select('table')
players = []
columns = ["Player","Shirt"]
titulars = [item for item in tables[1].text.strip().split('\n') if len(item)>0]
#tables[1] is where the data for the first team is; the other team is in tables[8]
for num, name in zip(titulars[2::2],titulars[1::2]):
player = []
player.extend((num,name))
players.append(player)
pd.DataFrame(players,columns=columns)
Output:
Player Shirt
0 TORNER ENCINAS, GONZALO 1
1 MACHUCA LOVERA, OSMAR SILVESTRE 3
2 JARA MARTIN, BLAI 4
3 AGUILAR LUQUE, DANIEL 5
4 FONT MURILLO, JOAQUIN 6
5 MARTÍNEZ ELVIR, RICHARD ADRIAN 7
6 MARQUEZ RODRIGUEZ, GERARD 8
7 PATUEL BATLLE, GERARD 10
8 EL MAHI ZAROUALI, BILAL 11
9 JAUME MORERA, ADRIA 14
10 DEL VALLE ESCANCIANO, MARTI 15
I'm faced with the following challenge: I want to get all financial data about companies and I wrote a code that does it and let's say that the result is like below:
Unnamed: 0 I Q 2017 II Q 2017 \
0 Przychody netto ze sprzedaży (tys. zł) 137 134
1 Zysk (strata) z działal. oper. (tys. zł) -423 -358
2 Zysk (strata) brutto (tys. zł) -501 -280
3 Zysk (strata) netto (tys. zł)* -399 -263
4 Amortyzacja (tys. zł) 134 110
5 EBITDA (tys. zł) -289 -248
6 Aktywa (tys. zł) 27 845 26 530
7 Kapitał własny (tys. zł)* 22 852 22 589
8 Liczba akcji (tys. szt.) 13 921,975 13 921,975
9 Zysk na akcję (zł) -0029 -0019
10 Wartość księgowa na akcję (zł) 1641 1623
11 Raport zbadany przez audytora N N
but 464 times more.
Unfortunately when I want to save all 464 results in one CSV file I can save only one last result. Not all 464 results, just one... Could you help me save all? Below is my code.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.bankier.pl/gielda/notowania/akcje'
page = requests.get(url)
soup = BeautifulSoup(page.content,'lxml')
# Find the second table on the page
t = soup.find_all('table')[0]
#Read the table into a Pandas DataFrame
df = pd.read_html(str(t))[0]
#get
names_of_company = df["Walor AD"].values
links_to_financial_date = []
#all linkt with the names of companies
links = []
for i in range(len(names_of_company)):
new_string = 'https://www.bankier.pl/gielda/notowania/akcje/' + names_of_company[i] + '/wyniki-finansowe'
links.append(new_string)
############################################################################
for i in links:
url2 = f'https://www.bankier.pl/gielda/notowania/akcje/{names_of_company[0]}/wyniki-finansowe'
page2 = requests.get(url2)
soup = BeautifulSoup(page2.content,'lxml')
# Find the second table on the page
t2 = soup.find_all('table')[0]
df2 = pd.read_html(str(t2))[0]
df2.to_csv('output.csv', index=False, header=None)
You've almost got it. You're just overwriting your CSV each time. Replace
df2.to_csv('output.csv', index=False, header=None)
with
with open('output.csv', 'a') as f:
df2.to_csv(f, header=False)
in order to append to the CSV instead of overwriting it.
Also, your example doesn't work because this:
for i in links:
url2 = f'https://www.bankier.pl/gielda/notowania/akcje/{names_of_company[0]}/wyniki-finansowe'
should be:
for i in links:
url2 = i
When the website has no data, skip and move on to the next one:
try:
t2 = soup.find_all('table')[0]
df2 = pd.read_html(str(t2))[0]
with open('output.csv', 'a') as f:
df2.to_csv(f, header=False)
except:
pass
After forming the below python pandas dataframe (for example)
import pandas
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pandas.DataFrame(data,columns=['Name','Age'])
If I iterate through it, I get
In [62]: for i in df.itertuples():
...: print( i.Index, i.Name, i.Age )
...:
0 Alex 10
1 Bob 12
2 Clarke 13
What I would like to achieve is to replace the value of a particular cell
In [67]: for i in df.itertuples():
...: if i.Name == "Alex":
...: df.at[i.Index, 'Age'] = 100
...:
Which seems to work
In [64]: df
Out[64]:
Name Age
0 Alex 100
1 Bob 12
2 Clarke 13
The problem is that when using a larger different dataset, and do:
First, I create a new column named like NETELEMENT with a default value of ""
I would like to replace the default value "" with the string that the function lookup_netelement returns
df['NETELEMENT'] = ""
for i in df.itertuples():
df.at[i.Index, 'NETELEMENT'] = lookup_netelement(i.PEER_SRC_IP)
print( i, lookup_netelement(i.PEER_SRC_IP) )
But what I get as a result is:
Pandas(Index=769, SRC_AS='', DST_AS='', COMMS='', SRC_COMMS=nan, AS_PATH='', SRC_AS_PATH=nan, PREF='', SRC_PREF='0', MED='0', SRC_MED='0', PEER_SRC_AS='0', PEER_DST_AS='', PEER_SRC_IP='x.x.x.x', PEER_DST_IP='', IN_IFACE='', OUT_IFACE='', PROTOCOL='udp', TOS='0', BPS=35200.0, SRC_PREFIX='', DST_PREFIX='', NETELEMENT='', IN_IFNAME='', OUT_IFNAME='') routerX
meaning that it should be:
NETELEMENT='routerX' instead of NETELEMENT=''
Could you please advise what I am doing wrong ?
EDIT: for reasons of completeness the lookup_netelement is defined as
def lookup_netelement(ipaddr):
try:
x = LOOKUP['conn'].hget('ipaddr;{}'.format(ipaddr), 'dev') or b""
except:
logger.error('looking up `ipaddr` for netelement caused `{}`'.format(repr(e)), exc_info=True)
x = b""
x = x.decode("utf-8")
return x
Hope you are looking for where for conditional replacement i.e
def wow(x):
return x ** 10
df['new'] = df['Age'].where(~(df['Name'] == 'Alex'),wow(df['Age']))
Output :
Name Age new
0 Alex 10 10000000000
1 Bob 12 12
2 Clarke 13 13
3 Alex 15 576650390625
Based on your edit your trying to apply the function i.e
df['new'] = df['PEER_SRC_IP'].apply(lookup_netelement)
Edit : For your comment on sending two columns, use lambda with axis 1 i.e
def wow(x,y):
return '{} {}'.format(x,y)
df.apply(lambda x : wow(x['Name'],x['Age']),1)