How can I do something similar to VLOOKUP in Excel with urlparse? - python-3.x

I need to compare two sets of data from csv's, one (csv1) with a column 'listing_url' the other (csv2) with columns 'parsed_url' and 'url_code'. I would like to use the result set from urlparse on csv1 (specifically the netloc) to compare to csv2 'parsed_url' and output matching value from 'url_code' to csv.
from urllib.parse import urlparse
import re, pandas as pd
scr = pd.read_csv('csv2',squeeze=True,usecols=['parsed_url','url_code'])[['parsed_url','url_code']]
data = pd.read_csv('csv1')
L = data.values.T[0].tolist()
T = pd.Series([scr])
for i in L:
n = urlparse(i)
nf = pd.Series([(n.netloc)])
I'm stuck trying to convert the data into objects I can use map with, if that's even the best thing to use, I don't know.

Related

Remove '-' and two following characters using regex?

I have set of data from laboratory where data column looks like this:
12-15.11.12
19-22.11.12
26-29.11.12
03-06.12.12
10-13.12.12
17-20.12.12
19-23.12.12
27-30.12.12
02-05.01.13
I only want the first value (the day of sampling) so I can convert it into pandas datetime series etc. and continue working with data.
I know I can manually delete it in Excel but I would like to do it with the use of code. So my goal is for example:
12-15.11.12 -> 12.11.2012, '-15' gets deleted.
You can use re.sub with -\d+ pattern (regex101):
import re
data = '''\
12-15.11.12
19-22.11.12
26-29.11.12
03-06.12.12
10-13.12.12
17-20.12.12
19-23.12.12
27-30.12.12
02-05.01.13'''
data = re.sub(r'-\d+', '', data)
print(data)
Prints:
12.11.12
19.11.12
26.11.12
03.12.12
10.12.12
17.12.12
19.12.12
27.12.12
02.01.13
import re
dates = [
"12-15.11.12",
"19-22.11.12",
"26-29.11.12",
"03-06.12.12",
"10-13.12.12",
"17-20.12.12",
"19-23.12.12",
"27-30.12.12",
"02-05.01.13"
]
cleaned_dates = []
for date in dates:
date = re.sub(r"(\d+)-\d+", r"\1", date)
cleaned_dates.append(date)
print(cleaned_dates)

How do you take one dataframe that covers multiple years, and break it into a separate DF for each year

I already looked on SE and couldn't find an answer to my question. I am still new to this.
I am trying to take a purchasing csv file and break it into separate dataframes for each year.
For example, if I have a listing with full dates in MM/DD/YYYY format, I am trying to separate them into dataframes for each year. Like Ord2015, Ord2014, etc...
I tried to covert the full date into just the year, and also attempted to use slicing to only look at the last four of the date to no avail.
Here is my current (incomplete) attempt:
import pandas as pd
import csv
import numpy as np
import datetime as dt
import re
purch1 = pd.read_csv('purchases.csv')
#Remove unneeded fluff
del_colmn = ['pid', 'notes', 'warehouse_id', 'env_notes', 'budget_notes']
purch1 = purch1.drop(del_colmn, 1)
#break down by year only
purch1.sort_values(by=['order_date'])
Ord2015 = ()
Ord2014 = ()
for purch in purch1:
Order2015.add(purch1['order_date'] == 2015)
Per req by #anon01... here are the results of the code you had me run. I only used a sample of four as that was all I was initially playing with... The record has almost 20k lines, so I only pulled aside a few to play with.
'{"pid":{"0":75,"2":95,"3":117,"1":82},"env_id":{"0":12454,"2":12532,"3":12623,"1":12511},"ord_date":{"0":"10\/2\/2014","2":"11\/22\/2014","3":"2\/17\/2015","1":"11\/8\/2014"},"cost_center":{"0":"Ops","2":"Cons","3":"Net","1":"Net"},"dept":{"0":"Ops","2":"Cons","3":"Ops","1":"Ops"},"signing_mgr":{"0":"M. Dodd","2":"L. Price","3":"M. Dodd","1":"M. Dodd"},"check_num":{"0":null,"2":null,"3":null,"1":82301.0},"rec_date":{"0":"10\/11\/2014","2":"12\/2\/2014","3":"3\/1\/2015","1":"11\/20\/2014"},"model":{"0":null,"2":null,"3":null,"1":null},"notes":{"0":"Shipped to east WH","2":"Rec'd by L.Price","3":"Shipped to Client (1190)","1":"Rec'd by K. Wilson"},"env_notes":{"0":"appr by K.Polt","2":"appr by S. Crane","3":"appr by K.Polt","1":"appr by K.Polt"},"budget_notes":{"0":null,"2":"OOB expense","3":"Bill to client","1":null},"cost_year":{"0":2014.0,"2":2015.0,"3":null,"1":2014.0}}'
You can add parse_dates to read_csv for convert column to datetimes and then create dictionary of DataFrames dfs, for selecting is used keys:
purch1 = pd.read_csv('purchases.csv', parse_dates=['ord_date'])
dfs = dict(tuple(purch1.groupby(df['ord_date'].dt.year)))
Ord2015 = dfs[2015]
Ord2016 = dfs[2016]
It is not recommended, but possible create DataFrames by years groups:
for i, g in df.groupby(purch1['ord_date'].dt.year):
globals()['Ord' + str(i)] = g

How to append list of column to list

Im trying to add new column from csv to the table from the same csv. Im trying to use append but its still not working it says ''numpy.ndarray' object has no attribute 'append''
import pandas as pd
import numpy as np
path = r"D:\python projects\volcano_data_2010.csv"
data = pd.read_csv(path)
data_used = data.iloc[:,[1,2,8,9]].values
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan,strategy='mean')
data_used = imp.fit_transform(data_used) #so far ok
data_used = data_used.append([data.iloc[:,7].values])
print(data_used)
function append only applicable to list datatype, since your data type is in array use should use np.append but it will append array
a1 = np.append(data_used, data.iloc[:,7])
if you want to append like a columns, you should us np.column_stack function
a2 = np.column_stack((data_used, data.iloc[:,7]))

Web Scraping - stock prices, quandl

I have a quick question. My code looks like below:
import quandl
names_of_company = ['KGHM','INDYKPOL','KRUK','KRUSZWICA']
for names in names_of_company:
x = quandl.get('WSE/{names_of_company}', start_date='2018-11-26',
end_date='2018-11-29')
I am trying to get all the data in one output but I can't change the names of each company one after another. Do you have any ideas?
Thanks for help
unless I'm missing something, looks like you should just be able to do a pretty basic for loop. it was the syntax that was was incorrect.
import quandl
import pandas as pd
names_of_company = ['KGHM','INDYKPOL','KRUK','KRUSZWICA']
results = pd.DataFrame()
for names in names_of_company:
x = quandl.get('WSE/%s' %names, start_date='2018-11-26',
end_date='2018-11-29')
x['company'] = names
results = results.append(x).reset_index(drop=True)

reading 300k cells in excel using read_only in openpyxl not enough

I've read quite a few questions on here about reading large excel files with openpyxl and the read_only param in load_workbook(), and I've done it successfully with source excels 50x30, but when I try to do it on a workbook with a 30x1100 sheet, it stalls. Right now, it just reads in the excel and transfers it to a multi-dimensional array.
from openpyxl import Workbook
from openpyxl import load_workbook
def transferCols(refws,mx,refCol,newCol,header):
rmax = refws.max_row
for r in range(1, rmax+1):
if (r == 1):
mx[r-1][newCol-1] = header
else:
mx[r-1][newCol-1] = refws.cell(row = r, column = refCol).value
return
ref_wb = load_workbook("UESfull.xlsx", read_only= True)
ref_ws = ref_wb.active
rmax = ref_ws.max_row
matrix = [["fill" for col in range(30)] for row in range(rmax)]
print("step ", 1)
transferCols(ref_ws,matrix,1,1,"URL")
...
I only put the print("step") line to track the progress, but surprisingly, it stalls at step 1! I just don't know if the structure is poor or if 300k cells is too much for openpyxl. I havent even began to write to my put excel yet! Thanks in advance!
I suspect that you have an undimensioned worksheet so ws.max_row is unknown. If this is the case, use ws.calculate_dimensions() will tell you, then you should just iterate over the rows of both sheets in parallel.
Instead of trying to read large excel in openpyxl try pandas shall get you better result. pandas has better functions to clean the data that you are ought to do.
Here is an example of 10000 rows and 30 columns of data that is written and read back in pandas:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10000,30))
df.to_excel('test.xlsx')
df1 = pd.read_excel('test.xlsx')

Resources