I am trying to use python with some nice functions in R. In particular I want to use read.transactions function which is found in one of the packages in R (arules)
I did the following steps
1- Open Anaconda and lunch R studio
In R studio
2- install.packages('arules', dep = TRUE)
3- loadNamespace('arules')
4- .libPaths()
Got
[1] "D:/Anaconda3/Lib/site-packages/rpy2/R/win-library/3.4"
[2] "C:/Program Files/R/R-3.4.4/library"
Now I go to jupyter notebook
In Jupyter Notebook
import rpy2
import rpy2.robjects as RObjects
from rpy2.robjects.packages import importr
utils = importr("utils")
d = {'print.me': 'print_dot_me', 'print_me': 'print_uscore_me'}
try:
arules = importr('arules', robject_translations = d, lib_loc = "D:/Anaconda3/Lib/site-packages/rpy2/R/win-library/3.4")
except:
arules = importr('arules', robject_translations = d, lib_loc = "C:/Program Files/R/R-3.4.4/library")
The Outcome was
---------------------------------------------------------------------------
RRuntimeError Traceback (most recent call last)
<ipython-input-3-5df30d28440c> in <module>()
3 try:
----> 4 arules = importr('arules', robject_translations = d, lib_loc = "D:/Anaconda3/Lib/site-packages/rpy2/R/win-library/3.4")
5 except:
~\Anaconda3\lib\site-packages\rpy2\robjects\packages.py in importr(name, lib_loc, robject_translations, signature_translation, suppress_messages, on_conflict, symbol_r2python, symbol_check_after, data)
452 _system_file(package = rname)):
--> 453 env = _get_namespace(rname)
454 version = _get_namespace_version(rname)[0]
RRuntimeError: Error in loadNamespace(name) : there is no package called 'arules'
During handling of the above exception, another exception occurred:
RRuntimeError Traceback (most recent call last)
<ipython-input-3-5df30d28440c> in <module>()
4 arules = importr('arules', robject_translations = d, lib_loc = "D:/Anaconda3/Lib/site-packages/rpy2/R/win-library/3.4")
5 except:
----> 6 arules = importr('arules', robject_translations = d, lib_loc = "C:/Program Files/R/R-3.4.4/library")
7
~\Anaconda3\lib\site-packages\rpy2\robjects\packages.py in importr(name, lib_loc, robject_translations, signature_translation, suppress_messages, on_conflict, symbol_r2python, symbol_check_after, data)
451 if _package_has_namespace(rname,
452 _system_file(package = rname)):
--> 453 env = _get_namespace(rname)
454 version = _get_namespace_version(rname)[0]
455 exported_names = set(_get_namespace_exports(rname))
RRuntimeError: Error in loadNamespace(name) : there is no package called 'arules'
Which was not able to import the R package to Python
I did the same with DirichletReg and it was successful. I do not know why.
Can anyone help me with this?
importr looks into R_HOME directory for the installed R packages. I assume, arules package was not added in the library folder of R_HOME instead it is added in some other location let's say 'C:\Users\User_name\Documents\R\win-library\3.x.x' which might be causing the issue.
If that is the case, copy arules folder from that specific location and add into library folder of R_HOME directory. Try this approach and see whether you are able to come out of the problem.
Now to the last of the discovery, there is nothing like that in python, however, there is a way out to use read.transactions
groceries <- read.transactions("groceries.csv", sep = ",")
> summary(groceries)
transactions as itemMatrix in sparse format with
9835 rows (elements/itemsets/transactions) and
169 columns (items) and a density of 0.02609146
Python Jupyter notebook
1) Import the data as
import requests
url = 'https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/groceries.csv'
grocery_dataset = requests.get(url)
# Save string as txt file
f = open('grocery_dataset.txt','w')
f.write(grocery_dataset.text)
f.close()
2) Separate the data and adjust them as you wish
import csv
grocery_items = set()
with open("grocery_dataset.txt") as f:
reader = csv.reader(f, delimiter=",")
for i, line in enumerate(reader):
grocery_items.update(line)
output_list = list()
with open("grocery_dataset.txt") as f:
reader = csv.reader(f, delimiter=",")
for i, line in enumerate(reader):
row_val = {item:0 for item in grocery_items}
row_val.update({item:1 for item in line})
output_list.append(row_val)
4) save it as a Dataframe in python
import pandas as pd
grocery_df = pd.DataFrame(output_list)
hence
grocery_df.shape
will give
(9835, 169)
which represent that of the rows and columns of the summary(groceries) in R
Related
I am getting error when trying to use FuzzyWuzzy between two other dataframe column.
I want to match df_1['name_new'] to df['term'].
below is the site where I got my code
https://towardsdatascience.com/fuzzy-string-match-with-python-on-large-dataset-and-why-you-should-not-use-fuzzywuzzy-4ec9f0defcd
#Transform text to vectors with TF-IDF:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,2), max_df=0.9, min_df=5, token_pattern='(\S+)')
tf_idf_matrix_1 = tfidf_vectorizer.fit_transform(df_1['name_new'])
tf_idf_matrix_2 = tfidf_vectorizer.fit_transform(df['term'])
I careated "tf_idf_matrix_2 " to match other df's 'term' column
from scipy.sparse import csr_matrix
!pip install sparse_dot_topn
import sparse_dot_topn.sparse_dot_topn as ct
def awesome_cossim_top(A, B, ntop, lower_bound=0):
# force A and B as a CSR matrix.
# If they have already been CSR, there is no overhead
A = A.tocsr()
B = B.tocsr()
M, _ = A.shape
_, N = B.shape
idx_dtype = np.int32
nnz_max = M*ntop
indptr = np.zeros(M+1, dtype=idx_dtype)
indices = np.zeros(nnz_max, dtype=idx_dtype)
data = np.zeros(nnz_max, dtype=A.dtype)
ct.sparse_dot_topn(
M, N, np.asarray(A.indptr, dtype=idx_dtype),
np.asarray(A.indices, dtype=idx_dtype),
A.data,
np.asarray(B.indptr, dtype=idx_dtype),
np.asarray(B.indices, dtype=idx_dtype),
B.data,
ntop,
lower_bound,
indptr, indices, data)
return csr_matrix((data,indices,indptr),shape=(M,N))
import time
t1 = time.time()
# adjust lower bound: 0.8
# keep top 10 similar results
matches = awesome_cossim_top(tf_idf_matrix_1, tf_idf_matrix_2.transpose(), 10, 0.8)
t = time.time()-t1
print("finished in:", t)
def get_matches_df(sparse_matrix, name_vector, top=10000):
non_zeros = sparse_matrix.nonzero()
sparserows = non_zeros[0]
sparsecols = non_zeros[1]
if top:
nr_matches = top
else:
nr_matches = sparsecols.size
left_side = np.empty([nr_matches], dtype=object)
right_side = np.empty([nr_matches], dtype=object)
similairity = np.zeros(nr_matches)
for index in range(0, nr_matches):
left_side[index] = name_vector[sparserows[index]]
right_side[index] = name_vector[sparsecols[index]]
similairity[index] = sparse_matrix.data[index]
return pd.DataFrame({'name_new_1': left_side,
'term_1': right_side,
'similairity_score': similairity})
matches_df = pd.DataFrame()
matches_df = get_matches_df(matches, df_1['name_new'], top=10000)
# Remove all exact matches
I get my error like this=>
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
384 try:
--> 385 return self._range.index(new_key)
386 except ValueError as err:
ValueError: 111816 is not in range
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
4 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
385 return self._range.index(new_key)
386 except ValueError as err:
--> 387 raise KeyError(key) from err
388 raise KeyError(key)
389 return super().get_loc(key, method=method, tolerance=tolerance)
KeyError: 111816
Please help... what is wrong with my code?
I have managed to get the output I want from this script:
But I am having trouble exporting it to a csv using:
v.to_csv(n + '.csv', index=False)
I get this error:
Traceback (most recent call last): Python/CouponRedemptions/start.py", line 22, in <module> print(v['invoice_line_normal_price']) File "~/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 2902, in getitem indexer = self.columns.get_loc(key) File "~/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2893, in get_loc raise KeyError(key) from err KeyError: 'invoice_line_normal_price'
I think it is the way the DF is structured, you cannot export it in its current state. I was wondering how I would go about making this work or any suggestions on where I cant start looking.
import pandas as pd
import re
r = pd.read_csv('cp.csv', low_memory=False)
r = r.filter(['shop_name','order_coupon_code','invoice_line_type','invoice_date','invoice_line_normal_price'])
r = r[r.order_coupon_code.notnull()]
r['invoice_line_normal_price'] = pd.to_numeric(r['invoice_line_normal_price'],errors = 'coerce')
n = input("Enter the coupon name: ")
nr = r[r.order_coupon_code.str.match(n,flags=re.IGNORECASE)]
nr = nr[nr.invoice_line_type.str.match('charge')]
nr = nr.sort_values('shop_name')
v = nr.groupby(['shop_name'])['invoice_line_normal_price'].value_counts().to_frame('counts')
print(v)
example of csv code
shop_name order_coupon_code invoice_line_type invoice_date invoice_line_normal_price moresome moreother hello
0 shop1 nv55 sell 01.01.2016 01:00:00.000 15.0 3 tt hi
1 shop2 nv44 quote 01.01.2016 02:00:00.000 22.0 4 rr hey
2 shop3 nv22 charge 01.01.2016 03:00:00.000 27.0 5 dd what
mport pandas as pd
# The low_memory option is not properly deprecated, but it should be, since it does not actually do anything differently
r = pd.read_csv('cp.csv')
print(r)
# r = r.loc[:,['shop_name', 'order_coupon_code', 'invoice_line_type', 'invoice_date', 'invoice_line_normal_price']]
r = r.filter(['shop_name','order_coupon_code','invoice_line_type','invoice_date','invoice_line_normal_price'])
r = r[r.order_coupon_code.notnull()]
r['invoice_line_normal_price'] = pd.to_numeric(r['invoice_line_normal_price'],errors = 'coerce')
# Enter the coupon name: nv22
n = input("Enter the coupon name: ")
nr = r[r.order_coupon_code.str.contains(n.lower())]
nr = nr[nr.invoice_line_type.str.match('charge')]
nr = nr.sort_values('shop_name')
v = nr.groupby(['shop_name'])['invoice_line_normal_price'].value_counts().to_frame('counts')
print(v)
v.to_csv(n + '.csv', index=False)
the output
shop_name invoice_line_normal_price
shop3 27.0 1
let say you need to add more to Single csv file
v.to_csv(n + '.csv',mode='a', index=False)
no header
v.to_csv(n + '.csv',mode='a', index=False,header=False)
just to make sure this error mean the name of column is not in your csv file check out the column name on your csv file
get_loc raise KeyError(key) from err KeyError: 'invoice_line_normal_price'
I'm extracting from data which is of type dictionary.
import urllib3
import json
http = urllib3.PoolManager()
url = 'https://raw.githubusercontent.com/leanhdung1994/BigData/main/fr-esr-principaux-etablissements-enseignement-superieur.json'
f = http.request('GET', url)
data = json.loads(f.data.decode('utf-8'))
data[0]["geometry"]["coordinates"]
geo = []
n = len(data)
for i in range(n):
geo.append(data[i]["geometry"]["coordinates"])
It returns an error
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-26-52e67ffdcaa6> in <module>
12 n = len(data)
13 for i in range(n):
---> 14 geo.append(data[i]["geometry"]["coordinates"])
KeyError: 'geometry'
This is weird, because, when I only run data[0]["geometry"]["coordinates"], it returns [7.000275, 43.58554] without error.
Could you please elaborate on this issue?
Error is occuring because in few of the response dictionaries you don't habe "geometry" key.
Check before appending to geo list, that "geometry" key exists in response dict.
Try following code.
import urllib3
import json
http = urllib3.PoolManager()
url = 'https://raw.githubusercontent.com/leanhdung1994/BigData/main/fr-esr-principaux-etablissements-enseignement-superieur.json'
f = http.request('GET', url)
data = json.loads(f.data.decode('utf-8'))
geo = []
n = len(data)
for i in range(n):
if "geometry" in data[i]:
geo.append(data[i]["geometry"]["coordinates"])
print(geo)
I believe the problem is that there are places in your data which do not have a "geography" key. As a preliminary matter, your data structure is not technically a dictionary. It is a 'list' of 'dictionaries'. You can tell that by using the print(type(data)) and print(type(data[0]) commands.
I took your code but added the following lines:
dataStructure = data[0]
print(type(dataStructure))
geo = []
n = len(data)
for i in range(321):
try:
geo.append(data[i]["geometry"]["coordinates"])
except:
print(i)
If you run this, you will see that at index positions 64 and 130, there is no geometry key. You may want to explore those entries specifically and see whether they should be removed from your data or whether you just need to alter the keyword to something else for those lines.
I was trying to import multiple csv files into sqlite database into multiple tables(using jupyter notebook in python3). The name of each file will be the name of the table. I have defined a function to covert the encoding to utf8 as below:
import sqlite3
import glob
import csv
import sys
def convert_to_utf8(dirname):
for filename in glob.glob(os.path.join(dirname, '*.csv')):
ifp = open(filename, "rt", encoding='cp1252')
input_data = ifp.read()
ifp.close()
ofp = open(filename + ".fix", "wt", encoding='utf-8')
for c in input_data:
if c != '\0':
ofp.write(c)
ofp.close()
return
all the files are in the same folder. staging_dir_name_1 is where the files are. And I have below code to covert the csv file into tables, some of the codes are from similar questions in StackFlow:
convert_to_utf8(staging_dir_name_1)
conn = sqlite3.connect("medicare_hospital_compare_1.db")
c = conn.cursor()
for filename in glob.glob(os.path.join(staging_dir_name_1, '*.csv')):
with open(filename, "rb") as f:
data = csv.DictReader(f)
cols = data.fieldnames
tablename = os.path.splitext(os.path.basename(filename))[0]
sql_str = "drop table if exists %s" % tablename
c.execute(sql_str)
sql_str = "create table if not exists %s (%s)" % (tablename, ','.join(["%s text" % col for col in cols]))
c.execute(sql_str)
sql_str = "insert into %s values (%s)" % (tablename, ','.join(["?" for col in cols]))
c.executemany(sql_str, (list(map(row.get, cols)) for row in data))
conn.commit()
but when i run this i get this error
> Error Traceback (most recent call
> last) <ipython-input-29-be7c1f43e4c5> in <module>()
> 2 with open(filename, "rb") as f:
> 3 data = csv.DictReader(f)
> ----> 4 cols = data.fieldnames
> 5 tablename = os.path.splitext(os.path.basename(filename))[0]
> 6
>
> C:\Users\dupin\Anaconda3\lib\csv.py in fieldnames(self)
> 96 if self._fieldnames is None:
> 97 try:
> ---> 98 self._fieldnames = next(self.reader)
> 99 except StopIteration:
> 100 pass
>
> Error: iterator should return strings, not bytes (did you open the
> file in text mode?)
Could anyone help me on how to resolve this issue? I have been thinking about it for a while but still couldn't figure out how to resolve this.
**===UPDATE===**
Now i have changed 'rb' to 'rt', i got a new error full NULL values, i think the first function has already removed all the null values
Error Traceback (most recent call last)
<ipython-input-77-68d56c0b4cf2> in <module>()
3
4 data = csv.DictReader(f)
----> 5 cols = data.fieldnames
6 table = os.path.splitext(os.path.basename(filename))[0]
7
C:\Users\dupin\Anaconda3\lib\csv.py in fieldnames(self)
96 if self._fieldnames is None:
97 try:
---> 98 self._fieldnames = next(self.reader)
99 except StopIteration:
100 pass
Error: line contains NULL byte
I'm attempting numba to optimise some code. I've worked through the initial examples in section 1.3.1 in the 0.26.0 user guide (http://numba.pydata.org/numba-doc/0.26.0/user/jit.html) and get the expected results, so I don't think the problem is installation.
Here's my code:
import numba
import numpy
import random
a = 8
b = 4
def my_function(a, b):
all_values = numpy.fromiter(range(a), dtype = int)
my_array = []
for n in (range(a)):
some_values = (all_values[all_values != n]).tolist()
c = random.sample(some_values, b)
my_array.append(sorted([n] + c))
return my_array
print(my_function(a, b))
my_function_numba = numba.jit()(my_function)
print(my_function_numba(a, b))
Which after printing out the expected results from the my_function call returns the following error message:
ValueError Traceback (most recent call last)
<ipython-input-8-b5d8983a58f6> in <module>()
19 my_function_numba = numba.jit()(my_function)
20
---> 21 print(my_function_numba(a, b))
ValueError: cannot compute fingerprint of empty list
Fingerprint of empty list?
I'm not sure about that error in particular, but in general, to be fast numba requires a particular subset of numpy/python (see here and here for more). So I might rewrite it like this.
#numba.jit(nopython=True)
def fast_my_function(a, b):
all_values = np.arange(a)
my_array = np.empty((a, b + 1), dtype=np.int32)
for n in range(a):
some = all_values[all_values != n]
c = np.empty(b + 1, dtype=np.int32)
c[1:] = np.random.choice(some, b)
c[0] = n
c.sort()
my_array[n, :] = c
return my_array
Main things to note:
no lists, I'm pre-allocating everything.
no use of generators (in both python 2 & 3 for n in range(a) will get converted to a fast native loop)
adding nopython=True to the decorator makes it so numba will complain if I use something that can't be efficiently JITed.