how to save scraped data in db? - python-3.x

I m trying to save scraped data in db but got stuck,
first I have saved scraped data in csv file and using glob library to find newest csv and upload data of that csv into db-
I m not sure what i m doing wrong here plase find code and error
i have created table yahoo_data in db with same column name as that of csv and my code output
import scrapy
from scrapy.http import Request
import MySQLdb
import os
import csv
import glob
class YahooScrapperSpider(scrapy.Spider):
name = 'yahoo_scrapper'
allowed_domains = ['in.news.yahoo.com']
start_urls = ['http://in.news.yahoo.com/']
def parse(self, response):
news_url=response.xpath('//*[#class="Mb(5px)"]/a/#href').extract()
for url in news_url:
absolute_url=response.urljoin(url)
yield Request (absolute_url,callback=self.parse_text)
def parse_text(self,response):
Title=response.xpath('//meta[contains(#name,"twitter:title")]/#content').extract_first()
# response.xpath('//*[#name="twitter:title"]/#content').extract_first(),this also works
Article= response.xpath('//*[#class="canvas-atom canvas-text Mb(1.0em) Mb(0)--sm Mt(0.8em)--sm"]/text()').extract()
yield {'Title':Title,
'Article':Article}
def close(self, reason):
csv_file = max(glob.iglob('*.csv'), key=os.path.getctime)
mydb = MySQLdb.connect(host='localhost',
user='root',
passwd='prasun',
db='books')
cursor = mydb.cursor()
csv_data = csv.reader(csv_file)
row_count = 0
for row in csv_data:
if row_count != 0:
cursor.execute('INSERT IGNORE INTO yahoo_data (Title,Article) VALUES(%s, %s)', row)
row_count += 1
mydb.commit()
cursor.close()
gettting this error
ana. It should be directed not to disrespect the Sikh community and hurt its sentiments by passing such arbitrary and uncalled for orders," said Badal.', 'The SAD president also "brought it to the notice of the Haryana chief minister that Article 25 of the constitution safeguarded the rights of all citizens to profess and practices the tenets of their faith."', '"Keeping these facts in view I request you to direct the Haryana Public Service Commission to rescind its notification and allow Sikhs as well as candidates belonging to other religions to sport symbols of their faith during all examinations," said Badal. (ANI)']}
2019-04-01 16:49:41 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-01 16:49:41 [scrapy.extensions.feedexport] INFO: Stored csv feed (25 items) in: items.csv
2019-04-01 16:49:41 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method YahooScrapperSpider.close of <YahooScrapperSpider 'yahoo_scrapper' at 0x2c60f07bac8>>
Traceback (most recent call last):
File "C:\Users\prasun.j\AppData\Local\Continuum\anaconda3\lib\site-packages\MySQLdb\cursors.py", line 201, in execute
query = query % args
TypeError: not enough arguments for format string
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\prasun.j\AppData\Local\Continuum\anaconda3\lib\site-packages\twisted\internet\defer.py", line 151, in maybeDeferred
result = f(*args, **kw)
File "C:\Users\prasun.j\AppData\Local\Continuum\anaconda3\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "C:\Users\prasun.j\Desktop\scrapping\scrapping\spiders\yahoo_scrapper.py", line 44, in close
cursor.execute('INSERT IGNORE INTO yahoo_data (Title,Article) VALUES(%s, %s)', row)
File "C:\Users\prasun.j\AppData\Local\Continuum\anaconda3\lib\site-packages\MySQLdb\cursors.py", line 203, in execute
raise ProgrammingError(str(m))
MySQLdb._exceptions.ProgrammingError: not enough arguments for format string
2019-04-01 16:49:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 7985,
'downloader/request_count': 27,
'downloader/request_method_count/GET': 27,
'downloader/response_bytes': 2148049,
'downloader/response_count': 27,
'downloader/response_status_count/200': 26,
'downloader/response_status_count/301': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 4, 1, 11, 19, 41, 350717),
'item_scraped_count': 25,
'log_count/DEBUG': 53,
'log_count/ERROR': 1,
'log_count/INFO': 8,
'request_depth_max': 1,
'response_received_count': 26,
'scheduler/dequeued': 27,
'scheduler/dequeued/memory': 27,
'scheduler/enqueued': 27,
'scheduler/enqueued/memory': 27,
'start_time': datetime.datetime(2019, 4, 1, 11, 19, 36, 743594)}
2019-04-01 16:49:41 [scrapy.core.engine] INFO: Spider closed (finished)

This error
MySQLdb._exceptions.ProgrammingError: not enough arguments for format string
seems motivated by the lack of a sufficient number of arguments in the row you passed.
You can try to print the row, to understand what is going wrong.
Anyway, if you want to save scraped data to DB, I suggest to write a simple item pipeline, which exports data to DB, without passing through CSV.
For further information abuot item pipelines, see http://doc.scrapy.org/en/latest/topics/item-pipeline.html#topics-item-pipeline
You can found a useful example at Writing items to a MySQL database in Scrapy

seems like you are passing list to the parameters that need to be mentioned by the comma
try to add asterix to 'row' var:
cursor.execute('INSERT IGNORE INTO yahoo_data (Title,Article) VALUES(%s, %s)', row)
to:
cursor.execute('INSERT IGNORE INTO yahoo_data (Title,Article) VALUES(%s, %s)', *row)

Related

Scrapy using files.middleware downloads given file without extension

I want to automate some file-exchange. I need to download a .csv-file from a website, which is secured with an authentication before you can start to download.
First, I wanted to try downloading the file, with wget, but I did not manage, so I switched to scrapy and everything works fine, my authentication and the download, BUT the file comes without an extension -.-'
here is a snippet of my spider:
def after_login(self, response):
accview = response.xpath('//span[#class="user-actions welcome"]')
if accview:
print('Logged in')
file_url = response.xpath('//article[#class="widget-single widget-shape-widget widget"]/p/a/#href').get()
file_url = response.urljoin(file_url)
items = StockfwItem()
items['file_urls'] = [file_url]
yield items
my settings.py:
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
items.py:
file_urls = scrapy.Field()
files = scrapy.Field()
The reason why I am sure, that there is a problem with my spider, is that, if I download the file regular via brower, it always comes as a regular csvfile.
When I try to open the downloaded file(filename is hashed in sha1), I get the following error_msg:
File "/usr/lib/python3.6/csv.py", line 111, in __next__
self.fieldnames
File "/usr/lib/python3.6/csv.py", line 98, in fieldnames
self._fieldnames = next(self.reader)
_csv.Error: line contains NULL byte
Also when I open the downloaded file with notepad++ and save encoding as utf-8, it works without any problems...
scrapy console output:
{'file_urls': ['https://floraworld.be/Servico.Orchard.FloraWorld/Export/Export'] ,
'files': [{'checksum': 'f56c6411803ec45863bc9dbea65edcb9',
'path': 'full/cc72731cc79929b50c5afb14e0f7e26dae8f069c',
'status': 'downloaded',
'url': 'https://floraworld.be/Servico.Orchard.FloraWorld/Export/Expo rt'}]}
2021-08-02 10:00:30 [scrapy.core.engine] INFO: Closing spider (finished)
2021-08-02 10:00:30 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2553,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 2,
'downloader/request_method_count/POST': 2,
'downloader/response_bytes': 76289,
'downloader/response_count': 4,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/302': 1,
'elapsed_time_seconds': 20.892172,
'file_count': 1,
'file_status_count/downloaded': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 8, 2, 8, 0, 30, 704638),
'item_scraped_count': 1,
'log_count/DEBUG': 6,
'log_count/INFO': 10,
'log_count/WARNING': 1,
'memusage/max': 55566336,
'memusage/startup': 55566336,
'request_depth_max': 1,
'response_received_count': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'splash/render.html/request_count': 1,
'splash/render.html/response_count/200': 1,
'start_time': datetime.datetime(2021, 8, 2, 8, 0, 9, 812466)}
2021-08-02 10:00:30 [scrapy.core.engine] INFO: Spider closed (finished)
snippet of downloaded file and opened with vim on ubuntu server:
"^#A^#r^#t^#i^#c^#l^#e^#C^#o^#d^#e^#"^#;^#"^#D^#e^#s^#c^#r^#i^#p^#t^#i^#o^#n^#"^#;^#"^#B^#B^#"^#;^#"^#K^#T^#"^#;^#"^#S^#i^#z^#e^#"^#;^#"^#P^#r^#i^#c^#e^#"^#;^#"^#S^#t^#o^#c^#k^#"^#;^#"^#D^#e^#l^#i^#v^#e^#r^#y^#D^#a^#t^#e^#"^#^M^#
^#"^#1^#0^#0^#0^#L^#"^#;^#"^#A^#l^#o^#e^# ^#p^#l^#a^#n^#t^# ^#x^# ^#2^#3^# ^#l^#v^#s^#"^#;^#"^#4^#"^#;^#"^#4^#"^#;^#"^#6^#5^#"^#;^#"^#4^#6^#,^#7^#7^#"^#;^#"^#1^#1^#8^#,^#0^#0^#0^#0^#0^#"^#;^#"^#"^#^M^#
^#"^#1^#0^#0^#0^#M^#"^#;^#"^#A^#l^#o^#e^# ^#p^#l^#a^#n^#t^# ^#x^# ^#1^#7^# ^#l^#v^#s^#"^#;^#"^#4^#"^#;^#"^#1^#2^#"^#;^#"^#5^#0^#"^#;^#"^#3^#2^#,^#6^#1^#"^#;^#"^#2^#0^#6^#,^#0^#0^#0^#0^#0^#"^#;^#"^#"^#^M^#
^#"^#1^#0^#0^#0^#S^#"^#;^#"^#A^#l^#o^#e^# ^#p^#l^#a^#n^#t^# ^#x^# ^#1^#6^# ^#l^#v^#s^#"^#;^#"^#4^#"^#;^#"^#2^#4^#"^#;^#"^#4^#0^#"^#;^#"^#2^#2^#,^#3^#2^#"^#;^#"^#-^#6^#,^#0^#0^#0^#0^#0^#"^#;^#"^#2^#3^#/^#0^#8^#/^#2^#0^#2^#1^#"^#^M^#
^#"^#1^#0^#0^#2^#M^#"^#;^#"^#B^#A^#T^#O^#N^# ^#P^#L^#A^#N^#T^# ^#6^#7^# ^#C^#M^# ^#W^#/^#P^#O^#T^#"^#;^#"^#2^#"^#;^#"^#6^#"^#;^#"^#6^#7^#"^#;^#"^#2^#2^#,^#4^#2^#"^#;^#"^#3^#3^#,^#0^#0^#0^#0^#0^#"^#;^#"^#5^#/^#0^#9^#/^#2^#0^#2^#1^#"^#^M^#
^#"^#1^#0^#0^#2^#S^#"^#;^#"^#B^#A^#T^#O^#N^# ^#P^#L^#A^#N^#T^# ^#4^#2^# ^#C^#M^# ^#W^#/^#P^#O^#T^#"^#;^#"^#4^#"^#;^#"^#1^#2^#"^#;^#"^#4^#2^#"^#;^#"^#1^#0^#,^#5^#4^#"^#;^#"^#-^#9^#5^#,^#0^#0^#0^#0^#0^#"^#;^#"^#5^#/^#0^#9^#/^#2^#0^#2^#1^#"^#^M^#
^#"^#1^#0^#0^#4^#N^#"^#;^#"^#B^#a^#t^#o^#n^# ^#P^#l^#a^#n^#t^#"^#;^#"^#2^#"^#;^#"^#2^#"^#;^#"^#9^#9^#"^#;^#"^#1^#2^#0^#,^#9^#5^#"^#;^#"^#5^#3^#,^#0^#0^#0^#0^#0^#"^#;^#"^#3^#0^#/^#0^#9^#/^#2^#0^#2^#1^#"^#^M^#
^#"^#1^#0^#0^#5^#N^#"^#;^#"^#N^#a^#t^#u^#r^#a^#l^# ^#s^#t^#r^#e^#l^#i^#t^#z^#i^#a^# ^#w^#/^#p^#o^#t^#"^#;^#"^#1^#"^#;^#"^#1^#"^#;^#"^#1^#3^#0^#"^#;^#"^#2^#0^#7^#,^#4^#4^#"^#;^#"^#1^#4^#,^#0^#0^#0^#0^#0^#"^#;^#"^#1^#/^#1^#2^#/^#2^#0^#2^#1^#"^#^M^#
what the heck is this??
When I change the filename to file.csv, downloading the file to my windoof desktop and open it with notepad++ again, it looks good:
"ArticleCode";"Description";"BB";"KT";"Size";"Price";"Stock";"DeliveryDate"
"1000L";"Aloe plant x 23 lvs";"4";"4";"65";"46,77";"118,00000";""
"1000M";"Aloe plant x 17 lvs";"4";"12";"50";"32,61";"206,00000";""
"1000S";"Aloe plant x 16 lvs";"4";"24";"40";"22,32";"-6,00000";"23/08/2021"
"1002M";"BATON PLANT 67 CM W/POT";"2";"6";"67";"22,42";"33,00000";"5/09/2021"
"1002S";"BATON PLANT 42 CM W/POT";"4";"12";"42";"10,54";"-95,00000";"5/09/2021"
for all those who suffer on the same problem:
I just hit in my terminal:
cat Inputfile | tr -d '\0' > Outputfile.csv
First of all try to change the encoding in vim:
set fileencodings=utf-8
or open it in a different text editor in your ubuntu machine, maybe it's just a problem with vim.
Second thing to do is to download the file with the correct name:
import os
from urllib.parse import unquote
from scrapy.pipelines.files import FilesPipeline
from scrapy.http import Request
class TempPipeline():
def process_item(self, item, spider):
return item
class ProcessPipeline(FilesPipeline):
# Overridable Interface
def get_media_requests(self, item, info):
urls = ItemAdapter(item).get(self.files_urls_field, [])
return [Request(u) for u in urls]
def file_path(self, request, response=None, info=None, *, item=None):
# return 'files/' + os.path.basename(urlparse(request.url).path) # from scrapy documentation
return os.path.basename(unquote(request.url)) # this is what worked for my project, but maybe you'll want to add ".csv"
also you need to change settings.py:
ITEM_PIPELINES = {
'myproject.pipelines.MyImagesPipeline': 300
}
FILES_STORE = '/path/to/valid/dir'
Try those two things and if it still doesn't work then update me please.
I think your file containing null bytes.
The issue might be:
Your items.py contains two fields file_urls and files. But, your spider yields only one item i.e., file_urls. Thus, CSV gets created with two columns (file_urls , files) but files column does not contain any data (which might causing this problem). Try commenting this line and see if it works #files = scrapy.Field().

Why my code python only can import some data from a file .CSV to my database PostgreSQL?

Description and Objective
This is the project 1 of the CS50 Web programing's course.
I need to import the content of a table from a file .csv to a table in my database PostgreSQL through Python.
The table has the next format:
isbn,title,author,year
0380795272,Krondor: The Betrayal,Raymond E. Feist,1998
The columns of the table were created directly in my PostgreQSL database with the next data type:
id: Integer not null
isbn: Varchar not null
title: Text not null
author: Varchar not null
year: Integer not null
I have the next Python code:
import csv
import os
from sqlalchemy import create_engine
from sqlalchemy.orm import scoped_session, sessionmaker
engine = create_engine(os.getenv("DATABASE_URL"))
db = scoped_session(sessionmaker(bind=engine))
def main():
f = open("bookspr1.csv")
reader = csv.reader(f)
for isbn, title, author, year in reader:
db.execute("INSERT INTO books (isbn, title, author, year) VALUES (:isbn, :title, :author, :year)",
{"isbn": isbn, "title": title, "author": author, "year": year})
print(f"Added the book {title}")
db.commit()
if __name__ == "__main__":
main()
Issue
When I run the python code to import the data table from the file .csv, the system throws an error:
C:\xampp\htdocs\project1>python import0_pr1A.py
Traceback (most recent call last):
File "C:\Users\Verel\AppData\Local\Programs\Python\Python37-32\lib\site-packag
es\sqlalchemy\engine\base.py", line 1284, in _execute_context
cursor, statement, parameters, context
File "C:\Users\Verel\AppData\Local\Programs\Python\Python37-32\lib\site-packag
es\sqlalchemy\engine\default.py", line 590, in do_execute
cursor.execute(statement, parameters)
psycopg2.errors.InvalidTextRepresentation: invalid input syntax for type integer
: "year"
LINE 1: ...le, author, year) VALUES ('isbn', 'title', 'author', 'year')
^
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "import0_pr1A.py", line 24, in <module>
main()
File "import0_pr1A.py", line 18, in main
{"isbn": isbn, "title": title, "author": author, "year": year})
File "C:\Users\Verel\AppData\Local\Programs\Python\Python37-32\lib\site-packag
es\sqlalchemy\orm\scoping.py", line 163, in do
return getattr(self.registry(), name)(*args, **kwargs)
File "C:\Users\Verel\AppData\Local\Programs\Python\Python37-32\lib\site-packag
es\sqlalchemy\orm\session.py", line 1292, in execute
clause, params or {}
File "C:\Users\Verel\AppData\Local\Programs\Python\Python37-32\lib\site-packag
es\sqlalchemy\engine\base.py", line 1020, in execute
return meth(self, multiparams, params)
File "C:\Users\Verel\AppData\Local\Programs\Python\Python37-32\lib\site-packag
es\sqlalchemy\sql\elements.py", line 298, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "C:\Users\Verel\AppData\Local\Programs\Python\Python37-32\lib\site-packag
es\sqlalchemy\engine\base.py", line 1139, in _execute_clauseelement
distilled_params,
File "C:\Users\Verel\AppData\Local\Programs\Python\Python37-32\lib\site-packag
es\sqlalchemy\engine\base.py", line 1324, in _execute_context
e, statement, parameters, cursor, context
File "C:\Users\Verel\AppData\Local\Programs\Python\Python37-32\lib\site-packag
es\sqlalchemy\engine\base.py", line 1518, in _handle_dbapi_exception
sqlalchemy_exception, with_traceback=exc_info[2], from_=e
File "C:\Users\Verel\AppData\Local\Programs\Python\Python37-32\lib\site-packag
es\sqlalchemy\util\compat.py", line 178, in raise_
raise exception
File "C:\Users\Verel\AppData\Local\Programs\Python\Python37-32\lib\site-packag
es\sqlalchemy\engine\base.py", line 1284, in _execute_context
cursor, statement, parameters, context
File "C:\Users\Verel\AppData\Local\Programs\Python\Python37-32\lib\site-packag
es\sqlalchemy\engine\default.py", line 590, in do_execute
cursor.execute(statement, parameters)
sqlalchemy.exc.DataError: (psycopg2.errors.InvalidTextRepresentation) invalid in
put syntax for type integer: "year"
LINE 1: ...le, author, year) VALUES ('isbn', 'title', 'author', 'year')
^
[SQL: INSERT INTO books (isbn, title, author, year) VALUES (%(isbn)s, %(title)s,
%(author)s, %(year)s)]
[parameters: {'isbn': 'isbn', 'title': 'title', 'author': 'author', 'year': 'yea
r'}]
(Background on this error at: http://sqlalche.me/e/9h9h)
C:\xampp\htdocs\project1>
In order to isolate the problem, I try to import the file .CSV but supriming the first row which include the names of the columns (isbn, title, author, year) and when I run the code, It starts the data transfer but It stops suddenly with another error when It try import a row where the data "title" or "author" contains Double quotes (" ") and Comma (,). For example the next row with the author "V.E. Schwab, Victoria Schwab" generate that conflict:
0765335344,Vicious,"V.E. Schwab, Victoria Schwab",2013
And the new error is like this:
C:\xampp\htdocs\project1>python import0_pr1A.py
Added the book The Mark of Athena
Added the book Her Fearful Symmetry
Traceback (most recent call last):
File "import0_pr1A.py", line 24, in <module>
main()
File "import0_pr1A.py", line 16, in main
for isbn, title, author, year in reader:
ValueError: not enough values to unpack (expected 4, got 1)
C:\xampp\htdocs\project1>python import0_pr1A.py
The data transfer is finished succesfully when the file .CSV is imported without the first row (isbn, title, author, year) and without data that contains Double quotes (" ") and Commas (,).
C:\xampp\htdocs\project1>python import0_pr1A.py
Added the book The Lion's Game
Added the book The Rainmaker
Added the book Eleanor & Park
C:\xampp\htdocs\project1>python import0_pr1A.py
C:\xampp\htdocs\project1>python list0_pr1.py
Krondor: The Betrayal by Raymond E. Feist of 1998.
The Dark Is Rising by Susan Cooper of 1973.
The Black Unicorn by Terry Brooks of 1987.
The Lion's Game by Nelson DeMille of 2000.
The Rainmaker by John Grisham of 1995.
Eleanor & Park by Rainbow Rowell of 2013.
C:\xampp\htdocs\project1>
Finally I tried inserting some code lines but the result was the same:
import psycopg2
reader.next
db.close()
import csv
import os
import psycopg2
from sqlalchemy import create_engine
from sqlalchemy.orm import scoped_session, sessionmaker
engine = create_engine(os.getenv("DATABASE_URL"))
db = scoped_session(sessionmaker(bind=engine))
def main():
f = open("books.csv")
reader = csv.reader(f)
reader.__next__
for isbn, title, author, year in reader:
db.execute("INSERT INTO books (isbn, title, author, year) VALUES (:isbn, :title, :author, :year)",
{"isbn": isbn, "title": title, "author": author, "year": year})
print(f"Added the book {title}")
db.commit()
db.close()
if __name__ == "__main__":
main()
Conclusion
I need a help to modify this python code that let me import completly the file .csv including the first row and the data that contains Double Quotes (" ") and Commas (,).
reader.__next__
This simply retrieves the method, it does not invoke the method. You need reader.__next__(), but I think next(reader) might be more conventional.
0765335344,Vicious,"V.E. Schwab, Victoria Schwab",2013
Works fine for me. Maybe your actual file has smart quotes or something like that rather than straight ASCII.
Try
csv.reader(lines, quotechar='"', delimiter=',', ...
see csv.reader also prior SO answer,

Fuzzy String Matching With Pandas and FuzzyWuzzy,Data matching: TypeError: cannot use a string pattern on a bytes-like object

I have the data file which looks like this -
And I have another data file which has all the correct country names.
For matching both the files that, I am using below:
import pandas as pd
names_array=[]
ratio_array=[]
def match_names(wrong_names,correct_names):
for row in wrong_names:
x=process.extractOne(row, correct_names)
names_array.append(x[0])
ratio_array.append(x[1])
return names_array,ratio_array
fields = ['name']
#Wrong country names dataset
df=pd.read_csv("wrong-country-names.csv",encoding="ISO-8859-1",sep=';', skipinitialspace=True, usecols= fields )
print(df.dtypes)
wrong_names=df.dropna().values
#Correct country names dataset
choices_df=pd.read_csv("country-names.csv",encoding="ISO-8859-1",sep='\t', skipinitialspace=True)
correct_names=choices_df.values
name_match,ratio_match=match_names(wrong_names,correct_names)
df['correct_country_name']=pd.Series(name_match)
df['country_names_ratio']=pd.Series(ratio_match)
df.to_csv("string_matched_country_names.csv")
print(df[['name','correct_country_name','country_names_ratio']].head(10))
I get the below error:
name object
dtype: object
Traceback (most recent call last):
File "<ipython-input-221-a1fd87d9f661>", line 1, in <module>
runfile('C:/Users/Drashti Bhatt/Desktop/untitled0.py', wdir='C:/Users/Drashti Bhatt/Desktop')
File "C:\Users\Drashti Bhatt\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
execfile(filename, namespace)
File "C:\Users\Drashti Bhatt\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/Drashti Bhatt/Desktop/untitled0.py", line 27, in <module>
name_match,ratio_match=match_names(wrong_names,correct_names)
File "C:/Users/Drashti Bhatt/Desktop/untitled0.py", line 9, in match_names
x=process.extractOne(row, correct_names)
File "C:\Users\Drashti Bhatt\Anaconda3\lib\site-packages\fuzzywuzzy\process.py", line 220, in extractOne
return max(best_list, key=lambda i: i[1])
File "C:\Users\Drashti Bhatt\Anaconda3\lib\site-packages\fuzzywuzzy\process.py", line 78, in extractWithoutOrder
processed_query = processor(query)
File "C:\Users\Drashti Bhatt\Anaconda3\lib\site-packages\fuzzywuzzy\utils.py", line 95, in full_process
string_out = StringProcessor.replace_non_letters_non_numbers_with_whitespace(s)
File "C:\Users\Drashti Bhatt\Anaconda3\lib\site-packages\fuzzywuzzy\string_processing.py", line 26, in replace_non_letters_non_numbers_with_whitespace
return cls.regex.sub(" ", a_string)
TypeError: expected string or bytes-like object
I tried with .decode option, but it did not work out. What I am doing wrong?
Any help on this will be much appreciated! Thanks much!
The below code is working. you can find the differences. But i am not sure if this is solution that you are looking. And i have tried on sample files which i had created manually. I have removed fields from pd.read_csv.
(... = same as your code)
...
def match_names(wrong_names,correct_names):
for row in wrong_names:
print('row=',row)
...
return names_array,ratio_array
fields = ['name']
#Wrong country names dataset
df=pd.read_csv("fuzzy.csv",encoding="ISO-8859-1", skipinitialspace=True)
print(df.dtypes)
wrong_names=df.dropna().values
print(wrong_names)
#Correct country names dataset
choices_df=pd.read_csv("country.csv",encoding="ISO-8859-1",sep='\t', skipinitialspace=True)
correct_names=choices_df.values
print(correct_names)
...
print(df[['correct_country_name','country_names_ratio']].head(10))
Output
Country object
alpha-2 object
alpha-3 object
country-code int64
iso_3166-2 object
region object
sub-region object
region-co int64
sub-region.1 int64
dtype: object
[[u'elbenie' u'AL' u'ALB' 8 u'ISO 3166-2:AL' u'Europe' u'Southern Europe'
150 39]
[u'enforre' u'AD' u'AND' 20 u'ISO 3166-2:AD' u'Europe' u'Southern Europe'
150 39]
[u'Belerus' u'AT' u'AUT' 40 u'ISO 3166-2:AT' u'Europe' u'Western Europe'
150 155]]
[[u'elbenie']
[u'enforre']
[u'Belerus']]
('row=', array([u'elbenie', u'AL', u'ALB', 8, u'ISO 3166-2:AL', u'Europe',
u'Southern Europe', 150, 39], dtype=object))
('row=', array([u'enforre', u'AD', u'AND', 20, u'ISO 3166-2:AD', u'Europe',
u'Southern Europe', 150, 39], dtype=object))
('row=', array([u'Belerus', u'AT', u'AUT', 40, u'ISO 3166-2:AT', u'Europe',
u'Western Europe', 150, 155], dtype=object))
correct_country_name country_names_ratio
0 [elbenie] 60
1 [enforre] 60
2 [Belerus] 60

How do I parse list of dict of dict of ... dict to dataframe?

I've got a list of dictionaries of dictionaries... Basically, it is just big piece of JSON. Here how looks like one dict from a list:
{'id': 391257, 'from_id': -1, 'owner_id': -1, 'date': 1554998414, 'marked_as_ads': 0, 'post_type': 'post', 'text': 'Весна — время обновлений. Очищаем балконы от старых лыж и API от устаревших версий: уже скоро запросы к API c версией ниже 5.0 перестанут поддерживаться.\n\nОжидаемая дата изменений: 15 мая 2019 года. \n\nПодробности в Roadmap: https://vk.com/dev/version_update_2.0', 'post_source': {'type': 'vk'}, 'comments': {'count': 91, 'can_post': 1, 'groups_can_post': True}, 'likes': {'count': 182, 'user_likes': 0, 'can_like': 1, 'can_publish': 1}, 'reposts': {'count': 10, 'user_reposted': 0}, 'views': {'count': 63997}, 'is_favorite': False}
And I want to dump each dict to frame. if I just do
data = pandas.DataFrame(list_of_dicts)
I get a frame where are only two columns: first one contains keys, and another one contains data, like this:
I tried doing it in a loop:
for i in list_of_dicts:
tmp = pandas.DataFrame().from_dict(i)
data = pandas.concat([data, tmp])
print(i)
But I face ValueError:
Traceback (most recent call last):
File "/home/keddad/PycharmProjects/vk_group_parse/Data Grabber.py", line 68, in <module>
main()
File "/home/keddad/PycharmProjects/vk_group_parse/Data Grabber.py", line 61, in main
tmp = pandas.DataFrame().from_dict(i)
File "/home/keddad/anaconda3/envs/vk_group_parse/lib/python3.7/site-packages/pandas/core/frame.py", line 1138, in from_dict
return cls(data, index=index, columns=columns, dtype=dtype)
File "/home/keddad/anaconda3/envs/vk_group_parse/lib/python3.7/site-packages/pandas/core/frame.py", line 392, in __init__
mgr = init_dict(data, index, columns, dtype=dtype)
File "/home/keddad/anaconda3/envs/vk_group_parse/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 212, in init_dict
return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "/home/keddad/anaconda3/envs/vk_group_parse/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 51, in arrays_to_mgr
index = extract_index(arrays)
File "/home/keddad/anaconda3/envs/vk_group_parse/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 320, in extract_index
raise ValueError('Mixing dicts with non-Series may lead to '
ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.
How, after this, I can get dataframe with one post (one dictionary in the list is one post) and all the data in it as columns?
I can't figure out the df exactly but I think you simply need to do a reset_index and all the data which is currently(it seems):
df.reset_index(inplace=True)
Another thing if you want the keys as columns:
df = pd.Dataframe.from_dict(orient='columns')
# or try `index` in columns if you don't get desired results
In a for loop:
l = []
for i in dict.keys:
l.append(pd.DataFrame.from_dict(dict[i], orient='columns'))
df = pd.concat(l)
Not quite sure what you are trying to do, but do you mean something like this?
You can see inside the data by just printing the dataframe. Or you can print each one by the following code.
data = pandas.DataFrame(list_of_dicts)
print(data)
for i in data.loc[:, data.columns]:
print(data[i])

KeyError in Python 3 Dictionary, but it's returning the current value

So I have this file with these values:
AFG,13,0,0,2
ALG,15,5,2,8
ARG,40,18,24,28
Stored into a dictionary like this:
{'ARG': (40, 18, 24, 28), 'ALG': (15, 5, 2, 8), 'AFG': (13, 0, 0, 2)}
I have a function that has the user punch in the key, and it should return the tuple with the numbers in it.
However, if I were to type in, say, AFG, I get:
Traceback (most recent call last):
File "C:\Users\username\Dropbox\Programming\Python\Project3.py", line 131, in <module>
main()
File "C:\Users\username\Dropbox\Programming\Python\Project3.py", line 110, in main
findMedals(countryDictionary, MedalDictionary)
File "C:\Users\username\Dropbox\Programming\Python\Project3.py", line 88, in findMedals
answer.append([medalDict[medalCount]])
KeyError: (13, 0, 0, 2)
As you can see, the KeyError gives out the correct value for the inputted key, but why is it still complaining about it? Doesn't KeyError mean the key didn't exist?
My code:
def findMedals(countryDict, medalDict):
search_str = input('What is the country you want information on? ')
for code, medalCount in medalDict.items():
if search_str in code:
answer.append([medalDict[medalCount]])
else:
answer = ['No Match Found']
The problem is on this line:
answer.append([medalDict[medalCount]])
You are iterating over medalDict with for code, medalCount in medalDict.items():. This already assigns medalCount to the value associated with the key code. Your dict doesn't have a key represented by the tuple of medals. Therefore, it errors when you ask for medalDict[medalCount].
You can fix this by:
answer.append([medalCount])
Hope this helps

Resources