I am scraping data from GDELT [https://www.gdeltproject.org]. It is a pretty cool project that checks ~100,000 news sites each day, labels all the articles, and makes them available. I am getting attribute error while extracting the data. The code use is the following:
import gdelt
gd = gdelt.gdelt(version=1)
from statsmodels.tsa.api import VAR
import pandas as pd
import os
os.makedirs("data",exist_ok=True)
import datetime
cur_date = datetime.datetime(2022,1,10) - datetime.timedelta(days=10)
end_date = datetime.datetime(2022,1,10)
year = cur_date.year
month = str(cur_date.month)
day = str(cur_date.day)
if cur_date.month < 10:
month = "0" + month
if cur_date.day < 10:
day = "0" + day
gd.Search(['%s %s %s'%(year, month, day)],table='gkg',coverage=True, translation=False)
I am getting attribute error
AttributeError Traceback (most recent call last)
<ipython-input-10-2f00cabbf1ac> in <module>
----> 1 results = gd.Search(['%s %s %s'%(year, month, day)],table='gkg',coverage=True,
translation=False)
~\anaconda3\lib\site-packages\gdelt\base.py in Search(self, date, table, coverage,
translation, output, queryTime, normcols)
646
647 if self.table == 'gkg' and self.version == 1:
--> 648 results.columns = results.ix[0].values.tolist()
649 results.drop([0], inplace=True)
650 columns = results.columns
~\anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5463 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5464 return self[name]
-> 5465 return object.__getattribute__(self, name)
5466
5467 def __setattr__(self, name: str, value) -> None:
AttributeError: 'DataFrame' object has no attribute 'ix'
I want to convert the text for suitable "natural language processing"
There are approx 3000+ books in column of "TEXT"
every row has huge text or one book in every row so when I apply this code I am getting a error as shown bellow.
When I am applying the below code
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(len(dt)):
review = re.sub('[^a-zA-Z0-9]', ' ', dt['TEXT'][i])
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
review = ' '.join(review)
corpus.append(review)
I am getting the following error
TypeError Traceback (most recent call last)
<ipython-input-16-47569f8727fa> in <module>
6 corpus = []
7 for i in range(1000,2000):
----> 8 review = re.sub('[^a-zA-Z0-9]', ' ', dt['TEXT'][i])
9 review = review.lower()
10 review = review.split()
~\anaconda3\lib\re.py in sub(pattern, repl, string, count, flags)
190 a callable, it's passed the Match object and must return
191 a replacement string to be used."""
--> 192 return _compile(pattern, flags).sub(repl, string, count)
193
194 def subn(pattern, repl, string, count=0, flags=0):
TypeError: expected string or bytes-like object
This means that in your DataFrame column 'TEXT' there are values that are not strings.
You can do this instead:
for i in range(len(df)):
try:
re.sub('[^a-zA-Z0-9]', ' ', df['TEXT'][i])
# the rest of your code ...
except TypeError:
pass
I'm pretty new to python and I'm currently working on an assignment to implement a movie recommendation system. I have a .csv file that contains various descriptions of a given movie's attribute. I ask the user for a movie title and then the system returns similar movies.
The dataset is named movie_dataset.csv from this folder on GitHub: https://github.com/codeheroku/Introduction-to-Machine-Learning/tree/master/Building%20a%20Movie%20Recommendation%20Engine
The problem I am encountering is that when I ask the user to enter a movie title, the program only works for certain titles.
The code:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
#helper functions#
def get_title_from_index(index):
return df[df.index == index]["title"].values[0]
def get_index_from_title(title):
return df[df.title == title]["index"].values[0]
df = pd.read_csv("movie_dataset.csv")
#print (df.columns)
features = ['keywords','cast','genres','director']
for feature in features:
df[feature] = df[feature].fillna('')
def combine_features(row):
return row['keywords'] +" "+ row['cast'] +" "+ row['genres'] +" "+ row['director']
df["combine_features"] = df.apply(combine_features, axis=1)
#print (df["combine_features"].head())
cv = CountVectorizer()
count_matrix = cv.fit_transform(df["combine_features"])
#MTitle = input("Type in a movie title: ")
cosine_sim = cosine_similarity(count_matrix)
movie_user_likes = 'Avatar'#MTitle
movie_index = get_index_from_title(movie_user_likes)
similar_movies = list(enumerate(cosine_sim[movie_index]))
sorted_similar_movies = sorted(similar_movies, key= lambda x:x[1], reverse=True)
i = 0
for movie in sorted_similar_movies:
print (get_title_from_index(movie[0]))
i=i+1
if i>10:
break
When I enter "Batman" the program runs fine. But when I run "Harry Potter" I get:
IndexError Traceback (most recent call last)
<ipython-input-51-687ddb420709> in <module>
30 movie_user_likes = MTitle
31
---> 32 movie_index = get_index_from_title(movie_user_likes)
33
34 similar_movies = list(enumerate(cosine_sim[movie_index]))
<ipython-input-51-687ddb420709> in get_index_from_title(title)
8
9 def get_index_from_title(title):
---> 10 return df[df.title == title]["index"].values[0]
11
12 df = pd.read_csv("movie_dataset.csv")
IndexError: index 0 is out of bounds for axis 0 with size 0
There's simply no entry in the data base for the movie "Harry Potter"
You should add some testing for these cases such as:
def get_index_from_title(title):
try:
return df[df.title == title]["index"].values[0]
except IndexError:
return None
Then of course in the calling code you'll have to test if you got a None from the function and act accordingly.
I would like to use the Python Faker library to generate 500 lines of data, however I get repeated data using the code I came up with below. Can you please point out where I'm going wrong. I believe it has something to do with the for loop. Thanks in advance:
from faker import Factory
import pandas as pd
import random
def create_fake_stuff(fake):
df = pd.DataFrame(columns=('name'
, 'email'
, 'bs'
, 'address'
, 'city'
, 'state'
, 'date_time'
, 'paragraph'
, 'Conrad'
,'randomdata'))
stuff = [fake.name()
, fake.email()
, fake.bs()
, fake.address()
, fake.city()
, fake.state()
, fake.date_time()
, fake.paragraph()
, fake.catch_phrase()
, random.randint(1000,2000)]
for i in range(10):
df.loc[i] = [item for item in stuff]
print(df)
if __name__ == '__main__':
fake = Factory.create()
create_fake_stuff(fake)
Disclaimer: this answer is added much after the question and adds some new info not directly answering the question.
Now there is a fast new library Mimesis - Fake Data Generator.
Upside: It is stated it works times faster than faker (see below my test of data similar to one in question).
Downside: works from 3.6 version of Python only.
pip install mimesis
>>> from mimesis import Person
>>> from mimesis.enums import Gender
>>> person = Person('en')
>>> person.full_name(gender=Gender.FEMALE)
'Antonetta Garrison'
>>> personru = Person('ru')
>>> personru.full_name()
'Рената Черкасова'
The same with developed earlier faker:
pip install faker
>>> from faker import Faker
>>> fake_ru=Faker('ja_JP')
>>> fake_ru=Faker('ru_RU')
>>> fake_jp=Faker('ja_JP')
>>> print (fake_ru.name())
Субботина Елена Наумовна
>>> print (fake_jp.name())
大垣 花子
Below it my recent timing of Mimesis vs. Faker based on code provided in answer from forzer0eight:
from faker import Faker
import pandas as pd
import random
fake = Faker()
def create_rows_faker(num=1):
output = [{"name":fake.name(),
"address":fake.address(),
"name":fake.name(),
"email":fake.email(),
#"bs":fake.bs(),
"city":fake.city(),
"state":fake.state(),
"date_time":fake.date_time(),
#"paragraph":fake.paragraph(),
#"Conrad":fake.catch_phrase(),
"randomdata":random.randint(1000,2000)} for x in range(num)]
return output
%%time
df_faker = pd.DataFrame(create_rows_faker(5000))
CPU times: user 3.51 s, sys: 2.86 ms, total: 3.51 s
Wall time: 3.51 s
from mimesis import Person
from mimesis import Address
from mimesis.enums import Gender
from mimesis import Datetime
person = Person('en')
import pandas as pd
import random
person = Person()
addess = Address()
datetime = Datetime()
def create_rows_mimesis(num=1):
output = [{"name":person.full_name(gender=Gender.FEMALE),
"address":addess.address(),
"name":person.name(),
"email":person.email(),
#"bs":person.bs(),
"city":addess.city(),
"state":addess.state(),
"date_time":datetime.datetime(),
#"paragraph":person.paragraph(),
#"Conrad":person.catch_phrase(),
"randomdata":random.randint(1000,2000)} for x in range(num)]
return output
%%time
df_mimesis = pd.DataFrame(create_rows_mimesis(5000))
CPU times: user 178 ms, sys: 1.7 ms, total: 180 ms
Wall time: 179 ms
Below is resulting data for comparison:
df_faker.head(2)
address city date_time email name randomdata state
0 3818 Goodwin Haven\nBrocktown, GA 06168 Valdezport 2004-10-18 20:35:52 joseph81#gomez-beltran.info Deborah Garcia 1218 Oklahoma
1 2568 Gonzales Field\nRichardhaven, NC 79149 West Rachel 1985-02-03 00:33:00 lbeck#wang.com Barbara Pineda 1536 Tennessee
df_mimesis.head(2)
address city date_time email name randomdata state
0 351 Nobles Viaduct Cedar Falls 2013-08-22 08:20:25.288883 chemotherapeutics1964#gmail.com Ernest 1673 Georgia
1 517 Williams Hill Malden 2008-01-26 18:12:01.654995 biochemical1972#yandex.com Jonathan 1845 North Dakota
Following scripts can remarkably enhance the pandas performance.
from faker import Faker
import pandas as pd
import random
fake = Faker()
def create_rows(num=1):
output = [{"name":fake.name(),
"address":fake.address(),
"name":fake.name(),
"email":fake.email(),
"bs":fake.bs(),
"address":fake.address(),
"city":fake.city(),
"state":fake.state(),
"date_time":fake.date_time(),
"paragraph":fake.paragraph(),
"Conrad":fake.catch_phrase(),
"randomdata":random.randint(1000,2000)} for x in range(num)]
return output
It takes 5.55s.
%%time
df = pd.DataFrame(create_rows(5000))
Wall time: 5.55 s
I placed the fake stuff array inside my for loop to achieve the desired result:
for i in range(10):
stuff = [fake.name()
, fake.email()
, fake.bs()
, fake.address()
, fake.city()
, fake.state()
, fake.date_time()
, fake.paragraph()
, fake.catch_phrase()
, random.randint(1000, 2000)]
df.loc[i] = [item for item in stuff]
print(df)
Using the farsante and mimesis libraries is the easiest way to create Pandas DataFrames with fake data.
import random
import farsante
from mimesis import Person
from mimesis import Address
from mimesis import Datetime
person = Person()
address = Address()
datetime = Datetime()
def rand_int(min_int, max_int):
def some_rand_int():
return random.randint(min_int, max_int)
return some_rand_int
df = farsante.pandas_df([
person.full_name,
address.address,
person.name,
person.email,
address.city,
address.state,
datetime.datetime,
rand_int(1000, 2000)], 5)
print(df)
full_name address name ... state datetime some_rand_int
0 Weldon Durham 1027 Nellie Square Bruna ... West Virginia 2030-06-10 09:21:29.179412 1453
1 Veta Conrad 932 Cragmont Arcade Betsey ... Iowa 2017-08-11 23:50:27.479281 1909
2 Vena Kinney 355 Edgar Highway Tyson ... New Hampshire 2002-12-21 05:26:45.723531 1735
3 Adam Sheppard 270 Williar Court Treena ... North Dakota 2011-03-30 19:16:29.015598 1503
4 Penney Allison 592 Oakdale Road Chas ... Maine 2009-12-14 16:31:37.714933 1175
This approach keeps your code clean.
I work with python 3.5 and I have the next problem to import some datas from a hdf5 files.
I will show a very simple example which resume what happen. I have created a small dataframe and I have inserted it into a hdf5 files. Then I have tried to select from this hdf5 file the rows which have on the column "A" a value less that 1. So I get the error:
"Type error: unorderable types: str() < int()"
image
import pandas as pd
import numpy as np
import datetime
import time
import h5py
from pandas import DataFrame, HDFStore
def test_conected():
hdf_nombre_archivo ="1_Archivo.h5"
hdf = HDFStore(hdf_nombre_archivo)
np.random.seed(1234)
index = pd.date_range('1/1/2000', periods=3)
df = pd.DataFrame(np.random.randn(3, 4), index=index, columns=
['A', 'B','C','F'])
print(df)
with h5py.File(hdf_nombre_archivo) as f:
df.to_hdf(hdf_nombre_archivo, 'df',format='table')
print("")
with h5py.File(hdf_nombre_archivo) as f:
df_nuevo = pd.read_hdf(hdf_nombre_archivo, 'df',where= ['A' < 1])
print(df_nuevo )
def Fin():
print(" ")
print("FIN")
if __name__ == "__main__":
test_conected()
Fin()
print(time.strftime("%H:%M:%S"))
I have been investigating but I dont get to solve this error. Some idea?
Thanks
Angel
where= ['A' < 1]
in your condition statement 'A' is consider as string or char and 1 is int so first make them in same type by typecasting.
ex:
str(1)