Using Python Faker generate different data for 5000 rows - python-3.x

I would like to use the Python Faker library to generate 500 lines of data, however I get repeated data using the code I came up with below. Can you please point out where I'm going wrong. I believe it has something to do with the for loop. Thanks in advance:
from faker import Factory
import pandas as pd
import random
def create_fake_stuff(fake):
df = pd.DataFrame(columns=('name'
, 'email'
, 'bs'
, 'address'
, 'city'
, 'state'
, 'date_time'
, 'paragraph'
, 'Conrad'
,'randomdata'))
stuff = [fake.name()
, fake.email()
, fake.bs()
, fake.address()
, fake.city()
, fake.state()
, fake.date_time()
, fake.paragraph()
, fake.catch_phrase()
, random.randint(1000,2000)]
for i in range(10):
df.loc[i] = [item for item in stuff]
print(df)
if __name__ == '__main__':
fake = Factory.create()
create_fake_stuff(fake)

Disclaimer: this answer is added much after the question and adds some new info not directly answering the question.
Now there is a fast new library Mimesis - Fake Data Generator.
Upside: It is stated it works times faster than faker (see below my test of data similar to one in question).
Downside: works from 3.6 version of Python only.
pip install mimesis
>>> from mimesis import Person
>>> from mimesis.enums import Gender
>>> person = Person('en')
>>> person.full_name(gender=Gender.FEMALE)
'Antonetta Garrison'
>>> personru = Person('ru')
>>> personru.full_name()
'Рената Черкасова'
The same with developed earlier faker:
pip install faker
>>> from faker import Faker
>>> fake_ru=Faker('ja_JP')
>>> fake_ru=Faker('ru_RU')
>>> fake_jp=Faker('ja_JP')
>>> print (fake_ru.name())
Субботина Елена Наумовна
>>> print (fake_jp.name())
大垣 花子
Below it my recent timing of Mimesis vs. Faker based on code provided in answer from forzer0eight:
from faker import Faker
import pandas as pd
import random
fake = Faker()
def create_rows_faker(num=1):
output = [{"name":fake.name(),
"address":fake.address(),
"name":fake.name(),
"email":fake.email(),
#"bs":fake.bs(),
"city":fake.city(),
"state":fake.state(),
"date_time":fake.date_time(),
#"paragraph":fake.paragraph(),
#"Conrad":fake.catch_phrase(),
"randomdata":random.randint(1000,2000)} for x in range(num)]
return output
%%time
df_faker = pd.DataFrame(create_rows_faker(5000))
CPU times: user 3.51 s, sys: 2.86 ms, total: 3.51 s
Wall time: 3.51 s
from mimesis import Person
from mimesis import Address
from mimesis.enums import Gender
from mimesis import Datetime
person = Person('en')
import pandas as pd
import random
person = Person()
addess = Address()
datetime = Datetime()
def create_rows_mimesis(num=1):
output = [{"name":person.full_name(gender=Gender.FEMALE),
"address":addess.address(),
"name":person.name(),
"email":person.email(),
#"bs":person.bs(),
"city":addess.city(),
"state":addess.state(),
"date_time":datetime.datetime(),
#"paragraph":person.paragraph(),
#"Conrad":person.catch_phrase(),
"randomdata":random.randint(1000,2000)} for x in range(num)]
return output
%%time
df_mimesis = pd.DataFrame(create_rows_mimesis(5000))
CPU times: user 178 ms, sys: 1.7 ms, total: 180 ms
Wall time: 179 ms
Below is resulting data for comparison:
df_faker.head(2)
address city date_time email name randomdata state
0 3818 Goodwin Haven\nBrocktown, GA 06168 Valdezport 2004-10-18 20:35:52 joseph81#gomez-beltran.info Deborah Garcia 1218 Oklahoma
1 2568 Gonzales Field\nRichardhaven, NC 79149 West Rachel 1985-02-03 00:33:00 lbeck#wang.com Barbara Pineda 1536 Tennessee
df_mimesis.head(2)
address city date_time email name randomdata state
0 351 Nobles Viaduct Cedar Falls 2013-08-22 08:20:25.288883 chemotherapeutics1964#gmail.com Ernest 1673 Georgia
1 517 Williams Hill Malden 2008-01-26 18:12:01.654995 biochemical1972#yandex.com Jonathan 1845 North Dakota

Following scripts can remarkably enhance the pandas performance.
from faker import Faker
import pandas as pd
import random
fake = Faker()
def create_rows(num=1):
output = [{"name":fake.name(),
"address":fake.address(),
"name":fake.name(),
"email":fake.email(),
"bs":fake.bs(),
"address":fake.address(),
"city":fake.city(),
"state":fake.state(),
"date_time":fake.date_time(),
"paragraph":fake.paragraph(),
"Conrad":fake.catch_phrase(),
"randomdata":random.randint(1000,2000)} for x in range(num)]
return output
It takes 5.55s.
%%time
df = pd.DataFrame(create_rows(5000))
Wall time: 5.55 s

I placed the fake stuff array inside my for loop to achieve the desired result:
for i in range(10):
stuff = [fake.name()
, fake.email()
, fake.bs()
, fake.address()
, fake.city()
, fake.state()
, fake.date_time()
, fake.paragraph()
, fake.catch_phrase()
, random.randint(1000, 2000)]
df.loc[i] = [item for item in stuff]
print(df)

Using the farsante and mimesis libraries is the easiest way to create Pandas DataFrames with fake data.
import random
import farsante
from mimesis import Person
from mimesis import Address
from mimesis import Datetime
person = Person()
address = Address()
datetime = Datetime()
def rand_int(min_int, max_int):
def some_rand_int():
return random.randint(min_int, max_int)
return some_rand_int
df = farsante.pandas_df([
person.full_name,
address.address,
person.name,
person.email,
address.city,
address.state,
datetime.datetime,
rand_int(1000, 2000)], 5)
print(df)
full_name address name ... state datetime some_rand_int
0 Weldon Durham 1027 Nellie Square Bruna ... West Virginia 2030-06-10 09:21:29.179412 1453
1 Veta Conrad 932 Cragmont Arcade Betsey ... Iowa 2017-08-11 23:50:27.479281 1909
2 Vena Kinney 355 Edgar Highway Tyson ... New Hampshire 2002-12-21 05:26:45.723531 1735
3 Adam Sheppard 270 Williar Court Treena ... North Dakota 2011-03-30 19:16:29.015598 1503
4 Penney Allison 592 Oakdale Road Chas ... Maine 2009-12-14 16:31:37.714933 1175
This approach keeps your code clean.

Related

Locate an id in Dataframe using constraint on columns percentile

I am trying to do a Weighted Aged Historical Var based on the below Dataframe. I would like to identify the ID in my dataframe corresponding to the 5% quantile of the 'Weight_Age_Cumul' column (like in the below example i found on internet)
enter image description here
I ve tryied the following line of code but i get the following error message : 'DataFrame' object has no attribute 'idmax'
cac_df_sorted[cac_df_sorted.Weight_Age_Cumul]<=0.05].CAC_Log_returns.idmax()
enter image description here
If you can help me on that it you be great, thank you
full code below if needed :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
from tabulate import tabulate
from scipy.stats import norm
import yfinance as yf
from yahoofinancials import YahooFinancials
import sys
cac_df = yf.download('^FCHI',
start='2020-04-01',
end='2022-05-31',
progress=False,
)
cac_df.head()
cac_df = cac_df.drop(columns=['Open','High','Low','Close','Volume'])
#convertion into retuns
cac_df['Adj Close_-1'] = cac_df['Adj Close'].shift(1)
cac_df['CAC_Log_returns'] = np.log(cac_df['Adj Close']/cac_df['Adj Close_-1'])
cac_df.index = pd.to_datetime(cac_df.index, format = '%Y-%m-%d').strftime('%Y-%m-%d')
#plot CAC returns graph & histogram
cac_df['CAC_Log_returns'].plot(kind='line',figsize=(15,7))
plt.show()
cac_df['CAC_Log_returns'].hist(bins=40,normed=True,histtype='stepfilled',alpha=0.5)
plt.xlabel('Returns')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
#Historical Var Constant weight & Age Weighted & Vol Weighted
cac_df_sorted = cac_df.copy()
cac_df_sorted.sort_values(by=['Date'],inplace=True,ascending = False)
#Weight for Var Age weighted
lamb = 0.98
n = len(cac_df_sorted['CAC_Log_returns'])
weight_age= []
weight_age = [(lamb**(i-1) * (1-lamb))/(1-lamb**n)for i in range(1, n+1)]
#design of the dataframe
cac_df_sorted['Weight_Age'] = weight_age
cac_df_sorted.sort_values(by=['CAC_Log_returns'],inplace=True,ascending = True)
cac_df_sorted['Weight_Age_Cumul'] = np.cumsum(weight_age)
#Historical Var Constant weight
Var_95_1d_CW = -cac_df_sorted['CAC_Log_returns'].quantile(0.05)
Var_99_1d_CW = -cac_df_sorted['CAC_Log_returns'].quantile(0.01)
#from Var1d to Var10d
mean = np.mean(cac_df['CAC_Log_returns'])
Var_95_10d_CW =(np.sqrt(10)*Var_95_1d_CW)+(mean *(np.sqrt(10)-10))
Var_99_10d_CW = (np.sqrt(10)*Var_99_1d_CW) +(mean *(np.sqrt(10)-10))
print(tabulate([['95%',Var_95_1d_CW,Var_95_10d_CW],['99%',Var_99_1d_CW,Var_99_10d_CW]], headers= ['Confidence Level', 'Value at Risk 1 day Constant Weight','Value at Risk 10 days Constant Weight']))
print(cac_df_sorted)
# Historical Var Age weighted
#Find where cumulative (percentile) hits 0.05 and 0.01
cac_df_sorted[cac_df_sorted['Weight_Age_Cumul']<=0.05].CAC_Log_returns.idmax()

How do I ask question about lowercase error in casefolding?

I don't understand this error... I've already turned df into lowercase before turning it into a list dataframe:
0 Masuk ke Liang Lahat, Rian D’Masiv Makin Sadar... Infotainment Untuk pertama kalinya, Rian masuk ke liang lah...
1 Alasan PPKM, Kuasa Hukum Vicky Prasetyo Sebut ... Infotainment Andai saja persidangan tetap berjalan seperti ...
...
1573 Jessica Iskandar Syok Tahu Kabar Nia Ramadhani... Infotainment “Banyak wartawan juga nanyain. Itu aku baru ba...
1574 Show 10 Menit BTS dalam Koleksi LV Music & Movie BTS melaksanakan ’’tugas’’ perdananya sebagai ...
Code:
import pandas as pd
import numpy as np
import re
import string
import nltk
import str
def load_data():
dataset = pd.read_csv("jawapos_entertainment.csv")
return dataset
news_df = load_data()
news_df.head()
df = pd.DataFrame(news_df[['judul_name','judul_kategori','judul_Headline']])
df
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory
factory = StopWordRemoverFactory()
stopwords = factory.create_stop_word_remover()
kalimat = df [['judul_name','judul_Headline']]
kalimat = kalimat.lower()
stop = stopwords.remove(kalimat)
print(stop)
But I have an error in this line:
AttributeError Traceback (most recent call last)
<ipython-input-17-ce52d5ec4fb2> in <module>
4
5 kalimat = df [['judul_name','judul_Headline']]
----> 6 kalimat = kalimat.lower()
7
8 stop = stopwords.remove(kalimat)
~\anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5463 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5464 return self[name]
-> 5465 return object.__getattribute__(self, name)
5466
5467 def __setattr__(self, name: str, value) -> None:
AttributeError: 'DataFrame' object has no attribute 'lower'
But why is the program returning a lowercase error if I've already passed the lowercase dataframe before?
You can't just lower a Dataframe object. First you have to point that you want to use vectorized string functions for Series and Index-pd.Series.str.
Converting whole dataframe to lowercase format should looks like this:
for columns in kalimat.columns:
kalimat[columns] = kalimat[columns].str.lower()

NLTK Named Entity Category Labels

I keep hitting a wall when it comes to NLTK. I've been able to Token and Categorize a single string of text, however, if I try to apply the script across multiple rows I get the Tokens, but it does not return a Category which is the most important part for me.
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
+nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
SENT_DETECTOR = nltk.data.load('tokenizers/punkt/english.pickle')
Example:
ex = 'John'
ne_tree = nltk.ne_chunk(pos_tag(word_tokenize(ex)))
print(ne_tree)
Output:
(S (PERSON John/NNP))
That is exactly what I'm looking for. I need the Category not just NNP.
When I apply this across a table I just get the token and no Category.
Example:
df = pd.read_csv('ex3.csv')
df
Input:
Order Text
0 0 John
1 1 Paul
2 2 George
3 3 Ringo
Code:
df['results'] = df.Text.apply(lambda x: nltk.ne_chunk(pos_tag(word_tokenize(x))))
df
Output:
print(df)
Order Text results
0 0 John [[(John, NNP)]]
1 1 Paul [[(Paul, NNP)]]
2 2 George [[(George, NNP)]]
3 3 Ringo [[(Ringo, NN)]]
I'm getting the tokens and it's working across all rows, but it is not giving me a Category 'PERSON'.
I really need Categories.
Is this not possible? Thanks for the help.
Here we go...
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
+nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
df = pd.read_csv("ex3.csv")
# print(df)
text1 = df['text'].to_list()
text =[]
for i in text1:
text.append(i.capitalize())
# create a column for store resullts
df['results'] = ""
for i in range(len(text)):
SENT_DETECTOR = nltk.data.load('tokenizers/punkt/english.pickle')
ne_tree = nltk.ne_chunk(pos_tag(word_tokenize(text[i])))
df['results'][i] = ne_tree[0].label()
print(df)

Pandas_datareader error SymbolWarning: Failed to read symbol: 'T', replacing with NaN

I have this code which gets a list of symbols from wikipedia and then gets the stock data from yahoofinance. This is a simple code which was working fine up till a couple days ago but for some reason i getting the unable to read symbol error on many stocks. Is yahoo doing this? What can i do to fix this error. I cannot ignore this because over 50 symbols were NaN and when i rerun the code different symbols show up in the error
pandas_datareader version: 0.8.1
Code:
import datetime
import pandas as pd
import numpy as np
import csv
from pandas_datareader import data as web
import matplotlib
import matplotlib.pyplot as plt
import requests
import bs4 as bs
from urllib.request import urlopen
from bs4 import BeautifulSoup
import tqdm
from pandas import DataFrame
import seaborn as sns
resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
soup = bs.BeautifulSoup(resp.text, 'lxml')
table = soup.find('table', {'class': 'wikitable sortable'})
tickers = []
for row in table.findAll('tr')[1:]:
ticker = row.findAll('td')[0].text.strip()
tickers.append(ticker)
start = datetime.date(2008,11,1)
end = datetime.date.today()
# df = web.get_data_yahoo(tickers, start, end)
df = web.DataReader(tickers, 'yahoo', start, end)
Error:
C:\ProgramData\Anaconda3\lib\site-packages\pandas_datareader\base.py:270: SymbolWarning: Failed to read symbol: 'T', replacing with NaN.
warnings.warn(msg.format(sym), SymbolWarning)
C:\ProgramData\Anaconda3\lib\site-packages\pandas_datareader\base.py:270: SymbolWarning: Failed to read symbol: 'BKR', replacing with NaN.
warnings.warn(msg.format(sym), SymbolWarning)
C:\ProgramData\Anaconda3\lib\site-packages\pandas_datareader\base.py:270: SymbolWarning: Failed to read symbol: 'BRK.B', replacing with NaN.
warnings.warn(msg.format(sym), SymbolWarning)
Looks like the date issue could be fixed by replacing tickers with a '.' to '-' per this github issue
Also you do not need requests or BeautifulSoup just use pd.read_html
I successfully created a DataFrame with no warnings or errors in python 3.6.8 and pandas 24.2 for all 505 tickers. See example below:
import pandas as pd
from pandas_datareader import data as web
import datetime
# no need for requests or BeautifulSoup use read_html
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')[0]
# convert symbol column to list
tickers = df['Symbol'].values.tolist()
# list comprehension to replace data in strings
t = [x.replace('.', '-') for x in tickers]
start = datetime.date(2008,11,1)
end = datetime.date.today()
df2 = web.DataReader(t, 'yahoo', start, end)
Here is the list of all 505 tickers in the DataFrame:
print(*df2.columns.levels[1])
A AAL AAP AAPL ABBV ABC ABMD ABT ACN ADBE ADI ADM ADP ADS ADSK AEE AEP AES AFL AGN AIG AIV AIZ AJG AKAM ALB ALGN ALK ALL ALLE ALXN AMAT AMCR AMD AME AMG AMGN AMP AMT AMZN ANET ANSS ANTM AON AOS APA APD APH APTV ARE ARNC ATO ATVI AVB AVGO AVY AWK AXP AZO BA BAC BAX BBT BBY BDX BEN BF-B BIIB BK BKNG BKR BLK BLL BMY BR BRK-B BSX BWA BXP C CAG CAH CAT CB CBOE CBRE CBS CCI CCL CDNS CDW CE CERN CF CFG CHD CHRW CHTR CI CINF CL CLX CMA CMCSA CME CMG CMI CMS CNC CNP COF COG COO COP COST COTY CPB CPRI CPRT CRM CSCO CSX CTAS CTL CTSH CTVA CTXS CVS CVX CXO D DAL DD DE DFS DG DGX DHI DHR DIS DISCA DISCK DISH DLR DLTR DOV DOW DRE DRI DTE DUK DVA DVN DXC EA EBAY ECL ED EFX EIX EL EMN EMR EOG EQIX EQR ES ESS ETFC ETN ETR EVRG EW EXC EXPD EXPE EXR F FANG FAST FB FBHS FCX FDX FE FFIV FIS FISV FITB FLIR FLS FLT FMC FOX FOXA FRC FRT FTI FTNT FTV GD GE GILD GIS GL GLW GM GOOG GOOGL GPC GPN GPS GRMN GS GWW HAL HAS HBAN HBI HCA HD HES HFC HIG HII HLT HOG HOLX HON HP HPE HPQ HRB HRL HSIC HST HSY HUM IBM ICE IDXX IEX IFF ILMN INCY INFO INTC INTU IP IPG IPGP IQV IR IRM ISRG IT ITW IVZ JBHT JCI JEC JKHY JNJ JNPR JPM JWN K KEY KEYS KHC KIM KLAC KMB KMI KMX KO KR KSS KSU L LB LDOS LEG LEN LH LHX LIN LKQ LLY LMT LNC LNT LOW LRCX LUV LVS LW LYB M MA MAA MAC MAR MAS MCD MCHP MCK MCO MDLZ MDT MET MGM MHK MKC MKTX MLM MMC MMM MNST MO MOS MPC MRK MRO MS MSCI MSFT MSI MTB MTD MU MXIM MYL NBL NCLH NDAQ NEE NEM NFLX NI NKE NLOK NLSN NOC NOV NOW NRG NSC NTAP NTRS NUE NVDA NVR NWL NWS NWSA O OKE OMC ORCL ORLY OXY PAYX PBCT PCAR PEAK PEG PEP PFE PFG PG PGR PH PHM PKG PKI PLD PM PNC PNR PNW PPG PPL PRGO PRU PSA PSX PVH PWR PXD PYPL QCOM QRVO RCL RE REG REGN RF RHI RJF RL RMD ROK ROL ROP ROST RSG RTN SBAC SBUX SCHW SEE SHW SIVB SJM SLB SLG SNA SNPS SO SPG SPGI SRE STI STT STX STZ SWK SWKS SYF SYK SYY T TAP TDG TEL TFX TGT TIF TJX TMO TMUS TPR TRIP TROW TRV TSCO TSN TTWO TWTR TXN TXT UA UAA UAL UDR UHS ULTA UNH UNM UNP UPS URI USB UTX V VAR VFC VIAB VLO VMC VNO VRSK VRSN VRTX VTR VZ WAB WAT WBA WCG WDC WEC WELL WFC WHR WLTW WM WMB WMT WRK WU WY WYNN XEC XEL XLNX XOM XRAY XRX XYL YUM ZBH ZION ZTS
print(len(df2.columns.levels[1]))
505
For those of you still facing the error, try using pandas_datareader version 0.10.0.

memory error on web.datareader using pandas

I have a block of code that functions, however, I get memory errors or extremely long run times, is there a more elegant solution that requires less memory or can be run in a shorter time period?
import pandas as pd
import pandas.io.data as web
import datetime
#Grabs tickers from html
exchList = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies', infer_types=False)
sp500 = []
for ticker in exchList[0][0][1:]:
sp500.append(ticker)
sp500 = [w.replace('.','-') for w in sp500]
#sets date for data fetch
start = datetime.datetime(2000,1,1)
end = datetime.date.today()
#fetches data from yahoo and prints to csv
p = web.DataReader(sp500, "yahoo", start, end)
main_df = p.to_frame()
noIndex = main_df.reset_index()
noIndex.columns.values[1]= 'Name'
indexed = noIndex.set_index('Date')
csv = indexed.to_csv('edata.csv')
import pandas.io.data is deprecated in modern Pandas versions:
In [113]: import pandas.io.data
...
ImportError: The pandas.io.data module is moved to a separate package (pandas-datareader). After installing the pandas-datareader package (h
ttps://github.com/pandas-dev/pandas-datareader), you can change the import ``from pandas.io import data, wb`` to ``from pandas_datareader im
port data, wb``.
So we should use pandas_datareader instead:
from pandas_datareader import data as web
url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
sp500 = pd.read_html(url)[0].iloc[1:, 0].str.replace('\.', '-')
df = web.DataReader(sp500, "yahoo", '2000-01-01').to_frame()
Memory usage:
In [112]: df.memory_usage()
Out[112]:
Index 7974167
Open 15870936
High 15870936
Low 15870936
Close 15870936
Volume 15870936
Adj Close 15870936
dtype: int64
Execution time:
In [115]: %timeit -n 1 -r 1 web.DataReader(sp500, "yahoo", '2000-01-01').to_frame()
1 loop, best of 1: 1min 57s per loop

Resources