Difficulty converting escaped UTF-8 using Python - python-3.x

I can't understand why one function works and the other doesn't. I have tried encoding the aristotle.txt file as utf-8 and other encodings, but still it only outputs the escaped codes. Code as seen below.
import codecs
def convert_to_utf8_1(from_file, to_file):
to_file = codecs.open(to_file,'w','utf8')
from_file = codecs.open(from_file, 'r', 'utf8')
#from_file = open(from_file, 'r')
for line in from_file:
to_file.write(line)
to_file.close()
def convert_to_utf8_2(from_string, to_file):
to_file = codecs.open(to_file,'w','utf8')
to_file.write(from_string)
to_file.close()
if __name__ == '__main__':
a = "Aristotle$$$$Aristotle (/\u02C8\u00E6r\u026A\u02CCst\u0252t\u0259l/; Greek: \u1F08\u03C1\u03B9\u03C3\u03C4\u03BF\u03C4\u03AD\u03BB\u03B7\u03C2 [aristot\u00E9l\u025B\u02D0s], Aristot\u00E9l\u0113s; 384 \u2013 322 BCE) was a Greek philosopher and scientist born in Stagirus, northern Greece, in 384 BCE. His father, Nicomachus, died when Aristotle was a child, whereafter Proxenus of Atarneus became his guardian. At eighteen, he joined Plato's Academy in Athens and remained there until the age of thirty-seven (c. 347 BCE)."
convert_to_utf8_1("aristotle.txt", "test1.txt")
convert_to_utf8_2(a, "test2.txt")
#########################OUTPUT FROM test1.txt###############################
#Aristotle$$$$Aristotle (/\u02C8\u00E6r\u026A\u02CCst\u0252t\u0259l/; Greek: #\u1F08\u03C1\u03B9\u03C3\u03C4\u03BF\u03C4\u03AD\u03BB\u03B7\u03C2 [aristot\u00E9l\u025B\u02D0s], Aristot\u00E9l\u0113s; 384 #\u2013 322 BCE) was a Greek philosopher and scientist born in Stagirus, northern Greece, in 384 BCE. His father, Nicomachus, died #when Aristotle was a child, whereafter Proxenus of Atarneus became his guardian. At eighteen, he joined Plato's Academy in Athens #and remained there until the age of thirty-seven (c. 347 BCE).
#########################OUTPUT FROM test2.txt###############################
#Aristotle$$$$Aristotle (/ˈærɪˌstɒtəl/; Greek: Ἀριστοτέλης [aristotélɛːs], Aristotélēs; 384 – 322 BCE) was a Greek philosopher and #scientist born in Stagirus, northern Greece, in 384 BCE. His father, Nicomachus, died when Aristotle was a child, whereafter #Proxenus of Atarneus became his guardian. At eighteen, he joined Plato's Academy in Athens and remained there until the age of #thirty-seven (c. 347 BCE).
Hex of a few lines as requested:
0000-0010: 41 72 69 73-74 6f 74 6c-65 24 24 24-24 41 72 69 Aristotl e$$$$Ari
0000-0020: 73 74 6f 74-6c 65 20 28-2f 5c 75 30-32 43 38 5c stotle.( /\u02C8\

Related

How to calculate Integral estimation in Python?

How can I do the following in python3 on the provided data set listed below?
Problem
Knowing that the data.txt has 2 columns:
xValues, where 𝑎 ≤ 𝑥 ≤ 𝑏 , with 𝑎 and 𝑏 being some constants
gOfXValues
1Compute second order estimates of 𝑔′(𝑥)
2Compute second order estimates of $g'(x)$ and $\int_a^b g(x)dx.$
Generally, we don't know that 𝑥 values given from a random data sample is evenly separated.
3Plot 𝑔(𝑥) and 𝑔′(𝑥) , print the integral.
4Based on the graph, what function do you think 𝑔(𝑥) is?
5Verify this by qualitatively comparing the exact derivative and integral of your supposed 𝑔(𝑥) with the numerical results obtained previously.
What I have done so far
import pandas as pd
dataFrame = pd.read_csv('/Users/Files/data.txt', sep="\s+", names=['xValue','gOfXValue'])
dataFrame.info
dataFrame.head()
dgdxArray = []
gOfXValues = []
#iterate all observations and extract the columns as the independent and dependent variable
for index, row in dataFrame.iterrows():
xValue = row['xValue']
gOfXValue = row['gOfXValue']
gOfXValues.append(gOfXValue)
if index > 0:
h = 0.05
difference = gOfXValue - gOfXValues[index-1] #check the difference between Current vs Previous value
dgdx = difference / h #get the Derivative
dgdxArray.append(dgdx) #add the derivative to an array so as to plot it
dgdxArray.insert(0,0.5) #hard code values
#plot the initial values provided
fig = plt.figure(figsize = (6,12))
ax1 = fig.add_subplot(311)
ax1.plot(dataFrame['xValue'], dataFrame['gOfXValue'])
ax1.set_title('Plot initial values x, g(x)')
ax1.set_xlabel('xValue')
ax1.set_ylabel('gOfXValue')
#Plot X value and the derivative on y axis
ax2 = fig.add_subplot(312)
ax2.plot(dataFrame['xValue'], dgdxArray)
ax2.set_title('Plot x, derivativeOfGOfx')
ax2.set_xlabel('xValue')
ax2.set_ylabel('gOfXValue')
fig.tight_layout()
plt.show()
EDIT_1
How can I find the original function definition given that I only have access to xValue and gOfXvalue?
Edit_2 based on comments with D Stanley
#we know that in calculus g(x) = sin(x), g'(x) = cos(x), g(x)dx = - cos(x) + C
#Therefore:
#calculate sin(x) and compare it to g(x) provided
#calculate f'(x) and compare it to g(x) provided
#calculate integral of g(x)dx and compare it to g(x) provided
constant = 2 #choose a random constant
#calculate the sin,cos and integral for existing x value considering that the original function is sin
sinXfound = np.sin(dataFrame.xValue)
cosXfound = np.cos(dataFrame.xValue)
intXfound = - np.cos(dataFrame.xValue) + constant
#create new columns in the original df with values calculate above
dataFrame['sinXfound'] = sinXfound
dataFrame['cosXfound'] = cosXfound
dataFrame['intXfound'] = intXfound
#find what is the difference between sin,cos newly found and the original xValue provided in the request
differenceSinXfound = sinXfound - dataFrame['gOfXValue']
differenceCosXfound = cosXfound - dataFrame['gOfXValue']
differenceIntXfound = intXfound - dataFrame['gOfXValue']
#add columns to df
dataFrame['differenceSinXfound'] = differenceSinXfound
dataFrame['differenceCosXfound'] = differenceCosXfound
dataFrame['differenceIntXfound'] = differenceCosXfound
print(dataFrame)
Edit_3 based on Lutz answer
xValues = dataFrame.xValue
gofXValues = dataFrame.gOfXValue
firstDiffArray = []
def calculate_ALL_Divided_Differences():
for index, row in dataFrame.iterrows():
if index > 0:
xNow = row['xValue']
gNow = row['gOfXValue']
difference = (gofXValues[indexNow] - gofXValues[indexNow - 1]) / (xValues[indexNow] - xValues[indexNow -1])
firstDiffArray.append(difference)
firstDividedDifference = (gofXValues[1] - gofXValues[0]) / (xValues[1] - xValues[0])
x0 = xValues[0]
gOfXZero = gofXValues[0]
#Apply Newton's divided difference interpolation formula
for index, row in dataFrame.iterrows():
if index > 0:
xNow = row['xValue']
gNow = row['gOfXValue']
x_Minus_x0 = xNow - xValues[0]
x_Minus_x1 = xNow - xValues[1]
#Newton's divided difference interpolation formula is
#f(x) = y0+(x-x0) f [x0,x1]+ (x-x0) * (x-x1) * f [x0,x1,x2]
divided_Difference_Interpolation = gOfXZero + (xNow - x0) * firstDividedDifference + x_Minus_x0 * x_Minus_x1
DataSet
0.000000000000000000e+00 1.000000000000000000e+00
3.157379551346525814e-02 1.031568549764810605e+00
6.314759102693051629e-02 1.063105631312673660e+00
9.472138654039577443e-02 1.094579807794844983e+00
1.262951820538610326e-01 1.125959705067717476e+00
1.578689775673262907e-01 1.157214042967250833e+00
1.894427730807915489e-01 1.188311666489717755e+00
2.210165685942568070e-01 1.219221576847691280e+00
2.525903641077220652e-01 1.249912962370308467e+00
2.841641596211873511e-01 1.280355229217014390e+00
3.157379551346525814e-01 1.310518031874168710e+00
3.473117506481178118e-01 1.340371303404112702e+00
3.788855461615830977e-01 1.369885285416546861e+00
4.104593416750483836e-01 1.399030557732340974e+00
4.420331371885136140e-01 1.427778067710209653e+00
4.736069327019788444e-01 1.456099159207016047e+00
5.051807282154441303e-01 1.483965601142838819e+00
5.367545237289094162e-01 1.511349615642326727e+00
5.683283192423747021e-01 1.538223905724288576e+00
5.999021147558398770e-01 1.564561682511917962e+00
6.314759102693051629e-01 1.590336691936528268e+00
6.630497057827704488e-01 1.615523240908179226e+00
6.946235012962356237e-01 1.640096222927107217e+00
7.261972968097009096e-01 1.664031143110431099e+00
7.577710923231661955e-01 1.687304142609184154e+00
7.893448878366314814e-01 1.709892022391332755e+00
8.209186833500967673e-01 1.731772266367076707e+00
8.524924788635619421e-01 1.752923063833377038e+00
8.840662743770272280e-01 1.773323331215339138e+00
9.156400698904925139e-01 1.792952733082778582e+00
9.472138654039576888e-01 1.811791702421020833e+00
9.787876609174229747e-01 1.829821460135725886e+00
1.010361456430888261e+00 1.847024033772298957e+00
1.041935251944353436e+00 1.863382275431223256e+00
1.073509047457818832e+00 1.878879878861460462e+00
1.105082842971284007e+00 1.893501395714874080e+00
1.136656638484749404e+00 1.907232250945481322e+00
1.168230433998214579e+00 1.920058757338174438e+00
1.199804229511679754e+00 1.931968129152434877e+00
1.231378025025145151e+00 1.942948494867437148e+00
1.262951820538610326e+00 1.952988909015839214e+00
1.294525616052075501e+00 1.962079363094462847e+00
1.326099411565540898e+00 1.970210795540986215e+00
1.357673207079006072e+00 1.977375100766707305e+00
1.389247002592471247e+00 1.983565137236369846e+00
1.420820798105936644e+00 1.988774734587002602e+00
1.452394593619401819e+00 1.992998699778669724e+00
1.483968389132867216e+00 1.996232822271006846e+00
1.515542184646332391e+00 1.998473878220378808e+00
1.547115980159797566e+00 1.999719633693477938e+00
1.578689775673262963e+00 1.999968846894156327e+00
1.610263571186728138e+00 1.999221269401275647e+00
1.641837366700193535e+00 1.997477646416338626e+00
1.673411162213658709e+00 1.994739716020657028e+00
1.704984957727123884e+00 1.991010207442792002e+00
1.736558753240589281e+00 1.986292838338002742e+00
1.768132548754054456e+00 1.980592311082403967e+00
1.799706344267519631e+00 1.973914308085537694e+00
1.831280139780985028e+00 1.966265486126021811e+00
1.862853935294450203e+00 1.957653469715929573e+00
1.894427730807915378e+00 1.948086843500509424e+00
1.926001526321380775e+00 1.937575143700825064e+00
1.957575321834845949e+00 1.926128848607841171e+00
1.989149117348311346e+00 1.913759368137436745e+00
2.020722912861776521e+00 1.900479032456751760e+00
2.052296708375241696e+00 1.886301079693208704e+00
2.083870503888706871e+00 1.871239642738459663e+00
2.115444299402172490e+00 1.855309735160411755e+00
2.147018094915637665e+00 1.838527236237377238e+00
2.178591890429102840e+00 1.820908875129262583e+00
2.210165685942568015e+00 1.802472214201578105e+00
2.241739481456033189e+00 1.783235631518890418e+00
2.273313276969498808e+00 1.763218302525168424e+00
2.304887072482963983e+00 1.742440180929283100e+00
2.336460867996429158e+00 1.720921978814716535e+00
2.368034663509894333e+00 1.698685145993306556e+00
2.399608459023359508e+00 1.675751848623608709e+00
2.431182254536824683e+00 1.652144947115186779e+00
2.462756050050290302e+00 1.627887973340858441e+00
2.494329845563755477e+00 1.603005107179614530e+00
2.525903641077220652e+00 1.577521152413588590e+00
2.557477436590685826e+00 1.551461512003107668e+00
2.589051232104151001e+00 1.524852162764468444e+00
2.620625027617616620e+00 1.497719629475682934e+00
2.652198823131081795e+00 1.470090958436002904e+00
2.683772618644546970e+00 1.441993690505579018e+00
2.715346414158012145e+00 1.413455833652134119e+00
2.746920209671477320e+00 1.384505835032010967e+00
2.778494005184942495e+00 1.355172552633428618e+00
2.810067800698408114e+00 1.325485226510211501e+00
2.841641596211873289e+00 1.295473449634670038e+00
2.873215391725338463e+00 1.265167138398678670e+00
2.904789187238803638e+00 1.234596502792368877e+00
2.936362982752268813e+00 1.203792016290152311e+00
2.967936778265734432e+00 1.172784385474099356e+00
2.999510573779199607e+00 1.141604519424951558e+00
3.031084369292664782e+00 1.110283498911275091e+00
3.062658164806129957e+00 1.078852545407476660e+00
3.094231960319595132e+00 1.047342989971558280e+00
3.125805755833060751e+00 1.015786242013636764e+00
3.157379551346525925e+00 9.842137579863630137e-01
3.188953346859991100e+00 9.526570100284420528e-01
3.220527142373456275e+00 9.211474545925236734e-01
3.252100937886921450e+00 8.897165010887251313e-01
3.283674733400387069e+00 8.583954805750482198e-01
3.315248528913852244e+00 8.272156145259004223e-01
3.346822324427317419e+00 7.962079837098479107e-01
3.378396119940782594e+00 7.654034972076313448e-01
3.409969915454247769e+00 7.348328616013215520e-01
3.441543710967712943e+00 7.045265503653301842e-01
3.473117506481178562e+00 6.745147734897881664e-01
3.504691301994643737e+00 6.448274473665716044e-01
3.536265097508108912e+00 6.154941649679892546e-01
3.567838893021574087e+00 5.865441663478661027e-01
3.599412688535039262e+00 5.580063094944210933e-01
3.630986484048504881e+00 5.299090415639968743e-01
3.662560279561970056e+00 5.022803705243168437e-01
3.694134075075435231e+00 4.751478372355316671e-01
3.725707870588900406e+00 4.485384879968926652e-01
3.757281666102365580e+00 4.224788475864115211e-01
3.788855461615830755e+00 3.969948928203856919e-01
3.820429257129296374e+00 3.721120266591415593e-01
3.852003052642761549e+00 3.478550528848133316e-01
3.883576848156226724e+00 3.242481513763912915e-01
3.915150643669691899e+00 3.013148540066936665e-01
3.946724439183157074e+00 2.790780211852837978e-01
3.978298234696622693e+00 2.575598190707169000e-01
4.009872030210087424e+00 2.367816974748317982e-01
4.041445825723553043e+00 2.167643684811096927e-01
4.073019621237018661e+00 1.975277857984218954e-01
4.104593416750483392e+00 1.790911248707376391e-01
4.136167212263949011e+00 1.614727637626225398e-01
4.167741007777413742e+00 1.446902648395883562e-01
4.199314803290879361e+00 1.287603572615404479e-01
4.230888598804344980e+00 1.136989203067911847e-01
4.262462394317809711e+00 9.952096754324846195e-02
4.294036189831275330e+00 8.624063186256325508e-02
4.325609985344740060e+00 7.387115139215894022e-02
4.357183780858205679e+00 6.242485629917493561e-02
4.388757576371671298e+00 5.191315649949035382e-02
4.420331371885136029e+00 4.234653028407053821e-02
4.451905167398601648e+00 3.373451387397807810e-02
4.483478962912066379e+00 2.608569191446230562e-02
4.515052758425531998e+00 1.940768891759592218e-02
4.546626553938997617e+00 1.370716166199725805e-02
4.578200349452462348e+00 8.989792557207887391e-03
4.609774144965927967e+00 5.260283979342972316e-03
4.641347940479392697e+00 2.522353583661263166e-03
4.672921735992858316e+00 7.787305987243531291e-04
4.704495531506323047e+00 3.115310584367314561e-05
4.736069327019788666e+00 2.803663065220618478e-04
4.767643122533254285e+00 1.526121779621192331e-03
4.799216918046719016e+00 3.767177728993265085e-03
4.830790713560184635e+00 7.001300221330386542e-03
4.862364509073649366e+00 1.122526541299739833e-02
4.893938304587114985e+00 1.643486276363004261e-02
4.925512100100580604e+00 2.262489923329280561e-02
4.957085895614045334e+00 2.978920445901367398e-02
4.988659691127510953e+00 3.792063690553715283e-02
5.020233486640975684e+00 4.701109098416056398e-02
5.051807282154441303e+00 5.705150513256296296e-02
5.083381077667906922e+00 6.803187084756523451e-02
5.114954873181371653e+00 7.994124266182545124e-02
5.146528668694837272e+00 9.276774905451867781e-02
5.178102464208302003e+00 1.064986042851256975e-01
5.209676259721767622e+00 1.211201211385396492e-01
5.241250055235233241e+00 1.366177245687767439e-01
5.272823850748697971e+00 1.529759662277010435e-01
5.304397646262163590e+00 1.701785398642742253e-01
5.335971441775628321e+00 1.882082975789789447e-01
5.367545237289093940e+00 2.070472669172214175e-01
5.399119032802559559e+00 2.266766687846610839e-01
5.430692828316024290e+00 2.470769361666227404e-01
5.462266623829489909e+00 2.682277336329232931e-01
5.493840419342954640e+00 2.901079776086668005e-01
5.525414214856420259e+00 3.126958573908157346e-01
5.556988010369884989e+00 3.359688568895683458e-01
5.588561805883350608e+00 3.599037770728926722e-01
5.620135601396816227e+00 3.844767590918209965e-01
5.651709396910280958e+00 4.096633080634712876e-01
5.683283192423746577e+00 4.354383174880819274e-01
5.714856987937211308e+00 4.617760942757109799e-01
5.746430783450676927e+00 4.886503843576732731e-01
5.778004578964142546e+00 5.160343988571614027e-01
5.809578374477607277e+00 5.439008407929837308e-01
5.841152169991072896e+00 5.722219322897904581e-01
5.872725965504537626e+00 6.009694422676585823e-01
5.904299761018003245e+00 6.301147145834531393e-01
5.935873556531468864e+00 6.596286965958874093e-01
5.967447352044933595e+00 6.894819681258308464e-01
5.999021147558399214e+00 7.196447707829856100e-01
6.030594943071863945e+00 7.500870376296912001e-01
6.062168738585329564e+00 7.807784231523084983e-01
6.093742534098795183e+00 8.116883335102823560e-01
6.125316329612259914e+00 8.427859570327489447e-01
6.156890125125725532e+00 8.740402949322825243e-01
6.188463920639190263e+00 9.054201922051545726e-01
6.220037716152655882e+00 9.368943686873262289e-01
6.251611511666121501e+00 9.684314502351897280e-01
6.283185307179586232e+00 9.999999999999997780e-01
Your derivative is wrong, at this level it should be
(gofx[i]-gofx[i-1]) / (x[i]-x[i-1])
But this is only a first order approximation of the derivative, the task asks for a second error order. That is, for the derivative at x[i], you have to take the interpolation polynomial through the points x[i-1], x[i], x[i+1] and their values
g[x[i]] + g[x[i],x[i+1]] * (x-x[i]) + g[x[i],x[i-1],x[i+1]] * (x-x[i])*(x-x[i-1])
and compute the derivative of it at x=x[i]. Or alternatively, from the Taylor expansion you know that
(gofx[i]-gofx[i-1]) / (x[i]-x[i-1]) = g'(x[i]) - 0.5*g''(x[i])*(x[i]-x[i-1])+...
(gofx[i+1]-gofx[i]) / (x[i+1]-x[i]) = g'(x[i]) + 0.5*g''(x[i])*(x[i+1]-x[i])+...
Combining both you can eliminate the term with g''(x[i]).
so if
dx = x[1:]-x[:-1]
dg = g[1:]-g[:-1]
are the simple differences, then the first order derivative with second error order is
dg_dx = dg/dx
diff_g = ( dx[:-1]*(dg_dx[1:]) + dx[1:]*(dg_dx[:-1]) ) / (dx[1:]+dx[:-1])
This is written so that the nature as convex combination becomes obvious.
For the integral, the cumulative trapezoidal quadrature should be sufficient.
sum( 0.5*(g[:-1]+g[1:])*(x[1:]-x[:-1]) )
Use the cumulative sum if you want the anti-derivative as function (table).
You might want to extract the data into numpy arrays directly, there should be functions in pandas that do that.
In total I get the short script
x,g = np.loadtxt('so65602720.data').T
%matplotlib inline
plt.figure(figsize=(10,3))
plt.subplot(131)
plt.plot(x,g,x,np.sin(x)+1); plt.legend(["table g values", "$1+sin(x)$"]); plt.grid();
dx = x[1:]-x[:-1]
dg = g[1:]-g[:-1]
dg_dx = dg/dx
diff_g = ( dx[:-1]*(dg_dx[1:]) + dx[1:]*(dg_dx[:-1]) ) / (dx[1:]+dx[:-1])
plt.subplot(132)
plt.plot(x,g,x[1:-1],diff_g); plt.legend(["g", "g'"]); plt.grid();
int_g = np.cumsum(0.5*(g[1:]+g[:-1])*(x[1:]-x[:-1]))
plt.subplot(133)
plt.plot(x[1:],int_g,x,x); plt.legend(["integral of g","diagonal"]); plt.grid();
plt.tight_layout(); plt.show()
resulting in the plot collection
showing first that indeed the data is of the function g(x)=1+sin(x), that the derivative correctly looks like the cosine and the integral is x+1-cos(x).
I am positing this as a placeholder to get suggestions.
#Try1
import pandas as pd
from sympy import *
dataFrame = pd.read_csv('/data_2.txt', sep="\s+", names=['xValue','gOfXValue'],float_precision='round_trip', nrows=200)
pd.set_option('display.max_rows', 200)
xValues = dataFrame.xValue
gofXValues = dataFrame.gOfXValue
#Compute second order estimate for g'(x)
firstDiffArray = []
def calculate_ALL_Divided_Differences():
for index, row in dataFrame.iterrows():
if index > 0:
xNow = row['xValue']
gNow = row['gOfXValue']
difference = (gofXValues[index] - gofXValues[index - 1]) / (xValues[index] - xValues[index -1])
firstDiffArray.append(difference)
calculate_ALL_Divided_Differences()
#Plot x and g'(x) that I've found
xValuesLessOne = xValues[:-1]
fig = plt.figure(figsize = (6,12))
ax1 = fig.add_subplot(311)
ax1.plot(xValuesLessOne, firstDiffArray)
ax1.set_title('x vs g Derivated')
ax1.set_xlabel('xValue')
ax1.set_ylabel('g Derivated')
fig.tight_layout()
plt.show()
firstDiffArray.insert(0,0) #insert the first row as 0 to have same 200 rows data shape
dataFrame['derivative_For_gOfX'] = firstDiffArray #create column for the derivative of gOfX
###Find integral g(x)dx
def findIntegral():
for index, row in dataFrame.iterrows():
if index < 199:
xNow = xValues[index]
xNowPlus_1 = xValues[index + 1]
gNow = gofXValues[index]
gNowPlus_1 = gofXValues[index + 1]
intg = (gNowPlus_1 - gNow) * (xNowPlus_1 - xNow)
integralPoints.append(intg)
intgTrapez = 0.5*(gofXValues[i+1] + gofXValues[i]) * (xValues[i+1] - xValues[i])
trapezIntegralPoints.append(intgTrapez)
#integral found numerically
integralFound = findIntegral()
sumIntegralPoints = sum(integralPoints)
sumtrapezIntegralPoints = sum(trapezIntegralPoints)
print('IntegralFound', sumIntegralPoints)
print('TrapezIntegral', sumtrapezIntegralPoints)
IntegralFound -2.7538735181131813e-17
TrapezIntegral 8.328014461189756
xValue gOfXValue derivative_For_gOfX
0 0.000000 1.000000 0.000000
1 0.031574 1.031569 0.999834
2 0.063148 1.063106 0.998837
3 0.094721 1.094580 0.996845
4 0.126295 1.125960 0.993859
5 0.157869 1.157214 0.989882
6 0.189443 1.188312 0.984919
7 0.221017 1.219222 0.978974
8 0.252590 1.249913 0.972052
9 0.284164 1.280355 0.964162
10 0.315738 1.310518 0.955311
11 0.347312 1.340371 0.945508
12 0.378886 1.369885 0.934762
13 0.410459 1.399031 0.923084
14 0.442033 1.427778 0.910486
15 0.473607 1.456099 0.896981
16 0.505181 1.483966 0.882581
17 0.536755 1.511350 0.867302
18 0.568328 1.538224 0.851158
19 0.599902 1.564562 0.834166
20 0.631476 1.590337 0.816342
21 0.663050 1.615523 0.797704
22 0.694624 1.640096 0.778271
23 0.726197 1.664031 0.758063
24 0.757771 1.687304 0.737099
25 0.789345 1.709892 0.715400
26 0.820919 1.731772 0.692987
27 0.852492 1.752923 0.669885
28 0.884066 1.773323 0.646114
29 0.915640 1.792953 0.621699
30 0.947214 1.811792 0.596665
31 0.978788 1.829821 0.571035
32 1.010361 1.847024 0.544837
33 1.041935 1.863382 0.518096
34 1.073509 1.878880 0.490838
35 1.105083 1.893501 0.463090
36 1.136657 1.907232 0.434881
37 1.168230 1.920059 0.406239
38 1.199804 1.931968 0.377192
39 1.231378 1.942948 0.347768
40 1.262952 1.952989 0.317998
41 1.294526 1.962079 0.287911
42 1.326099 1.970211 0.257537
43 1.357673 1.977375 0.226907
44 1.389247 1.983565 0.196050
45 1.420821 1.988775 0.164998
46 1.452395 1.992999 0.133781
47 1.483968 1.996233 0.102431
48 1.515542 1.998474 0.070978
49 1.547116 1.999720 0.039455
50 1.578690 1.999969 0.007893
51 1.610264 1.999221 -0.023677
52 1.641837 1.997478 -0.055224
53 1.673411 1.994740 -0.086715
54 1.704985 1.991010 -0.118120
55 1.736559 1.986293 -0.149408
56 1.768133 1.980592 -0.180546
57 1.799706 1.973914 -0.211505
58 1.831280 1.966265 -0.242252
59 1.862854 1.957653 -0.272758
60 1.894428 1.948087 -0.302993
61 1.926002 1.937575 -0.332925
62 1.957575 1.926129 -0.362525
63 1.989149 1.913759 -0.391764
64 2.020723 1.900479 -0.420613
65 2.052297 1.886301 -0.449042
66 2.083871 1.871240 -0.477023
67 2.115444 1.855310 -0.504529
68 2.147018 1.838527 -0.531533
69 2.178592 1.820909 -0.558006
70 2.210166 1.802472 -0.583923
71 2.241739 1.783236 -0.609258
72 2.273313 1.763218 -0.633986
73 2.304887 1.742440 -0.658081
74 2.336461 1.720922 -0.681521
75 2.368035 1.698685 -0.704281
76 2.399608 1.675752 -0.726340
77 2.431182 1.652145 -0.747674
78 2.462756 1.627888 -0.768263
79 2.494330 1.603005 -0.788086
80 2.525904 1.577521 -0.807124
81 2.557477 1.551462 -0.825357
82 2.589051 1.524852 -0.842767
83 2.620625 1.497720 -0.859337
84 2.652199 1.470091 -0.875051
85 2.683773 1.441994 -0.889892
86 2.715346 1.413456 -0.903846
87 2.746920 1.384506 -0.916900
88 2.778494 1.355173 -0.929039
89 2.810068 1.325485 -0.940252
90 2.841642 1.295473 -0.950528
91 2.873215 1.265167 -0.959856
92 2.904789 1.234597 -0.968228
93 2.936363 1.203792 -0.975635
94 2.967937 1.172784 -0.982069
95 2.999511 1.141605 -0.987524
96 3.031084 1.110283 -0.991994
97 3.062658 1.078853 -0.995476
98 3.094232 1.047343 -0.997965
99 3.125806 1.015786 -0.999460
100 3.157380 0.984214 -0.999958
101 3.188953 0.952657 -0.999460
102 3.220527 0.921147 -0.997965
103 3.252101 0.889717 -0.995476
104 3.283675 0.858395 -0.991994
105 3.315249 0.827216 -0.987524
106 3.346822 0.796208 -0.982069
107 3.378396 0.765403 -0.975635
108 3.409970 0.734833 -0.968228
109 3.441544 0.704527 -0.959856
110 3.473118 0.674515 -0.950528
111 3.504691 0.644827 -0.940252
112 3.536265 0.615494 -0.929039
113 3.567839 0.586544 -0.916900
114 3.599413 0.558006 -0.903846
115 3.630986 0.529909 -0.889892
116 3.662560 0.502280 -0.875051
117 3.694134 0.475148 -0.859337
118 3.725708 0.448538 -0.842767
119 3.757282 0.422479 -0.825357
120 3.788855 0.396995 -0.807124
121 3.820429 0.372112 -0.788086
122 3.852003 0.347855 -0.768263
123 3.883577 0.324248 -0.747674
124 3.915151 0.301315 -0.726340
125 3.946724 0.279078 -0.704281
126 3.978298 0.257560 -0.681521
127 4.009872 0.236782 -0.658081
128 4.041446 0.216764 -0.633986
129 4.073020 0.197528 -0.609258
130 4.104593 0.179091 -0.583923
131 4.136167 0.161473 -0.558006
132 4.167741 0.144690 -0.531533
133 4.199315 0.128760 -0.504529
134 4.230889 0.113699 -0.477023
135 4.262462 0.099521 -0.449042
136 4.294036 0.086241 -0.420613
137 4.325610 0.073871 -0.391764
138 4.357184 0.062425 -0.362525
139 4.388758 0.051913 -0.332925
140 4.420331 0.042347 -0.302993
141 4.451905 0.033735 -0.272758
142 4.483479 0.026086 -0.242252
143 4.515053 0.019408 -0.211505
144 4.546627 0.013707 -0.180546
145 4.578200 0.008990 -0.149408
146 4.609774 0.005260 -0.118120
147 4.641348 0.002522 -0.086715
148 4.672922 0.000779 -0.055224
149 4.704496 0.000031 -0.023677
150 4.736069 0.000280 0.007893
151 4.767643 0.001526 0.039455
152 4.799217 0.003767 0.070978
153 4.830791 0.007001 0.102431
154 4.862365 0.011225 0.133781
155 4.893938 0.016435 0.164998
156 4.925512 0.022625 0.196050
157 4.957086 0.029789 0.226907
158 4.988660 0.037921 0.257537
159 5.020233 0.047011 0.287911
160 5.051807 0.057052 0.317998
161 5.083381 0.068032 0.347768
162 5.114955 0.079941 0.377192
163 5.146529 0.092768 0.406239
164 5.178102 0.106499 0.434881
165 5.209676 0.121120 0.463090
166 5.241250 0.136618 0.490838
167 5.272824 0.152976 0.518096
168 5.304398 0.170179 0.544837
169 5.335971 0.188208 0.571035
170 5.367545 0.207047 0.596665
171 5.399119 0.226677 0.621699
172 5.430693 0.247077 0.646114
173 5.462267 0.268228 0.669885
174 5.493840 0.290108 0.692987
175 5.525414 0.312696 0.715400
176 5.556988 0.335969 0.737099
177 5.588562 0.359904 0.758063
178 5.620136 0.384477 0.778271
179 5.651709 0.409663 0.797704
180 5.683283 0.435438 0.816342
181 5.714857 0.461776 0.834166
182 5.746431 0.488650 0.851158
183 5.778005 0.516034 0.867302
184 5.809578 0.543901 0.882581
185 5.841152 0.572222 0.896981
186 5.872726 0.600969 0.910486
187 5.904300 0.630115 0.923084
188 5.935874 0.659629 0.934762
189 5.967447 0.689482 0.945508
190 5.999021 0.719645 0.955311
191 6.030595 0.750087 0.964162
192 6.062169 0.780778 0.972052
193 6.093743 0.811688 0.978974
194 6.125316 0.842786 0.984919
195 6.156890 0.874040 0.989882
196 6.188464 0.905420 0.993859
197 6.220038 0.936894 0.996845
198 6.251612 0.968431 0.998837
199 6.283185 1.000000 0.999834

Why do I get 'positional index out of bound' error only for a few inputs?

I am working on a project where I return the search results for a particular topic from the dataset of scraped news articles. However, for few of the inputs, I get 'IndexError: positional index out of bounds' while for others, the code works just fine. I even tried to limit the number of outputs and printed the indexes of the rows which are to be printed just to be sure that '.iloc' does not return that error but it still is happening.
Data:
Code:
'''
def search_index(c):
global input_str
c = c.lower()
c = re.sub("((\S+)?(http(S)?)(\S+))|((\S+)?(www)(\S+))|((\S+)?(\#)(\S+)?)", " ", c)
c = re.sub('[^a-zA-Z0-9\n]',' ',c)
d = list(c.split())
input_str = [word for word in d if word not in stop_words]
print(input_str)
e = OrderedDict()
f = []
temp=[]
for index,content in data.iterrows():
count = 0
points = 0
for i in input_str:
if i in (content['word_count']).keys():
#pdb.set_trace()
count += 1 # considering how many words from the input match with the content
points += content['word_count'][i] # considering the number of times those words occur in the content corpus
if len(input_str)<=3:
if count>=1:
e[index] = {'count':count,'points':points}
elif 3 < len(input_str) <=5:
if count>=2:
e[index] = {'count':count,'points':points}
elif len(input_str) > 5:
if count>=3:
e[index] = {'count':count,'points':points}
#print('\nIndex:',index,'\nContent:\n',content['Content'])
# the lambda function first sorts the dictionary based on the 'count' and then on the basis of 'points'
for key,val in sorted(e.items(), key=lambda kv: (kv[1]['count'],kv[1]['points']),reverse=True):
f.append(key)
#print(key,val)
#print(f)
#data.iloc[f,:]
print('Total number of results: ',len(f))
if len(f)>50 :
temp=f[:20]
print(temp)
print('Top 20 results:\n')
a = data.iloc[temp,[0,1,2,3]].copy()
else:
a = data.iloc[f,[0,1,2,3]].copy()
print(a)
'''
'''
def user_ask():
b = input('Enter the topic you''re interested in:')
articles = search_index(b)
print(articles)
'''
'''
user_ask()
'''
Output: For this input I am getting the required output
'''
Enter the topic youre interested in:Joe Biden
['joe', 'biden']
Total number of results: 2342
[2337, 3314, 4164, 3736, 3750, 3763, 4246, 3386, 3392, 13369, 3006, 4401,
4089, 3787, 4198, 3236, 4432, 4097, 4179, 4413]
Top 20 results:
Link \
2467 https://abcnews.go.com/Politics/rep-max-rose-c...
3471 https://abcnews.go.com/International/dalai-lam...
4343 https://abcnews.go.com/US/georgia-legislation-...
3910 https://abcnews.go.com/Politics/temperatures-c...
3924 https://abcnews.go.com/Business/cheap-fuel-pul...
3937 https://abcnews.go.com/US/puerto-ricans-demand...
4425 https://abcnews.go.com/Politics/trump-biden-is...
3543 https://abcnews.go.com/Business/record-number-...
3549 https://abcnews.go.com/US/michigan-state-stude...
17774 https://abcnews.go.com/Politics/bernie-sanders...
3152 https://abcnews.go.com/Politics/note-gop-aids-...
4583 https://abcnews.go.com/Politics/polls-show-tig...
4268 https://abcnews.go.com/International/students-...
3962 https://abcnews.go.com/Politics/heels-arizona-...
4377 https://abcnews.go.com/Politics/north-carolina...
3388 https://abcnews.go.com/Lifestyle/guy-fieri-lau...
4614 https://abcnews.go.com/Politics/persistence-he...
4276 https://abcnews.go.com/Politics/congressional-...
4358 https://abcnews.go.com/US/nursing-home-connect...
4595 https://abcnews.go.com/US/hurricane-sally-upda...
Title \
2467 Rep. Max Rose calls on Trump to up COVID-19 ai...
3471 The Dalai Lama's simple advice to navigating C...
4343 Georgia lawmakers pass bill that gives court t...
3910 Temperatures and carbon dioxide are up, regula...
3924 Has cheap fuel pulled the plug on electric veh...
3937 Puerto Ricans demand state of emergency amid r...
4425 Trump vs. Biden on the issues: Foreign policy
3543 Record number of women CEOs on this year's For...
3549 All Michigan State students asked to quarantin...
17774 Bernie Sanders, Danny Glover Attend Game 7 of ...
3152 The Note: GOP aids Trump in programming around...
4583 Trump adviser predicts Sunbelt sweep, misleads...
4268 2 students allegedly caught up in Belarus crac...
3962 On heels of Arizona Senate primary, Republican...
4377 North Carolina to be a crucial battleground st...
3388 Guy Fieri has helped raise over $22M for resta...
4614 Little girls will have to wait 4 more years, W...
4276 Congressional Black Caucus to propose policing...
4358 Nursing home in Connecticut transferring all r...
4595 Sally slams Gulf Coast with life-threatening f...
Content Category
2467 New York Rep. Max Rose joined “The View” Monda... Politics
3471 As millions of people around the world continu... International
4343 They've done their time behind bars and been o... US
3910 Every week we'll bring you some of the climate... Politics
3924 Electric vehicles have always been a tough sel... Business
3937 As Puerto Rico struggles to recover from multi... US
4425 American foreign policy for over half a centur... Politics
3543 A record high number of female CEOs are at the... Business
3549 All local Michigan State University students h... US
17774 — -- Bernie Sanders capped Memorial Day off by... Politics
3152 The TAKE with Rick Klein\nPresident Donald Tru... Politics
4583 Facing polls showing a competitive race in as ... Politics
4268 A U.S. student studying at New York’s Columbia... International
3962 What's sure to be one of the most expensive an... Politics
4377 North Carolina, home of the upcoming business ... Politics
3388 Guy Fieri should add donations to his triple d... Lifestyle
4614 Four years ago, a major political party nomina... Politics
4276 The Congressional Black Caucus is at work on a... Politics
4358 All residents at a Connecticut nursing home ar... US
4595 Sally made landfall near Gulf Shores, Alabama,... US
None
​
'''
For this input it is returning an error.
'''
Enter the topic youre interested in:Joe
['joe']
Total number of results: 2246
[4246, 4594, 3763, 3736, 4448, 2337, 3431, 3610, 3636, 4089, 13369, 15363,
7269, 21077, 3299, 4372, 4413, 7053, 15256, 1305]
Top 20 results:
--------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-31-ff543d46b951> in <module>
----> 1 user_ask()
<ipython-input-27-31af284a01b4> in user_ask()
4 if int(a) == 0:
5 b = input('Enter the topic you''re interested in:')
----> 6 articles = search_index(b)
7 print(articles)
8
<ipython-input-25-4a5261a1e717> in search_index(c)
50 print(temp)
51 print('Top 20 results:\n')
---> 52 a = data.iloc[temp,[0,1,2,3]].copy()
53 else:
54 a = data.iloc[f,[0,1,2,3]].copy()
c:\users\henis\appdata\local\programs\python\python37\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
1759 except (KeyError, IndexError, AttributeError):
1760 pass
-> 1761 return self._getitem_tuple(key)
1762 else:
1763 # we by definition only have the 0th axis
c:\users\henis\appdata\local\programs\python\python37\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
2064 def _getitem_tuple(self, tup: Tuple):
2065
-> 2066 self._has_valid_tuple(tup)
2067 try:
2068 return self._getitem_lowerdim(tup)
c:\users\henis\appdata\local\programs\python\python37\lib\site-packages\pandas\core\indexing.py in _has_valid_tuple(self, key)
700 raise IndexingError("Too many indexers")
701 try:
--> 702 self._validate_key(k, i)
703 except ValueError:
704 raise ValueError(
c:\users\henis\appdata\local\programs\python\python37\lib\site-packages\pandas\core\indexing.py in _validate_key(self, key, axis)
2006 # check that the key does not exceed the maximum size of the index
2007 if len(arr) and (arr.max() >= len_axis or arr.min() < -len_axis):
-> 2008 raise IndexError("positional indexers are out-of-bounds")
2009 else:
2010 raise ValueError(f"Can only index by location with a
[{self._valid_types}]")
IndexError: positional indexers are out-of-bounds
'''

How do I deal with empty list items while scraping web data?

I'm trying to scrape data into a CSV file from a website that lists contact information for people in my industry. My code works well until I get to a page where one of the entries doesn't have a specific item.
So for example:
I'm trying to collect
Name, Phone, Profile URL
If there isn't a phone number listed, there isn't even a tag for that field on the page, and my code errors out with
"IndexError: list index out of range"
I'm pretty new to this, but what I've managed to cobble together so far from various youtube tutorials/this site has really saved me a ton of time completing some tasks that would take me days otherwise. I'd appreciate any help that anyone is willing to offer.
I've tried varying if/then statements where if the variable is null, then set the variable to "Empty"
Edit:
I updated the code. I switched to CSS Selectors for more specificity and readability. I also added a try/except to at least bypass the index error, but doesn't solve the problem of incorrect data being stored due to uneven amounts of data for each field. Also, the site I'm trying to scrape is in the code now.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
driver = webdriver.Firefox()
MAX_PAGE_NUM = 5
MAX_PAGE_DIG = 2
with open('results.csv', 'w') as f:
f.write("Name, Number, URL \n")
#Run Through Pages
for i in range(1, MAX_PAGE_NUM + 1):
page_num = (MAX_PAGE_DIG - len(str(i))) * "0" + str(i)
website = "https://www.realtor.com/realestateagents/lansing_mi/pg-" + page_num
driver.get(website)
Name = driver.find_elements_by_css_selector('div.agent-list-card-title-text.clearfix > div.agent-name.text-bold > a')
Number = driver.find_elements_by_css_selector('div.agent-list-card-title-text.clearfix > div.agent-phone.hidden-xs.hidden-xxs')
URL = driver.find_elements_by_css_selector('div.agent-list-card-title-text.clearfix > div.agent-name.text-bold > a')
#Collect Data From Each Page
num_page_items = len(Name)
with open('results.csv', 'a') as f:
for i in range(num_page_items):
try:
f.write(Name[i].text.replace(",", ".") + "," + Number[i].text + "," + URL[i].get_attribute('href') + "\n")
print(Name[i].text.replace(",", ".") + "," + Number[i].text + "," + URL[i].get_attribute('href') + "\n")
except IndexError:
f.write("Skip, Skip, Skip \n")
print("Number Missing")
continue
driver.close()
If any of the fields I'm trying to collect don't exist on individual listings, I just want the empty field to be filled in as "Empty" on the spreadsheet.
You could use try/except to take care of that. I also opted to use Pandas and BeautifulSoup as I'm more familiar with those.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from bs4 import BeautifulSoup
driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
import pandas as pd
MAX_PAGE_NUM = 5
MAX_PAGE_DIG = 2
results = pd.DataFrame()
#Run Through Pages
for i in range(1, MAX_PAGE_NUM + 1):
page_num = (MAX_PAGE_DIG - len(str(i))) * "0" + str(i)
website = "https://www.realtor.com/realestateagents/lansing_mi/pg-" + page_num
driver.get(website)
soup = BeautifulSoup(driver.page_source, 'html.parser')
agent_cards = soup.find_all('div', {'class':'agent-list-card clearfix'})
for agent in agent_cards:
try:
Name = agent.find('div', {'itemprop':'name'}).text.strip().split('\n')[0]
except:
Name = None
try:
Number = agent.find('div', {'itemprop':'telephone'}).text.strip()
except:
Number = None
try:
URL = 'https://www.realtor.com/' + agent.find('a', href=True)['href']
except:
URL = None
temp_df = pd.DataFrame([[Name, Number, URL]], columns=['Name','Number','URL'])
results = results.append(temp_df, sort=True).reset_index(drop=True)
print('Processed page: %s' %i)
driver.close()
results.to_csv('results.csv', index=False)
Output:
print (results)
Name ... URL
0 Nicole Enz ... https://www.realtor.com//realestateagents/nico...
1 Jennifer Worthington ... https://www.realtor.com//realestateagents/jenn...
2 Katherine Keener ... https://www.realtor.com//realestateagents/kath...
3 Erica Cook ... https://www.realtor.com//realestateagents/eric...
4 Jeff Thornton, Broker, Assoc Broker ... https://www.realtor.com//realestateagents/jeff...
5 Neal Sanford, Agent ... https://www.realtor.com//realestateagents/neal...
6 Sherree Zea ... https://www.realtor.com//realestateagents/sher...
7 Jennifer Cooper ... https://www.realtor.com//realestateagents/jenn...
8 Charlyn Cosgrove ... https://www.realtor.com//realestateagents/char...
9 Kathy Birchen & Chad Dutcher ... https://www.realtor.com//realestateagents/kath...
10 Nancy Petroff ... https://www.realtor.com//realestateagents/nanc...
11 The Angela Averill Team ... https://www.realtor.com//realestateagents/the-...
12 Christina Tamburino ... https://www.realtor.com//realestateagents/chri...
13 Rayce O'Connell ... https://www.realtor.com//realestateagents/rayc...
14 Stephanie Morey ... https://www.realtor.com//realestateagents/step...
15 Sean Gardner ... https://www.realtor.com//realestateagents/sean...
16 John Burg ... https://www.realtor.com//realestateagents/john...
17 Linda Ellsworth-Moore ... https://www.realtor.com//realestateagents/lind...
18 David Bueche ... https://www.realtor.com//realestateagents/davi...
19 David Ledebuhr ... https://www.realtor.com//realestateagents/davi...
20 Aaron Fox ... https://www.realtor.com//realestateagents/aaro...
21 Kristy Seibold ... https://www.realtor.com//realestateagents/kris...
22 Genia Beckman ... https://www.realtor.com//realestateagents/geni...
23 Angela Bolan ... https://www.realtor.com//realestateagents/ange...
24 Constance Benca ... https://www.realtor.com//realestateagents/cons...
25 Lisa Fata ... https://www.realtor.com//realestateagents/lisa...
26 Mike Dedman ... https://www.realtor.com//realestateagents/mike...
27 Jamie Masarik ... https://www.realtor.com//realestateagents/jami...
28 Amy Yaroch ... https://www.realtor.com//realestateagents/amy-...
29 Debbie McCarthy ... https://www.realtor.com//realestateagents/debb...
.. ... ... ...
70 Vickie Blattner ... https://www.realtor.com//realestateagents/vick...
71 Faith F Steller ... https://www.realtor.com//realestateagents/fait...
72 A. Jason Titus ... https://www.realtor.com//realestateagents/a.--...
73 Matt Bunn ... https://www.realtor.com//realestateagents/matt...
74 Joe Vitale ... https://www.realtor.com//realestateagents/joe-...
75 Reozom Real Estate ... https://www.realtor.com//realestateagents/reoz...
76 Shane Broyles ... https://www.realtor.com//realestateagents/shan...
77 Megan Doyle-Busque ... https://www.realtor.com//realestateagents/mega...
78 Linda Holmes ... https://www.realtor.com//realestateagents/lind...
79 Jeff Burke ... https://www.realtor.com//realestateagents/jeff...
80 Jim Convissor ... https://www.realtor.com//realestateagents/jim-...
81 Concetta D'Agostino ... https://www.realtor.com//realestateagents/conc...
82 Melanie McNamara ... https://www.realtor.com//realestateagents/mela...
83 Julie Adams ... https://www.realtor.com//realestateagents/juli...
84 Liz Horford ... https://www.realtor.com//realestateagents/liz-...
85 Miriam Olsen ... https://www.realtor.com//realestateagents/miri...
86 Wanda Williams ... https://www.realtor.com//realestateagents/wand...
87 Troy Seyfert ... https://www.realtor.com//realestateagents/troy...
88 Maggie Gerich ... https://www.realtor.com//realestateagents/magg...
89 Laura Farhat Bramson ... https://www.realtor.com//realestateagents/laur...
90 Peter MacIntyre ... https://www.realtor.com//realestateagents/pete...
91 Mark Jacobsen ... https://www.realtor.com//realestateagents/mark...
92 Deb Good ... https://www.realtor.com//realestateagents/deb-...
93 Mary Jane Vanderstow ... https://www.realtor.com//realestateagents/mary...
94 Ben Magsig ... https://www.realtor.com//realestateagents/ben-...
95 Brenna Chamberlain ... https://www.realtor.com//realestateagents/bren...
96 Deborah Cooper, CNS ... https://www.realtor.com//realestateagents/debo...
97 Huggler, Bashore & Brooks ... https://www.realtor.com//realestateagents/hugg...
98 Jodey Shepardson Custack ... https://www.realtor.com//realestateagents/jode...
99 Madaline Alspaugh-Young ... https://www.realtor.com//realestateagents/mada...
[100 rows x 3 columns]

using shift() to compare row elements

I have the sample data and code below where I'm trying to loop through the dataDF column with the function and find the first case of increasing values then return the Quarter value corresponding the the 1st increasing value from the dataDF column. I'm planning to use the function with apply, but I don't think I'm using shift() properly. If I just try to return dataDF.shift() I get an error. I'm new to python so any tips on how to compare a row to the next row or what I'm doing wrong with shift() are greatly appreciated.
Sample Data:
return dataDF.head(20).to_dict()
{'Quarter': {246: '2008q3',
247: '2008q4',
248: '2009q1',
249: '2009q2',
250: '2009q3',
251: '2009q4',
252: '2010q1',
253: '2010q2',
254: '2010q3',
255: '2010q4',
256: '2011q1',
257: '2011q2',
258: '2011q3',
259: '2011q4',
260: '2012q1',
261: '2012q2',
262: '2012q3',
263: '2012q4',
264: '2013q1',
265: '2013q2'},
'dataDF': {246: 14843.0,
247: 14549.9,
248: 14383.9,
249: 14340.4,
250: 14384.1,
251: 14566.5,
252: 14681.1,
253: 14888.6,
254: 15057.700000000001,
255: 15230.200000000001,
256: 15238.4,
257: 15460.9,
258: 15587.1,
259: 15785.299999999999,
260: 15973.9,
261: 16121.9,
262: 16227.9,
263: 16297.299999999999,
264: 16475.400000000001,
265: 16541.400000000001}}
Code:
def find_end(x):
qrts = []
if (dataDF < dataDF.shift()):
qrts.append(dataDF.iloc[0,:].shift(1))
return qrts
Try
df.Quarter[df.dataDF > df.dataDF.shift()].iloc[0]
Returns
'2009q3'
IIUC:
In [46]: x.loc[x.dataDF.diff().gt(0).idxmax(), 'Quarter']
Out[46]: '2009q3'
Explanation:
In [43]: x
Out[43]:
Quarter dataDF
246 2008q3 14843.0
247 2008q4 14549.9
248 2009q1 14383.9
249 2009q2 14340.4
250 2009q3 14384.1
251 2009q4 14566.5
252 2010q1 14681.1
253 2010q2 14888.6
254 2010q3 15057.7
255 2010q4 15230.2
256 2011q1 15238.4
257 2011q2 15460.9
258 2011q3 15587.1
259 2011q4 15785.3
260 2012q1 15973.9
261 2012q2 16121.9
262 2012q3 16227.9
263 2012q4 16297.3
264 2013q1 16475.4
265 2013q2 16541.4
In [44]: x.dataDF.diff()
Out[44]:
246 NaN
247 -293.1
248 -166.0
249 -43.5
250 43.7 # <-------------------
251 182.4
252 114.6
253 207.5
254 169.1
255 172.5
256 8.2
257 222.5
258 126.2
259 198.2
260 188.6
261 148.0
262 106.0
263 69.4
264 178.1
265 66.0
Name: dataDF, dtype: float64
In [45]: x.dataDF.diff().gt(0).idxmax()
Out[45]: 250
Using numpy to find the argmax of diff greater than 0. Then using get_value to retrieve the value we need.
v = dataDF.dataDF.values
j = dataDF.columns.get_loc('Quarter')
dataDF.get_value((np.diff(v) > 0).argmax() + 1, j, takeable=True)
'2009q3'
What about the speeeeed!

How to combine multiple character columns into a single column in an R data frame

I am working with Census data and I need to combine four character columns into a single column.
Example:
LOGRECNO STATE COUNTY TRACT BLOCK
60 01 001 021100 1053
61 01 001 021100 1054
62 01 001 021100 1055
63 01 001 021100 1056
64 01 001 021100 1057
65 01 001 021100 1058
I want to create a new column that adds the strings of STATE, COUNTY, TRACT, and BLOCK together into a single string. Example:
LOGRECNO STATE COUNTY TRACT BLOCK BLOCKID
60 01 001 021100 1053 01001021101053
61 01 001 021100 1054 01001021101054
62 01 001 021100 1055 01001021101055
63 01 001 021100 1056 01001021101056
64 01 001 021100 1057 01001021101057
65 01 001 021100 1058 01001021101058
I've tried:
AL_Blocks$BLOCK_ID<- paste(c(AL_Blocks$STATE, AL_Blocks$County, AL_Blocks$TRACT, AL_Blocks$BLOCK), collapse = "")
But this combines all rows of all four columns into a single string.
Try this:
AL_Blocks$BLOCK_ID<- with(AL_Blocks, paste0(STATE, COUNTY, TRACT, BLOCK))
there was a typo in County... it should've been COUNTY. Also, you don't need the collapse parameter.
I hope that helps.
You can use do.call and paste0. Try:
AL_Blocks$BLOCK_ID <- do.call(paste0, AL_Block[c("STATE", "COUNTY", "TRACT", "BLOCK")])
Example output:
do.call(paste0, AL_Blocks[c("STATE", "COUNTY", "TRACT", "BLOCK")])
# [1] "010010211001053" "010010211001054" "010010211001055" "010010211001056"
# [5] "010010211001057" "010010211001058"
do.call(paste0, AL_Blocks[2:5])
# [1] "010010211001053" "010010211001054" "010010211001055" "010010211001056"
# [5] "010010211001057" "010010211001058"
You can also use unite from "tidyr", like this:
library(tidyr)
library(dplyr)
AL_Blocks %>%
unite(BLOCK_ID, STATE, COUNTY, TRACT, BLOCK, sep = "", remove = FALSE)
# LOGRECNO BLOCK_ID STATE COUNTY TRACT BLOCK
# 1 60 010010211001053 01 001 021100 1053
# 2 61 010010211001054 01 001 021100 1054
# 3 62 010010211001055 01 001 021100 1055
# 4 63 010010211001056 01 001 021100 1056
# 5 64 010010211001057 01 001 021100 1057
# 6 65 010010211001058 01 001 021100 1058
where "AL_Blocks" is provided as:
AL_Blocks <- structure(list(LOGRECNO = c("60", "61", "62", "63", "64", "65"),
STATE = c("01", "01", "01", "01", "01", "01"), COUNTY = c("001", "001",
"001", "001", "001", "001"), TRACT = c("021100", "021100", "021100",
"021100", "021100", "021100"), BLOCK = c("1053", "1054", "1055", "1056",
"1057", "1058")), .Names = c("LOGRECNO", "STATE", "COUNTY", "TRACT",
"BLOCK"), class = "data.frame", row.names = c(NA, -6L))
You can try this too
AL_Blocks <- transform(All_Blocks, BLOCKID = paste(STATE,COUNTY,
TRACT, BLOCK, sep = "")
Or try this
DF$BLOCKID <-
paste(DF$LOGRECNO, DF$STATE, DF$COUNTY,
DF$TRACT, DF$BLOCK, sep = "")
(Here is a method to set up the dataframe for people coming into this discussion later)
DF <-
data.frame(LOGRECNO = c(60, 61, 62, 63, 64, 65),
STATE = c(1, 1, 1, 1, 1, 1),
COUNTY = c(1, 1, 1, 1, 1, 1),
TRACT = c(21100, 21100, 21100, 21100, 21100, 21100),
BLOCK = c(1053, 1054, 1055, 1056, 1057, 1058))
You can use tidyverse package:
DF %>% unite(new_var, STATE, COUNTY, TRACT, BLOCK)
The new kid on the block is the glue package:
library(glue)
my_data %>%
glue::glue("{STATE}{COUNTY}{TRACT}{BLOCK}")
You can both WRITE and READ Text files with any specified "string-separator", not necessarily a character separator. This is very useful in many cases when the data has practically all terminal symbols, and thus, no 1 symbol can be used as a separator. Here are examples of read and write functions:
WRITE OUT Special Separator Text:
writeSepText <- function(df, fileName, separator) {
con <- file(fileName)
data <- apply(df, 1, paste, collapse = separator)
# data
data <- writeLines(data, con)
close(con)
return
}
Test Writing out text file separated by a string "bra_break_ket"
writeSepText(df=as.data.frame(Titanic), fileName="/Users/user/break_sep.txt", separator="<break>")
READ In text files with special separator string
readSepText <- function(fileName, separator) {
data <- readLines(con <- file(fileName))
close(con)
records <- sapply(data, strsplit, split=separator)
dataFrame <- data.frame(t(sapply(records,c)))
rownames(dataFrame) <- 1: nrow(dataFrame)
return(as.data.frame(dataFrame,stringsAsFactors = FALSE))
}
Test Reading in text file separated by
df <- readSepText(fileName="/Users/user/break_sep.txt", separator="<break>"); df

Resources