Remove n rows and iterate it n times in dataframe - python-3.x

I have 31 million of values in txt file. I need to remove values between 21600 to 61200, which I did through the code below and now I have to use this logic to remove for every 86400 values between above specified ones. This means remove values between 21600+86400 to 61200+86400, then remove 21600+86400+86400 to 61200+86400+86400 and so on applying same logic until the end of data. I tried many options, even using linked list, but I could not apply it to my large dataset. How shall it be done?
Visual example for values 1 to 24, remove values from 6 to `17:
1 2 3 4 5 6 - - - - - - - - - - 17 18 19 20 21 22 23 24
then apply to the next set of rows who follow this structure as below (start 6+24=30 and stop 17+24=41):
25 26 27 28 29 30 - - - - - - - - - - 41 42 43 44 45 46 47 48
and so on until the end of data (remove between 30+24 and 41+24 for the next set).
I limited the code below for the first 250000 of values for simplicity.
import numpy as np
import pandas as pd
sample = np.arange(0, 259201, 1).tolist()
df = pd.DataFrame(sample)
df = df.drop(df.index[21601:61200])
Basically, I need to apply something like this below, but I am not sure how to do it for my case.
for day in reverse(range(366)):
df.drop(df.index[21601+day*86400:61200+day*86400])

You can use the modulo operator to do so (% symbol in python and pandas).
Here is how your last piece of code can be re-written:
df[~(df.index.to_series() % 86400).between(21601, 61200)]
I used to_series() because between() is not defined for Index objects.

Related

Why Python throws "<pyqtgraph.graphicsItems.PlotDataItem.PlotDataItem at 0x1dc1a1bf670>" when trying to plot a simple boxplot?

I created the following df that contains the following data:
True Price Range
0 0.3151260504201625
1 0.08403361344537472
2 0.3577441077441151
3 0.2773629187113253
4 0.1715633712202524
5 0.4948364888123946
6 0.30068728522337035
7 0.043261951113993474
8 0.4562242016076512
9 0.1527050610820258
10 0.2185314685314596
11 0.9452626950978232
12 0.459016393442627
13 0.48097944905989937
14 0.17459624618071515
15 0.39534372940917983
16 0.44130626654898236
17 0.440237728373319
18 0.4386926957666129
19 0.4149377593361054
20 0.08748906386702823
21 0.21920210434019272
22 6.046511627906989
23 0.1536772777167961
24 0.06590509666081337
25 0.021987686895345346
26 0.46337157987643834
27 0.3077599472411393
28 0.043907793633368136
29 0.17644464049405312
30 0.29082774049218146
31 0.3157419936851629
32 0.49762497172584935
33 0.24797114517584748
34 0.49571879224875615
35 0.2941842045711658
36 0.49661399548532276
37 0.6515389800044902
38 0.4916201117318546
39 0.6037567084078699
40 1.1599375418246702
41 0.2668445630420171
42 0.28882470562096907
43 0.3771073646849967
44 0.17742293191395805
45 0.022158209616654382
46 0.46532240194991115
47 0.3576218149307117
48 0.15642458100558798
49 0.27008777852802623
50 0.3161698283649531
51 0.18289894833103962
52 0.7097069597069705
53 0.5325306783977797
54 0.13879250520472936
55 0.3940658321743243
56 0.1391465677180067
57 0.301694128568109
58 0.1860897883228582
59 0.1638193306810218
I'm trying to plot its corresponding boxplot, based on this guide, I can run df["True Price Range"].plot(kind='box', title='Boxplot of True Price Range') in the Spyder console to get that.
But after doing so, I get the following output:
Out[3]: <pyqtgraph.graphicsItems.PlotDataItem.PlotDataItem at 0x1dc1a1bf670>
Which I don't understand, I also tried:
df["True Price Range"].plot(kind='box', title='Boxplot of True Price Range').show()
And got literally nothing as output, not even errors.
Finally I tried using the Pandas boxplot function (i.e. df.boxplot(column= "True Price Range")) and got the following error:
AttributeError: module 'finplot.pdplot' has no attribute 'boxplot_frame'
Note: Both matplotlib.pyplot and pandas libraries were imported as pd and plt respectively before running the syntax above
May I get some assistance here?
Regarding your first command, the output you get is simply the object type that you have created (pyqtgraph.graphicsItems.PlotDataItem.PlotDataItem) and the memory address where this object is located (0x1dc1a1bf670). There is nothing wrong with this output. If you want to see the plot, you should add plt.show() after this command. Therefore, the code should be:
df["True Price Range"].plot(kind='box', title='Boxplot of True Price Range')
plt.show() # displays the figure
The matplotlib.pyplot.show function displays the figure that you have created.
Regarding the line df["True Price Range"].plot(kind='box', title='Boxplot of True Price Range').show(), you shouldn't call the show method on the object you have just created (i.e., the pyqtgraph.graphicsItems.PlotDataItem.PlotDataItem type). Just call the show method as mentioned above.
Regarding the AttributeError you mention last, it means that the method you want to call does not exist.
Solved by running the following lines:
plt.boxplot(x=df, vert = False)
plt.show() # displays the figure
matplotlib.pyplot must have been imported as plt

Create date ranges from an array of dates

Let's say I have below array of dates (not necessarily sorted):
import numpy as np
np.array(["2000Q1", "2000Q2", "2000Q3", "2000Q4", "2001Q1", "2001Q2", "2001Q3", "2001Q4", "2002Q1",
"2002Q2", "2002Q3", "2002Q4", "2003Q1", "2003Q2", "2003Q3", "2003Q4", "2004Q1", "2004Q2", "2004Q3",
"2004Q4", "2005Q1", "2005Q2", "2005Q3", "2005Q4", "2006Q1", "2006Q2", "2006Q3", "2006Q4", "2007Q1",
"2007Q2", "2007Q3", "2007Q4", "2008Q1", "2008Q2", "2008Q3", "2008Q4", "2009Q1", "2009Q2", "2009Q3",
"2009Q4"])
From this I want to create a DataFrame with 2 columns for start-date and end-date, where this dates corresponds to the starting date of a date range and ending date for that date rage spanning 4 years. This will continue for each element of above array until the last element. For example, first 3 rows of this new DataFrame would look like below
Is there any direct function/method to achieve above in Python?
Here's one way using PeriodIndex and DateOffset functions in pandas. Note that I named your array arr below:
df = pd.DataFrame({'start-date': arr,
'end-date': (pd.PeriodIndex(arr, freq='Q').to_timestamp() +
pd.DateOffset(years=4, months=10)).to_period('Q')})
Output:
start-date end-date
0 2000Q1 2004Q4
1 2000Q2 2005Q1
2 2000Q3 2005Q2
3 2000Q4 2005Q3
4 2001Q1 2005Q4
5 2001Q2 2006Q1
6 2001Q3 2006Q2
7 2001Q4 2006Q3
8 2002Q1 2006Q4
9 2002Q2 2007Q1
10 2002Q3 2007Q2
11 2002Q4 2007Q3
12 2003Q1 2007Q4
13 2003Q2 2008Q1
14 2003Q3 2008Q2
15 2003Q4 2008Q3
16 2004Q1 2008Q4
17 2004Q2 2009Q1
18 2004Q3 2009Q2
19 2004Q4 2009Q3
20 2005Q1 2009Q4
21 2005Q2 2010Q1
22 2005Q3 2010Q2
23 2005Q4 2010Q3
24 2006Q1 2010Q4
25 2006Q2 2011Q1
26 2006Q3 2011Q2
27 2006Q4 2011Q3
28 2007Q1 2011Q4
29 2007Q2 2012Q1
30 2007Q3 2012Q2
31 2007Q4 2012Q3
32 2008Q1 2012Q4
33 2008Q2 2013Q1
34 2008Q3 2013Q2
35 2008Q4 2013Q3
36 2009Q1 2013Q4
37 2009Q2 2014Q1
38 2009Q3 2014Q2
39 2009Q4 2014Q3

Stripping ints from a string in pandas column

I have a column like this:
Age
15-20 years old
20-25 years old
I want this as output:
Age_Min Age_Max
15 20
20 25
I am trying to use str.strip() but no success so far.
I tried d[['Age_Min','Age_Max']]=d['Age'].str.split('-',expand=True)
and the result is almost there. Is there a way to get only the integers and remove the string?
Any tips?
Use Series.str.split with expand=True:
In [858]: out = df['Age'].str.split('-', expand=True).rename(columns={0:'Age_Min', 1: 'Age_Max'})
In [860]: out['Age_Max'] = out['Age_Max'].str.split().str[0]
In [861]: out
Out[861]:
Age_Min Age_Max
0 15 20
1 20 25
OR using regex:
In [870]: out = df['Age'].str.extract("(\d*\-?\d+)")[0].str.split('-', expand=True).rename(columns={0:'Age_Min', 1: 'Age_Max'})
In [871]: out
Out[871]:
Age_Min Age_Max
0 15 20
1 20 25

VBA solution of VIF factors [EXCEL]

I have several multiple linear regressions to carry out, I am wondering if there is a VBA solution for getting the VIF of regression outputs for different equations.
My current data format:
i=1
Year DependantVariable Variable2 Variable3 Variable4 Variable5 ....
2009 100 10 20 -
2010 110 15 25 -
2011 115 20 30 -
2012 125 25 35 -
2013 130 25 40 -
I have the above table, with the value of i determining the value of the variables (essentially, different regression input tables in place for every value of i)
I am looking for a VBA that will check every value of i (stored in a column), calculate the VIF for every value of i and output something like below
ivalue variable1VIF variable2VIF ...
1 1.1 1.3
2 1.2 10.1

sort pyspark dataframe within groups

I would like to sort column "time" within each "id" group.
The data looks like:
id time name
132 12 Lucy
132 10 John
132 15 Sam
78 11 Kate
78 7 Julia
78 2 Vivien
245 22 Tom
I would like to get this:
id time name
132 10 John
132 12 Lucy
132 15 Sam
78 2 Vivien
78 7 Julia
78 11 Kate
245 22 Tom
I tried
df.orderby(['id','time'])
But I don't need to sort "id".
I have two questions:
Can I just sort "time" within same "id"? and How?
Will be more efficient if I just sort "time" than using orderby() to sort both columns?
This is exactly what windowing is for.
You can create a window partitioned by the "id" column and sorted by the "time" column. Next you can apply any function on that window.
# Create a Window
from pyspark.sql.window import Window
w = Window.partitionBy(df.id).orderBy(df.time)
Now use this window over any function:
For e.g.: let's say you want to create a column of the time delta between each row within the same group
import pyspark.sql.functions as f
df = df.withColumn("timeDelta", df.time - f.lag(df.time,1).over(w))
I hope this gives you an idea. Effectively you have sorted your dataframe using the window and can now apply any function to it.
If you just want to view your result, you could find the row number and sort by that as well.
df.withColumn("order", f.row_number().over(w)).sort("order").show()

Resources