Python fill up blank space from web extraction text with NaN - python-3.x

I have extracted some text from web and saved using numpy in format string (fmt="%s").
The data is successfully transferred and readable as follows:
250.0 1000 39.9 45.9 53 60 210 16
250.0 1000 39.9 45.9 53 60 210 16
250.0 1020 40.7 70 200 10
250.0 1010 40.1 95 175 9
250.0 1010 39.9 43.7 67 150 120 16
250.0 1000 39.5 49.5 34 80 190 15
The data consists 2 blank spaces at row 3 and 4 which I believe missing values originates from web. I tried to read the file (sample-250.dat) using numpy and loadtxt procedure :
data5 = np.loadtxt(path1+"sample-250.dat",dtype=object)
PRES=data5[:,0]
HIGHT=data5[:,1]
TEMP=data5[:,2]
DWPT=data5[:,3]
RELH=data5[:,4]
DRCT=data5[:,5]
md=data5[:,6]
SKNT=data5[:,7]
Sadly, the output shows error as follows : ValueError: Wrong number of columns at line 3.
Anyone got ideas on how to read such data probably to replace those blank spaces with NaN values?.
Thanks

How about using pandas instead of numpy?
import pandas as pd
import numpy as np
data5 = pd.read_table(path1+"sample-250.dat", header = None).values
data5 is what you need.

Related

Adding empty row base on two columns in Pandas DataFrame

I have a dataframe of following structure
x y z
93 122 787.185547
93 123 847.964905
93 124 908.932190
93 125 1054.865845
93 126 1109.340576
x y is coordinates,and I know their range.For example
x_range=np.arange(90,130)
y_range=np.arange(100,130)
z is measurement data
Now I want to insert missing points with nan value in z
so it looks like
x y z
90 100 NaN
90 101 NaN
...........................
93 121 NaN
93 122 787.185547
93 123 847.964905
93 124 908.932190
...........................
129 128 NaN
129 129 NaN
It can be done by a simple but stupid for loop.
But is there a simple way to perform this?
I will recommend use itertools.product follow by merge
import itertools
df=pd.DataFrame(itertools.product(x_range,y_range),columns=['x','y']).merge(df,how='left')

create lag features based on multiple columns

i have a time series dataset. i need to extract the lag features. i am using below code but got all NAN's
df.groupby(['week','id1','id2','id3'],as_index=False)['value'].shift(1)
input
week,id1,id2,id3,value
1,101,123,001,45
1,102,231,004,89
1,203,435,099,65
2,101,123,001,48
2,102,231,004,75
2,203,435,099,90
output
week,id1,id2,id3,value,t-1
1,101,123,001,45,NAN
1,102,231,004,89,NAN
1,203,435,099,65,NAN
2,101,123,001,48,45
2,102,231,004,75,89
2,203,435,099,90,65
You want to shift to the next week so remove 'week' from the grouping:
df['t-1'] = df.groupby(['id1','id2','id3'],as_index=False)['value'].shift()
# week id1 id2 id3 value t-1
#0 1 101 123 1 45 NaN
#1 1 102 231 4 89 NaN
#2 1 203 435 99 65 NaN
#3 2 101 123 1 48 45.0
#4 2 102 231 4 75 89.0
#5 2 203 435 99 90 65.0
That's error prone to missing weeks. In this case we can merge after changing the week, which ensures it is the prior week regardless of missing weeks.
df2 = df.assign(week=df.week+1).rename(columns={'value': 't-1'})
df = df.merge(df2, on=['week', 'id1', 'id2', 'id3'], how='left')
Another way to bring and rename many columns would be to use the suffixes argument in the merge. This will rename all overlapping columns (that are not keys) in the right DataFrame.
df.merge(df.assign(week=df.week+1), # Manally lag
on=['week', 'id1', 'id2', 'id3'],
how='left',
suffixes=['', '_lagged'] # Right df columns -> _lagged
)
# week id1 id2 id3 value value_lagged
#0 1 101 123 1 45 NaN
#1 1 102 231 4 89 NaN
#2 1 203 435 99 65 NaN
#3 2 101 123 1 48 45.0
#4 2 102 231 4 75 89.0
#5 2 203 435 99 90 65.0

Rolling window percentile rank over a multi-index Pandas DataFrame

I am creating a percentile rank over a rolling window of time and would like help refining my approach.
My DataFrame has a multi-index with the first level set to datetime and the second set to an identifier. Ultimately, I’d like the rolling window to evaluate the trailing n periods, including the current period, and produce the corresponding percentile ranks.
I referenced the posts shown below but found they were working with the data a bit differently than how I intend to. In those posts, the final functions group results by identifier and then by datetime, whereas I'm looking to use rolling panels of data in my function (dates and identifiers).
using rolling functions on multi-index dataframe in pandas
Panda rolling window percentile rank
This is an example of what I am after.
Create a sample DataFrame:
num_days = 5
np.random.seed(8675309)
stock_data = {
"AAPL": np.random.randint(1, max_value, size=num_days),
"MSFT": np.random.randint(1, max_value, size=num_days),
"WMT": np.random.randint(1, max_value, size=num_days),
"TSLA": np.random.randint(1, max_value, size=num_days)
}
dates = pd.date_range(
start="2013-01-03",
periods=num_days,
freq=BDay()
)
sample_df = pd.DataFrame(stock_data, index=dates)
sample_df = sample_df.stack().to_frame(name='data')
sample_df.index.names = ['date', 'ticker']
Which outputs:
date ticker
2013-01-03 AAPL 2
MSFT 93
TSLA 39
WMT 21
2013-01-04 AAPL 141
MSFT 43
TSLA 205
WMT 20
2013-01-07 AAPL 256
MSFT 93
TSLA 103
WMT 25
2013-01-08 AAPL 233
MSFT 60
TSLA 13
WMT 104
2013-01-09 AAPL 19
MSFT 120
TSLA 282
WMT 293
The code below breaks out the sample_df into 2 day increments and produces a rank vs. ranking over a rolling window of time. So it's close, but not what I'm after.
sample_df.reset_index(level=1, drop=True)[['data']] \
.apply(
lambda x: x.groupby(pd.Grouper(level=0, freq='2d')).rank()
)
I then tried what's shown below without much luck either.
from scipy.stats import rankdata
def rank(x):
return rankdata(x, method='ordinal')[-1]
sample_df.reset_index(level=1, drop=True) \
.rolling(window="2d", min_periods=1) \
.apply(
lambda x: rank(x)
)
I finally arrived at the output I'm looking for but the formula seems a bit contrived, so I'm hoping to identify a more elegant approach if one exists.
import numpy as np
import pandas as pd
from pandas.tseries.offsets import BDay
window_length = 1
target_column = "data"
def rank(df, target_column, ids, window_length):
percentile_ranking = []
list_of_ids = []
date_index = df.index.get_level_values(0).unique()
for date in date_index:
rolling_start_date = date - BDay(window_length)
first_date = date_index[0] + BDay(window_length)
trailing_values = df.loc[rolling_start_date:date, target_column]
# Only calc rolling percentile after the rolling window has lapsed
if date < first_date:
pass
else:
percentile_ranking.append(
df.loc[date, target_column].apply(
lambda x: stats.percentileofscore(trailing_values, x, kind="rank")
)
)
list_of_ids.append(df.loc[date, ids])
ranks, output_ids = pd.concat(percentile_ranking), pd.concat(list_of_ids)
df = pd.DataFrame(
ranks.values, index=[ranks.index, output_ids], columns=["percentile_rank"]
)
return df
ranks = rank(
sample_df.reset_index(level=1),
window_length=1,
ids='ticker',
target_column="data"
)
sample_df.join(ranks)
I get the feeling that my rank function is more than what's needed here. I appreciate any ideas/feedback to help in simplifying this code to arrive at the output below. Thank you!
data percentile_rank
date ticker
2013-01-03 AAPL 2 NaN
MSFT 93 NaN
TSLA 39 NaN
WMT 21 NaN
2013-01-04 AAPL 141 87.5
MSFT 43 62.5
TSLA 205 100.0
WMT 20 25.0
2013-01-07 AAPL 256 100.0
MSFT 93 50.0
TSLA 103 62.5
WMT 25 25.0
2013-01-08 AAPL 233 87.5
MSFT 60 37.5
TSLA 13 12.5
WMT 104 75.0
2013-01-09 AAPL 19 25.0
MSFT 120 62.5
TSLA 282 87.5
WMT 293 100.0
Edited: The original answer was taking 2d groups without the rolling effect, and just grouping the first two days that appeared. If you want rolling by every 2 days:
Dataframe pivoted to keep the dates as index and ticker as columns
pivoted = sample_df.reset_index().pivot('date','ticker','data')
Output
ticker AAPL MSFT TSLA WMT
date
2013-01-03 2 93 39 21
2013-01-04 141 43 205 20
2013-01-07 256 93 103 25
2013-01-08 233 60 13 104
2013-01-09 19 120 282 293
Now we can apply a rolling function and consider all stocks in the same window within that rolling
from scipy.stats import rankdata
def pctile(s):
wdw = sample_df.loc[s.index,:].values.flatten() ##get all stock values in the period
ranked = rankdata(wdw) / len(wdw)*100 ## their percentile
return ranked[np.where(wdw == s[len(s)-1])][0] ## return this value's percentile
pivoted_pctile = pivoted.rolling('2D').apply(pctile, raw=False)
Output
ticker AAPL MSFT TSLA WMT
date
2013-01-03 25.0 100.0 75.0 50.0
2013-01-04 87.5 62.5 100.0 25.0
2013-01-07 100.0 50.0 75.0 25.0
2013-01-08 87.5 37.5 12.5 75.0
2013-01-09 25.0 62.5 87.5 100.0
To get the original format back, we just melt the results:
pd.melt(pivoted_pctile.reset_index(),'date')\
.sort_values(['date', 'ticker']).reset_index()
Output
value
date ticker
2013-01-03 AAPL 25.0
MSFT 100.0
TSLA 75.0
WMT 50.0
2013-01-04 AAPL 87.5
MSFT 62.5
TSLA 100.0
WMT 25.0
2013-01-07 AAPL 100.0
MSFT 50.0
TSLA 75.0
WMT 25.0
2013-01-08 AAPL 87.5
MSFT 37.5
TSLA 12.5
WMT 75.0
2013-01-09 AAPL 25.0
MSFT 62.5
TSLA 87.5
WMT 100.0
If you prefer in one execution:
pd.melt(
sample_df\
.reset_index()\
.pivot('date','ticker','data')\
.rolling('2D').apply(pctile, raw=False)\
.reset_index(),'date')\
.sort_values(['date', 'ticker']).set_index(['date','ticker'])
Note that on day 7 this is different than what you displayed. This is actually rolling, so in day 7, because there is no day 6, the values are ranked only for that day, as the window of data is only 4 values and windows don't look forward. This differs from your result for that day.
Original
Is this something you might be looking for? I combined the groupby on the date (2 days) with transform so the number of observations is the same as the series provided. As you can see I kept the first observation of the window group.
df = sample_df.reset_index()
df['percentile_rank'] = df.groupby([pd.Grouper(key='date',freq='2D')]['data']\
.transform(lambda x: x.rank(ascending=True)/len(x)*100)
Output
Out[19]:
date ticker data percentile_rank
0 2013-01-03 AAPL 2 12.5
1 2013-01-03 MSFT 93 75.0
2 2013-01-03 WMT 39 50.0
3 2013-01-03 TSLA 21 37.5
4 2013-01-04 AAPL 141 87.5
5 2013-01-04 MSFT 43 62.5
6 2013-01-04 WMT 205 100.0
7 2013-01-04 TSLA 20 25.0
8 2013-01-07 AAPL 256 100.0
9 2013-01-07 MSFT 93 50.0
10 2013-01-07 WMT 103 62.5
11 2013-01-07 TSLA 25 25.0
12 2013-01-08 AAPL 233 87.5
13 2013-01-08 MSFT 60 37.5
14 2013-01-08 WMT 13 12.5
15 2013-01-08 TSLA 104 75.0
16 2013-01-09 AAPL 19 25.0
17 2013-01-09 MSFT 120 50.0
18 2013-01-09 WMT 282 75.0
19 2013-01-09 TSLA 293 100.0

In the output I am not getting the complete table from Excel

I just started using pandas, i wanted to import one Excel file with 31 rows and 11 columns, but in the output only some columns are displayed, the middle columns are represented by "....", and the first column 'EST' the starting few elements are displayed "00:00:00".
Code
import pandas as pd
df = pd.read_excel("C:\\Users\daryl\PycharmProjects\pandas\Book1.xlsx")
print(df)
Output
C:\Users\daryl\AppData\Local\Programs\Python\Python37\python.exe "C:/Users/daryl/PycharmProjects/pandas/1. Introduction.py"
EST Temperature ... Events WindDirDegrees
0 2016-01-01 00:00:00 38 ... NaN 281
1 2016-02-01 00:00:00 36 ... NaN 275
2 2016-03-01 00:00:00 40 ... NaN 277
3 2016-04-01 00:00:00 25 ... NaN 345
4 2016-05-01 00:00:00 20 ... NaN 333
5 2016-06-01 00:00:00 33 ... NaN 259
6 2016-07-01 00:00:00 39 ... NaN 293
7 2016-08-01 00:00:00 39 ... NaN 79
8 2016-09-01 00:00:00 44 ... Rain 76
9 2016-10-01 00:00:00 50 ... Rain 109
10 2016-11-01 00:00:00 33 ... NaN 289
11 2016-12-01 00:00:00 35 ... NaN 235
12 1-13-2016 26 ... NaN 284
13 1-14-2016 30 ... NaN 266
14 1-15-2016 43 ... NaN 101
15 1-16-2016 47 ... Rain 340
16 1-17-2016 36 ... Fog-Snow 345
17 1-18-2016 25 ... Snow 293
18 1/19/2016 22 ... NaN 293
19 1-20-2016 32 ... NaN 302
20 1-21-2016 31 ... NaN 312
21 1-22-2016 26 ... Snow 34
22 1-23-2016 26 ... Fog-Snow 42
23 1-24-2016 28 ... Snow 327
24 1-25-2016 34 ... NaN 286
25 1-26-2016 43 ... NaN 244
26 1-27-2016 41 ... Rain 311
27 1-28-2016 37 ... NaN 234
28 1-29-2016 36 ... NaN 298
29 1-30-2016 34 ... NaN 257
30 1-31-2016 46 ... NaN 241
[31 rows x 11 columns]
Process finished with exit code 0
To answer your question about the display of only a few columns and "..." :
All of the columns have been properly ingested, but your screen / the console is not wide enough to output all of the columns at once in a "print" fashion. This is normal/expected behavior.
Pandas is not a spreadsheet visualization tool like Excel. Maybe someone can suggest a tool for visualizing dataframes in a spreadsheet format for Python, like in Excel. I think I've seen people visualizing spreadsheets in Spyder but I don't use that myself.
If you want to make sure all of the columns are there, try using list(df) or print(list(df)).
To answer your question about the EST format:
It looks like you have some data cleaning to do. This is typical work in data science. I am not sure how to best do this - I haven't worked much with dates/datetime yet. However, here is what I see:
The first few items have timestamps as well, likely formatted in HH:MM:SS
They are formatted YYYY-MM-DD
On index row 18, there are / instead of - in the date
The remaining rows are formatted M-DD-YYYY
There's an option on read_csv's documentation that may take care of those automatically. It's called "parse_dates". If you turn that option on like pd.read_csv('file location', parse_dates='EST'), that could turn on the date parser for the EST column and maybe solve your problem.
Hope this helps! This is my first answer to anyone who sees it feel free to edit and improve it.

Making data trend using python

I want to make data trending using python for a dataframe for eg my raw data in csv is below format
Date TCH_Nom TCH_Denom SD_Nom SD_Denom
1/08/2018 42 58 4 21
2/08/2018 67 100 12 120
3/08/2018 23 451 9 34
Output should be
KPI 1/08/2018 2/08/2018 3/08/2018
TCH_Nom 42 67 23
TCH_Denom 58 100 451
SD_Nom 4 12 9
SD_Denom 21 120 34
from io import StringIO
import pandas as pd
txt = '''Date TCH_Nom TCH_Denom SD_Nom SD_Denom
1/08/2018 42 58 4 21
2/08/2018 67 100 12 120
3/08/2018 23 451 9 34'''
df = pd.read_table(StringIO(txt),sep = '\s+')
As pointed out in comment:
df.set_index('Date',inplace = True) # set index to Date column
df.T
which gives this outcome:
Date 1/08/2018 2/08/2018 3/08/2018
TCH_Nom 42 67 23
TCH_Denom 58 100 451
SD_Nom 4 12 9
SD_Denom 21 120 34

Resources