Altair distinct and count not plotting expected value - altair

I am trying to plot the number of unique values in a column containing strings on the y axis as follows:
alt.Chart(as_df).mark_bar(color='firebrick').encode(
alt.X('TimeUTC:T', title='Day',axis=alt.AxisConfig(labelAngle=45)),
alt.Y('distinct(FlightID)', type='nominal', title='Number of flights')
)
My data is of the form:
TimeUTC
FlightID
Latitude
Longitude
2021-01-01 06:05:00.079745+00:00
2021-01-01 06:05:00+00:00
a706014b-02d0-424a-a346-2bd25ffa8e08
42.3323
2021-01-01 06:05:00.291337+00:00
2021-01-01 06:05:00+00:00
d2e2bd67-c95a-426d-9357-a717d6c9124d
42.3434
2021-01-01 06:06:00.131817+00:00
2021-01-01 06:06:00+00:00
a706014b-02d0-424a-a346-2bd25ffa8e08
42.3323
2021-01-01 06:06:00.219178+00:00
2021-01-01 06:06:00+00:00
d2e2bd67-c95a-426d-9357-a717d6c9124d
42.3434
The result is this chart:
If I count the number of unique FlightIDs on Feb 9th, I get:
foo = as_df['20210209':'20210209']
foo.FlightID.nunique()
58
Why is the chart showing a maximum number of unique FlightIDs as 12 when there is at least one day with 58?
python: 3.9.7
altair: 4.1.0

In the chart, you are grouping the x axis by timestamp, whereas in the pandas aggregation, you are grouping the data by date (i.e. stripping the hours and minutes).
If you would like to group the x axis by date in the chart, you can do something like this:
alt.Chart(as_df).mark_bar(color='firebrick').encode(
alt.X('yearmonthdate(TimeUTC):T', title='Day',axis=alt.AxisConfig(labelAngle=45)),
alt.Y('distinct(FlightID)', type='nominal', title='Number of flights')
)

Related

How to convert data frame for time series analysis in Python?

I have a dataset of around 13000 rows and 2 columns (text and date) for two year period. One of the column is date in yyyy-mm-dd format. I want to perform time series analysis where x axis would be date (each day) and y axis would be frequency of text on corresponding date.
I think if I create a new data frame with unique dates and number of text on corresponding date that would solve my problem.
Sample data
How can I create a new column with frequency of text each day? For example:
Thanks in Advance!
Depending on the task you are trying to solve, i can see two options for this dataset.
Either, as you show in your example, count the number of occurrences of the text field in each day, independently of the value of the text field.
Or, count the number of occurrence of each unique value of the text field each day. You will then have one column for each possible value of the text field, which may make more sense if the values are purely categorical.
First things to do :
import pandas as pd
df = pd.DataFrame(data={'Date':['2018-01-01','2018-01-01','2018-01-01', '2018-01-02', '2018-01-03'], 'Text':['A','B','C','A','A']})
df['Date'] = pd.to_datetime(df['Date']) #convert to datetime type if not already done
Date Text
0 2018-01-01 A
1 2018-01-01 B
2 2018-01-01 C
3 2018-01-02 A
4 2018-01-03 A
Then for option one :
df = df.groupby('Date').count()
Text
Date
2018-01-01 3
2018-01-02 1
2018-01-03 1
For option two :
df[df['Text'].unique()] = pd.get_dummies(df['Text'])
df = df.drop('Text', axis=1)
df = df.groupby('Date').sum()
A B C
Date
2018-01-01 1 1 1
2018-01-02 1 0 0
2018-01-03 1 0 0
The get_dummies function will create one column per possible value of the Text field. Each column is then a boolean indicator for each row of the dataframe, telling us which value of the Text field occurred in this row. We can then simply make a sum aggregation with a groupby by the Date field.
If you are not familiar with the use of groupby and aggregation operation, i recommend that you read this guide first.

How do I write consecutive times to a text file without listing all times, only the final range?

I am currently writing datetimes to a txt file. It currently looks like this:
2021-01-01 06:52:00 ,
2021-01-01 06:54:00 ,
2021-01-01 06:55:00 ,
2021-01-01 06:56:00 ,
2021-01-01 06:57:00 ,
2021-01-01 06:59:00 ,
2021-01-01 07:01:00 ,
I would instead like it to be displayed as a list of start and end times, like this:
2021-01-01 06:52:00 , 2021-01-01 06:52:00
2021-01-01 06:54:00 , 2021-01-01 06:59:00
2021-01-01 07:01:00 , 2021-01-01 07:01:00
You can see that any time there are consecutive times, it shows the range (2021-01-01 06:54:00 , 2021-01-01 06:59:00), and any time there is not a consecutive time, it repeats the time in the "end time" column.
My code currently looks like this, where time_values is just a numpy array of times from an xarray file:
time_list_array = []
for t in time_values:
start_time = pd.to_datetime(str(t)).strftime(
'%Y-%m-%d %H:%M:%S')
time_list_array.append(start_time)
#Write list of datetimes to txt file
full_path = '/home/'
if path.exists(full_path) is False:
mkdir(full_path)
with open(full_path+'.txt', 'a') as text_file:
for t in time_list_array:
text_file.write('%s ,\n' % t)
Your print statement in for t in time_list_array: includes '\n'. This creates a new line every time. You want to write 2 time values on the same line THEN add '\n' at the end. You need a small modification to the loop and the write format. Something like this:
for i in range(len(time_list_array)//2):
text_file.write(f'{time_list_array[2*i]} , {time_list_array[2*i+1]}\n')
Note: this assumes you have an even number of entries in time_list_array.
This is another way to solve the problem, using Odd to track which format to use.
with open('test_2.txt', 'w') as text_file:
Odd = True
for t in time_list_array:
if Odd:
text_file.write(f'{t}')
else:
text_file.write(f' , {t}\n')
Odd = not(Odd)

is there any method in pandas to convert dataframe from day to defaullt d/m/y format?

I would like to convert all day in the data-frame into day/feb/2020 format
here date field consist only day
from first one convert the date field like this
My current approach is:
import datetime
y=[]
for day in planned_ds.Date:
x=datetime.datetime(2020, 5, day)
print(x)
Is there any easy method to convert all day data-frame to d/m/y format?
One way as assuming you have data like
df = pd.DataFrame([1,2,3,4,5], columns=["date"])
is to convert them to dates and then shift them to start when you need them to:
pd.to_datetime(df["date"], unit="D") - pd.datetime(1970,1,1) + pd.datetime(2020,1,31)
this results in
0 2020-02-01
1 2020-02-02
2 2020-02-03
3 2020-02-04
4 2020-02-05

How can I loop over rows in my DataFrame, calculate a value and put that value in a new column with this lambda function

./test.csv looks like:
price datetime
1 100 2019-10-10
2 150 2019-11-10
...
import pandas as pd
import datetime as date
import datetime as time
from datetime import datetime
from datetime import timedelta
csv_df = pd.read_csv('./test.csv')
today = datetime.today()
csv_df['datetime'] = csv_df['expiration_date'].apply(lambda x: pd.to_datetime(x)) #convert `expiration_date` to datetime Series
def days_until_exp(expiration_date, today):
diff = (expiration_date - today)
return [diff]
csv_df['days_until_expiration'] = csv_df['datetime'].apply(lambda x: days_until_exp(csv_df['datetime'], today))
I am trying to iterate over a specific column in my DateFrame labeled csv_df['datetime'] which in each cell has just one value, a date, and do a calcation defined by diff.
Then I want the single value diff to be put into the new Series csv_df['days_until_expiration'].
The problem is, it's calculating values for every row (673 rows) and putting all those values in a list in each row of csv_df['days_until_expiration. I realize it may be due to the brackets around [diff], but without them I get an error.
In Excel, I would just do something like =SUM(datetime - price) and click and drag down the rows to have it populate a new column. However, I want to do this in Pandas as it's part of a bigger application.
csv_df['datetime'] is series, so x of apply is each cell of series. You call apply with lambda and days_until_exp(), but you doesn't passing x to it. Therefore, the result is wrong.
Anyway, Without your sample data, I guess that you want to find sum of csv_df['datetime'] - today(). To do this, you don't need apply. Just do direct vectorized operation on series and sum.
I make 2 columns dataframe for sample:
csv_df:
datetime days_until_expiration
0 2019-09-01 NaN
1 2019-09-02 NaN
2 2019-09-03 NaN
Do the following return series of delta between csv_df['datetime'] and today(). I guess you want this::
td = datetime.datetime.today()
csv_df['days_until_expiration'] = (csv_df['datetime'] - td).dt.days
csv_df:
datetime days_until_expiration
0 2019-09-01 115
1 2019-09-02 116
2 2019-09-03 117
OR:
To find sum of all deltas and assign the same sum value to csv_df['days_until_expiration']
csv_df['days_until_expiration'] = (csv_df['datetime'] - td).dt.days.sum()
csv_df:
datetime days_until_expiration
0 2019-09-01 348
1 2019-09-02 348
2 2019-09-03 348

How to obtain the percent change from the first and last value of a DataFrame in Pandas

I would like to know how to obtain the pct_change from either column balance, exports and imports, from the first and last value of the following DataFrame? Also it would be nice if you could illustrate how to get the pct_change from two specific dates
balance date exports imports
0 -45053 2008-01-01 421443 466496
1 -33453 2009-01-01 399649 433102
2 -41168 2010-01-01 445748 486916
3 -25171 2011-01-01 498862 524033
4 -33364 2012-01-01 501055 534419
5 -35367 2013-01-01 519913 555280
6 -36831 2014-01-01 518925 555756
7 -32370 2015-01-01 517161 549531
8 -43013 2016-01-01 547473 590486
IIUIC, use iloc for first .iloc[0] and last .iloc[-1] to get pct_change 100*(last/first-1)
In [244]: cols = ['balance', 'exports', 'imports']
In [245]: 100*(df[cols].iloc[-1]/df[cols].iloc[0]-1)
Out[245]:
balance -4.528000
exports 29.904400
imports 26.579006
dtype: float64
First, you have to set your index as date if you want to use directly the dates. Then everything goes easily.
import pandas as pd
data = [[-45053, "2008-01-01", 421443, 466496],
[-33453, "2009-01-01", 399649, 433102],
[-41168, "2010-01-01", 445748, 486916],
[-25171, "2011-01-01", 498862, 524033],
[-33364, "2012-01-01", 501055, 534419],
[-35367, "2013-01-01", 519913, 555280],
[-36831, "2014-01-01", 518925, 555756],
[-32370, "2015-01-01", 517161, 549531],
[-43013, "2016-01-01", 547473, 590486]]
columns = ["balance","date","exports","imports"]
df=pd.DataFrame(data,columns=columns).set_index("date")
print(df.loc["2009-01-01"]/df.loc["2008-01-01"]-1)
# result
# balance -0.257475
# exports -0.051713
# imports -0.071585
# dtype: float64

Resources