How to ignore plotting the feature which have all zeros in subplot using groupby - python-3.x

I am trying to use the "groupby" to plot all features in a dataframe in different subplots based on each serial number with ignoring the feature 'Ft' of each serial number (in subplot) where the all data are Zeros, for example, we should ignore 'Ft1' in S/N 'KLM10015' because all data in this feature are Zeros. The size of the dataframe is "5514 rows and 565 columns" with the ability of using a dataframe with different sizes.
The x-axis of each subplot represent the "Date" , y-axis represents each feature values (Ft) and the title represent the serial number (S/N).
This is an example of the dataframe which I have:
df =
S/N Ft1 Ft12 Ft17 ---- Ft1130 Ft1140 Ft1150
DATE
2021-01-06 KLM10015 0 12 14 ---- 17 52 47
2021-01-07 KLM10015 0 10 48 ---- 19 20 21
2021-01-11 KLM10015 0 0 45 ---- 0 19 0
2021-01-17 KLM10015 0 1 0 ---- 16 44 66
| | | | | | | |
| | | | | | | |
| | | | | | | |
2021-02-09 KLM10018 1 11 0 ---- 25 27 19
2021-12-13 KLM10018 12 0 19 ---- 78 77 18
2021-12-16 kLM10018 14 17 14 ---- 63 19 0
2021-07-09 KLM10018 18 0 77 ---- 65 34 98
2021-07-15 KLM10018 0 88 82 ---- 63 31 22
Code:
list_ID = ["ft1","ft12", "ft17, ......, ft1130, 1140, ft1150]
def plot_fun (dataframe):
for item in list_ID:
fig = plt.figure(figsize=(35, 20))
ax1 = fig.add_subplot(221)
ax2 = fig.add_subplot(222)
ax3 = fig.add_subplot(223)
ax4 = fig.add_subplot(224)
dataframe.groupby('S/N')[item].plot(legend='True', ax=ax1)
dataframe.groupby('S/N')[item].plot(legend='True', ax=ax2)
dataframe.groupby('S/N')[item].plot(legend='True', ax=ax3)
dataframe.groupby('S/N')[item].plot(legend='True', ax=ax4)
plt.show()
plot_fun (df)
I need really to your help. Thanks a lot

Related

matplotlib auto generates dates in my x-axis

I have a DataFrame that consist of 2 columns
Transaction Week | Completed
2021-01-10 | 63
2021-01-17 | 76
2021-01-24 | 63
2021-01-31 | 20
I cannot understand why after I plot the graph, my x-axis has more than 4 dates (My DataFrame only has 4 entries). How can I remove those dates?
x=Weekly_Settled_Trans_Status['Transaction Week']
y=Weekly_Settled_Trans_Status['Completed']
plt.plot(x,y)
plt.tick_params('x', labelrotation=45)

Windowing Data into Rows in Pyspark

I'm preparing a dataset to develop a supervised model to predict a value given the 5 previous values before it. For example given the sample data below, I would predict the 6th column given columns 1:5, or the 8th column given columns 3:7.
id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
a 150 110 130 80 136 150 190 110 150 110 130 136 100 150 190 110
b 100 100 130 100 136 100 160 230 122 130 15 200 100 100 136 100
c 130 122 140 140 122 130 15 200 100 100 130 100 136 100 160 230
To that end, I want to reorganize the sample data above into rows of 6 columns, taking every slice/window of 6 values possible (e.g. 1:6, 2:7, 3:8). How can I do that? Is it possible in PySpark/SQL? Example of output below, index just for clarification:
1 2 3 4 5 6
a[1:6] 150 110 130 80 136 150
a[2:7] 110 130 80 136 150 190
a[3:8] 130 80 136 150 190 110
...
c[1:6] 130 122 140 140 122 130
c[2:7] 122 140 140 122 130 15
...
c[10:16] 130 100 136 100 160 230
You can convert your columns into an array of arrays or array of structs and then explode, for example:
from pyspark.sql.functions import struct, explode, array, col
# all columns except the first
cols = df.columns[1:]
# size of the splits
N = 6
Use array of arrays:
df_new = df.withColumn('dta', explode(array(*[ array(*cols[i:i+N]) for i in range(len(cols)-N+1) ]))) \
.select('id', *[ col('dta')[i].alias(str(i+1)) for i in range(N) ])
df_new.show()
+---+---+---+---+---+---+---+
| id| 1| 2| 3| 4| 5| 6|
+---+---+---+---+---+---+---+
| a|150|110|130| 80|136|150|
| a|110|130| 80|136|150|190|
| a|130| 80|136|150|190|110|
| a| 80|136|150|190|110|150|
| a|136|150|190|110|150|110|
| a|150|190|110|150|110|130|
| a|190|110|150|110|130|136|
| a|110|150|110|130|136|100|
| a|150|110|130|136|100|150|
| a|110|130|136|100|150|190|
| a|130|136|100|150|190|110|
| b|100|100|130|100|136|100|
+---+---+---+---+---+---+---+
Use array of structs (spark 2.4+):
df_new = df.withColumn('dta', array(*cols)) \
.selectExpr("id", f"""
inline(transform(sequence(0,{len(cols)-N}), i -> ({','.join(f'dta[i+{j}] as `{j+1}`' for j in range(N))})))
""")
the code inside above f-string is the same as the following for N=6:
inline(transform(sequence(0,10), i -> struct(dta[i] as `1`, dta[i+1] as `2`, dta[i+2] as `3`, dta[i+3] as `4`, dta[i+4] as `5`, dta[i+5] as `6`)))
Yes you can use this code (and modify it to get what you need):
partitions = []
for row in df.rdd.toLocalIterator():
row_list = list(row)
num_elements = 6
for i in range(0, len(row_list) - num_elements):
partition = row[i : i+num_elements]
partitions.append(partition)
output_df = spark.createDataFrame(partitions)

Selecting column on the basis of date

I have the following data set.
ID Date description V1 V2 V3
1 31-Jan-2013 Des1 10 20 30
1 31-Jan-2013 Des2 20 30 20
1 31-jan-2014 Des1 56 30 20
1 31-jan-2014 des2 30 40 60
2 31-dec-2013 Decc1 10 20 30
2 31-dec-2013 Decc2 20 30 20
2 31-dec-2014 Decc1 56 30 20
2 31-dec-2014 decc2 30 40 60
I want extract only the latest year values for the ID's.
expected output.
ID Date description V1 V2 V3
1 31-jan-2014 Des1 56 30 20
1 31-jan-2014 des2 30 40 60
2 31-dec-2014 Decc1 56 30 20
2 31-dec-2014 decc2 30 40 60
Can anybody help, how we can achieve this in pandas.
Thanks
Anubhav
may be use groupby().
data_u.set_index(['ID', 'Date'],inplace=True)
data_u.sort_index(inplace=True)
data_u.groupby(data_u.index).index.agg(['count'])
this gives me count of the rows of multindex.
But I want to select the latest year of all ID's. Number of records are >500000
You could do the following:
df['Date'] = pd.to_datetime(df['Date'])
df[df.apply(lambda x : x['Date'] == df[(df['ID'] == x['ID'])]['Date'].max() , axis =1)]
Output
+---+----+------------+-------------+----+----+----+
| | ID | Date | description | V1 | V2 | V3 |
+---+----+------------+-------------+----+----+----+
| 2 | 1 | 2014-01-31 | Des1 | 56 | 30 | 20 |
| 3 | 1 | 2014-01-31 | des2 | 30 | 40 | 60 |
| 6 | 2 | 2014-12-31 | Decc1 | 56 | 30 | 20 |
| 7 | 2 | 2014-12-31 | decc2 | 30 | 40 | 60 |
+---+----+------------+-------------+----+----+----+

Creating A new column based on other columns' values with specific requirement in Python Dataframe

I want to create a new column in Python dataframe with specific requirements from other columns. For example, my python dataframe df:
A | B
-----------
5 | 0
5 | 1
15 | 1
10 | 1
10 | 1
20 | 2
15 | 2
10 | 2
5 | 3
15 | 3
10 | 4
20 | 0
I want to create new column C, with below requirements:
When the value of B = 0, then C = 0
The same value in B will have the same value in C. The same values in B will be classified as start, middle, and end. So for values 1, it has 1 start, 2 middle, and 1 end, for values 3, it has 1 start, 0 middle, and 1 end. And the calculation for each section:
I specify a threshold = 10.
Let's look at values B = 1 :
Start :
C.loc[2] = min(threshold, A.loc[1]) + A.loc[2]
Middle :
C.loc[3] = A.loc[3]
C.loc[4] = A.loc[4]
End:
C.loc[5] = min(Threshold, A.loc[6])
However, the output value of C will be the sum of the above calculations.
When the value of B is unique and not 0. For example when B = 4
C[10] = min(threshold, A.loc[9]) + min(threshold, A.loc[11])
I can solve point 0 and 3. But I'm struggling to solve point 2.
So, the final output will be:
A | B | c
--------------------
5 | 0 | 0
5 | 1 | 45
15 | 1 | 45
10 | 1 | 45
10 | 1 | 45
20 | 2 | 50
15 | 2 | 50
10 | 2 | 50
5 | 3 | 25
10 | 3 | 25
10 | 4 | 20
20 | 0 | 0

Plot Shaded Error Bars from Pandas Agg

I have data in the following format:
| | Measurement 1 | | Measurement 2 | |
|------|---------------|------|---------------|------|
| | Mean | Std | Mean | Std |
| Time | | | | |
| 0 | 17 | 1.10 | 21 | 1.33 |
| 1 | 16 | 1.08 | 21 | 1.34 |
| 2 | 14 | 0.87 | 21 | 1.35 |
| 3 | 11 | 0.86 | 21 | 1.33 |
I am using the following code to generate a matplotlib line graph from this data, which shows the standard deviation as a filled in area, see below:
def seconds_to_minutes(x, pos):
minutes = f'{round(x/60, 0)}'
return minutes
fig, ax = plt.subplots()
mean_temperature_over_time['Measurement 1']['mean'].plot(kind='line', yerr=mean_temperature_over_time['Measurement 1']['std'], alpha=0.15, ax=ax)
mean_temperature_over_time['Measurement 2']['mean'].plot(kind='line', yerr=mean_temperature_over_time['Measurement 2']['std'], alpha=0.15, ax=ax)
ax.set(title="A Line Graph with Shaded Error Regions", xlabel="x", ylabel="y")
formatter = FuncFormatter(seconds_to_minutes)
ax.xaxis.set_major_formatter(formatter)
ax.grid()
ax.legend(['Mean 1', 'Mean 2'])
Output:
This seems like a very messy solution, and only actually produces shaded output because I have so much data. What is the correct way to produce a line graph from the dataframe I have with shaded error regions? I've looked at Plot yerr/xerr as shaded region rather than error bars, but am unable to adapt it for my case.
What's wrong with the linked solution? It seems pretty straightforward.
Allow me to rearrange your dataset so it's easier to load in a Pandas DataFrame
Time Measurement Mean Std
0 0 1 17 1.10
1 1 1 16 1.08
2 2 1 14 0.87
3 3 1 11 0.86
4 0 2 21 1.33
5 1 2 21 1.34
6 2 2 21 1.35
7 3 2 21 1.33
for i, m in df.groupby("Measurement"):
ax.plot(m.Time, m.Mean)
ax.fill_between(m.Time, m.Mean - m.Std, m.Mean + m.Std, alpha=0.35)
And here's the result with some random generated data:
EDIT
Since the issue is apparently iterating over your particular dataframe format let me show how you could do it (I'm new to pandas so there may be better ways). If I understood correctly your screenshot you should have something like:
Measurement 1 2
Mean Std Mean Std
Time
0 17 1.10 21 1.33
1 16 1.08 21 1.34
2 14 0.87 21 1.35
3 11 0.86 21 1.33
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 4 columns):
(1, Mean) 4 non-null int64
(1, Std) 4 non-null float64
(2, Mean) 4 non-null int64
(2, Std) 4 non-null float64
dtypes: float64(2), int64(2)
memory usage: 160.0 bytes
df.columns
MultiIndex(levels=[[1, 2], [u'Mean', u'Std']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
names=[u'Measurement', None])
And you should be able to iterate over it with and obtain the same plot:
for i, m in df.groupby("Measurement"):
ax.plot(m["Time"], m['Mean'])
ax.fill_between(m["Time"],
m['Mean'] - m['Std'],
m['Mean'] + m['Std'], alpha=0.35)
Or you could restack it to the format above with
(df.stack("Measurement") # stack "Measurement" columns row by row
.reset_index() # make "Time" a normal column, add a new index
.sort_values("Measurement") # group values from the same Measurement
.reset_index(drop=True)) # drop sorted index and make a new one

Resources