I have a df called exp with the following columns:
| category | p_value | score |
| tennis | 0.45 | 432 |
| soccer | 0.88 | 46 |
My goal is to bin all the scores per p value and create an accumulative plot that is grouped by category.
I managed to create the bin using the following:
# find the p_value bin, each row belongs to
# 0 is underflow, len(edges) is overflow bin
exp['bin'] = np.digitize(exp['p value'], bins=bin_edges)
# get the number of UBI per p_value bin
score_per_bin = exp.groupby('bin')['score'].sum()
And than managed to plot it:
not every bin might be filled, so we will use pandas index
binned = pd.DataFrame({
'center': bin_center,
'width': bin_width,
'score': np.zeros(len(bin_center))
}, index=np.arange(1, len(bin_edges)))
binned['score'] = score_per_bin
plt.step(
binned['center'],
binned['score'].cumsum(),
where='mid',
)
plt.xlabel('p-value')
plt.ylabel('score')
plt.show()
But I get a plot with 1 line while I need a line per category.
My question is - how do I keep the category and plot it?
Thank you
Related
I am plotting using python3 and matplotlib from a pandas dataframe. My dataframe looks like this:
date | value1 | value2 | price
--------------------------------------
2019-09-01 | 300 | 125 | 850
2019-09-02 | 400 | 250 | 625
. | . | . | .
. | . | . | .
. | . | . | .
2019-11-14 | 216 | 543 | 793
date is a datetime object, value1, value2, and price are all just ints/floats.
I'm plotting value1,value2 as a stacked bar chart and price as a line in two axes on the same figure using the second axis to display xticks as so (note, stacked_bar is a groupby dataframe made from df and price is corresponding price on the dates in stacked bar. My code ex is brief in actually showing this in my plot functions because those aren't the issue, so just trying to give an idea of what I'm using):
fig,ax = plt.subplots(nrows=2)
fig.set_size_inches(10,6)
stacked_bar.plot(kind='bar',stacked=True,ax=ax[0])
df['Price'].plot(ax=ax[1])
Is there an easy way to get matplotlib to display the ending date on here as an xtick ? I've been trying a million things to get the ticks formatted properly and I can't seem to get the right one. I'm at the point where I'm about to just make the x-axis non visible and display the date range as a label like '9-01 to 11-14'. For reference, in the displayed figure, I'm not making any adjustments to xticks, this is how it displays itself on run.
In the following example, the date is set as the index which is used to generate the first and last tick labels:
import numpy as np # v 1.19.2
import pandas as pd # v 1.2.3
import matplotlib.pyplot as plt # v 3.3.4
# Create sample dataset
rng = np.random.default_rng(seed=1)
date = pd.date_range('2019-09-01', '2019-11-14', freq='D')
value1 = rng.integers(0, 2000, size=date.size)
value2 = rng.integers(0, 2000, size=date.size)
price = 800 + np.cumsum(rng.integers(-100, 101, size=date.size))
df = pd.DataFrame(dict(value1=value1, value2=value2, Price=price), index=date)
# Create plots with shared x-axis and use_index=False to align them properly
fig, (top, bot) = plt.subplots(nrows=2, sharex=True, figsize=(10,6))
df[['value1', 'value2']].plot(kind='bar', stacked=True, ax=top)
df['Price'].plot(use_index=False, ax=bot)
# Format x-axis ticks
top.xaxis.set_visible(False)
bot.set_xticks([0, len(df)-1])
bot.set_xticks([], minor=True)
bot.set_xticklabels([date.strftime('%m-%d') for date in df.index[[0,-1]]]);
I got a data set (Excel) with hundreds of entries. In one string column there is most of the information. The information is divided by '_' and typed in by humans. Therefore, it is not possible to work with index positions.
To create a usable data basis it's mandatory to extract information from this column in another column.
The search pattern = '*v*' is alone not enough. But combined with the condition that the first item has to be a digit it works.
I tried to get it to work with iterrows, iteritems, str.strip, str.extract and many more. But the best solution I received with a for-loop.
pattern = '_*v*_'
test = []
for i in df['col']:
'#Split the string in substrings
i = i.split('_')
for c in i:
if c.find('x') == 1:
if c[0].isdigit():
# print(c)
test.append(c)
else:
'#To be able to fix a few rows manually
test.append(0)
[4]: test =[22v3, 33v55, 4v2]
#Input
+-----------+-----------+
| col | targetcol |
+-----------+-----------+
| as_22v3 | |
| 33v55_bdd | |
| Ave_4v2 | |
+-----------+-----------+
#Output
+-----------+-----------+--+
| col | targetcol | |
+-----------+-----------+--+
| as_22v3 | 22v3 | |
| 33v55_bdd | 33v55 | |
| Ave_4v2 | 4v2 | |
+-----------+-----------+--+
My code does work, but only for the first few rows. It stops after 36 values and I can't figure out why. There is no error message besides of course that it is not possible to assign the list to a DataFrame series since it has not the same size.
pandas.Series.str.extract should help:
>>> df['col'].str.extract(r'(\d+v+\d+)')
0
0 22v3
1 33v55
2 4v2
df = pd.DataFrame({
'col': ['as_22v3', '33v55_bdd', 'Ave_4v2']
})
df['targetcol'] = df['col'].str.extract(r'(\d+v+\d+)')
EDIT
df = pd.DataFrame({
'col': ['as_22v3', '33v55_bdd', 'Ave_4v2', '_22 v3', 'space 2,2v3', '2.v3',
'2.111v999', 'asd.123v77', '1 v7', '123 v 8135']
})
pattern = r'(\d+(\,[0-9]+)?(\s+)?v\d+)'
df['result'] = df['col'].str.extract(pattern)[0]
col result
0 as_22v3 22v3
1 33v55_bdd 33v55
2 Ave_4v2 4v2
3 _22 v3 22 v3
4 space 2,2v3 2,2v3
5 2.v3 NaN
6 2.111v999 111v999
7 asd.123v77 123v77
8 1 v7 1 v7
9 123 v 8135 NaN
You say it stops after 36 values? You say it is Excel file you are processing? One thing you could try is to save data set to .csv file and try to read this file in with pd.read_csv function. There are sometimes some extra characters in Excel file that are not easily visible.
I'm trying to create a forecasting process using hierarchical time series. My problem is that I can't find a way to create a for loop that hierarchically extracts daily time series from a pandas dataframe grouping the sum of quantities by date. The resulting daily time series should be passed to a function inside the loop, and the results stored in some other object.
Dataset
The initial dataset is a table that represents the daily sales data of 3 hierarchical levels: city, shop, product. The initial table has this structure:
+============+============+============+============+==========+
| Id_Level_1 | Id_Level_2 | Id_Level_3 | Date | Quantity |
+============+============+============+============+==========+
| Rome | Shop1 | Prod1 | 01/01/2015 | 50 |
+------------+------------+------------+------------+----------+
| Rome | Shop1 | Prod1 | 02/01/2015 | 25 |
+------------+------------+------------+------------+----------+
| Rome | Shop1 | Prod1 | 03/01/2015 | 73 |
+------------+------------+------------+------------+----------+
| Rome | Shop1 | Prod1 | 04/01/2015 | 62 |
+------------+------------+------------+------------+----------+
| ... | ... | ... | ... | ... |
+------------+------------+------------+------------+----------+
| Milan | Shop3 | Prod9 | 31/12/2018 | 185 |
+------------+------------+------------+------------+----------+
| Milan | Shop3 | Prod9 | 31/12/2018 | 147 |
+------------+------------+------------+------------+----------+
| Milan | Shop3 | Prod9 | 31/12/2018 | 206 |
+------------+------------+------------+------------+----------+
Each City (Id_Level_1) has many Shops (Id_Level_2), and each one has some Products (Id_Level_3). Each shop has a different mix of products (maybe shop1 and shop3 have product7, which is not available in other shops). All data are daily and the measure of interest is the quantity.
Hierarchical Index (MultiIndex)
I need to create a tree structure (hierarchical structure) to extract a time series for each "node" of the structure. I call a "node" a cobination of the hierarchical keys, i.e. "Rome" and "Milan" are nodes of Level 1, while "Rome|Shop1" and "Milan|Shop9" are nodes of level 2. In particulare, I need this on level 3, because each product (Id_Level_3) has different sales in each shop of each city. Here is the strict hierarchy.
Nodes of level 3 are "Rome, Shop1, Prod1", "Rome, Shop1, Prod2", "Rome, Shop2, Prod1", and so on. The key of the nodes is logically the concatenation of the ids.
For each node, the time series is composed by two columns: Date and Quantity.
# MultiIndex dataframe
Liv_Labels = ['Id_Level_1', 'Id_Level_2', 'Id_Level_3', 'Date']
df.set_index(Liv_Labels, drop=False, inplace=True)
The I need to extract the aggregated time series in order but keeping the hierarchical nodes.
Level 0:
Level_0 = df.groupby(level=['Data'])['Qta'].sum()
Level 1:
# Node Level 1 "Rome"
Level_1['Rome'] = df.loc[idx[['Rome'],:,:]].groupby(level=['Data']).sum()
# Node Level 1 "Milan"
Level_1['Milan'] = df.loc[idx[['Milan'],:,:]].groupby(level=['Data']).sum()
Level 2:
# Node Level 2 "Rome, Shop1"
Level_2['Rome',] = df.loc[idx[['Rome'],['Shop1'],:]].groupby(level=['Data']).sum()
... repeat for each level 2 node ...
# Node Level 2 "Milan, Shop9"
Level_2['Milan'] = df.loc[idx[['Milan'],['Shop9'],:]].groupby(level=['Data']).sum()
Attempts
I already tried creating dictionaries and multiindex, but my problem is that I can't get a proper "node" use inside the loop. I can't even extract the unique level nodes keys, so I can't collect a specific node time series.
# Get level labels
Level_Labels = ['Id_Liv'+str(n) for n in range(1, Liv_Num+1)]+['Data']
# Initialize dictionary
TimeSeries = {}
# Get Level 0 time series
TimeSeries["Level_0"] = df.groupby(level=['Data'])['Qta'].sum()
# Get othe levels time series from 1 to Level_Num
for i in range(1, Liv_Num+1):
TimeSeries["Level_"+str(i)] = df.groupby(level=Level_Labels[0:i]+['Data'])['Qta'].sum()
Desired result
I would like a loop the cycles my dataset with these actions:
Creates a structure of all the unique node keys
Extracts the node time series grouped by Date and Quantity
Store the time series in a structure for later use
Thanks in advance for any suggestion! Best regards.
FR
I'm currently working on a switch dataset that I polled from an sql database where each port on the respective switch has a data frame which has a time series. So to access this time series information for each specific port I represented the switches by their IP addresses and the various number of ports on the switch, and to make sure I don't re-query what I already queried before I used the .unique() method to get unique queries of each.
I set my index to be the IP and Port indices and accessed the port information like so:
def yield_df(df):
for ip in df.index.get_level_values('ip').unique():
for port in df.loc[ip].index.get_level_values('port').unique():
yield df.loc[ip].loc[port]
Then I cycled the port data frames with a for loop like so:
for port_df in yield_df(adb_df):
I'm sure there are faster ways to carry out these procedures in pandas but I hope this helps you start solving your problem
I have a pandas dataframe and I want to search through strings in column A, if there's a match I want to append 1 to a new column, if there is no match I want to append a 0.
My df currently looks like:
Column A | Column B | Column C
company one | 314 | 0.9
company one toast | 190 | 0.3
www.companyone | 380 | 0.87
companyone home | 850 | 0.1
toaster supplies | 1100 | 0.5
toast rack | 200 | 0.7
...
I'm trying to write a function which will read through column A, and if there's a match with either company one or companyone, then append 1 on the end of the row. If there is no match, then append 0. The output I'm looking for is:
Column A | Column B | Column C | Branded
company one | 314 | 0.9 | 1
company one toast | 190 | 0.3 | 1
www.companyone | 380 | 0.87 | 1
companyone home | 850 | 0.1 | 1
toaster supplies | 1100 | 0.5 | 0
toast rack | 200 | 0.7 | 0
...
I've tried this function:
def branded(table):
if 'company.*?one' in table[table['Column A']]:
table['Branded'] = 1
else:
table['Branded'] = 0
return table.head()
However I get a KeyError. I'm not sure what I'm missing though.
You can do it like this:
df['Branded'] = df['Column A'].str.contains('company.*?one')*1
The solution posted by zipa is better in my opinion. However, thought of sharing this which is a tweak version in case the strings to be looked for are entirely of different pattern. You can add the words to the list and then perform something similar:
import pandas as pd
df = pd.DataFrame({'column':['company one','companyone', 'company two']})
search = ['company one', 'companyone']
string_search = '|'.join(search)
df['flag'] = df['column'].str.contains(string_search)
df['flag'] = df['flag'].map({True: 1, False: 0})
After looking for answers and trying everything could not figure out a way out, so here it goes.
I have a list of *.txt files that I want to merge by column. I am 100% sure that they have the same structure, as follows
File1
date | time | model_name1
1850-01-16 | 12:00:00 | 0.10
File2
date | time | model_name2
1850-01-16 | 12:00:00 | 0.50
File3..... and so on
Note: the vertical bars are just for clarity here.
Now my output should look like this:
Output
date | time | model_name1 | model_name2
1850-01-16 | 12:00:00 | 0.10 | 0.50
With the following piece of code
out_list4 = os.listdir(out_directory)
df_list = [pd.read_table(out_path+os.fsdecode(file_x), sep='\s+') for file_x in out_list4]
df_merged = reduce(lambda left,right: ,
pd.merge(left,right,on=['date'], how='outer'), df_list)
pd.DataFrame.to_csv(df_merged, out_path+'merged.txt', sep='\t', index=False)
I manage the following output:
Output
date | time_x | model_name1 |time_y | model_name2
1850-01-16 | 12:00:00 | 0.10 |12:00:00| 0.50
As expected since I only have the key ""on=['date']"".
Now if I try to write time as second key as follows: ""on=['date','time']"", it crashes with the following error:
Key error:'time'
and a long list of tracebacks.
I tried placing left_on/righ_on in case "date" was being handled as index. No use. I know the problem does not lie on the files, the structure is right, it is the code. Any help will be much appreciated. And sorry for readibility on the
So, the problem was before. I had defined ""out_list4"" as a list before:
out_list4 = list()
and it was making a mess at the end. Each data element on the list should have size 1872 x 3, but at the end it was adding them altogether again making one last entry be 1872 x 12 and no 'time' header.
Changing the definition of ""out_list4"" to:
out_list4 = []
did the trick. The tip came from Combine a list of pandas dataframes to one pandas dataframe.