When I download date and convert into DataFrame I lose the first column with data - python-3.x

I use a quandl into download a stock prices. I have a list of names of companies and I download all informations. After that, I convert it into data frame. When I do it for only one company all works well but when I try do it for all in the same time something goes wrong. The first column with data convert into index with the value from 0 to 3 insted of data
My code looks like below:
import quandl
import pandas as pd
names_of_company = [11BIT, ABCDATA, ALCHEMIA]
for names in names_of_company:
x = quandl.get('WSE/%s' %names, start_date='2018-11-29',
end_date='2018-11-29',
paginate=True)
x['company'] = names
results = results.append(x).reset_index(drop=True)
Actual results looks like below:
Index Open High Low Close %Change Volume # of Trades Turnover (1000) company
0 204.5 208.5 204.5 206.0 0.73 3461.0 105.0 717.31 11BIT
1 205.0 215.0 202.5 214.0 3.88 10812.0 392.0 2254.83 ABCDATA
2 215.0 215.0 203.5 213.0 -0.47 12651.0 401.0 2656.15 ALCHEMIA
But I expected:
Data Open High Low Close %Change Volume # of Trades Turnover (1000) company
2018-11-29 204.5 208.5 204.5 206.0 0.73 3461.0 105.0 717.31 11BIT
2018-11-29 205.0 215.0 202.5 214.0 3.88 10812.0 392.0 2254.83 ABCDATA
2018-11-29 215.0 215.0 203.5 213.0 -0.47 12651.0 401.0 2656.15 ALCHEMIA
So as you can see, there is an issue with data beacues it can't convert into a correct way. But as I said if I do it for only one company, it works. Below is code:
x = quandl.get('WSE/11BIT', start_date='2019-01-01', end_date='2019-01-03')
df = pd.DataFrame(x)
I will be very grateful for any help ! Thanks All

When you store it to a dataframe, the date is your index. You lose it because when you use .reset_index(), you over write the old index (the date), and instead of the date being added as a column, you tell it to drop it with .reset_index(drop=True)
So I'd append, but then once the whole results dataframe is populated, I'd then reset the index, but NOT drop by either doing results = results.reset_index(drop=False) or results = results.reset_index() since the default is false.
import quandl
import pandas as pd
names_of_company = ['11BIT', 'ABCDATA', 'ALCHEMIA']
results = pd.DataFrame()
for names in names_of_company:
x = quandl.get('WSE/%s' %names, start_date='2018-11-29',
end_date='2018-11-29',
paginate=True)
x['company'] = names
results = results.append(x)
results = results.reset_index(drop=False)
Output:
print (results)
Date Open High ... # of Trades Turnover (1000) company
0 2018-11-29 269.50 271.00 ... 280.0 1822.02 11BIT
1 2018-11-29 0.82 0.92 ... 309.0 1027.14 ABCDATA
2 2018-11-29 4.55 4.55 ... 1.0 0.11 ALCHEMIA
[3 rows x 10 columns]

Related

Analyse Logdata out of a cvs file all written in a long row

I have a set of a couple hundreds sensors whose data is recorded in a log file. Each measurement cycle of all sensors is written in one line of the log file in cvs format. I need to be able to structure the log file to make some analysis with plots and calculations of the values.
The format of the CVS is like the following:
ID;Time;SensorID;ValueA;ValueB;ValueC;ValueD;SensorID;ValueA;ValueB;ValueC;ValueD[3..400]SensorID;ValueA;ValueB;ValueC;ValueD
11234;11:12:123456;12345678;5.3;53.4;53;-36.6;72345670;5.8;57.4;56;-39.6;[...]92345670;5.9;60.4;55;-33.6;
So I have a very long table with about 5000 or 6000 columns which contain my values but I'm not sure which is the right way to extract it in a easy way to perform some analysis. The table has about 600 rows.
I have written a report function in python with the help of pandas. The format I can already analyse is like the following:
Time;SensorID;ValueA;ValueB;ValueC;ValueD
11:12:123456;12345678;5.3;53.4;53;-36.6;
11:12:123457;12345679;5.5;55;54;-40;
So the time is slightly different and the Sensor ID will be different.
I use groupby(SensorID) and plots of the groupby and after that I perform some value_count() within some columns
If I understand you correct, each data line in the CSV contains ValueA - D for a set sensors, sharting the same ID and Time columns. Also your dataline ends with a semicolon which will throw pandas off.
[...]92345670;5.9;60.4;55;-33.6;
This answer leaves the semicolon stays in place since I assume you can change the process that produces the CSV file.
from io import StringIO
string = """
ID;Time;SensorID;ValueA;ValueB;ValueC;ValueD;SensorID;ValueA;ValueB;ValueC;ValueD;SensorID;ValueA;ValueB;ValueC;ValueD;
11234;11:12:123456;12345678;5.3;53.4;53;-36.6;72345670;5.8;57.4;56;-39.6;92345670;5.9;60.4;55;-33.6;
"""
df = pd.read_csv(StringIO(string), sep=';', engine='python')
# Shift the columns. We need this because the extra semicolon!
columns = df.columns[1:]
df = df.iloc[:, :-1]
df.columns = columns
df = df.set_index('Time')
# n is how many groups of sensor measurement are stored in each line
n = df.shape[1] // 5
idx = pd.MultiIndex.from_product([range(n), ['SensorID', 'ValueA', 'ValueB', 'ValueC', 'ValueD']])
result = df.stack(level=0).droplevel(-1).reset_index()
Result:
Time SensorID ValueA ValueB ValueC ValueD
0 11:12:123456 12345678 5.3 53.4 53 -36.6
1 11:12:123456 72345670 5.8 57.4 56 -39.6
2 11:12:123456 92345670 5.9 60.4 55 -33.6
Now you can send it to your analysis function.
thanks for thinking about it. Have you tested this code? I'm getting this table as output:
Time 0
0 11234 11:12:123456
1 11234 12345678
2 11234 5.3
3 11234 53.4
4 11234 53
5 11234 -36.6
6 11234 72345670
7 11234 5.8
8 11234 57.4
9 11234 56
10 11234 -39.6
11 11234 92345670
12 11234 5.9
13 11234 60.4
14 11234 55
15 11234 -33.6

Create of multiple subsets from existing pandas dataframe [duplicate]

I have a very large dataframe (around 1 million rows) with data from an experiment (60 respondents).
I would like to split the dataframe into 60 dataframes (a dataframe for each participant).
In the dataframe, data, there is a variable called 'name', which is the unique code for each participant.
I have tried the following, but nothing happens (or execution does not stop within an hour). What I intend to do is to split the data into smaller dataframes, and append these to a list (datalist):
import pandas as pd
def splitframe(data, name='name'):
n = data[name][0]
df = pd.DataFrame(columns=data.columns)
datalist = []
for i in range(len(data)):
if data[name][i] == n:
df = df.append(data.iloc[i])
else:
datalist.append(df)
df = pd.DataFrame(columns=data.columns)
n = data[name][i]
df = df.append(data.iloc[i])
return datalist
I do not get an error message, the script just seems to run forever!
Is there a smart way to do it?
Can I ask why not just do it by slicing the data frame. Something like
#create some data with Names column
data = pd.DataFrame({'Names': ['Joe', 'John', 'Jasper', 'Jez'] *4, 'Ob1' : np.random.rand(16), 'Ob2' : np.random.rand(16)})
#create unique list of names
UniqueNames = data.Names.unique()
#create a data frame dictionary to store your data frames
DataFrameDict = {elem : pd.DataFrame() for elem in UniqueNames}
for key in DataFrameDict.keys():
DataFrameDict[key] = data[:][data.Names == key]
Hey presto you have a dictionary of data frames just as (I think) you want them. Need to access one? Just enter
DataFrameDict['Joe']
Firstly your approach is inefficient because the appending to the list on a row by basis will be slow as it has to periodically grow the list when there is insufficient space for the new entry, list comprehensions are better in this respect as the size is determined up front and allocated once.
However, I think fundamentally your approach is a little wasteful as you have a dataframe already so why create a new one for each of these users?
I would sort the dataframe by column 'name', set the index to be this and if required not drop the column.
Then generate a list of all the unique entries and then you can perform a lookup using these entries and crucially if you only querying the data, use the selection criteria to return a view on the dataframe without incurring a costly data copy.
Use pandas.DataFrame.sort_values and pandas.DataFrame.set_index:
# sort the dataframe
df.sort_values(by='name', axis=1, inplace=True)
# set the index to be this and don't drop
df.set_index(keys=['name'], drop=False,inplace=True)
# get a list of names
names=df['name'].unique().tolist()
# now we can perform a lookup on a 'view' of the dataframe
joe = df.loc[df.name=='joe']
# now you can query all 'joes'
You can convert groupby object to tuples and then to dict:
df = pd.DataFrame({'Name':list('aabbef'),
'A':[4,5,4,5,5,4],
'B':[7,8,9,4,2,3],
'C':[1,3,5,7,1,0]}, columns = ['Name','A','B','C'])
print (df)
Name A B C
0 a 4 7 1
1 a 5 8 3
2 b 4 9 5
3 b 5 4 7
4 e 5 2 1
5 f 4 3 0
d = dict(tuple(df.groupby('Name')))
print (d)
{'b': Name A B C
2 b 4 9 5
3 b 5 4 7, 'e': Name A B C
4 e 5 2 1, 'a': Name A B C
0 a 4 7 1
1 a 5 8 3, 'f': Name A B C
5 f 4 3 0}
print (d['a'])
Name A B C
0 a 4 7 1
1 a 5 8 3
It is not recommended, but possible create DataFrames by groups:
for i, g in df.groupby('Name'):
globals()['df_' + str(i)] = g
print (df_a)
Name A B C
0 a 4 7 1
1 a 5 8 3
Easy:
[v for k, v in df.groupby('name')]
Groupby can helps you:
grouped = data.groupby(['name'])
Then you can work with each group like with a dataframe for each participant. And DataFrameGroupBy object methods such as (apply, transform, aggregate, head, first, last) return a DataFrame object.
Or you can make list from grouped and get all DataFrame's by index:
l_grouped = list(grouped)
l_grouped[0][1] - DataFrame for first group with first name.
In addition to Gusev Slava's answer, you might want to use groupby's groups:
{key: df.loc[value] for key, value in df.groupby("name").groups.items()}
This will yield a dictionary with the keys you have grouped by, pointing to the corresponding partitions. The advantage is that the keys are maintained and don't vanish in the list index.
The method in the OP works, but isn't efficient. It may have seemed to run forever, because the dataset was long.
Use .groupby on the 'method' column, and create a dict of DataFrames with unique 'method' values as the keys, with a dict-comprehension.
.groupby returns a groupby object, that contains information about the groups, where g is the unique value in 'method' for each group, and d is the DataFrame for that group.
The value of each key in df_dict, will be a DataFrame, which can be accessed in the standard way, df_dict['key'].
The original question wanted a list of DataFrames, which can be done with a list-comprehension
df_list = [d for _, d in df.groupby('method')]
import pandas as pd
import seaborn as sns # for test dataset
# load data for example
df = sns.load_dataset('planets')
# display(df.head())
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
# Using a dict-comprehension, the unique 'method' value will be the key
df_dict = {g: d for g, d in df.groupby('method')}
print(df_dict.keys())
[out]:
dict_keys(['Astrometry', 'Eclipse Timing Variations', 'Imaging', 'Microlensing', 'Orbital Brightness Modulation', 'Pulsar Timing', 'Pulsation Timing Variations', 'Radial Velocity', 'Transit', 'Transit Timing Variations'])
# or a specific name for the key, using enumerate (e.g. df1, df2, etc.)
df_dict = {f'df{i}': d for i, (g, d) in enumerate(df.groupby('method'))}
print(df_dict.keys())
[out]:
dict_keys(['df0', 'df1', 'df2', 'df3', 'df4', 'df5', 'df6', 'df7', 'df8', 'df9'])
df_dict['df1].head(3) or df_dict['Astrometry'].head(3)
There are only 2 in this group
method number orbital_period mass distance year
113 Astrometry 1 246.36 NaN 20.77 2013
537 Astrometry 1 1016.00 NaN 14.98 2010
df_dict['df2].head(3) or df_dict['Eclipse Timing Variations'].head(3)
method number orbital_period mass distance year
32 Eclipse Timing Variations 1 10220.0 6.05 NaN 2009
37 Eclipse Timing Variations 2 5767.0 NaN 130.72 2008
38 Eclipse Timing Variations 2 3321.0 NaN 130.72 2008
df_dict['df3].head(3) or df_dict['Imaging'].head(3)
method number orbital_period mass distance year
29 Imaging 1 NaN NaN 45.52 2005
30 Imaging 1 NaN NaN 165.00 2007
31 Imaging 1 NaN NaN 140.00 2004
For more information about the seaborn datasets
NASA Exoplanets
Alternatively
This is a manual method to create separate DataFrames using pandas: Boolean Indexing
This is similar to the accepted answer, but .loc is not required.
This is an acceptable method for creating a couple extra DataFrames.
The pythonic way to create multiple objects, is by placing them in a container (e.g. dict, list, generator, etc.), as shown above.
df1 = df[df.method == 'Astrometry']
df2 = df[df.method == 'Eclipse Timing Variations']
In [28]: df = DataFrame(np.random.randn(1000000,10))
In [29]: df
Out[29]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 10 columns):
0 1000000 non-null values
1 1000000 non-null values
2 1000000 non-null values
3 1000000 non-null values
4 1000000 non-null values
5 1000000 non-null values
6 1000000 non-null values
7 1000000 non-null values
8 1000000 non-null values
9 1000000 non-null values
dtypes: float64(10)
In [30]: frames = [ df.iloc[i*60:min((i+1)*60,len(df))] for i in xrange(int(len(df)/60.) + 1) ]
In [31]: %timeit [ df.iloc[i*60:min((i+1)*60,len(df))] for i in xrange(int(len(df)/60.) + 1) ]
1 loops, best of 3: 849 ms per loop
In [32]: len(frames)
Out[32]: 16667
Here's a groupby way (and you could do an arbitrary apply rather than sum)
In [9]: g = df.groupby(lambda x: x/60)
In [8]: g.sum()
Out[8]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16667 entries, 0 to 16666
Data columns (total 10 columns):
0 16667 non-null values
1 16667 non-null values
2 16667 non-null values
3 16667 non-null values
4 16667 non-null values
5 16667 non-null values
6 16667 non-null values
7 16667 non-null values
8 16667 non-null values
9 16667 non-null values
dtypes: float64(10)
Sum is cythonized that's why this is so fast
In [10]: %timeit g.sum()
10 loops, best of 3: 27.5 ms per loop
In [11]: %timeit df.groupby(lambda x: x/60)
1 loops, best of 3: 231 ms per loop
The method based on list comprehension and groupby- Which stores all the split dataframe in list variable and can be accessed using the index.
Example
ans = [pd.DataFrame(y) for x, y in DF.groupby('column_name', as_index=False)]
ans[0]
ans[0].column_name
You can use the groupby command, if you already have some labels for your data.
out_list = [group[1] for group in in_series.groupby(label_series.values)]
Here's a detailed example:
Let's say we want to partition a pd series using some labels into a list of chunks
For example, in_series is:
2019-07-01 08:00:00 -0.10
2019-07-01 08:02:00 1.16
2019-07-01 08:04:00 0.69
2019-07-01 08:06:00 -0.81
2019-07-01 08:08:00 -0.64
Length: 5, dtype: float64
And its corresponding label_series is:
2019-07-01 08:00:00 1
2019-07-01 08:02:00 1
2019-07-01 08:04:00 2
2019-07-01 08:06:00 2
2019-07-01 08:08:00 2
Length: 5, dtype: float64
Run
out_list = [group[1] for group in in_series.groupby(label_series.values)]
which returns out_list a list of two pd.Series:
[2019-07-01 08:00:00 -0.10
2019-07-01 08:02:00 1.16
Length: 2, dtype: float64,
2019-07-01 08:04:00 0.69
2019-07-01 08:06:00 -0.81
2019-07-01 08:08:00 -0.64
Length: 3, dtype: float64]
Note that you can use some parameters from in_series itself to group the series, e.g., in_series.index.day
here's a small function which might help some (efficiency not perfect probably, but compact + more or less easy to understand):
def get_splited_df_dict(df: 'pd.DataFrame', split_column: 'str'):
"""
splits a pandas.DataFrame on split_column and returns it as a dict
"""
df_dict = {value: df[df[split_column] == value].drop(split_column, axis=1) for value in df[split_column].unique()}
return df_dict
it converts a DataFrame to multiple DataFrames, by selecting each unique value in the given column and putting all those entries into a separate DataFrame.
the .drop(split_column, axis=1) is just for removing the column which was used to split the DataFrame. the removal is not necessary, but can help a little to cut down on memory usage after the operation.
the result of get_splited_df_dict is a dict, meaning one can access each DataFrame like this:
splitted = get_splited_df_dict(some_df, some_column)
# accessing the DataFrame with 'some_column_value'
splitted[some_column_value]
The existing answers cover all good cases and explains fairly well how the groupby object is like a dictionary with keys and values that can be accessed via .groups. Yet more methods to do the same job as the existing answers are:
Create a list by unpacking the groupby object and casting it to a dictionary:
dict([*df.groupby('Name')]) # same as dict(list(df.groupby('Name')))
Create a tuple + dict (this is the same as #jezrael's answer):
dict((*df.groupby('Name'),))
If we only want the DataFrames, we could get the values of the dictionary (created above):
[*dict([*df.groupby('Name')]).values()]
I had similar problem. I had a time series of daily sales for 10 different stores and 50 different items. I needed to split the original dataframe in 500 dataframes (10stores*50stores) to apply Machine Learning models to each of them and I couldn't do it manually.
This is the head of the dataframe:
I have created two lists;
one for the names of dataframes
and one for the couple of array [item_number, store_number].
list=[]
for i in range(1,len(items)*len(stores)+1):
global list
list.append('df'+str(i))
list_couple_s_i =[]
for item in items:
for store in stores:
global list_couple_s_i
list_couple_s_i.append([item,store])
And once the two lists are ready you can loop on them to create the dataframes you want:
for name, it_st in zip(list,list_couple_s_i):
globals()[name] = df.where((df['item']==it_st[0]) &
(df['store']==(it_st[1])))
globals()[name].dropna(inplace=True)
In this way I have created 500 dataframes.
Hope this will be helpful!

Efficient way to create a large matrix in python from a table of interactions

I have a csv file in the following structure, indicating "interactions".
I need to convert this to a standard square matrix format so that I can use some other functions written for graphs (with igraph).
CSV file I would like to convert has ~106M rows in the following format
node1 node2 interaction strength
XYZ ABC 0.74
XYZ TAH 0.24
XYZ ABA 0.3
ABC TAH 0.42
... (node names are made up to show there is no pattern except node1 is ordered)
and standard format I would like to have this data has about 16K rows and 16K columns as follows:
XYZ ABC ABA TAH ...
XYZ 0 0.74 0.3 0
ABC 0.74 0 0 0.42
ABA 0.3 0 0 0
TAH 0 0.42 0 0
.
.
.
I do not necessarily need to have a dataframe in the end but I need to have row and column names noted in same order and save this final matrix as csv to somewhere.
What I tried is:
import pandas as pd
import progressbar
def list_uniqify(seq, idfun=None):
# order preserving
if idfun is None:
def idfun(x): return x
seen = {}
result = []
for item in seq:
marker = idfun(item)
# in old Python versions:
# if seen.has_key(marker)
# but in new ones:
if marker in seen: continue
seen[marker] = 1
result.append(item)
return result
data = pd.read_csv('./pipelines/cache/fr2/summa_fr2.csv', index_col=0)
names_ordered = helper.list_uniqify( data.iloc[:, 0].tolist() + data.iloc[:, 1].tolist() )
adj = pd.DataFrame(0, index=names_ordered, columns=names_ordered)
bar = progressbar.ProgressBar(maxval=data.shape[0]+1,
widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])
bar.update(0)
bar.start()
print("Preparing output...")
for i in range(data.shape[0]):
bar.update(i)
adj.loc[data.iloc[i, 0], data.iloc[i, 1]] = data.iloc[i, 2]
adj.loc[data.iloc[i, 1], data.iloc[i, 0]] = data.iloc[i, 2]
bar.finish()
print("Saving output...")
adj.to_csv("./data2_fr2.csv")
About 20-30 minutes in and I just got 1%, which means this would take about 2 days which is too long.
Is there anything can I do to fasten this process?
Note: I could parallelize this (8 cores, 15GB RAM, ~130GB SWAP)
but single core operation takes 15GB RAM, ~15GB SWAP already. I am not sure if this is a good idea or not. As no two processes would write on the same cell of dataframe, I wouldn't need to correct for the racing condition right?
Edit: Below are speed tests for suggested functions, they are amazingly better than implemented loop (that took ~34 seconds for 50K...)
speeds in seconds for 250K, 500K, 1M rows:
pivot_table: 0.029901999999999873, 0.031084000000000334, 0.0320750000000003
crosstab: 0.023093999999999948, 0.021742999999999846, 0.021409000000000233
Look at using pd.crosstab:
pd.crosstab(df['node1'],df['node2'],df['interaction'],aggfunc='first').fillna(0)
Output:
node2 ABA ABC TAH
node1
ABC 0.0 0.00 0.42
XYZ 0.3 0.74 0.24
I think you just need .pivot_table and then to reindex the columns (and rows which get changed), filling missing values with 0.
import pandas as pd
df2 = (pd.pivot_table(df, index='node1', columns='node2', values='interaction_strength')
.reindex(df.node1.drop_duplicates())
.reindex(df.node1.drop_duplicates(), axis=1)
.fillna(0))
df2.index.name=None
df2.columns.name=None
Output:
XYZ ABC
XYZ 0.0 0.74
ABC 0.0 0.00

Populating pandas column based on moving date range (efficiently)

I have 2 pandas dataframes, one of them contains dates with measurements, and the other contains dates with an event ID.
df1
from datetime import datetime as dt
from datetime import timedelta
import pandas as pd
import numpy as np
today = dt.now()
ndays = 10
df1 = pd.DataFrame({'Date': [today + timedelta(days = x) for x in range(ndays)], 'measurement': pd.Series(np.random.randint(1, high = 10, size = ndays))})
df1.Date = df1.Date.dt.date
Date measurement
2018-01-10 8
2018-01-11 2
2018-01-12 7
2018-01-13 3
2018-01-14 1
2018-01-15 1
2018-01-16 6
2018-01-17 9
2018-01-18 8
2018-01-19 4
df2
df2 = pd.DataFrame({'Date': ['2018-01-11', '2018-01-14', '2018-01-16', '2018-01-19'], 'letter': ['event_a', 'event_b', 'event_c', 'event_d']})
df2.Date = pd.to_datetime(df2.Date, format = '%Y-%m-%d')
df2.Date = df2.Date.dt.date
Date event_id
2018-01-11 event_a
2018-01-14 event_b
2018-01-16 event_c
2018-01-19 event_d
I give the dates in df1 an event_id from df2 only if it's between two event dates. The resulting dataframe would look something like:
df3
today = dt.now()
ndays = 10
df3 = pd.DataFrame({'Date': [today + timedelta(days = x) for x in range(ndays)], 'measurement': pd.Series(np.random.randint(1, high = 10, size = ndays)), 'event_id': ['event_a', 'event_a', 'event_b', 'event_b', 'event_b', 'event_c', 'event_c', 'event_d', 'event_d', 'event_d']})
df3.Date = df3.Date.dt.date
Date event_id measurement
2018-01-10 event_a 4
2018-01-11 event_a 2
2018-01-12 event_b 1
2018-01-13 event_b 5
2018-01-14 event_b 5
2018-01-15 event_c 4
2018-01-16 event_c 6
2018-01-17 event_d 6
2018-01-18 event_d 9
2018-01-19 event_d 6
The code I use to achieve this is:
n = 1
while n <= len(list(df2.Date)) - 1 :
for date in list(df1.Date):
if date <= df2.iloc[n].Date and (date > df2.iloc[n-1].Date):
df1.loc[df1.Date == date, 'event_id'] = df2.iloc[n].event_id
n += 1
The dataset that I am working with is significantly larger than this (a few million rows) and this method runs far too long. Is there a more efficient way to accomplish this?
So there are quite a few things to improve performance.
The first question I have is: does it have to be a pandas frame to begin with? Meaning can't df1 and df2 just be lists of tuples or list of lists?
The thing is that pandas adds a significant overhead when accessing items but especially when setting values individually.
Pandas excels when it comes to vectorized operations but I don't see an efficient alternative right now (maybe someone comes up with such an answer, that would be ideal).
Now what I'd do is:
Convert your df1 and df2 to records -> e.g. d1 = df1.to_records() what you get is an array of tuples, basically with the same structure as the dataframe.
Now run your algorithm but instead of operating on pandas dataframes you operate on the arrays of tuples d1 and d2
Use a third list of tuples d3 where you store the newly created data (each tuple is a row)
Now if you want you can convert d3 back to a pandas dataframe:
df3 = pd.DataFrame.from_records(d3, myKwArgs**)
This will speed up your code significantly I'd assume by more than 100-1000%. It does increase memory usage though, so if you are low on memory try to avoid the pandas dataframes all-together or dereference unused pandas frames df1, df2 once you used them to create the records (and if you run into problems call gc manually).
EDIT: Here a version of your code using the procedure above:
d3 = []
n = 1
while n < range(len(d2)):
for i in range(len(d1)):
date = d1[i][0]
if date <= d2[n][0] and date > d2[n-1][0]:
d3.append( (date, d2[n][1], d1[i][1]) )
n += 1
You can try df.apply() method to achieve this. Refer pandas.DataFrame.apply. I think my code will works faster than yours.
My approach:
Merge two dataframes df1 and df2 and create new one df3 by
df3 = pd.merge(df1, df2, on='Date', how='outer')
Sort df3 by date to make easy to travserse.
df3['Date'] = pd.to_datetime(df3.Date)
df3.sort_values(by='Date')
Create set_event_date() method to apply for each rows in df3.
new_event_id = np.nan
def set_event_date(df3):
global new_event_id
if df3.event_id is not np.nan:
new_event_id = df3.event_id
return new_event_id
Apply set_event_method() to each rows in df3.
df3['new_event_id'] = df3.apply(set_event_date,axis=1)
Final Output will be:
Date Measurement New_event_id
0 2018-01-11 2 event_a
1 2018-01-12 1 event_a
2 2018-01-13 3 event_a
3 2018-01-14 6 event_b
4 2018-01-15 3 event_b
5 2018-01-16 5 event_c
6 2018-01-17 7 event_c
7 2018-01-18 9 event_c
8 2018-01-19 7 event_d
9 2018-01-20 4 event_d
Let me know once you tried my solution and it works faster than yours.
Thanks.

Pandas Cumulative Sum of Difference Between Value Counts in Two Dataframe Columns

The charts below show my basic challenge: subtract NUMBER OF STOCKS WITH DATA END from NUMBER OF STOCKS WITH DATA START. The challenge I am having is that the date range for each series does not match so I need to merge both sets to a common date range, perform the subtraction, and save results to a new comma seperated value file.
Input data in file named 'meta.csv' contains 3187 lines. Fields per line are data for ticker, start, & end. Head and tail as shown here:
0000 ticker,start,end
0001 A,1999-11-18,2016-12-27
0002 AA,2016-11-01,2016-12-27
0003 AAL,2005-09-27,2016-12-27
0004 AAMC,2012-12-13,2016-12-27
0005 AAN,1984-09-07,2016-12-27
...
3183 ZNGA,2011-12-16,2016-12-27
3184 ZOES,2014-04-11,2016-12-27
3185 ZQK,1990-03-26,2015-09-09
3186 ZTS,2013-02-01,2016-12-27
3187 ZUMZ,2005-05-06,2016-12-27
Python code and console output:
import pandas as pd
df = pd.read_csv('meta.csv')
s = df.groupby('start').size().cumsum()
e = df.groupby('end').size().cumsum()
#s.plot(title='NUMBER OF STOCKS WITH DATA START',
# grid=True,style='k.')
#e.plot(title='NUMBER OF STOCKS WITH DATA END',
# grid=True,style='k.')
print(s.head(5))
print(s.tail(5))
print(e.tail(5))
OUT:
start
1962-01-02 11
1962-11-19 12
1970-01-02 30
1971-08-06 31
1972-06-01 54
dtype: int64
start
2016-07-05 3182
2016-10-04 3183
2016-11-01 3184
2016-12-05 3185
2016-12-08 3186
end
2016-12-08 544
2016-12-15 545
2016-12-16 546
2016-12-21 547
2016-12-27 3186
dtype: int64
Chart output when comments removed for code shown above:
I want to create one population file with the date and number of stocks with active data which should have a head and tail shown as follows:
date,num_stocks
1962-01-02,11
1962-11-19,12
1970-01-02,30
1971-08-06,31
1972-06-01,54
...
2016-12-08,2642
2016-12-15,2641
2016-12-16,2640
2016-12-21,2639
2016-12-27,2639
The ultimate goal is to be able to plot the number of stocks with data over any specified date range by reading the population file.
To align the dates with their respective counts. I'd take the difference of pd.Series.value_counts
df.start.value_counts().sub(df.end.value_counts(), fill_value=0)
1984-09-07 1.0
1990-03-26 1.0
1999-11-18 1.0
2005-05-06 1.0
2005-09-27 1.0
2011-12-16 1.0
2012-12-13 1.0
2013-02-01 1.0
2014-04-11 1.0
2015-09-09 -1.0
2016-11-01 1.0
2016-12-27 -9.0
dtype: float64
Thanks to the crucial tip provided by piRSquared I solved the challenge using this code:
import pandas as pd
df = pd.read_csv('meta.csv')
x = df.start.value_counts().sub(df.end.value_counts(), fill_value=0)
x.iloc[-1] = 0
r = x.cumsum()
r.to_csv('pop.csv')
z = pd.read_csv('pop.csv', index_col=0, header=None)
z.plot(title='NUMBER OF STOCKS WITH DATA',legend=None,
grid=True,style='k.')
'pop.csv' file head/tail:
1962-01-02 11.0
1962-11-19 12.0
1970-01-02 30.0
1971-08-06 31.0
1972-06-01 54.0
...
2016-12-08 2642.0
2016-12-15 2641.0
2016-12-16 2640.0
2016-12-21 2639.0
2016-12-27 2639.0
Chart:

Resources