Separating data in pandas that has a variable format - python-3.x

I have a txt file that is an output from another modelling program where it is looking at parameters of a modeled node at a time. The data output is similar to the following below. My problem is the data is coming as a single column and is broken occasionally by a new header and then the first section of the column repeats (time), but the second portion is new. There are two things I would like to be able to do:
1) Break the data into the two columns time and data for the node. Then add the node label as the first column.
2) Later there is another parameter for the node, not immediately under where the information would be in the form Data 2 Node (XX,XX) that is the same as one previous.
This would give me 4 columns in the end with the first being the node id repeated, the second being the time, third being data parameter 1, and fourth being the matched data parameter 2.
I've included a small sample of the data below, but the output is nearly over 1,000,000 lines so it would be nice to use pandas or another python functionality to manipulate the data.
Thanks for the help!
Name 20 vs 2
----------------------------------
time Data 1 Node( 72, 23)
--------------------- ---------------------
4.1203924E-003 -3.6406431E-005
1.4085015E-002 -5.8257871E-004
2.4049638E-002 6.8743013E-004
3.4014260E-002 8.2296302E-005
4.3978883E-002 -1.2276627E-004
5.3943505E-002 1.9813024E-004
....
Name 20 vs 2
----------------------------------
time Data 1 Node( 72, 24)
--------------------- ---------------------
4.1203924E-003 -3.6406431E-005
1.4085015E-002 -5.8257871E-004
2.4049638E-002 6.8743013E-004
3.4014260E-002 8.2296302E-005
4.3978883E-002 -1.2276627E-004
5.3943505E-002 1.9813024E-004

So after a fair amount of googling I managed to piece this one together. The data I was looking at was space separated so I used the fixed file width read method in pandas, following that I looked at the indices of a few known elements in the list and used them to break up the data into two dataframes that I could merge and process after. I removed the lines and NaN values as they were not of interest to me. Following that the fun began on actually using the data.
import pandas
widths = [28, 27 ]
df = pd.read_fwf(filename, widths=widths, names=["Times", "Items"])
data = df["Items"].astype(str)
index1 = data.str.contains('Data 1').idxmax()
index2 = data.str.contains('Data 2').idxmax()
df2 = pd.read_fwf(filename, widths=widths, skiprows=index1, skipfooter = (len(df)-index2), header = None, names=["Times", "Prin. Stress 1"])
df2 = pd.read_fwf(filename, widths=widths, skiprows=index2, header = None, names=["Times", "Prin. Stress 2"])
df2["Prin. Stress 2"] = df3["Prin. Stress 2"]
df2 = df2[~df2["Times"].str.contains("---")] # removes lines ----
df2.dropna(inplace = True)

Related

What are faster ways of reading big data set and apply row wise operations other than pandas and dask?

I am working on a code where I need to populate a set of data structure based on each row of a big table. Right now, I am using pandas to read the data and do some elementary data validation preprocess. However, when I get to the rest of process and putting data in the corresponding data structure, it takes considerably long time for the loop to be completed and my data structures gets populated. For example, in the following code I have a table with 15 M records. Table has three columns and I create a foo() object base on each row and add it to a list.
# Profile.csv
# Index | Name | Family| DB
# ---------|------|-------|----------
# 0. | Jane | Doe | 08/23/1977
# ...
# 15000000 | Jhon | Doe | 01/01/2000
class foo():
def __init__(self, name, last, bd):
self.name = name
self.last = last
self.bd = bd
def populate(row, my_list):
my_list.append(foo(*row))
# reading the csv file and formatting the date column
df = pd.read_csv('Profile.csv')
df['DB'] = pd.to_datetime(df['DB'],'%Y-%m-%d')
# using apply to create an foo() object and add it to the list
my_list = []
gf.apply(populate, axis=1, args=(my_list,))
So the after using pandas to convert the string date to the date object, I just need to iterate over the DataFrame to creat my object and add them to the list. This process is very time taking (in my real example it is even taking more time since my data structure is more complex and I have more columns). So, I am wondering what is the best practice in this case to enhance my run time. Should I even use pandas to read my big tables and process through them row by row?
it would be simply faster using a file handle:
input_file = "profile.csv"
sep=";"
my_list = []
with open(input_file) as fh:
cols = {}
for i, col in enumerate(fh.readline().strip().split(sep)):
cols[col] = i
for line in fh:
line = line.strip().split(sep)
date = line[cols["DB"]].split("/")
date = [date[2], date[0], date[1]]
line[cols["DB"]] = "-".join(date)
populate(line, my_list)
There are multiple approaches for this kind of situation, however, the fastest and most effective method is using vectorization if possible. The solution for the example I demonstrated in this post using vectorization could be as follows:
my_list = [foo(*args) for args in zip(df["Name"],df["Family"],df["BD"])]
If the vectorization is not possible, converting the data framce to a dictionary could significantly improve the performance. For the current example if would be something like:
my_list = []
dc = df.to_dict()
for i, j in dc.items():
my_list.append(foo(dc["Name"][i], dc["Family"][i], dc["BD"][i]))
The last solution is particularly very effective if the type of structures and processes are more complex.

Adding a list of 20 items to successive cells of a row in a dataframe

I am using Clarifai's API to get tags for images in a URL and then add those to an empty data frame. There are 20 tags returned from Clarifai and I want to store each URL with its tags in 1 row. So the first cell in the row would be the URL then every successive cell would contain one of the 20 items about that URL.
It would look, ideally, like this
URL. | tag 1 | tag 2 | ..... | tag 20
www.tetURL.com | Xyz | abc |...... | fgh
So far, I have got the URL and the tags received, but I am having a hard time trying to figure out how to store each tag in a successive cell of a row
df = pd.DataFrame(). #Need to append values to this DataFrame
test = 'https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQVOCBtn_hZYXawqd_OPu6YkAM737TEiIabOe0X_CIvPtuPRei96C3gI1KjTlc1URek05nhBSiV&usqp=CAc'
print('running predictions for this url:\n', test,'\n')
response = workflow.predict_by_url(test)
pred_vals = response['results'][0]['outputs'][2]['data']['regions'][0]['data']['concepts']
vals_list = []
for vals in range(len(pred_vals)):
concept_val = pred_vals[vals] # dict containing the id, name and value
# print(concept_val['name'],':', concept_val['value'])
vals_list.append(concept_val['name'])
# print('')
print(vals_list)
These are the values of vals_list
['long-sleeve', 'top', 'colorblock', 'turtleneck', 'crewneck', 'sweatshirt', 't-shirt', 'graphic', 'stripes', 'mockneck', 'sweater', 'coat', 'shirt', 'knit', 'hoodie', 'fedora', 'leather', 'eyelet', 'scarf', 'windowpane']
I will appreciate any help you guys can provide.
Have you managed to build a dictionary with url as key and tags as values? If so, you may be able to use from_dict:
info_df = pd.DataFrame.from_dict(info, orient='index')
If you don't want urls as index add .reset_index():
info_df = pd.DataFrame.from_dict(info, orient='index').reset_index()
Reconsider using a wide format data frame of tag# structure and instead use a long format data frame of two columns for url and tag. Long format data scales better with efficiency (i.e,. what if there are less or more than 20 tags)? Also, it is much easier to work with long data in most analytical operations: aggregation, modeling, plotting, etc. With this approach, simply use DataFrame constructor to build two-column data set.
test = ('https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQVOCBtn_hZYXaw'
'qd_OPu6YkAM737TEiIabOe0X_CIvPtuPRei96C3gI1KjTlc1URek05nhBSiV&usqp=CAc')
print('running predictions for this url:\n', test,'\n')
response = workflow.predict_by_url(test)
pred_vals = (response['results'][0]['outputs'][2]
['data']['regions'][0]['data']['concepts'])
vals_list = []
for concept_val in pred_vals:
vals_list.append(concept_val['name'])
# TWENTY-ROW DATA FRAME WITH SAME URL FOR EACH TAG
df = pd.DataFrame({'url': test, 'tags': val_list})
Alternatively, for a shorter call with list comprehension in place for-loop.
df = pd.DataFrame({'url': test,
'tags': [concept_val['name'] for concept_val in pred_vals]})

Pandas dropped row showing in plot

I am trying to make a heatmap.
I get my data out of a pipeline that class some rows as noisy, I decided to get a plot including them and a plot without them.
The problem I have: In the plot without the noisy rows I have blank line appearing (the same number of lines than rows removed).
Roughly The code looks like that (I can expand part if required I am trying to keep it shorts).
If needed I can provide a link with similar data publicly available.
data_frame = load_df_fromh5(file) # load a data frame from the hdf5 output
noisy = [..] # a list which indicate which row are vector
# I believe the problem being here:
noisy = [i for (i, v) in enumerate(noisy) if v == 1] # make a vector which indicates which index to remove
# drop the corresponding index
df_cells_noisy = df_cells[~df_cells.index.isin(noisy)].dropna(how="any")
#I tried an alternative method:
not_noisy = [0 if e==1 else 1 for e in noisy)
df = df[np.array(not_noisy, dtype=bool)]
# then I made a clustering using scipy
Z = hierarchy.linkage(df, method="average", metric="canberra", optimal_ordering=True)
df = df.reindex(hierarchy.leaves_list(Z))
# the I plot using the df variable
# quit long function I believe the problem being upstream.
plot(df)
The plot is quite long but I believe it works well because the problem only shows with the no noisy data frame.
IMO I believe somehow pandas keep information about the deleted rows and that they are plotted as a blank line. Any help is welcome.
Context:
Those are single-cell data of copy number anomaly (abnormalities of the number of copy of genomic segment)
Rows represent individuals (here individuals cells) columns represents for the genomic interval the number of copy (2 for vanilla (except sexual chromosome)).

Append each value in a DataFrame to a np vector, grouping by column

I am trying to create a list, which will be fed as input to the neural network of a Deep Reinforcement Learning model.
What I would like to achieve:
This list should have the properties of this code's output
vec = []
lines = open("data/" + "GSPC" + ".csv", "r").read().splitlines()
for line in lines[1:]:
vec.append(float(line.split(",")[4]))
i.e. just a list of values like this [enter image description here][1]
The original dataframe looks like:
Out[0]:
Close sma15
0 1.26420 1.263037
1 1.26465 1.263193
2 1.26430 1.263350
3 1.26450 1.263533
but by using df.transpose() i obtained the following:
0 1 2 3
Close 1.264200 1.264650 1.26430 1.26450
sma15 1.263037 1.263193 1.26335 1.263533
from here I would like to obtain a list grouped by column, of the type:
[1.264200, 1.263037, 1.264650, 1.263193, 1.26430, 1.26335, 1.26450, 1.263533]
I tried
x = np.array(df.values.tolist(), dtype = np.float32).reshape(1,-1)
but this gives me a float with 1 row and 6 columns, how could I achieve a result that has the properties I am looking for?
From what I can understand, you just want a flattened version of the DataFrame's values. That can be done simply with the ndarray.flatten() method rather than reshaping it.
# Creating your DataFrame object
a = [[1.26420, 1.263037],
[1.26465, 1.263193],
[1.26430, 1.263350],
[1.26450, 1.263533]]
df = pd.DataFrame(a, columns=['Close', 'sma15'])
df.values.flatten()
This gives array([1.2642, 1.263037, 1.26465, 1.263193, 1.2643, 1.26335, 1.2645, 1.263533]) as is (presumably) desired.
PS: I am not sure why you have not included the last row of the DataFrame as the output of your transpose operation. Is that an error?

Slow loop aggregating rows and columns

I have a DataFrame with a column named 'UserNbr' and a column named 'Spclty', which is composed of elements like this:
[['104', '2010-01-31'], ['215', '2014-11-21'], ['352', '2016-07-13']]
where there can be 0 or more elements in the list.
Some UserNbr keys appear in multiple rows, and I wish to collapse each such group into a single row such that 'Spclty' contains all the unique dicts like those in the list shown above.
To save overhead on appending to a DataFrame, I'm appending each output row to a list, instead to the DataFrame.
My code is working, but it's taking hours to run on 0.7M rows of input. (Actually, I've never been able to keep my laptop open long enough for it to finish executing.)
Is there a better way to aggregate into a structure like this, maybe using a library that provides more data reshaping options instead looping over UserNbr? (In R, I'd use the data.table and dplyr libraries.)
# loop over all UserNbr:
# consolidate specialty fields into dict-like sets (to remove redundant codes);
# output one row per user to new data frame
out_rows = list()
spcltycol = df_tmp.column.get_loc('Spclty')
all_UserNbr = df_tmp['UserNbr'].unique()
for user in all_UserNbr:
df_user = df_tmp.loc[df_tmp['UserNbr'] == user]
if df_user.shape[0] > 0:
open_combined = df_user_open.iloc[0, spcltycol] # capture 1st row
for row in range(1, df_user.shape[0]): # union with any subsequent rows
open_combined = open_combined.union(df_user.iloc[row, spcltycol])
new_row = df_user.drop(['Spclty', 'StartDt'], axis = 1).iloc[0].tolist()
new_row.append(open_combined)
out_rows.append(new_row)
# construct new dataframe with no redundant UserID rows:
df_out = pd.DataFrame(out_rows,
columns = ['UserNbr', 'Spclty'])
# convert Spclty sets to dicts:
df_out['Spclty'] = [dict(df_out['Spclty'][row]) for row in range(df_out.shape[0])]
The conversion to dict gets rid of specialties that are repeated between rows, In the output, a Spclty value should look like this:
{'104': '2010-01-31', '215': '2014-11-21', '352': '2016-07-13'}
except that there may be more key-value pairs than in any corresponding input row (resulting from aggregation over UserNbr).
I withdraw this question.
I had hoped there was an efficient way to use groupby with something else, but I haven't found any examples with a complex data structure like this one and have received no guidance.
For anyone who gets similarly stuck with very slow aggregation problems in Python, I suggest stepping up to PySpark. I am now tackling this problem with a Databricks notebook and am making headway with the pyspark.sql.window Window functions. (Now, it only takes minutes to run a test instead of hours!)
A partial solution is in the answer here:
PySpark list() in withColumn() only works once, then AssertionError: col should be Column

Resources