Python: Extract dimension data from dataframe string column and create columns with values for each of them - python-3.x

Hej,
I have a source file with 2 columns: ID and all_dimensions. All dimensions is a string with different "key-value"-pairs which are not the same for each id.
I want to make the keys column headers and parse the respective value if existent in the right cell.
Example:
ID all_dimensions
12 Height:2 cm,Volume: 4cl,Weight:100g
34 Length: 10cm, Height: 5 cm
56 Depth: 80cm
78 Weight: 2 kg, Length: 7 cm
90 Diameter: 4 cm, Volume: 50 cl
Desired result:
ID Height Volume Weight Length Depth Diameter
12 2 cm 4cl 100g - - -
34 5 cm - - 10cm - -
56 - - - - 80cm -
78 - - 2 kg 7 cm - -
90 - 50 cl - - - 4 cm
I do have over a 100 dimensions so ideally I would like to write a for loop or something similar to not specify each column header (see code examples below)
I am using Python 3.7.3 and pandas 0.24.2.
What have I tried already:
1) I have tried to split the data in separate columns but wasn't sure how to proceed to have each value assigned into the right header:
df.set_index('ID',inplace=True)
newdf = df["all_dimensions"].str.split(",|:",expand = True)
2) Using the initial df, I used "str.extract" to create new columns (but then I would need to specify each header):
df['Volume']=df.all_dimensions.str.extract(r'Volume:([\w\s.]*)').fillna('')
3) To resolve the problem of 2) with each header, I created a list of all dimension attributes and thought to use the list with an for loop to extract the values:
columns_list=df.all_dimensions.str.extract(r'^([\D]*):',expand=True).drop_duplicates()
columns_list=columns_list[0].str.strip().values.tolist()
for dimension in columns_list:
df.dimension=df.all_dimensions.str.extract(r'dimension([\w\s.]*)').fillna('')
Here, JupyterNB gives me a UserWarning: "Pandas doesn't allow columns to be created via a new attribute name" and the df looks the same as before.

Option 1: I prefer splitting several time:
new_series = (df.set_index('ID')
.all_dimensions
.str.split(',', expand=True)
.stack()
.reset_index(level=-1, drop=True)
)
# split second time for individual measurement
new_df = (new_series.str
.split(':', expand=True)
.reset_index()
)
# stripping off leading/trailing spaces
new_df[0] = new_df[0].str.strip()
new_df[1] = new_df[1].str.strip()
# unstack to get the desire table:
new_df.set_index(['ID', 0])[1].unstack()
Option 2: Use split(',|:') as what you tried:
# splitting
new_series = (df.set_index('ID')
.all_dimensions
.str.split(',|:', expand=True)
.stack()
.reset_index(level=-1, drop=True)
)
# concat along axis=1 to get dataframe with two columns
# new_df.columns = ('ID', 0, 1) where 0 is measurement name
new_df = (pd.concat((new_series[::2].str.strip(),
new_series[1::2]), axis=1)
.reset_index())
new_df.set_index(['ID', 0])[1].unstack()
Output:
Depth Diameter Height Length Volume Weight
ID
12 NaN NaN 2 cm NaN 4cl 100g
34 NaN NaN 5 cm 10cm NaN NaN
56 80cm NaN NaN NaN NaN NaN
78 NaN NaN NaN 7 cm NaN 2 kg
90 NaN 4 cm NaN NaN 50 cl NaN

This is a hard question , your string need to be split and your each items after split need to be convert to dict , then we can using DataFrame constructor rebuild those columns
d=[ [{y.split(':')[0]:y.split(':')[1]}for y in x.split(',')]for x in df.all_dimensions]
from collections import ChainMap
data = list(map(lambda x : dict(ChainMap(*x)),d))
s=pd.DataFrame(data)
df=pd.concat([df,s.groupby(s.columns.str.strip(),axis=1).first()],1)
df
Out[26]:
ID all_dimensions Depth ... Length Volume Weight
0 12 Height:2 cm,Volume: 4cl,Weight:100g NaN ... NaN 4cl 100g
1 34 Length: 10cm, Height: 5 cm NaN ... 10cm NaN NaN
2 56 Depth: 80cm 80cm ... NaN NaN NaN
3 78 Weight: 2 kg, Length: 7 cm NaN ... 7 cm NaN 2 kg
4 90 Diameter: 4 cm, Volume: 50 cl NaN ... NaN 50 cl NaN
[5 rows x 8 columns]
Check the columns
df['Height']
Out[28]:
0 2 cm
1 5 cm
2 NaN
3 NaN
4 NaN
Name: Height, dtype: object

Related

Create some features based on the average growth rate of y for the month over the past few years

Assuming we have dataset df (which can be downloaded from this link), I want to create some features based on the average growth rate of y for the month of the past several years, for example: y_agr_last2, y_agr_last3, y_agr_last4, etc.
The formula is:
For example, for September 2022, y_agr_last2 = ((1 + 3.85/100)*(1 + 1.81/100))^(1/2) -1, y_agr_last3 = ((1 + 3.85/100)*(1 + 1.81/100)*(1 + 1.6/100))^(1/3) -1.
The code I use is as follows, which is relatively repetitive and trivial:
import math
df['y_shift12'] = df['y'].shift(12)
df['y_shift24'] = df['y'].shift(24)
df['y_shift36'] = df['y'].shift(36)
df['y_agr_last2'] = pow(((1+df['y_shift12']/100) * (1+df['y_shift24']/100)), 1/2) -1
df['y_agr_last3'] = pow(((1+df['y_shift12']/100) * (1+df['y_shift24']/100) * (1+df['y_shift36']/100)), 1/3) -1
df.drop(['y_shift12', 'y_shift24', 'y_shift36'], axis=1, inplace=True)
df
How can the desired result be achieved more concisely?
References:
Create some features based on the mean of y for the month over the past few years
Following is one way to generalise it:
import functools
import operator
num_yrs = 3
for n in range(1, num_yrs+1):
df[f"y_shift{n*12}"] = df["y"].shift(n*12)
df[f"y_agr_last{n}"] = pow(functools.reduce(operator.mul, [1+df[f"y_shift{i*12}"]/100 for i in range(1, n+1)], 1), 1/n) - 1
df = df.drop(["y_agr_last1"] + [f"y_shift{n*12}" for n in range(1, num_yrs+1)], axis=1)
Output:
date y x1 x2 y_agr_last2 y_agr_last3
0 2018/1/31 -13.80 1.943216 3.135839 NaN NaN
1 2018/2/28 -14.50 0.732108 0.375121 NaN NaN
...
22 2019/11/30 4.00 -0.273262 -0.021146 NaN NaN
23 2019/12/31 7.60 1.538851 1.903968 NaN NaN
24 2020/1/31 -11.34 2.858537 3.268478 -0.077615 NaN
25 2020/2/29 -34.20 -1.246915 -0.883807 -0.249940 NaN
26 2020/3/31 46.50 -4.213756 -4.670146 0.221816 NaN
...
33 2020/10/31 -1.00 1.967062 1.860070 -0.035569 NaN
34 2020/11/30 12.99 2.302166 2.092842 0.041998 NaN
35 2020/12/31 5.54 3.814303 5.611199 0.030017 NaN
36 2021/1/31 -6.41 4.205601 4.948924 -0.064546 -0.089701
37 2021/2/28 -22.38 4.185913 3.569100 -0.342000 -0.281975
38 2021/3/31 17.64 5.370519 3.130884 0.465000 0.298025
...
54 2022/7/31 0.80 -6.259455 -6.716896 0.057217 0.052793
55 2022/8/31 -5.30 1.302754 1.412277 0.015121 -0.000492
56 2022/9/30 NaN -2.876968 -3.785964 0.028249 0.024150

Concatenate 2 dataframes. I would like to combine duplicate columns

The following code can be used as an example of the problem I'm having:
dic={'A':['1','2','3'], 'B':['10','11','12']}
df1=pd.DataFrame(dic)
df1.set_index('A', inplace=True)
dic2={'A':['4','5','6'], 'B':['10','11','12']}
df2=pd.DataFrame(dic2)
df2.set_index('A', inplace=True)
df3=pd.concat([df1,df2], axis=1)
print(df3)
The result I get from this concatenation is:
B B
1 10 NaN
2 11 NaN
3 12 NaN
4 NaN 10
5 NaN 11
6 NaN 12
I would like to have:
B
1 10
2 11
3 12
4 10
5 11
6 12
I know that I can concatenate along axis=0. Unfortunately, that only solves the problem for this little example. The actual code I'm working with is more complex. Concatenating along axis=0 causes the index to be duplicated. I don't want that either.
EDIT:
People have asked me to give a more complex example to describe why simply removing 'axis=1' doesn't work. Here is a more complex example, first with axis=1 INCLUDED:
dic={'A':['1','2','3'], 'B':['10','11','12']}
df1=pd.DataFrame(dic)
df1.set_index('A', inplace=True)
dic2={'A':['4','5','6'], 'B':['10','11','12']}
df2=pd.DataFrame(dic2)
df2.set_index('A', inplace=True)
df=pd.concat([df1,df2], axis=1)
dic3={'A':['1','2','3'], 'C':['20','21','22']}
df3=pd.DataFrame(dic3)
df3.set_index('A', inplace=True)
df4=pd.concat([df,df3], axis=1)
print(df4)
This gives me:
B B C
1 10 NaN 20
2 11 NaN 21
3 12 NaN 22
4 NaN 10 NaN
5 NaN 11 NaN
6 NaN 12 NaN
I would like to have:
B C
1 10 20
2 11 21
3 12 22
4 10 NaN
5 11 NaN
6 12 NaN
Now here is an example with axis=1 REMOVED:
dic={'A':['1','2','3'], 'B':['10','11','12']}
df1=pd.DataFrame(dic)
df1.set_index('A', inplace=True)
dic2={'A':['4','5','6'], 'B':['10','11','12']}
df2=pd.DataFrame(dic2)
df2.set_index('A', inplace=True)
df=pd.concat([df1,df2])
dic3={'A':['1','2','3'], 'C':['20','21','22']}
df3=pd.DataFrame(dic3)
df3.set_index('A', inplace=True)
df4=pd.concat([df,df3])
print(df4)
This gives me:
B C
A
1 10 NaN
2 11 NaN
3 12 NaN
4 10 NaN
5 11 NaN
6 12 NaN
1 NaN 20
2 NaN 21
3 NaN 22
I would like to have:
B C
1 10 20
2 11 21
3 12 22
4 10 NaN
5 11 NaN
6 12 NaN
Sorry it wasn't very clear. I hope this helps.
Here is a two step process, for the example provided after the 'EDIT' point. Start by creating the dictionaries:
import pandas as pd
dic = {'A':['1','2','3'], 'B':['10','11','12']}
dic2 = {'A':['4','5','6'], 'B':['10','11','12']}
dic3 = {'A':['1','2','3'], 'C':['20','21','22']}
Step 1: convert each dictionary to a data frame, with index 'A', and concatenate (along axis=0):
t = pd.concat([pd.DataFrame(dic).set_index('A'),
pd.DataFrame(dic2).set_index('A'),
pd.DataFrame(dic3).set_index('A')])
Step 2: concatenate non-null elements of col 'B' with non-null elements of col 'C' (you could put this in a list comprehension if there are more than two columns). Now we concatenate along axis=1:
result = pd.concat([
t.loc[ t['B'].notna(), 'B' ],
t.loc[ t['C'].notna(), 'C' ],
], axis=1)
print(result)
B C
1 10 20
2 11 21
3 12 22
4 10 NaN
5 11 NaN
6 12 NaN
Edited:
If two objects need to be added along axis=1, then the new columns will be appended.And with axis=0 or default same column will be appended with new values.
Refer Below Solution:
import pandas as pd
dic={'A':['1','2','3'], 'B':['10','11','12']}
df1=pd.DataFrame(dic)
df1.set_index('A', inplace=True)
dic2={'A':['4','5','6'], 'B':['10','11','12']}
df2=pd.DataFrame(dic2)
df2.set_index('A', inplace=True)
df=pd.concat([df1,df2])
dic3={'A':['1','2','3'], 'C':['20','21','22']}
df3=pd.DataFrame(dic3)
df3.set_index('A', inplace=True)
df4=pd.concat([df,df3],axis=1) #As here C is new new column so need to use axis=1
print(df4)
Output:
B C
1 10 20
2 11 21
3 12 22
4 10 NaN
5 11 NaN
6 12 NaN

Slicing xarray dataset with coordinate dependent variable

I built an xarray dataset in python3 with coordinates (time, levels) to identify all cloud bases and cloud tops during one day of observations. The variable levels is the dimension for the cloud base/tops that can be identified at a given time. It stores cloud base/top heights values for each time.
Now I want to select all the cloud bases and tops that are located within a given range of heights that change in time. The height range is identified by the arrays bottom_mod and top_mod. These arrays have a time dimension and contain the edges of the range of heights to be selected.
The xarray dataset is cloudStandard_mod_reshaped:
Dimensions: (levels: 8, time: 9600)
Coordinates:
* levels (levels) int64 0 1 2 3 4 5 6 7
* time (time) datetime64[ns] 2013-04-14 ... 2013-04-14T23:59:51
Data variables:
cloudTop (time, levels) float64 nan nan nan nan nan ... nan nan nan nan
cloudThick (time, levels) float64 nan nan nan nan nan ... nan nan nan nan
cloudBase (time, levels) float64 nan nan nan nan nan ... nan nan nan nan
I tried to select the heights in the range identified by top and bottom array as follows:
PBLclouds = cloudStandard_mod_reshaped.sel(levels=slice(bottom_mod[:], top_mod[:]))
but this instruction does accept only scalar values for the slice command.
Do you know how to slice with values that are coordinate-dependent?
You can use the .where() method.
The line providing the solution is under 2.
1. First, create some data like yours:
The dataset:
nlevels, ntime = 8, 50
ds = xr.Dataset(
coords=dict(levels=np.arange(nlevels), time=np.arange(ntime),),
data_vars=dict(
cloudTop=(("levels", "time"), np.random.randn(nlevels, ntime)),
cloudThick=(("levels", "time"), np.random.randn(nlevels, ntime)),
cloudBase=(("levels", "time"), np.random.randn(nlevels, ntime)),
),
)
output of print(ds):
<xarray.Dataset>
Dimensions: (levels: 8, time: 50)
Coordinates:
* levels (levels) int64 0 1 2 3 4 5 6 7
* time (time) int64 0 1 2 3 4 5 6 7 8 9 ... 41 42 43 44 45 46 47 48 49
Data variables:
cloudTop (levels, time) float64 0.08375 0.04721 0.9379 ... 0.04877 2.339
cloudThick (levels, time) float64 -0.6441 -0.8338 -1.586 ... -1.026 -0.5652
cloudBase (levels, time) float64 -0.05004 -0.1729 0.7154 ... 0.06507 1.601
For the top and bottom levels, I'll make the bottom level random and just add an offset to construct the top level.
offset = 3
bot_mod = xr.DataArray(
dims=("time"),
coords=dict(time=np.arange(ntime)),
data=np.random.randint(0, nlevels - offset, ntime),
name="bot_mod",
)
top_mod = (bot_mod + offset).rename("top_mod")
output of print(bot_mod):
<xarray.DataArray 'bot_mod' (time: 50)>
array([0, 1, 2, 2, 3, 1, 2, 1, 0, 2, 1, 3, 2, 0, 2, 4, 3, 3, 2, 1, 2, 0,
2, 2, 0, 1, 1, 4, 1, 3, 0, 4, 0, 4, 4, 0, 4, 4, 1, 0, 3, 4, 4, 3,
3, 0, 1, 2, 4, 0])
2. Then, select the range of levels where clouds are:
use .where() method to select the dataset variables that are between the bottom level and the top level:
ds_clouds = ds.where((ds.levels > bot_mod) & (ds.levels < top_mod))
output of print(ds_clouds):
<xarray.Dataset>
Dimensions: (levels: 8, time: 50)
Coordinates:
* levels (levels) int64 0 1 2 3 4 5 6 7
* time (time) int64 0 1 2 3 4 5 6 7 8 9 ... 41 42 43 44 45 46 47 48 49
Data variables:
cloudTop (levels, time) float64 nan nan nan nan nan ... nan nan nan nan
cloudThick (levels, time) float64 nan nan nan nan nan ... nan nan nan nan
cloudBase (levels, time) float64 nan nan nan nan nan ... nan nan nan nan
It puts nan where the condition is not satisfied, you can use the .dropna() method to get rid of those.
3. Check for success:
Plot cloudBase variable of the dataset before and after processing:
fig, axes = plt.subplots(ncols=2)
ds.cloudBase.plot.imshow(ax=axes[0])
ds_clouds.cloudBase.plot.imshow(ax=axes[1])
plt.show()
I'm not yet allowed to embed images, so that's a link:
Original data vs. selected data

How to compare PANDAS Columns in a Dataframes to find all entries appearing in different columns?

Full disclosure. I'm fairly new to Python and discovered PANDAS today.
I created a Dataframe from two csv files, one which is the results of a robot scanning barcode IDs and one which is a list of instructions for the robot to execute.
import pandas as pd
#import csv file and read the column containing plate IDs scanned by Robot
scancsvdata = pd.read_csv("G:\scan.csv", header=None, sep=';', skiprows=(1),usecols=[6])
#Rename Column to Plates Scanned
scancsvdata.columns = ["IDs Scanned"]
#Remove any Duplicate Plate IDs
scancsvdataunique = scancsvdata.drop_duplicates()
#import the Worklist to be executed CSV file and read the Source Column to find required Plates
worklistdataSrceID = pd.read_csv("G:\TestWorklist.CSV", usecols=["SrceID"])
#Rename SrceID Column to Plates Required
worklistdataSrceID.rename(columns={'SrceID':'IDs required'}, inplace=True)
#remove duplicates from Plates Required
worklistdataSrceIDunique = worklistdataSrceID.drop_duplicates()
#import the Worklist to be executed CSV file and read the Destination Column to find required Plates
worklistdataDestID = pd.read_csv("G:\TestWorklist.CSV", usecols=["DestID"])
#Rename DestID Column to Plates Required
worklistdataDestID.rename(columns={'DestID':'IDs required'}, inplace=True)
#remove duplicates from Plates Required
worklistdataDestIDunique = worklistdataDestID.drop_duplicates()
#Combine into one Dataframe
AllData = pd.concat ([scancsvdataunique, worklistdataSrceIDunique, worklistdataDestIDunique], sort=True)
print (AllData)
The resulting Dataframe lists IDs scanned in Column 1 and IDs required in Column 2.
IDs Scanned IDs required
0 1024800.0 NaN
1 1024838.0 NaN
2 1024839.0 NaN
3 1024841.0 NaN
4 1024844.0 NaN
5 1024798.0 NaN
6 1024858.0 NaN
7 1024812.0 NaN
8 1024797.0 NaN
9 1024843.0 NaN
10 1024840.0 NaN
11 1024842.0 NaN
12 1024755.0 NaN
13 1024809.0 NaN
14 1024810.0 NaN
15 8656.0 NaN
16 8657.0 NaN
17 8658.0 NaN
0 NaN 1024800.0
33 NaN 1024843.0
0 NaN 8656.0
7 NaN 8657.0
15 NaN 8658.0
How would I go about ensuring that all the IDs in the 'IDs Required' Column, appear in the 'IDs Scanned Column'?
Ideally the results of the comparison above would be a generic message like 'All IDs found'.
If different csv files were used and the Dataframe was as follows
IDs Scanned IDs required
0 1024800.0 NaN
1 1024838.0 NaN
2 1024839.0 NaN
3 1024841.0 NaN
4 1024844.0 NaN
5 1024798.0 NaN
6 1024858.0 NaN
7 1024812.0 NaN
8 1024797.0 NaN
9 1024843.0 NaN
10 1024840.0 NaN
11 1024842.0 NaN
12 1024755.0 NaN
13 1024809.0 NaN
14 1024810.0 NaN
15 8656.0 NaN
16 8657.0 NaN
17 8658.0 NaN
0 NaN 2024800.0
33 NaN 2024843.0
0 NaN 8656.0
7 NaN 8657.0
15 NaN 8658.0
Then the result of the comparison would be the list of the missing IDs, 2024800 and 2024843.
To check True/False if all the items required are in the column;
all([item in df["IDs Scanned"] for item in df["IDs required"].unique()])
To get a list of the unique missing items:
sorted(set(df["IDs required"]) - set(df["IDs Scanned"]))
Or using pandas syntax to return a DataFrame filtered to rows where IDs required are not found in IDs Scanned:
df.loc[~df["IDs required"].isin(df["IDs Scanned"])]
missing_ids = df.loc[~df['IDs required'].isin(df['IDs Scanned']), 'IDs required']

Pandas append returns DF with NaN values

I'm appending data from a list to pandas df. I keep getting NaN in my entries.
Based on what I've read I think I might have to mention the data type for each column in my code.
dumps = [];features_df = pd.DataFrame()
for i in range (int(len(ids)/50)):
dumps = sp.audio_features(ids[i*50:50*(i+1)])
for i in range (len(dumps)):
print(list(dumps[0].values()))
features_df = features_df.append(list(dumps[0].values()), ignore_index = True)
Expected results, something like-
[0.833, 0.539, 11, -7.399, 0, 0.178, 0.163, 2.1e-06, 0.101, 0.385, 99.947, 'audio_features', '6MWtB6iiXyIwun0YzU6DFP', 'spotify:track:6MWtB6iiXyIwun0YzU6DFP', 'https://api.spotify.com/v1/tracks/6MWtB6iiXyIwun0YzU6DFP', 'https://api.spotify.com/v1/audio-analysis/6MWtB6iiXyIwun0YzU6DFP', 149520, 4]
for one row.
Actual-
danceability energy ... duration_ms time_signature
0 NaN NaN ... NaN NaN
1 NaN NaN ... NaN NaN
2 NaN NaN ... NaN NaN
3 NaN NaN ... NaN NaN
4 NaN NaN ... NaN NaN
5 NaN NaN ... NaN NaN
For all rows
append() strategy in a tight loop isn't a great way to do this. Rather, you can construct an empty DataFrame and then use loc to specify an insertion point. The DataFrame index should be used.
For example:
import pandas as pd
df = pd.DataFrame(data=[], columns=['n'])
for i in range(100):
df.loc[i] = i
print(df)
time python3 append_df.py
n
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
real 0m13.178s
user 0m12.287s
sys 0m0.617s
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

Resources