Easy way to convert list of string to numpy array - python-3.x

I am working with data from World Ocean Database (WOD), and somehow I ended up with a list that looks like this one:
idata =
[' 1, 0.0,0, , 6.2386,0, , 33.2166,0, ,\n',
' 2, 5.0,0, , 6.2385,0, , 33.2166,0, ,\n',
' 3, 10.0,0, , 6.2306,0, , 33.2175,0, ,\n',
' 4, 15.0,0, , 6.2359,0, , 33.2176,0, ,\n',
' 5, 20.0,0, , 6.2387,0, , 33.2175,0, ,\n']
Is there any easy way to convert this structure into a numpy array or in a friendlier format? I just want to add the information of the columns in a pandas DataFrame.

You could use a combination of string manipulation (i.e. strip() and split()) and list comprehensions:
import numpy as np
idata = [
' 1, 0.0,0, , 6.2386,0, , 33.2166,0, ,\n',
' 2, 5.0,0, , 6.2385,0, , 33.2166,0, ,\n',
' 3, 10.0,0, , 6.2306,0, , 33.2175,0, ,\n',
' 4, 15.0,0, , 6.2359,0, , 33.2176,0, ,\n',
' 5, 20.0,0, , 6.2387,0, , 33.2175,0, ,\n']
ll = [[float(x.strip()) for x in s.split(',') if x.strip()] for s in idata]
print(np.array(ll))
# [[ 1. 0. 0. 6.2386 0. 33.2166 0. ]
# [ 2. 5. 0. 6.2385 0. 33.2166 0. ]
# [ 3. 10. 0. 6.2306 0. 33.2175 0. ]
# [ 4. 15. 0. 6.2359 0. 33.2176 0. ]
# [ 5. 20. 0. 6.2387 0. 33.2175 0. ]]
which can also be fed to a Pandas dataframe constructor:
import pandas as pd
df = pd.DataFrame(ll)
print(df)
# 0 1 2 3 4 5 6
# 0 1.0 0.0 0.0 6.2386 0.0 33.2166 0.0
# 1 2.0 5.0 0.0 6.2385 0.0 33.2166 0.0
# 2 3.0 10.0 0.0 6.2306 0.0 33.2175 0.0
# 3 4.0 15.0 0.0 6.2359 0.0 33.2176 0.0
# 4 5.0 20.0 0.0 6.2387 0.0 33.2175 0.0

You might split the values by comma, strip the parts and add the resulting array to a DataFrame like follows:
import pandas as pd
data = [[item.strip() for item in line.split(',')] for line in idata]
df = pd.DataFrame(data)
In order to safely convert the DataFrame to numeric values pd.to_numeric could be used:
df = df.apply(pd.to_numeric)

try: from io import StringIO # Python 3
except: from StringIO import StringIO # Python 2
import pandas as pd
df = pd.read_csv(StringIO(''.join(idata)), index_col=0, header=None, sep=r',\s*', engine='python')
print(df)
# prints:
# 1 2 3 4 5 6 7 8 9 10
# 0
# 1 0.0 0 NaN 6.2386 0 NaN 33.2166 0 NaN NaN
# 2 5.0 0 NaN 6.2385 0 NaN 33.2166 0 NaN NaN
# 3 10.0 0 NaN 6.2306 0 NaN 33.2175 0 NaN NaN
# 4 15.0 0 NaN 6.2359 0 NaN 33.2176 0 NaN NaN
# 5 20.0 0 NaN 6.2387 0 NaN 33.2175 0 NaN NaN
Remove the header=None if you can include an initial row of idata that actually specifies helpful column labels. Remove sep=r',\s*', engine='python' if you're happy for the blank columns to contain blank string objects instead of NaN.

Related

Copying column that have NaN value in it and adding prefix

I have x number of columns that contain NaN value
With the following code i can check that
for index,value in df.iteritems():
if value.isnull().values.any() == True:
this will show me with Boolean values which volumns have NaN.
If true I need to create new column that will have prefix 'Interpolation' + name of that column in its name.
So to make it clear if Column with the name 'XXX' has NaN I need to create new column with the name 'Interpolation XXX'.
Any ides how to do this ?
Something like this:
In [80]: df = pd.DataFrame({'XXX':[1,2,np.nan,4], 'YYY':[1,2,3,4], 'ZZZ':[1,np.nan, np.nan, 4]})
In [81]: df
Out[81]:
XXX YYY ZZZ
0 1.0 1 1.0
1 2.0 2 NaN
2 NaN 3 NaN
3 4.0 4 4.0
In [92]: nan_cols = df.columns[df.isna().any()].tolist()
In [94]: for col in df.columns:
...: if col in nan_cols:
...: df['Interpolation ' + col ] = df[col]
...:
In [95]: df
Out[95]:
XXX YYY ZZZ Interpolation XXX Interpolation ZZZ
0 1.0 1 1.0 1.0 1.0
1 2.0 2 NaN 2.0 NaN
2 NaN 3 NaN NaN NaN
3 4.0 4 4.0 4.0 4.0

How to use an integer list to find rows in pd.DataFrame with non-integer indices

How can I make this work?
import pandas as pd
L = [1,3,5]
df = pd.DataFrame([1,2,3,4,5,6,7], index=[0.1,0.2,0.3,0.4,0.5,0.6,0.7])
print(df[0])
print(df[0].loc(L))
I would like to have this output format:
0.2 2
0.4 4
0.6 6
I think that is .iloc
df.iloc[L]
Out[477]:
0
0.2 2
0.4 4
0.6 6

Split pandas columns into two with column MultiIndex

I need to split DataFrame columns into two and add an additional value to the new column. The twist is that I need to lift the original column names up one level and add two new column names.
Given a DataFrame h:
>>> import pandas as pd
>>> h = pd.DataFrame({'a': [0.6, 0.4, 0.1], 'b': [0.2, 0.4, 0.7]})
>>> h
a b
0 0.6 0.2
1 0.4 0.4
2 0.1 0.7
I need to lift the original column names up one level and add two new column names. The result should look like this:
>>> # some stuff...
a b
expected received expected received
0 0.6 1 0.2 1
1 0.4 1 0.4 1
2 0.1 1 0.7 1
I've tried this:
>>> h['a1'] = [1, 1, 1]
>>> h['b1'] = [1, 1, 1]
>>> t = [('f', 'expected'),('f', 'received'), ('g', 'expected'), ('g', 'received')]
>>> h.columns = pd.MultiIndex.from_tuples(t)
>>> h
f g
expected received expected received
0 0.6 0.2 1 1
1 0.4 0.4 1 1
2 0.1 0.7 1 1
This just renames the columns but does not align them properly. I think the issue is there's no link between a1 and b1 to the expected and received columns.
How do I lift the original column names up one level and add two new column names?
I am using concat with keys , then swaplevel
h1=h.copy()
h1[:]=1
pd.concat([h,h1],keys=['expected', 'received'],axis=1).\
swaplevel(0,1,axis=1).\
sort_index(level=0,axis=1)
Out[233]:
a b
expected received expected received
0 0.6 1.0 0.2 1.0
1 0.4 1.0 0.4 1.0
2 0.1 1.0 0.7 1.0

Pandas append returns DF with NaN values

I'm appending data from a list to pandas df. I keep getting NaN in my entries.
Based on what I've read I think I might have to mention the data type for each column in my code.
dumps = [];features_df = pd.DataFrame()
for i in range (int(len(ids)/50)):
dumps = sp.audio_features(ids[i*50:50*(i+1)])
for i in range (len(dumps)):
print(list(dumps[0].values()))
features_df = features_df.append(list(dumps[0].values()), ignore_index = True)
Expected results, something like-
[0.833, 0.539, 11, -7.399, 0, 0.178, 0.163, 2.1e-06, 0.101, 0.385, 99.947, 'audio_features', '6MWtB6iiXyIwun0YzU6DFP', 'spotify:track:6MWtB6iiXyIwun0YzU6DFP', 'https://api.spotify.com/v1/tracks/6MWtB6iiXyIwun0YzU6DFP', 'https://api.spotify.com/v1/audio-analysis/6MWtB6iiXyIwun0YzU6DFP', 149520, 4]
for one row.
Actual-
danceability energy ... duration_ms time_signature
0 NaN NaN ... NaN NaN
1 NaN NaN ... NaN NaN
2 NaN NaN ... NaN NaN
3 NaN NaN ... NaN NaN
4 NaN NaN ... NaN NaN
5 NaN NaN ... NaN NaN
For all rows
append() strategy in a tight loop isn't a great way to do this. Rather, you can construct an empty DataFrame and then use loc to specify an insertion point. The DataFrame index should be used.
For example:
import pandas as pd
df = pd.DataFrame(data=[], columns=['n'])
for i in range(100):
df.loc[i] = i
print(df)
time python3 append_df.py
n
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
real 0m13.178s
user 0m12.287s
sys 0m0.617s
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

Replace values in pandas column based on nan in another column

For pairs of columns, i want to replace the values of the second columns with nan if the values in the first is nan.
I have tried without success
>import pandas as pd
>
> df=pd.DataFrame({'a': ['r', np.nan, np.nan, 's'], 'b':[0.5, 0.5, 0.2,
> 0.02], 'c':['n','r', np.nan, 's' ], 'd':[1,0.5,0.2,0.05]})
>
>listA=['a','c']
>listB=['b','d']
>for color, ratio in zip(listA, listB):
>>df.loc[df[color].isnull(), ratio] == np.nan
df remain unchanged
other test using def (failed)
>def Test(df):
>> if df[color]== np.nan:
>> >> return df[ratio]== np.nan
>> else:
>> >>return
>for color, ratio in zip(listA, listB):
>>>>df[ratio]=df.apply(Test, axis=1)
Thanks
It seems you have typo, change == to =:
for color, ratio in zip(listA, listB):
df.loc[df[color].isnull(), ratio] = np.nan
print (df)
a b c d
0 r 0.50 n 1.00
1 NaN NaN r 0.50
2 NaN NaN NaN NaN
3 s 0.02 s 0.05
Another solution with mask for replace True values of mask to NaN by default:
for color, ratio in zip(listA, listB):
df[ratio] = df[ratio].mask(df[color].isnull())
print (df)
a b c d
0 r 0.50 n 1.00
1 NaN NaN r 0.50
2 NaN NaN NaN NaN
3 s 0.02 s 0.05

Resources