handling of unstructured data in pandas - python-3.x

I'm trying to read a unstructured csv file using pandas read_csv(). The problem is some of the files have rows with extra columns as shown below in the sample input.
sample input:
col0,col1,col2
a,b,c
a,b,c,d
a,b,c
a,b,c,d
While handling these kind of files the program throws some ParseError
ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4
sample output :
col0 | col1 | col2 | col3
a | b | c | NaN
a | b | c | d
a | b | c | NaN
a | b | c | d
I don't want to ignore the lines with error_bad_lines = False parameter in pandas read_csv().
Any kind of help will be highly appreciated.

One possible solution is preprocessing first and find max number of separators, and set parameter names by range:
path_csv = 'file.csv'
with open(path_csv) as f:
lines = f.readlines()
num = max(l.count(',') for l in lines) + 1
print (num)
4
df = pd.read_csv(path_csv, names=range(num))
print (df)
0 1 2 3
0 col0 col1 col2 NaN
1 a b c NaN
2 a b c d
3 a b c NaN
4 a b c d
Similar if header is not important, so possible remove it:
df = pd.read_csv(path_csv, names=range(num), skiprows=1)
print (df)
0 1 2 3
0 a b c NaN
1 a b c d
2 a b c NaN
3 a b c d
Another more dynamic solution is add values to header:
path_csv = 'file.csv'
with open(path_csv) as f:
lines = f.readlines()
#get header to list
header = [x.strip() for x in lines[0].split(',')]
#get max number of separator
max_num = max(l.count(',') for l in lines)
#add missing header values by range
if len(header) < max_num + 1:
header = header + list(range(max_num-len(header) + 1))
print (header)
['col0', 'col1', 'col2', 0]
df = pd.read_csv(path_csv, names=header, skiprows=1)
print (df)
col0 col1 col2 0
0 a b c NaN
1 a b c d
2 a b c NaN
3 a b c d

Related

Concat values on dataframe columns excluding NaN's

I have a dataframe with n store columns, here I'm just showing the first 2:
ref_id store_0 store_1
0 100 c b
1 300 d NaN
I want a way to concat only the non-NaN values from store columns into a new column adding a comma between each value, and finally drop those columns. Desired output is:
ref_id stores
0 100 c,b
1 300 d
Right now I've tried df['stores'] = df['store_0'] + ',' + df['store_1'] with this result:
ref_id store_0 store_1 stores
0 100 c b c,b
1 300 d NaN NaN
You can use:
cols = df.filter(like='store_').columns
df2 = (df
.drop(columns=cols)
.assign(stores=df[cols].agg(lambda s: s.dropna()
.str.cat(sep=','),
axis=1))
)
Or, for in place modification:
cols = df.filter(like='store_').columns
df['stores'] = df[cols].agg(lambda s: s.dropna().str.cat(sep=','), axis=1)
df.drop(columns=cols, inplace=True)
Output:
ref_id stores
0 100 c,b
1 300 d
You can try
df_ = df.filter(like='store')
df = (df.assign(store=df_.apply(lambda row : row.str.cat(sep=','), axis=1))
.drop(df_.columns, axis=1))
print(df)
ref_id store
0 100 c,b
1 300 d
Try with
df['store'] = df.filter(like = 'store').apply(lambda x : ','.join(x[x==x]),1)
df
Out[60]:
ref_id store_0 store_1 store
0 100 c b c,b
1 300 d NaN d

How to find the intersection of a pair of columns in pandas dataframe with pairs in any order?

I have below dataframe
col1 col2
a b
b a
c d
d c
e d
Desired Output should be unique pair from two columns
col1 col2
a b
c d
e d
Convert values to frozenset and then filter by DataFrame.duplicated in boolean indexing:
df = df[~df[['col1','col2']].apply(frozenset, axis=1).duplicated()]
print (df)
col1 col2
0 a b
2 c d
4 e d
Or you can sorting values by np.sort and remove duplicates by DataFrame.drop_duplicates:
df = pd.DataFrame(np.sort(df[['col1','col2']]), columns=['col1','col2']).drop_duplicates()
print (df)
col1 col2
0 a b
2 c d
4 d e

How to explode/split a nested list, inside a list inside a pandas dataframe column and make separate columns out of them?

I have a dataframe. I want to split the Options column into id, AUD,ud.
id col1 col2 Options
1 A B [{'id':25,'X': {'AUD': None, 'ud':0}}]
2 C D [{'id':27,'X': {'AUD': None, 'ud':0}}]
3 E F [{'id':28,'X': {'AUD': None, 'ud':0}}]
4 G H [{'id':29,'X': {'AUD': None, 'ud':0}}]
Expected output dataframe:
id col1 col2 id Aud ud
1 A B 25 None 0
2 C D 27 None 0
3 E F 28 None 0
4 G H 29 None 0
How do you go about it using python3.6 and pandas dataframe?
Use list comprehension with json_normalize for get DataFrames and join together by concat, also added DataFrame.add_prefix for avoid duplicated columns names:
from pandas.io.json import json_normalize
import ast
L = [json_normalize(x) for x in df.pop('Options')]
#if strings instead dicts
#L = [json_normalize(ast.literal_eval(x)) for x in df.pop('Options')]
df = df.join(pd.concat(L, ignore_index=True, sort=False).add_prefix('opt_'))
print (df)
id col1 col2 opt_id opt_X.AUD opt_X.ud
0 1 A B 25 None 0
1 2 C D 27 None 0
2 3 E F 28 None 0
3 4 G H 29 None 0
Another solution with extract X values of nested dictionaries:
L = [{k: v for y in ast.literal_eval(x) for k, v in {**y.pop('X'), **y}.items()}
for x in df.pop('Options')]
df = df.join(pd.DataFrame(L, index=df.index).add_prefix('opt_'))
print (df)
id col1 col2 opt_AUD opt_ud opt_id
0 1 A B None 0 25
1 2 C D None 0 27
2 3 E F None 0 28
3 4 G H None 0 29
Try this:
for dit in df['Options'].iteritems():
df.loc[dit[0],'id'] = dit[1][0]['id']
df.loc[dit[0],'Aud'] = dit[1][0]['X']['AUD']
df.loc[dit[0],'ud'] = dit[1][0]['X']['ud']

How do I make a panda frames values across multiple columns, its columns

I have the following dataframe loaded up in Pandas.
print(pandaDf)
id col1 col2 col3
12a a b d
22b d a b
33c c a b
I am trying to convert the values across multiple rows into its columns so the output would be like this :
Desired output:
id a b c d
12a 1 1 0 1
22b 1 1 0 0
33c 1 1 1 0
I have tried adding in a value column where the value = 1 and using a pivot table
pandaDf['value'] = 1
column = ['col1', 'col2', 'col3']
pandaDf.pivot_table(index = 'id', value = 'value', columns = column)
However, the resulting data frame is a multilevel index and the pandaDf.pivot() method does not allow multiple column values.
Please advise about how I could do this with an output of a single level index.
Thanks for taking the time to read this and I apologize if I have made any formatting errors in posting the question. I am still learning the proper stackoverflow syntax.
You can use One-Hot Encoding to solve this problem.
Here is one way to do this pd.get_dummies and some multiindex flatten and sum:
df1 = df.set_index('id')
df_out = pd.get_dummies(df1)
df_out.columns = df_out.columns.str.split('_', expand=True)
df_out = df_out.sum(level=1, axis=1).reset_index()
print(df_out)
Output:
id a c d b
0 12a 1 0 1 1
1 22b 1 0 1 1
2 33c 1 1 0 1
Using get_dummies
pd.get_dummies(df.set_index('id'),prefix='', prefix_sep='').sum(level=0,axis=1)
Out[81]:
a c d b
id
12a 1 0 1 1
22b 1 0 1 1
33c 1 1 0 1

pandas how to convert a two-dimension dataframe to a one-dimension dataframe

suppose I have a dataframe with multi columns.
a b c
1
2
3
How to convert it to a single columns dataframe
1 a
2 a
3 a
1 b
2 b
3 b
1 c
2 c
3 c
please note that the former is a Dataframe other than Panel
Use melt:
df = df.reset_index().melt('index', var_name='col').set_index('index')[['col']]
print (df)
col
index
1 a
2 a
3 a
1 b
2 b
3 b
1 c
2 c
3 c
Or numpy.repeat and numpy.tile with DataFrame constructor::
a = np.repeat(df.columns, len(df))
b = np.tile(df.index, len(df.columns))
df = pd.DataFrame(a, index=b, columns=['col'])
print (df)
col
1 a
2 a
3 a
1 b
2 b
3 b
1 c
2 c
3 c
another way is,
pd.DataFrame(list(itertools.product(df.index, df.columns.values))).set_index([0])
Output:
1
0
1 a
1 b
1 c
2 a
2 b
2 c
3 a
3 b
3 c
For exact output:
use sort_values
print pd.DataFrame(list(itertools.product(df.index, df.columns.values))).set_index([0]).sort_values(by=[1])
1
0
1 a
2 a
3 a
1 b
2 b
3 b
1 c
2 c
3 c

Resources