string to pandas dataframe - string

following the parsing of a large pdf document I end up with string in the format in python:
Company Name;(Code) at End of Month;Reason for Alteration No. of Shares;Bond Symbol, etc.; Value, etc.; after Alteration;Remarks
Shares;Shares
TANSEISHA CO.,LTD.;(9743)48,424,071;0
MEITEC CORPORATION;(9744)31,300,000;0
TKC Corporation;(9746)26,731,033;0
ASATSU-DK INC.;(9747);42,155,400;Exercise of Subscription Warrants;0;May 2013 Resolution based 1;0Shares
May 2013 Resolution based 2;0Shares
Would it be possible to transform this into a pandas dataframe as follows where the columns are delimited by the ";". So looking at the above section from the string my df should look like:
Company Name (Code) at End of Month Reason for Alteration ....
Value,etc after Alteration Remarks Shares .....
As additional problem my rows don't always have the same number of strings delimited by ";", meaning that I would need to find a way to see my columns( I don't mind setting like a dataframe with 15 columns and delete afterwards those II do no need)
Thanks

This is a nice opportunity to use StringIO to make your result look like an open file handle so that you can just use pd.read_csv:
In [1]: import pandas as pd
In [2]: from StringIO import StringIO
In [3]: s = """Company Name;(Code) at End of Month;Reason for Alteration No. of Shares;Bond Symbol, etc.; Value, etc.; after Alteration;Remarks
...: Shares;Shares
...: TANSEISHA CO.,LTD.;(9743)48,424,071;0
...: MEITEC CORPORATION;(9744)31,300,000;0
...: TKC Corporation;(9746)26,731,033;0
...: ASATSU-DK INC.;(9747);42,155,400;Exercise of Subscription Warrants;0;May 2013 Resolution based 1;0Shares
...: May 2013 Resolution based 2;0Shares"""
In [4]: pd.read_csv(StringIO(s), sep=";")
Out [4]: Company Name (Code) at End of Month Reason for Alteration No. of Shares Bond Symbol, etc. Value, etc. after Alteration Remarks
0 Shares Shares NaN NaN NaN NaN NaN
1 TANSEISHA CO.,LTD. (9743)48,424,071 0 NaN NaN NaN NaN
2 MEITEC CORPORATION (9744)31,300,000 0 NaN NaN NaN NaN
3 TKC Corporation (9746)26,731,033 0 NaN NaN NaN NaN
4 ASATSU-DK INC. (9747) 42,155,400 Exercise of Subscription Warrants 0.0 May 2013 Resolution based 1 0Shares
5 May 2013 Resolution based 2 0Shares NaN NaN NaN NaN NaN
Note that it does look like there are some obvious data cleanup problems to tackle from here, but that should at least give you a start.

I would split your read in string into a list of list. Possibly use regex to find the beginning of each record (or at least use something that you know where it shows up, it looks like (Code) at End of Month might work) and slice your way through. Something like this:
import re
import pandas as pd
# Start your list of list off with your expected headers
mystringlist = [["Company Name",
"(Code) at End of Month",
"Reason for Alteration",
"Value,etc",
"after Alteration",
"Remarks Shares"]]
# This will be used to store the start and end indexes of each record
indexlist = []
# A recursive function to find the start location of each record. It expects a list of 1s and 0s
def find_start(thestring, startloc=0):
if startloc >=len(thestring):
return
else:
foundindex = thestring.find("1",startloc)
indexlist.append(foundindex)
return find_start(thestring, foundindex+1)
# Split on your delimiter
mystring = thestring.split(";")
# Use a list comprehension to make your list of 1s
# and 0s based on the location of a fixed regular expressible record
stringloc = "".join([1 if re.match(x, "\(\d+\)\d+,\d+,\d+") else 0 for x in mystring])
find_start(stringloc)
# Make your list of list based on found indexes
# We subtract 1 from the index position because we want the element
# that immediately precedes the element we find (it's an easier regex
# to make when it's a consistent structure.
for x in indexlist:
if mystringlist.index(x)+1 != len(indexlist):
mystringlist.append(mystring[x-1:indexlist[indexlist.index(x)+1]-1])
# Turn mystring list into a data frame
mydf = pd.DataFrame(mystringlist)

Related

pandas df are being read as dict

I'm having some trouble with pandas. I opened a .xlsx file with pandas, but when I try to filter any information, it shows me the error
AttributeError: 'dict' object has no attribute 'head' #(or iloc, or loc, or anything else from DF/pandas)#
So, I did some research and realized that my table turned into a dictionary (why?).
I'm trying to convert this mess into a proper dictionary, so I can convert it into a properly df, because right now, it shows some characteristics from both. I need a df, just it.
Here is the code:
import pandas as pd
df = pd.read_excel('report.xlsx', sheet_name = ["May"])
print(df)
Result: it shows the table plus "[60 rows x 24 columns]"
But when I try to filter or iterate, it shows all dicts possible attibute errors.
Somethings I tried: .from_dict, xls.parse/(df.to_dict).
When I try to convert df to dict properly, it shows
ValueError: If using all scalar values, you must pass an index
I tried this link: [https://stackoverflow.com/questions/17839973/constructing-pandas-dataframe-from-values-in-variables-gives-valueerror-if-usi)][1], but it didn't work. For some reason, it said in one of the errors that I should provide 2-d parameters, that's why I tried to create a new dict and do a sort of 'append', but it didn't work too...
Then I tried all stuff to set an index, but it doesn't let me rename columns because it says .iloc is not an attribute from dict)
I'm new in python, but I never saw a 'pd.read_excel' open a DataFrame as 'dict'. What should I do?
tks!
[1]: Constructing pandas DataFrame from values in variables gives "ValueError: If using all scalar values, you must pass an index"
if its a dict of DataFrames try...
>>> dict_df = {"a":pd.DataFrame([{1:2,3:4},{1:4,4:6}]), "b":pd.DataFrame([{7:9},{1:4}])}
>>> dict_df
{'a': 1 3 4
0 2 4.0 NaN
1 4 NaN 6.0, 'b': 7 1
0 9.0 NaN
1 NaN 4.0}
>>> pd.concat(dict_df.values(),keys=dict_df.keys(), axis=1)
a b
1 3 4 7 1
0 2 4.0 NaN 9.0 NaN
1 4 NaN 6.0 NaN 4.0

Pandas apply with eval not giving NAN as result when NAN in column its calculating on

I have to support the ability for user to run any formula against a frame to produce a new column.
I may have a frame that looks like
dim01 dim02 msr01
0 A 25 1.0
1 B 26 5.3
2 C 53 NaN
I interpret user code to allow them to run a formula using supported functions/ standard operators / other columns
So a formula might look like SQRT([msr01]*100+7)
I convert the user input to Python syntax so this would evaluate to something like
formula_str = '(math.sqrt((row.msr01*100)+7))'
I then apply it to my pandas dataframe like this
data_frame['msr002'] = data_frame.apply(lambda row: eval(formula_str), axis=1)
This was working good until I hit data with a NaN in a column used in the calculation. I noticed that when this case happens I get a frame like this in return.
dim01 dim02 msr01 msr02
0 A 25 1.0 10.344
1 B 26 5.3 23.173
2 C 53 NaN 7.342
So it appears that the eval is not evaluating the NaN correctly.
I am using a lexer/parser to ensure that the user sent formula isnt dangerous and to convert from everyday user syntax to use python functions and make it work against pandas columns.
Any advice on how to fix this?
Perhaps I should include something in the lambda that looks if any required column is NaN and just hardcode to Nan in that case? But that doesn't seem like the best solution to me.
I did see this question which is similar but didnt think it answered my exact need.
So you can try with
df.msr01.mul(100).add(7)**0.5
Out[716]:
0 10.34408
1 23.17326
2 NaN
Name: msr01, dtype: float64
Also with your original code
df.apply(lambda row: eval(formula_str), axis=1)
Out[714]:
0 10.34408
1 23.17326
2 NaN
dtype: float64

Dask apply with custom function

I am experimenting with Dask, but I encountered a problem while using apply after grouping.
I have a Dask DataFrame with a large number of rows. Let's consider for example the following
N=10000
df = pd.DataFrame({'col_1':np.random.random(N), 'col_2': np.random.random(N) })
ddf = dd.from_pandas(df, npartitions=8)
I want to bin the values of col_1 and I follow the solution from here
bins = np.linspace(0,1,11)
labels = list(range(len(bins)-1))
ddf2 = ddf.map_partitions(test_f, 'col_1',bins,labels)
where
def test_f(df,col,bins,labels):
return df.assign(bin_num = pd.cut(df[col],bins,labels=labels))
and this works as I expect it to.
Now I want to take the median value in each bin (taken from here)
median = ddf2.groupby('bin_num')['col_1'].apply(pd.Series.median).compute()
Having 10 bins, I expect median to have 10 rows, but it actually has 80. The dataframe has 8 partitions so I guess that somehow the apply is working on each one individually.
However, If I want the mean and use mean
median = ddf2.groupby('bin_num')['col_1'].mean().compute()
it works and the output has 10 rows.
The question is then: what am I doing wrong that is preventing apply from operating as mean?
Maybe this warning is the key (Dask doc: SeriesGroupBy.apply) :
Pandas’ groupby-apply can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-apply will apply func once to each partition-group pair, so when func is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.
You are right! I was able to reproduce your problem on Dask 2.11.0. The good news is that there's a solution! It appears that the Dask groupby problem is specifically with the category type (pandas.core.dtypes.dtypes.CategoricalDtype). By casting the category column to another column type (float, int, str), then the groupby will work correctly.
Here's your code that I copied:
import dask.dataframe as dd
import pandas as pd
import numpy as np
def test_f(df, col, bins, labels):
return df.assign(bin_num=pd.cut(df[col], bins, labels=labels))
N = 10000
df = pd.DataFrame({'col_1': np.random.random(N), 'col_2': np.random.random(N)})
ddf = dd.from_pandas(df, npartitions=8)
bins = np.linspace(0,1,11)
labels = list(range(len(bins)-1))
ddf2 = ddf.map_partitions(test_f, 'col_1', bins, labels)
print(ddf2.groupby('bin_num')['col_1'].apply(pd.Series.median).compute())
which prints out the problem you mentioned
bin_num
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
5 0.550844
6 0.651036
7 0.751220
8 NaN
9 NaN
Name: col_1, Length: 80, dtype: float64
Here's my solution:
ddf3 = ddf2.copy()
ddf3["bin_num"] = ddf3["bin_num"].astype("int")
print(ddf3.groupby('bin_num')['col_1'].apply(pd.Series.median).compute())
which printed:
bin_num
9 0.951369
2 0.249150
1 0.149563
0 0.049897
3 0.347906
8 0.847819
4 0.449029
5 0.550608
6 0.652778
7 0.749922
Name: col_1, dtype: float64
#MRocklin or #TomAugspurger
Would you be able to create a fix for this in a new release? I think there is sufficient reproducible code here. Thanks for all your hard work. I love Dask and use it every day ;)

Python dataframe with value 'NA' not fetching

I am trying to read a excel with below data:
But when i tried to debug the dataframe its showing only:
Could you explain why the NA is not showing in the dataframe.
Also is there any way to fetch NA .
Python version : 3.7
In pd.read_excel there's an argument for this called na_values.
Quoted from the documentation:
Additional strings to recognize as NA/NaN.
Furthermore you have to overwrite the default NaN values, which is also empty cell '', with the parameter keep_default_na=False
Again quoting from the documentation:
If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they’re appended to.
So the following should help your problem:
df = pd.read_excel('Filename.xlsx', na_values='NA', keep_default_na=False)
Output
Item Status
0 Soap NaN
1 butter
2 Rice NaN
3 pen Available

Merging sheets of excel using python

I am trying to take data of two sheets and comparing with each other if it matches i want to append column. Let me explain this by showing what i am doing and what i am trying to get in output using python.
This is my sheet1 from excel.xlsx:
it contains four column name,class,age and group.
This is my sheet2 from excel.xlsx:
it contains default, and name column with extra names in it.
So, Now i am trying to match name of sheet2 with sheet1, if the name containing in sheet1 matches with sheet2 then i want to add default value corresponding to that name from sheet2.
This i need in output:
As you can see only Ravi and Neha having default in sheet2 and that name matches with sheet1 name. Suhash and Aish dont have any default value so not anything coming there.
This code i tried:
import pandas as pd
import xlrd
df1 = pd.read_excel('stack.xlsx', sheet_name='Sheet1')
df2 = pd.read_excel('stack.xlsx', sheet_name='Sheet2')
df1['DEFAULT'] = df1.NAME.map(df2.set_index('NAME')['DEFAULT'].to_dict())
df1.to_excel('play.xlsx',index=False)
and getting output excel like this:
Not getting default against Ravi.
Please help me with this to get this expected output using python.
Assuming you read each sheet into a dataframe (df = sheet1, df2 = sheet2)
it's quite easy and there are a few options (ranked in order of speed, from fastest to slowest):
# .merge
df = df.merge(df2, how='left', on='Name')
# pd.conact
df = pd.concat([df.set_index('Name'), df2.set_index('Name').Default], axis=1, sort='Name', join='inner')
# .join
df = df.set_index('Name').join(df2.set_index('Name'))
# .map
df.Default = df.Name.map(df2.set_index('Name')['Default'].to_dict())
All of them will have the following output:
Name Default Class Age Group
0 NaN NaN 4 2 tig
1 Ravi 2.0 5 5 rose
2 NaN NaN 3 3 lily
3 Suhas NaN 5 5 rose
4 NaN NaN 2 2 sun
5 Neha 3.0 5 5 rose
6 NaN NaN 5 2 sun
7 Aish NaN 5 5 rose
Then you overwrite the original sheet by using df.to_excel
EDIT
So the code you shared has 3 problems. One of which seems to be a language barrier... You only need 1 of the options I gave you. Secondly there's a missing ' when reading the first sheet into df. And lastly you're inconsistent when using the df names. you defined df1 and df2 but used just df in the code which doesn't work
So the correct code would be as follows:
import pandas as pd
import xlrd
df1 = pd.read_excel('stack.xlsx', sheet_name='Sheet1') #Here the ' was missing
df2 = pd.read_excel('stack.xlsx', sheet_name='Sheet2')
## Now you chose one of the options, I used map here, but you can pick any one of them
df1.DEFAULT = df1.NAME.map(df2.set_index('NAME')['DEFAULT'].to_dict())
df1.to_excel('play.xlsx',index=False)

Resources