I am working project to convert pdf file into table using tabule python. Where while scanning the tabula detect such table, but one such column as table is as below in while the actually image of table is as below picture_2
Is there any method using python to single column into separate column, like second picture.
You need to use str.split with expand=True.
example:
>>> import pandas as pd
>>> df = pd.DataFrame([["Purchase Balance"],["138 303"]])
>>> df
0
0 Purchase Balance
1 138 303
>>> df[0].str.split(" ", expand=True)
0 1
0 Purchase Balance
1 138 303
Related
Want an excel sheet column with odd values but has to be duplicated.
The value of the column has to be like 1,1,3,3,5,5,7,7,9,9,11,11.... and so on until 20000/20001.
Could you let me know how to do it either in excel sheet or in pandas dataframe.
Tried duplicating in excel but it does not work.
Tried in pandas but even even numbers come up.
df['ABC'] = 2 + df.index//1
You can use numpy.arange with numpy.repeat :
import pandas as pd
import numpy as np
df = pd.Series(np.repeat(np.arange(1,20002,2),2)).to_frame("ABC")
Then (if needed) use pandas.DataFrame.to_excel to make a spreadsheet :
df.to_excel("out.xlsx", index=False)
# Output :
print(df)
ABC
0 1
1 1
2 3
3 3
4 5
... ...
19997 19997
19998 19999
19999 19999
20000 20001
20001 20001
[20002 rows x 1 columns]
Using ms365, try something like:
=TOCOL(LET(x,SEQUENCE(10001,,,2),IFERROR(EXPAND(x,,2),x)))
My excel spreadsheet is for the form as below.
A
B
Part 1- 20210910
55
Part 2- 20210829
45
Part 3- 20210912
2
I would like to take the strings from Column A "Part A- 20210910" but read it using Pandas as "2021/09/10", a date format. How could I implement this?
IIUC:
df['A'] = df['A'].str.extract(r'(\d{8})').astype('datetime64')
print(df)
# Output:
A B
0 2021-09-10 55
1 2021-08-29 45
2 2021-09-12 2
My beginner way of doing it:
import pandas as pd
df = pd.read_excel('file_name.xlsx')
df['A'] = df['A'].apply(lambda x: x.split('-')).apply(lambda x: x[1]).apply(lambda x: pd.to_datetime(str(x), format='%Y%m%d'))
Output
I have 2 Excel sheets which have list of Serial numbers along with Date of purchase. Sheet-1 is master sheet, sheet-2 can be called a subset of that master sheet.
Not all Serial numbers in Sheet-1 are updated with Date of purchase. In Sheet-2 there are those Serial numbers which have their Date of purchase values missing in Sheet-1. Sheet-2 is completely updated with its Serial number and Date of purchase values.
I am trying to read all Serial numbers from Sheet-1, search those in Sheet-2, find the corresponding Date of purchase and update this value (wherever missing) in Sheet-1.
Following is the Layout of both sheets:
(Please note that the column names are a bit different in both sheets)
Sheet-1
Serial# Date of purchase
111 01-Jun-2018
222 13-Jan-2018
333 (Blank)
444 (Blank)
555 11-Dec-2017
Sheet-2
Serial Number purchase date
333 03-Feb-2019
444 19-Feb-2019
I am new to Pandas and first time trying to make a Python script using Pandas to achieve this. Here is the code that I've managed to write but its not working.
import xlrd
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
df1 = pd.read_excel('Excel-1.xlsx', sheet_name='Sheet1')
df2 = pd.read_excel('Excel-2.xlsx', sheet_name='Sheet1')
df1['Date of purchase'] = df1['Serial#'].map(df2.set_index('Serial Number')['purchase date'])
ERROR
pandas.core.indexes.base.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
EDIT:
Both sheets have hundreds of entries, the layout I mentioned is just a sample.
Both sheets have other columns also apart from the 2 listed in sample layout, the 2 mentioned are of interest to us.
Assuming your input data are:
In [1]: import pandas as pd
In [2]: sheet1=pd.DataFrame([[111,'01-Jun-2018'],[222,'13-Jan-2018'],[333],[444],[555,'11-Dec-2017']], columns=['Serial#','Date of purchase'])
In [3]: sheet1
Out[3]:
Serial# Date of purchase
0 111 01-Jun-2018
1 222 13-Jan-2018
2 333 None
3 444 None
4 555 11-Dec-2017
In [4]: sheet2=pd.DataFrame([[333,'03-Feb-2019'],[444,'19-Feb-2019']],columns=sheet1.columns)
In [5]: sheet2
Out[5]:
Serial# Date of purchase
0 333 03-Feb-2019
1 444 19-Feb-2019
You can proceed by indexing your dataframes and using the fillna method:
In [6]: sheet1.set_index('Serial#')
In [7]: sheet1['Date of purchase'].fillna(sheet2.set_index('Serial#')['Date of purchase'], inplace=True)
In [8]:
Out[8]:
Date of purchase
Serial#
111 01-Jun-2018
222 13-Jan-2018
333 03-Feb-2019
444 19-Feb-2019
555 11-Dec-2017
I'm trying to modify multiple column values in pandas.Dataframes with different increments in each column so that the values in each column do not overlap with each other when graphed on a line graph.
Here's the end goal of what I want to do: link
Let's say I have this kind of Dataframe:
Col1 Col2 Col3
0 0.3 0.2
1 1.1 1.2
2 2.2 2.4
3 3 3.1
but with hundreds of columns and thousands of values.
When graphing this on a line-graph on excel or matplotlib, the values overlap with each other, so I would like to separate each column by adding the same values for each column like so:
Col1(+0) Col2(+10) Col3(+20)
0 10.3 20.2
1 11.1 21.2
2 12.2 22.4
3 13 23.1
By adding the same value to one column and increasing by an increment of 10 over each column, I am able to see each line without it overlapping in one graph.
I thought of using loops and iterations to automate this value-adding process, but I couldn't find any previous solutions on Stackoverflow that addresses how I could change the increment value (e.g. from adding 0 in Col1 in one loop, then adding 10 to Col2 in the next loop) between different columns, but not within the values in a column. To make things worse, I'm a beginner with no clue about programming or data manipulation.
Since the data is in a CSV format, I first used Pandas to read it and store in a Dataframe, and selected the columns that I wanted to edit:
import pandas as pd
#import CSV file
df = pd.read_csv ('data.csv')
#store csv data into dataframe
df1 = pd.DataFrame (data = df)
# Locate columns that I want to edit with df.loc
columns = df1.loc[:, ' C000':]
here is where I'm stuck:
# use iteration with increments to add numbers
n = 0
for values in columns:
values = n + 0
print (values)
But this for-loop only adds one increment value (in this case 0), and adds it to all columns, not just the first column. Not only that, but I don't know how to add the next increment value for the next column.
Any possible solutions would be greatly appreciated.
IIUC ,just use df.add() over axis=1 with a list made from the length of df.columns:
df1 = df.add(list(range(0,len(df.columns)*10))[::10],axis=1)
Or as #jezrael suggested, better:
df1=df.add(range(0,len(df.columns)*10, 10),axis=1)
print(df1)
Col1 Col2 Col3
0 0 10.3 20.2
1 1 11.1 21.2
2 2 12.2 22.4
3 3 13.0 23.1
Details :
list(range(0,len(df.columns)*10))[::10]
#[0, 10, 20]
I would recommend you to avoid looping over the data frame as it is inefficient but rather think of adding to matrixes.
e.g.
import numpy as np
import pandas as pd
# Create your example df
df = pd.DataFrame(data=np.random.randn(10,3))
# Create a Matrix of ones
x = np.ones(df.shape)
# Multiply each column with an incremented value * 10
x = x * 10*np.arange(1,df.shape[1]+1)
# Add the matrix to the data
df + x
Edit: In case you do not want to increment with 10, 20 ,30 but 0,10,20 use this instead
import numpy as np
import pandas as pd
# Create your example df
df = pd.DataFrame(data=np.random.randn(10,3))
# Create a Matrix of ones
x = np.ones(df.shape)
# THIS LINE CHANGED
# Obmit the 1 so there is only an end value -> default start is 0
# Adjust the length of the vector
x = x * 10*np.arange(df.shape[1])
# Add the matrix to the data
df + x
I'm reading in a pandas DataFrame using pd.read_csv. I want to keep the first row as data, however it keeps getting converted to column names.
I tried header=False but this just deleted it entirely.
(Note on my input data: I have a string (st = '\n'.join(lst)) that I convert to a file-like object (io.StringIO(st)), then build the csv from that file object.)
You want header=None the False gets type promoted to int into 0 see the docs emphasis mine:
header : int or list of ints, default ‘infer’ Row number(s) to use as
the column names, and the start of the data. Default behavior is as if
set to 0 if no names passed, otherwise None. Explicitly pass header=0
to be able to replace existing names. The header can be a list of
integers that specify row locations for a multi-index on the columns
e.g. [0,1,3]. Intervening rows that are not specified will be skipped
(e.g. 2 in this example is skipped). Note that this parameter ignores
commented lines and empty lines if skip_blank_lines=True, so header=0
denotes the first line of data rather than the first line of the file.
You can see the difference in behaviour, first with header=0:
In [95]:
import io
import pandas as pd
t="""a,b,c
0,1,2
3,4,5"""
pd.read_csv(io.StringIO(t), header=0)
Out[95]:
a b c
0 0 1 2
1 3 4 5
Now with None:
In [96]:
pd.read_csv(io.StringIO(t), header=None)
Out[96]:
0 1 2
0 a b c
1 0 1 2
2 3 4 5
Note that in latest version 0.19.1, this will now raise a TypeError:
In [98]:
pd.read_csv(io.StringIO(t), header=False)
TypeError: Passing a bool to header is invalid. Use header=None for no
header or header=int or list-like of ints to specify the row(s) making
up the column names
I think you need parameter header=None to read_csv:
Sample:
import pandas as pd
from pandas.compat import StringIO
temp=u"""a,b
2,1
1,1"""
df = pd.read_csv(StringIO(temp),header=None)
print (df)
0 1
0 a b
1 2 1
2 1 1
If you're using pd.ExcelFile to read all the excel file sheets then:
df = pd.ExcelFile("path_to_file.xlsx")
df.sheet_names # Provide the sheet names in the excel file
df = df.parse(2, header=None) # Parsing the 2nd sheet in the file with header = None
df
Output:
0 1
0 a b
1 1 1
2 0 1
3 5 2
You can set custom column name in order to prevent this:
Let say if you have two columns in your dataset then:
df = pd.read_csv(your_file_path, names = ['first column', 'second column'])
You can also generate programmatically column names if you have more than and can pass a list in front of names attribute.