Tabula-py for borderless table extraction - python-3.x

Can anyone please suggest me how to extract tabular data from a PDF using python/java program for the below borderless table present in a pdf file?

This table might be difficult one for tabla. How about using guess=False, stream=True ?
Update: As of tabula-py 1.0.3, guess and stream should work together. No need to set guess=False to use stream or lattice option.

I solved this problem via tabula-py
conda install tabula-py
and
>>> import tabula
>>> area = [70, 30, 750, 570] # Seems to have to be done manually
>>> page2 = tabula.read_pdf("nar_2021_editorial-2.pdf", guess=False, lattice=False,
stream=True, multiple_tables=False, area=area, pages="all",
) # `tabula` doc explains params very well
>>> page2
and I got this result
> 'pages' argument isn't specified.Will extract only from page 1 by default. [
> ShortTitle Text \ 0
> Arena3Dweb 3D visualisation of multilayered networks 1
> Aviator Monitoring the availability of web services 2
> b2bTools Predictions for protein biophysical features and 3
> NaN their conservation 4
> BENZ WS Four-level Enzyme Commission (EC) number ..
> ... ... 68
> miRTargetLink2 miRNA target gene and target pathway
> 69 NaN networks
> 70 mmCSM-PPI Effects of multiple point mutations on
> 71 NaN protein-protein interactions
> 72 ModFOLD8 Quality estimates for 3D protein models
>
>
> URL 0 http://bib.fleming.gr/Arena3D 1
> https://www.ccb.uni-saarland.de/aviator 2
> https://bio2byte.be/b2btools/ 3
> NaN 4 https://benzdb.biocomp.unibo.it/ ..
> ... 68 https://www.ccb.uni-saarland.de/mirtargetlink2 69
> NaN 70 http://biosig.unimelb.edu.au/mmcsm ppi 71
> NaN 72 https://www.reading.ac.uk/bioinf/ModFOLD/ [73
> rows x 3 columns]]
This is an iterable obj, so you can manipulate it via for row in page2:
Hope it help you

Tabula-py borderless table extraction:
Tabula-py has stream which on True detects table based on gaping.
from tabula convert_into
src_pdf = r"src_path"
des_csv = r"des_path"
convert_into(src_pdf, des_csv, guess=False, lattice=False, stream=True, pages="all")

Related

Panda returns 50x1 matrix instead of 50x7? (read_csv gone wrong)

I'm quite new to Python. I'm trying to load a .csv file with Panda but it returns a 50x1 matrix instead of expected 50x7. I'm a bit uncertain whether it is becaue my data contains numbers with "," (although I thought the quotechar attribute would solve that problem).
EDIT: Should perhaps mention that including the attribute sep=',' doesn't solve the issue)
My code looks like this
df = pd.read_csv('data.csv', header=None, quotechar='"')
print(df.head)
print(len(df.columns))
print(len(df.index))
Any ideas? Thanks in advance
Here is a subset of the data as text
10-01-2021,813,116927,"2,01",-,-,-
11-01-2021,657,117584,"2,02",-,-,-
12-01-2021,462,118046,"2,03",-,-,-
13-01-2021,12728,130774,"2,24",-,-,-
14-01-2021,17895,148669,"2,55",-,-,-
15-01-2021,15206,163875,"2,81",5,5,"0,0001"
16-01-2021,4612,168487,"2,89",7,12,"0,0002"
17-01-2021,2536,171023,"2,93",717,729,"0,01"
18-01-2021,3883,174906,"3,00",2147,2876,"0,05"
Here is the output of the head-function
0
0 27-12-2020,6492,6492,"0,11",-,-,-
1 28-12-2020,1987,8479,"0,15",-,-,-
2 29-12-2020,8961,17440,"0,30",-,-,-
3 30-12-2020,11477,28917,"0,50",-,-,-
4 31-12-2020,6197,35114,"0,60",-,-,-
5 01-01-2021,2344,37458,"0,64",-,-,-
6 02-01-2021,8895,46353,"0,80",-,-,-
7 03-01-2021,6024,52377,"0,90",-,-,-
8 04-01-2021,2403,54780,"0,94",-,-,-
Using your data I got the expected result. (even without quotechar='"')
Could you maybe show us your output?
import pandas as pd
df = pd.read_csv('data.csv', header=None)
print(df)
> 0 1 2 3 4 5 6
> 0 10-01-2021 813 116927 2,01 - - -
> 1 11-01-2021 657 117584 2,02 - - -
> 2 12-01-2021 462 118046 2,03 - - -
> 3 13-01-2021 12728 130774 2,24 - - -
> 4 14-01-2021 17895 148669 2,55 - - -
> 5 15-01-2021 15206 163875 2,81 5 5 0,0001
> 6 16-01-2021 4612 168487 2,89 7 12 0,0002
> 7 17-01-2021 2536 171023 2,93 717 729 0,01
> 8 18-01-2021 3883 174906 3,00 2147 2876 0,05
You need to define the seperator and delimiter, like this:
df = pd.read_csv('data.csv', header=None, sep = ',', delimiter=',' , quotechar='"')

Finding out the NAN values for Summary report

List item
```def drag_mis(data):
list = []
for val in data.values:
if np.any(val) == None:
list.append(val)
return list.count(val)```
""" Need a summary report like a file attached in xls format need to automate this boring stuff"""
**
The Above function will help us to drag nan values give the count
**
df.groupby(["Operator","Model"],axis=0)[['Jan-17', 'Feb-17', 'Mar-17', 'Apr-17', 'May-17',
'Jun-17', 'Jul-17', 'Aug-17', 'Sep-17', 'Oct-17', 'Nov-17', 'Dec-17',
'Jan-18', 'Feb-18', 'Mar-18', 'Apr-18', 'May-18', 'Jun-18', 'Jul-18',
'Aug-18', 'Sep-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Jan-19', 'Feb-19',
'Mar-19', 'Apr-19', 'May-19', 'Jun-19', 'Jul-19', 'Aug-19', 'Sep-19',
'Oct-19', 'Nov-19', 'Dec-19', 'Jan-20', 'Feb-20', 'Mar-20', 'Apr-20',
'May-20']].apply(drag_mis)
####I want to drag all nan values so that i can make count for summary report in new CSV file
#### The output is as follows:
AAL 737 0
757 0
767 0
777 0
787 0
MD80 0
AAR 747 0
767 0
777 0
ABM 747 0
ACN 737 0
######Please add your ideas,any one,where my function going wrong#######
********tried below code but i need a summary like value_counts,which can not be implemented in dataframe[![enter image description here][1]][1]********
**
df.groupby(["Operator","Model"])[['Jan-17', 'Feb-17', 'Mar-17', 'Apr-17', 'May-17', 'Jun-17', 'Jul-17', 'Aug-17', 'Sep-17', 'Oct-17', 'Nov-17', 'Dec-17', 'Jan-18', 'Feb-18', 'Mar-18', 'Apr-18', 'May-18', 'Jun-18', 'Jul-18', 'Aug-18', 'Sep-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Jan-19', 'Feb-19', 'Mar-19', 'Apr-19', 'May-19', 'Jun-19', 'Jul-19', 'Aug-19', 'Sep-19', 'Oct-19', 'Nov-19', 'Dec-19', 'Jan-20', 'Feb-20', 'Mar-20', 'Apr-20', 'May-20']].apply(lambda x: x.isnull().sum())
**
******
Please look in to this snapshot of xls file
`
<[1]: https://i.stack.imgur.com/E1FTN.jpg>strong text

Parsing error when reading a specific Pajek (NET) file with Networkx into Jupyter

I am trying to reading this pajek file in Google Colab's version of Jupyter and I get an error when executing the following very simple code:
J = nx.MultiDiGraph()
J=nx.read_pajek("/content/data/graphdatasets/jazz.net")
print(nx.info(J))
The error is the following:
/usr/local/lib/python3.6/dist-packages/networkx/readwrite/pajek.py in parse_pajek(lines)
211 except AttributeError:
212 splitline = shlex.split(str(l))
--> 213 id, label = splitline[0:2]
214 labels.append(label)
215 G.add_node(label)
ValueError: not enough values to unpack (expected 2, got 1)
With pip show networkx, I see that I'm running Networkx version: 2.3. Am I doing something wrong in the code?
Update: Pasting below the file's first few lines:
*Vertices 198
*Arcs
*Edges
1 8 1
1 24 1
1 35 1
1 42 1
1 46 1
1 60 1
1 74 1
1 78 1
According to the Pajek definition the first two lines of your file are not according to the standard. After *vertices n, n lines with details about the vertices are expected. In addition, *edges and *arcs is a duplicate. NetworkX assumes use for an edge list, which started with *arcs a MultiDiGraph and for *edges a MultiGraph (see current code). To resolve your problem, you only need to delete the first two lines of your .net-file.

How to handle such errors?

companies = pd.read_csv("http://www.richard-muir.com/data/public/csv/CompaniesRevenueEmployees.csv", index_col = 0)
companies.head()
I'm getting this error please suggest what approaches should be tried.
"utf-8' codec can't decode byte 0xb7 in position 7"
Try encoding as 'latin1' on macOS.
companies = pd.read_csv("http://www.richardmuir.com/data/public/csv/CompaniesRevenueEmployees.csv",
index_col=0,
encoding='latin1')
Downloading the file and opening it in notepad++ shows it is ansi-encoded. If you are on a windows system this should fix it:
import pandas as pd
url = "http://www.richard-muir.com/data/public/csv/CompaniesRevenueEmployees.csv"
companies = pd.read_csv(url, index_col = 0, encoding='ansi')
print(companies)
If not (on windows), you need to research how to convert ansi-encoded text to something you can read.
See: https://docs.python.org/3/library/codecs.html#standard-encodings
Output:
Name Industry \
0 Walmart Retail
1 Sinopec Group Oil and gas
2 China National Petroleum Corporation Oil and gas
... ... ...
47 Hewlett Packard Enterprise Electronics
48 Tata Group Conglomerate
Revenue (USD billions) Employees
0 482 2200000
1 455 358571
2 428 1636532
... ... ...
47 111 302000
48 108 600000

svm train output file has less lines than that of the input file

I am currently building a binary classification model and have created an input file for svm-train (svm_input.txt). This input file has 453 lines, 4 No. features and 2 No. classes [0,1].
i.e
0 1:15.0 2:40.0 3:30.0 4:15.0
1 1:22.73 2:40.91 3:36.36 4:0.0
1 1:31.82 2:27.27 3:22.73 4:18.18
0 1:22.73 2:13.64 3:36.36 4:27.27
1 1:30.43 2:39.13 3:13.04 4:17.39 ......................
My problem is that when I count the number of lines in the output model generated by svm-train (svm_train_model.txt), this has 12 fewer lines than that of the input file. The line count here shows 450, although there are obviously also 9 lines at the beginning showing the various parameters generated
i.e.
svm_type c_svc
kernel_type rbf
gamma 1
nr_class 2
total_sv 441
rho -0.156449
label 0 1
nr_sv 228 213
SV
Therefore 12 lines in total from the original input of 453 have gone. I am new to svm and was hoping that someone could shed some light on why this might have happened?
Thanks in advance
Updated.........
I now believe that in generating the model, it has removed lines whereby the labels and all the parameters are exactly the same.
To explain............... My input is a set of miRNAs which have been classified as 1 and 0 depending on their involvement in a particular process or not (i.e 1=Yes & 0=No). The input file looks something like.......
0 1:22 2:30 3:14 4:16
1 1:26 2:15 3:17 4:25
0 1:22 2:30 3:14 4:16
Whereby, lines one and three are exactly the same and as a result will be removed from the output model. My question is then both why the output model would do this and how I can get around this (whilst using the same features)?
Whilst both SOME OF the labels and their corresponding feature values are identical within the input file, these are still different miRNAs.
NOTE: The Input file does not have a feature for miRNA name (and this would clearly show the differences in each line) however, in terms of the features used (i.e Nucleotide Percentage Content), some of the miRNAs do have exactly the same percentage content of A,U,G & C and as a result are viewed as duplicates and then removed from the output model as it obviously views them as duplicates even though they are not (hence there are less lines in the output model).
the format of the input file is:
Where:
Column 0 - label (i.e 1 or 0): 1=Yes & 0=No
Column 1 - Feature 1 = Percentage Content "A"
Column 2 - Feature 2 = Percentage Content "U"
Column 3 - Feature 3 = Percentage Content "G"
Column 4 - Feature 4 = Percentage Content "C"
The input file actually looks something like (See the very first two lines below), as they appear identical, however each line represents a different miRNA):
1 1:23 2:36 3:23 4:18
1 1:23 2:36 3:23 4:18
0 1:36 2:32 3:5 4:27
1 1:14 2:41 3:36 4:9
1 1:18 2:50 3:18 4:14
0 1:36 2:23 3:23 4:18
0 1:15 2:40 3:30 4:15
In terms of software, I am using libsvm-3.22 and python 2.7.5
Align your input file properly, is my first observation. The code for libsvm doesnt look for exactly 4 features. I identifies by the string literals you have provided separating the features from the labels. I suggest manually converting your input file to create the desired input argument.
Try the following code in python to run
Requirements - h5py, if your input is from matlab. (.mat file)
pip install h5py
import h5py
f = h5py.File('traininglabel.mat', 'r')# give label.mat file for training
variables = f.items()
labels = []
c = []
import numpy as np
for var in variables:
data = var[1]
lables = (data.value[0])
trainlabels= []
for i in lables:
trainlabels.append(str(i))
finaltrain = []
trainlabels = np.array(trainlabels)
for i in range(0,len(trainlabels)):
if trainlabels[i] == '0.0':
trainlabels[i] = '0'
if trainlabels[i] == '1.0':
trainlabels[i] = '1'
print trainlabels[i]
f = h5py.File('training_features.mat', 'r') #give features here
variables = f.items()
lables = []
file = open('traindata.txt', 'w+')
for var in variables:
data = var[1]
lables = data.value
for i in range(0,1000): #no of training samples in file features.mat
file.write(str(trainlabels[i]))
file.write(' ')
for j in range(0,49):
file.write(str(lables[j][i]))
file.write(' ')
file.write('\n')

Resources