Difficulty Reading Text from Excel file - excel

I'm working on a project where I am trying to search large amounts of text from an excel file for keywords. These keywords are citations in their various formats, (e.g. XXXXXX, YYYY), and then also to search the text for citations which contain the last name of the author. In excel, the C column is the authors last names, and the D column is the text of the writing. I am working with xlrd, but I do not know how to use the items from list "L" search the items in list "L1". Ultimately, I need to search list "L1" (the text) for citations, and then search L1 again for citations which have the same name as the corresponding cell in L, (e.g. C3 = Smith, must search D3 for any citation which has the name Smith). Any help with this, or other tips/methods for my task would be greatly appreciated!
Here is my current code for searching the excel file.
from xlrd import open_workbook,cellname
book = open_workbook("C:\Python27\Doc\Book3.xls")
sheet = book.sheet_by_index(0)
for year in xrange(1900,2014):
citation = str(year) or str(year) + ')' or '(' + str(year) + ')' or str(year) + ';'
firstc = sheet.col_values(2)
secondc = sheet.col_values(3)
L = [firstc]
L1 = [secondc]
if citation in L1:
print 'citation ' + str(year)
if L in L1:
print 'self-cite ' + str(year)
for item in L1:
if citation in item:
print item
I am somewhat of a novice at python and I apologize for bothering you all, but I have had difficulty finding pre-written topics on searching text files. Thank you!
Best

You can't look if L (which is a list) is in L1. You can look to see if the items in L are in L1. For example:
>>> s = 'abcde'
>>> b = ['a', 'f', 'g', 'b']
>>> b
['a', 'f', 'g', 'b']
>>> for i in b:
... if i in s:
... print i
... else:
... print "nope"
...
a
nope
nope
b
>>>
If you have two lists, you'll need to loop over both, with a nested for loop:
for i in b:
for j in L1:
do stuff
Hope that gives you a start.
ETA:
You can use enumerate to get the index of the item you're looping on currently and use that to get into the right row in the second list:
>>> b = ['a', 'f', 'g', 'b']
>>> L1 = ['worda', 'words including b', 'many many words', 'a lot more words']
>>> for i, j in enumerate(b):
... if j in L1[i]:
... print j
... else:
... print i, j
a
1 f
2 g
3 b
>>>
Combine that with row_values and you might have what you need.

Related

find the lengths of all sublists containing common repeated element

I need to find all the sublists from a list where the element is 'F' and that must come one after other
g= ['T','F','F,'F','F','T','T','T','F,'F','F','T]
so, here in this case there are two sublists present in this list which contains element 'F' in repeat
i.e; ['F','F,'F','F'] in index 1,2,3,4 which is in repeat ,so answer is 4
and
['F','F,'F'] in index 8,9,10 which is again in continuous index,so answer is 3
Note:
The list contains only two elements 'T' and 'F' and every time we are doing these operations for element 'F'
You can get the lengths of consecutive sequences with itertools.groupby:
from itertools import groupby
data = ['T','F','F','F','F','T','T','T','F','F','F','T']
# Consecutive sequences of "F".
# "groupby(data)" produces an iterator that calculates on-the-fly.
# The iterator returns consecutive keys and groups from the iterable "data".
seqs = [list(g) for k, g in groupby(data) if k == 'F']
print(seqs)
# [['F', 'F', 'F', 'F'], ['F', 'F', 'F']]
seq_lens = [len(k) for k in seqs]
print(seq_lens)
# [4, 3]
Also cool is max length of such consecutive sequences:
max_len_seq = len(max(seqs, key=len))
print(max_len_seq)
# 4
See itertools.groupby for more info:
class groupby:
# [k for k, g in groupby('AAAABBBCCDAABBB')] --> A B C D A B
# [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D
...
etc
You can create 2 variable to keep count of the repeated letter. Traverse the array and when you found t increase t, when you find a f check the tcount first if it is bigger than 1 it means there is a repeat print the count of the repetition.
tcount = 0;
fcount = 0;
for e in g:
if e=="T":
tcount++
if fcount>1
print(fcount)
fcount=0
//do same operation for F

Efficient way of calculating specific length combinations of adjacent data?

I have a list of elements, of which I'd like to determine all possible combinations that can be arranged - preserving their order - to arrive at 'n' groups
So as an example, if I have an ordered list of A, B, C, D, E, and only want 2 groups, the four solutions would be;
ABCD, E
ABC, DE
AB, CDE
A, BCDE
Now, with some help from another StackOverflow post I've come up with a workable brute-force solution that calculates all possible combinations of all possible groupings from which I simply extract those cases that meet my target number of groupings.
For reasonable numbers of elements, this is just fine, but as I extend the numbers of elements, the number of combinations increases very very quickly, and I was wondering if there might be a clever way to limit the solutions calculated to only those that meet my target groupings number?
Code so far is as follows;
import itertools
import string
import collections
def generate_combination(source, comb):
res = []
for x, action in zip(source,comb + (0,)):
res.append(x)
if action == 0:
yield "".join(res)
res = []
#Create a list of first 20 letters of the alphabet
seq = list(string.ascii_uppercase[0:20])
seq
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T']
#Generate all possible combinations
combinations = [list(generate_combination(seq,c)) for c in itertools.product((0,1), repeat=len(seq)-1)]
len(combinations)
524288
#Create a list that counts the number of groups in each solution,
#and counter to allow easy query
group_counts = [len(i) for i in combinations]
count_dic = collections.Counter(group_counts)
count_dic[1], count_dic[2], count_dic[3], count_dic[4], count_dic[5], count_dic[6]
(1, 19, 171, 969, 3876, 11628)
So as you can see, while over half a million combinations were calculated, if I had only wanted ones of length = 5, only 3,876 need have been calculated
Any suggestions?
A partition of seq into 5 parts is equivalent to a choice of 4 locations in range(1, len(seq)) at which to cut seq.
Thus you could use itertools.combinations(range(1, len(seq)), 4) to generate all the partitions of seq into 5 parts:
import itertools as IT
import string
def partition_into_n(iterable, n, chain=IT.chain, map=map):
"""
Return a generator of all partitions of iterable into n parts.
Based on http://code.activestate.com/recipes/576795/ (Raymond Hettinger)
which generates all partitions.
"""
s = iterable if hasattr(iterable, '__getitem__') else tuple(iterable)
size = len(s)
first, middle, last = [0], range(1, size), [size]
getitem = s.__getitem__
return (map(getitem, map(slice, chain(first, div), chain(div, last)))
for div in IT.combinations(middle, n-1))
seq = list(string.ascii_uppercase[0:20])
ngroups = 5
for partition in partition_into_n(seq, ngroups):
print(' '.join([''.join(grp) for grp in partition]))
print(len(list(partition_into_n(seq, ngroups))))
yields
A B C D EFGHIJKLMNOPQRST
A B C DE FGHIJKLMNOPQRST
A B C DEF GHIJKLMNOPQRST
A B C DEFG HIJKLMNOPQRST
...
ABCDEFGHIJKLMNO P Q RS T
ABCDEFGHIJKLMNO P QR S T
ABCDEFGHIJKLMNO PQ R S T
ABCDEFGHIJKLMNOP Q R S T
3876

Convert a string within a list to an element in the list in python

I am using python data to create a ReportLab report. I have a list that looks like this:
mylist = [['a b c d e f'],['g h i j k l']]
and want to convert it to look like this:
mylist2 = [[a,b,c,d,e],[g,h,i,j,k,l]]
the first list gives me a "List out of index" error when building the report.
the second list works in ReportLab, but columns and formatting in this list aren't what I want.
What is the best method to convert mylist 1 to mylist2 in python?
string to list can be done using split() method.
try mylist[1][0].split() and mylist[0][0].split()
Borrowing idea from Jibin Mathews, I tried the following
new_list = [mylist[0][0].split(), mylist[1][0].split()]
and it prints
[['a', 'b', 'c', 'd', 'e', 'f'], ['g', 'h', 'i', 'j', 'k', 'l']]
I saw 'f' is missing in your final list. Is that the mistake?
mylist = [['a b c d e f'],['g h i j k l']]
import re
space_re = re.compile(r'\s+')
output = []
for l in mylist:
element = l[0]
le = re.split(space_re, element)
output.append(le)
This not best answer but it will work fine.!

Reordering character triplets in Python

I've been trying to solve this homework problem for days, but can't seem to fix it. I started my study halfway through the first semester, so I can't ask the teacher yet and I hope you guys can help me. It's not for grades, I just want to know how.
I need to write a program that reads a string and converts the triplets abc into bca. Per group of three you need to do this. For examplekatzonbecomesatkonz`.
The closest I've gotten is this:
string=(input("Give a string: "))
for i in range(0, len(string)-2):
a = string[i]
b = string[i + 1]
c = string[i + 2]
new_string= b, c, a
i+=3
print(new_string)
The output is:
('a', 't', 'k')
('t', 'z', 'a')
('z', 'o', 't')
('o', 'n', 'z')
The code below converts for example "abc" to "bca". It works for any string containing triplets. Now, if input is "abcd", it is converted to "bcad". If you input "katzon", it is converted to "atkonz". This is what I understood from your question.
stringX = input()
# create list of words in the string
listX = stringX.split(" ")
listY = []
# create list of triplets and non-triplets
for word in listX:
listY += [[word[i:i+3] for i in range(0, len(word), 3)]]
# convert triplets, for example: "abc" -> "bca"
for listZ in listY:
for item in listZ:
if len(item)==3:
listZ[listZ.index(item)] = listZ[listZ.index(item)][1:] + listZ[listZ.index(item)][0]
listY[listY.index(listZ)] = "".join(listZ)
# create final string
stringY = " ".join(listY)
print(stringY)

Getting all str type elements in a pd.DataFrame

Based on my little knowledge on pandas,pandas.Series.str.contains can search a specific str in pd.Series. But what if the dataframe is large and I just want to glance all kinds of str element in it before I do anything?
Example like this:
pd.DataFrame({'x1':[1,2,3,'+'],'x2':[2,'a','c','this is']})
x1 x2
0 1 2
1 2 a
2 3 c
3 + this is
I need a function to return ['+','a','c','this is']
If you are looking strictly at what are string values and performance is not a concern, then this is a very simple answer.
df.where(df.applymap(type).eq(str)).stack().tolist()
['a', 'c', '+', 'this is']
There are 2 possible ways - check numeric values saved as strings or not.
Check difference:
df = pd.DataFrame({'x1':[1,'2.78','3','+'],'x2':[2.8,'a','c','this is'], 'x3':[1,4,5,4]})
print (df)
x1 x2 x3
0 1 2.8 1
1 2.78 a 4 <-2.78 is float saved as string
2 3 c 5 <-3 is int saved as string
3 + this is 4
#flatten all values
ar = df.values.ravel()
#errors='coerce' parameter in pd.to_numeric return NaNs for non numeric
L = np.unique(ar[np.isnan(pd.to_numeric(ar, errors='coerce'))]).tolist()
print (L)
['+', 'a', 'c', 'this is']
Another solution is use custom function for check if possible convert to floats:
def is_not_float_try(str):
try:
float(str)
return False
except ValueError:
return True
s = df.stack()
L = s[s.apply(is_not_float_try)].unique().tolist()
print (L)
['a', 'c', '+', 'this is']
If need all values saved as strings use isinstance:
s = df.stack()
L = s[s.apply(lambda x: isinstance(x, str))].unique().tolist()
print (L)
['2.78', 'a', '3', 'c', '+', 'this is']
You can using str.isdigit with unstack
df[df.apply(lambda x : x.str.isdigit()).eq(0)].unstack().dropna().tolist()
Out[242]: ['+', 'a', 'c', 'this is']
Using regular expressions and set union, could try something like
>>> set.union(*[set(df[c][~df[c].str.findall('[^\d]+').isnull()].unique()) for c in df.columns])
{'+', 'a', 'c', 'this is'}
If you use a regular expression for a number in general, you could omit floating point numbers as well.

Resources