Print some columns in multiline string - string

Is there any way to print some columns in a string which is in several lines. For instance, let's suppose we have the following string:
EXAMPLE1
- -- ---
EXAMPLE2
And I was only print the columns which has '-' in columns. So the the output for this case should be:
EAMLE1
------
EAMLE2
I was thinking of splitting the string and iterate throug every column by using zip and print just those columns which have '-' But don't really know how to use it properly.
Any idea would be welcomed
thanks in advance

Once we split the string into lines, we can use zip(*lines) to transpose the list, getting the columns, search those for -, and then transpose again to get the new lines. Then we can use str.join to assemble the result.
s = '''\
EXAMPLE1
- -- ---
EXAMPLE2'''
columns = (tup for tup in zip(*s.split('\n')) if any('-' in x for x in tup))
lines = (''.join(line) for line in zip(*columns))
print('\n'.join(lines))
Output:
EAMLE1
------
EAMLE2

Related

How to achieve below situation using python list comprehension?

rows = [(d = re.split("\s{2,}|\|", line)) for line in lines if len(d) > 5 and d[0]!='' ]
As in the code snippet shown, I am splitting a list of lines by spaces in each line. I am trying to assign split to a variable d so that I can use it later in if condition and can avoid repetitive split.
Is there way to achieve it?
rows = [d for d in [re.split("\s{2,}|\|", line) for line in lines] if len(d) > 5 and d[0]!='']

Python regex multiple matches occurrences between two strings

I have a multi-line string with my start/end magic strings ("X" and "Y"). I'm trying to capture all occurrences but I'm experiencing some issues.
Here is the code
testString = '''AAAAAXBBBBBYCCCCCXDDDDDYEEEEEEXFFF
FFFYGGG
'''
pattern = re.compile(r'(.*)X(.*)Y(.*)', re.MULTILINE)
match = re.search(pattern, testString)
print match.group(1) # output: AAAAAXBBBBBYCCCCC
print match.group(2) # output: DDDDD
print match.group(3) # output: EEEEEEXFFF
Basically, I'm trying to capture all occurrences of the following (And I have to maintain text order):
Text before the magic start string (e.g.: AAAAA, CCCCC, EEEEEE)
Text between start/end magic strings (e.g.: BBBBB, DDDDD, FFF\nFFF)
Text after the magic start string (e.g.: CCCCC, GGG)
So I'm trying to print the following output: (what's in between brackets below is just a comment)
AAAAA (before magic string)
BBBBB (between magic strings)
CCCCC (before/after magic strings, it does not matter. Just the order matters.)
DDDDD (after magic string)
And so on. Printing them in that order would solve the issue. (Then I can pass each to other functions, ...etc.)
The code works nicely when the text is as simple as for example "AAXBBYCC", but with complicated strings I'm losing control.
Any ideas or alternative ways to do this?
You could match any character except X or Y in group 1 and then match X and do the same for Y. The "after the magic string" part you could capture in a lookahead with a third group.
The negated character class using [^ will also match an newline to match the FFFFFF part.
([^XY]+)X([^XY]+)Y(?=([^XY]+))
([^XY]+)X Capture group 1, match 1+ times any char except X or Y, then match X
([^XY]+)Y Capture group 2, match 1+ times any char except X or Y, then match Y
(?= Positive lookahead, assert what is directly to the right is
([^XY]+) Capture group 3, match 1+ times any char except X or Y
) Close lookahead
Regex demo | Python demo
import re
regex = r"([^XY]+)X([^XY]+)Y(?=([^XY]*))"
s = ("AAAAAXBBBBBYCCCCCXDDDDDYEEEEEEXFFF\n"
"FFFYGGG")
matches = re.findall(regex, s)
print(matches)
Output
[('AAAAA', 'BBBBB', 'CCCCC'), ('CCCCC', 'DDDDD', 'EEEEEE'), ('EEEEEE', 'FFF\nFFF', 'GGG')]
So I'm trying to print the following output: (what's in between brackets below is just a comment)
AAAAA (before magic string)
BBBBB (between magic strings)
CCCCC (before/after magic strings, it does not matter. Just the order matters.)
DDDDD (after magic string)
And so on.
Since it doesn't matter whether before or after start or end, it is as simple as:
import re
o = re.split("X|Y", testString)
print(*o, sep='\n')
Can't you just use:
pattern = re.compile(r'[^XY]+')
match = re.findall(pattern, testString)
print(match)
# ['AAAAA', 'BBBBB', 'CCCCC', 'DDDDD', 'EEEEEE', 'FFF\nFFF', 'GGG\n']

How to remove the alphanumeric characters from a list and split them in the result?

'''def tokenize(s):
string = s.lower().split()
getVals = list([val for val in s if val.isalnum()])
result = "".join(getVals)
print (result)'''
tokenize('AKKK#eastern B!##est!')
Im trying for the output of ('akkkeastern', 'best')
but my output for the above code is - AKKKeasternBest
what are the changes I should be making
Using a list comprehension is a good way to filter elements out of a sequence like a string. In the example below, the list comprehension is used to build a list of characters (characters are also strings in Python) that are either alphanumeric or a space - we are keeping the space around to use later to split the list. After the filtered list is created, what's left to do is make a string out of it using join and last but not least use split to break it in two at the space.
Example:
string = 'AKKK#eastern B!##est!'
# Removes non-alpha chars, but preserves space
filtered = [
char.lower()
for char in string
if char.isalnum() or char == " "
]
# String-ifies filtered list, and splits on space
result = "".join(filtered).split()
print(result)
Output:
['akkkeastern', 'best']

How to count strings in specified field within each line of one or more csv files

Writing a Python program (ver. 3) to count strings in a specified field within each line of one or more csv files.
Where the csv file contains:
Field1, Field2, Field3, Field4
A, B, C, D
A, E, F, G
Z, E, C, D
Z, W, C, Q
the script is executed, for example:
$ ./script.py 1,2,3,4 file.csv
And the result is:
A 10
C 7
D 2
E 2
Z 2
B 1
Q 1
F 1
G 1
W 1
ERROR
the script is executed, for example:
$ ./script.py 1,2,3,4 file.csv file.csv file.csv
Where the error occurs:
for rowitem in reader:
for pos in field:
pos = rowitem[pos] ##<---LINE generating error--->##
if pos not in fieldcnt:
fieldcnt[pos] = 1
else:
fieldcnt[pos] += 1
TypeError: list indices must be integers or slices, not str
Thank you!
Judging from the output, I'd say that the fields in the csv file does not influence the count of the string. If the string uniqueness is case-insensitive please remember to use yourstring.lower() to return the string so that different case matches are actually counted as one. Also do keep in mind that if your text is large the number of unique strings you might find could be very large as well, so some sort of sorting must be in place to make sense of it! (Or else it might be a long list of random counts with a large portion of it being just 1s)
Now, to get a count of unique strings using the collections module is an easy way to go.
file = open('yourfile.txt', encoding="utf8")
a= file.read()
#if you have some words you'd like to exclude
stopwords = set(line.strip() for line in open('stopwords.txt'))
stopwords = stopwords.union(set(['<media','omitted>','it\'s','two','said']))
# make an empty key-value dict to contain matched words and their counts
wordcount = {}
for word in a.lower().split(): #use the delimiter you want (a comma I think?)
# replace punctuation so they arent counted as part of a word
word = word.replace(".","")
word = word.replace(",","")
word = word.replace("\"","")
word = word.replace("!","")
if word not in stopwords:
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
That should do it. The wordcount dict should contain the word and it's frequency. After that just sort it using collections and print it out.
word_counter = collections.Counter(wordcount)
for word, count in word_counter.most_common(20):
print(word, ": ", count)
I hope this solves your problem. Lemme know if you face problems.

How to print multiple multiline strings from a list onto the same line

I have a list containing string patterns for digits 0-3. I am trying to print them onto the same line, so that print(digits1+col+digits[2]+col+digits[3]) prints '1 2 3' from the # pattern strings from the respective list index, but can only get the number patterns printed on their own.
# Create strings for each number 0-3 and store in digits list.
zero = '#'*3+'\n'+'#'+' '+'#'+'\n'+'#'+' '+'#'+'\n'+'#'+' '+'#'+'\n'+'#'*3
one = '#\n'.rjust(4)*6
two = '#'*3+'\n'+'#'.rjust(3)+'\n'+'#'*3+'\n'+'#'.ljust(3)+'\n'+'#'*3
three = '#'*3+'\n'+'#'.rjust(3)+'\n'+'#'*3+'\n'+'#'.rjust(3)+'\n'+'#'*3
digits = [zero, one, two, three]
col = '\n'.ljust(1)*6 # A divider column between each printed digit.
print(digits[1]+col+digits[2]+col+digits[3],end='')
The result of the above code.
One way to solve this is by reversing the digits matrix, right now each index in digits list has the complete digit values but if we keep horizontal values at each index it will print properly.
think it would be better represented in code...https://repl.it/#pavanskipo/DirectTriangularSlash
# Digits replaced horizntally
digits_rev = [digits[0].split("\n"),
digits[1].split("\n"),
digits[2].split("\n"),
digits[3].split("\n")]
for i in range(0, len(digits)+1):
print(digits_rev[0][i] + '\t' +
digits_rev[1][i] + '\t' +
digits_rev[2][i] + '\t' +
digits_rev[3][i])
click on the link and hit run, let me know if it works

Resources