how to format sublists in a tabular way automatically? - python-3.x

I have the following list, which contains sublists
tableData = [['apples', 'oranges', 'cherries', 'banana'],['Alice', 'Bob', 'Carol', 'David'], ['dogs', 'cats', 'moose', 'goose']]
My aim is to format them that way
apples Alice dogs
oranges Bob cats
cherries Carol moose
banana David goose
The code below do that
In [55]: for nested_list in zip(*tableData):
print("{:>9} {:>9} {:>9}".format(*nested_list))
Yet what bugs me is I need to specify manually the format of each sublist.
I've been trying to find a way to do it automatically with a for loop but I did not find anything relevant on how to do it.
Any tips are more than welcomed.
Thanks.

How about this:
for line in zip(*tableData):
for word in line:
print("{:>9}".format(word), end=' ')
print()
Explanation
If the print() was absent, all the sublists would be put on a single line like this
apples Alice dogs oranges Bob cats cherries Carol moose banana David goose
The print() allows a newline

If you just want to use {:>9} as the format code with an arbitrary number of columns, try this:
fieldFormat = ' '.join(['{:>9}'] * len(tableData))
for nestedList in zip(*tableData):
print(fieldFormat.format(*nestedList))
This just creates a list of {:>9} format specifiers, one for each column in tableData, then joins them together with spaces.
If you want to automatically calculate the field widths as well, you can do this:
fieldWidths = [max(len(word) for word in col) for col in tableData]
fieldFormat = ' '.join('{{:>{}}}'.format(wid) for wid in fieldWidths)
for nestedList in zip(*tableData):
print(fieldFormat.format(*nestedList))
fieldWidths is generated from a list comprehension that calculates the maximum length of each word in each column. From the inside:
(len(word) for word in col)
This is a generator that will produce the length of each word in col.
max(len(word) for word in col)
Feeding the generator (or any iterable) into max will calculate the maximum value of everything produced by the iterable.
[max(len(word) for word in col) for col in tableData]
This list comprehension produces the maximum length of all words in each column col of data in tableData.
fieldFormat is then produded by transforming fieldWidths into format specifiers. Again from the inside:
'{{:>{}}}'.format(wid)
This formats wid into the {:>#} format. {{ is a way to have a format specifier produce a {; similarly, }} produces }. The {} in the middle is what actually gets formatted with wid.
('{{:>{}}}'.format(wid) for wid in fieldWidths)
This is a generator function that does the above formatting for each width listed in fieldWidths.
fieldFormat = ' '.join('{{:>{}}}'.format(wid) for wid in fieldWidths)
This just joins those formats together with spaces in between to create the fieldFormat format specifier.

Related

How to check if a word in one csv exist in another column of another csv file

I have 2 csv file, one is dictionary.csv which contains a list of words, and another is story.csv. In the story.csv there are many columns, and in one of the columns contains a lots of words called news_story. I wanted to check if the list of words from dictionary.csv exists in the news_story column. Afterwards i wanted to print all of the rows in which the news_story column contained words from the lists of words from dictionary.csv in a new csv file called New.csv
These are the codes i have tried so far
import csv
import pandas as pd
news=pd.read_csv("story.csv")
dictionary=pd.read_csv("dictionary.csv")
pattern = '|'.join(dictionary)
exist=news['news_story'].str.contains(pattern)
for CHECK in exist:
if not CHECK:
news['NEWcolumn']='NO'
else:
news['NEWcolumn']='YES'
news.to_csv('New.csv')
I kept on getting a nos eventhough there should be some trues
story.csv
news_url news_title news_date news_story
goog.com functional 2019 This story is about a functional requirement
live.com pbandJ 2001 I made a sandwich today
key.com uAndI 1992 A code name of a spy
dictionary.csv
red
tie
lace
books
functional
New.csv
news_url news_title news_date news_story
goog.com functional 2019 This story is about a functional requirement
First convert column to Series with header=None for avoid remove first value with squeeze=True in read_csv:
dictionary=pd.read_csv("dictionary.csv", header=None, squeeze=True)
print (dictionary)
0 red
1 tie
2 lace
3 books
4 functional
Name: 0, dtype: object
pattern = '|'.join(dictionary)
#for avoid match substrings use words boundaries
#pattern = '|'.join(r"\b{}\b".format(x) for x in dictionary)
Last filter by boolean indexing:
exist = news['news_story'].str.contains(pattern)
news[exist].to_csv('New.csv')
Detail:
print (news[exist])
news_url news_title news_date \
0 goog.com functional 2019
news_story
0 This story is about a functional requirement

How to convert negations and single words with same repetitive letter

I have a data frame that has a column with text data in it. I want to remove words that mean nothing and convert negations like "isn't" to "is not" from the text data. Because when I remove the punctuations "isn't" becomes "isn t" and when I will remove words having letters less than length 2 "t" will be deleted completely. So, I want to do the following 3 tasks-
1) convert negations like "isn't" to "is not"
2) remove words that mean nothing
3) remove less than length 2 letters
For eg, the df column looks similar to this-
user_id text data column
1 it's the coldest day
2 they aren't going
3 aa
4 how are you jkhf
5 v
6 ps
7 jkhf
The output should be-
user_id text data column
1 it is the coldest day
2 they are not going
3
4 how are you
5
6
7
How to implement this?
def is_repetitive(w):
"""Predicate, true for words like jj or aaaaa."""
w = str(w) # caller should have provided a single word as input
return len(w) > 1 and all((c == w[0] for c in w[1:]))
Feed all words in the corpus to that function,
to accumulate a list of repetitive words.
Then add such words to your list of stop words.
1) Use SpaCy or NLTK's lemmatization tools to convert strings (though they do other things like convert plural to singular as well - so you may end up needing to write your own code to do this).
2) Use stopwords from NLTK or spacy to remove the obvious stop words. Alternatively, feed them your own list of stop words (their default stop words are things like is, a, the).
3)Use a basic filter, if len<2 remove row

Retrieve first element in a column list and sum over it (e.g. if first element = k, sum over k) in python

really sorry if this has been answered already, I'm new to python and might have been searching for the wrong terminology.
I'm working with the US Baby name data as in Python for Data Analysis 2nd ed. Basically I've concated the datasets into a df called name_df looks like
id name births
1 Aaron 20304
2 Adam 10000
etc.
I'm looking to sum over the first letter of each name element if it is a K (or any other letter). I'm struggling to get the first element part though - here is what I have so far:
count = 0
letter = ['K']
for n in ['name']:
if name_df['name'][0] == letter:
count +=1
else:
count+=0
print(count)
clearly that just retrieves the first element. do i need to use some sort of splicing technique instead?
Would you like to count the distinct names starting with 'K'?
len([n for n in name_df['name'] if n[0]=='K'])
Or do you want to sum up to get the number of babies?
sum([c for n,c in name_df[['name','births']].values if n[0]=='K'])
Or with more 'pandaish' syntax:
sum(name_df.loc[name_df['name'].str[0]=='K','births'])

How to count a specific word separated by paragraphs?

So I want to be able to count the number of times a certain sequence such as "AGCT" appears in a document full of letters. However I don't just want the total amount in the document, I want how many times it shows up separated by ">".
So for example if the document contained: asdflkdafagctalkjsdjagctlkdjf>asdlfkjaagct>adjkhfhAGCTlksdjfagct>...
It would tell me:
2
1
1
since the sequence "AGCT" appears twice before the first ">" and once after the next one and once more after the third one and so on.
I do not know how to do this and any help would be appreciated.
You can use a combination of string methods and Python's llist comprehension like this:
Split your text in paragraphs, and for each paragraph count the ocurrences of the wanted substring. It is actually more concise in Python than in English:
>>> mytext = "asdflkdafagctalkjsdjagctlkdjf>asdlfkjaagct>adjkhfhAGCTlksdjfagct>"
>>> count = [para.count("agc") for para in mytext.split(">") ]
>>> count
[2, 1, 1, 0]

How to find duplicates by row in excel

I've been looking for a way to see which rows have duplicate words in.
If a word matches in column A and C I would like to add a "X" to column B. The whole cell shouldn't have to be exactly the same for example, John Miller and Miller,J This needs to only match words in the same row and not the entire column. I have 50k plus rows to work through so I'm looking for a better way,any help would really be appreciated
Here's what it looks like:
A
Jf Wepener . Lourens Johannes Stephanus
Me Horn x Horn Maria Elizabeth
Jg Waldeck x Waldeck Johan George
Pj Du Preez x Preez Paulus Jacobus Du
By Excel you have long formula (To work well). Following the scheme:
First Split the String Column A for searcing. In the colums:
D: =TRIM(IFERROR(LEFT(A2;SEARCH(" ";A2;1));A2))
E: =TRIM(IFERROR(LEFT(MID(A2;LEN(D2)+2;99);SEARCH(" ";MID(A2;LEN(D2)+2;99);1));MID(A2;LEN(D2)+2;99)))
F: =TRIM(IFERROR(LEFT(MID(A2;LEN(D2)+LEN(E2)+3;99);SEARCH(" ";MID(A2;LEN(D2)+LEN(E2)+3;99);1));MID(A2;LEN(D2)+LEN(E2)+3;99)))
in the column B:
=IF(OR(AND(ISNUMBER(SEARCH(D2;C2));IF(D2="";FALSE;TRUE));AND(ISNUMBER(SEARCH(E2;C2));IF(E2="";FALSE;TRUE));AND(ISNUMBER(SEARCH(F2;C2));IF(F2="";FALSE;TRUE)));"x";"")
In this way you can have space (at the end) or not. If you have more words in the column A you shall add formula.It's Better / You can split and HIDE the columns no necessary for generate the results...

Resources