Python: Group / rearange text - python-3.x

I have the following files in a folder:
a235626_1.jpg
a235626_2.jpg
a235626_3.jpg
a235626_4.jpg
a235626_5.jpg
A331744R_1.JPG
A331744R_2.jpg
A331758L_1.JPG
A331758L_2.jpg
A331758R_1.JPG
A331758R_2.jpg
A331789L_1.JPG
A331789L_2.jpg
A331789R_1.JPG
A331789R_2.jpg
A331793L_1.JPG
A331793L_2.jpg
A331826L_1.JPG
A331826L_2.jpg
A331826R_1.JPG
A331826R_2.jpg
A335531L_1.JPG
A335531R_1.JPG
A335531R_2.jpg
How can i group them so that i get:
a235626_1.jpg|a235626_2.jpg|a235626_3.jpg|a235626_4.jpg|a235626_5.jpg
A331744R_1.JPG|A331744R_2.JPG
A331758L_1.JPG|A331758L_2.JPG
... and so on.
Thanks!

Use itertools.groupby
from itertools import groupby
files = ['a235626_1.jpg', 'a235626_2.jpg', 'a235626_3.jpg', 'a235626_4.jpg', 'a235626_5.jpg', 'A331744R_1.JPG',
'A331744R_2.jpg', 'A331758L_1.JPG', 'A331758L_2.jpg', 'A331758R_1.JPG', 'A331758R_2.jpg', 'A331789L_1.JPG',
'A331789L_2.jpg', 'A331789R_1.JPG', 'A331789R_2.jpg', 'A331793L_1.JPG', 'A331793L_2.jpg', 'A331826L_1.JPG',
'A331826L_2.jpg', 'A331826R_1.JPG', 'A331826R_2.jpg', 'A335531L_1.JPG', 'A335531R_1.JPG', 'A335531R_2.jpg']
for key, items in groupby(files, lambda t: t.split('_')[0]):
print('|'.join(items))
>>> a235626_1.jpg|a235626_2.jpg|a235626_3.jpg|a235626_4.jpg|a235626_5.jpg
>>> A331744R_1.JPG|A331744R_2.jpg
>>> A331758L_1.JPG|A331758L_2.jpg
>>> A331758R_1.JPG|A331758R_2.jpg
>>> A331789L_1.JPG|A331789L_2.jpg
>>> A331789R_1.JPG|A331789R_2.jpg
>>> A331793L_1.JPG|A331793L_2.jpg
>>> A331826L_1.JPG|A331826L_2.jpg
>>> A331826R_1.JPG|A331826R_2.jpg
>>> A335531L_1.JPG
>>> A335531R_1.JPG|A335531R_2.jpg

I think this question is quite similar to your question and it has been answered before. Can you take a look at it?
How to sort file names in a particular order using python

Related

How to read in pandas column as column of lists?

Probably a simple solution but I couldn't find a fix scrolling through previous questions so thought I would ask.
I'm reading in a csv using pd.read_csv() One column is giving me issues:
0 ['Bupa', 'O2', 'EE', 'Thomas Cook', 'YO! Sushi...
1 ['Marriott', 'Evans']
2 ['Toni & Guy', 'Holland & Barrett']
3 []
4 ['Royal Mail', 'Royal Mail']
It looks fine here but when I reference the first value in the column i get:
df['brand_list'][0]
Out : '[\'Bupa\', \'O2\', \'EE\', \'Thomas Cook\', \'YO! Sushi\', \'Costa\', \'Starbucks\', \'Apple Store\', \'HMV\', \'Marks & Spencer\', "Sainsbury\'s", \'Superdrug\', \'HSBC UK\', \'Boots\', \'3 Store\', \'Vodafone\', \'Marks & Spencer\', \'Clarks\', \'Carphone Warehouse\', \'Lloyds Bank\', \'Pret A Manger\', \'Sports Direct\', \'Currys PC World\', \'Warrens Bakery\', \'Primark\', "McDonald\'s", \'HSBC UK\', \'Aldi\', \'Premier Inn\', \'Starbucks\', \'Pizza Hut\', \'Ladbrokes\', \'Metro Bank\', \'Cotswold Outdoor\', \'Pret A Manger\', \'Wetherspoon\', \'Halfords\', \'John Lewis\', \'Waitrose\', \'Jessops\', \'Costa\', \'Lush\', \'Holland & Barrett\']'
Which is obviously a string not a list as expected. How can I retain the list type when I read in this data?
I've tried the import ast method I've seen in other posts: df['brand_list_new'] = df['brand_list'].apply(lambda x: ast.literal_eval(x)) Which didn't work.
I've also tried to replicate with dummy dataframes:
df1 = pd.DataFrame({'a' : [['test','test1','test3'], ['test59'], ['test'], ['rhg','wreg']],
'b' : [['erg','retbn','ert','eb'], ['g','eg','egr'], ['erg'], 'eg']})
df1['a'][0]
Out: ['test', 'test1', 'test3']
Which works as I would expect - this suggests to me that the solution lies in how I am importing the data
Apologies, I was being stupid. The following should work:
import ast
df['brand_list_new'] = df['brand_list'].apply(lambda x: ast.literal_eval(x))
df['brand_list_new'][0]
Out: ['Bupa','O2','EE','Thomas Cook','YO! Sushi',...]
As desired

Python all possible combinations/permutation of X items and of length X+1

I've been searching everywhere but can't find a thing for my issue.
Let's say I've got three numbers : ['1','2','3'].
I want, using itertool or not, all possible combinations/permutations with a length of 4 and I only want combinations containing these 3 numbers (I don't want '1111' or '1221' and so).
The wanted result would be like that :
1 2 3 1
1 1 2 3
2 2 3 1
from itertools import combinations_with_replacement as irep
res = [' '.join(x) for x in irep('123',4) if {'1','2','3'}.issubset(x)]
# output
# ['1123', '1223', '1233']
OR
from itertools import product
res= [' '.join(x) for x in product('123',repeat=4) if {'1','2','3'}.issubset(x)]
# output
# ['1123', '1132', '1213', '1223', '1231', '1232', '1233',
# '1312', '1321', '1322', '1323', '1332', '2113', '2123',
# '2131', '2132', '2133', '2213', '2231', '2311', '2312',
# '2313', '2321', '2331', '3112', '3121', '3122', '3123',
# '3132', '3211', '3212', '3213', '3221', '3231', '3312', '3321']
import itertools
elements = ['1', '2', '3']
permutations = [''.join(combination) for combination in
itertools.product(elements, repeat = 4) if
all(elem in combination for elem in elements)]
Does this produce what you are looking for?
The code produces the following output:
['1123', '1132', '1213', '1223', '1231', '1232', '1233', '1312', '1321', '1322', '1323', '1332', '2113', '2123', '2131', '2132', '2133', '2213', '2231', '2311', '2312', '2313', '2321', '2331', '3112', '3121', '3122', '3123', '3132', '3211', '3212', '3213', '3221', '3231', '3312', '3321']

split username & password from URL in 3.8+ (splituser is deprecated, no alternative)

trying to filter out the user-password from a URL.
(I could've split it manually by the last '#' sign, but I'd rather use a parser)
Python gives a deprecation warning but urlparse() doesn't handle user/password.
Should I just trust the last-#-sign, or is there a new version of split-user?
Python 3.8.2 (default, Jul 16 2020, 14:00:26)
[GCC 9.3.0] on linux
>>> url="http://usr:pswd#www.site.com/path&var=val"
>>> import urllib.parse
>>> urllib.parse.splituser(url)
<stdin>:1: DeprecationWarning: urllib.parse.splituser() is deprecated as of 3.8, use urllib.parse.urlparse() instead
('http://usr:pswd', 'www.site.com/path&var=val')
>>> urllib.parse.urlparse(url)
ParseResult(scheme='http', netloc='usr:pswd#www.site.com', path='/path&var=val', params='', query='', fragment='')
#neigher with allow_fragments:
>>> urllib.parse.urlparse(url,allow_fragments=True)
ParseResult(scheme='http', netloc='us:passw#ktovet.com', path='/all', params='', query='var=val', fragment='')
(Edit: the repr() output is partial & misleading; see my answer.)
It's all there, clear and accessible.
What went wrong: The repr() here is misleading, showing only few properties / values (why? it's another question).
The result is available with explicit property get:
>>> url = 'http://usr:pswd#www.sharat.uk:8082/nativ/page?vari=valu'
>>> p = urllib.parse.urlparse(url)
>>> p.port
8082
>>> p.hostname
'www.sharat.uk'
>>> p.password
'pswd'
>>> p.username
'usr'
>>> p.path
'/nativ/page'
>>> p.query
'vari=valu'
>>> p.scheme
'http'
Or as a one-liner (I just needed the domain):
>>> urllib.parse.urlparse('http://usr:pswd#www.sharat.uk:8082/nativ/page?vari=valu').hostname
www.shahart.uk
Looking at the source code for splituser, looks like they simply use str.rpartition:
def splituser(host):
warnings.warn("urllib.parse.splituser() is deprecated as of 3.8, "
"use urllib.parse.urlparse() instead",
DeprecationWarning, stacklevel=2)
return _splituser(host)
def _splituser(host):
"""splituser('user[:passwd]#host[:port]') --> 'user[:passwd]', 'host[:port]'."""
user, delim, host = host.rpartition('#')
return (user if delim else None), host
which yes, relies on the last occurrence of #.
EDIT: urlparse still has all these fields, see Berry's answer

trouble with transpose in pd.read_csv

I have a data in a CSV file structured like this
Installation Manufacturing Sales & Distribution Project Development Other
43,934 24,916 11,744 - 12,908
52,503 24,064 17,722 - 5,948
57,177 29,742 16,005 7,988 8,105
69,658 29,851 19,771 12,169 11,248
97,031 32,490 20,185 15,112 8,989
119,931 30,282 24,377 22,452 11,816
137,133 38,121 32,147 34,400 18,274
154,175 40,434 39,387 34,227 18,111
I want to skip the header and transpose the list like this
43934 52503 57177 69658 97031 119931 137133 154175
24916 24064 29742 29851 32490 30282 38121 40434
11744 17722 16005 19771 20185 24377 32147 39387
0 0 7988 12169 15112 22452 34400 34227
12908 5948 8105 11248 8989 11816 18274 18111
Here is my code
import pandas as pd
import csv
FileName = "C:/Users/kesid/Documents/Pthon/Pthon.csv"
data = pd.read_csv(FileName, header= None)
data = list(map(list, zip(*data)))
print(data)
I am getting the error "TypeError: zip argument #1 must support iteration". Any help much appreciated.
You can read_csv in "normal" mode:
df = pd.read_csv('input.csv')
(colum names will be dropper later).
Start processing from replacing NaN with 0:
df.fillna(0, inplace=True)
Then use either df.values.T or df.values.T.tolist(), whatever
better suits your needs.
You should use skiprows=[0] to skip reading the first row and use .T to transpose
df = pd.read_csv(filename, skiprows=[0], header=None).T

Get the group ID sorted from "/etc/group"

I'd like to manipulate the "/etc/group"
In [39]: fp = open("/etc/group")
In [40]: content = [c.replace("\n", "") for c in fp.readlines()]
In [42]: content
Out[42]:
['root:x:0:',
'bin:x:1:',
'daemon:x:2:',
'sys:x:3:',
'adm:x:4:',
'tty:x:5:',
'disk:x:6:',
'lp:x:7:',
'mem:x:8:',
'kmem:x:9:',
'wheel:x:10:',
'cdrom:x:11:',
'mail:x:12:postfix',
'man:x:15:',
'dialout:x:18:',....]
The result is sorted by alphabet rather than the group ID
In [44]: sorted(content, key=lambda c:int(re.search(r"\d+",c).group()))
Out[44]:
['root:x:0:',
'bin:x:1:',
'daemon:x:2:',
'sys:x:3:',
'adm:x:4:',
'tty:x:5:',
'disk:x:6:',
'lp:x:7:',
'mem:x:8:',
'kmem:x:9:',
'wheel:x:10:',
'cdrom:x:11:',
'mail:x:12:postfix',
'man:x:15:',
'dialout:x:18:',
I get it done with re.search and lambda in a unwired way,
Could it be solved in an elegant style?
Sort by the third colon-defined field:
sorted(content, key=lambda x: int(x.split(':')[2]))

Resources