Why Vim is assembling short lines when it should just break the long ones? - vim

I am trying to edit a file using Vim. However, I have just started to use this editor.
This is the text that I am willing to fix (it is in portuguese, but this fact is irrelevant to my doubt):
---
ENUM Questão 1
AREA ETHICS
Janaína é procuradora do município de Oceanópolis e atua, fora da carga horária demandada pela função, como advogada na sociedade de advogados Alfa, especializada em Direito Tributário. A profissional já foi professora na universidade estadual Beta, situada na localidade, tendo deixado o magistério há um ano, quando tomou posse como procuradora municipal.
As you see, the phrase starting with "Janaina é..." is too big. I am trying to make everything have 80 columns.
Hence, I did:
:set textwidth=80
And, in visual mode with all the txt selected, I did:
gq
This is the final output:
--- ENUM Questão 1
AREA ETHICS
Janaína é procuradora do município de Oceanópolis e atua, fora da carga horária
demandada pela função, como advogada na sociedade de advogados Alfa,
especializada em Direito Tributário. A profissional já foi professora na
universidade estadual Beta, situada na localidade, tendo deixado o magistério há
um ano, quando tomou posse como procuradora municipal.
The final result is close to what I want. The only problem is the change from
---
ENUM Questão 1
to
--- ENUM Questão 1
I thought that :set textwidth=80 and :set columns=80 were commands made to break lines which were too long. But, for some reason, this command is assembling the short line with --- and the line with ENUM Questão [num]
Why is this happening?
How can I solve this?
Thanks.

gq can do a lot of things, depending on the formatexpr, formatprg or most possibly the formatoptions setting. see :h gq.
I would use the folloging regex:
:%s/.\{80}/&\r/g
*Note: there is also textwrap in vim, which may help you (:set wrap)

Related

Pandas string encoding when retrieving cell value

I have the following Series:
s = pd.Series(['ANO DE LOS BÃEZ MH EE 3 201'])
When I print the series I get:
0 ANO DE LOS BÃEZ MH EE 3 201
But when I get the cell element I get an hexadecimal value in the string:
>>> s.iloc[0]
'ANO DE LOS BÃ\x81EZ MH EE 3 201'
Why does this happens and how can I retrieve the cell value and get the string: 'ANO DE LOS BÃEZ MH EE 3 201'?
Even though I am not really sure where the issue arised I Could solve it by using the unidecode package.
output_string = unidecode(s.iloc[0])

interpret communication from / dev files

I want to intercept the content emitted by the keyboard,
I know that each device has a file linked to it,
my question is how to intercept a 'specific' usb port and where to find the documentation to interpret the data
Sorry for the english (google translator)
Based on: https://sergioprado.org/implementando-um-teclado-virtual-no-linux/
The file that abstracts event capture from input devices like mouse, keyboard, joysticks, touch screens, etc. is / dev / input
You can use an "evtest" tool to check which device is related to a given file: example evtest / dev / input / event4
Português
Baseado em: https://sergioprado.org/implementando-um-teclado-virtual-no-linux/
O arquivo que abstrai a captura de eventos de dispositivos de entrada como mouse, teclado, joysticks, touch screens, etc é /dev/input
Você pode usar a ferramenta "evtest" para verificar qual o dispositivo relacionado à determinado arquivo: exemplo evtest /dev/input/event4

Handling data compression with non-ASCII values while reading and writing file

I am trying to learn lossless compression algorithms using Python 3 and until now I have implemented huffman,burrow wheeler transform and move to front which can take up to 256 unique characters based on there ASCII values. So basically I am trying to read a UTF-8 text file and convert its characters to a single string, then alter that string to compress it. All the algorithms work perfectly but the problem lies in reading file with non-ASCII characters, because if I read the file without encoding it the data value of some special characters goes up to 8221 and movetofront algorithm gives this error:
ValueError: 8221 is not in list
To the read file I tried:
with open('test.txt','r',encoding='utf-8') as f:
data = f.readlines()
charData = ''.join(str(x.encode('utf-8'))[2:-1] for x in data)
huffmanEncode(mtfEncoding(bwt_suffixArray(charData)))
Encode individual char and slice b'', bytes representation from it.
which converts this-> 'you’ll have to check'
to this-> 'you\xe2\x80\x99ll have to check'
Now I input this string, compress it, then decompress it. Decompression works perfectly and I get my string back that represents Unicode. My question is how to get the original content of file back, I tried:
print(bytes(decompressedStr).decode('utf-8'))
#Gives:
>>>TypeError: string argument without an encoding
and:
print(codecs.encode(str,decompressedStr).decode('utf-8'))
#Gives same exact string back:
>>>you\xe2\x80\x99ll have to check
Is there a more efficient way to do this? If not how to convert Unicode representing string to UTF-8 string?
Compression algorithms work on bytes, which is what an encoded file contains. Open your original file in binary mode:
with open('test.txt','rb') as f:
data = f.read()
Don't decode it to Unicode characters, whose ordinal values can be much larger than a byte. Compress the bytes, decompress the bytes, then decode the result to Unicode.
Full example:
#!python3
#coding:utf8
import lzma
text = '''Hola! Yo empecé aprendo Español hace dos mes en la escuela. Yo voy la universidad. Yo tratar estudioso Español tres hora todos los días para que yo saco mejor rápido. ¿Cosa algún yo debo hacer además construir mí vocabulario? Muchas veces yo estudioso la palabras solo para que yo construir mí voabulario rápido. Yo quiero empiezo leo el periódico Español la próxima semana. Por favor correcto algún la equivocaciónes yo hisciste. Gracias!'''
# Create a file containing non-ASCII characters:
with open('test.txt','w',encoding='utf8') as f:
f.write(text)
# Read the raw bytes data.
with open('test.txt','rb') as f:
data = f.read()
# Note: The file write/read can be skipped by encoding the original Unicode text
# to bytes manually.
#
# data = text.encode('utf8')
# Using a built-in Python compression/decompression algorithm.
compressed_data = lzma.compress(data)
decompressed_data = lzma.decompress(compressed_data)
print('orginial length =',len(data))
print('compressed length =',len(compressed_data))
print('decompressed length =',len(decompressed_data))
assert data == decompressed_data
# Now decode the byte data back to Unicode.
print(decompressed_data.decode('utf8'))
Output:
orginial length = 455
compressed length = 372
decompressed length = 455
Hola! Yo empecé aprendo Español hace dos mes en la escuela. Yo voy la universidad. Yo tratar estudioso Español tres hora todos los días para que yo saco mejor rápido. ¿Cosa algún yo debo hacer además construir mí vocabulario? Muchas veces yo estudioso la palabras solo para que yo construir mí voabulario rápido. Yo quiero empiezo leo el periódico Español la próxima semana. Por favor correcto algún la equivocaciónes yo hisciste. Gracias!

Py3 Find common words in multiple strings

I have a problem and i can't get a proper and good solution.
Here is the thing, i have my list, which contains addresses:
addresses = ['Paris, 9 rue VoltaIre', 'Paris, 42 quai Voltaire', 'Paris, 87 rue VoLtAiRe']
And I want to get after treatment:
to_upper_words = ['paris', 'voltaire']
Because 'Paris' and 'Voltaire' are in all strings.
The goal is to upper all common words, and results for example:
'PARIS, 343 BOULEVARD SAINT-GERMAIN'
'PARIS, 458 BOULEVARD SAINT-GERMAIN'
'Paris', 'Boulevard', 'Saint-Germain' are in upper case because all strings (two in this case) have these words in common.
I found multiple solutions but only for two strings, and the result is ugly to apply for N strings.
If anyone can help...
Thank you in advance !
Remove the commas so that the words can be picked out simply by splitting on blanks. Then find the words that common to all of the addresses using set intersection. (There is only one such word for my collection, namely Montréal.) Now look through the original collection of addresses replacing each occurrence of common words with its uppercase equivalent. The most likely mistake in doing this might be to try to assign a value to an individual value in the list. However, these are, in effect, copies which is why I arranged to assign the modified values to addressses[i].
>>> addresses = [ 'Montréal, 3415 Chemin de la Côte-des-Neiges', 'Montréal, 3655 Avenue du Musée', 'Montréal, 3475 Rue de la Montagne', 'Montréal, 3450 Rue Drummond' ]
>>> sans_virgules = [address.replace(',', ' ') for address in addresses]
>>> listes_de_mots = [sans_virgule.split() for sans_virgule in sans_virgules]
>>> mots_en_commun = set(listes_de_mots[0])
>>> for liste in listes_de_mots[1:]:
... mots_en_commun = mots_en_commun.intersection(set(liste))
...
>>> mots_en_commun
{'Montréal'}
>>> for i, address in enumerate(addresses):
... for mot in list(mots_en_commun):
... addresses[i] = address.replace(mot, mot.upper())
...
>>> addresses
['MONTRÉAL, 3415 Chemin de la Côte-des-Neiges', 'MONTRÉAL, 3655 Avenue du Musée', 'MONTRÉAL, 3475 Rue de la Montagne', 'MONTRÉAL, 3450 Rue Drummond']

count how many words from a word list appear in a cell

Im trying to count the amount of times a word occurs in a string in excel, but im not sure how to do it. I found the formula =SUMPRODUCT(ISNUMBER(SEARCH(" "&Z$2:Z$20&","," "&A1&","))+0) in another post, but I can't seem to make it work for me.
data example (strings):
text
1335 Nu standaard 3 dagen #Full #Membership bij #LACH , #dating voor #50plus http://t.co/7urme1Huhq http://t.co/ZzGWnKNdgx
1334 De 50plus cliché-poll vul hem in en we zien direct of de clichés waar zijn,
1282 #Koopkracht staat centraal in de campagne van #50Plus. Volgens mij maken veel ouderen zich zorgen over heel andere zaken. #PS2015
1278 #ps15 #50Plus: Dramatische gevolgen afschaffen ouderentoeslag http://t.co/nJW8kQRTtu Zie http://t.co/bFGtRDhF7e
1276 Campagnebussen in beeld. Mijn favoriet: de Fiat Babyboom van 50Plus. http://t.co/Wji8fL3gDp
1275 50Plus staat alleen in stelling over sociaal beleid http://t.co/79TF5PWN2j
1266 .#Struijlaard #50PLUS Zuid-Holland trekt met een camper de hele provincie door. Vandaag tot 17.00 uur in Leerdam. #PS15
1262 Stem #50PLUS op 18 maart !
1261 Boek van 98-jarige zangeres http://t.co/xxFjkTxYDY #vrouwen #50plus #muziek #Gog #Dings #Liessel
1257 #senfkarin kwam je op 50plus?
Word list:
achterlijk
bewogen
bovenkomen
brutaliteit
dramatisch
gehoorzamen
groen
illuminatie
incongruent
kwaadspreken
ontmoeting
ontredderd
recidiveren
roestig
ruw
secuur
steenkoud
tippelen
treffen
verbleken
verkijken
verlevendigen
vertillen
vorig
wezen
zorg
zwier
I think the commas are messing it up, it should be
=SUMPRODUCT(ISNUMBER(SEARCH(" "&Z$2:Z$20&" "," "&A1&" "))+0)
So it matches the whole word plus the spaces at each end, even when it is the first or last word in the cell.

Resources