How to split compound word in pandas? - python-3.x

I have and document that consist of many compounds (or sometimes combined) word as:
document.csv
index text
0 my first java code was helloworld
1 my cardoor is totally broken
2 I will buy a screwdriver to fix my bike
As seen above some words are combined or compound and I am using compound word splitter from here to fix this issue, however, I have trouble to apply it in each row of my document (like pandas series) and convert the document into a clean form of:
cleanDocument.csv
index text
0 my first java code was hello world
1 my car door is totally broken
2 I will buy a screw driver to fix my bike
(I am aware of word such as screwdriver should be together, but my goal is cleaning the document). If you have a better idea for splitting only combined words, please let me know.
splitter code may works as:
import pandas as pd
import splitter ## This use enchant dict (pip install enchant requires)
data = pd.read_csv('document.csv.csv')
then, it should use:
splitter.split(data) ## ???
I already looked into something like this but this not work in my case. thanks

You use apply wit axis =1 : Can you try the following
data.apply(lambda x: splitter.split(j) for j in (x.split()), axis = 1)
I do not have splitter installed on my system. By looking at the link you have provided, I have this following code. Can you try:
def handle_list(m):
ret_lst = []
L = m['text'].split()
for wrd in L:
g = splitter.split(wrd)
if g :
ret_lst.extend(g)
else:
ret_lst.append(wrd)
return ret_lst
dft.apply(handle_list, axis = 1)

Related

Does anyone know how to pull averages from a text file for several different people?

I have a text file (Player_hits.text) that I am trying to pull player batting averages from. Similar to lines 179-189 I want to find an average. However, I do not want to find the average for the entire team. Instead, I want to find the average of every individual player on the team.
For instance, the text file is set up as such:
Player_hits.txt
In this file a 1 defines a hit and a 0 means the player did not get a hit. I am trying to pull an individual average for both players. (Alex = 0.500, Riley = 0.666)
If someone could help, that would be greatly appreciated!
Thanks!
Link to original code on repl.it: Baseball Stat-Tracking
JSONDecodeError Image
The json.decoder.JSONDecodeError: is coming because the json.loads() doesn't interpret that (each line, '[[1, 'Riley']\n'as valid json format. You can use ast to read in that list as a literal evaluation, thus storing that as a list element [', 'Riley'] in your list of p_hits.
Then the second part is you can convert to the dataframe and groupby the 'name' column. So jim has the right idea, but there's errors in that too (Ie. colmuns should be columns, and the items in the list need to be strings ['hit','name'], not undeclared variables.
import pandas as pd
import ast
p_hits = []
with open('Player_hits.txt') as hits:
for line in hits:
l = ast.literal_eval(line)
p_hits.append(l)
df = pd.DataFrame(p_hits, columns=['hit', 'name'])
Output: with an example dataset I made
print(df.groupby(['name']).mean())
hit
name
Matt 0.714286
Riley 0.285714
Todd 0.500000
import pandas as pd
import json
p_hits = []
with open('Player_hits.txt') as hits:
for line in hits:
l = json.loads(line)
p_hits.append(l)
df = pd.DataFrame.from_records(p_hits, colmuns=[hit, name])
df.groupby(['name']).mean()

Python: Using Pandas and Regex to Clean Phone Numbers with Country Code

I'm attempting to use pandas to clean phone numbers so that it returns only the 10 digit phone number and removes the country code if it is present and any special characters.
Here's some sample code:
phone_series = pandas.Series(['+1(192) 167-4056', '123-345-4568', '1233455678', '(123) 123-1234'])
r1 = '[^0-9]+'
phone_series.str.replace(r1, '')
Returns
0 11921674056
1 1233454568
2 1233455678
3 1231231234
As you can see, this regex works well except for the country code. And unfortunately, this system I'm loading into cannot accept the country code. What I'm struggling with, is finding a regex that with strip the country code as well. All the regex's I've found will match the 10 digits I need, and in this case with using pandas, I need to not match them.
I could easily write a function and use .apply but I feel like there is likely a simple regex solution that I'm missing.
Thanks for any help!
I don't think regex is necessary here, which is nice because regex is a pain in the buns.
To append your current solution:
phone_series = pandas.Series(['+1(192) 167-4056', '123-345-4568', '1233455678', '(123) 123-1234'])
r1 = '[^0-9]+'
phone_series.str.replace(r1, '')
phone_series = phone_series.apply(lambda x: x[-10:])
My lazier solution:
>>> phone_series = pd.Series(['+1(192) 167-4056', '123-345-4568', '1233455678', '(123) 123-1234'])
>>> p2 = phone_series.apply(lambda x: ''.join([i for i in x if str.isnumeric(i)])[-10:])
>>> p2
0 1921674056
1 1233454568
2 1233455678
3 1231231234
dtype: object

Convert everything in a dictionary to lower case, then filter on it?

import pandas as pd
import nltk
import os
directory = os.listdir(r"C:\...")
x = []
num = 0
for i in directory:
x.append(pd.read_fwf("C:\\..." + i))
x[num] = x[num].to_string()
So, once I have a dictionary x = [ ] populated by the read_fwf for each file in my directory:
I want to know how to make it so every single character is lowercase. I am having trouble understanding the syntax and how it is applied to a dictionary.
I want to define a filter that I can use to count for a list of words in this newly defined dictionary, e.g.,
list = [bus, car, train, aeroplane, tram, ...]
Edit: Quick unrelated question:
Is pd_read_fwf the best way to read .txt files? If not, what else could I use?
Any help is very much appreciated. Thanks
Edit 2: Sample data and output that I want:
Sample:
The Horncastle boar's head is an early seventh-century Anglo-Saxon
ornament depicting a boar that probably was once part of the crest of
a helmet. It was discovered in 2002 by a metal detectorist searching
in the town of Horncastle, Lincolnshire. It was reported as found
treasure and acquired for £15,000 by the City and County Museum, where
it is on permanent display.
Required output - changes everything in uppercase to lowercase:
the horncastle boar's head is an early seventh-century anglo-saxon
ornament depicting a boar that probably was once part of the crest of
a helmet. it was discovered in 2002 by a metal detectorist searching
in the town of horncastle, lincolnshire. it was reported as found
treasure and acquired for £15,000 by the city and county museum, where
it is on permanent display.
You shouldn't need to use pandas or dictionaries at all. Just use Python's built-in open() function:
# Open a file in read mode with a context manager
with open(r'C:\path\to\you\file.txt', 'r') as file:
# Read the file into a string
text = file.read()
# Use the string's lower() method to make everything lowercase
text = text.lower()
print(text)
# Split text by whitespace into list of words
word_list = text.split()
# Get the number of elements in the list (the word count)
word_count = len(word_list)
print(word_count)
If you want, you can do it in the reverse order:
# Open a file in read mode with a context manager
with open(r'C:\path\to\you\file.txt', 'r') as file:
# Read the file into a string
text = file.read()
# Split text by whitespace into list of words
word_list = text.split()
# Use list comprehension to create a new list with the lower() method applied to each word.
lowercase_word_list = [word.lower() for word in word_list]
print(word_list)
Using a context manager for this is good since it automatically closes the file for you as soon as it goes out of scope (de-tabbed from with statement block). Otherwise you would have to use file.open() and file.read().
I think there are some other benefits to using context managers, but someone please correct me if I'm wrong.
I think what you are looking for is dictionary comprehension:
# Python 3
new_dict = {key: val.lower() for key, val in old_dict.items()}
# Python 2
new_dict = {key: val.lower() for key, val in old_dict.iteritems()}
items()/iteritems() gives you a list of tuples of the (keys, values) represented in the dictionary (e.g. [('somekey', 'SomeValue'), ('somekey2', 'SomeValue2')])
The comprehension iterates over each of these pairs, creating a new dictionary in the process. In the key: val.lower() section, you can do whatever manipulation you want to create the new dictionary.

How to read starting N words from each rows in python3

I am reading excel which has free text in a column.Now after reading that file from pandas, I want to restrict the column having text to read just N words from starting for each rows. I tried everything but was not able to make it.
data["text"] = I am going to school and I bought something from market.
But I just want to read staring 5 words. so that it could look like below.
data["text"] = I am going to school.
and I want this same operation to be done bow each row for data["text"] column.
You help will be highly appreciated.
def first_k(s: str, k=5) -> str:
s = str(s) # just in case something like NaN tries to sneak in there
first_words = s.split()[:k]
return ' '.join(first_words)
Then, apply the function:
data['text'] = data['text'].apply(first_k)
data["text"] = [' '.join(s.split(' ')[:5]) for s in data["text"].values]

Python adding lists

I am importing data from a file, which is working correctly. I have appended the data from this file into 3 different lists, name, mark, mark2 although I don't understand how or if i can make a new list called total_marks and add a calculation appending mark + mark2 into total_marks. Tried looking about for help on this and couldn't find much relating to it. The plan is to actually add the two lists together and work out a percentage which the total marks would be 150.
To add the two lists item by item:
combined = []
for m1, m2 in zip(mark, mark2):
combined.append(m1+m2)
The zip function returns an item pair from the two lists for each pair in the lists.:
https://docs.python.org/3/library/functions.html#zip
Then you can perform the final operation this way:
final = []
for m in combined:
final.append(m/150*100)
As I said in my comment, I highly recommend that once you've gotten past learning the basics that you then take the time to learn two libraries: pandas and xlwings. These will greatly help your ability to interact between python and excel. An operation like you have here becomes much simpler once you learn pandas dataframes.
Here is a better way, using pandas.
import pandas
df = pandas.read_csv('Classmarks.csv', index_col = 'student_name', names = ('student_name', 'mark1', 'mark2'), header = None)
df['combined'] = df['mark1'] + df['mark2']
df['final'] = df['combined'] / 150 * 100
print(df)
Don't have to do any loops using pandas. And you can then write it back to a csv file:
df.to_csv('Classmarksout.csv')

Resources