using Python how to remove redundancy from rows of text file - python-3.x

Hello guys I am using RCV1 dataset. I want to remove duplicates words or tokens from the text file but I am not sure how to do it. And since these are not duplicate rows these are words in articles. I am using python, please help me with this.please see the attached image to get an idea about text file

Assuming that the words of the text file are spaced out with only a blank spaces (i.e., no attached commas and periods), the following code should work for you.
items = []
with open("data.txt") as f:
for line in f:
items += line.split()
newItemList = list(set(items))
If you would like to have the items as a single string:
newItemList = " ".join(list(set(items)))
If you want the order to be preserved as well, then do
newItemList = []
for item in items:
if item not in newItemList:
newItemList += [item]
newItemList = " ".join(newItemList)

Related

How to remove lines from a file starting with a specific word python3

I am doing this as an assignment. So, I need to read a file and remove lines that start with a specific word.
fajl = input("File name:")
rec = input("Word:")
def delete_lines(fajl, rec):
with open(fajl) as file:
text = file.readlines()
print(text)
for word in text:
words = word.split(' ')
first_word = words[0]
for first in word:
if first[0] == rec:
text = text.pop(rec)
return text
print(text)
return text
delete_lines(fajl, rec)
At the last for loop, I completely lost control of what I am doing. Firstly, I can't use pop. So, once I locate the word, I need to somehow delete lines that start with that word. Additionally, there is also one minor problem with my approach and that is that first_word gets me the first word but the , also if it is present.
Example text from a file(file.txt):
This is some text on one line.
The text is irrelevant.
This would be some specific stuff.
However, it is not.
This is just nonsense.
rec = input("Word:") --- This
Output:
The text is irrelevant.
However, it is not.
You cannot modify an array while you are iterating over it. But you can iterate over a copy to modify the original one
fajl = input("File name:")
rec = input("Word:")
def delete_lines(fajl, rec):
with open(fajl) as file:
text = file.readlines()
print(text)
# let's iterate over a copy to modify
# the original one without restrictions
for word in text[:]:
# compare with lowercase to erase This and this
if word.lower().startswith(rec.lower()):
# Remove the line
text.remove(word)
newtext="".join(text) # join all the text
print(newtext) # to see the results in console
# we should now save the file to see the results there
with open(fajl,"w") as file:
file.write(newtext)
print(delete_lines(fajl, rec))
Tested with your sample text. if you want to erase "this". The startswith method will wipe "this" or "this," alike. This will only delete the text and let any blank lines alone. if you don't want them you can also compare with "\n" and remove them

Reading a .txt file and appending each word into a dictionary

I'm kind of on a time crunch, but this was one of my problems in my homework assignment. I am stuck, and I don't know what to do or how to proceed.
Our assignment was to open various text files and within each of the text files, we are supposed to add each word into a dictionary in which the key is the document number it came from, and the value is the word.
For example, one text file would be:
1
Hello, how are you?
I am fine and you?
Each of the text files begin with a number corresponding to it's title (for example, "document1.txt" begins with "1", "document2.txt" begins with "2", etc)
My teacher gave us this coding to help with stripping the punctuation and the lines, but I am having a hard time figuring out where to implement it.
data = re.split("[ .,:;!?\s\b]+|[\r\n]+", line)
data = filter(None, data)
I don't really understand where the filter(None, data) stuff comes into play, because all it does is return a code line of what it represents in memory.
Here's my code so far:
def invertFile(list_of_file_names):
import re
diction = {}
emplist = []
fordiction = []
for x in list_of_file_names:
afile = open(x, 'r')
with afile as f:
for line in f:
savedSort = filterText(f)
def filterText(line):
import re
word_delimiters = [' ', ',', ';', ':', '.','?','!']
data = re.split("[ .,:;!?\s\b]+|[\r\n]+", f)
key, value = data[0], data[1:]
diction[key] = value
How do I make it so each word is appended into a dictionary, where the key is the document it comes from, and the value are the words in the document? Thank you.

How do I print out results on a separate line after converting them from a set to a string?

I am currently trying to compare to text files, to see if they have any words in common in both files.
The text files are as
ENGLISH.TXT
circle
table
year
competition
FRENCH.TXT
bien
competition
merci
air
table
My current code is getting them to print, Ive removed all the unnessecary squirly brackets and so on, but I cant get them to print on different lines.
List = open("english.txt").readlines()
List2 = open("french.txt").readlines()
anb = set(List) & set(List2)
anb = str(anb)
anb = (str(anb)[1:-1])
anb = anb.replace("'","")
anb = anb.replace(",","")
anb = anb.replace('\\n',"")
print(anb)
The output is expected to separate both results onto new lines.
Currently Happening:
Competition Table
Expected:
Competition
Table
Thanks in advance!
- Xphoon
Hi I'd suggest you to try two things as a good practice:
1) Use "with" for opening files
with open('english.txt', 'r') as englishfile, open('french.txt', 'r') as frenchfile:
##your python operations for the file
2) Try to use the "f-String" opportunity if you're using Python 3:
print(f"Hello\nWorld!")
File read using "open()" vs "with open()"
This post explains very well why to use the "with" statement :)
And additionally to the f-strings if you want to print out variables do it like this:
print(f"{variable[index]}\n variable2[index2]}")
Should print out:
Hello and World! in seperate lines
Here is one solution including converting between sets and lists:
with open('english.txt', 'r') as englishfile, open('french.txt', 'r') as frenchfile:
english_words = englishfile.readlines()
english_words = [word.strip('\n') for word in english_words]
french_words = frenchfile.readlines()
french_words = [word.strip('\n') for word in french_words]
anb = set(english_words) & set(french_words)
anb_list = [item for item in anb]
for item in anb_list:
print(item)
Here is another solution by keeping the words in lists:
with open('english.txt', 'r') as englishfile, open('french.txt', 'r') as frenchfile:
english_words = englishfile.readlines()
english_words = [word.strip('\n') for word in english_words]
french_words = frenchfile.readlines()
french_words = [word.strip('\n') for word in french_words]
for english_word in english_words:
for french_word in french_words:
if english_word == french_word:
print(english_word)

How can I remove all POS tags except for 'VBD' and 'VBN' from my CSV file?

I want to remove words tagged with the specific part-of-speech tags VBD and VBN from my CSV file. But, I'm getting the error "IndexError: list index out of range" after entering the following code:
for word in POS_tag_text_clean:
if word[1] !='VBD' and word[1] !='VBN':
words.append(word[0])
My CSV file has 10 reviews of 10 people and the row name is Comment.
Here is my full code:
df_Comment = pd.read_csv("myfile.csv")
def clean(text):
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
tagged = nltk.pos_tag(text)
text = text.rstrip()
text = re.sub(r'[^a-zA-Z]', ' ', text)
stop_free = " ".join([i for i in text.lower().split() if((i not in stop) and (not i.isdigit()))])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
return normalized
text_clean = []
for text in df)Comment['Comment']:
text_clean.append(clean(text).split())
print(text_clean)
POS_tag_text_clean = [nltk.pos_tag(t) for t in text_clean]
print(POS_tag_text_clean)
words=[]
for word in POS_tag_text_clean:
if word[1] !='VBD' and word[1] !='VBN':
words.append(word[0])
How can I fix the error?
It is a bit hard to understand your problem without an example and the corresponding outputs, but it might be this:
Assuming that text is a string, text_clean will be a list of lists of strings, where every string represents a word. After the part-of-speech tagging, POS_tag_text_clean will therefore be a list of lists of tuples, each tuple containing a word and its tag.
If I'm right, then your last loop actually loops over items from your dataframe instead of words, as the name of the variable suggests. If an item has only one word (which is not so unlikely, since you filter a lot in clean()), your call to word[1] will fail with an error similar to the one you report.
Instead, try this code:
words = []
for item in POS_tag_text_clean:
words_in_item = []
for word in item:
if word[1] !='VBD' and word[1] !='VBN':
words_in_item .append(word[0])
words.append(words_in_item)

Save Tweets as .csv, Contains String Literals and Entities

I have tweets saved in JSON text files. I have a friend who wants tweets containing keywords, and the tweets need to be saved in a .csv. Finding the tweets is easy, but I run into two problems and am struggling with finding a good solution.
Sample data are here. I have included the .csv file that is not working as well as a file where each row is a tweet in JSON format.
To get into a dataframe, I use pd.io.json.json_normalize. It works smoothly and handles nested dictionaries well, but pd.to_csv does not work because it does not handle, as far as I can tell, string literals well. Some of the tweets contain '\n' in the text field, and pandas writes new lines when that happens.
No problem, I process pd['text'] to remove '\n'. The resulting file still has too many rows, 1863 compared to the 1388 it should. I then modified my code to replace all string-literals:
tweets['text'] = [item.replace('\n', '') for item in tweets['text']]
tweets['text'] = [item.replace('\r', '') for item in tweets['text']]
tweets['text'] = [item.replace('\\', '') for item in tweets['text']]
tweets['text'] = [item.replace('\'', '') for item in tweets['text']]
tweets['text'] = [item.replace('\"', '') for item in tweets['text']]
tweets['text'] = [item.replace('\a', '') for item in tweets['text']]
tweets['text'] = [item.replace('\b', '') for item in tweets['text']]
tweets['text'] = [item.replace('\f', '') for item in tweets['text']]
tweets['text'] = [item.replace('\t', '') for item in tweets['text']]
tweets['text'] = [item.replace('\v', '') for item in tweets['text']]
Same result, pd.to_csv saves a file with more rows than actual tweets. I could replace string literals in all columns, but that is clunky.
Fine, don't use pandas. with open(outpath, 'w') as f: and so on creates a .csv file with the correct number of rows. Reading the file, either with pd.read_csv or reading line by line will fail, however.
It fails because of how Twitter handles entities. If a tweet's text contains a url, mention, hashtag, media, or link, then Twitter returns a dictionary that contains commas. When pandas flattens the tweet, the commas get preserved within a column, which is good. But when the data are read in, pandas splits what should be one column into multiple columns. For example, a column might look like [{'screen_name': 'ProfOsinbajo','name': 'Prof Yemi Osinbajo','id': 2914442873,'id_str': '2914442873', 'indices': [0,' 13]}]', so splitting on commas creates too many columns:
[{'screen_name': 'ProfOsinbajo',
'name': 'Prof Yemi Osinbajo',
'id': 2914442873",
'id_str': '2914442873'",
'indices': [0,
13]}]
That is the outcome whether I use with open(outpath) as f: as well. With that approach, I have to split lines, so I split on commas. Same problem - I do not want to split on commas if they occur in a list.
I want those data to be treated as one column when saved to file or read from file. What am I missing? In terms of the data at the repository above, I want to convert forstackoverflow2.txt to a .csv with as many rows as tweets. Call this file A.csv, and let's say it has 100 columns. When opened, A.csv should also have 100 columns.
I'm sure there are details I've left out, so please let me know.
Using the csv module works. It writes the file out as a .csv while counting the lines, then reads it back in and counts the lines again.
The result matched, and opening the .csv in Excel also gives 191 columns and 1338 lines of data.
import json
import csv
with open('forstackoverflow2.txt') as f,\
open('out.csv','w',encoding='utf-8-sig',newline='') as out:
data = json.loads(next(f))
print('columns',len(data))
writer = csv.DictWriter(out,fieldnames=sorted(data))
writer.writeheader() # write header
writer.writerow(data) # write the first line of data
for i,line in enumerate(f,2): # start line count at two
data = json.loads(line)
writer.writerow(data)
print('lines',i)
with open('out.csv',encoding='utf-8-sig',newline='') as f:
r = csv.DictReader(f)
lines = list(r)
print('readback columns',len(lines[0]))
print('readback lines',len(lines))
Output:
columns 191
lines 1338
readback lines 1338
readback columns 191
#Mark Tolonen's answer is helpful, but I ended up going a separate route. When saving the tweets to file, I removed all \r, \n, \t, and \0 characters from anywhere in the JSON. Then, I save the file with as tab separated so that commas in fields like location or text do not confuse a read function.

Resources