Python Itertools groupby - writing lines to csv file - python-3.x

I am new to Python, apologies if this is a stupid question.
I have a text file with the following input:
Apple Apple1
Apple Apple2
Aaron Aaron1
Aaron Aaron2
Aaron Aaron3
Tree Tree1
I have the following code:
import csv
import sys
from itertools import groupby
with open('File.txt', 'r', encoding='utf-8') as csvfile:
reader = csv.reader(csvfile, delimiter='\t')
next(reader, None)
a = [[k] + [x[1] for x in g] for k, g in groupby(reader, key=lambda row: row[0])]
sys.stdout=open('Out.txt','w', encoding='utf-8')
print (str(a))
What I want to achieve:
Apple Apple1,Apple2
Aaron Aaron1,Aaron2,Aaron3
Tree Tree1
However, the output I am now getting is in list form, while I want it to be printed line per line. How can I achieve this?

How about
import pandas as pd
df = pd.read_csv('File.txt',delimiter=' ' , header=None)
grouped = df.groupby(0).agg(lambda col: ', '.join(col)).to_records()
for group in grouped:
print(group[0] + ' ' + group[1])

Related

How to save values in row after one loop?

How to save values to row in one cycle, please?
import numpy as np
A = 5.4654
B = [4.78465, 6.46545, 5.798]
for i in range(2):
f = open(f'file.txt', 'a')
np.savetxt(f, np.r_[A, B], fmt='%22.16f')
f.close()
The output is:
5.4653999999999998
4.7846500000000001
6.4654499999999997
5.7980000000000000
5.4653999999999998
4.7846500000000001
6.4654499999999997
5.7980000000000000
The desired output is:
5.4653999999999998 4.7846500000000001 6.4654499999999997 5.7980000000000000
5.4653999999999998 4.7846500000000001 6.4654499999999997 5.7980000000000000
According to the documentation:
newlinestr, optional
String or character separating lines.
So, perhaps:
np.savetxt(f, np.r_[A, B], fmt='%22.16f', newlinestr=' ')
print('\n', file=f)
An alternative might be to np.transpose(np.r_[A, B]) perhaps?

NLTK Named Entity Category Labels

I keep hitting a wall when it comes to NLTK. I've been able to Token and Categorize a single string of text, however, if I try to apply the script across multiple rows I get the Tokens, but it does not return a Category which is the most important part for me.
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
+nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
SENT_DETECTOR = nltk.data.load('tokenizers/punkt/english.pickle')
Example:
ex = 'John'
ne_tree = nltk.ne_chunk(pos_tag(word_tokenize(ex)))
print(ne_tree)
Output:
(S (PERSON John/NNP))
That is exactly what I'm looking for. I need the Category not just NNP.
When I apply this across a table I just get the token and no Category.
Example:
df = pd.read_csv('ex3.csv')
df
Input:
Order Text
0 0 John
1 1 Paul
2 2 George
3 3 Ringo
Code:
df['results'] = df.Text.apply(lambda x: nltk.ne_chunk(pos_tag(word_tokenize(x))))
df
Output:
print(df)
Order Text results
0 0 John [[(John, NNP)]]
1 1 Paul [[(Paul, NNP)]]
2 2 George [[(George, NNP)]]
3 3 Ringo [[(Ringo, NN)]]
I'm getting the tokens and it's working across all rows, but it is not giving me a Category 'PERSON'.
I really need Categories.
Is this not possible? Thanks for the help.
Here we go...
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
+nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
df = pd.read_csv("ex3.csv")
# print(df)
text1 = df['text'].to_list()
text =[]
for i in text1:
text.append(i.capitalize())
# create a column for store resullts
df['results'] = ""
for i in range(len(text)):
SENT_DETECTOR = nltk.data.load('tokenizers/punkt/english.pickle')
ne_tree = nltk.ne_chunk(pos_tag(word_tokenize(text[i])))
df['results'][i] = ne_tree[0].label()
print(df)

need to read text file line by line using python and get users data into a pandas dataframe

i need to read text file line by line using python and get users data into a pandas dataframe
i tried below
import pandas as pd
y=0
Name =[]
Age =[]
with open('file.txt', 'r') as fp:
for line in fp:
if line =="<USERDATA":
row=True
break
else:
l = line.split("=")[0]
i = line.split("=")[-1]
row=False
if row == False:
if "\tName" in l:
Name.append(i)
elif "\Age" in l:
Age.append(i)
else:
pass
else:
pass
while 0<=y<(len(Name))-1:
z={"Name":Nmae[y],"Age":Age[y]}
y += 1
df = pd.DataFrame(z,columns=["Name","Age"],index=None)
files contents is some how like below:
sample
You have some logical issues, I have fixed them, I would encourage you to compare your code with mine and then try to see the differences present and if you have any doubts, comment below.
import pandas as pd
import numpy as np
y=0
Name =[]
Age =[]
z={}
with open('file.txt', 'r') as fp:
for line in fp:
line = line.strip()
if line ==r'<USERDATA':
row=True
continue
if line ==r'<USEREND':
if (len(Age)<len(Name)):
Age.append(np.nan) #only adding once since done at end of each row
elif (len(Name)<len(Age)):
Name.append(np.nan)
continue
else:
l = line.split("=")[0].strip()
i = line.split("=")[-1].strip()
row=False
if row == False:
if l == 'Name':
Name.append(i)
elif l == 'Age':
Age.append(i)
z={"Name":Name,"Age":Age}
df = pd.DataFrame(z,columns=["Name","Age"],index=None)

Incorrect formatting while reading csv

CSV format (3 columns):
id_numb formatted_id Comment_Txt
1 Z007 sample text says good morning.
Code to read:
with open("file.csv", 'r' ,newline='') as csvfile:
file_reader = csv.reader(csvfile, delimiter=' ',quotechar='|')
for row in file_reader:
print(row)
Expected OP:
['id_numb', 'formatted_id', 'Comment_Txt']
['1', 'Z007', 'sample','text' ,'says','good','morning.']
My OP:
['1,Z007,sample', 'text' ,'says','good','morning.']
The first 3 tokens are automatically joined. I am not able to understand the mistake. Any suggetsions will be helpful.
import csv
from functools import reduce
with open("file.csv", 'r' ,newline='') as csvfile:
file_reader = csv.reader(csvfile, delimiter=',',quotechar='|')
for row in file_reader:
print(reduce(lambda x, y: x+y, [i.split(' ') for i in row]))
output:
['id_numb', 'formatted_id', 'Comment_Txt']
['1', 'Z007', 'sample', 'text', 'says', 'good', 'morning.']
Is it Expected OP?
You could try using
with open("file.csv", 'r' ,newline='') as csvfile:
file_reader = csv.reader(csvfile, delimiter=',',quotechar='|')
for row in file_reader:
print(row)
since your first row seems to be of the form
1,Z007,sample text says good morning
and using ' ' as a delimiter basically splits anything separated by a space into two different columns.

Error Unorderable Types. Select rows where on hdf5 files

I work with python 3.5 and I have the next problem to import some datas from a hdf5 files.
I will show a very simple example which resume what happen. I have created a small dataframe and I have inserted it into a hdf5 files. Then I have tried to select from this hdf5 file the rows which have on the column "A" a value less that 1. So I get the error:
"Type error: unorderable types: str() < int()"
image
import pandas as pd
import numpy as np
import datetime
import time
import h5py
from pandas import DataFrame, HDFStore
def test_conected():
hdf_nombre_archivo ="1_Archivo.h5"
hdf = HDFStore(hdf_nombre_archivo)
np.random.seed(1234)
index = pd.date_range('1/1/2000', periods=3)
df = pd.DataFrame(np.random.randn(3, 4), index=index, columns=
['A', 'B','C','F'])
print(df)
with h5py.File(hdf_nombre_archivo) as f:
df.to_hdf(hdf_nombre_archivo, 'df',format='table')
print("")
with h5py.File(hdf_nombre_archivo) as f:
df_nuevo = pd.read_hdf(hdf_nombre_archivo, 'df',where= ['A' < 1])
print(df_nuevo )
def Fin():
print(" ")
print("FIN")
if __name__ == "__main__":
test_conected()
Fin()
print(time.strftime("%H:%M:%S"))
I have been investigating but I dont get to solve this error. Some idea?
Thanks
Angel
where= ['A' < 1]
in your condition statement 'A' is consider as string or char and 1 is int so first make them in same type by typecasting.
ex:
str(1)

Resources