Python - Storing float values in CSV file - python-3.x

I am trying to store the positive and negative score of statements in a text file. I want to store the score in a csv file. I have implemented the below given code:
import openpyxl
from nltk.tokenize import sent_tokenize
import csv
from senti_classifier import senti_classifier
from nltk.corpus import wordnet
file_content = open('amazon_kindle.txt')
for lines in file_content:
sentence = sent_tokenize(lines)
pos_score,neg_score = senti_classifier.polarity_scores(sentence)
with open('target.csv','w') as f:
writer = csv.writer(f,lineterminator='\n',delimiter=',')
for val in range(pos_score):
writer.writerow(float(s) for s in val[0])
f.close()
But the code displays me the following error in for loop.
Traceback (most recent call last):
File "C:\Users\pc\AppData\Local\Programs\Python\Python36-32\classifier.py",
line 21, in for val in pos_score: TypeError: 'float' object is not iterable

You have several errors with your code:
Your code and error do not correspond with each other.
for val in pos_score: # traceback
for val in range(pos_score): #code
pos_score is a float so both are errors range() takes an int and for val takes an iterable. Where do you expect to get your list of values from?
And from usage it looks like you are expecting a list of list of values because you are also using a generator expression in your writerow
writer.writerow(float(s) for s in val[0])
Perhaps you are only expecting a list of values so you can get rid of the for loop and just use:
writer.writerow(float(val) for val in <list_of_values>)
Using:
with open('target.csv','w') as f:
means you no longer need to call f.close() and with closes the file at the end of the with block. This also means the writerow() needs to be in the with block:
with open('target.csv','w') as f:
writer = csv.writer(f,lineterminator='\n',delimiter=',')
writer.writerow(float(val) for val in <list_of_values>)

Related

Changes csv row value

This is my code:
import pandas as pd
import re
# reading the csv file
patients = pd.read_csv("partial.csv")
# updating the column value/data
for patient in patients.iterrows():
cip=patient['VALOR_ID']
new_cip = re.sub('^(\w+|)',r'FIXED_REPLACED_STRING',cip)
patient['VALOR_ID'] = new_cip
# writing into the file
df.to_csv("partial-writer.csv", index=False)
print(df)
I'm getting this message:
Traceback (most recent call last):
File "/home/jeusdi/projects/workarea/salut/load-testing/load.py", line 28, in
cip=patient['VALOR_ID']
TypeError: tuple indices must be integers or slices, not str
EDIT
Form code above you can think I need to set a same fixed value to all rows.
I need to loop over "rows" and generate a random string and set it on each different "row".
Code above would be:
for patient in patients.iterrows():
new_cip = generate_cip()
patient['VALOR_ID'] = new_cip
Use Series.str.replace, but not sure about | in regex. Maybe should be removed it:
df = pd.read_csv("partial.csv")
df['VALOR_ID'] = df['VALOR_ID'].str.replace('^(\w+|)',r'FIXED_REPLACED_STRING')
#if function return scalars
df['VALOR_ID'] = df['VALOR_ID'].apply(generate_cip)
df.to_csv("partial-writer.csv", index=False)

How to read multiple text files as strings from two folders at the same time using readline() in python?

Currently have version of the following script that uses two simple readline() snippets to read a single line .txt file from two different folders. Running under ubuntu 18.04 and python 3.67 Not using glob.
Encountering 'NameError' now when trying to read multiple text files from same folders using 'sorted.glob'
readlines() causes error because input from .txt files must be strings not lists.
New to python. Have tried online python formatting, reindent.py etc. but no success.
Hoping it's a simple indentation issue so it won't be an issue in future scripts.
Current error from code below:
Traceback (most recent call last):
File "v1-ReadFiles.py", line 21, in <module>
context_input = GenerationInput(P1=P1, P3=P3,
NameError: name 'P1' is not defined
Current modified script:
import glob
import os
from src.model_use import TextGeneration
from src.utils import DEFAULT_DECODING_STRATEGY, LARGE
from src.flexible_models.flexible_GPT2 import FlexibleGPT2
from src.torch_loader import GenerationInput
from transformers import GPT2LMHeadModel, GPT2Tokenizer
for name in sorted(glob.glob('P1_files/*.txt')):
with open(name) as f:
P1 = f.readline()
for name in sorted(glob.glob('P3_files/*.txt')):
with open(name) as f:
P3 = f.readline()
if __name__ == "__main__":
context_input = GenerationInput(P1=P1, P3=P3,
genre=["mystery"],
persons=["Steve"],
size=LARGE,
summary="detective")
print("PREDICTION WITH CONTEXT WITH SPECIAL TOKENS")
model = GPT2LMHeadModel.from_pretrained('models/custom')
tokenizer = GPT2Tokenizer.from_pretrained('models/custom')
tokenizer.add_special_tokens(
{'eos_token': '[EOS]',
'pad_token': '[PAD]',
'additional_special_tokens': ['[P1]', '[P2]', '[P3]', '[S]', '[M]', '[L]', '[T]', '[Sum]', '[Ent]']}
)
model.resize_token_embeddings(len(tokenizer))
GPT2_model = FlexibleGPT2(model, tokenizer, DEFAULT_DECODING_STRATEGY)
text_generator_with_context = TextGeneration(GPT2_model, use_context=True)
predictions = text_generator_with_context(context_input, nb_samples=1)
for i, prediction in enumerate(predictions):
print('prediction n°', i, ': ', prediction)
Thanks to afghanimah here:
Problem with range() function when used with readline() or counter - reads and processes only last line in files
Dropped glob. Also moved all model= etc. load functions before 'with open ...'
with open("data/test-P1-Multi.txt","r") as f1, open("data/test-P3-Multi.txt","r") as f3:
for i in range(5):
P1 = f1.readline()
P3 = f3.readline()
context_input = GenerationInput(P1=P1, P3=P3, size=LARGE)
etc.

how calc values in array imported from csv.reader?

I have this csv file
Germany,1,5,10,20
UK,0,2,4,10
Hungary,6,11,22,44
France,8,22,33,55
and this script,
I would like to make some aritmetic operations with values in 2D array(data)
For example print value (data[1][3]) increased of 10,
Seems that I need some conversion to integer, right ?
What is best solution please ?
import csv
datafile = open('sample.csv', 'r')
datareader = csv.reader(datafile, delimiter=',')
data = []
for row in datareader:
data.append(row)
print ((data[1][3])+10)
I got this error
/python$ python3 read6.py
Traceback (most recent call last):
File "read6.py", line 8, in <module>
print ((data[1][3])+10)
TypeError: must be str, not int
You'll have to manually convert to integers as you suspected:
import csv
datafile = open('sample.csv', 'r')
datareader = csv.reader(datafile, delimiter=',')
data = []
for row in datareader:
data.append([row[0]] + list(map(int, row[1:])))
print ((data[1][3])+10)
Specifically this modification on line 7 of your code:
data.append([row[0]] + list(map(int, row[1:])))
The csv package docs mention that
No automatic data type conversion is performed unless the QUOTE_NONNUMERIC format option is specified (in which case unquoted fields are transformed into floats).
Since the strings in your CSV are not quoted (i.e. "Germany" instead of Germany), this isn't useful for your case, so converting manually is the way to go.

Dask client.persist returns AssertionError when I try to use HashingVectorizer

I am trying to vectorize the dask.dataframe with dask HashingVectorizer. I want the vectorization results to stay in the cluster (distributed system). That's why I am using client.persist when I try to transform the data. But for some reason, I am getting the error below.
Traceback (most recent call last):
File "/home/dodzilla/my_project/components_with_adapter/vectorizers/base_vectorizer.py", line 112, in hybrid_feature_vectorizer
CLUSTERING_FEATURES=self.clustering_features)
File "/home/dodzilla/my_project/components_with_adapter/vectorizers/text_vectorizer.py", line 143, in vectorize
X = self.client.persist(fitted_vectorizer.transform, combined_data)
File "/home/dodzilla/.local/lib/python3.6/site-packages/distributed/client.py", line 2860, in persist
assert all(map(dask.is_dask_collection, collections))
AssertionError
I can't share the data but all of the necessary information about the data is as below:
>>>type(combined_data)
<class 'dask.dataframe.core.Series'>
>>>type(combined_data.compute())
<class 'pandas.core.series.Series'>
>>>combined_data.compute().shape
12
A minimal working example can be found below. Below, in the code snippet, combined_data holds the merged columns. Meaning: all of the columns are merged into 1 column. Data has 12 rows. All of the values inside the rows are string. This is the code where I am getting the error:
from stop_words import get_stop_words
from dask_ml.feature_extraction.text import HashingVectorizer as daskHashingVectorizer
import pandas as pd
import dask
import dask.dataframe as dd
from dask.distributed import Client
def convert_dataframe_to_single_text(documents):
"""
Combine all of the columns into 1 column.
"""
if type(documents) is dask.dataframe.core.DataFrame:
cols = documents.columns
documents['combined'] = documents[cols].apply(func=(lambda row: ' '.join(row.values.astype(str))), axis=1,
meta=('str'))
document_texts = documents.drop(cols, axis=1)
else:
raise TypeError('Wrong type of data. Expected Pandas DF or Dask DF but received ', type(documents))
return document_texts
# Init the client.
client = Client('localhost:8786')
# Get stopwords
stopwords = get_stop_words(language="english")
# Create dask dataframe from pandas dataframe
data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':["twenty", "twentyone", "nineteen", "eighteen"]}
df = pd.DataFrame(data)
df = dd.from_pandas(df, npartitions=1)
# Init the vectorizer
vectorizer = daskHashingVectorizer(stop_words=stopwords, alternate_sign=False,
norm=None, binary=False,
n_features=10000)
# Combine all of to columns into 1 column.
combined_data = convert_dataframe_to_single_text(df)
# Fit the vectorizer.
fitted_vectorizer = client.persist(vectorizer.fit(combined_data))
# Transform the data.
X = client.persist(fitted_vectorizer.transform, combined_data)
I hope the information is enough.
Important note: I am not getting any kind of error when I say client.compute but from what I understand this doesn't work in the cluster of machines and instead it runs in the local machine. And it returns a csr matrix instead of a lazily evaluated dask.array.
This is not how I was supposed to use client.persist. Functions I was looking for are client.submit and client.map... In my case client.submit solved my issue.

I am trying to read a .csv file which contains data in the order of timestamp, number plate, vehicle type and exit/entry

I need to store the timestamps in a list for further operations and have written the following code:
import csv
from datetime import datetime
from collections import defaultdict
t = []
columns = defaultdict(list)
fmt = '%Y-%m-%d %H:%M:%S.%f'
with open('log.csv', 'r') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
#t = row[1]
for i in range(len(row)):
columns[i].append(row[i])
if (row):
t=list(datetime.strptime(row[0],fmt))
columns = dict(columns)
print (columns)
for i in range(len(row)-1):
print (t)
But I am getting the error :
Traceback (most recent call last):
File "parking.py", line 17, in <module>
t = list(datetime.strptime(row[0],fmt))
TypeError: 'datetime.datetime' object is not iterable
What can I do to store each timestamp in the column in a list?
Edit 1:
Here is the sample log file
2011-06-26 21:27:41.867801,KA03JI908,Bike,Entry
2011-06-26 21:27:42.863209,KA02JK1029,Car,Exit
2011-06-26 21:28:43.165316,KA05K987,Bike,Entry
If you have a csv file than why not use pandas to get what you want.The code for your problem may be something like this.
import Pandas as pd
df=pd.read_csv('log.csv')
timestamp=df[0]
if the first column of csv is of Timestamp than you have an array with having all the entries in the first column in the list known as timestamp.
After this you can convert all the entries of this list into timestamp objects using datetime.datetime.strptime().
Hope this is helpful.
I can't comment for clarifications yet.
Would this code get you the timestamps in a list? If yes, give me a few lines of data from the csv file.
from datetime import datetime
timestamps = []
with open(csv_path, 'r') as readf_obj:
for line in readf_obj:
timestamps.append(line.split(',')[0])
fmt = '%Y-%m-%d %H:%M:%S.%f'
datetimes_timestamps = [datetime.strptime(timestamp_, fmt) for timestamp_ in timestamps]

Resources