Read multiple text files to 2D numpy array in Python

Read multiple text files to 2D numpy array in Python - python-3.x

I have 10 txt files. Each of them with strings.
A.txt: "This is a cat"
B.txt: "This is a dog"
.
.
J.txt: "This is an ant"
I want to read these multiple files and put it in 2D array.
[['This', 'is', 'a', 'cat'],['This', 'is', 'a', 'dog']....['This', 'is', 'an', 'ant']]
from glob import glob
import numpy as np
for filename in glob('*.txt'):
with open(filename) as f:
data = np.genfromtxt(filename, dtype=str)
It's not working the way I want. Any help will be greatly appreciated.

You are just generating different numpy arrays for each text file and not saving any of them. How about add each file to a list like so and convert to numpy later?
data = []
for filename in glob('*.txt'):
with open(filename) as f:
data.append(f.read().split())
data = np.array(data)

Related

Reading CSV column values and append to List in Python

I'd like to read a column from a CSV file and store those values in a list
The CSV file is currently as below
Names
Tom
Ryan
John
The result that I'm looking for is
['Tom', 'Ryan', 'John']
Below is the code that I've written.
import csv
import pandas as pd
import time
# Declarations
UserNames = []
# Open a csv file using pandas
data_frame = pd.read_csv("analysts.csv", header=1, index_col=False)
names = data_frame.to_string(index=False)
# print(names)
# Iteration
for name in names:
UserNames.append(name)
print(UserNames)
So far the result is as follows
['T', 'o', 'm', ' ', '\n', 'R', 'y', 'a', 'n', '\n', 'J', 'o', 'h', 'n']
Any help would be appreciated.
Thanks in advance

Hi instead of using converting your Dataframe to a String you could just convert it to a list like this:
import pandas as pd
import csv
import time
df = pd.read_csv("analyst.csv", header=0)
names = df["Name"].to_list()
print(names)
Output: ['tom', 'tim', 'bob']
Csv File:
Name,
tom,
tim,
bob,
I was not sure how your csv really looked like so you could have to adjust the arguments of the read_csv function.

Most efficient way to convert Python multidimensional list to CSV file?

I want to output a multidimensional list to a CSV file.
Currently, I am creating a new DataFrame object and converting that to CSV. I am aware of the csv module, but I can't seem to figure out how to do that without manual input. The populate method allows the user to choose how many rows and columns they want. Basically, the data variable will usually be of form [[x1, y1, z1], [x2, y2, z2], ...]. Any help is appreciated.
FROM populator IMPORT populate
FROM pandas IMPORT DataFrame
data = populate()
df = DataFrame(data)
df.to_csv('output.csv')

CSVs are nothing but comma separated strings for each column and new-line separated for each row, which you can do like so:
data = [[1, 2, 4], ['A', 'AB', 2], ['P', 23, 4]]
data_string = '\n'.join([', '.join(map(str, row)) for row in data])
with open('data.csv', 'wb') as f:
f.write(data_string.encode())

How to get column names from a dataframe dynamically to geojson properties

I am trying to read the column names from dataframe and append it to the geojson properties dynamically it worked by hard coding the column names but i want those not by hard coding
can any one help me how to get those values (not by geojson iterating rows)
import pandas as pd
import geojson
def data2geojson(df):
#s="name=X[0],description=X[1],LAT-x[2]"
features = []
insert_features = lambda X: features.append(
geojson.Feature(geometry=geojson.Point((float(X["LONG"]),float(X["LAT"]))),
properties=dict(name=X[0],description=X[1])))
df.apply(insert_features,axis=1)
#with open('/dbfs/FileStore/tables/geojson11.geojson', 'w', encoding='utf8') as fp:
# geojson.dump(geojson.FeatureCollection(features), fp, sort_keys=True, ensure_ascii=False)
print(features)
df=spark.sql("select * from geojson1")
df=df.toPandas()
data2geojson(df)

How to generate random letters in python?

Is there any way to generate random alphabets in python. I've come across a code where it is possible to generate random alphabets from a-z.
For instance, the below code generates the following output.
import pandas as pd
import numpy as np
import string
ran1 = np.random.random(5)
print(random)
[0.79842166 0.9632492 0.78434385 0.29819737 0.98211011]
ran2 = string.ascii_lowercase
'abcdefghijklmnopqrstuvwxyz'
However, I want to generate random letters with input as the number of random letters (example 3) and the desired output as [a, f, c]. Thanks in advance.

Convert the string of letters to a list and then use numpy.random.choice. You'll get an array back, but you can make that a list if you need.
import numpy as np
import string
np.random.seed(123)
list(np.random.choice(list(string.ascii_lowercase), 10))
#['n', 'c', 'c', 'g', 'r', 't', 'k', 'z', 'w', 'b']
As you can see, the default behavior is to sample with replacement. You can change that behavior if needed by adding the parameter replace=False.

Here is an idea modified from https://pythontips.com/2013/07/28/generating-a-random-string/
import string
import random
def random_generator(size=6, chars=string.ascii_lowercase):
return ''.join(random.choice(chars) for x in range(size))

NLTK- Python -Cannot import a new corpus [duplicate]

I reckoned that often the answer to my title is to go and read the documentations, but I ran through the NLTK book but it doesn't give the answer. I'm kind of new to Python.
I have a bunch of .txt files and I want to be able to use the corpus functions that NLTK provides for the corpus nltk_data.
I've tried PlaintextCorpusReader but I couldn't get further than:
>>>import nltk
>>>from nltk.corpus import PlaintextCorpusReader
>>>corpus_root = './'
>>>newcorpus = PlaintextCorpusReader(corpus_root, '.*')
>>>newcorpus.words()
How do I segment the newcorpus sentences using punkt? I tried using the punkt functions but the punkt functions couldn't read PlaintextCorpusReader class?
Can you also lead me to how I can write the segmented data into text files?

After some years of figuring out how it works, here's the updated tutorial of
How to create an NLTK corpus with a directory of textfiles?
The main idea is to make use of the nltk.corpus.reader package. In the case that you have a directory of textfiles in English, it's best to use the PlaintextCorpusReader.
If you have a directory that looks like this:
newcorpus/
file1.txt
file2.txt
...
Simply use these lines of code and you can get a corpus:
import os
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
corpusdir = 'newcorpus/' # Directory of corpus.
newcorpus = PlaintextCorpusReader(corpusdir, '.*')
NOTE: that the PlaintextCorpusReader will use the default nltk.tokenize.sent_tokenize() and nltk.tokenize.word_tokenize() to split your texts into sentences and words and these functions are build for English, it may NOT work for all languages.
Here's the full code with creation of test textfiles and how to create a corpus with NLTK and how to access the corpus at different levels:
import os
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
# Let's create a corpus with 2 texts in different textfile.
txt1 = """This is a foo bar sentence.\nAnd this is the first txtfile in the corpus."""
txt2 = """Are you a foo bar? Yes I am. Possibly, everyone is.\n"""
corpus = [txt1,txt2]
# Make new dir for the corpus.
corpusdir = 'newcorpus/'
if not os.path.isdir(corpusdir):
os.mkdir(corpusdir)
# Output the files into the directory.
filename = 0
for text in corpus:
filename+=1
with open(corpusdir+str(filename)+'.txt','w') as fout:
print>>fout, text
# Check that our corpus do exist and the files are correct.
assert os.path.isdir(corpusdir)
for infile, text in zip(sorted(os.listdir(corpusdir)),corpus):
assert open(corpusdir+infile,'r').read().strip() == text.strip()
# Create a new corpus by specifying the parameters
# (1) directory of the new corpus
# (2) the fileids of the corpus
# NOTE: in this case the fileids are simply the filenames.
newcorpus = PlaintextCorpusReader('newcorpus/', '.*')
# Access each file in the corpus.
for infile in sorted(newcorpus.fileids()):
print infile # The fileids of each file.
with newcorpus.open(infile) as fin: # Opens the file.
print fin.read().strip() # Prints the content of the file
print
# Access the plaintext; outputs pure string/basestring.
print newcorpus.raw().strip()
print
# Access paragraphs in the corpus. (list of list of list of strings)
# NOTE: NLTK automatically calls nltk.tokenize.sent_tokenize and
# nltk.tokenize.word_tokenize.
#
# Each element in the outermost list is a paragraph, and
# Each paragraph contains sentence(s), and
# Each sentence contains token(s)
print newcorpus.paras()
print
# To access pargraphs of a specific fileid.
print newcorpus.paras(newcorpus.fileids()[0])
# Access sentences in the corpus. (list of list of strings)
# NOTE: That the texts are flattened into sentences that contains tokens.
print newcorpus.sents()
print
# To access sentences of a specific fileid.
print newcorpus.sents(newcorpus.fileids()[0])
# Access just tokens/words in the corpus. (list of strings)
print newcorpus.words()
# To access tokens of a specific fileid.
print newcorpus.words(newcorpus.fileids()[0])
Finally, to read a directory of texts and create an NLTK corpus in another languages, you must first ensure that you have a python-callable word tokenization and sentence tokenization modules that takes string/basestring input and produces such output:
>>> from nltk.tokenize import sent_tokenize, word_tokenize
>>> txt1 = """This is a foo bar sentence.\nAnd this is the first txtfile in the corpus."""
>>> sent_tokenize(txt1)
['This is a foo bar sentence.', 'And this is the first txtfile in the corpus.']
>>> word_tokenize(sent_tokenize(txt1)[0])
['This', 'is', 'a', 'foo', 'bar', 'sentence', '.']

I think the PlaintextCorpusReader already segments the input with a punkt tokenizer, at least if your input language is english.
PlainTextCorpusReader's constructor
def __init__(self, root, fileids,
word_tokenizer=WordPunctTokenizer(),
sent_tokenizer=nltk.data.LazyLoader(
'tokenizers/punkt/english.pickle'),
para_block_reader=read_blankline_block,
encoding='utf8'):
You can pass the reader a word and sentence tokenizer, but for the latter the default already is nltk.data.LazyLoader('tokenizers/punkt/english.pickle').
For a single string, a tokenizer would be used as follows (explained here, see section 5 for punkt tokenizer).
>>> import nltk.data
>>> text = """
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries. And sometimes sentences
... can start with non-capitalized words. i is a good variable
... name.
... """
>>> tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
>>> tokenizer.tokenize(text.strip())

>>> import nltk
>>> from nltk.corpus import PlaintextCorpusReader
>>> corpus_root = './'
>>> newcorpus = PlaintextCorpusReader(corpus_root, '.*')
"""
if the ./ dir contains the file my_corpus.txt, then you
can view say all the words it by doing this
"""
>>> newcorpus.words('my_corpus.txt')

from nltk.corpus.reader.plaintext import PlaintextCorpusReader
filecontent1 = "This is a cow"
filecontent2 = "This is a Dog"
corpusdir = 'nltk_data/'
with open(corpusdir + 'content1.txt', 'w') as text_file:
text_file.write(filecontent1)
with open(corpusdir + 'content2.txt', 'w') as text_file:
text_file.write(filecontent2)
text_corpus = PlaintextCorpusReader(corpusdir, ["content1.txt", "content2.txt"])
no_of_words_corpus1 = len(text_corpus.words("content1.txt"))
print(no_of_words_corpus1)
no_of_unique_words_corpus1 = len(set(text_corpus.words("content1.txt")))
no_of_words_corpus2 = len(text_corpus.words("content2.txt"))
no_of_unique_words_corpus2 = len(set(text_corpus.words("content2.txt")))
enter code here

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Read multiple text files to 2D numpy array in Python - python-3.x

You are just generating different numpy arrays for each text file and not saving any of them. How about add each file to a list like so and convert to numpy later? data = [] for filename in glob('*.txt'): with open(filename) as f: data.append(f.read().split()) data = np.array(data)

Related

Reading CSV column values and append to List in Python

Most efficient way to convert Python multidimensional list to CSV file?

How to get column names from a dataframe dynamically to geojson properties

How to generate random letters in python?

NLTK- Python -Cannot import a new corpus [duplicate]

Categories

Resources