Read N number of random lines from a text file / Python - python-3.x

I'm currently reading from a "collection" file which holds all possible outcomes, of type int, which I read into a DataFrame.
cycle = 19380816
pull = 10000000
sample = rand.sample(range(cycle),cycle-pull)
new_df = pd.read_csv('collection.txt', skiprows = sample, sep = " ", names = ['a1','b1','c1','a2','b2','c2','a3','b3','c3','a4','b4','c4','a5','b5','c5'], header = None)
Of course the sample cannot be greater than the length of the actual file.
I want to randomly pull lines which exceed the length of lines in the file.
In this case, where "pull > cycle".
Essentially a
rand.choice, 'of line in "collections.txt"', N times
Is there a way to do this using pd.read_csv?

You can always read in the entire dataframe, then take n samples (with replacement, using rand.choices) from the row indices and then use iloc to grab the new sampled dataframe.
# cycle = 19380816
pull = 10000000
whole_df = pd.read_csv('collection.txt', sep = " ", names = ['a1','b1','c1','a2','b2','c2','a3','b3','c3','a4','b4','c4','a5','b5','c5'], header = None)
i_sample = rand.choices(range(len(whole_df)), k=pull)
new_df = whole_df.iloc[i_sample]

Related

Appending max value of a zipped list

I have these three lists:
bankNames = ["bank1","bank2","bank3"]
interestRate = (0.05,0.01,0.08)
namePlusInterest = zip(interestRate,bankNames)
print(max(list(namePlusInterest)))
the print function returns an output of:
(0.08, 'bank3')
I want to be able to split the output into individual variables (for example):
MaxRate = 0.08
MaxBank = 'bank3'
So for later in my code I can say:
print(MaxBank + "has the highest interest rate of" + MaxRate)
You can use tuple unpacking to get each individual element from the tuple:
bankNames = ["bank1", "bank2", "bank3"]
interestRate = (0.05, 0.01, 0.08)
namePlusInterest = zip(interestRate, bankNames)
MaxRate, MaxBank = max(list(namePlusInterest))
print(f"{MaxBank} has the highest interest rate of {MaxRate}")

TypeError: 'int' object is not iterable when calculating mean

I am trying to read different values from a file and to store them in a list. After that, I need to take their mean and in doing so I am getting the error above. Code is working up to to line
"Avg_Humidity.append(words[8])"
Here it is:
def monthly_report(path,year,month):
pre_script="Murree_weather"
format='.txt'
file_name = pre_script + year + month+format
name_path=os.path.join(path,file_name)
file = open(name_path, 'r')
data = file.readlines()
Max_Temp = []
Min_Temp = []
Avg_Humidity = []
for line in data:
words = line.split(",")
Max_Temp.append(words[1])
Min_Temp.append(words[3])
Avg_Humidity.append(words[8])
Count_H, Count_Max_Temp, Count_Min_Temp, Mean_Max_Temp, Mean_Min_Temp,
Mean_Avg_Humidity=0
for iterate in range(1,len(Max_Temp)):
Mean_Max_Temp= Mean_Max_Temp+Max_Temp(iterate)
Count_Max_Temp=Count_Max_Temp+1
Mean_Max_Temp=Mean_Max_Temp/Count_Max_Temp
for iterate in range(1,len(Min_Temp)):
Mean_Min_Temp= Mean_Min_Temp+Min_Temp(iterate)
Count_Min_Temp=Count_Min_Temp+1
Mean_Min_Temp=Mean_Min_Temp/Count_Min_Temp
for iterate in range(1,len(Avg_Humidity)):
Mean_Avg_Humidity= Mean_Avg_Humidity+Avg_Humidity(iterate)
Count_H=Count_H+1
Mean_Avg_Humidity=Mean_Avg_Humidity/Count_H
print("Mean Average Humidity = ",Mean_Avg_Humidity)
print("Mean Maximum Temperature = ",Mean_Max_Temp)
print("Mean Minimum Temperature = ",Mean_Min_Temp)
return
This line is incorrect:
Count_H, Count_Max_Temp, Count_Min_Temp, Mean_Max_Temp, Mean_Min_Temp, Mean_Avg_Humidity = 0
To fix, change it to:
Count_H = Count_Max_Temp = Count_Min_Temp = Mean_Max_Temp = Mean_Min_Temp = Mean_Avg_Humidity = 0
An alternative fix would be to leave the commas as they are and change the right-hand side to a list or tuple of zeroes that has the same number of elements as the left-hand side. But that would be less clear, and harder to maintain.

dictionaries feature extraction Python

I'm doing a text categorization experiment. For the feature extraction phase I'm trying to create a feature dictionary per document. For now, I have two features, Type token ratio and n-grams of the relative frequency of function words. When I print my instances, only the feature type token ratio is in the dictionary. This seems to be because an ill functioning get_pos(). It returns empty lists.
This is my code:
instances = []
labels = []
directory = "\\Users\OneDrive\Data"
for dname, dirs, files in os.walk(directory):
for fname in files:
fpath = os.path.join(dname, fname)
with open(fpath,'r') as f:
text = csv.reader(f, delimiter='\t')
vector = {}
#TTR
lemmas = get_lemmas(text)
unique_lem = set(lemmas)
TTR = str(len(unique_lem) / len(lemmas))
name = fname[:5]
vector['TTR'+ '+' + name] = TTR
#function word ngrams
pos = get_pos(text)
fw = []
regex = re.compile(
r'(LID)|(VNW)|(ADJ)|(TW)|(VZ)|(VG)|(BW)')
for tag in pos:
if regex.search(tag):
fw.append(tag)
for n in [1,2,3]:
grams = ngrams(fw, n)
fdist = FreqDist(grams)
total = sum(c for g,c in fdist.items())
for gram, count in fdist.items():
vector['fw'+str(n)+'+'+' '+ name.join(gram)] = count/total
instances.append(vector)
labels.append(fname[:1])
print(instances)
And this is an example of a Dutch input file:
This is the code from the get_pos function, which I call from another script:
def get_pos(text):
row4=[]
pos = []
for row in text:
if not row:
continue
else:
row4.append(row[4])
pos = [x.split('(')[0] for x in row4] # remove what's between the brackets
return pos
Can you help me find what's wrong with the get_pos function?
When you call get_lemmas(text), all contents of the file are consumed, so get_pos(text) has nothing left to iterate over. If you want to go through a file's content multiple times, you need to either f.seek(0) between the calls, or read the rows into a list in the beginning and iterate over the list when needed.

Averaging a list from a file and also find the min and max rate of change between each index Python

We received a file called USPopulation.txt, with instructions that basically say line 1 in the file is the year 1950 and the last line being 1990. we needed to store the data in a list and do 3 things with said data.
Find the average(somewhat easy i think i have this down)
Find the maximum rate of change in any 1 year
Find the minimum rate of change in any 1 year
The Average of the numbers code taken from a past program
list_of_numbers = []
with open('USPopulation.txt') as f:
for line in f:
if line.strip():
list_of_numbers.append(int(line.strip()))
print('Total ',len(list_of_numbers))
print('Average ',1.0 * sum(list_of_numbers) / len(list_of_numbers))
I need to combine the other elements and have no idea how any help would be great
The rate of change is the difference between two subsequent values in the list. To get that value, you basically need to store the previous value and compare it to the current one.
One way of doing this would be to simply loop through your list, and collect those values:
previous = None:
ratesOfChange = []
for num in list_of_numbers:
if previous:
ratesOfChange.append(abs(num - previous))
previous = num
Getting the maximum and minimum is then as easy as calling max() and min() on the ratesOfChange list.
Of course to improve this a bit, you might want to consider collecting those values while parsing the file already (so you save the second loop through the list). And you could even note down the minimum and maximum at the same time to save another loop through it (both max and min will loop over the list).
if USPopulation.txt is like
1950=10000
1951=10005
1952=10030
then by converting the above lines into dictionary so that one can access each year and its corresponding population
file1 = open("USPopulation.txt", "r+")
years_dict = dict()
arr = []
class population:
def __init__(self):<br>
self.average = 0
self.maximum = 0
self.minimum = 0
def average_method(self):
try:
for line in file1.readlines():
value1 = line.split('=', 1)
years_dict[value1[0]] = int(value1[1])
length_of_dictionary = len(years_dict.keys())
for values in years_dict.values():
self.average = self.average + values
self.average = (1.0 * self.average / length_of_dictionary)
except:
print "not able to read the lines from the file"
def maximum_method(self):
try:
i = 0
for year in range(1950, 1952):
arr.insert(i, (years_dict[str(year + 1)] - years_dict[str(year)]))
i = i + 1
self.minimum = min(arr)
self.maximum = max(arr)
except:
print "not able to insert the element"
obj = population()
obj.average_method()
obj.maximum_method()
print "Average of population: " + str(obj.average)
print "maximum rate of change: " + str(obj.maximum)
print "minimum rate of change: " + str(obj.minimum)

Trying to read a text file...but not getting all the contents

I am trying to read the file with the following format which repeats itself (but I have cut out the data even for the first repetition because of it being too long):
1.00 'day' 2011-01-02
'Total Velocity Magnitude RC - Matrix' 'm/day'
0.190189 0.279141 0.452853 0.61355 0.757833 0.884577
0.994502 1.08952 1.17203 1.24442 1.30872 1.36653
1.41897 1.46675 1.51035 1.55003 1.58595 1.61824
Download the actual file with the complete data here
This is my code which I am using to read the data from the above file:
fid = fopen(file_name); % open the file
dotTXT_fileContents = textscan(fid,'%s','Delimiter','\n'); % read it as string ('%s') into one big array, row by row
dotTXT_fileContents = dotTXT_fileContents{1};
fclose(fid); %# don't forget to close the file again
%# find rows containing 'Total Velocity Magnitude RC - Matrix' 'm/day'
data_starts = strmatch('''Total Velocity Magnitude RC - Matrix'' ''m/day''',...
dotTXT_fileContents); % data_starts contains the line numbers wherever 'Total Velocity Magnitude RC - Matrix' 'm/day' is found
ndata = length(data_starts); % total no. of data values will be equal to the corresponding no. of '** K' read from the .txt file
%# loop through the file and read the numeric data
for w = 1:ndata-1
%# read lines containing numbers
tmp_str = dotTXT_fileContents(data_starts(w)+1:data_starts(w+1)-3); % stores the content from file dotTXT_fileContents of the rows following the row containing 'Total Velocity Magnitude RC - Matrix' 'm/day' in form of string
%# convert strings to numbers
tmp_str = tmp_str{:}; % store the content of the string which contains data in form of a character
%# assign output
data_matrix_grid_wise(w,:) = str2num(tmp_str); % convert the part of the character containing data into number
end
To give you an idea of pattern of data in my text file, these are some results from the code:
data_starts =
2
1672
3342
5012
6682
8352
10022
ndata =
7
Therefore, my data_matrix_grid_wise should contain 1672-2-2-1(for a new line)=1667 rows. However, I am getting this as the result:
data_matrix_grid_wise =
Columns 1 through 2
0.190189000000000 0.279141000000000
0.423029000000000 0.616590000000000
0.406297000000000 0.604505000000000
0.259073000000000 0.381895000000000
0.231265000000000 0.338288000000000
0.237899000000000 0.348274000000000
Columns 3 through 4
0.452853000000000 0.613550000000000
0.981086000000000 1.289920000000000
0.996090000000000 1.373680000000000
0.625792000000000 0.859638000000000
0.547906000000000 0.743446000000000
0.562903000000000 0.759652000000000
Columns 5 through 6
0.757833000000000 0.884577000000000
1.534560000000000 1.714330000000000
1.733690000000000 2.074690000000000
1.078000000000000 1.277930000000000
0.921371000000000 1.080570000000000
0.934820000000000 1.087410000000000
Where am I wrong? In my final result, I should get data_matrix_grid_wise composed of 10000 elements instead of 36 elements. Thanks.
Update: How can I include the number before 'day' i.e. 1,2,3 etc. on a line just before the data_starts(w)? I am using this within the loop but it doesn't seem to work:
days_str = dotTXT_fileContents(data_starts(w)-1);
days_str = days_str{1};
days(w,:) = sscanf(days_str(w-1,:), '%d %*s %*s', [1, inf]);
Problem in line tmp_str = tmp_str{:}; Matlab have strange behaviour when handling chars. Short solution for you is replace last with the next two lines:
y = cell2mat( cellfun(#(z) sscanf(z,'%f'),tmp_str,'UniformOutput',false));
data_matrix_grid_wise(w,:) = y;
The problem is with last 2 statements. When you do tmp_str{:} you convert cell array to comma-separated list of strings. If you assign this list to a single variable, only the first string is assigned. So the tmp_str will now have only the first row of data.
Here is what you can do instead of last 2 lines:
tmp_mat = cellfun(#str2num, tmp_str, 'uniformoutput',0);
data_matrix_grid_wise(w,:) = cell2mat(tmp_mat);
However, you will have a problem with concatenation (cell2mat) since not all of your rows have the same number of columns. It's depends on you how to solve it.

Resources