Merging data points in NLTK's ConditionalFreqDist - python-3.x

I have a Prussian newspaper corpus covering the years from 1863 to 1894 and want to plot the word usage over time. The corpus consits of roughly 2400 xml files, one file for each issue. If I would plot the ConditionalFreqDist I would get a graph with 2400 data points on the x-axis, which renders the graph unreadable.
How can I merge the information concerning the same year, displaying the average usage of each word in my search list u_input? E.g: I have 3 files for the year 1863, looking for the word 'König' - king (among other search terms), the first file contains 1 mention, the 2nd file 3 and the 3rd file 2. I would like the graph to only have one data point '1863' with the value '2'.
The plotting function:
def _plot_input():
cfd = nltk.ConditionalFreqDist(
(target, fileid[:-4]) # takes first 4 characters as lable names = year
for fileid in reader.fileids() # for all files in directory
for w in reader.words(fileid) # for all words in each file
for target in u_input
if w.lower().startswith(target) # includes words like 'könliglich' if search term was 'König'
)
cfd.plot(title='Word usage over time in Prussian Newspapers')
u_input is a list containing the words I'm analyzing, reader is my corpusreader object, files are named like this yyyy-mm-dd.xml, e.g. "1867-03-06.xml".
Thanks in advance.
Edit:
The quick fix would be to loop over all files, read all files beginning with the same year and write the contents into one single new file for each year.

To extract the year from the filename you must write fileid[:4], not fileid[:-4]. Once you do that, you'll have only as many x positions as there are distinct years in your corpus. This is exactly equivalent to the "quick fix" you suggest.
However, the y values will be totals for the year, not per-file averages within each year as you ask. If this is really what you needed, edit your question to clarify. (I suspect that what you really need is an average over the total number of words in a year; anything else is nonsense, unless all your files are exactly the same size.)

Related

Split range of times into 5 mins intervals in Excel

I'm new to programming and data analysis in general, and need some help with a large dataset file (43 GB). It is a list of High Frequency trades fro a stock containing two columns I'm intrested in: Time (in UTC format including milliseconds, e.g. 2019-01-01T00:06:41.033529796Z) and price. I have managed to open the file using delimiter software and split it into 509 files which would fit in an excel sheet.
I now need to compare the price change during 5 minute intervals based on the prices in this file.
My first problem is that Excel doesnt the approriate time format for interpretation.
Secondly, I need to understand perhaps using the =FLOOR formula, to split the lsit of trade times into 5 mins intervals and find the difference in corespongin prices.
I have tried making excel recognise the UTC format with no success. Any help would be appreciated!

Which is the best Data Mining model to extrapolate known values to missing values in a table? (General question)

I am working on a little data mining project (I am still a Data Science student, not a professional). Maybe you can help me to choose a proper model for my task.
So, let's say we have a table with three columns and around 4000 rows:
YEAR
COLOR
NAME
1900
Green
David
1901
Yellow
Sarah
1902
Green
???
1902
Red
Sarah
…
…
…
2020
Purple
John
Any value for any field can be repeated in the dataset (also Year values).
In the first two columns we don't have missing values, but we only have around 20% of Name values in the third column. Name value deppends somewhat on the first two columns (not a causal relation).
My goal is to extrapolate the available Name values to the whole table and get a range of occurrences for each name value (for example in a boxplot)
I have imagined a process like that, although I am not very sure if statitically it makes sense (any objections and suggestions are appreciated):
For every unknown NAME value, the algorythm choose randomly one of the already known NAME values. The odds of a particular NAME value to be chosen depend on the variables YEAR and COLOR. For instance, if 'David' values tend to be correlated with low Year values AND with 'Green' or 'Purple' values for Color, the algorythm give 'David' a higher probability to be chosen if input values for Year and Color are "1900, Purple".
When the above process ends, the number of occurrences for each name is counted.
The above process is applied 30 times and the results for each name are displayed in a plotbox.
However, I don't know which is the best model to implement an idea similar to this. I have drawn the process in a simple paint drawing:
Possible output for the task
Which do you think it could be a good approach to this task? I appreciate any help.
I think you have the process down, it's converting the data which may be the first hurdle.
I would look at using from sklearn.preprocessing import OrdinalEncoder to encode the data to convert from categorical to numeric.
You could then use a random number generator to produce a number within the range defined by the encoding which would randomly select a name.
Loop through this 30 times with an f loop to achieve the result.
It also looks like you will need to provide the ranking values for year and colour prior to building out your code. From here you would just provide bands, for example, if year > 1985, etc within your for loop to specify the names.

Scripting Language that can count number of files and do simple plotting

I am looking for a programming language that can do the following:
Count the number of files (User will specify the file type and date range) in a directory.
Windows-based
Can do simple GUI that will prompt the user with input values
Can do simple plotting of bar and line graphs
Can take input values from an Excel file.
Can do simple means test like t-test or ANOVA.
The purpose is this. I need to automize the weekly sample per consumable ratio data. Every picture file corresponds to one sample tested. The consummable data is logged in an excel file. So the program I want to write will "read" the number of samples tested by counting the picture files. And it will "read" the consummable used by reading the Excel file.
Input is as follows:
1. Date range where you want to see the sample tested per consummable ratio
2. Machine number and consummable type to know which part of Excel file to extract
3. Tick bar as option if we want to do stat test or not.
Output:
1. Bar or line graph (sample tested per consummable ratio vs time)
2. T-test results if we want to do a comparison of two time periods.
Many thanks in advance.

making big data set smaller in excel

I made a little test machine that accidentally created a 'big' data set:
6 columns with +/- 550.000 rows.
The end result I am looking for is a graph with 6 lines, horizontal axis 1 - 550.000 measurements and vertically the values in the rows. (capped at 200 or so). Data is a resistance measurement that should be between 0 - 30 or very big (borken), the software writes 'inf' in these cases.
My skill is limited to excel, so what have I done until now:
Imported in Excel. The measurements are valuable between 0 - 30 and inf is not good for a graph, so I did: if(cell>200){200}else{keep cell value}.
Now making a graph is a timely exercise and excel does not like this, result is not good.
So I would like to take the average value of 60 measurements to reduce the rows to below 10.000. So =AVERAGE(H1:H60)
But I cannot get this to work.
Questions:
How do I reduce this data set and get a good graph.
Should I switch
to other software that is more applicable?
FYI: I already changed the software of the testing device to take the average value of a bunch of measurements the next time... But I cannot repeat this test.
Download link of data set comma separated file 17MB
I think you are on the right track, however my guess is that you only want to get an average every 60 rows and are unsure how to do this.
Using MOD(Number, Divisor) inside an if statement will let you specify that the average should be calculated only once in every x number of cells.
Assuming you'll have one row above your data table for headers, you are looking for something along the lines of:
=IF(MOD(ROW(A61),60) = 1,AVERAGE(H2:H61),"")
Once you have this you can filter your average column to non-blank values and use this to create your graph.

Excel time series data plot

I am trying to plot some time series data, but in a way that has stumped me so far. The salient part here is that each data point is associated with an open date and a closed date. I would like a time series line graph that counts the number open on a given date.
Example: Open - Close
first record: 2/10/2013 - 3/1/2013
second record: 2/15/2013 - 3/5/2013
The graph I'm looking for would start at 0, rise to 1 on 2/10 rise again to 2 on 2/15 then drop down 1 on 3/1 and back to 0 on 3/5.
The actual dataset contains hundreds of records, so manual processing is out of the question. I'm sure there must be an easy way to do it, but I have not found it yet. Tried help and google search, but I'm not exactly sure what I'm looking for.
Use the CountIfs() function like so:
So, you specify the category labels, and then use the COUNTIFS() function to evaluate, for each category label, how many records are open at that time.
You can use the result of the Countifs function as the frequency for a histogram, time series, bar chart, etc.
Then, plot the data in columns E & F (or however your sheet happens to be arranged) to create the chart.
Edit
To include blank values in the count, modify the formula thusly:
=COUNTIFS($B$3:$B$7,"<="&E3,$C$3:$C$7,">="&E3)+COUNTIFS($B$3:$B$7,"<="&E3,$C$3:$C$7,"")

Resources