Excel VBA Textfile to 2d array - excel

I am new to excel vba. I want to read a textfile that contains text like this:
John Smith Engineer Chicago
Bob Alice Doctor New York
Jane Smith Teacher St. Louis
So, I want to convert this into a 2D array so if I do print(3,3), it should return 'Teacher'.
I am able to read entire file contents into one string but am having difficulty in converting it to
a 2d array like above. Please advice on how to proceed. Thanks

unless the text file has some specific structure to it, you're going to struggle a bit. Things that might make it easier are:
Does the text file contain line breaks at the end of each line?
Are all the names in [FirstName][LastName] format as per your example
or might some have more/less words?
Does the Occupation always come directly after the name?
Are there a (very) limited number of Occupations?

as mentioned by NautMeg, You have to make some assumptions on the data based on the provided template.
However we can assume that :
a space is the delimiter
The Final column is City, which can contain a space
there are 4 columns
First Name
Last Name
Profession
City/Location
Using this information:
While Not EOF(my_file)
Line Input #my_file, text_line
// text_line contains the independent line
i = i + 1
// i is the line number
Wend
is how we retrieve each line.
Split ( Expression, [Delimiter], [Limit], [Compare] )
This will give you each item in the list. For index's < 3 (0 based index), they are unique columns of data and you can handle them however you want.
For Index >=3, Join these together into 1 string .
Join( SourceArray, [Delimiter] )
You'll likely want to make the delimiter in this case a simple space, since the split function will remove the space.
That will allow you to parse the data AS is.
However, for future reference if you can control the export of the text file, you should try exporting as a CSV file.
Good luck

Related

counting most commonly used words across thousands of text files using Octave

I've accumulated a list of over 10,000 text files in Octave. I've got a function which cleans up the contents of each files, normalizing things in various ways (lowercase, reducing repeated whitespace, etc). I'd like to distill from all these files a list of words that appear in at least 100 files. I'm not at all familiar with Octave data structs or cell arrays or the Octave sorting functions, and was hoping someone could help me understand how to:
initialize an appropriate data structure (word_appearances) to count how many emails contain a particular word
loop thru the unique words that appear in an email string and increment for each of those words the count I'm tracking in word_appearances -- ideally we'd ignore words less than two chars in length and also exclude a short list of stop_words.
reduce word_appearances to only contain words that appear some number of times, e.g, min_appearances=100 times.
sort the words in word_appearances alphabetically and either export this as a .MAT file or as a CSV file like so:
1 aardvark
2 albatross
etc.
I currently have this code to loop through my files one by one:
for i = 1:total_files
filename = char(files{i}(1));
printf("processing %s\n", filename);
file_contents = jPreProcessFile(readFile(filename))
endfor
Note that the file_contents that comes back is pretty clean -- usually just a bunch of words, some repeated, separated by single spaces like so:
email market if done right is both highli effect and amazingli cost effect ok it can be downright profit if done right imagin send your sale letter to on million two million ten million or more prospect dure the night then wake up to sale order ring phone and an inbox full of freshli qualifi lead we ve been in the email busi for over seven year now our list ar larg current and deliver we have more server and bandwidth than we current need that s why we re send you thi note we d like to help you make your email market program more robust and profit pleas give us permiss to call you with a propos custom tailor to your busi just fill out thi form to get start name email address phone number url or web address i think you ll be delight with the bottom line result we ll deliv to your compani click here emailaddr thank you remov request click here emailaddr licatdcknjhyynfrwgwyyeuwbnrqcbh
Obviously, I need to create the word_appearances data structure such that each element in it specifies a word and how many files have contained that word so far. My primary point of confusion is what sort of data structure word_appearances should be, how I would search this data structure to see if some new word is already in it, and if found, increment its count, otherwise add a new element to word_appearances with count=1.
Octave has containers.Map to hold key-value pairs. This is the simple usage:
% initialize map
m = containers.Map('KeyType', 'char', 'ValueType', 'int32');
% check if it has a word
word = 'hello';
if !m.isKey(word)
m(word) = 1;
endif
% increment existing values
m(word) += 1;
This is one way to extract most frequent words from a map like the one above:
counts = m.values;
[sorted_counts, indices] = sort(cell2mat(counts));
top10_indices = indices(end:-1:end-9);
top10_words = m.keys(top10_indices);
I must warn you though, Octave may be pretty slow at this task, considering that you have thousands of files. Use it only if running time isn't that important for you.

Extracting text in excel

I have some text which I receive daily that I need to seperate. I have hundreds of lines similar to the extract below:
COMMODITY PRICE DIFFERENTIAL: FEB50-FEB40 (APR): COMPANY A OFFERS 1000KB AT $0.40
I need to extract individual snippets from this text, so for each in a seperate cell, I the result needs to be the date, month, company, size, and price. In the case, the result would be:
FEB50-40
APR
COMPANY A
100
0.40
The issue I'm struggling with is uniformity. For example one line might have FEB50-FEB40, another FEB5-FEB40, or FEB50-FEB4. Another example giving me difficult is that some rows might have 'COMPANY A' and the other 'COMPANYA' (one word instead of two).
Any ideas? I've been trying combinations of the below but I'm not able to have uniform results.
=TRIM(MID(SUBSTITUTE($D7," ",REPT(" ",LEN($D7))), (5)*LEN($D7)+1,LEN($D7)))
=MID($D7,20,21-10)
=TRIM(RIGHT(SUBSTITUTE($D6,"$",REPT("$",2)),4))
Sometimes I get
FEB40-50(' OR 'FEB40-FEB5'
when it should be
'FEB40-FEB50'`
Thank you to who is able to help.
You might get to the limits of formulas with this scenario, but with Power Query you can still work.
As I see it, you want to apply the following logic to extract text from this string:
COMMODITY PRICE DIFFERENTIAL: FEB50-FEB40 (APR): COMPANY A OFFERS 1000KB AT $0.40
text after the first : and before the first (
text between the brackets
text after the word OFFERS and before AT
text after 'AT`
These can be easily translated into several "Split" scenarios inside Power Query.
split by custom delimiter : - that's colon and space - for each ocurrence
remove first column
Split new first column by ( - that's space and bracket - for leftmost
Replace ) with nothing in second column
Split third column by delimiter OFFERS
split new fourth column by delimiter AT
The screenshot shows the input data and the result in the Power Query editor after renaming the columns and before loading the query into the worksheet.
Once you have loaded the query, you can add / remove data in the input table and simply refresh the query to get your results. No formulas, just clicking ribbon commands.
You can take this further by removing the "KB" from the column, convert it to a number, divide it by 100. Your business processing logic will drive what you want to do. Just take it one step at a time.

Read only 2 columns from a '~' delimited text file into dataframe and store second column as string

I have a very large text file (3.33 GB) which has 47 columns separated by delimiter ~. I just need the first and the last column to work with. The last column is a 17 digit number which may contain leading zeros. I have to store this column as a string (so as to not remove the leading zeros). An example of the first and last column is shown below:
id Number
0 0 10030040125198660
1 12345 60034046122158670
My question is whether it's possible to read just these two columns alone, and store the second column as string ? The reason I ask is because loading 3.3GB file as a dataframe takes a lot of time, converting it into string takes an even longer amount. I want to know if I can save time by choosing only the columns I need.
My code as of now (shown the column names as numbers for easy understanding):
df=pd.read_csv('myfile.txt',low_memory=False,sep='~',header=None)
df.drop(columns=[2,3,4...,46],inplace=True) #Keeping only column 1 and 47
df['47']=df['47'].astype(str)
Any help is highly appreciated!
You should use "usecols" parameter. Check out the read_csv official documentation. Infact that is the first thing you should check

Extracting text from complex string in excel

The attached image (link: https://i.stack.imgur.com/w0pEw.png) shows a range of cells (B1:B7) from a table I imported from the web. I need a formula that allows me to extract the names from each cell. In this case, my objective is to generate the following list of names, where each name is in its own cell: Erik Karlsson, P.K. Subban, John Tavares, Matthew Tkachuk, Steven Stamkos, Dustin Brown, Shea Weber.
I have been reading about left, right, and mid functions, but I'm confused by the irregular spacing and special characters (i.e. the box with question mark beside some names).
Can anyone help me extract the names? Thanks
Assuming that your cells follow the same format, you can use a variety of text functions to get the name.
This function requires the following format:
Some initial text, followed by
2 new lines in Excel (represented by CHAR(10)
The name, which consists of a first name, a space, then a last name
A second space on the same line as the name, followed by some additional text.
With this format, you can use the following formula (assuming your data is in an Excel table, with the column of initial data named Text):
=MID([#Text],SEARCH(CHAR(10),[#Text],SEARCH(CHAR(10),[#Text])+1)+1,SEARCH(" ",MID([#Text],SEARCH(CHAR(10),[#Text],SEARCH(CHAR(10),[#Text])+1)+1,LEN([#Text])),SEARCH(" ",MID([#Text],SEARCH(CHAR(10),[#Text],SEARCH(CHAR(10),[#Text])+1)+1,LEN([#Text])))+1)-1)
To come up with this formula, we take the following steps:
First, we figure out where the name starts. We know this occurs after the 2 new lines, so we use:
=SEARCH(CHAR(10),[#Text],SEARCH(CHAR(10),[#Text])+1)+1
The inner (occurring second) SEARCH finds the first new line, and the outer (occurring first) finds the 2nd new line.
Now that we have that value, we can use it to determine the rest of the string (after the 2 new lines). Let's say that the previous formula was stored in a table column called Start of Name. The 2nd formula will then be:
=MID([#Text],[#[Start of Name]],LEN([#Text]))
Note that we're using the length of the entire text, which by definition is more than we need. However, that's not an issue, since Excel returns the smaller amount between the last argument to MID and the actual length of the text.
Once we have the text from the start of the name on, we need to calculate the position of the 2nd space (where the name ends). To do that, we need to calculate the position of the first space. This is similar to how we calculated the start of the name earlier (which starts after 2 new lines). The function we need is:
=SEARCH(" ",[#[Rest of String]],SEARCH(" ",[#[Rest of String]])+1)-1
So now, we know where the name starts (after 2 new lines), and where it ends (after the 2nd space). Assuming we have these numbers stored in columns named Start of Name and To Second Space respectively, we can use the following formula to get the name:
=MID([#Text],[#[Start of Name]],[#[To Second Space]])
This is equivalent to the first formula: The difference is that the first formula doesn't use any "helper columns".
Of course, if any cell doesn't match this format, then you'll be out of luck. Using Excel formulas to parse text can be finicky and inflexible. For example, if someone has a middle name, or someone has a initials with spaces (e.g. P.K. Subban was P. K. Subban), or there was a Jr. or something, your job would be a lot harder.
Another alternative is to use regular expressions to get the data you want. I would recommend this thorough answer as a primer. Although you still have the same issues with name formats.
Finally, there's the obligatory Falsehoods Programmers Believe About Names as a warning against assuming any kind of standardized name format.

Trimming strings in excel for different words

Scenario: I have some rows with string of data in excel. The data is always on the same order ("columns") but the size of the data in each "column" varies. In the original strings, there can be one or multiple blank spaces between each piece of "column" data, and so far I used the trim function to reduce that to 1 blank space.
Objective: I am trying to somehow separate the data from the string in different columns, but inside each column data, there might also be spaces, for example I am trying to output this original:
James Smith code1 code2 10.5 09/23/1900AT PRESENT UUUB SJ SPECIAL 250AAA No No NoCORRECTED part1
to this with trim:
James Smith code1 code2 10.5 09/23/1900AT PRESENT UUUB SJ SPECIAL 250AAA No No NoCORRECTED part1
as this:
James Smith code1 code2 10.5 09/23/1900 AT PRESENT UUUB SJ SPECIAL 250AAA No No No CORRECTED part1
where each field is in its proper column.
Obs1: One of the problematic fields for me is the one that has the result "AT PRESENT", because there is a space in between, and there is no space between the "AT" and the last digit of the previous column.
Obs2: I also face similar problems in the first row (headers), which also can have more than 1 work per field.
Obs3: Here are two other string examples that appear in the dataset:
code1 03/15/1950TEAM-ALPHA h/s/s CERTIFIED3-3/1 third point 03/19/1944 -- --SR SR Prototype
code1 200000.00especial reduced Redone third part -- No
What I already tried: I have been trying the LEFT, RIGHT and MID functions, but since I cannot foresee how many letters will be in most of the fields, I found no proper way to do it. I also tried doing simple character substitution, but that does not solve the problem of the fields that are mistakenly merged. The first thing I tried was using "text to columns": here the result is also problematic, if I have spaces inside a field it gets divided, and if there is no space between fields, there will also be an error. I am tried to to this as dynamically as possible, to account for different data variants.
Question: Any suggestions or ideas on how to tackle this situation?
Have you tried Text to Columns on Data tab?
Set your original data type delimited and select the "Space" delimiter. Make sure you tick "Treat consecutive delimiters as one"

Resources