How Do I separate data written in the same column? - excel

So I've got bunch of file in excel that has the measure of the electric current versus voltage with like 2000 points. The problem is, they're all saved in the column same box. It's suppose to be such that Column A has all the voltages and Column B all the currents. Right now, everything is saved in Column A both voltages and current.
I really don't want to separate 2000 points with 20 files of it hand by hand, is there a good way to separate them? The good news is, the points are separated by a space e.g. [1.001 2.002] with 1.001 being a voltage or 2.002 being a current or separated by a negative sign if there is one [-1.001-2.002] so I feel like a simple program can fix this up. I know how to code in C and matlab(also, the goal is to make it matlab readable) but what's the best way to resolve this if it could also be done maybe in excel macro?

Are you on a unix/linux?
Save the file as CSV.
Run this from the terminal.
$ sed -i .bak "s/ /,/g" my-file.csv
Your original file will be saved as my-file.bak. The my-file.csv will now be comma delimited instead of space delineated.
Open the CSV in excel.

One option would be to open the sheet in Power Query and split the column based on a delimiter. Assuming you have something (a space, comma, etc.), then this will give you two new columns. In Query Editor, go to split column and then by delimiter. Source: https://support.office.com/en-au/article/Split-a-column-of-text-Power-Query-5282d425-6dd0-46ca-95bf-8e0da9539662?ui=en-US&rs=en-AU&ad=AU#__toc354843578
Edit: If you don't have Power Query, it's a free add-on for Excel: https://www.microsoft.com/en-gb/download/details.aspx?id=39379

Related

How to grep csv documents to APPEND info and keep the old data intact?

I have huge_database.csv like this:
name,phone,email,check_result,favourite_fruit
sam,64654664,sam#example.com,,
sam2,64654664,sam2#example.com,,
sam3,64654664,sam3#example.com,,
[...]
===============================================
then I have 3 email lists:
good_emails.txt
bad_emails.txt
likes_banana.txt
the contents of which are:
good_emails.txt:
sam#example.com
sam3#example.com
bad_emails.txt:
sam2#example.com
likes_banana.txt:
sam#example.com
sam2#example.com
===============================================
I want to do some grep, so that at the end the output will be like this:
sam,64654664,sam#example.com,y,banana
sam2,64654664,sam2#example.com,n,banana
sam3,64654664,sam3#example.com,y,
I don't mind doing it in multiple steps manually and, perhaps, in some complex algorithm such as copy pasting to multple files. What matters to me is the reliability, and most importantly the ability to process very LARGE csv files with more than 1M lines.
What must also be noted is the lists that I will "grep" to add data to some of the columns will most of the times affect at most 20% of the total csv file rows, meaning the remaining 80% must be intact and if possible not even displace from their current order.
I would also like to note that I will be using a software called EmEditor rather than spreadsheet softwares like Excel due to the speed of it and the fact that Excel simply cannot process large csv files.
How can this be done?
Will appreciate any help.
Thanks.
Googling, trial and error, grabbing my head from frustration.
Filter all good emails with Filter. Open Advanced Filter. Next to the Add button is Add Linked File. Add the good_emails.txt file, set to the email column, and click Filter. Now only records with good emails are shown.
Select column 4 and type y. Now do the same for bad emails and change the column to n. Follow the same steps and change the last column values to the correct string.

Ideas to extract specific invoice pdf data for different formats and convert to Excel

I am currently working on a digitalisation project which consists in extracting specific information from pdf-formatted electricity invoices. Once the data is extracted, I would like to store it in an Excel spreadsheet.
The objectives are the following:
First of all, the data to be extracted would be the following:
https://i.stack.imgur.com/6RLo2.png
In this case, the data to be extracted is the information surrounded in red. This would be the CUPS, the total amount and the consumed electricity per period (P1-P6).
Once this is extracted, I would like to display this in an Excel Spreadsheet.
Could you please give me any ideas/tips regarding the extraction of this data? I understand that OCR software would do this best, but do not know how I could extract this specific information.
Thanks for you help and advice.
If there is no text data in your PDF then I don't believe there is a clean and consistent way to do this yet. If your invoice templates are always the same format and resolution, then the pixel coordinates of the text positions should be the same.
This means that you can create a cropped image with only the text you're interested in. Then you can use your OCR tool to extract all the text and you have extracted your data field. You would have to do this for all the data fields that you want to extract.
This would only work for invoices that always have the same format and resolution. So scanned invoices wouldn't work, and dynamic tables make things exponentially more complex as well.
I would check if its possible to simply extract the text using PDF to text 1st then work my cmd text parsing around that output, and loop file to file.
I don't have your sample to test so you would need to adjust to suit your bills
pdftotext -nopgbrk -layout electric.pdf - |findstr /i "cups factura" & pdftotext -nopgbrk -layout -y 200 -W 300 -H 200 electric.pdf
Personally would use the two parts as separate cycles so first pair replace the , with a safe csv character such as * then inject , for the large gap to make them 2 column csv (perhaps replace the Γé¼ with ,€ if necessary since your captured text may be in €uros already)
The second group I would possibly inject , by numeric position to form the desired columns, I only demo 4 column by 2 rows but you want 7 column by 4 rows, so adjust those values to suit. However, you can use any language you are familiar with such as VBA to split how you want to import in to eXcel.
In Excel you may want to use PowerQuery to read the pdf:
https://learn.microsoft.com/en-us/power-query/connectors/pdf
Then you can further process to extract the data you want within PowerQuery.
If you are interested in further data analysis after extraction you may want to consider KNIME as well:
https://hub.knime.com/jyotendra/spaces/Public/latest/Reading%20PDF%20and%20extracting%20information~pNh3GdorF0Z9WGm8
From there export to Excel is also supported.
edit:
after extracting, regex helps to filter for the specific data, e.g. look for key words, length and structure of the data item (e.g. the CUPS number), is it a currency with decimal etc.
edit 2: regex in Excel
How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops
e.g. look for a new line starting with CUPS followed by a sequence of 15-characters (if you have more details, you can specify the matching pattern more: e.g. starting with E, or 5th character is X or 5, etc.)

Excel : Comparing two csv files, and isolate line data from corresponding columns

I have two CSV files, one of 25 000 lines containing all data and one of 9000 lines containing names i need to get the data from the first one.
Someone told me that would be fairly easy using excel but i can't seem to find a similar problem.
I've tried comparisons tools, but they are not helping me isolate what i need.
Using this example
Master file :
Name;email;displayname
Bbob;Bbob#mail.com;Bob bob
Mmartha;Martha#mail.com;Mmartha
Cclaire;Cclaire#mail.com;cclair
Name file :
Name
Mmartha
Cclaire
What i need to get after comparison :
Name;email;displayname
Mmartha;Martha#mail.com;Mmartha
Cclaire;Cclaire#mail.com;cclair`
So for the names I've in my second csv, I've got to get the entire line from the master csv file.
Right now i can use notepad compare for exemple, but on 25000 lines considering what i need, it's a lot of manual labor to come. I think there is a way someone faced a similar issue.
I can't seem to find a solution right now so here I am.
Beforehand, excuses for the Dutch screenshots, I'm unsure about the English terms in PowerQuery, but you should be able to follow the procedure.
Using PowerQuery:
Start PowerQuery
Load both source CSV1 and CSV2
Join Query as new
Select both column 1 and select Inner option
Result should look like this:
Use first row as headers:
Delete 4th column, close and load values

How to convert csv text to numbers as well as manipulate almost non-machine readable data in Power Bi?

I have a sales datasheet that is in csv, that I input into Power BI. The monetary values on the sheet come up as decimal placed numbers (e.g 123.0000) but Power BI reads it as text. When I try and convert this to a fixed decimal number ($) it kicks back an error. How do I convert this safely to ($)? There are also multiple columns with these values in them. How would I convert all of them in the easiest way, as there are other columns with just normal numbers between these monetary columns? (1 x SOH column and then 1 x Net column - this repeats)
On top of this, the datasheet is spread in such a way that is is difficult to manipulate the data into a form that is easy for Power BI to read. The header rows begin with the SKU code and description, but then move over to each individual store (retail store) by location as well as being broken up into SOH and Net, per store per column. I've been racking my brain on this for ages and can't seem to find a simple way around it. Any tips would be great.
For the conversion to ($), I went into the csv sheet, altered the format of the numbers and saved it as a .xml, but the issue with this is that I would have to repeat this tedious step every time I would need to pull data, which is a lot.
For the layout of the original spreadsheet, I tried unpivoting the data in Power BI, which does work. However, it is still sectioned off by Net and SOH, which means I have to add a slicer in just to see Net or SOH on its own, instead of having them as separate entries.
I expect the output to firstly give me fixed decimal numbers, but all I get is an error when trying to convert the numbers to $.
With the unpivoting, I can manipulate the data by store, which is great and helps, but I have to create a separate sheet which has the store ID's on it so that I can"filter" them when I want to switch between them (again, a slicer is necessary). I expect to be able to look at the store individually as well as overall and then also look at the Net individually and SOH individually, by store and as a whole. From there I can input my cost sheet and calculate the GP.
I have attached a picture of the data. I can drop a sample sheet somewhere as well if necessary. I just need to know where.
I figured it out. All you have to do is change the regional settings, not on your laptop specifically but rather within Power BI itself.
Go to File > options and settings > options > regional settings > change to English (United Kingdom) {thats the region that worked for me and fixed everything automatically}

Sorting txt data files while importing in Excel Data Query

I am trying to enter approximately 190 txt datafiles in Excel using the New Query tool (Data->New Query->From File->From Folder). In the Windows explorer the data are properly ordered: the first being 0summary, the second 30summary etc.
However, when entering them through the query tool the files are sorted as shown in the picture (see line 9 for example, you will see that the file is not in the right position):
The files are sorted based on the first digit instead of the value represented. Is there a solution to this issue? I have tried putting space between the number and the summary but it also didn't work. I saw online that Excel doesn't recognize the text within "" or after /, but I am not allowed to save the text files with those symbols in their name in Windows. Even when removed the word summary the problem didn't fix. Any suggestions?
If all your names include the word Summary:
You can add a column "Extract" / "Text before delimiter" enter "Summary", change the column type to Number and sort over that column
If the only numbers are those you wish to sort on, you can
add a custom column with just the numbers
Change the data type to whole number
sort on that.
The formula for the custom column:
Text.Select([Name],{"0".."9"})
If the alpha portion varies, and you need to sort on that also, you can do something similar adding another column for the alpha portion, and sorting on that.
If there might be digits after the leading digits upon which you want to sort, then use the following formula for the added column which will extract only the digits at the beginning of the file name:
=Text.Middle([Name],0,Text.PositionOfAny([Name],{"A".."z"}))

Resources