How to create shapefile by only used cells in flopy? - flopy

Actually my question is in the header. For instance there is a model with 2 stress periods 3 layers 15 cols and 10 rows and I am modifiying general head boundary package. After I edit the model cells by giving stress period, layer, row, column, head and condictivity values; I want to export that package as shapefile. When I use export function, it creates a shapefile but attribute table consists only row and column information which shows all cells in model. I want to see only modified cells feature in shape file with head and cond information for each stress periods and layers. Because, I will use that information to update the package values. Is there a way to export selected cells as features in shapefile?

There is no way to export selected cells as a shapefile. You would make a selection of the changed cells using GIS software or open source GIS tools after generating the shapefile of the package.

Related

How to split a Pandas dataframe into multiple csvs according to when the value of a column changes

So, I have a dataframe with 3D point cloud data (X,Y,Z,Color):
dataframe sample
Basically, I need to group the data according to the color column (which takes values of 0,0.5 and 1). However, I don't need an overall grouping (this is easy). I need it to create new dataframes every time the value changes. That is, I'd like a new dataframe for every set of rows that are followed by and preceded by 5 zeros (because single zeros are sometimes erroneously present in chunks of data that I'm interested in).
Basically, the zero values (black) are meaningless for me; I'm only interested in the 0.5 (red) and 1 values (green). What I want to accomplish is to segment the original point cloud into smaller clusters that I can then visualize. I hope this is clear. I can't seem to find answers to my question anywhere.
First of all, you should understand the for loop well. Python is a great programming language for using the code of any library inside functions and loops. Let's say you have a dataset and you want to navigate and control column a. First, let's start the loop with the "for i in dataset:" code. When you move to the bottom line, you have now specified the criteria you want with the code if "i[a] > 0.5:" in each for loop. Now if the value is greater than 0.5, you can write the necessary codes to create a new dataset with all the data of the row you are in. In terms of personal training, I did not write ready-made code.

Ideas to extract specific invoice pdf data for different formats and convert to Excel

I am currently working on a digitalisation project which consists in extracting specific information from pdf-formatted electricity invoices. Once the data is extracted, I would like to store it in an Excel spreadsheet.
The objectives are the following:
First of all, the data to be extracted would be the following:
https://i.stack.imgur.com/6RLo2.png
In this case, the data to be extracted is the information surrounded in red. This would be the CUPS, the total amount and the consumed electricity per period (P1-P6).
Once this is extracted, I would like to display this in an Excel Spreadsheet.
Could you please give me any ideas/tips regarding the extraction of this data? I understand that OCR software would do this best, but do not know how I could extract this specific information.
Thanks for you help and advice.
If there is no text data in your PDF then I don't believe there is a clean and consistent way to do this yet. If your invoice templates are always the same format and resolution, then the pixel coordinates of the text positions should be the same.
This means that you can create a cropped image with only the text you're interested in. Then you can use your OCR tool to extract all the text and you have extracted your data field. You would have to do this for all the data fields that you want to extract.
This would only work for invoices that always have the same format and resolution. So scanned invoices wouldn't work, and dynamic tables make things exponentially more complex as well.
I would check if its possible to simply extract the text using PDF to text 1st then work my cmd text parsing around that output, and loop file to file.
I don't have your sample to test so you would need to adjust to suit your bills
pdftotext -nopgbrk -layout electric.pdf - |findstr /i "cups factura" & pdftotext -nopgbrk -layout -y 200 -W 300 -H 200 electric.pdf
Personally would use the two parts as separate cycles so first pair replace the , with a safe csv character such as * then inject , for the large gap to make them 2 column csv (perhaps replace the Γé¼ with ,€ if necessary since your captured text may be in €uros already)
The second group I would possibly inject , by numeric position to form the desired columns, I only demo 4 column by 2 rows but you want 7 column by 4 rows, so adjust those values to suit. However, you can use any language you are familiar with such as VBA to split how you want to import in to eXcel.
In Excel you may want to use PowerQuery to read the pdf:
https://learn.microsoft.com/en-us/power-query/connectors/pdf
Then you can further process to extract the data you want within PowerQuery.
If you are interested in further data analysis after extraction you may want to consider KNIME as well:
https://hub.knime.com/jyotendra/spaces/Public/latest/Reading%20PDF%20and%20extracting%20information~pNh3GdorF0Z9WGm8
From there export to Excel is also supported.
edit:
after extracting, regex helps to filter for the specific data, e.g. look for key words, length and structure of the data item (e.g. the CUPS number), is it a currency with decimal etc.
edit 2: regex in Excel
How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops
e.g. look for a new line starting with CUPS followed by a sequence of 15-characters (if you have more details, you can specify the matching pattern more: e.g. starting with E, or 5th character is X or 5, etc.)

Requesting help editing a formula in a TM1 worksheet

The said TM1 worksheet uses the DBRW formula to write values that users enter, to a cube, and also uses the same to fetch the value and display it in the worksheet. The values in the cube consist of movie codes such as 7500023. This movie code can be mapped to a movie title in a dimension DIM. It is to be noted that this movie code and the movie title are both aliases of the principal name used in the dimension, which runs along like 0007500023 (The movie code with zeroes preceding). I'd like the movie titles to be displayed in the worksheet instead of the movie codes.
I tried using the SUBNM function, but it opens up a subset editor and also doesn't write values to the cube like DBRW. So, that's ruled out.
The DBRA function seemed perfect when it came to fetching the movie titles from the dimension DIM. But this doesn't write the values to the cube.
Is there any way I can combine the DBRA and the DBRW function in a formula or is there an alternative function for this purpose?
I will add an extra column which will extract the alias = DBRA.
And then I will base your IF statement on the new column.
Otherwise, if you don't need the code, then why not use the alias in the first place? Export data with the alias on. Then you don't even need to retrieve the alias, because it's already there.
It just doesn't make sense to put them both in one formula.
When you recalculate - excel assess the formula once.
But you're expecting it to assess it multiple times. (hence the circular reference)

XLConnect setcellformula usage

I was going through the XLConnect package reference manual. I came across one function called "SetCellFormula" and its example. I was trying to implement this for data in the text file imported into Excel.
What I wish to do is the following
1. Import data of 200 rows and 400 columns into the Excel file.
2. For each row of data containing 400 columns, compute average, median, minimum and maximum and place the results in another sheet. I want to use the SetCellFormula in case I want to edit the cells directly into Excel, as there are associated features like idx2col, col2idx, ecdt. I find that it is easy to do for a single row. Is there any possibility to do that for all rows in a single statement?
3. Please note that the value for rows and columns can vary based on the type of data available.
Please let me know how to proceed with this.

How to Differentiate a Data from a Column/Header in an Excel File

I hope someone can help me come up with an algorithm.
Im still very new with Apache POI and I was assigned to come up with an algorithm on how to read a template (Excel) and extract the headers/column names from the data itself.
The following must be taken into account:
There can be multiple headers/column names in just one sheet of an Excel file.
Headers can be horizontal AND/OR vertical in nature. This means that there could be a mixture of vertical and horizontal headers in one sheet.
Headers dont necessarily have to be at the very first row of the file. There could be introductions or banner images there.
The system must allow ANY kind of Excel format, so there is no control over the formatting of the cells, the naming convention, etc.
Some headers are alphanumeric in nature, which means it also contains numbers.
Some cells are merged to make room for a specific header.
Any ideas and suggestions are very much welcome. Just let me know if you have further clarifications.
(I know nothing about Apache, but some about Excel Interop working)
If the sheets to be detected are yours, I'd recomend NAMING those header cells. (To name a cell in Excel, there's a field at the top left of the screen, where normally the cell coordinates appear (like "A1" or "B2" and so...). Type a name in that place, and you will be able to identify that cell via code by it's name. ( 'Worksheet.Range("Name")' is where you get those cells via code)
To manage names, go to "Insert - Names" or "Formulas - Name manager", depending on what version of excel.
(Personally, I never work with sheets via code without naming headers, then I use "Offset" to get the data cells corresponding to those headers - This allows me to freely edit the sheet later without breaking the code)
If the sheets aren't yours, then, you'll need to find out the extents of the data. (Last row and last column)
Then check for the first line that contains all columns filled, none of them blank. That's a probable horizontal header.
As well as check for the first columns that contains all lines filled. That's a probable vertical header.
You could, as well, search for completely blank lines and/or columns to find headers that are AFTER some data, in case of sheets containing multiple horizontal headers, or vertical.
You could use some formatting properties (Range.Interior or Range.Font for examples) of those cells to identify if they are headers (usually headers have different format, color, borders and so on).
If you're sure there's no numeric header, I mean, all headers contains text, check for the type of data in the cells. If all are strings, header probability increases.
Even so, that's a tricky thing to do, if sheets don't follow some pattern, once in a while one of them can deceive your code and bring false results. I'd recommend, if alowed, to add a human verification to confirm the results after the proccess is done.
The solution to this problem involves taking away two of these freedoms. Such constraints applied will make this a tractable problem. Most of such freedoms come from overcautious thinking.
The freedoms are given as quotes below:-
Headers can be horizontal AND/OR vertical in nature. This means that there could be a mixture of vertical and horizontal headers in one sheet.
Typically, vertical headers are not used in Excel Files where there is a need to programmatically detect headers. As the primary, most common and sometimes the only reason for such detection is to upload/transform the tabular data.
Funny things happen when vertical headers are introduced:
They become Labels of Forms. This implies that such forms are used for data entry rather than storage. The data from such forms is stored in horizontal/columnar headers and rowwise/vertical records of data . Thus obviating the need for Upload/Transformation of the data entry sheet.
Excel is designed to have only horizontal headers. Vertical Headers cease to have autofilter support.
Even when Vertical Headers are present, a top horizontal header row can still be introduced to mark the headers themselves as descriptions / categories.
Staying true, to the core need for autodetection of headers, we can state that once our requirement states that Headers can be placed only in a horizontal alignment, the solution becomes slightly more tractable but not fully so.
Some cells are merged to make room for a specific header.
Merging cells is poison and anathema to the entire reason for transformation/upload of data. This is a pill I steadfastly have refused to take in my entire career with Excel & SQL jugglery. You may kindly merge all that you want to for all I care, however thee shall not pass into my beloved SQL Server.
For aforementioned reasons of prejudice and ill-will towards all mergers and mergees alike. I'd respectfully suggest that you too take this course.
Solution
Staying true to the above requirements after taking away the 2 freedoms. The pseudo algorithm (solution) is to
Take a sample of say c x r Excel Rows. For eg: 200 x 201 rows and columns
Find the counts of non-empty cells using an inbuilt formula like COUNTA whose contents have a non-zero length. The Count of such non-empty cells in each row is maintained as a data structure.
The type of data ie:- Number, Date, String should also be maintained in the above data structure capable of expressing the following:
Row# 22 contains
30 non-empty cells of which
28 are alphanumeric,
1 is a Date and
1 is a Number.
The First specific row that contains the maximum number of such non empty cells with the maximum number of strings should very likely be the header row.
Converting all of the above to a specific algorithm in any given language should be a deliciously occupying task for any young developer in their prime.

Resources