I need to transform Excel files to ESRI FileGDB using FME.
The problem is that my excel worksheets contains more than one table.
Example: At row 1, I have the attributes of the first table. Row 2 to 4 contains the values.
At row 6 I have the attributes of the second table. 45 next rows are the values.
And the same thing for the third table.
These rows can change. I could have the attributes of the second table at any row.
I think the best solution would be to have a process that split the .xls file in three different files so I can transform them directly into ESRI format.
Is there a transformer that could perform this task or should I code it myself in Python?
PS: This process will be called from a REST Service so I can't do this manually. Also, the columns name will always be the same.
Thanks
FME reads the Excel rows in order, so I would add a Counter transformer after reading the Excel file.
The column names don't change, so you could check at which row (number given by the Counter) the new table begins.
Then is just a matter of filtering the features with a TestFilter.
Related
I have a set of excel files inside ADLS. The format looks similar to the one below:
The first 4 rows would always be the document header information and the last 3 will be 2 empty rows and the end of the document indicator. The number of rows for the employee information is indefinite. I would like to delete the first 4 rows and the last 3 rows using ADF.
Can any help me with what should be expressions in the Derived column / Select?
My Excel file:
Source Data set settings (give A5 in range and select first row as header):
SourceDataSetProperties
Make sure to refresh schema in the source data set.
Schema
After schema refresh, if you preview the source data, you will be seeing all rows from row number 5. This will include footer too which we can filter in data flow.
Next, add a filter transformation with below expression
!startsWith(sno,'dummy') && sno!=''
this will filter out the rows starting with dummy, in your case, end of document. Also we are ignoring the empty rows by checking sno!=''
Final Preview after filter:
How about this? Under the 'Source' tab, choose the number of lines you want to skip.
I am working with incomplete historical data and am using Python to select specific information from TXT files (e.g. via Regex) and write them to .csv tables.
Is it possible to write a certain item or a list of items to new rows in a particular column in an existing CSV file?
I can add individual strings or lists as consecutive new rows or columns to an existing table, but very often, I am only filling in "missing information".
It would be great to find a way to select the next row in the "n"-th column of a CSV table, or to select the column by name / column heading.
Have you considered using Pandas?
It has convenient methods for reading and writing csv-files. Working with columns, rows, and cells is quite intuitive.
It takes a little time to understand the basics of Pandas. but if you plan to work with csv and csv-like data more than once, it is worth it.
I am trying to convert one Excel to CSV with Apache Nifi. When the first row has less cell values with information than the other rows of the document (for example, the first row has 5 cells, the 2th -> 8 cells, the 5 -> 7 cells), the parsing of the document only takes into account the number of cells of the first row (5). So I am losing information (in the example case, the 2th row would lose 3 cell values and the 5th, 2).
Another visual example:
The configuration of my process looks like:
Can anyone tell me how to solve the problem?
#Jaime - The NiFI Processor ConvertExcelToCSVProcessor makes some assumptions which you have noticed. It assumes you are sending a consistent set of data in each row. Your Excel doesn't meet these basic assumptions.
My best advice is to fix the desperate data in the excel sheet. Add missing columns with data you can remove/ignore later. The only other choice would be to remake the processor as a custom processor Where you could have it check every row, get the row with the most columns, use that for the column count.
I have an excel spreadsheet where the names of columns for my first 5 columns are on the 2nd row, and the name for columns on all the other columns are on the 3rd row.
Data starts for every column at row 4.
How can I load this data efficiently in SAS with the appropriate names?
Thanks!
The fastest way would be to make the change in excel.
A slower way would be:
1) use two different import procedures: one for first 5 columns and another for the rest of the columns. You can do this using the RANGE= option in proc import. now you'll have two data sets in work library.
2) Use a data steps to remove empty rows that are between col headers and first row of data.
3) Then use data step to create a row number variable for each data set (using the statement N= ) and do a one-to-one merge to combine the two.
If anyone knows a faster way, let me know. This is the first that came to mind.
I have a csv file open in excel. I want to create two line graphs by choosing two rows. The problem is that these rows are in one row. How is this possible? One row contains many values from which a set of values needs to be plotted against set of values in the same row. The power of the two sets are identical. These two sets of values are fetched by filtering the row according to the values of other columns. I can create the plot of one set since I can apply the filter once. How can I add the second set of values onto the existing plot by doing an independent filter on the same column? I don't want to split the file into two different files. I am not that familiar with excel 2007.
If your data is labeled, you probably want to use a pivot chart. Click the link for an overview