Ignore extra columns in CSV when using COPY FROM in Cassandra - cassandra

I have a table with two columns id, name and I am ingesting data from a 3rd party that may add data to each row without my knowledge. I want to ensure the file still loads until I am ready to change my data model, ignoring any extra columns that are added to the end of each row.
E.g. If the csv file changes to id, name, email at some point, I want my COPY FROM query to continue happily loading id, name.
I've tried to implement SKIPCOLS, but as far as I can see, this only seems to work when there is a match between fields in the first place.
i.e. The following is the correct usage of COPY FROM with SKIPCOLS but will not help my needs as I can't seem to reference a column that only exists in the csv.
COPY users (id, name) FROM 'users.csv' WITH SKIPCOLS = 'name';
Is there another way to do this or a different way to use SKIPCOLS that I am missing?

Related

Excel - List of key values created from external files in power query, trouble with editing mapped values

I am attempting to create a standardized list of names for a long list of free typed values in a list of csv's pulled from Jira.
What I have tried so far has been to use Get Data -> From File -> From Folder
And then narrow it down to just the column I need and then remove all duplicate rows.
After loading that, I have tried adding a column that's just an empty string. I have done this both in power query and in the data model with the same effect. I want to have the second column so the user can map the values in the key column on a worksheet. This table will be used as a map for pivot tables to standardize names. Attempting to update the value in a worksheet and then refreshing to see that change in the data model just reverts the value back to an empty string.
Obviously i'm going about this the wrong way. The goal is to be able to maintain this key, value map over the months as new keys are added to it and just have to map those new entries rather than having to do a lot of work with comparing every time to see whats new. Is there a better way to achieve what I am trying to do and still maintain it being expandable over the months without having to redo the entire workbook?

Tabulator - Getting Columns including order and size

I am creating a table using Tabulator, which seems great and very powerful.
I want a way to save relevant data of the table so it can e recreated on the fly.
Currently, I think there are a few things I need...
The row data - I get this using table.getData();
The columns - I get this using table.getColumnDefinitions();
The row data seems perfect I can store that and use it. However, the column information I am saving doesnt appear to have the size of the columns if I have resized them?
Is there a way of getting ALL the relevant column info, so I can save and recreate it exactly?
Alternatively, if there's a single 1 function that saves everything (row data, columns (including order, size etc)) in one go as a JSON or something that may be handy
So you have a few options here.
Config Persistence
If you simply want the table to be the same way it was the last time the user used it on that computer, you could look at using the Peristent Configuration module. This will store a copy of the table column configuration on the browsers local storage so that next time they load the page it will be laid out the same.
Column Layout
If you want to store it externally then you are correct,
the column width is not updated in the definition after a user changes it.
If you want to get the current layout of the columns then you can use the getColumnLayout function to retrieve the current layout of columns:
var columnLayout = table.getColumnLayout();
Though this will only contain the key layout characteristics and not the full definition, you would need to merge them if you wanted to store them in one place.
More details on this method can be found in the Manual Column Layout Documentation

How do I lock an additional column to rows imported from Power Query in Excel 2016 without a unique key column?

I am using Power Query in Excel 2016 to combine data from 12 different workbooks within the same folder system into one table, and need to add an additional column in the master table that tracks the status of each row. However, when I refresh the data, the Status column does not follow the rows to which it is initially applied.
I have already looked at [ Inserting text manually in a custom column and should be visible on refresh of the report ] but this solution only works with a unique ID column. Because each of the 12 workbooks is edited separately and because there is no single column that can be guaranteed to have unique values between all of the different spreadsheets, I don't have a key to join the data to the additional column.
I believe there is always a way of finding a Unique ID. If you can get your head around this, it is not that difficult to solve your problem.
See my below example, I used three sample workbooks saved in a Test folder. Depends on the way you add them to the query editor, in my example I used From Folder and follow the prompts without making any changes and combined the tables automatically. Once combined there is a Source.Name column automatically added. I suggest to leave this column in your output table as it can form part of the Unique ID if your data is highly identical across the workbooks.
An optional step (not in my screenshot) is to add an Index column and concatenate the index number with a product/task name so it can make that specific line of data entry even more unique.
Once you added the Status column with data entered manually on the master table, load the master table back to query editor.
Then go back to the original query (Test (Input) in my example) and merge it with the reloaded output query. See my screen-shot for how to 'uniquely' merge the two tables.
The rest is self-explanatory. I think the key is finding elements of the Unique ID and incorporate it in the merge part.
Let me know if you have any questions. Cheers :)

How to make a file in Excel, that refreshes from Query Editor and work on it

I am using Power Query Editor to create a working file, using multiple tables from several sources.
After I combine these and make my working file, I am using it to make some work on columns I add later on the working file.
I have noticed that the values I enter in the working file are not bound to the main key, lets assume the first column, but they are independent values in a column.
The result is that if one table changes, for example one line is deleted or I change the sorting of the Query, my working file is wrong, since the data changed but the added columns remain as they were.
Is there a way to have the added columns to be bound with a value, as it is for example with VLOOKUP?
How can I make a file that will update from different sourcesbut stil I can work on it without the risk of misplacing the work I do.
I hope I am clear.
Thank you in advance!
This is fairly simple if each line in your table is unique (which in your example you say the first column can serve as a key). Setup your working columns on the table and then load the table into PQ (as a connection only). Then go to your original query that is combining your data and add a merge at the end where you merge against the table you just loaded into PQ and match on your key. Then expand only your working columns from the merge.
This way whenever you refresh your table, it will match lines against it's existing output in your work before updating, so data in your work columns will be maintained. However note this is only going to retain values, not any formulas you may be using in your work columns.

Skipping rows when importing Excel into SQL using SSIS 2008

I need to import sheets which look like the following:
March Orders
***Empty Row
Week Order # Date Cust #
3.1 271356 3/3/10 010572
3.1 280353 3/5/10 022114
3.1 290822 3/5/10 010275
3.1 291436 3/2/10 010155
3.1 291627 3/5/10 011840
The column headers are actually row 3. I can use an Excel Sourch to import them, but I don't know how to specify that the information starts at row 3.
I Googled the problem, but came up empty.
have a look:
the links have more details, but I've included some text from the pages (just in case the links go dead)
http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/97144bb2-9bb9-4cb8-b069-45c29690dfeb
Q:
While we are loading the text file to SQL Server via SSIS, we have the
provision to skip any number of leading rows from the source and load
the data to SQL server. Is there any provision to do the same for
Excel file.
The source Excel file for me has some description in the leading 5
rows, I want to skip it and start the data load from the row 6. Please
provide your thoughts on this.
A:
Easiest would be to give each row a number (a bit like an identity in
SQL Server) and then use a conditional split to filter out everything
where the number <=5
http://social.msdn.microsoft.com/Forums/en/sqlintegrationservices/thread/947fa27e-e31f-4108-a889-18acebce9217
Q:
Is it possible during import data from Excel to DB table skip first 6 rows for example?
Also Excel data divided by sections with headers. Is it possible for example to skip every 12th row?
A:
YES YOU CAN. Actually, you can do this very easily if you know the number columns that will be imported from your Excel file. In
your Data Flow task, you will need to set the "OpenRowset" Custom
Property of your Excel Connection (right-click your Excel connection >
Properties; in the Properties window, look for OpenRowset under Custom
Properties). To ignore the first 5 rows in Sheet1, and import columns
A-M, you would enter the following value for OpenRowset: Sheet1$A6:M
(notice, I did not specify a row number for column M. You can enter a
row number if you like, but in my case the number of rows can vary
from one iteration to the next)
AGAIN, YES YOU CAN. You can import the data using a conditional split. You'd configure the conditional split to look for something in
each row that uniquely identifies it as a header row; skip the rows
that match this 'header logic'. Another option would be to import all
the rows and then remove the header rows using a SQL script in the
database...like a cursor that deletes every 12th row. Or you could
add an identity field with seed/increment of 1/1 and then delete all
rows with row numbers that divide perfectly by 12. Something like
that...
http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/847c4b9e-b2d7-4cdf-a193-e4ce14986ee2
Q:
I have an SSIS package that imports from an Excel file with data
beginning in the 7th row.
Unlike the same operation with a csv file ('Header Rows to Skip' in
Connection Manager Editor), I can't seem to find a way to ignore the
first 6 rows of an Excel file connection.
I'm guessing the answer might be in one of the Data Flow
Transformation objects, but I'm not very familiar with them.
A:
Question Sign in to vote 1 Sign in to vote rbhro, actually there were
2 fields in the upper 5 rows that had some data that I think prevented
the importer from ignoring those rows completely.
Anyway, I did find a solution to my problem.
In my Excel source object, I used 'SQL Command' as the 'Data Access
Mode' (it's drop down when you double-click the Excel Source object).
From there I was able to build a query ('Build Query' button) that
only grabbed records I needed. Something like this: SELECT F4,
F5, F6 FROM [Spreadsheet$] WHERE (F4 IS NOT NULL) AND (F4
<> 'TheHeaderFieldName')
Note: I initially tried an ISNUMERIC instead of 'IS NOT NULL', but
that wasn't supported for some reason.
In my particular case, I was only interested in rows where F4 wasn't
NULL (and fortunately F4 didn't containing any junk in the first 5
rows). I could skip the whole header row (row 6) with the 2nd WHERE
clause.
So that cleaned up my data source perfectly. All I needed to do now
was add a Data Conversion object in between the source and destination
(everything needed to be converted from unicode in the spreadsheet),
and it worked.
My first suggestion is not to accept a file in that format. Excel files to be imported should always start with column header rows. Send it back to whoever provides it to you and tell them to fix their format. This works most of the time.
We provide guidance to our customers and vendors about how files must be formatted before we can process them and it is up to them to meet the guidlines as much as possible. People often aren't aware that files like that create a problem in processing (next month it might have six lines before the data starts) and they need to be educated that Excel files must start with the column headers, have no blank lines in the middle of the data and no repeating the headers multiple times and most important of all, they must have the same columns with the same column titles in the same order every time. If they can't provide that then you probably don't have something that will work for automated import as you will get the file in a differnt format everytime depending on the mood of the person who maintains the Excel spreadsheet. Incidentally, we push really hard to never receive any data from Excel (only works some of the time, but if they have the data in a database, they can usually accomodate). They also must know that any changes they make to the spreadsheet format will result in a change to the import package and that they willl be charged for those development changes (assuming that these are outside clients and not internal ones). These changes must be communicated in advance and developer time scheduled, a file with the wrong format will fail and be returned to them to fix if not.
If that doesn't work, may I suggest that you open the file, delete the first two rows and save a text file in a data flow. Then write a data flow that will process the text file. SSIS did a lousy job of supporting Excel and anything you can do to get the file in a different format will make life easier in the long run.
My first suggestion is not to accept a file in that format. Excel files to be imported should always start with column header rows. Send it back to whoever provides it to you and tell them to fix their format. This works most of the time.
Not entirely correct.
SSIS forces you to use the format and quite often it does not work correctly with excel
If you can't change he format consider using our Advanced ETL Processor.
You can skip rows or fields and you can validate the data the way you want.
http://www.dbsoftlab.com/etl-tools/advanced-etl-processor/overview.html
Sky is the limit
You can just use the OpenRowset property you can find in the Excel Source properties.
Take a look here for details:
SSIS: Read and Export Excel data from nth Row
Regards.

Resources