How to manually control data schema interpretation - excel

When I export public weather data from https://www1.ncdc.noaa.gov/pub/data/uscrn/products/subhourly01/2017/CRNS0101-05-2017-TX_Austin_33_NW.txt, as soon as solar radiation > 9, all of my data for the remaining columns gets lumped into a single column, as shown below. I have tried uploading as txt and csv and the problem still exists in excel, sheets, and dataprep.
Why is this happening?
Is there a programmatic way to fix this so that the data populates as intended, with 1 value per column?

It is likely because the initial data structure is not detected correctly. This can happen if the first rows of your dataset have a different structure than the remaining rows.
To solve this problem in Dataprep, you can indicate how the dataset should be structured by following these steps:
Go to the flow view
Right click on the dataset and choose "remove structure..."
Open the recipe
Insert a split row step:
splitrows col: column1 on: '\n'
Split the column using a whitespace regex (for e.g., /\s+/)
splitpatterns col: column1 type: on on: /\s+/ limit: 22
(you can copy and paste the following command inside the search input when you create a new step)
Here is what you should get:
Note: it is also possible to prevent the initial structure detection when importing a dataset. See https://cloud.google.com/dataprep/docs/html/Remove-Initial-Structure_136154971

Related

Aligning vertically a series of tables with text

Hi I need the text to be in a specific format in a spreadsheet to be able to upload it on a translation tool.
I have already used the text split function to separate the text in a cell with bullet points, moving each bullet point to a separate cell.
enter image description here
Then I used the transpose function to separate each set of data. For context, you are looking at fashion products.
The name of the product is on the first row, followed by a list of features (e.g. "Bracciale" means bracelet and it is followed by the list of materials)
enter image description here
Now for the last step, I need these sets to be vertical, not horizontal. Like this:
enter image description here
I would like to set up an automatic system so that every time we receive a list with hundreds of these products we do not need to copy-paste them one below the other.
With pivot tables maybe? Keep in mind that if it is too complex it might be hard to train the translators to do it each time. Please let me know your suggestions. Thank you!
I am not a programmer. I tried pivot tables but the data was in the wrong order and I am not sure how to get the data out from the pivot table with values only without the sub-menus.
My suggestion would be to use the 'Unpivot Columns' feature in the Power Query Editor - it would be really simple.
Steps:
Select the whole range
Go to Data // Get & Transform Data // From Table/Range
Uncheck 'My Table has headers' (unless it does - but doesn't look like it?)
Press OK. This will open Power Query Editor and will have actually given you column names Col1/2/3 etc, but ignore that.
Go to Add Column // Index column
Select all columns EXCEPT the new index column by Shift+clicking on those headers
Go to Transform // Unpivot Columns
Assuming the order is important, click in the Attribute column and Sort Ascending
Click in the Index column and Sort Ascending
Remove the Attribute and Index columns if you want (right click header)
Go to File // Close & Load
You will get a new table - dynamically linked to the first (ie. can be updated/refreshed) - in the unpivoted format.
Let me know if you need more details / screenshot?
Based of this trick, maybe the following is helpfull:
Formula in A5:
=DROP(REDUCE(0,A1:A3,LAMBDA(a,b,VSTACK(a,TEXTSPLIT(b,,HSTACK(CHAR(10),"^"),1)))),1)
TEXTSPLIT() will use a combination of newline chars and the circumflex to split the input directly into a vertical array;
Iteration in REDUCE() will allow for stacked results;
DROP() the initial value from results.

Excel : Comparing two csv files, and isolate line data from corresponding columns

I have two CSV files, one of 25 000 lines containing all data and one of 9000 lines containing names i need to get the data from the first one.
Someone told me that would be fairly easy using excel but i can't seem to find a similar problem.
I've tried comparisons tools, but they are not helping me isolate what i need.
Using this example
Master file :
Name;email;displayname
Bbob;Bbob#mail.com;Bob bob
Mmartha;Martha#mail.com;Mmartha
Cclaire;Cclaire#mail.com;cclair
Name file :
Name
Mmartha
Cclaire
What i need to get after comparison :
Name;email;displayname
Mmartha;Martha#mail.com;Mmartha
Cclaire;Cclaire#mail.com;cclair`
So for the names I've in my second csv, I've got to get the entire line from the master csv file.
Right now i can use notepad compare for exemple, but on 25000 lines considering what i need, it's a lot of manual labor to come. I think there is a way someone faced a similar issue.
I can't seem to find a solution right now so here I am.
Beforehand, excuses for the Dutch screenshots, I'm unsure about the English terms in PowerQuery, but you should be able to follow the procedure.
Using PowerQuery:
Start PowerQuery
Load both source CSV1 and CSV2
Join Query as new
Select both column 1 and select Inner option
Result should look like this:
Use first row as headers:
Delete 4th column, close and load values

Sorting txt data files while importing in Excel Data Query

I am trying to enter approximately 190 txt datafiles in Excel using the New Query tool (Data->New Query->From File->From Folder). In the Windows explorer the data are properly ordered: the first being 0summary, the second 30summary etc.
However, when entering them through the query tool the files are sorted as shown in the picture (see line 9 for example, you will see that the file is not in the right position):
The files are sorted based on the first digit instead of the value represented. Is there a solution to this issue? I have tried putting space between the number and the summary but it also didn't work. I saw online that Excel doesn't recognize the text within "" or after /, but I am not allowed to save the text files with those symbols in their name in Windows. Even when removed the word summary the problem didn't fix. Any suggestions?
If all your names include the word Summary:
You can add a column "Extract" / "Text before delimiter" enter "Summary", change the column type to Number and sort over that column
If the only numbers are those you wish to sort on, you can
add a custom column with just the numbers
Change the data type to whole number
sort on that.
The formula for the custom column:
Text.Select([Name],{"0".."9"})
If the alpha portion varies, and you need to sort on that also, you can do something similar adding another column for the alpha portion, and sorting on that.
If there might be digits after the leading digits upon which you want to sort, then use the following formula for the added column which will extract only the digits at the beginning of the file name:
=Text.Middle([Name],0,Text.PositionOfAny([Name],{"A".."z"}))

Spark - Have I read from csv correctly?

I read a csv file into Spark using:
df = spark.read.format(file_type).options(header='true', quote='\"',
ignoreLeadingWhiteSpace='true',inferSchema='true').load(file_location)
When I tried it with sample csv data from another source and did diplsay(df) it showed a neatly displayed header row followed by data.
When I try it on my main data, which has 40 columns, and millions of rows, it simply displays the first 20 column headers and no data rows.
Is this normal behavior or is it reading it wrong?
Update:
I shall mark the question as answered as the tips below are useful. However my results from doing:
df.show(5, truncate=False)
currently shows:
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |��"periodID","DAXDate","Country Name","Year","TransactionDate","QTR","Customer Number","Customer Name","Customer City","Document Type Code","Order Number","Product Code","Product Description","Selling UOM","Sub Franchise Code","Sub Franchise Description","Product Major Code","Product Major Description","Product Minor Code","Product Minor Description","Invoice Number","Invoice DateTime","Class Of Trade ID","Class Of Trade","Region","AmountCurrencyType","Extended Cost","Gross Trade Sales","Net Trade Sales","Total(Ext Std Cost)","AdjustmentType","ExcludeComment","CurrencyCode","fxRate","Quantity","FileName","RecordCount","Product Category","Direct","ProfitCenter","ProfitCenterRegion","ProfitCenterCountry"| +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
I shall have to go back to basics an preview the csv in a text editor to find out what the correct format is for this file to figure out what's going wrong. Note, I had to update my code to the following to deal with pipe delimter:
df = spark.read.format(file_type).options(header='true', quote='\"', delimiter='|',ignoreLeadingWhiteSpace='true',inferSchema='true').load(file_location)
Yes this is normal beheaviour. The dataframe function show() has a default value to display 20 rows. You can set a different value for that (but keep in mind that it doesn't make sense to print all rows of your file) and also stop it from truncating. For example:
df.show(100, truncate=False)
It is a normal behaviour. You can view the content of your data in different ways:
show(): Show you in a formatted way the first 20 rows. You can specify as argument the number of rows you want to display (if you provide a value much higher that your data is ok!). Columns will be truncated too, as a default configuration. You can specify truncate=False to show all the columns. (like #cronoik correctly said in his answer).
head(): The same as show(), but it prints the date in a "row" format. Does not provide a nice formatted table, it is useful for a quick complete look of your data, for example with head(1) to show only the first row.
describe().show(): you can show a summary that gives you an insight of the data. For example, shows you the count of elements, the min/max/avg value of each column.
It is normal for Spark dataframes to display limited rows and columns. Your reading of the data should not be a problem. However, to confirm that you have read the csv correctly you can try to see the number of rows and columns in the df, using
len(df.columns)
or
df.columns
For number of rows
df.count()
In case you need to see the content in detail you can use the option stated by cronoik.

SSIS Text was truncated with status value 4

I am developing a SSIS package, trying to update an existing SQL table from a CSV flat file. All of the columns are successfully updating except for one column. If I ignore this column on truncate, my package completes successfully. So I know this is a truncate problem and not error.
This column is empty for almost every row. However, there are a few rows where this field is 200-300 characters. My data conversion task identified this field as a DT_WSTR, but from what I've read elsewhere maybe this should be DT_NTEXT. I've tried both and I even set the DT_WSTR to 500. But none of this fixed my problem. How can I fix? What data type should this column be in my SQL table?
Error: 0xC02020A1 at Data Flow Task 1, Source - Berkeley812_csv [1]: Data conversion failed. The data conversion for column "Reason for Delay in Transition" returned status value 4 and status text "Text was truncated or one or more characters had no match in the target code page.".
Error: 0xC020902A at Data Flow Task 1, Source - Berkeley812_csv [1]: The "output column "Reason for Delay in Transition" (110)" failed because truncation occurred, and the truncation row disposition on "output column "Reason for Delay in Transition" (110)" specifies failure on truncation. A truncation error occurred on the specified object of the specified component.
Error: 0xC0202092 at Data Flow Task 1, Source - Berkeley812_csv [1]: An error occurred while processing file "D:\ftproot\LocalUser\RyanDaulton\Documents\Berkeley Demographics\Berkeley812.csv" on data row 758.
One possible reason for this error is that your delimiter character (comma, semi-colon, pipe, whatever) actually appears in the data in one column. This can give very misleading error messages, often with the name of a totally different column.
One way to check this is to redirect the 'bad' rows to a separate file and then inspect them manually. Here's a brief explanation of how to do that:
http://redmondmag.com/articles/2010/04/12/log-error-rows-ssis.aspx
If that is indeed your problem, then the best solution is to fix the files at the source to quote the data values and/or use a different delimeter that isn't in the data.
I've had this issue before, it is likely that the default column size for the file is incorrect. It will put a default size of 50 characters but the data you are working with is larger. In the advanced settings for your data file, adjust the column size from 50 to the table's column size.
I suspect the
or one or more characters had no match in the target code page
part of the error.
If you remove the rows with values in that column, does it load?
Can you identify, in other words, the rows which cause the package to fail?
It could be the data is too long, or it could be that there's some funky character in there SQL Server doesn't like.
If this is coming from SQL Server Import Wizard, try editing the definition of the column on the Data Source, it is 50 characters by default, but it can be longer.
Data Soruce -> Advanced -> Look at the column that goes in error -> change OutputColumnWidth to 200 and try again.
I've had this problem before, you can go to "advanced" tab of "choose a data source" page and click on "suggested types" button, and set the "number of rows" as much as you want. after that, the type and text qualified are set to the true values.
i applied the above solution and can convert my data to SQL.
In my case, some of my rows didn't have the same number of columns as the header. Example, Header has 10 columns, and one of your rows has 8 or 9 columns. (Columns = Count number of you delimiter characters in each line)
If all other options have failed, trying recreating the data import task and/or the connection manager. If you've made any changes since the task was originally created, this can sometimes do the trick. I know it's the equivalent of rebooting, but, hey, if it works, it works.
I have same problem, and it is due to a column with very long data.
When I map it, I changed it from DT_STR to Text_Stream, and it works
In the destination, in advanced, check that the length of the column is equal to the source.
OuputColumnWidth of column must be increased.
Path: Source Connection manager-->Advanced-->OuputColumnWidth

Resources