Excel Query missing rows - excel

I'm trying to setup tables in a spreadsheet as using query, from file, xlsx
On the source data there are 16,149 rows and after the import I only get 16,107 rows.
I've dummy'd down the import as much as possible I think:
let
Source = Excel.Workbook(File.Contents("C:\Users\myfile.xlsx"), null, true),
Sheet1_Sheet = Source{[Item="Sheet1",Kind="Sheet"]}[Data],
#"Promoted Headers" = Table.PromoteHeaders(Sheet1_Sheet)
in
#"Promoted Headers"
Is there any way to debug this?

ok figured it out. problem was I was ignoring an error message when I ran the query that was a result of the new file missing one column that was present in the old data.
Net-net, the new file never got imported and that's why the row counts were different.

Related

Access columns in Nested Tables by its column position instead by name using Power Query

I have been working to create a Power Query for Excel that checks a folder and gets the Excel WorkBooks, consolidating the Sheets within. I’m novice in power query so I need a lot of your help providing examples to accomplish it.
I’ve been stuck iterating some changes, trying to overcome that some excel sheets do not have the same column name {‘Column1’, ‘Column2’ or ‘Column3’}, and when accessing columns by name the query will give me error.
The comparison should be done on the columns 1 to 3 of each nested table in 'First15Rows' column, and if there are 3 or more Desired Titles found in the same row, then the sheet is considered valid
So I’m asking for help
The current query looks like this.
let
Source = Folder.Files(Excel.CurrentWorkbook(){[Name="FldrLocation"]}[Content][FldrLocation]{0}),
FilterFileNames = Table.SelectRows(Source, each not Text.StartsWith([Name], "~$") and Text.Contains([Extension], ".xls")),
RemoveOtherCols1 = Table.SelectColumns(FilterFileNames,{"Content", "Name", "Date modified"}),
OnlyRecent = Table.SelectRows(RemoveOtherCols1, each [Date modified] >= Date.AddWeeks(DateTime.LocalNow(), -WeeksAgo)),
AddSheetsColumn = Table.AddColumn(OnlyRecent, "Custom", each Excel.Workbook([Content])),
ExpandSheetsFromTable = Table.ExpandTableColumn(AddSheetsColumn, "Custom", {"Name", "Data"}, {"Sheets", "Data"}),
FilterSheetNames = Table.SelectRows(ExpandSheetsFromTable, each not Text.Contains([Sheets], "Print") and not Text.StartsWith([Sheets], "_xlnm")),
RemoveEmptySheets = Table.SelectRows(FilterSheetNames, each
if Table.IsEmpty(Table.SelectRows ([Data], each _ **[Column1]** <> null)) then null else true),
AddFirst15Rows = Table.AddColumn(RemoveEmptySheets, "First15Rows", each Table.FirstN([Data], 15)),
CheckMatch = Table.SelectRows(AddFirst15Rows, each
if Table.IsEmpty(Table.SelectRows([First15Rows], each _**[Column1]** = "Date" or _**[Column2]** = "Time"))
then null
else true)
in
CheckMatch

Filter one list by another using power query

I have a list of elemental impurities in power query which I wish to filter according to whether or not they exist on another list known as the prop65 list.
The screenshot below shows a simplified example of what I am trying to achieve.
I appreciate that using formulas However I don't know how to achieve this using a Power query solution. If anyone know how to achieve this it would be appreciated.
Data shown:
Aluminium 33.885
Antimony 0.6777
Arsenic 3.5064
Barium 2.259
Boron 1.3554
Bromoform 0.555
Cadmium 3.18895
Chromium 0.33885
Cobalt 1.1295
Copper 0.4518
Indium 0.4518
Simplified Prop65 List
Arsenic
Bromoform
Cadmium
Furan
Lead
Nafenopin
Here is one way to do that:
Read in the two tables
Do an Inner Join
let
//get original data
Source = Excel.CurrentWorkbook(){[Name="Data"]}[Content],
data = Table.TransformColumnTypes(Source,{{"Impurity", type text}, {"Amount (ppm)", type number}}),
//get Filter
Source2 = Excel.CurrentWorkbook(){[Name="Prop65"]}[Content],
filter = Table.TransformColumnTypes(Source2,{"Simplified Prop65 List", Text.Type}),
//Join them
filteredData = Table.Join(data,"Impurity", filter, "Simplified Prop65 List",JoinKind.Inner),
//Remove unneeded column
#"Removed Columns" = Table.RemoveColumns(filteredData,{"Simplified Prop65 List"})
in
#"Removed Columns"
Another method would be a filter (Table.SelectRows) method, but it may be slower with a large dataset. At least, in a single instance where I had an opportunity to compare, the Table.Join method was faster on a 100,000 row data set.
let
//get original data
Source = Excel.CurrentWorkbook(){[Name="Data"]}[Content],
data = Table.TransformColumnTypes(Source,{{"Impurity", type text}, {"Amount (ppm)", type number}}),
//get Filter
Source2 = Excel.CurrentWorkbook(){[Name="Prop65"]}[Content],
filter = Table.TransformColumnTypes(Source2,{"Simplified Prop65 List", Text.Type})[#"Simplified Prop65 List"],
//filter the rows
filteredData = Table.SelectRows(data, each List.Contains(filter,[Impurity]))
in
filteredData

How to insert value in already created Database table through pandas `df.to_sql()`

I'm creating new table then inserting values in it because the tsv file doesn't have headers so i need to create table structure first then insert the value. I'm trying to insert the value in database table which is been created. I'm using df.to_sql function to insert tsv values into database table but its creating table but it's not inserting values in that table and its not giving any type of error either.
I have tried to create new table through sqalchemy and insert value it worked but it didn't worked for already created table.
conn, cur = create_conn()
engine = create_engine('postgresql://postgres:Shubham#123#localhost:5432/walmart')
create_query = '''create table if not exists new_table(
"item_id" TEXT, "product_id" TEXT, "abstract_product_id" TEXT,
"product_name" TEXT, "product_type" TEXT, "ironbank_category" TEXT,
"primary_shelf" TEXT, apparel_category" TEXT, "brand" TEXT)'''
cur.execute(create_query)
conn.commit()
file_name = 'new_table'
new_file = "C:\\Users\\shubham.shinde\\Desktop\\wallll\\new_file.txt"
data = pd.read_csv(new_file, delimiter="\t", chunksize=500000, error_bad_lines=False, quoting=csv.QUOTE_NONE, dtype="unicode", iterator=True)
with open(file_name + '_bad_rows.txt', 'w') as f1:
sys.stderr = f1
for df in data:
df.to_sql('new_table', engine, if_exists='append')
data.close()
I want to insert values from df.to_sql() into database table
Not 100% certain if this argument works with postgresql, but I had a similar issue when doing it on mssql. .to_sql() already creates the table in the first argument of the method in new_table. The if_exists = append also doesn't check for duplicate values. If data in new_file is overwritten, or run through your function again, it will just add to the table. As to why you're seeing the table name, but not seeing the data in it, might be due to the size of the df. Try setting fast_executemany=True as the second argument of the create_engine.
My suggestion, get rid of create_query, and handle the data types after to_sql(). Once the SQL table is created, you can use your actual SQL table, and join against this staging table for duplicate testing. The non-duplicates can be written to the actual table, converting datatypes on UPDATE to match the tables data type structure.

Excel 2016 Power Query - Get data from Oracle data dictionary

I've got an Excel 2016 spreadsheet set up with an Oracle db data source and I already have several queries set up in PowerQuery to get data from the tables in a specific schema and all is working well.
I now need to get some data from the data dictionary - I need to find the name of a trigger associated with a specific table in the schema - so I've set up a query to try to get data from user_triggers, but so far I've not been able to get it to work.
This is the query I have set up so far (SourceTableName is a reference to a named cell in the sheet to get the table name) :
let
STN = Excel.CurrentWorkbook(){[Name="SourceTableName"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(STN,{{"Column1", type text}}),
table_name = #"Changed Type"{0}[Column1],
Source = Oracle.Database("MY_DB", [HierarchicalNavigation=true]),
Schema = Source{[Schema="MY_SCHEMA"]}[Data],
USER_TRIGGERS = Schema{[Name="USER_TRIGGERS"]}[Data]
in
USER_TRIGGERS
This works perfectly fine for the other queries I already have set up as long as the table name is one of the tables in the schema, but referring to a data dictionary view as in the above example doesn't seem to work.
The error I get when trying to run the this query is:
Expression.Error: The key didn't match any rows in the table.
Details:
Key=Record
Table=Table
Does anyone know if it's actually possible to get data from the data dictionary using powerquery and if it is what do I need to change to get it to work?
Thanks in advance for any help with this!
Cheers,
Dave
I've figured it out! Answering my own question in case it's useful for anyone else in the future
It's actually possible to specify an SQL query directly in the db connection line and you can include variable names from other parts of the query in the SQL, like so:
let
STN = Excel.CurrentWorkbook(){[Name="SourceTableName"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(STN,{{"Column1", type text}}),
table_name = #"Changed Type"{0}[Column1],
upper_tn = Text.Upper(table_name),
Triggers = Oracle.Database("MY_DB", [HierarchicalNavigation=true, Query="select trigger_name from user_triggers where table_name = '" & upper_tn & "'"])
in
Triggers
Using the SQL query directly in this way seems to work fine for data dictionary views :)

Is there a way to get column names using hiveContext?

I have an "iplRDD" which is a json, and I do below steps and query through hivecontext. I get the results but without columns headers. Is there is a way to get the columns names along with the values?
val teamRDD = hiveContext.jsonRDD(iplRDD)
teamRDD.registerTempTable("teams")
hiveContext.cacheTable("teams")
val result = hiveContext.sql("select * from teams where team_name = "KKR" )
result.collect.foreach(println)
Any thoughts please ?
teamRDD.schema.fieldNames should contain the header names.
You can get it by using:
result.schema().fields();
you can save your dataframe 'result' like this with header as csv file:
result.write().format("com.databricks.spark.csv").option("header", "true").save(outputPath);

Resources