I have a CSV that contains some production data. When loaded into Excels power query it has a structure similar to this (material batches may contain remainders of old material batches as recycling material):
Mat_Batch Date Recyc_Batch RawMaterial1 RawMaterial2 RawMaterial3 Amount1 Amount2 Amount3
123 01.11.2019 Fe Cr Ni 70 19 11
234 01.12.2019 Fe Cr Ni 71 18 11
345 01.02.2020 123 Fe Cr Ni 72 17 9
456 01.01.2020 234 Fe Cr Ni 70 19 11
567 01.02.2020 Fe Cr Ni 72 16 10
678 01.01.2020 456 Fe Cr Ni 70 19 11
Another CSV has the following content (it simply links a production batch to a material batch; production batches may contain more than one material batch):
Batch Mat_Batch
abc 456
abc 567
bcd 345
Now I would like to use power query m to evaluate which material batches exactly were used to produce a part batch. E.g. batch "abc" was made from 456 + 567 + 234 (as recycling material in 456).
As a first step, I filter the production batch table by a specific batch and join both tables via the resulting Mat_Batch column. As a second iteration I seperate the Recyc_Batch column from the matched material batches and do a second join with a copy of my material batch table to gain all additional recycling materials that where used. But how could I do so "infinite" times? The way I'm doing it I have to create additional queries for each iteration but I need a way to automatically repeat those joining steps until there is no more additional recycling material used.
here is a Query (Result) you can use (if I understood correct)
let
Quelle = Table.NestedJoin(tbl_Material, {"Mat_Batch"}, tbl_Production, {"Mat_Batch"}, "tbl_Production", JoinKind.LeftOuter),
Combine_Sources = Table.ExpandTableColumn(Quelle, "tbl_Production", {"Batch"}, {"Batch"}),
DeleteOtherColumns = Table.SelectColumns(Combine_Sources,{"Batch", "Mat_Batch", "Recyc_Batch"}),
UnpivotOtherColumns = Table.UnpivotOtherColumns(DeleteOtherColumns, {"Batch"}, "Attribut", "Wert"),
FilterRows = Table.SelectRows(UnpivotOtherColumns, each ([Batch] <> null)),
SortRows = Table.Sort(FilterRows,{{"Batch", Order.Ascending}})
in
SortRows
The result looks like that
Best regards Chris
Related
I would like to sort column "time" within each "id" group.
The data looks like:
id time name
132 12 Lucy
132 10 John
132 15 Sam
78 11 Kate
78 7 Julia
78 2 Vivien
245 22 Tom
I would like to get this:
id time name
132 10 John
132 12 Lucy
132 15 Sam
78 2 Vivien
78 7 Julia
78 11 Kate
245 22 Tom
I tried
df.orderby(['id','time'])
But I don't need to sort "id".
I have two questions:
Can I just sort "time" within same "id"? and How?
Will be more efficient if I just sort "time" than using orderby() to sort both columns?
This is exactly what windowing is for.
You can create a window partitioned by the "id" column and sorted by the "time" column. Next you can apply any function on that window.
# Create a Window
from pyspark.sql.window import Window
w = Window.partitionBy(df.id).orderBy(df.time)
Now use this window over any function:
For e.g.: let's say you want to create a column of the time delta between each row within the same group
import pyspark.sql.functions as f
df = df.withColumn("timeDelta", df.time - f.lag(df.time,1).over(w))
I hope this gives you an idea. Effectively you have sorted your dataframe using the window and can now apply any function to it.
If you just want to view your result, you could find the row number and sort by that as well.
df.withColumn("order", f.row_number().over(w)).sort("order").show()
I have a text file consisting of data that is separated by tab-delimited columns. There are many ways to read data in from the file into python, but I am specifically trying to use a method similar to one outlined below. When using a context manager like with open(...) as ..., I've seen that the general concept is to have all of the subsequent code indented within the with statement. Yet when defining a function, the return statement is usually placed at the same indentation as the first line of code within the function (excluding cases with awkward if-else loops). In this case, both approaches work. Is one method considered correct or generally preferred over the other?
def read_in(fpath, contents=[], row_limit=np.inf):
"""
fpath is filelocation + filename + '.txt'
contents is the initial data that the file data will be appeneded to
row_limit is the maximum number of rows to be read (in case one would like to not read in every row).
"""
nrows = 0
with open(fpath, 'r') as f:
for row in f:
if nrows < row_limit:
contents.append(row.split())
nrows += 1
else:
break
# return contents
return contents
Below is a snippet of the text-file I am using for this example.
1996 02 08 05 17 49 263 70 184 247 126 0 -6.0 1.6e+14 2.7e+28 249
1996 02 12 05 47 26 91 53 160 100 211 236 2.0 1.3e+15 1.6e+29 92
1996 02 17 02 06 31 279 73 317 257 378 532 9.9 3.3e+14 1.6e+29 274
1996 02 17 05 18 59 86 36 171 64 279 819 27.9 NaN NaN 88
1996 02 19 05 15 48 98 30 266 129 403 946 36.7 NaN NaN 94
1996 03 02 04 11 53 88 36 108 95 120 177 1.0 1.5e+14 8.7e+27 86
1996 03 03 04 12 30 99 26 186 141 232 215 2.3 1.6e+14 2.8e+28 99
And below is a sample call.
fpath = "/Users/.../sample_data.txt"
data_in = read_in(fpath)
for i in range(len(data_in)):
print(data_in[i])
(I realize that it's better to use chunks of pre-defined sizes to read in data, but the number of characters per row of data varies. So I'm instead trying to give user control over the number of rows read in; one could read in a subset of the rows at a time and append them into contents, continually passing them into read_in - possibly in a loop - if the file size is large enough. That said, I'd love to know if I'm wrong about this approach as well, though this isn't my main question.)
If your function needs to do some other things after writing to the file, you usually do it outside the with block. So essentially you need to return outside the with block too.
However if the purpose of your function is just to read in a file, you can return within the with block, or outside it. I believe none of the methods are preferred in this case.
I don't really understand your second question.
You can put return also withing with context.
By exiting context, the cleanup are done. This is the power of with, not to need to check all possible exit paths. Note: also with exception inside with the exit context is called.
But if file is empty (as an example), you should still return something. So in such case your code is clear, and follow the principle: one exit path. But if you should handle end of file without finding something important, I would putting normal return within with context, and handle the special case after it.
I have two data sets.
Week IN US FR UK MX
1 200 550 0 1 0
2 70 257 309 33 0
3 49 306 293 49 8
4 77 308 408 53 65
5 117 341 343 59 81
.....
Week IN US FR UK MX
1 0 0 0 0 0
2 36 129 194 24 0
3 51 322 287 57 0
4 75 292 373 50 56
5 80 249 296 56 76
....
Against each week,
I have number of orders requested in the first table
I have number of orders delivered in the second table.
I want a pivot chart which shows the same
If you are using Excel 2010 or newer you could use PowerQuery, a free add-in provided by Microsoft, to unpivot and combine your two datasets. I believe in Excel 2016 PowerQuery is already included as Get & Transform.
Create a new query with PowerQuery and open the Advanced Editor. Remove any text already appearing in the editor and use the below code to get an unpivoted and combined table.
let
Requested = Table.AddColumn(Table.UnpivotOtherColumns(Excel.CurrentWorkbook(){[Name="Requested"]}[Content], {"Week"}, "Country", "Value"), "Type", each "Requested"),
Delivered = Table.AddColumn(Table.UnpivotOtherColumns(Excel.CurrentWorkbook(){[Name="Delivered"]}[Content], {"Week"}, "Country", "Value"), "Type", each "Delivered"),
Combined = Table.Combine({Requested, Delivered})
in
Combined
The first two lines after let are getting the data from your tables (assuming your two datasets are in tables named Requested and Delivered), using Table.UnpivotOtherColumns to unpivot them and then adding a column Type to indicate if the line is for a request or a delivery.
Table.Combine simply appends one table to the other (putting all the lines from the deliveries below the requests).
Close the Advanced Editor and click Close & Load in the query editor to add the query results to an Excel sheet. Using the resulting table you can easily create a pivot table that shows the combined data.
Since the query is still connected to the source tables, anytime your data changes / gets updated you can refresh the query (similar to a pivot table) to get the new data.
I have a question very similar to a previous post:
Merging two files by a single column in unix
but i want to merge my data based on two columns (The orders are the same, so no need to sort).
Example,
subjectid subID2 name age
12 121 Jane 16
24 241 Kristen 90
15 151 Clarke 78
23 231 Joann 31
subjectid subID2 prob_disease
12 121 0.009
24 241 0.738
15 151 0.392
23 231 1.2E-5
And the output to look like
subjectid SubID2 prob_disease name age
12 121 0.009 Jane 16
24 241 0.738 Kristen 90
15 151 0.392 Clarke 78
23 231 1.2E-5 Joanna 31
when i use join it only considers the first column(subjectid) and repeats the SubID2 column.
Is there a way of doing this with join or some other way please? Thank you
join command doesn't have an option to scan more than one field as a joining criteria. Hence, you will have to add some intelligence into the mix. Assuming your files has a FIXED number of fields on each line, you can use something like this:
join f1 f2 | awk '{print $1" "$2" "$3" "$4" "$6}'
provided the the field counts are as given in your examples. Otherwise, you need to adjust the scope of print in the awk command, by adding or taking away some fields.
If the orders are identical, you could still merge by a single column and specify the format of which columns to output, like:
join -o '1.1 1.2 2.3 1.3 1.4' file_a file_b
as described in join(1).
I have a dataset like this
Obs MinNo EurNo MinLav EurLav
1 103 15.9 92 21.9
2 68 18.5 126 18.5
3 79 15.9 114 22.3
My goal is to create a data set like this from the dataset above:
Obs Min Eur Lav
1 103 15.9 No
2 92 21.9 Yes
3 68 18.5 No
4 126 18.5 Yes
5 79 15.9 No
6 114 22.3 Yes
Basically I'm taking the 4 columns and appending them into 2 columns + a Categorical indicating which set of 2 columns they came from
Here's what I have so far
PROC IMPORT DATAFILE='f:\data\order_effect.xls' DBMS=XLS OUT=orderEffect;
RUN;
DATA temp;
INFILE orderEffect;
INPUT minutes euros ##;
IF MOD(_N_,2)^=0 THEN lav='Yes';
ELSE lav='No';
RUN;
My question though is how I can I import an Excel sheet but then modify the SAS dataset it creates so I can shove the second two columns below the first two and add a third column based on which columns in came from?
I know how to do this by splitting the dataset into two datasets then appending one onto the other but with the mode function above it would be a lot faster.
You were very close, but misunderstanding what PROC IMPORT does.
When PROC EXPORT completes, it will have created a SAS data set named orderEffect containing SAS variables from the columns in your worksheet. You just need to do a little data step program to give the result you want. Try this:
data want;
/* Define the SAS variables you want to keep */
format Min 8. Eur 8.1;
length Lav $3;
keep Min Eur Lav;
set orderEffect;
Min = MinNo;
Eur = EurNo;
Lav = 'No';
output;
Min = MinLav;
Eur = EurLav;
Lav = 'Yes';
output;
run;
This assumes that the PROC IMPORT step created a data set with those names. Run that step first to be sure and revise the program if necessary.