Find duplicate content and consolidate in groovy

Find duplicate content and consolidate in groovy - groovy

I want to edit a file automatically with groovy.
For example, i have the following textfile:
(First line is only for your understanding)
branch ID item ID - - weight piece --- --- ---
178568305 108350 0 0 0 -1 215 215 012
178568305 102190 0 0 0 -1 74 74 012
178568305 102120 0 0 0 -8 35 35 012
178568305 102190 0 0 0 -6 74 74 012
178568305 102190 0 0 0 -6 74 74 012
178587626 108280 0 0 0 -3 189 189 012
178587626 159550 0 0 0 -1 499 499 012
178587626 107740 0 0 0 -4 229 229 012
178587626 105330 0 0 -10 0 626 626 012
178587626 102190 0 0 0 -6 74 74 012
In column 1 i have a brach ID
In column 2 i have a item ID
In column 5 i have a weight, for example in gram
In column 6 i have a number of pieces
Column 3, 4, 7, 8 and 9 are not important
In branch ID:
From line 1 to 5 and 6 and from 6 to 10 i have two different branchs IDs
In item ID:
In line 2, 4 and 5 i have the same item ID always to the same branche ID.
Now, what i want is to consolidate the the item ID 102190 with branch ID 178568305 to one line.
But the piece or weight have to count in sum in these one line.
But attention: In Line 10 i have the same item ID like in Line 2 or 4, but i have a another branch ID. It is not allowed to consolidate the weight/piece from line 10 with 2, 4 and 5!
For Example:
branch ID item ID - - weight piece --- --- ---
178568305 108350 0 0 0 -1 215 215 012
178568305 102120 0 0 0 -8 35 35 012
178568305 102190 0 0 0 -13 74 74 012
178587626 108280 0 0 0 -3 189 189 012
178587626 159550 0 0 0 -1 499 499 012
178587626 107740 0 0 0 -4 229 229 012
178587626 105330 0 0 -10 0 626 626 012
178587626 102190 0 0 0 -6 74 74 012
My input text file is only separated with spaces. My output textfile must be exact the same.
Unfortunately, I have no idea how to create this in a groovy script.
Can anyone help? Tutorials are helpful too. I have no idea for the logical script expiration.

You just need to group the data by branchId and itemId, here is a working code for you:
def input = new File("input.txt")
def output = new File("output.txt")
PrintWriter printWriter = new PrintWriter(output)
Map<String, String[]> result = [:]
input.eachLine { currentLine, lineNumber ->
def array = currentLine.split(" +");
String rowId = array[0] + "_" + array[1];
if(lineNumber == 1 ){
result.put(rowId, array)
}else {
String[] rowValues = array;
if(null != result.get(rowId)){
String[] existingValues = result.get(rowId);
for(int i = 0; i < existingValues.length; ++i){
try{
existingValues[i] = String.valueOf( rowValues[i].toInteger() + existingValues[i].toInteger())
}catch (NumberFormatException ex){
print(ex)
}
}
}else{
result.put(rowId, rowValues)
}
println(array)
}
}
int maxColumnWidth = 14;
result.each { key, value ->
for(int i = 0; i < value.length; ++i){
if(i == 0){
printWriter.print(value[i])
}else{
String v = value[i];
while(v.length() < maxColumnWidth){
v = " " + v;
}
printWriter.print(v)
}
}
printWriter.println()
}
printWriter.close()
Sample Input:
branch ID item ID - - weight piece --- --- ---
178568305 108350 0 0 0 -1 215 215 012
178568305 102190 0 0 0 -1 74 74 012
178568305 102120 0 0 0 -8 35 35 012
178568305 102190 0 0 0 -6 74 74 012
178568305 102190 0 0 0 -6 74 74 012
178587626 108280 0 0 0 -3 189 189 012
178587626 159550 0 0 0 -1 499 499 012
178587626 107740 0 0 0 -4 229 229 012
178587626 105330 0 0 -10 0 626 626 012
178587626 102190 0 0 0 -6 74 74 012
Output:
branch ID item ID - - weight piece --- --- ---
178568305 108350 0 0 0 -1 215 215 012
535704915 306570 0 0 0 -13 222 222 36
178568305 102120 0 0 0 -8 35 35 012
178587626 108280 0 0 0 -3 189 189 012
178587626 159550 0 0 0 -1 499 499 012
178587626 107740 0 0 0 -4 229 229 012
178587626 105330 0 0 -10 0 626 626 012
178587626 102190 0 0 0 -6 74 74 012

Related

Plot n-row data as column gnuplot

I have a data in formatted in 10-columns as follow:
# col1 col2 col3 col4 col5 col6 col7 col8 col9 col10
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
I want to plot all the data individually as a single column,that is the 11th data will be 11 and so on. How can I do this in gnuplot directly?
example of the data can be obtained here: data

This is a bit a special data format. Well, you could rearrange it with whatever tools, but you can also rearrange with gnuplot only.
For this, you need to have the data in a datablock. How to get it from a file into a datablock see the answer to this question: gnuplot: load datafile 1:1 into datablock
Code:
### plot special dataformat
reset session
$Data <<EOD
#SP# 12 10511.100
265 7 2 5 2 10 6 10 4 4
8 8 4 8 8 7 17 16 12 17
9 23 18 16 18 26 18 31 31 38
35 58 48 95 107 156 161 199 282 398
448 704 851 1127 1399 1807 2272 2724 3376 4077
4903 6458 7158 9045 9279 12018 13765 14212 17397 19166
21159 23650 25537 28003 29645 35385 34328 36021 42720 39998
45825 48111 49548 46591 53471 53888 56166 61747 57867 59226
59888 65953 61544 68233 68770 69336 63925 69660 69781 70590
76419 70791 70411 75909 70082 76136 69906 75069 75168 74690
73897 73656 73134 77603 70795 77603 68092 74208 73385 66906
71924 70866 74408 67869 67703 70924 65004 68566 62694 65917
64636 62988 62372 64923 59231 58266 60636 59191 54090 56428
55222 53519 52724 53973 49649 51418 46858 48289 46800 45395
44235 43087 40999 42777 39129 40020 37985 37019 35739 34925
33344 33968 30874 31292 30141 29528 27956 27001 25712 25842
23857 23752 22900 21926 20853 19897 19063 18997 18345 16499
16631 15810 15793 14158 13609 13429 13022 12276 11579 10810
10930 9743 9601 8939 8762 8338 7723 7470 6815 6774
6342 6056 5939 5386 5264 4889 4600 4380 4151 3982
3579 3557 3335 3220 3030 2763 2769 2516 2409 2329
2310 2153 2122 1948 1813 1879 1671 1666 1622 1531
1584 1455 1430 1409 1345 1291 1300 1284 1373 1261
1189 1373 1258 1220 1134 1261 1213 1116 1288 1087
1113 1137 1182 1087 1213 1061 1132 1211 1004 1081
1130 1144 1208 1089 1114 1088 1116 1188 1137 1150
1216 1101 1092 1148 1115 1161 1262 1157 1206 1183
1177 1274 1203 1150 1161 1206 1215 1166 1248 1217
1212 1250 1239 1292 1226 1262 1209 1329 1178 1383
1219 1175 1265 1264 1361 1206 1266 1285 1189 1284
1330 1223 1325 1338 1250 1322 1256 1252 1353 1269
1278 1281 1349 1256 1326 1309 1262 1374 1303 1293
1350 1297 1262 1144 1305 1224 1259 1292 1447 1187
1342 1267 1197 1327 1189 1248 1250 1198 1290 1299
1233 1173 1327 1206 1231 1205 1182 1232 1233 1158
1193 1137 1180 1211 1196 1176 1096 1131 1086 1134
1125 1122 1090 1145 1053 1067 1097 1003 1044 993
1056 1006 915 959 923 943 1026 930 927 929
914 849 920 818 808 888 877 808 848 867
735 785 769 738 744 716 708 677 660 657
589 626 649 581 578 597 580 539 495 541
528 402 457 435 425 417 415 408 366 375
322 341 292 286 272 313 263 255 246 207
213 176 195 180 181 168 153 140 114 130
106 100 97 92 71 71 72 59 57 49
43 42 35 38 36 26 33 29 29 14
22 19 11 11 14 14 6 6 9 4
7 5 2 5 1 3 0 0 0 2
0 1 3 0 2 0 0 0 0 1
1 0 3 1 0 1 2 1 2 0
0 3 0 0 1 0 0 1 2 0
1 0 0 0 0 0 0 1 2 0
2 3 1 0 0 3 1 0 1 0
0 1 0 1 1 0 0 1 0 0
0 1 0 0 1 2 1 2 0 1
1 1 0 0 0 0 0 0 0 2
1 1 0 0 0 1 0 1 0 1
1 1 1 1 1 0 0 3 0 2
1 1 1 0 1 0 1 0 0 2
1 1 0 1 0 0 0 1 0 1
0 0 0 0 2 3 1 2 0 0
1 2 0 1 2 1 1 1 1 1
1 0 1 0 0 2 1 2 2 1
0 0 1 1 1 0 1 1 1 0
0 2 1 1 1 0 0 0 1 1
0 2 1 1 2 0 2 1 1 1
1 1 0 0 0 2 0 0 1 0
1 0 1 1 2 2 0 0 0 3
2 0 0 0 2 0 1 1 0 1
0 0 0 2 1 4 0 1 0 1
2 0 0 0 0 0 1 0 0 2
0 0 0 0 1 0 0 0 0 0
0 1 0 1 0 0 0 0 1 2
0 0 1 0 1 0 0 1 0 1
0 0 2 1 1 0 0 1 0 0
0 0 0 1 0 0 0 0 0 1
0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 1 0 0
0 0 0 1 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 1
0 0 1 2 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0
1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0
EOD
set print $Data2
HeaderLines = 1
Line = ''
do for [i=HeaderLines+1:|$Data|] {
Line = Line.sprintf(' %s',$Data[i][1:strlen($Data[i])-1])
}
print Line
set print
WavStart = 450
WavStep = 0.1
myWav(col) = WavStart + column(col)*WavStep
set grid x,y
plot $Data2 u (myWav(1)):0:3 matrix w l lc "red" notitle
### end of code
Result:

Why does Cassandra major compaction fail to clear expired tombstones?

We have deployed a global Apache Cassandra cluster (node: 12, RF: 3, version: 3.11.2) in our production environment. We are running into an issue where running major compaction on column family is failing to clear tombstones from one node (out of 3 replicas) even though metadata information shows min timestamp passed gc_grace_seconds set on the table.
Here is sstable metadata output
SSTable: mc-4302-big
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Bloom Filter FP chance: 0.010000
Minimum timestamp: 1
Maximum timestamp: 1560326019515476
SSTable min local deletion time: 1560233203
SSTable max local deletion time: 2147483647
Compressor: org.apache.cassandra.io.compress.LZ4Compressor
Compression ratio: 0.8808303792058351
TTL min: 0
TTL max: 0
First token: -9201661616334346390 (key=bca773eb-ecbb-49ec-9330-cc16da310b58:::)
Last token: 9117719078924671254 (key=7c23b975-5354-4c82-82e5-1762bac75a8d:::)
minClustringValues: [00000f8f-74a9-4ce3-9d87-0a4dabef30c1]
maxClustringValues: [ffffc966-a02c-4e1f-bdd1-256556624288]
Estimated droppable tombstones: 46.31761624099541
SSTable Level: 0
Repaired at: 0
Replay positions covered: {}
totalColumnsSet: 0
totalRows: 618382
Estimated tombstone drop times:
1560233680: 353
1560234658: 237
1560235604: 176
1560236803: 471
1560237652: 402
1560238342: 195
1560239166: 373
1560239969: 356
1560240586: 262
1560241207: 247
1560242037: 387
1560242847: 357
1560243742: 280
1560244469: 283
1560245095: 353
1560245957: 357
1560246773: 362
1560247956: 449
1560249034: 217
1560249849: 310
1560251080: 296
1560251984: 304
1560252993: 239
1560253907: 407
1560254839: 977
1560255761: 671
1560256486: 317
1560257199: 679
1560258020: 703
1560258795: 507
1560259378: 298
1560260093: 2302
1560260869: 2488
1560261535: 2818
1560262176: 2842
1560262981: 1685
1560263708: 1830
1560264308: 808
1560264941: 1990
1560265753: 1340
1560266708: 2174
1560267629: 2253
1560268400: 1627
1560269174: 2347
1560270019: 2579
1560270888: 3947
1560271690: 1727
1560272446: 2573
1560273249: 1523
1560274086: 3438
1560275149: 2737
1560275966: 3487
1560276814: 4101
1560277660: 2012
1560278617: 1198
1560279680: 769
1560280441: 1337
1560281033: 608
1560281876: 2065
1560282546: 2926
1560283128: 6305
1560283836: 824
1560284574: 71
1560285166: 140
1560285828: 118
1560286404: 83
1560295835: 72
1560296951: 456
1560297814: 670
1560298496: 271
1560299333: 473
1560300159: 284
1560300831: 127
1560301551: 536
1560302309: 425
1560303302: 860
1560304064: 465
1560304782: 319
1560305657: 323
1560306552: 236
1560307454: 368
1560308409: 320
1560309178: 210
1560310091: 177
1560310881: 85
1560311970: 147
1560312706: 76
1560313495: 88
1560314847: 687
1560315817: 1618
1560316544: 1245
1560317423: 5361
1560318491: 2060
1560319595: 5853
1560320587: 5390
1560321473: 3868
1560322644: 5784
1560323703: 6861
1560324838: 7200
1560325744: 5642
Count Row Size Cell Count
1 0 3054
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
10 0 0
12 0 0
14 0 0
17 0 0
20 0 0
24 0 0
29 0 0
35 0 0
42 0 0
50 0 0
60 98 0
72 49 0
86 46 0
103 2374 0
124 39 0
149 36 0
179 43 0
215 18 0
258 26 0
310 24 0
372 18 0
446 16 0
535 19 0
642 27 0
770 17 0
924 12 0
1109 14 0
1331 23 0
1597 20 0
1916 12 0
2299 11 0
2759 11 0
3311 11 0
3973 12 0
4768 5 0
5722 8 0
6866 5 0
8239 5 0
9887 6 0
11864 5 0
14237 10 0
17084 1 0
20501 8 0
24601 2 0
29521 2 0
35425 3 0
42510 2 0
51012 2 0
61214 1 0
73457 2 0
88148 3 0
105778 0 0
126934 3 0
152321 2 0
182785 1 0
219342 0 0
263210 0 0
315852 0 0
379022 0 0
454826 0 0
545791 0 0
654949 0 0
785939 0 0
943127 0 0
1131752 0 0
1358102 0 0
1629722 0 0
1955666 0 0
2346799 0 0
2816159 0 0
3379391 1 0
4055269 0 0
4866323 0 0
5839588 0 0
7007506 0 0
8409007 0 0
10090808 1 0
12108970 0 0
14530764 0 0
17436917 0 0
20924300 0 0
25109160 0 0
30130992 0 0
36157190 0 0
43388628 0 0
52066354 0 0
62479625 0 0
74975550 0 0
89970660 0 0
107964792 0 0
129557750 0 0
155469300 0 0
186563160 0 0
223875792 0 0
268650950 0 0
322381140 0 0
386857368 0 0
464228842 0 0
557074610 0 0
668489532 0 0
802187438 0 0
962624926 0 0
1155149911 0 0
1386179893 0 0
1663415872 0 0
1996099046 0 0
2395318855 0 0
2874382626 0
3449259151 0
4139110981 0
4966933177 0
5960319812 0
7152383774 0
8582860529 0
10299432635 0
12359319162 0
14831182994 0
17797419593 0
21356903512 0
25628284214 0
30753941057 0
36904729268 0
44285675122 0
53142810146 0
63771372175 0
76525646610 0
91830775932 0
110196931118 0
132236317342 0
158683580810 0
190420296972 0
228504356366 0
274205227639 0
329046273167 0
394855527800 0
473826633360 0
568591960032 0
682310352038 0
818772422446 0
982526906935 0
1179032288322 0
1414838745986 0
Estimated cardinality: 3054
EncodingStats minTTL: 0
EncodingStats minLocalDeletionTime: 1560233203
EncodingStats minTimestamp: 1
KeyType: org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type)
ClusteringTypes: [org.apache.cassandra.db.marshal.UUIDType]
StaticColumns: {}
RegularColumns: {}
So far here is what we have tried,
1) major compaction with lower gc_grace_seconds
2) nodetool garbagecollect
3) nodetool scrub
None of the above methods is helping. Again, this is only happening for one node (out of total 3 replicas)

The tombstone markers generated during your major compaction are just that, markers. The data has been removed but a delete marker is left in place so that the other replicas can have gc_grace_seconds to process them too. The tombstone markers are fully dropped the next time the SSTable is compacted. Unfortunately because you've run a major compaction (rarely ever recommended) it may be a long time until there are suitable SSTables for compaction with it to clean up the tombstones. Remember that the tombstone drop will also only happen after local_delete_time + gc_grace_seconds as defined by the table.
If you're interested in learning more about how tombstones and compaction work together in the context of delete operations I suggest reading the following articles:
https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/dml/dmlAboutDeletes.html
https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html

Exporting a pandas DataFrame into excel based on cell values

I've a Dataframe as below. (resulted from pivot_table() method)
Location Loc 2 Loc 3 Loc 5 Loc 8 Loc 9
Item
1 404 317 272 113 449
3 1,205 870 846 371 1,632
5 208 218 128 31 268
7 107 54 57 17 179
9 387 564 245 83 571
10 364 280 115 34 252
16 104 80 72 22 143
17 111 85 44 10 209
18 124 182 67 27 256
19 380 465 219 103 596
if you take a closer look at it, there are missing Locations (eg, Loc 1, Loc, 4, etc) and missing Items (eg, 2, 4,8, etc)
I want to export this to my Excel pre-defined Template which has all the Locations & Items & fill the table based on Items & Values.
I know I can export the dataframe to a different excel sheet & use SUMIFS() or INDEX(), MATCH() formulas. but, I want to do this directly from Python/Panda to excel.
Below should be the result after exporting
Loc 1 Loc 2 Loc 3 Loc 4 Loc 5 Loc 6 Loc 7 Loc 8 Loc 9
1 0 404 317 0 272 0 0 113 449
2 0 0 0 0 0 0 0 0 0
3 0 1205 870 0 846 0 0 371 1632
4 0 0 0 0 0 0 0 0 0
5 0 208 218 0 128 0 0 31 268
6 0 0 0 0 0 0 0 0 0
7 0 107 54 0 57 0 0 17 179
8 0 0 0 0 0 0 0 0 0
9 0 387 564 0 245 0 0 83 571
10 0 364 280 0 115 0 0 34 252
11 0 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0
15 0 0 0 0 0 0 0 0 0
16 0 104 80 0 72 0 0 22 143
17 0 111 85 0 44 0 0 10 209
18 0 124 182 0 67 0 0 27 256
19 0 380 465 0 219 0 0 103 596
20 0 0 0 0 0 0 0 0 0

Use DataFrame.reindex with new index and columns values in arrays or lists:
idx = np.arange(1, 21)
cols = [f'Loc {x}' for x in np.arange(1, 10)]
df = df.reindex(index=idx, columns=cols, fill_value=0)
print (df)
Loc 1 Loc 2 Loc 3 Loc 4 Loc 5 Loc 6 Loc 7 Loc 8 Loc 9
1 0 404 317 0 272 0 0 113 449
2 0 0 0 0 0 0 0 0 0
3 0 1,205 870 0 846 0 0 371 1,632
4 0 0 0 0 0 0 0 0 0
5 0 208 218 0 128 0 0 31 268
6 0 0 0 0 0 0 0 0 0
7 0 107 54 0 57 0 0 17 179
8 0 0 0 0 0 0 0 0 0
9 0 387 564 0 245 0 0 83 571
10 0 364 280 0 115 0 0 34 252
11 0 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0
15 0 0 0 0 0 0 0 0 0
16 0 104 80 0 72 0 0 22 143
17 0 111 85 0 44 0 0 10 209
18 0 124 182 0 67 0 0 27 256
19 0 380 465 0 219 0 0 103 596
20 0 0 0 0 0 0 0 0 0

Benchmarking CPU and File IO for an application running on Linux

I wrote two programs to run on Linux, each using a different algorithm, and I want to find a way (preferably using a benchmarking software) to compare the CPU usage and IO operations between these two programs.
Is there such a thing? and if yes, where can I find them. Thanks.

You can try hardinfo
Or there are like n different tools measuring system performance if measuring it while running your app solves your purpose
And you can also check this thread

You might try vmstat command:
vmstat 2 20 > vmstat.txt
20 samples of 2 seconds
bi = KB in, bo = KB out with wa = waiting for I/O
I/O can also increase cache demands
%CPU utilisation = us (user) = sy (system)
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 0 277504 17060 82732 0 0 91 87 1432 236 11 3 84 1
0 0 0 277372 17068 82732 0 0 0 24 1361 399 23 8 59 10
test start
0 1 0 275240 17068 82732 0 0 0 512 1342 305 24 4 69 4
2 1 0 275232 17068 82780 0 0 24 10752 4176 216 7 8 0 85
1 1 0 275240 17076 82732 0 0 12288 2590 5295 243 15 8 0 77
0 1 0 275240 17076 82748 0 0 8 11264 4329 214 6 12 0 82
0 1 0 275240 17076 82780 0 0 16 11264 4278 233 15 10 0 75
0 1 0 275240 17084 82780 0 0 19456 542 6563 255 10 7 0 83
0 1 0 275108 17084 82748 0 0 5128 3072 3501 265 16 37 0 47
3 1 0 275108 17084 82748 0 0 924 5120 8369 3845 12 33 0 55
0 1 0 275116 17092 82748 0 0 1576 85 11483 6645 5 50 0 45
1 1 0 275116 17092 82748 0 0 0 136 2304 689 3 9 0 88
2 1 0 275084 17100 82732 0 0 0 352 2374 800 14 26 0 61
0 0 0 275076 17100 82732 0 0 546 118 2408 1014 35 17 47 1
0 1 0 275076 17104 82732 0 0 0 62 1324 76 3 2 89 7
1 1 0 275076 17108 82732 0 0 0 452 1879 442 8 13 66 12
0 0 0 275116 17108 82732 0 0 800 352 2456 1195 19 17 56 8
0 1 0 275116 17112 82732 0 0 0 54 1325 76 4 1 88 8
test end
1 1 0 275116 17116 82732 0 0 0 510 1717 286 6 10 72 11
1 0 0 275076 17116 82732 0 0 1600 1152 3087 1344 23 29 41 7

Convert a vector in R to a dataframe by columns

I would like to convert this vector with strings in each row and spaces separating the elements within one string:
> v.input_red
[1] "pm 0 100 2.1 59 70 15.5 14.8 31 984 32 0 56 55 0 0 0 0 0 0 -60 -260 0 0 6 0 0 0 0 0 20 8 2ff 0 249 0 0 "
[2] "pm 0 100 2.1 59 70 15.5 14.8 31 984 32 0 56 55 0 0 0 0 0 0 -60 -260 0 0 6 0 0 0 0 0 20 8 2ff 0 249 0 0 "
[3] "pm 0 100 2.1 59 70 15.5 14.8 31 984 32 0 56 55 0 0 0 0 0 0 -60 -260 0 0 6 0 0 0 0 0 20 8 2ff 0 249 0 0 "
to a dataframe with a column for each element. But I'm not quite sure how to extract the elements from the strings. Best way would be to convert the whole thing at once somehow, I guess..
Wanted result-dataframe (created manually):
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35
1 pm 0 100 2.1 59 70 15.5 14.8 31 984 32 0 56 55 0 0 0 0 0 0 -60 -260 0 0 6 0 0 0 0 0 20 8 2ff 0 249
2 pm 0 100 2.1 59 70 15.5 14.8 31 984 32 0 56 55 0 0 0 0 0 0 -60 -260 0 0 6 0 0 0 0 0 20 8 2ff 0 249
3 pm 0 100 2.1 59 70 15.5 14.8 31 984 32 0 56 55 0 0 0 0 0 0 -60 -260 0 0 6 0 0 0 0 0 20 8 2ff 0 249
Thanks in advance!
Matthias

For quite some time, read.table and family have had a text argument that lets you read directly from character vectors. There's no need to write the object to a file first.
Your sample data...
v.input_red <- c("pm 0 100 2.1 59 70 15.5 14.8 31 984 32 0 56 55 0 0 0 0 0 0 -60 -260 0 0 6 0 0 0 0 0 20 8 2ff 0 249 0 0 ",
"pm 0 100 2.1 59 70 15.5 14.8 31 984 32 0 56 55 0 0 0 0 0 0 -60 -260 0 0 6 0 0 0 0 0 20 8 2ff 0 249 0 0 ",
"pm 0 100 2.1 59 70 15.5 14.8 31 984 32 0 56 55 0 0 0 0 0 0 -60 -260 0 0 6 0 0 0 0 0 20 8 2ff 0 249 0 0 ")
... directly read in:
read.table(text = v.input_red, header = FALSE)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
# 1 pm 0 100 2.1 59 70 15.5 14.8 31 984 32 0 56 55 0 0 0
# 2 pm 0 100 2.1 59 70 15.5 14.8 31 984 32 0 56 55 0 0 0
# 3 pm 0 100 2.1 59 70 15.5 14.8 31 984 32 0 56 55 0 0 0
# V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33
# 1 0 0 0 -60 -260 0 0 6 0 0 0 0 0 20 8 2ff
# 2 0 0 0 -60 -260 0 0 6 0 0 0 0 0 20 8 2ff
# 3 0 0 0 -60 -260 0 0 6 0 0 0 0 0 20 8 2ff
# V34 V35 V36 V37
# 1 0 249 0 0
# 2 0 249 0 0
# 3 0 249 0 0

Assuming file is a file name that you save on your system:
writeLines(v.input_red, file)
data <- read.table(file)

Is this solution what you were looking for?
s1 <- "pm 0 100 2.1 59 70 15.5 14.8 31 984 32 0 56 55 0 0 0 0 0 0 -60 -260 0 0 6 0 0 0 0 0 20 8 2ff 0 249 0 0 "
s2 <- "pm 0 100 2.1 59 70 15.5 14.8 31 984 32 0 56 55 0 0 0 0 0 0 -60 -260 0 0 6 0 0 0 0 0 20 8 2ff 0 249 0 0 "
s3 <- "pm 0 100 2.1 59 70 15.5 14.8 31 984 32 0 56 55 0 0 0 0 0 0 -60 -260 0 0 6 0 0 0 0 0 20 8 2ff 0 249 0 0 "
df <- t(data.frame(strsplit(s1, " "),strsplit(s2, " "),strsplit(s3, " ")))
row.names(df) <- c("s1", "s2", "s3")
strsplit splits the string at each space char. Concatenated as data.frame gives you a df wih 3 columns. So you have to transpose it with t. I changes row names for better readability.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Find duplicate content and consolidate in groovy - groovy

Related

Plot n-row data as column gnuplot

Why does Cassandra major compaction fail to clear expired tombstones?

Exporting a pandas DataFrame into excel based on cell values

Benchmarking CPU and File IO for an application running on Linux

Convert a vector in R to a dataframe by columns

Categories

Resources