Removing First Column from Text File with awk - text

I have the next text file:
Article_Title 1st_Author Publication_Year Language Citations
Über die Theorie des Stoßes zwischen Atomen und elektrisch geladenen Teilchen Fermi 1924 German 54
Zur Quantelung des idealen einatomigen Gases Fermi 1926 German 333
Eine statistische Methode zur Bestimmung einiger Eigenschaften des Atoms und ihre
Anwendung auf die Theorie des periodi schen Systems der Elemente Fermi 1928 German 1833
Über die magnetischen Momente der Atomkerne Fermi 1929 German 795
Über das Intensitätsverhältnis der Dublettkomponenten der Alkalien Fermi 1929 German 134
Über den Ramaneffekt des Kohlendioxyds Fermi 1931 German 594
Quantum Theory of Radiation Fermi 1932 English 951
Zur Theorie der Hyperfeinstruktur Fermi 1933 German 280
Possible Production of Elements of Atomic Number Higher than 92 Fermi 1934 English 175
Versuch einer Theorie der β-Strahlen Fermi 1934 German 525
Sopra lo Spostamento per Pressione delle Righe Elevate delle Serie Spettrali Fermi 1934 Italian 901
Tentativo di una Teoria Dei Raggi β Fermi 1934 Italian 475
On the Absorption and the Diffusion of Slow Neutrons Almadi 1936 English 199
The Ionization Loss of Energy in Gases and in Condensed Materials Fermi 1940 English 710
The Capture of Negative Mesotrons in Matter Fermi 1947 English 1156
Interference Phenomena of Slow Neutrons Fermi 1947 English 301
On the Origin of the Cosmic Radiation Fermi 1949 English 3309
Are Mesons Elementary Particles? Fermi 1949 English 498
Angular Distribution of the Pions Produced in High Energy Nuclear Collisions Fermi 1951 English 324
Multiple Production of Pions in Nucleon-Nucleon Collisions at Cosmotron Energies Fermi 1953 English 118
The first Column is the Scientific Article name, the second one is the Last Name of the Author, the third one is the Publication year, the fourth one the article language and the fifth one is the number of citations ..
I would like to convert it into something like this:
1924 54
1926 333
1928 1833
1929 795
1929 134
1931 594
1932 951
1933 280
1934 175
1934 525
1934 901
1934 475
1936 199
1940 710
1947 1156
1947 301
1949 3309
1949 498
1951 324
1953 118
So, I need to remove the first column, the second and the fourth one
The Problem is the column of Article_Titles ...
If the Article titles were like this:
Quantum_Theory_of_Radiation
I just need to run the next command:
sed -i '1,2d' plotting_data.txt # Removing First and second Line
awk '{$1=$2=$4=""; print $0}' plotting_data.txt > tmp && mv tmp plotting_data.txt # Removing First, Second and Fourth Column
The problem is that there are spaces between the words of the Article Titles .. I don't know how to tell awk or sed to remove that column .. could you help me?
I am using the next awk version:
mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan
compiled limits:
max NF 32767
sprintf buffer 2040
and also the white space between fields is all blank chars

Assuming the white space in your sample is all blank chars this will work using any awk in any shell on any UNIX box:
$ awk 'NR==1{beg=index($0,$2)} NR>2{$0=substr($0,beg); print $2, $4}' file
1924 54
1926 333
1928 1833
1929 795
1929 134
1931 594
1932 951
1933 280
1934 175
1934 525
1934 901
1934 475
1936 199
1940 710
1947 1156
1947 301
1949 3309
1949 498
1951 324
1953 118

Related

Sum up Specific columns in a Dataframe from sqlite

im relatively new to Dataframes in Python and running into an Issue I cant find.
im having a Dataframe with the following column layout
print(list(df.columns.values)) returns:
['iccid', 'system', 'last_updated', '01.01', '02.01', '03.01', '04.01', '05.01', '12.01', '18.01', '19.01', '20.01', '21.01', '22.01', '23.01', '24.01', '25.01', '26.01', '27.01', '28.01', '29.01', '30.01', '31.01']
normally i should have a column for each day in a specific month. in the example above its December 2022. Sometimes Days are missing which isnt an issue.
i tried to first get all given columns that are relevant by filtering them:
# Filter out the columns that are not related to the data
data_columns = [col for col in df.columns if '.' in col]
Now comes the issue:
Sometimes the column "system" could also be empty so i need to put the iccid inside the system value:
df.loc[df['system'] == 'Nicht benannt!', 'system'] = df.loc[df['system'] == 'Nicht benannt!', 'iccid'].iloc[0]
df.loc[df['system'] == '', 'system'] = df.loc[df['system'] == '', 'iccid'].iloc
grouped = df.groupby('system').sum(numeric_only=False)
then i tried to create that needed 'data_usage' column.
grouped['data_usage'] = grouped[data_columns[-1]]
grouped.reset_index(inplace=True)
By that line i should normally only get the result of the last column in the dataframe (which was a workaround that also didnt work as expected)
Now what im trying to get is the sum of all columns which contain a date in their name and add this sum to a new column named data_usage.
the issue im having here is im getting results for systems which dont have an initial system value which have a data_usage of 120000 (which is just value that represents the megabytes used) and if i check the sqlite file the system in total only used 9000 mb in that particular month.
For Example:
im having this column in the sqlite file:
iccid
system
last_updated
06.02
08.02
8931080320014183316
Nicht benannt!
2023-02-06
1196
1391
and in the dataframe i get the following result:
8931080320014183316 48129.0
I cant find the issue and would be very happy if someone can point me into the right direction.
Here are some example data as requested:
iccid
system
last_updated
01.12
02.12
03.12
04.12
05.12
06.12
07.12
08.12
09.12
10.12
11.12
12.12
13.12
14.12
15.12
16.12
17.12
18.12
19.12
20.12
21.12
22.12
23.12
28.12
29.12
30.12
31.12
8945020184547971966
U-O-51
2022-12-01
2
32
179
208
320
509
567
642
675
863
1033
1055
1174
2226
2277
2320
2466
2647
2679
2713
2759
2790
2819
2997
3023
3058
3088
8945020855461807911
L-O-382
2022-12-01
1
26
54
250
385
416
456
481
506
529
679
772
802
832
858
915
940
1019
1117
1141
1169
1193
1217
1419
1439
1461
1483
8945020855461809750
C-O-27
2022-12-01
1
123
158
189
225
251
456
489
768
800
800
800
800
800
800
2362
2386
2847
2925
2960
2997
3089
3116
3448
3469
3543
3586
8931080019070958450
L-O-123
2022-12-02
0
21
76
313
479
594
700
810
874
1181
1955
2447
2527
2640
2897
3008
3215
3412
3554
3639
3698
3782
3850
4741
4825
4925
5087
8931080453114183282
Nicht benannt!
2022-12-02
0
6
45
81
95
98
101
102
102
102
102
102
102
103
121
121
121
121
149
164
193
194
194
194
194
194
194
8931080894314183290
C-O-16 N
2022-12-02
0
43
145
252
386
452
532
862
938
1201
1552
1713
1802
1855
2822
3113
3185
3472
3527
3745
3805
3880
3938
4221
4265
4310
4373
8931080465814183308
L-O-83
2022-12-02
0
61
169
275
333
399
468
858
1094
1239
1605
1700
1928
2029
3031
4186
4333
4365
4628
4782
4842
4975
5265
5954
5954
5954
5954
8931082343214183316
Nicht benannt!
2022-12-02
0
52
182
506
602
719
948
1129
1314
1646
1912
1912
1912
1912
2791
3797
3944
4339
4510
4772
4832
5613
5688
6151
6482
6620
6848
8931087891314183324
L-O-119
2022-12-02
0
19
114
239
453
573
685
800
1247
1341
1341
1341
1341
1341
1341
1341
1341
1341
1341
1341
1341
1341
1423
2722
3563
4132
4385

Why doesn't the seaborn plot show a confidence interval?

I am using sns.lineplot to show the confidence intervals in a plot.
sns.lineplot(x = threshold, y = mrl_array, err_style = 'band', ci=95)
plt.show()
I'm getting the following plot, which doesn't show the confidence interval:
What's the problem?
There is probably only a single observation per x value.
If there is only one observation per x value, then there is no confidence interval to plot.
Bootstrapping is performed per x value, but there needs to be more than one obsevation for this to take effect.
ci: Size of the confidence interval to draw when aggregating with an estimator. 'sd' means to draw the standard deviation of the data. Setting to None will skip bootstrapping.
Note the following examples from seaborn.lineplot.
This is also the case for sns.relplot with kind='line'.
The question specifies sns.lineplot, but this answer applies to any seaborn plot that displays a confidence interval, such as seaborn.barplot.
Data
import seaborn as sns
# load data
flights = sns.load_dataset("flights")
year month passengers
0 1949 Jan 112
1 1949 Feb 118
2 1949 Mar 132
3 1949 Apr 129
4 1949 May 121
# only May flights
may_flights = flights.query("month == 'May'")
year month passengers
4 1949 May 121
16 1950 May 125
28 1951 May 172
40 1952 May 183
52 1953 May 229
64 1954 May 234
76 1955 May 270
88 1956 May 318
100 1957 May 355
112 1958 May 363
124 1959 May 420
136 1960 May 472
# standard deviation for each year of May data
may_flights.set_index('year')[['passengers']].std(axis=1)
year
1949 NaN
1950 NaN
1951 NaN
1952 NaN
1953 NaN
1954 NaN
1955 NaN
1956 NaN
1957 NaN
1958 NaN
1959 NaN
1960 NaN
dtype: float64
# flight in wide format
flights_wide = flights.pivot("year", "month", "passengers")
month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
year
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432
# standard deviation for each year
flights_wide.std(axis=1)
year
1949 13.720147
1950 19.070841
1951 18.438267
1952 22.966379
1953 28.466887
1954 34.924486
1955 42.140458
1956 47.861780
1957 57.890898
1958 64.530472
1959 69.830097
1960 77.737125
dtype: float64
Plots
may_flights has one observation per year, so no CI is shown.
sns.lineplot(data=may_flights, x="year", y="passengers")
sns.barplot(data=may_flights, x='year', y='passengers')
flights_wide shows there are twelve observations for each year, so the CI shows when all of flights is plotted.
sns.lineplot(data=flights, x="year", y="passengers")
sns.barplot(data=flights, x='year', y='passengers')

Excel 2010 Find (dots)-replace(commas)

I am trying to clean data for calculations in excel. The data is taken from a pdf file, and there are numbers where I need to change dots into commas, since if I don't do that, excel understands the "text-to-columns seperated" numbers as dates, and transforms them accordingly which is not what I need.
But after "find-replace"'ing, the data within each cell gets "alt-enter"ed automatically.
I noticed that after a comma, the cell is transformed in this way. For instance, if there are two commas in one cell, then there are two lines in one cell, which creates a problem for me to use text-to-columns to seperate the cells.
So here is the data:
+++
Year CPI WPI Year CPI WPI
1960 29.8 31.7 1980 86.3 93.8
1961 30.0 31.6 1981 94.0 98.8
1962 30.4 31.6 1982 97.6 100.5
1963 30.9 31.6 1983 101.3 102.3
1964 31.2 31.7 1984 105.3 103.5
1965 31.8 32.8 1985 109.3 103.6
1966 32.9 33.3 1986 110.5 99.70
1967 33.9 33.7 1987 115.4 104.2
1968 35.5 34.6 1988 120.5 109.0
1969 37.7 36.3 1989 126.1 113.0
1970 39.8 37.1 1990 133.8 118.7
1971 41.1 38.6 1991 137.9 115.9
1972 42.5 41.1 1992 141.9 117.6
1973 46.2 47.4 1993 145.8 118.6
1974 51.9 57.3 1994 149.7 121.9
1975 55.5 59.7 1995 153.5 125.7
1976 58.2 62.5 1996 158.6 128.8
1977 62.1 66.2 1997 161.3 126.7
1978 67.7 72.7 1998 163.9 122.7
1979 76.7 83.4 1999 168.3 128.0
+++
I would like to have this data seperated by "space", where the decimals begin with a comma instead with a dot.
Why don't you use
=SUBSTITUTE(A1,".",",")
That's exactly what you want isn't it?

How to find the average of every six cells in Excel

This is a relatively common question so I don't want to be voted down for asking something that has been asked before. I will explain as I go along the steps I took to answer this question using StackOver Flow and other sources so that you can see that I have made attempts to solve it without solving a question.
I have a set of values as below:
O P Q "R" Z
6307 586 92.07 1.34
3578 195 94.83 6.00
3147 234 93.08 4.29
3852 227 94.43 15.00
3843 171 95.74 5.10
3511 179 95.15 7.18
6446 648 90.87 1.44
4501 414 91.58 0.38
3435 212 94.19 6.23
I want to take the average of the first six values in row "R" and then put that average in the sixth column in the sixth row of Z as such:
O P Q "R" Z
6307 586 92.07 1.34
3578 195 94.83 6.00
3147 234 93.08 4.29
3852 227 94.43 15.00
3843 171 95.74 5.10
3511 179 95.15 7.18 6.49
6446 648 90.87 1.44
4501 414 91.58 0.38
3435 212 94.19 6.23
414 414 91.58 3.49
212 212 94.19 11.78
231 231 93.44 -1.59 3.6
191 191 94.59 2.68
176 176 91.45 .75
707 707 91.96 2.68
792 420 90.95 0.75
598 598 92.15 7.45
763 763 90.66 -4.02
652 652 91.01 3.75
858 445 58.43 2.30 2.30
I have utilized the following formula I obtained
=AVERAGE(OFFSET(R1510,COUNTA(R:R)-6,0,6,1))
but I received an answer that was different from what I obtained by simply taking the average of the six previous cells as such:
=AVERAGE(R1505:R1510)
I then tried the following code from a Stack OverFlow (excel averaging every 10 rows) conversation that was tangentially similar to what I wanted
=AVERAGE(INDEX(R:R,1+6*(ROW()-ROW($B$1))):INDEX(R:R,10*(ROW()- ROW($B$1)+1)))
but I was unable to get an answer that resembled what I got from taking a rote
==AVERAGE(R1517:R1522)
I also found another approach in the following but was unable to accurately change the coding (F3 to R1510, for example)
=AVERAGE(OFFSET(F3,COUNTA($R$1510:$R$1517)-1,,-6,))
Doing so gave me a negative number for a clearly positive set of data. It was -6.95.
Put this in Z1 and copy down:
=IF(MOD(ROW(),6)=0,AVERAGE(INDEX(R:R,ROW()-5):INDEX(R:R,ROW())),"")

Most efficient compression extremely large data set

I'm currently generating an extremely large data set on a remote HPC (high performace computer). We are talking about 3 TB at the moment, and it could reach up to 10 TB once I'm done.
Each of the 450 000 files ranges from a few KB to about 100 MB and contains lines of integers with no repetitive/predictable patterns. Moreover they are split among 150 folders (I use the path to classify them according to the input parameters). Now that could be fine, but my research group is technically limited to 1TB of disk space on the remote server, although the admin are willing to close their eyes until the situation gets sorted out.
What would you recommend to compress such a dataset?
A limitation is that tasks can't run more than 48 hours at a time on this computer. So long but efficient compression methods are possible only if 48 hours is enough... I really have no other options as neither me, neither my group own enough disk space on other machines.
EDIT: Just to clarify, this a remote computer that runs on some variation of linux. All standard compression protocols are available. I don't have super user rights.
EDIT2: As request by Sergio, here is a sample output (first 10 lines of a files)
27 42 46 63 95 110 205 227 230 288 330 345 364 367 373 390 448 471 472 482 509 514 531 533 553 617 636 648 667 682 703 704 735 740 762 775 803 813 882 915 920 936 939 942 943 979 1018 1048 1065 1198 1219 1228 1513 1725 1888 1944 2085 2190 2480 5371 5510 5899 6788 7728 9514 10382 11946 13063 13808 16070 23301 23511 24538
93 94 106 143 157 164 168 181 196 293 299 334 369 372 439 457 508 527 547 557 568 570 573 592 601 668 701 704 799 838 848 870 875 882 890 913 953 959 1022 1024 1037 1046 1169 1201 1288 1615 1684 1771 2043 2204 2348 2387 2735 3149 4319 4890 4989 5321 5588 6453 7475 9277 9649 9654 11433 16966
1463
183 469 514 597 792
25 50 143 152 205 244 253 424 433 446 461 476 486 545 552 570 632 642 647 665 681 682 718 735 746 772 792 811 830 851 891 903 925 1037 1115 1147 1171 1612 1979 2749 3074 3158 6042 12709 20571 20859
24 30 86 312 726 875 1023 1683 1799
33 36 42 65 110 112 122 227 241 262 274 284 305 328 353 366 393 414 419 449 462 488 489 514 635 690 732 744 767 772 812 820 843 844 855 889 893 925 936 939 981 1015 1020 1060 1064 1130 1174 1304 1393 1477 1939 2004 2200 2205 2208 2216 2234 3284 4456 5209 6810 6834 8067 10811 10895 12771 15291
157 761 834 875 1001 2492
21 141 146 169 181 256 266 337 343 367 397 402 405 433 454 466 513 527 656 684 708 709 732 743 811 883 913 938 947 986 987 1013 1053 1190 1215 1288 1289 1333 1513 1524 1683 1758 2033 2684 3714 4129 6015 7395 8273 8348 9483 23630
1253
All integers are separated by one whitespace, and each line corresponds to a given element. I use implicit line numbers to store this information, because my data is assosiative i.e. the 0th element is associated to elements 27 42 46 63 110.. etc. I believe that there is no extra information whatsoever.
A few points that may help:
It looks like your numbers are sorted. If this is always the case, then it will be more efficient to compress the differences between adjacent numbers rather than the numbers themselves (since the differences will be somewhat smaller on average)
There are good ways of encoding small integer values in binary format, that are probably better than encoding them in text format. See the technique used by Google in their protocol buffers: (https://developers.google.com/protocol-buffers/docs/encoding)
Once you have applied the above techniques, then zipping / some standard form of compression should improve everything even further.
There is some research done at this LINK that breaks down the pro/cons of using gzip, bzip2, and lzma. Hopefully this can let you make an informed decision on your best approach.
All your numbers seem to be increasing in size (each line). A rather common approach in database technology would be to only store the size difference, making a line like
24 30 86 312 726 875 1023 1683 1799
to something like
6 56 226 414 149 148 660 116
Other lines of your example would even show more benefit, as the differences are smaller. This also works when the numbers decrease in-between, but you have to be able to deal with negative differences then.
Second thing to do would be changing the encoding. While compression will reduce this overhead, you're currently using 8 bit per digit, whereas you only need 4 bit of those (0-9, space as divisor). Implementing your own "4 bit character set" will already cut your storage requirements to half of the current size! In the end, this would be some kind of binary encoding of numbers of arbitrary length.

Resources