Create multi indexed dataframe by multipliying column value by another dataframe - python-3.x

Ok, so hopefully the title is understandable. I have two dataframes, one with datetime index and one column with values, and another with latitud and longitud and other columns.
The general layout is
df1=
factor
2015-04-15 NaN
2015-04-16 NaN
2015-04-17 NaN
2015-04-18 NaN
2015-04-19 NaN
2015-04-20 NaN
2015-04-21 NaN
2015-04-22 NaN
2015-04-23 NaN
2015-04-24 7.067218
2015-04-25 9.414628
2015-04-26 13.702154
2015-04-27 16.489926
2015-04-28 17.917428
2015-04-29 20.359118
2015-04-30 18.608707
2015-05-01 10.627798
2015-05-02 8.398942
2015-05-03 5.984976
2015-05-04 4.363621
2015-05-05 3.468062
2015-05-06 2.830794
2015-05-07 2.347879
df2=
i_lat i_lon multiplier sum ID distance
226 1092 264 -60.420166 61.420166 609 0.6142016587060164 km
228 1092 265 -129.914662 130.914662 609 1.309146617117938 km
204 1091 264 -203.371915 204.371915 609 2.043719152272311 km
206 1091 265 -233.799786 234.799786 609 2.347997860007727 km
224 1092 263 -240.718140 241.718140 609 2.417181399246371 km
.. ... ... ... ... ... ...
295 1095 268 -969.728516 970.728516 609 9.707285164114008 km
216 1092 259 -977.398084 978.398084 609 9.783980837220454 km
278 1094 269 -984.131470 985.131470 609 9.851314704203592 km
160 1088 267 -994.142285 995.142285 609 9.951422853836982 km
194 1091 259 -996.513606 997.513606 609 9.975136064824323 km
I basically need to do df1["factor"]*df2["multiplier"]+df2["sum"] for every pair of i_lat and i_lon so a multiindexed dataframe like this is outputed
df_output=
col
i_lat i_lon time
1092 264 2015-04-15 -9.000000e+33
2015-04-16 -9.000000e+33
2015-04-17 -9.000000e+33
2015-04-18 -9.000000e+33
2015-04-19 -9.000000e+33
... ...
1091 259 2015-05-05 -9.000000e+33
2015-05-06 -9.000000e+33
2015-05-07 -9.000000e+33
2015-05-08 -9.000000e+33
2015-05-09 -9.000000e+33
With col having the operation described above. I tried to use applyas df2.apply(lambda a: print(df1*a["multiplier"]+a["sum"], axis=1)) but it returns something that doesnt make sense. Not really know how to continue from now on.
Thanks!

You can do:
df2=df2.set_index(['i_lat', 'i_lon'])
(pd.DataFrame(df1.values * df2.multiplier.values + df2['sum'].values,
index=df1.index,
columns=df2.index
)
.unstack()
)

Related

Sum up Specific columns in a Dataframe from sqlite

im relatively new to Dataframes in Python and running into an Issue I cant find.
im having a Dataframe with the following column layout
print(list(df.columns.values)) returns:
['iccid', 'system', 'last_updated', '01.01', '02.01', '03.01', '04.01', '05.01', '12.01', '18.01', '19.01', '20.01', '21.01', '22.01', '23.01', '24.01', '25.01', '26.01', '27.01', '28.01', '29.01', '30.01', '31.01']
normally i should have a column for each day in a specific month. in the example above its December 2022. Sometimes Days are missing which isnt an issue.
i tried to first get all given columns that are relevant by filtering them:
# Filter out the columns that are not related to the data
data_columns = [col for col in df.columns if '.' in col]
Now comes the issue:
Sometimes the column "system" could also be empty so i need to put the iccid inside the system value:
df.loc[df['system'] == 'Nicht benannt!', 'system'] = df.loc[df['system'] == 'Nicht benannt!', 'iccid'].iloc[0]
df.loc[df['system'] == '', 'system'] = df.loc[df['system'] == '', 'iccid'].iloc
grouped = df.groupby('system').sum(numeric_only=False)
then i tried to create that needed 'data_usage' column.
grouped['data_usage'] = grouped[data_columns[-1]]
grouped.reset_index(inplace=True)
By that line i should normally only get the result of the last column in the dataframe (which was a workaround that also didnt work as expected)
Now what im trying to get is the sum of all columns which contain a date in their name and add this sum to a new column named data_usage.
the issue im having here is im getting results for systems which dont have an initial system value which have a data_usage of 120000 (which is just value that represents the megabytes used) and if i check the sqlite file the system in total only used 9000 mb in that particular month.
For Example:
im having this column in the sqlite file:
iccid
system
last_updated
06.02
08.02
8931080320014183316
Nicht benannt!
2023-02-06
1196
1391
and in the dataframe i get the following result:
8931080320014183316 48129.0
I cant find the issue and would be very happy if someone can point me into the right direction.
Here are some example data as requested:
iccid
system
last_updated
01.12
02.12
03.12
04.12
05.12
06.12
07.12
08.12
09.12
10.12
11.12
12.12
13.12
14.12
15.12
16.12
17.12
18.12
19.12
20.12
21.12
22.12
23.12
28.12
29.12
30.12
31.12
8945020184547971966
U-O-51
2022-12-01
2
32
179
208
320
509
567
642
675
863
1033
1055
1174
2226
2277
2320
2466
2647
2679
2713
2759
2790
2819
2997
3023
3058
3088
8945020855461807911
L-O-382
2022-12-01
1
26
54
250
385
416
456
481
506
529
679
772
802
832
858
915
940
1019
1117
1141
1169
1193
1217
1419
1439
1461
1483
8945020855461809750
C-O-27
2022-12-01
1
123
158
189
225
251
456
489
768
800
800
800
800
800
800
2362
2386
2847
2925
2960
2997
3089
3116
3448
3469
3543
3586
8931080019070958450
L-O-123
2022-12-02
0
21
76
313
479
594
700
810
874
1181
1955
2447
2527
2640
2897
3008
3215
3412
3554
3639
3698
3782
3850
4741
4825
4925
5087
8931080453114183282
Nicht benannt!
2022-12-02
0
6
45
81
95
98
101
102
102
102
102
102
102
103
121
121
121
121
149
164
193
194
194
194
194
194
194
8931080894314183290
C-O-16 N
2022-12-02
0
43
145
252
386
452
532
862
938
1201
1552
1713
1802
1855
2822
3113
3185
3472
3527
3745
3805
3880
3938
4221
4265
4310
4373
8931080465814183308
L-O-83
2022-12-02
0
61
169
275
333
399
468
858
1094
1239
1605
1700
1928
2029
3031
4186
4333
4365
4628
4782
4842
4975
5265
5954
5954
5954
5954
8931082343214183316
Nicht benannt!
2022-12-02
0
52
182
506
602
719
948
1129
1314
1646
1912
1912
1912
1912
2791
3797
3944
4339
4510
4772
4832
5613
5688
6151
6482
6620
6848
8931087891314183324
L-O-119
2022-12-02
0
19
114
239
453
573
685
800
1247
1341
1341
1341
1341
1341
1341
1341
1341
1341
1341
1341
1341
1341
1423
2722
3563
4132
4385

Retaining bad_lines identified by pandas in the output file instead of skipping those lines

I have to convert text files into csv's after processing the contents of the text file as pandas dataframe.
Below is the code i am using. out_txt is my input text file and out_csv is my output csv file.
df = pd.read_csv(out_txt, sep='\s', header=None, on_bad_lines='warn', encoding = "ANSI")
df = df.replace(r'[^\w\s]|_]/()|~"{}="', '', regex=True)
df.to_csv(out_csv, header=None)
If "on_bad_lines = 'warn'" is not decalred the csv files are not created. But if i use this condition those bad lines are getting skipped (obviously) with the warning
Skipping line 6: Expected 8 fields in line 7, saw 9. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
I would like to retain these bad lines in the csv. I have highlighted the bad lines detected in the below image (my input text file).
Below is the contents of the text file which is getting saved. In this content i would like to remove characters like #, &, (, ).
75062 220 8 6 110 220 250 <1
75063 260 5 2 584 878 950 <1
75064 810 <2 <2 456 598 3700 <1
75065 115 5 2 96 74 5000 <1
75066 976 <5 2 5 68 4200 <1
75067 22 210 4 348 140 4050 <1
75068 674 5 4 - 54 1130 3850 <1
75069 414 5 y) 446 6.6% 2350 <1
75070 458 <5 <2 548 82 3100 <1
75071 4050 <5 2 780 6430 3150 <1
75072 115 <7 <1 64 5.8% 4050 °#&4«x<i1
75073 456 <7 4 46 44 3900 <1
75074 376 <7 <2 348 3.8% 2150 <1
75075 378 <6 y) 30 40 2000 <1
I would split on \s later with str.split rather than read_csv :
df = (
pd.read_csv(out_txt, header=None, encoding='ANSI')
.replace(r'[^\w\s]|_]/()|~"{}="', '', regex=True)
.squeeze().str.split(expand=True)
)
Another variant (skipping everything that comes in-between the numbers):
df = (
pd.read_csv(out_txt, header=None, encoding='ANSI')
[0].str.findall(r"\b(\d+)\b"))
.str.split(expand=True)
)
​
Output :
print(df)
0 1 2 3 4 5 6 7
0 375020 1060 115 38 440 350 7800 1
1 375021 920 80 26 310 290 5000 1
2 375022 1240 110 28 460 430 5900 1
3 375023 830 150 80 650 860 6200 1
4 375024 185 175 96 800 1020 2400 1
5 375025 680 370 88 1700 1220 172 1
6 375026 550 290 72 2250 1460 835 2
7 375027 390 120 60 1620 1240 158 1
8 375028 630 180 76 820 1360 180 1
9 375029 460 280 66 380 790 3600 1
10 375030 660 260 62 11180 1040 300 1
11 375031 530 200 84 1360 1060 555 1

Concatenate Two DataFrames Based On DateTime Column

I have two dataframes.
First one:
Date B
2021-12-31 NaN
2022-01-31 500
2022-02-28 540
Second one:
Date A
2021-12-28 520
2021-12-31 530
2022-01-20 515
2022-01-31 529
2022-02-15 544
2022-02-25 522
I want to concatenate both the dataframes based on Year and Month and the resultant dataframe should look like below
Date A B
2021-12-28 520 NaN
2021-12-31 530 NaN
2022-01-20 515 500
2022-01-31 529 500
2022-02-15 544 540
2022-02-25 522 540
You need a left merge on the month period:
df2.merge(df1,
left_on=pd.to_datetime(df2['Date']).dt.to_period('M'),
right_on=pd.to_datetime(df1['Date']).dt.to_period('M'),
suffixes=(None, '_'),
how='left'
)
Then drop(columns=['key_0', 'Date_']) if needed.
Output:
key_0 Date A Date_ B
0 2021-12 2021-12-28 520 2021-12-31 NaN
1 2021-12 2021-12-31 530 2021-12-31 NaN
2 2022-01 2022-01-20 515 2022-01-31 500.0
3 2022-01 2022-01-31 529 2022-01-31 500.0
4 2022-02 2022-02-15 544 2022-02-28 540.0
5 2022-02 2022-02-25 522 2022-02-28 540.0

How to join all Cities belonging to their state in python

Actually, I have a dataframe that contains some states and I have a list of their few cities and I want to add those cities to that dataset and want to group each city with their state names.
Eg.
#I have entered some random city names for example purpose
city = ['Akola','Aurangabad','Dhule','Jalgaon','Mumbai','Mumbai Suburban','Nagpur']
State Cases Active Recovered Death
0 Maharashtra 77793 2933 41402 1458 33681 1352 2710 123
1 Andhra Pradesh 4223 143 1613 67 2539 73 71 3
2 Karnataka 4320 257 2653 157 1610 96 57 4
3 Goa 166 87 109 87 57 0
4 Tamil Nadu 27256 1384 12134 786 14902 586 220 12
and I want those cities to add in a data frame in a new column like
State Cases Active Recovered Death |CITY
0 Maharashtra 77793 2933 41402 1458 33681 1352 2710 123 |AKOLA
1 Maharashtra 77793 2933 41402 1458 33681 1352 2710 123 |DHULE
2 Maharashtra 77793 2933 41402 1458 33681 1352 2710 123 |MUMBAI
3 Andhra Pradesh 4223 143 1613 67 2539 73 71 3 |JALGAON
4 Andhra Pradesh 4223 143 1613 67 2539 73 71 3 |NAGPUR
5 Karnataka 4320 257 2653 157 1610 96 57 4
6 Goa 166 87 109 87 57 0
7 Tamil Nadu 27256 1384 12134 786 14902 586 220 12 |AURANGABAD
8 Tamil Nadu 27256 1384 12134 786 14902 586 220 12 |MUMBAI SUBURBAN
# data is wrong so please focus in format

Why doesn't the seaborn plot show a confidence interval?

I am using sns.lineplot to show the confidence intervals in a plot.
sns.lineplot(x = threshold, y = mrl_array, err_style = 'band', ci=95)
plt.show()
I'm getting the following plot, which doesn't show the confidence interval:
What's the problem?
There is probably only a single observation per x value.
If there is only one observation per x value, then there is no confidence interval to plot.
Bootstrapping is performed per x value, but there needs to be more than one obsevation for this to take effect.
ci: Size of the confidence interval to draw when aggregating with an estimator. 'sd' means to draw the standard deviation of the data. Setting to None will skip bootstrapping.
Note the following examples from seaborn.lineplot.
This is also the case for sns.relplot with kind='line'.
The question specifies sns.lineplot, but this answer applies to any seaborn plot that displays a confidence interval, such as seaborn.barplot.
Data
import seaborn as sns
# load data
flights = sns.load_dataset("flights")
year month passengers
0 1949 Jan 112
1 1949 Feb 118
2 1949 Mar 132
3 1949 Apr 129
4 1949 May 121
# only May flights
may_flights = flights.query("month == 'May'")
year month passengers
4 1949 May 121
16 1950 May 125
28 1951 May 172
40 1952 May 183
52 1953 May 229
64 1954 May 234
76 1955 May 270
88 1956 May 318
100 1957 May 355
112 1958 May 363
124 1959 May 420
136 1960 May 472
# standard deviation for each year of May data
may_flights.set_index('year')[['passengers']].std(axis=1)
year
1949 NaN
1950 NaN
1951 NaN
1952 NaN
1953 NaN
1954 NaN
1955 NaN
1956 NaN
1957 NaN
1958 NaN
1959 NaN
1960 NaN
dtype: float64
# flight in wide format
flights_wide = flights.pivot("year", "month", "passengers")
month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
year
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432
# standard deviation for each year
flights_wide.std(axis=1)
year
1949 13.720147
1950 19.070841
1951 18.438267
1952 22.966379
1953 28.466887
1954 34.924486
1955 42.140458
1956 47.861780
1957 57.890898
1958 64.530472
1959 69.830097
1960 77.737125
dtype: float64
Plots
may_flights has one observation per year, so no CI is shown.
sns.lineplot(data=may_flights, x="year", y="passengers")
sns.barplot(data=may_flights, x='year', y='passengers')
flights_wide shows there are twelve observations for each year, so the CI shows when all of flights is plotted.
sns.lineplot(data=flights, x="year", y="passengers")
sns.barplot(data=flights, x='year', y='passengers')

Resources