Merge in Panda does not allow second key to join on - python-3.x

After looking for answers and trying everything could not figure out a way out, so here it goes.
I have a list of *.txt files that I want to merge by column. I am 100% sure that they have the same structure, as follows
File1
date | time | model_name1
1850-01-16 | 12:00:00 | 0.10
File2
date | time | model_name2
1850-01-16 | 12:00:00 | 0.50
File3..... and so on
Note: the vertical bars are just for clarity here.
Now my output should look like this:
Output
date | time | model_name1 | model_name2
1850-01-16 | 12:00:00 | 0.10 | 0.50
With the following piece of code
out_list4 = os.listdir(out_directory)
df_list = [pd.read_table(out_path+os.fsdecode(file_x), sep='\s+') for file_x in out_list4]
df_merged = reduce(lambda left,right: ,
pd.merge(left,right,on=['date'], how='outer'), df_list)
pd.DataFrame.to_csv(df_merged, out_path+'merged.txt', sep='\t', index=False)
I manage the following output:
Output
date | time_x | model_name1 |time_y | model_name2
1850-01-16 | 12:00:00 | 0.10 |12:00:00| 0.50
As expected since I only have the key ""on=['date']"".
Now if I try to write time as second key as follows: ""on=['date','time']"", it crashes with the following error:
Key error:'time'
and a long list of tracebacks.
I tried placing left_on/righ_on in case "date" was being handled as index. No use. I know the problem does not lie on the files, the structure is right, it is the code. Any help will be much appreciated. And sorry for readibility on the

So, the problem was before. I had defined ""out_list4"" as a list before:
out_list4 = list()
and it was making a mess at the end. Each data element on the list should have size 1872 x 3, but at the end it was adding them altogether again making one last entry be 1872 x 12 and no 'time' header.
Changing the definition of ""out_list4"" to:
out_list4 = []
did the trick. The tip came from Combine a list of pandas dataframes to one pandas dataframe.

Related

Python for loop taking forever to run with huge dataset [duplicate]

This question already has answers here:
How do I Pandas group-by to get sum?
(11 answers)
Closed 1 year ago.
The df is formatted in this manner:
Zip Code | State | Carrier | Price
__________________________________
xxxxx | XX | ABCD | 12.0
xxxxx | XX | TUSD | 15.0
xxxxx | XX | PPLD | 17.0
The Code:
carrier_sum = []
unique_carrier = a_df['Carrier'].unique()
for i in unique_carrier:
x=0
for y, row in a_df.iterrows():
x = a_df.loc[a_df['Carrier'] == i, 'Prices'].sum()
print(i, x)
carrier_sum.append([i,x])
This is my code, at first it makes a unique_carrier list. Then for each of the carriers it iterrows() through the df to get the 'Price' and sum it returning the carrier_sum to the empty df I created.
The problem is it seems to take forever, I mean I ran it once and it took over 15 minutes just to get the sum for the first one unique carrier sum and there are 8 of them.
What can I do to make it more efficient?
The dataset is over 300000 rows long.
One way that I thought of is to go ahead and set a list with the unique carriers identified beforehand since I don't really need to look for it in the df, another thing I thought of is to organize the main dataset by carrier name alphabetically, and make the unique carrier list line up with how it is in the dataset.
Thank you for reading.
This solution can work for you
df.groupby('Carrier')['Price'].sum()

Pandas: Sliding window, summing app 14 day data

I do wonder how it is possible to make sliding windows in Pandas.
I have a dataframe with three columns.
Country | Number | DayOfTheYear
===================================
No | 50 | 0
No | 20 | 1
No | 37 | 2
I would love to see 14 day chunks for every country and day combination.
The country think can be ignored for the moment, since I can filter those manually in some way. But imagine there is only one country, is there a smart way to get some sort of summed up sliding window, resulting in something like the following?
Country | Sum | DatesOftheYear
===================================
No | 504 | 0-13
No | 207 | 1-14
No | 337 | 2-15
I would also accept if if they where disjunct, being only 0-13, 14-27, etc.
But I just cannot come along with Pandas. I know an old SQL solution, but is there anybody having a nice idea for Pandas?
If you want a rolling windows of your dataframe, you can simply use the .rolling function of pandas : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html
In your case : df["Number"].rolling(14).sum()

Getting multiple readings from .txt into excel

I'm not sure if this is the correct place to ask this, but basically I have a .txt file containing values that came from 2 separate sensors.
Example of some data:
{"t":3838202,"s":0,"n":"x1","v":-1052}
{"t":3838203,"s":0,"n":"y1","v":44}
{"t":3838204,"s":0,"n":"z1","v":-84}
{"t":3838435,"s":0,"n":"x1","v":-1052}
{"t":3838436,"s":0,"n":"y1","v":36}
{"t":3838437,"s":0,"n":"z1","v":-80}
{"t":3838670,"s":0,"n":"x1","v":-1056}
{"t":3838671,"s":0,"n":"y1","v":52}
{"t":3838672,"s":0,"n":"z1","v":-88}
{"t":3838902,"s":0,"n":"x1","v":-1052}
{"t":3838903,"s":0,"n":"y1","v":48}
{"t":3838904,"s":0,"n":"z1","v":-80}
{"t":3839136,"s":0,"n":"x1","v":-1056}
{"t":3839137,"s":0,"n":"y1","v":40}
{"t":3839138,"s":0,"n":"z1","v":-80}
x2:-944
y2:108
z2:-380
{"t":3839841,"s":0,"n":"x1","v":-1052}
{"t":3839842,"s":0,"n":"y1","v":44}
{"t":3839843,"s":0,"n":"z1","v":-80}
x2:-948
y2:100
z2:-380
{"t":3840541,"s":0,"n":"x1","v":-1052}
{"t":3840542,"s":0,"n":"y1","v":40}
{"t":3840543,"s":0,"n":"z1","v":-84}
{"t":3840774,"s":0,"n":"x1","v":-1052}
{"t":3840775,"s":0,"n":"y1","v":40}
{"t":3840776,"s":0,"n":"z1","v":-84}
x2:-948
y2:108
z2:-368
I'm trying to get the data into excel, so that for each "chunk" of data in the x1y1z1 section, I take the last set of recorded data and discard the rest and "pair" it with the next set of x2y2z2 data. I don't think I'm explaining it very well, but I basically want to take that text file and get this in excel:
+---------+-------+----+-----+------+-----+------+
| t | x1 | y1 | z1 | x2 | y2 | z2 |
+---------+-------+----+-----+------+-----+------+
| 3839138 | -1056 | 40 | -80 | -944 | 100 | -380 |
| 3839843 | -1052 | 44 | -80 | -948 | 100 | -380 |
| 3840776 | -1052 | 40 | -84 | -948 | 108 | -368 |
+---------+-------+----+-----+------+-----+------+
I'm really stuck as to where I should even start
I think like a programmer, so I would approach this problem in steps. If you are not a programmer, this might not be so helpful to you, and I am sorry for that.
First, define the data. How does each line of data get read and understood.
Second, write a parsing utility. A piece of code which interprets the data as it is read in and stores it in the form you want for your output
Third, import data into Excel.
So, based on the limited data you provided, I am not sure how you are able to determine the x1,y1,z1,x2,y2,z2 for each t, but I assume that the values enclosed in curly braces have something to do with that based on the values for s, n, and v I'm seeing in there. So, first of all you need to clearly determine the way you read the data. Take it one line at a time, and determine how you would build your output table based on each line of data. I assume you would treat the lines enclosed in curly braces differently from the lines with standalone x/y/z values for example.
I hope this points you in the right direction.

Using multiple parent IDs for cutoff times in deep feature synthesis

My data looks like: People <-- Events <--Activities. The parent is People, of which the only variable is the person_id. Events and Activities both have a time index, along with event_id and activity_id, both which have a few features.
Members of the 'People' entity visit places at all different times. I am trying to generate deep features for people. If people is something like [1,2,3], how do I pass cut off times that create deep features for something like (Person,cutofftime): [1,January2], [1, January3]
If I have only 3 People, it seems like I can't pass a cutoff_time dataframe that has 10 rows (for example, person 1 with 10 possible cutoff times). Trying this gives me the error "Duplicated rows in cutoff time dataframe", despite dropping duplicates from my cutoff_times dataframe.
Must I include time index in the People Entity? This would leave my parent entity with multiple people in the index, although they would have different time index. My instinct is that the people entity should not include any datetime column. I would like to give cut off times to the DFS function.
My cutoff_times df.head looks like this, and has multiple instances of some people_id:
+-------------------------------------------+
| person_id time label |
+-------------------------------------------+
| 0 f_GZSVLYU 2019-12-06 0.0 |
| 1 f_ATBJEQS 2019-12-06 1.0 |
| 2 f_GLFYVAY 2019-12-06 0.5 |
| 3 f_DIHPTPA 2019-12-06 0.5 |
| 4 f_GZSVLYU 2019-12-02 1.0 |
+-------------------------------------------+
The Parent People Entity is like this:
+-------------------+
| person_id |
+-------------------+
| 0 f_GZSVLYU |
| 1 f_ATBJEQS |
| 2 f_GLFYVAY |
| 3 f_DIHPTPA |
| 4 f_DVOYHRQ |
+-------------------+
How can I make featuretools understand what I'm trying to do?
'Duplicated rows in cutoff time dataframe.' I have explored my cutoff_times df and there are no duplicate rows. Person_id, times, and labels all have multiple occurrences each but no 2 rows are the same. Could these duplicates the error is referring to be somewhere else in the EntitySet?
The answer is one row of the cutoff_df had the same ID and time but with different labels. That's a problem.

Extract a substring new column based on a substring based on conditions ideally with Pandas

I got a data set (Excel) with hundreds of entries. In one string column there is most of the information. The information is divided by '_' and typed in by humans. Therefore, it is not possible to work with index positions.
To create a usable data basis it's mandatory to extract information from this column in another column.
The search pattern = '*v*' is alone not enough. But combined with the condition that the first item has to be a digit it works.
I tried to get it to work with iterrows, iteritems, str.strip, str.extract and many more. But the best solution I received with a for-loop.
pattern = '_*v*_'
test = []
for i in df['col']:
'#Split the string in substrings
i = i.split('_')
for c in i:
if c.find('x') == 1:
if c[0].isdigit():
# print(c)
test.append(c)
else:
'#To be able to fix a few rows manually
test.append(0)
[4]: test =[22v3, 33v55, 4v2]
#Input
+-----------+-----------+
| col | targetcol |
+-----------+-----------+
| as_22v3 | |
| 33v55_bdd | |
| Ave_4v2 | |
+-----------+-----------+
#Output
+-----------+-----------+--+
| col | targetcol | |
+-----------+-----------+--+
| as_22v3 | 22v3 | |
| 33v55_bdd | 33v55 | |
| Ave_4v2 | 4v2 | |
+-----------+-----------+--+
My code does work, but only for the first few rows. It stops after 36 values and I can't figure out why. There is no error message besides of course that it is not possible to assign the list to a DataFrame series since it has not the same size.
pandas.Series.str.extract should help:
>>> df['col'].str.extract(r'(\d+v+\d+)')
0
0 22v3
1 33v55
2 4v2
df = pd.DataFrame({
'col': ['as_22v3', '33v55_bdd', 'Ave_4v2']
})
df['targetcol'] = df['col'].str.extract(r'(\d+v+\d+)')
EDIT
df = pd.DataFrame({
'col': ['as_22v3', '33v55_bdd', 'Ave_4v2', '_22 v3', 'space 2,2v3', '2.v3',
'2.111v999', 'asd.123v77', '1 v7', '123 v 8135']
})
pattern = r'(\d+(\,[0-9]+)?(\s+)?v\d+)'
df['result'] = df['col'].str.extract(pattern)[0]
col result
0 as_22v3 22v3
1 33v55_bdd 33v55
2 Ave_4v2 4v2
3 _22 v3 22 v3
4 space 2,2v3 2,2v3
5 2.v3 NaN
6 2.111v999 111v999
7 asd.123v77 123v77
8 1 v7 1 v7
9 123 v 8135 NaN
You say it stops after 36 values? You say it is Excel file you are processing? One thing you could try is to save data set to .csv file and try to read this file in with pd.read_csv function. There are sometimes some extra characters in Excel file that are not easily visible.

Resources