I'm trying to write a python script that finds unique values (names) and reports the frequency of their occurrence, making use of Pandas library. There's a total of around 90 unique names, which I've anonymised in the head of the dataframe pasted below.
,1,2,3,4,5
0,monday09-01-2022,tuesday10-01-2022,wednesday11-01-2022,thursday12-01-2022,friday13-01-2022
1,Anonymous 1,Anonymous 1,Anonymous 1,Anonymous 1,
2,Anonymous 2,Anonymous 4,Anonymous 5,Anonymous 5,Anonymous 5
3,Anonymous 3,Anonymous 3,,Anonymous 6,Anonymous 3
4,,,,,
I'm trying to drop any row (the full row) that contains the regex expression "^monday.*", intending to indicate the word "monday" followed by any other number of random characters. I want to drop/deselect any cell/value within that row.
To achieve this goal, I've tried using the line of code below (and many other approaches I found on SO).
df = df[df[1].str.contains("^monday.*", case = True, regex=True) == False]
To clarify, I'm trying to search values of column "1" for the value "^.monday.*" and then deselecting the rows and all values in that row that match the regex expression. I've succesfully removed "monday09-01-2022" and "tuesday10-01-2022" etc.. but I'm also losing random names that are not in the matching rows.
Any help would be very much appreciated! Thank you!
I have problem of splitting the content of one excel column which contains numbers and letters into two columns the numbers in one column and the letters in the other.
As can you see in the first photo there is no space between the numbers and the letters, but the good thing is the letters are always "ms". I need a method split them as in the second photo.
Before
After
I tried to use the replace but it did not work. it did not split them.
Is there any other method.
You can use the extract method. Here is an example:
df = pd.DataFrame({'time': ['34ms', '239ms', '126ms']})
df[['time', 'unit']] = df['time'].str.extract('(\d+)(\D+)')
# convert time column into integer
df['time'] = df['time'].astype(int)
print(df)
# output:
# time unit
# 0 343 ms
# 1 239 ms
# 2 126 ms
It is pretty simple.
You need to use pandas.Series.str.split
Attaching the Syntax here :- pandas.Series.str.split
The Code should be
import pandas as pd
data_before = {'data' : ['34ms','56ms','2435ms']}
df = pd.DataFrame(data_before)
result = df['data'].str.split(pat='(\d+)',expand=True)
result = result.loc[:,[1,2]]
result.rename(columns={1:'number', 2:'string'}, inplace=True)
Output : -
print(result)
Output
First explaining the dataframe, the values of columns '0-156', '156-234', '234-546' .... '> 76830' is the percentage distribution for each range of distances in meters, totaling 100%.
Column 'Cell Name' refers to the data element of the other columns and the column 'Distance' is the column that will trigger the desired sum.
I need to sum the values of the columns '0-156', '156-234', '234-546' .... '> 76830' which are less than the value of the 'Distance' (Meters) column.
Below creation code for testing.
import pandas as pd
# initialize list of lists
data = [['Test1',0.36516562,19.065996,49.15094,24.344206,0.49186087,1.24217,5.2812457,0.05841639,0,0,0,0,158.4122868],
['Test2',0.20406325,10.664485,48.70978,14.885571,0.46103176,8.75815,14.200708,2.1162114,0,0,0,0,192.553074],
['Test3',0.13483211,0.6521175,6.124511,41.61725,45.0036,5.405257,1.0494527,0.012979688,0,0,0,0,1759.480042]
]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Cell Name','0-156','156-234','234-546','546-1014','1014-1950','1950-3510','3510-6630','6630-14430','14430-30030','30030-53430','53430-76830','>76830','Distance'])
Example of what should be done:
The value of column 'Distance' = 158.412286772863 therefore would have to sum the values <= of the following columns, 0-156, '156-234' totalizing 19.43116162 %.
Thanks so much!
As I understand it, you want to sum up all the percentage values in a row, where the lower value of the column-description (in case of '0-156' it would be 0, in case of '156-234' it would be 156, and so on...) is smaller than the value in the distance column.
First I would suggest, that you transform your string-like column-names into values, as an example:
lowerlimit=df.columns[2]
>>'156-234'
Then read the string only till the '-' and make it a number
int(lowerlimit[:lowerlimit.find('-')])
>> 156
You can loop this through all your columns and make a new row for the lower limits.
For a bit more simplicity I left out the first column for your example, and added another first row with the lower limits of each column, that you could generate as described above. Then this code works:
data = [[0,156,234,546,1014,1950,3510,6630,11430,30030,53430,76830,1e-23],[0.36516562,19.065996,49.15094,24.344206,0.49186087,1.24217,5.2812457,0.05841639,0,0,0,0,158.4122868],
[0.20406325,10.664485,48.70978,14.885571,0.46103176,8.75815,14.200708,2.1162114,0,0,0,0,192.553074],
[0.13483211,0.6521175,6.124511,41.61725,45.0036,5.405257,1.0494527,0.012979688,0,0,0,0,1759.480042]
]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['0-156','156-234','234-546','546-1014','1014-1950','1950-3510','3510-6630','6630-14430','14430-30030','30030-53430','53430-76830','76830-','Distance'])
df['lastindex']=None
df['sum']=None
After creating basically your dataframe, I add two columns 'lastindex' and 'sum'.
Then I am searching for the last index in every row, that is has its lower limit below the distance given in that row (df.iloc[x,-3]); afterwards I'm summing up the respective columns in that row.
for i in np.arange(1,len(df)):
df.at[i,'lastindex']=np.where(df.iloc[0,:-3]<df.iloc[i,-3])[0][-1]
df.at[i,'sum']=sum(df.iloc[i][0:df.at[i,'lastindex']+1])
I hope, this is helpful. Best, lepakk
I have an excel as shown below:
Input File
Now I want to filter fruits first from "Items" column and check which one in list of "list" column is not present in the list. For example: here "grapes" is not present in "Name" column. So I want grapes as output in next column as shown below.
Expected Output Shown
The same is to be done for many by filtering each items one by one as I have many items.
Please suggest or give some hints so that i can start this code.
I am naming the excel as Book1
import pandas as pd
frame = pd.read_excel("Book1.xlsx")
frame_list_as_String = frame.list.tolist()
frame_list = [x.split(',') for x in frame_list_as_String]
frame_Name = frame.Name.tolist()
frame_col3=[]
for item in frame_list :
frame_col3.append(list(set(items)-set(frame_Name)))
frame["col3"]=frame_col3
frame.to_excel("df.xlsx", index = False)
I have a column that's unorganized like this;
Name
Jack
James
Riddick
Random value
Another random value
What I'm trying to do is get only the names from this column, but struggling to find a way to differentiate real names to random values. Fortunately the names are all together, and the random values are all together as well. The only thing I can do is iterate the rows until it gets to 'Random value' and then break off.
I've tried using lambda's for this but with no success as I don't think there's a way to break. And I'm not sure how comprehension could work in this case.
Here's the example I've been trying to play with;
df['Name'] = df['Name'].map(lambda x: True if x != 'Random value' else break)
But the above doesn't work. Any suggestions on what could work based on what I'm trying to achieve? Thanks.
Find index of row containing 'Random value':
index_split = df[df.Name == 'Random value'].index.values[0]
Save your random values column for use later if you want:
random_values = df.iloc[index_split+1:,].values[0]
Remove random values from the Names column:
df = df[0:index_split]