get label of an index in pandas multiindex dataframe - python-3.x

I have a dataframe df
c1 c2
name sample
person1 a1 aaa AAA
b1 bbb BBB
c1 ccc CCC
person2 d1 ...
I want to iterate through the dataframe, one person at a time, and check if values in columns match a criteria. If I get a match, then I'd like to extract the label for that index (at level[1] and as a string), and create a set of all such indices. So say my criteria is that column_value == bbb, then I'd like to get "b1"
The following produces almost what I want, but it returns of set of generator objects, rather than the names of the labels as strings.
index_set = set()
for person, new_df in df.groupby(level=0):
idx = new_df.index.get_level_values(1).tolist()
index_set.add(x for x in idx)
which produces something like at 0x0000022F6F05D200>, at 0x0000022F6F05D410>,....
So how to make it produce something like {"b1", "f1", "h1",...} instead?
And another question: when iterating through df by creating new_df the index names don't seem to transfer to new_df. Can this be avoided somehow? It would make the code more readable if I could refer to the index as get_level_values('sample') rather than get_level_values(1)

The add method of a set adds one element, in your case it adds an iterator. You could use list comprehension to add a few: [index_set.add(x) for x in idx], but the correct way is to use update method:
index_set.update(idx)

Related

How to replace text in column by the value contained in the columns named in this text

In pyspark, I'm trying to replace multiple text values in a column by the value that are present in the columns which names are present in the calc column (formula).
So to be clear, here is an example :
Input:
|param_1|param_2|calc
|-------|-------|--------
|Cell 1 |Cell 2 |param_1-param_2
|Cell 3 |Cell 4 |param_2/param_1
Output needed:
|param_1|param_2|calc
|-------|-------|--------
|Cell 1 |Cell 2 |Cell 1-Cell 2
|Cell 3 |Cell 4 |Cell 4/Cell 3
In the column calc, the default value is a formula. It can be something as much as simple as the ones provided above or it can be something like "2*(param_8-param_4)/param_2-(param_3/param_7)".
What I'm looking for is something to substitute all the param_x by the values in the related columns regarding the names.
I've tried a lot of things but nothing works at all and most of the time when I use replace or regex_replace with a column for the replacement value, the error the column is not iterable occurs.
Moreover, the columns param_1, param_2, ..., param_x are generated dynamically and the calc column values can some of these columns but not necessary all of them.
Could you help me on the subject with a dynamic solution ?
Thank you so much.
Best regards
Update: Turned out I misunderstood the requirement. This would work:
for exp in ["regexp_replace(calc, '"+col+"', "+col+")" for col in df.schema.names]:
df=df.withColumn("calc", F.expr(exp))
Yet Another Update: To Handle Null Values add coalesce:
for exp in ["coalesce(regexp_replace(calc, '"+col+"', "+col+"), calc)" for col in df.schema.names]:
df=df.withColumn("calc", F.expr(exp))
Input/Output:
------- Keeping the below section for a while just for reference -------
You can't directly do that - as you won't be able to use column value directly unless you collect in a python object (which is obviously not recommended).
This would work with the same:
df = spark.createDataFrame([["1","2", "param_1 - param_2"],["3","4", "2*param_1 + param_2"]]).toDF("param_1", "param_2", "calc");
df.show()
df=df.withColumn("row_num", F.row_number().over(Window.orderBy(F.lit("dummy"))))
as_dict = {row.asDict()["row_num"]:row.asDict()["calc"] for row in df.select("row_num", "calc").collect()}
expression = f"""CASE {' '.join([f"WHEN row_num ='{k}' THEN ({v})" for k,v in as_dict.items()])} \
ELSE NULL END""";
df.withColumn("Result", F.expr(expression)).show();
Input/Output:

Advice on populating a dataframe based on an existing one

I'm seeking advice on populating a dataframe in pandas. I've now created a dataframe that looks like A:
However, eventually, it should looks like something in B, hopefully:
Could anyone suggest how to create a dataframe like B on top of A, if I have relevant data values.
Any comments or suggestions are highly appreciated.
I assume you have two dataframes that look like A. One for feature F1 and one for F2.
Then you create B like this:
a1 = ... # assuming A1, A2 already have correct index A, B, C as depicted.
a2 = ...
a1['Features'] = "F1"
a2['Features'] = "F2"
b = (pd.concat([a1, a2], axis=0)
.set_index("Features", append=True)
# Swing the new index level - Features - around to become a column level instead.
.unstack("Features"))
You've named the column level "Features" but I'd suggest using "Feature" instead, if you can.
There is also an alternate way to do the same thing, also seen in this question: How to make dataframe behave such as pandas_datareader
(pd.concat([a1, b2], axis='columns', keys=pd.Index(["F1", "F2"], name="Features"))
# swap hierarchy order of column levels
.swaplevel(-2, -1, axis=1)
# restore sorting to that of a1 columns - assuming a1, a2 have the same cols
.reindex(columns=a1.columns, level=0)
)

To extract street number from street address using regex from a dataframe in python

d1 = dataset['End station'].head(20)
for x in d1:
x = re.compile("[0-9]{5}")
print(d1)
Using dataset['End_Station'] = dataset['End station'].map(lambda x: re.compile("([0-9]{5})").search(x).group())
shows - TypeError: expected string or bytes-like object.
I am new to data analysis, can't think of any other methods
Pandas has its own methods concerning Regex, so the "more pandasonic" way
to write code is just to use them, instead of native re methods.
Consider such example of source data:
End station
0 4055 Johnson Street, Chicago
1 203 Mayflower Avenue, Columbus
To find the street No in the above addresses, run:
df['End station'].str.extract(r'(?P<StreetNo>\d{,5})')
and you will get:
StreetNo
0 4055
1 203
Note also that the street No may be shorter than 5 digits, but you attempt
to match a sequence of just 5 digits.
Another weird point in your code: Why do you compile a regex in a loop
and then make no use of them?
Edit
After a more thorough look at your code I have a couple of additional remarks.
When you write:
for x in df:
...
then the loop iterates actually over column names (not rows).
Another weird point in your code is that x variable, used initially to hold
a column name, you use again to save a compiled regex there.
It is a bad habbit. Variables should be used to hold one clearly
defined object in each of them.
And as far as iteration over rows is concerned, you can use e.g.
for idx, row in df.iterrows():
...
But note that iterrows returns pairs composed of:
index of the current row,
the row itself.
Then (in the loop) you will probably refer to individual columns of this row.

Transforming multiple data frame columns into one series

I have a dataset df(250,3) 250 raws and three columns. I want to write a loop that merges the content of each column in my dataframe to have one single series(250,1) of 250 raws and 1 columns 'df_single'. The manual operation is the following:
df_single = df['colour']+" "+df['model']+" "+df['size']
How can I create df_single with a for loop, or non-manually?
I tried to write this code with TypeError
df_conc=[]
for var in cols:
cat_list=df_code_part[var]
df_conc = df_conc+" "+cat_list
TypeError: can only concatenate list (not "str") to list
I think if need join 3 columns then your solution is really good:
df_single = df['colour']+" "+df['model']+" "+df['size']
If need general solution for many columns use DataFrame.astype for convert to strings if necessary with DataFrame.add for add whitespace, sum for concatenate and last remove tralining whitespeces by Series.str.rstrip for remove traling whitespace:
cols = ['color','model','size']
df_single = df[cols].astype(str).add(' ').sum(axis=1).str.rstrip()
Or:
df_single = df[cols].astype(str).apply(' '.join, axis=1)
If you want to have spaces between columns, run:
df.apply(' '.join, axis=1)
"Ordinary" df.sum(axis=1) concatenates all columns, but without
spaces between them.
if you want the sum You need use:
df_single=df.astype(str).add(' ').sum(axis=1).str.rstrip()
if you don't want to add all the columns then you need to select them previously:
columns=['colour','model','size']
df_single=df[columns].astype(str).add(' ').sum(axis=1).str.rstrip()

Comparing strings in same series (row) but different columns

I ran into this problem with comparing strings between two columns. What I want to do is to: For each row, check whether the string is column A is included in column B and if so, print a new string 'Yes' in column C.
Column A contains NaN values (blank cells in the csv I imported).
I have tried:
df['C']=df['B'].str.contains(df.loc['A'])
df.loc[df['A'].isin(df['B']), 'C']='Yes'
They both didn't work as I couldn't find the right way to compare strings.
This uses list comprehension, so it may not be the fastest solution, but works and is concise.
df['C'] = pd.Series(['Yes' if a in b else 'No' for a,b in zip(df['A'],df['B'])])
EDIT: If you don't want to keep the values in C instead of overwriting them with 'No', you can do it like this:
df['C'] = pd.Series(['Yes' if a in b else c for a,b,c in zip(df['A'],df['B'], df['C'])])
df = pd.DataFrame([['ab', 'abc'],
['abc', 'ab']], columns=list('AB'))
df['C'] = np.where(df.apply(lambda x: x.A in x.B, axis=1), 'Yes', 'No')
df
Try regex: https://docs.python.org/2/library/re.html if you already made the code to id every cell or value have to work with.

Resources