Pandas DataFrame remove strings having a certain characters - python-3.x

I have a pandas DataFrame with a lot of text data. I want to remove all lines starting with "*" mark. Therefore, I tried a small example as the following.
string1 = '''* This needs to be gone
But this line should stay
*remove
* this too
End'''
string2 = '''* This needs to be gone
But this line should stay
*remove
* this too
End'''
df = pd.DataFrame({'a':[string1,string2]})
df['a'] = df['a'].map(lambda a: (re.sub(r'(?m)^\*.*\n?', '', a, flags=re.MULTILINE)))
It could perfectly do the job. However, when I applied the same function to my original DataFrame it is not working. Can you help me to identify the issue?
df2['NewsText'] = df2['NewsText'].map(lambda a: (re.sub(r'(?m)^\*.*\n?', '', a, flags=re.MULTILINE)))
df2.head()
Pease see the attached image of my original DataFrame

Given your example data
.str.split('\n') creates a list of each section
.apply(lambda x: '\n'.join([y for y in x if '*' not in y])) uses a list comprehension to remove each sentence with * and then joins it back into a string.
You can join with ' '.join or ''.join
.apply(lambda x: [y for y in x if '*' not in y]) if you want to have a list instead of a long string.
| | a |
|---:|:--------------------------|
| 0 | * This needs to be gone |
| | But this line should stay |
| | *remove |
| | * this too |
| | End |
| 1 | * This needs to be gone |
| | But this line should stay |
| | *remove |
| | * this too |
| | End |
# remove sections with '*'
df['a'] = df['a'].str.split('\n').apply(lambda x: '\n'.join([y for y in x if '*' not in y]))
# final
| | a |
|---:|:--------------------------|
| 0 | But this line should stay |
| | End |
| 1 | But this line should stay |
| | End |

Related

Split column on condition in dataframe

The data frame I am working on has a column named "Phone" and I want to split in on / or , in a way such that I get the data frame as shown below in separate columns. For example, the first row is 0674-2537100/101 and I want to split it on "/" into two columns having values as 0674-2537100 and 0674-2537101.
Input:
+-------------------------------+
| Phone |
+-------------------------------+
| 0674-2537100/101 |
| 0674-2725627 |
| 0671 – 2647509 |
| 2392229 |
| 2586198/2583361 |
| 0663-2542855/2405168 |
| 0674 – 2563832/0674-2590796 |
| 0671-6520579/3200479 |
+-------------------------------+
Output:
+-----------------------------------+
| Phone | Phone1 |
+-----------------------------------+
| 0674-2537100 | 0674-2537101 |
| 0674-2725627 | |
| 0671 – 2647509 | |
| 2392229 | |
| 2586198 | 2583361 |
| 0663-2542855 | 0663-2405168 |
| 0674 – 2563832 | 0674-2590796 |
| 0671-6520579 | 0671-3200479 |
+-----------------------------------+
Here I came up with a solution where I can take out the length of strings on both sides of the separator(/). Take out their difference. Copy the substring from the first column from character position [:difference-1] to the second column.
So far my progress is,
df['Phone'] = df['Phone'].str.replace(' ', '')
df['Phone'] = df['Phone'].str.replace('–', '-')
df[['Phone','Phone1']] = df['Phone'].str.split("/",expand=True)
df["Phone1"].fillna(value=np.nan, inplace=True)
m2 = (df["Phone1"].str.len() < 12) & (df["Phone"].str.len() > 7)
m3 = df["Phone"].str.len() - df["Phonenew"].str.len()
df.loc[m2, "Phone1"] = df["Phone"].str[:m3-1] + df["Phonenew"]
It gives an error and the column has only nan values after I run this. PLease help me out here.
Considering you're only going to have 2 '/' in the 'Phone' column. Here's what you can do:
'''
This fucntion takes in rows of a dataframe as an input and returns row with appropriate values.
'''
def split_phone_number(row):
split_str=row['Phone'].split('/')
# Considering that you're only going to have 2 or lesser values, update
# the passed row's columns with appropriate values.
if len(split_str)>1:
row['Phone']=split_str[0]
row['Phone1']=split_str[1]
else:
row['Phone']=split_str[0]
row['Phone1']=''
# Return the updated row.
return row
# Making a dummy dataframe.
d={'Phone':['0674-2537100/101','0674-257349','0671-257349','257349','257349/100','101/100','5688343/438934']}
dataFrame= pd.DataFrame(data=d)
# Considering you're only going to have one extra column. adding that column to dataframe.
dataFrame=dataFrame.assign(Phone1=['' for i in range(dataFrame.shape[0])])
# applying the split_phone_number function to dataframe.
dataFrame=dataFrame.apply(split_phone_number,axis=1)
# Prinitng dataframe.
print(dataFrame)
Input:
+---------------------+
| Phone |
+---------------------+
| 0 0674-2537100/101 |
| 1 0674-257349 |
| 2 0671-257349 |
| 3 257349 |
| 4 257349/100 |
| 5 101/100 |
| 6 5688343/438934 |
+---------------------+
Output:
+----------------------------+
| Phone Phone1 |
+----------------------------+
| 0 0674-2537100 101 |
| 1 0674-257349 |
| 2 0671-257349 |
| 3 257349 |
| 4 257349 100 |
| 5 101 100 |
| 6 5688343 438934 |
+----------------------------+
For further reading:
dataframe.apply()
Hope this helps. Cheers!

Explode date interval over a group by and take last value in pyspark

I have a dataframe which contains some products, a date and a value. Now the dates have different gaps inbetween recorded values that I want to fill out. Such that I have a recorded value for every hour from the first time the product was seen to the last, if there is no record I want to use the latest value.
So, I have a dataframe like:
| ProductId | Date | Value |
|-----------|-------------------------------|-------|
| 1 | 2020-03-12T00:00:00.000+0000 | 4 |
| 1 | 2020-03-12T01:00:00.000+0000 | 2 |
| 2 | 2020-03-12T01:00:00.000+0000 | 3 |
| 2 | 2020-03-12T03:00:00.000+0000 | 4 |
| 1 | 2020-03-12T05:00:00.000+0000 | 4 |
| 3 | 2020-03-12T05:00:00.000+0000 | 2 |
I want to create a new dataframe that looks like:
| ProductId | Date | Value |
|-----------|-------------------------------|-------|
| 1 | 2020-03-12T00:00:00.000+0000 | 4 |
| 1 | 2020-03-12T01:00:00.000+0000 | 2 |
| 1 | 2020-03-12T02:00:00.000+0000 | 2 |
| 1 | 2020-03-12T03:00:00.000+0000 | 2 |
| 1 | 2020-03-12T04:00:00.000+0000 | 2 |
| 1 | 2020-03-12T05:00:00.000+0000 | 4 |
| 2 | 2020-03-12T01:00:00.000+0000 | 3 |
| 2 | 2020-03-12T02:00:00.000+0000 | 3 |
| 2 | 2020-03-12T03:00:00.000+0000 | 4 |
| 3 | 2020-03-12T05:00:00.000+0000 | 2 |
My code so far:
def generate_date_series(start, stop):
start = datetime.strptime(start, "yyyy-MM-dd'T'HH:mm:ss.SSSZ")
stop = datetime.strptime(stop, "yyyy-MM-dd'T'HH:mm:ss.SSSZ")
return [start + datetime.timedelta(hours=x) for x in range(0, (stop-start).hours + 1)]
spark.udf.register("generate_date_series", generate_date_series, ArrayType(TimestampType()))
df = df.withColumn("max", max(col("Date")).over(Window.partitionBy("ProductId"))) \
.withColumn("min", min(col("Date")).over(Window.partitionBy("ProductId"))) \
.withColumn("Dato", explode(generate_date_series(col("min"), col("max"))) \
.over(Window.partitionBy("ProductId").orderBy(col("Dato").desc())))
window_over_ids = (Window.partitionBy("ProductId").rangeBetween(Window.unboundedPreceding, -1).orderBy("Date"))
df = df.withColumn("Value", last("Value", ignorenulls=True).over(window_over_ids))
Error:
TypeError: strptime() argument 1 must be str, not Column
So the first question is obviously how do I create and call the udf correctly so I don't run into the above error.
The second question is how do I complete the task, such that I get my desired dataframe?
So after some searching and experimenting I found a solution. I defined a udf that returns a date range between two dates with 1 hour intervals. And I then do a forward fill
I fixed the issue with the following code:
def missing_hours(t1, t2):
return [t1 + timedelta(hours=x) for x in range(0, int((t2-t1).total_seconds()/3600))]
missing_hours_udf = udf(missing_hours, ArrayType(TimestampType()))
window = Window.partitionBy("ProductId").orderBy("Date")
df_missing = df.withColumn("prev_timestamp", lag(col("Date"), 1, None).over(window)) \
.filter(col("prev_timestamp").isNotNull()) \
.withColumn("Date", explode(missing_hours_udf(col("prev_timestamp"), col("Date")))) \
.withColumn("Value", lit(None)) \
.drop("prev_timestamp")
df = df_original.union(df_missing)
window = Window.partitionBy("ProductId").orderBy("Date") \
.rowsBetween(-sys.maxsize, 0)
# define the forward-filled column
filled_values_column = last(df['Value'], ignorenulls=True).over(window)
# do the fill
df = df.withColumn('Value', filled_values_column)

Pandas merge have two results with the same code and input data

I have two dataframe to merge.When I run the program with the same input data and code,there will be two situations(First:Successful merge;Second:The data belongs to 'annotate' in merge data is NaN.)
raw_df2 = pd.merge(annotate,raw_df,on='gene',how='right').fillna("unkown")
Then I have a test:
count = 10001
while (count > 10000):
raw_df2 = pd.merge(annotate,raw_df,on='gene',how='right').fillna("unkown")
count = len(raw_df2[raw_df2["type"]=="unkown"])
print(count)
If merge is faild,"raw_df" always is falied during the run.I must resubmit the script,and the result may be successful.
[First two columns are from 'annotate';Others are 'from raw_df']
The failed result:
| type | gene | locus | sample_1 | sample_2 | status | value_1 | value_2 |
+--------+---------------+--------------------------+----------+----------+--------+---------+----------+
| unknow | 0610040J01Rik | chr5:63812494-63899619 | Ctrl | SPION10 | OK | 2.02125 | 0.652688 |
| unknow | 1110008F13Rik | chr2:156863121-156887078 | Ctrl | SPION10 | OK | 87.7115 | 49.8795 |
+--------+---------------+--------------------------+----------+----------+--------+---------+----------+
The successful result:
+--------+----------+------------------------+----------+----------+--------+----------+---------+
| gene | type | locus | sample_1 | sample_2 | status | value_1 | value_2 |
+--------+----------+------------------------+----------+----------+--------+----------+---------+
| St18 | misc_RNA | chr1:6487230-6860940 | Ctrl | SPION10 | OK | 1.90988 | 3.91643 |
| Arid5a | misc_RNA | chr1:36307732-36324029 | Ctrl | SPION10 | OK | 1.33796 | 2.21057 |
| Carf | misc_RNA | chr1:60076867-60153953 | Ctrl | SPION10 | OK | 0.846988 | 1.47619 |
+--------+----------+------------------------+----------+----------+--------+----------+---------+
I have a solution,but I still don't know what cause the previous problem.
Set the column in two dataframe that I want to merge as the Index.Then use the index to merge two dataframe.
Run the script more than 10 times,the result is no longer wrong.
# the first dataframe
DataQiime = pd.read_csv(args.FileTranseq,header=None,sep=',') #
DataQiime.columns=['Feature.ID','Frequency']
DataQiime_index = DataQiime.set_index('Feature.ID', inplace=False, drop=True)
# the second dataframe
DataTranseq = pd.read_table(args.FileQiime,header=0,sep='\t',encoding='utf-8') #
DataTranseq_index = DataTranseq.set_index('Feature.ID', inplace=False, drop=True)
# merge by index
DataMerge = pd.merge(DataQiime,DataTranseq,left_index=True,right_index=True,how="inner")

PySpark getting distinct values over a wide range of columns

I have data with a large number of custom columns, the content of which I poorly understand. The columns are named evar1 to evar250. What I'd like to get is a single table with all distinct values, and a count how often these occur and the name of the column.
------------------------------------------------
| columnname | value | count |
|------------|-----------------------|---------|
| evar1 | en-GB | 7654321 |
| evar1 | en-US | 1234567 |
| evar2 | www.myclient.com | 123 |
| evar2 | app.myclient.com | 456 |
| ...
The best way I can think of doing this feels terrible, as I believe I have to read this data once per column (there are actually about 400 such columns.
i = 1
df_evars = None
while i <= 30:
colname = "evar" + str(i)
df_temp = df.groupBy(colname).agg(fn.count("*").alias("rows"))\
.withColumn("colName", fn.lit(colname))
if df_evars:
df_evars = df_evars.union(df_temp)
else:
df_evars = df_temp
display(df_evars)
Am I missing a better solution?
Update
This has been marked as a duplicate but the two responses IMO only solve part of my question.
I am looking at potentially very wide tables with potentially a large number of values. I need a simple way (ie. 3 columns that show the source column, the value and the count of the value in the source column.
The first of the responses only gives me an approximation of the number of distinct values. Which is pretty useless to me.
The second response seems less relevant than the first. To clarify, source data like this:
-----------------------
| evar1 | evar2 | ... |
|---------------|-----|
| A | A | ... |
| B | A | ... |
| B | B | ... |
| B | B | ... |
| ...
Should result in the output
--------------------------------
| columnname | value | count |
|------------|-------|---------|
| evar1 | A | 1 |
| evar1 | B | 3 |
| evar2 | A | 2 |
| evar2 | B | 2 |
| ...
Using melt borrowed from here:
from pyspark.sql.functions import col
melt(
df.select([col(c).cast("string") for c in df.columns]),
id_vars=[], value_vars=df.columns
).groupBy("variable", "value").count()
Adapted from the answer by user6910411.

Grouping and numbering items in a pandas dataframe

I want to add a column to a dataframe in python/pandas as follows:
| MarketID | SelectionID | Time | SelectNumber |
| 112337406 | 3819251.0 | 13:38:32 | 4 |
| 112337406 | 3819251.0 | 13:39:03 | 4 |
| 112337406 | 4979206.0 | 11:29:34 | 1 |
| 112337406 | 4979206.0 | 11:37:34 | 1 |
| 112337406 | 5117439.0 | 13:36:32 | 3 |
| 112337406 | 5117439.0 | 13:37:03 | 3 |
| 112337406 | 5696467.0 | 13:23:03 | 2 |
| 112337406 | 5696467.0 | 13:23:33 | 2 |
| 112337407 | 3819254.0 | 13:39:12 | 4 |
| 112337407 | 4979206.0 | 11:29:56 | 1 |
| 112337407 | 4979206.0 | 16:27:34 | 1 |
| 112337407 | 5117441.0 | 13:36:54 | 3 |
| 112337407 | 5117441.0 | 17:47:11 | 3 |
| 112337407 | 5696485.0 | 13:23:04 | 2 |
| 112337407 | 5696485.0 | 18:23:59 | 2 |
I currently have the market ID, Selection ID and Time, I want to generate the SelectNumber column, which represents the time order in which the particular selectionID appears within a particular MarketID. Once numbered all other iterations of the same selection ID within that MarketID need to be numbered the same. The MarketID will always be unique, but the same selectionID can appear in more than 1 MarketID.
This has got me stumped, any ideas?
First, you need the combinations of 'MarketID' and 'SelectionID' in order of occurrence, so lets sort on the time.
Then, for each 'MarketID' get the unique 'SelectionID's and number them in order of occurrence (already ordered, because df is ordered on column time). Secondly, the combination of number 'MarketID' and 'SelectionID' together with the order will be used later to set the numbers.
I'll give you two solution to the first part:
dfnewindex = df.sort_values('Time').set_index('MarketID')
valuesetter = {}
for indx in dfnewindex.index.unique():
selectionid_per_marketid = dfnewindex.loc[indx].sort_values('Time')['SelectionID'].drop_duplicates().values
valuesetter.update(dict(zip(zip(len(selectionid_per_marketid)*[indx], selectionid_per_marketid), range(1, 1+len(selectionid_per_marketid)))))
100 loops, best of 3: 3.22 ms per loop
df_sorted = df.sort_values('Time')
valuesetter = {}
for mrktid in df_sorted['MarketID'].unique():
sltnids = df_sorted[df_sorted['MarketID']==mrktid]['SelectionID'].drop_duplicates(keep='first').values
valuesetter.update(dict(zip(zip(len(sltnids)*[mrktid], sltnids), range(1, 1+len(sltnids)))))
100 loops, best of 3: 2.59 ms per loop
The boolean slicing solution is slightly faster in this case
The output:
valuesetter
{(112337406, 3819251.0): 4,
(112337406, 4979206.0): 1,
(112337406, 5117439.0): 3,
(112337406, 5696467.0): 2,
(112337407, 3819254.0): 4,
(112337407, 4979206.0): 1,
(112337407, 5117441.0): 3,
(112337407, 5696485.0): 2}
For the second part, this dict is used to generate a column, i.e. SelectNumber. Again two solutions, the first uses multiindex, the second groupby:
map(lambda x: valuesetter[x], df.set_index(['MarketID', 'SelectionID']).index.values)
1000 loops, best of 3: 1.23 ms per loop
map(lambda x: valuesetter[x], df.groupby(['MarketID', 'SelectionID']).count().index.values)
1000 loops, best of 3: 1.59 ms per loop
the multiindex seems to be the fastest solution.
The final, up to this point, fastest answer:
df_sorted = df.sort_values('Time')
valuesetter2 = {}
for mrktid in df_sorted['MarketID'].unique():
sltnids = df_sorted[df_sorted['MarketID']==mrktid]['SelectionID'].drop_duplicates(keep='first').values
valuesetter2.update(dict(zip(zip(len(sltnids)*[mrktid], sltnids), range(1, 1+len(sltnids)))))
df_sorted['SelectNumber'] = list(map(lambda x: valuesetter[x], df.set_index(['MarketID', 'SelectionID']).index.values))

Resources