Iterate in column for specific value and insert 1 if found or 0 if not found in new column python - python-3.x

I have a DataFrame as shown in the attached image. My columns of interest are fgr and fgr1. As you can see, they both contain values corresponding to years.
I want to iterate in the the two columns and for any value present, I want 1 if the value is present or else 0.
For example, in fgr the first value is 2028. So, the first row in column 2028 will have a value 1 and all other columns have value 0.
I tried using lookup but I did not succeed. So, any pointers will be really helpful.
Example dataframe
Data:
Data file in Excel

This fill do you job. You can use for loops aswell but I think this approach will be faster.
df["Matched"] = df["fgr"].isin(df["fgr1"])*1
Basically you check if values from one are in anoter column and if they are, you get True or False. You then multiply by 1 to get 1 and 0 instead of True or False.

From this answer
Not the most efficient, but should work for your case(time consuming if large dataset)
s = df.reset_index().melt(['index','fgr','fgr1'])
s['value'] = s.variable.eq(s.fgr.str[:4]).astype(int)
s['value2'] = s.variable.eq(s.fgr1.str[:4]).astype(int)
s['final'] = np.where(s['value']+s['value2'] > 0,1,0)
yourdf = s.pivot_table(index=['index','fgr','fgr1'],columns = 'variable',values='final',aggfunc='first').reset_index(level=[1,2])
yourdf

Related

Pyspark conditionally replace value in column with value from another column

I am working with some weather data that is missing some values (indicated via value code). For example, if SLP data is missing, it is assigned code 99999. I was able to use a window function to calculate a 7 day average and save it as a new column. A significantly reduced example of a single row is shown below:
SLP_ORIGIN
SLP_ORIGIN_7DAY_AVG
99999
11945.823516044207
I'm trying to write code such that when SLP_ORIGIN has the missing code it gets replaced using the SLP_ORIGIN_7DAY_AVG value. However, most code explains how to replace a column value based on a conditional with a constant value, not the column value. I tried using the following:
train_impute = train.withColumn("SLP_ORIGIN", \
when(train["SLP_ORIGIN"] == 99999, train["SLP_ORIGIN_7DAY_AVG"]).otherwise(train["SLP_ORIGIN"]))
where the dataframe is called train.
When I perform a count on the SLP_ORIGIN column using train.where("SLP_ORIGIN = 99999").count() I get the same count from before I attempted replacing the value in that column. I have already checked and my SLP_ORIGIN_7DAY_AVG does not have any values that match the missing code.
So how do I actually replace the 99999 values in the SLP_ORIGIN column with the associated SLP_ORIGIN_7DAY_AVG value?
EVEN BETTER, is there a way to do this replacement and window calculation without making a 7 day average column (I have other variables I need to do the same thing with so I'm hoping there is a more efficient way to do this).
Make sure to double check with dataframe you are verifying on.
I was using train.where("SLP_ORIGIN = 99999").count() when I should have been using train_impute.where("SLP_ORIGIN = 99999").count()
Additionally, instead of making a whole new column to store the imputed 7 day average, one can only calculate the average when the missing value code is present:
train = train.withColumn("SLP_ORIGIN", when(train["SLP_ORIGIN"] == 99999, f.avg('SLP_ORIGIN').over(w)).otherwise(train["SLP_ORIGIN"]))\

Pandas: get first datetime-in and last datetime-out in one row

First of all thanks in advance, there are always answers here so we learn a lot from the experts. I'm a noob using "pandas" (it's super handie for what i tried and achieved so far).
I have these data, handed to me like this (don't have access to the origin), 20k rows or more sometimes. The 'in' and 'out' columns may have one or more data per date, so when i get a 'in' the next data could be a 'out' or a 'in', depending, leaving me a blank cell, that's the problem (see first image).
I want to filter the first datetime-in, to left it in one column and the last datetime-out in another but the two in one row (see second image); the data comes in a csv file. I am doing this particular work manually with LibreOffice Calc (yeap).
So far, I have tried locating and relocating, tried merging, grouping... nothing works for me so i feel frustrated, ¿would you please lend me a hand? here is a minimal sample of the file
By the way english is not my language. ¡Thanks so much!
First:
out_column = df["out"].tolist()
This gives you all the out dates as a list, we will need that later.
in_column = df["in"].tolist() # in is used by python so I suggest renaming that row
I treat NaT as NaN (Null) in this Case.
Now we have to find what rows to keep, which we do by going through the in column and only keeping the rows after a NaN (and the first one):
filtered_df = []
tracker = False
for index, element in enumerate(in):
if index == 0 or tracker is True:
filtered_df.append(True)
tracker = False
continue
if element is None:
tracker = True
filtered_df.append(False)
Then you filter your df by this Boolean List:
df = df[filtered_df]
Now you fix up your out column by removing the null values:
while null in out_column:
out_column.remove(null)
Last but not least you overwrite your old out column with the new one:
df["out"] = out_column

How can I replace a particular column in a data frame based on a condition (categorical variables)?

I need to replace the salary status to 1 or 0 respectively if the salary is greater than 50,000 or less than or equal to 50,000 in a df.
The DataFrame shape:30162*13
I have tried this:
data2['SalStat']=data2['SalStat'].map({"less than or equal to 50,000":0,"greater than 50,000":1})
I also tried data2['SalStat']
and loc without any success.
How can I do the same?
I think your solution is nice.
If want match only by substring, e.g. by greater use Series.str.contains for boolean mask with converting to 0,1:
data2['SalStat']=data2['SalStat'].str.contains('greater').astype(int)
Or:
data2['SalStat']=data2['SalStat'].str.contains('greater').view('i1')
Try this
def status(d): return 0 if d == 'less than or equal to 50,000' else 1
data2['SalStat'] = list(map(status ,data2['SalStat']))

What is the most efficient format for storing strings from a for loop?

I have a script that runs through a series of strings and using regex pulls out certain strings (approx 4 output strings per input string).
e.g. HelloStackOverflowWorld
-> Hello; Stack; Overflow; World;
The final output would ideally be a table where I can filter based upon the strings in the columns. Using the case above, column 1 row 1 would have 'Hello', column 2 row 1 would have 'Stack' and so on.
The problem is, the size of the output will change depending on the input so I am unsure of what output format to use.
At the moment I used something similar to this:
if strfind(missing{ii},'hello')
miss.exch = [miss.exch;'hello'];
temp.exc = regexp(missing{ii},'(?<=\d[Q|T])(\w*?)(?=[q])','match');
miss.exc = [miss.exc;temp.exc];
temp.TQ= regexp(missing{ii},'(Qc|Tc)','match');
if strcmp(temp.TQ{1,1}, 'Tc')
miss.TQ = [miss.TQ;'variableA'];
elseif temp.TQ{1,1} == 'Qc'
miss.TQ = [miss.TQ;'variableB'];
end
else if .........
end
Which obviously results in a 1x1 struct consisting of a number of fields each with many cells. This makes filtering on strings an issue!
How can I define and add data into a 'table of strings' that I can then filter?
I think you are just looking for a cell array. Here is a simple example of what they can do:
C = {'Abc','Bcd';'Cde',[]}
strcmp(C,'Cde')
Results in:
ans =
0 0
1 0
Make sure to check doc cell to see how you can access them.

Excel dynamic data series. Unusual data look and chart

Since I solved previous problem with collecting data from database, I need to put that data on a chart now. I am working on a report generating software called ReportWorx.
Problem is, data comes in series and looks like this:
ID DATE SAMPLE
1 XX-XX-XX VALUE
1 XX-XX-XX VALUE
1 XX-XX-XX VALUE
2 XX-XX-XX VALUE
2 XX-XX-XX VALUE
3 XX-XX-XX VALUE
3 XX-XX-XX VALUE
I can not change how it looks because it is generated automatically. What I want is linear chart in which 1, 2, 3 are series name and of course next to it DATE and VALUE are put on a linear chart (or bargraph, w/e) (Date at X axis, Value at Y axis).
I can`t specify how many records will be there (how many rows) but I found few solutions about creating dynamically increasing charts, so probably it will not be a poblem. I just do not know how to separate thos ID series from each other.
EDIT:
I have found a solution in VBA according to the first answer. Here you have VBA code below:
Sub Rewrite()
Dim row, id
For row = 38 To 1000
For id = 1 To 37
If Sheet1.Cells(row, 1).Value = id Then
Sheet2.Cells(row, 1).Value = Sheet1.Cells(row, 2)
Sheet2.Cells(row, id + 1).Value = Sheet1.Cells(row, 3)
End If
Next id
Next row
End Sub
Thank You #sancho.s
I will post a solution that I use a lot for cases like yours.
With reference to the figure (where I used sample numbers), you set up 3 new columns (D:F here), the header of which contain the corresponding labels. Then you use a formula for "splitting" the list of X data (column B here) associated with each label, and assigning a "NULL" value for data not corresponding (#N/A here, but you can choose whatever you want):
=IF($A3=D$2,$B3,$B$1)
You enter this in D3. The absolute/relative indexing used allows for copy-and-paste throughout D3:F9.
Cell B1 here contains the "NULL" value.
Then you plot 3 series: column C against columns D, E, F.
PS: I guess you could split the Y data column instead, with similar results. For some reason that I do not recall, I decided a long time ago that this was the best option, at least in my case then. You may want to try out the other option.
PS2: This also works for data that is not sorted by label.
PS3: Using NA() as the "NULL" value avoids cell values being taken as zero and then showing up in the chart, as it is the case with other errors (e.g., try using =1/0 in B1). It is the best option I found so far. Alternatively (just in case you find it useful), you can use an explicit value which is outside the actual X data range, but then you would have to manually set the X axis range. All this is for a Scatter plot, just check what works for your case.

Resources