Reshaping a Large File - excel

I have a large .csv file with 8 variables and about 350,000 observations. In this file, each actual observation is actually split up into 105 rows. That is, each row has data for one specific demographic, and there are 105 demographic cuts (all relating to the same event). This makes it very difficult to merge this file with others.
I would like to change it so that there are 3,500 observations with variables for demographic statistics. I've tried creating a macro, but I haven't had much luck.
This is what it looks like now.
This is what I'd like it to look like.
This way, each ID is a unique observation. I think that this will make it much easier to work with. I can use either Stata or Excel. What is the best way to do this?

So here is an example with what I understand you want:
clear all
set more off
*----- example data -----
input id store date cut
1 5 1 1
1 5 1 2
2 8 1 1
2 9 1 2
2 8 2 3
end
format date %td
set seed 012385
gen val1 = floor(runiform()*1000)
gen val2 = floor(runiform()*2000)
list, sepby(id)
*----- what you want ? -----
reshape wide val1 val2, i(id store date) j(cut)
list, sepby(id)
My id variable is numerical, as are the cuts (see help destring and help encode to convert). The example data is also a bit more complex than the one you posted (in case your example is not representative enough).
The missings (.) that result are expected. val11 is to be interpreted as val1 of cut == 1. val21 as val2 of cut == 1. val12 as val1 of cut == 2, and so on. So when id == 1, val13 and val23 are missing because this person does not appear with cut ==3.
I hope that was clear enough for you to apply to your data.

Related

Sampling a dataframe according to some rules: balancing a multilabel dataset

I have a dataframe like this:
df = pd.DataFrame({'id':[10,20,30,40],'text':['some text','another text','random stuff', 'my cat is a god'],
'A':[0,0,1,1],
'B':[1,1,0,0],
'C':[0,0,0,1],
'D':[1,0,1,0]})
Here I have columns from Ato D but my real dataframe has 100 columns with values of 0and 1. This real dataframe has 100k reacords.
For example, the column A is related to the 3rd and 4rd row of text, because it is labeled as 1. The Same way, A is not related to the 1st and 2nd rows of text because it is labeled as 0.
What I need to do is to sample this dataframe in a way that I have the same or about the same number of features.
In this case, the feature C has only one occurrece, so I need to filter all others columns in a way that I have one text with A, one text with B, one text with Cetc..
The best would be: I can set using for example n=100 that means I want to sample in a way that I have 100 records with all the features.
This dataset is a multilabel dataset training and is higly unbalanced, I am looking for the best way to balance it for a machine learning task.
Important: I don't want to exclude the 0 features. I just want to have ABOUT the same number of columns with 1 and 0
For example. with a final data set with 1k records, I would like to have all columns from A to the final_column and all these columns with the same numbers of 1 and 0. To accomplish this I will need to random discard text rows and id only.
The approach I was trying was to look to the feature with the lowest 1 and 0 counts and then use this value as threshold.
Edit 1: One possible way I thought is to use:
df.sum(axis=0, skipna=True)
Then I can use the column with the lowest sum value as threshold to filter the text column. I dont know how to do this filtering step
Thanks
The exact output you expect is unclear, but assuming you want to get 1 random row per letter with 1 you could reshape (while dropping the 0s) and use GroupBy.sample:
(df
.set_index(['id', 'text'])
.replace(0, float('nan'))
.stack()
.groupby(level=-1).sample(n=1)
.reset_index()
)
NB. you can rename the columns if needed
output:
id text level_2 0
0 30 random stuff A 1.0
1 20 another text B 1.0
2 40 my cat is a god C 1.0
3 30 random stuff D 1.0

Code optimisation - comparing two datetime columns by month and creating a new column too slow

I am trying to create a new column in Pandas dataframe. If the other two date columns in my dataframe share the same month, then this new column should have 1 as a value, otherwise 0. Also, I need to check that ids match my other list of ids that I have saved previously in another place and mark those only with 1. I have some code but it is useless since I am dealing with almost a billion of rows.
my_list_of_ids = df[df.bool_column == 1].id.values
def my_func(date1, date2):
for id_ in df.id:
if id_ in my_list_of_ids:
if date1.month == date2.month:
my_var = 1
else:
my_var = 0
else:
my_var = 0
return my_var
df["new_column"] = df.progress_apply(lambda x: my_func(x['date1'], x['date2']), axis=1)
Been waiting for 30 minutes and still 0%. Any help is appreciated.
UPDATE (adding an example):
id | date1 | date2 | bool_column | new_column |
id1 2019-02-13 2019-04-11 1 0
id1 2019-03-15 2019-04-11 0 0
id1 2019-04-23 2019-04-11 0 1
id2 2019-08-22 2019-08-11 1 1
id2 ....
id3 2019-09-01 2019-09-30 1 1
.
.
.
What I need to do is save the ids that are 1 in my bool_column, then I am looping through all of the ids in my dataframe and checking if they are in the previously created list (= 1). Then I want to compare month and the year of date1 and date2 columns and if they are the same, create a new_column with a value 1 where they mach, otherwise, 0.
The pandas way to do this is
mask = ((df['date1'].month == df['date2'].month) & (df['id'].isin(my_list_of_ids)))
df['new_column'] = mask.replace({False: 0, True: 1})
Since you have a large data-set, this will take time, but should be faster than using apply
The best way to deal with the month match is to use vectorization in pandas and do this:
new_column = (df.date1.dt.month == df.date2.dt.month).astype(int)
That is, avoid using apply() over the DataFrame (which will probably be iterative) and take advantage of the underlying numpy vectorization. The gateway to such functionality is almost always in families of Series functions and properties, like the dt family for dates, str family for strings, and so forth.
Luckily, you have pre-computed the id_list membership in your bool_column, so to add membership as a criterion, just do this:
new_column = ((df.date1.dt.month == df.date2.dt.month) & df.bool_column).astype(int)
Once again, the & of two Series takes advantage of vectorization. You stay inside boolean space till the end, then cast to int with astype(int). Reviewing your code, it occurs to me that the iterative checking of your id_list may be the real performance hit here, even more so than the DataFrame.apply(). Whatever you do, avoid at all costs iterating your id_list at each row, since you already have a vector denoting membership in your bool_column.
By the way I believe there's a tiny error in your example data, the new_column value for your third row should be 0, since your bool_column value there is 0.

How to match two sets of data by dates which do not synchronise and include missing values in Excel

Please forgive any errors or shortcomings in this question, it's my first on stackoverflow.
I have two sets of data in Excel of differing lengths and frequency, and would like to be able to place a value of 0 for where they don't synchronise, and match the rest.
For example, dataset 1 could be:
Date Set1
01-01-2010 10
01-03-2010 4
01-04-2010 8
01-05-2010 5
01-06-2010 10
01-09-2010 12
01-10-2010 9
01-11-2010 4
And dataset 2 could be:
Date Set2
01-03-2010 102
01-06-2010 104
01-10-2010 102
I'm looking for an output table that displays the values alongside each other for dates matching, 0 otherwise, like so:
Date Set1 Set2
01-01-2010 10 0
01-03-2010 4 102
01-04-2010 8 0
01-05-2010 5 0
01-06-2010 10 104
01-09-2010 12 0
01-10-2010 9 102
01-11-2010 4 0
I can't seem to be able to crack this with my limited knowledge and the lack of synchronisation in the data. Any help would be much appreciated, thanks.
You can do this using a VLOOKUP nested in an IFERROR statement.
The two equations used (and dragged down to last unique date row) are:
H3 = IFERROR(VLOOKUP(G3,A:B,2,0),0)) & I3 = IFERROR(VLOOKUP(G3,D:E,2,0),0))
This will not work if you have duplicate dates in the same data set with varying values since VLOOKUP will always return the first matched value (reading top down).
Place Set1 in A1:B9 (header in row 1). Add a column of zeros next to it in column C, so A2:A9 is dates, B2:B9 is values and C2:C9 is zeros.
Place Set2 (without the header) in A10:B12; move the Set2 data to column C and put zeros in column B, so A10:A12 is dates, B10:B12 is zeros, C10:C12 is values.
Sort the range A2:C12 by Date (column A).
Easier to show with a screenshot but newbies are not allowed to post images.

Matlab - filter out identical cell entries

My matrix is a 10000 x 2 one. Looks like this:
Ann Beth
Bob Pete
Sam Sam
Jen Ted
...
There are many lines with identical names in both columns (like Sam). I need just rows with different names. I thought of a for-loop with ismember/string compare but this is very slow and there are some matrixes like this.
Other option that is also slow is to unique the first column and run a for loop with find the unique values and delete every time the values of find are identical. However this is slow as well. Please help to optimize.
Thanks
You can use strcmp to get a logical array of indices corresponding to identical rows, i.e. compare 1st column with 2nd and remove rows corresponding to indices of 1.
Example:
C = {'Ann' 'Beth';
'Bob' 'Pete';
'Sam' 'Sam';
'Jen' 'Ted'};
idx = strcmp(C(:,1),C(:,2))
Here idx looks like this:
idx =
0
0
1
0
Hence the 3rd row contains identical names. Now remove those:
C(idx,:) = [];
C =
'Ann' 'Beth'
'Bob' 'Pete'
'Jen' 'Ted'

How to count occurrence of unknown strings in column?

I have another question. Thanks for everyone's help and patience with an R newbie!
How can I count how many times a string occurs in a column? Example:
MYdata <- data.frame(fruits = c("apples", "pears", "unknown_f", "unknown_f", "unknown_f"),
veggies = c("beans", "carrots", "carrots", "unknown_v", "unknown_v"),
sales = rnorm(5, 10000, 2500))
The problem is that my real data set contains several thousand rows and several hundred of the unknown fruits and unknown veggies. I played around with "table()" and "levels" but without much success. I guess it's more complicated than that. Great would be to have an output table listing the name of each unique fruit/veggie and how many times it occurs in its column. Any hint in the right direction would be much appreciated.
Thanks,
Marcus
If I understand your question, the function table() should work just fine. Here is how:
table(MYdata$fruits)
apples pears unknown_f
1 1 3
table(MYdata$veggies)
beans carrots unknown_v
1 2 2
Or use table inside lapply:
lapply(MYdata[1:2], table)
$fruits
apples pears unknown_f
1 1 3
$veggies
beans carrots unknown_v
1 2 2
The following gives you a data frame of counts which you might find easier to use or may suit your purposes better:
tabs=lapply(MYdata[-3], table)
out=data.frame(item=names(unlist(tabs)),count=unlist(tabs)[],
stringsAsFactors=FALSE)
rownames(out)=c()
print(out)
item count
1 fruits.apples 1
2 fruits.pears 1
3 fruits.unknown_f 3
4 veggies.beans 1
5 veggies.carrots 2
6 veggies.unknown_v 2
Maybe something like
summary(MYdata$fruits)

Resources