I have 10 data sets and I want to check the correlation between all possible pairs.For example, if I had:
ABCD
I want to check the correlation between AB, AC, AD, BC etc.
I've been using Correl function in excel which is fine for small data sets but if I had 1000 data sets instead of 10, how would I do this?
This solution assumes you have datasets in your global environment and they can be "scraped" based on some criterion. In my case, I opted for ".string" handle. If not, you have to come up with your own way of putting names into a string. Another way would be to put all datasets into a list and work with indices.
A.string <- runif(5)
B.string <- runif(5)
C.string <- runif(5)
# find variables based on a common string
pairs <- combn(ls(pattern = "\\.string"), 2)
# for each pair, fetch variable and use function cor()
apply(pairs, MARGIN = 2, FUN = function(x) {
cor(get(x[1]), get(x[2]))
})
[1] 0.2586141 0.7106571 0.7119712
Related
I've not ever encountered this type of situation in a Python for loop before.
I have a dictionary of Names (key) and Regions (value). I want to match up each Name with two other names. The matched name cannot be themselves and reversing the elements is not a valid match (1,2) = (2,1). I do not want people from the same Region to be matched together though (unless it becomes impossible).
dict = {
"Tom":"Canada",
"Jerry":"USA",
"Peter":"USA",
"Pan":"Canada",
"Edgar":"France"
}
desired possible output:
[('Tom','Jerry'),('Tom','Peter'),('Jerry','Pan'),('Pan','Peter'),('Edgar','Peter'),('Edgar','Jerry')]
Everyone appears twice, but Jerry and Peter appears more in order for Edgar to have 2 matches with Names from a different region (Jerry and Peter should be chosen randomly here)
Count: Tom: 2, Jerry: 3, Peter: 3, Pan: 2, Edgar: 2
My approach is to convert the names into a list, shuffle them, then create tuple pairs using zip in a custom function. After the function is complete. I use a a for to check for pairings from the same region, if a same pairing region exists, then re-run the custom function. For some reason, when I print the results, I still see pairings between the same regions. What am I missing here?
import random
names=list(dict.keys())
def pairing(x):
random.shuffle(x)
#each person is tupled twice, once with the neighbor on each side
pairs = list(zip(x, x[1:]+x[:1]))
return pairs
pairs=pairing(names) #assigns variable from function to 'pairs'
for matchup in pairs:
if dict[matchup[0]]==dict[matchup[1]]:
break
pairing(names)
pairs=pairing(names)
for matchup in pairs:
print(matchup[0] ,dict[matchup[0]] , matchup[1] , dict[matchup[1]])
Just looking at it, something is clearly broken in the for loop, please help!
I've tried while rather than if in the for loop, but it did not work.
from itertools import combinations
import pandas as pd
import random
dict={'your dictionary'}
#create function to pair names together
def pairing(x):
random.shuffle(x)
#each person is tupled twice, once with the neighbor on each side
pairs = list(zip(x, x[1:]+x[:1]))
for matchup in pairs:
if dict[matchup[0]]==dict[matchup[1]]: #if someone's gym matches their opponent's gym in dictionary, re-run this function
return pairing(x)
return pairs
pairs=pairing(names)
for matchup in pairs:
print(matchup[0] ,dict[matchup[0]] , matchup[1] , dict[matchup[1]])
The trick is to return pairing(x) inside the custom function. This will return new pairings if any elements in the tuple share the same value in the dictionary. If inside the if statement, you go pairing(x) then return pair, it'll return the original tuple list which contains duplicates.
I want to reshape my data in Excel, which is currently in "wide" format into "long" format. You can see each variable (Column Name) corresponds to a tenure, race and cost burden. I want to more easily put these data into a pivot table, but I'm not sure how to do this. Any ideas out there?
FYI, the data are HUD CHAS (Department of Housing and Urban Development, Comprehensive Housing Affordability Strategy), which has over 20 tables that would need to be reshaped.
There is a simple R script that will help with this. The function accepts the path to your csv file and the number of header variables you have. In the example image/data I provided, there are 7 header variables. That is, the actual data (T9_est1) starts on the 8th column.
# Use the command below if you do not have the tidyverse package installed.
# install.packages("tidyverse")
library(tidyverse)
read_data_long <- function(path_to_csv, header_vars) {
data_table <- read_csv(path_to_csv)
fields_to_melt <- names(data_table[,as.numeric(header_vars+1):ncol(data_table)])
melted <- gather(data_table, fields_to_melt, key = 'variable', value = 'values')
return(melted)
}
# Change the file path to where your data is and where you want it written to.
# Also change "7" to the number of header variables your data has.
melted_data <- read_data_long("path_to_input_file.csv", 7)
write_csv(melted_data, "new_path_to_melted_file.csv")
(Updated 7/25/18 with a more elegant solution; Again 9/28/18 with small change.)
I have a data frame which contains different columns ('features').
My goal is to calculate column X statistical measures:
Mean, Standart-Deviation, Variance
But, to calculate all of those, with dependency on column Y.
e.g. Get all rows which Y = 1, and for them calculate mean,stddev, var,
then do the same for all rows which Y = 2 for them.
My current implementation is:
print "For CONGESTION_FLAG = 0:"
log_df.filter(log_df[flag_col] == 0).select([mean(size_col), stddev(size_col),
pow(stddev(size_col), 2)]).show(20, False)
print "For CONGESTION_FLAG = 1:"
log_df.filter(log_df[flag_col] == 1).select([mean(size_col), stddev(size_col),
pow(stddev(size_col), 2)]).show(20, False)
print "For CONGESTION_FLAG = 2:"
log_df.filter(log_df[flag_col] == 2).select([mean(size_col), stddev(size_col),
pow(stddev(size_col), 2)]).show(20, False)
I was told the filter() way is wasteful in terms of computation times, and received an advice that for making those calculation run faster (i'm using this on 1GB data file), it would be better use groupBy() method.
Can someone please help me transform those lines to do the same calculations by using groupBy instead?
I got mixed up with the syntax and didn't manage to do so correctly.
Thanks.
Filter by itself is not wasteful. The problem is that you are calling it multiple times (once for each value) meaning you are scanning the data 3 times. The operation you are describing is best achieved by groupby which basically aggregates data per value of the grouped column.
You could do something like this:
agg_df = log_df.groupBy(flag_col).agg(mean(size_col).alias("mean"), stddev(size_col).alias("stddev"), pow(stddev(size_col),2).alias("pow"))
You might also get better performance by calculating stddev^2 after the aggregation (you should try it on your data):
agg_df = log_df.groupBy(flag_col).agg(mean(size_col).alias("mean"), stddev(size_col).alias("stddev"))
agg_df2 = agg_df.withColumn("pow", agg_df["stddev"] * agg_df["stddev"])
You can:
log_df.groupBy(log_df[flag_col]).agg(
mean(size_col), stddev(size_col), pow(stddev(size_col), 2)
)
I'm taking a programming class and have our first assignment. I understand how it's supposed to work, but apparently I haven't hit upon the correct terms to search to get help (and the book is less than useless).
The assignment is to take a provided data set (names and numbers) and perform some manipulation and computation with it.
I'm able to get the names into a list, and know the general format of what commands I'm giving, but the specifics are evading me. I know that you refer to the numbers as names[0][1], names[1][1], etc, but not how to refer to just that record that is being changed. For example, we have to have the program check if a name begins with a letter that is Q or later; if it does, we double the number associated with that name.
This is what I have so far, with ??? indicating where I know something goes, but not sure what it's called to search for it.
It's homework, so I'm not really looking for answers, but guidance to figure out the right terms to search for my answers. I already found some stuff on the site (like the statistics functions), but just can't find everything the book doesn't even mention.
names = [("Jack",456),("Kayden",355),("Randy",765),("Lisa",635),("Devin",358),("LaWanda",452),("William",308),("Patrcia",256)]
length = len(names)
count = 0
while True
count < length:
if ??? > "Q" # checks if first letter of name is greater than Q
??? # doubles number associated with name
count += 1
print(names) # self-check
numberNames = names # creates new list
import statistics
mean = statistics.mean(???)
median = statistics.median(???)
print("Mean value: {0:.2f}".format(mean))
alphaNames = sorted(numberNames) # sorts names list by name and creates new list
print(alphaNames)
first of all you need to iter over your names list. To do so use for loop:
for person in names:
print(person)
But names are a list of tuples so you will need to get the person name by accessing the first item of the tuple. You do this just like you do with lists
name = person[0]
score = person[1]
Finally to get the ASCII code of a character, you use ord() function. That is going to be helpful to know if name starts with a Q or above.
print(ord('A'))
print(ord('Q'))
print(ord('R'))
This should be enough informations to get you started with.
I see a few parts to your question, so I'll try to separate them out in my response.
check if first letter of name is greater than Q
Hopefully this will help you with the syntax here. Like list, str also supports element access by index with the [] syntax.
$ names = [("Jack",456),("Kayden",355)]
$ names[0]
('Jack', 456)
$ names[0][0]
'Jack'
$ names[0][0][0]
'J'
$ names[0][0][0] < 'Q'
True
$ names[0][0][0] > 'Q'
False
double number associated with name
$ names[0][1]
456
$ names[0][1] * 2
912
"how to refer to just that record that is being changed"
We are trying to update the value associated with the name.
In theme with my previous code examples - that is, we want to update the value at index 1 of the tuple stored at index 0 in the list called names
However, tuples are immutable so we have to be a little tricky if we want to use the data structure you're using.
$ names = [("Jack",456), ("Kayden", 355)]
$ names[0]
('Jack', 456)
$ tpl = names[0]
$ tpl = (tpl[0], tpl[1] * 2)
$ tpl
('Jack', 912)
$ names[0] = tpl
$ names
[('Jack', 912), ('Kayden', 355)]
Do this for all tuples in the list
We need to do this for the whole list, it looks like you were onto that with your while loop. Your counter variable for indexing the list is named count so just use that to index a specific tuple, like: names[count][0] for the countth name or names[count][1] for the countth number.
using statistics for calculating mean and median
I recommend looking at the documentation for a module when you want to know how to use it. Here is an example for mean:
mean(data)
Return the sample arithmetic mean of data.
$ mean([1, 2, 3, 4, 4])
2.8
Hopefully these examples help you with the syntax for continuing your assignment, although this could turn into a long discussion.
The title of your post is "Need help working with lists within lists" ... well, your code example uses a list of tuples
$ names = [("Jack",456),("Kayden",355)]
$ type(names)
<class 'list'>
$ type(names[0])
<class 'tuple'>
$ names = [["Jack",456], ["Kayden", 355]]
$ type(names)
<class 'list'>
$ type(names[0])
<class 'list'>
notice the difference in the [] and ()
If you are free to structure the data however you like, then I would recommend using a dict (read: dictionary).
I know that you refer to the numbers as names[0][1], names[1][1], etc, but
not how to refer to just that record that is being changed. For
example, we have to have the program check if a name begins with a
letter that is Q or later; if it does, we double the number associated
with that name.
It's not entirely clear what else you have to do in this assignment, but regarding your concerns above, to reference the ith"record that is being changed" in your names list, simply use names[i]. So, if you want to access the first record in names, simply use names[0], since indexing in Python begins at zero.
Since each element in your list is a tuple (which can also be indexed), using constructs like names[0][0] and names[0][1] are ways to index the values within the tuple, as you pointed out.
I'm unsure why you're using while True if you're trying to iterate through each name and check whether it begins with "Q". It seems like a for loop would be better, unless your class hasn't gotten there yet.
As for checking whether the first letter is 'Q', str (string) objects are indexed similarly to lists and tuples. To access the first letter in a string, for example, see the following:
>>> my_string = 'Hello'
>>> my_string[0]
'H'
If you give more information, we can help guide you with the statistics piece, as well. But I would first suggest you get some background around mean and median (if you're unfamiliar).
I have to deal with csv image data from a camera which exports the data with a header. In that header is a simple function for converting CCD counts into power density. This equation includes both the dark offset level as well as a calibration factor. Here is an example from one line of an image file:
Power Density,=,(n - 232) * 4.182e-005 W/cm^2
Notice the commas. The csv header can be expected to have the same structure each time with different constants for dark level (232) and power density conversion (4.182e-005).
What I would like to be able to do is grab the last cell, strip off the units at the end (W/cm^2), and use what is left to define a function in Python. Something like
f = lambda n: '(n - 232) * 4.182e-005'
Is it possible to do so? If so, how?
eval and exec, which use compile, are both ways to dynamically convert code as text to a compiled function. If you dynamically create a new function, you only need to do the conversion once.
row = "Power Density,=,(n - 232) * 4.182e-005 W/cm^2".split(',')
expr = row[2].replace( ' W/cm^2', '')
# f = eval("lambda n:" + expr) # based on your original idea
exec("def f(n): return " + expr) # more flexible
print(f(0))
# -0.00970224
The lambda eval and def exec have the same result, other than f.name, but as usual, the def form is more flexible, even if the flexibility is not needed here.
The usual caveats about executing untrusted code apply. If you are working with photo files not your own and were worried about an adversary feeding you a poisoned file, then indeed you might want to tokenize expr and check that is only has the tokens expected.
I found a way to do it using eval, but I expect that it isn't very pythonic so I would still be interested in seeing other answers.
Here row is the row of interest from a csv.reader object, i.e. the same string I posted in the question divided at the commas.
# Strip the units from the string
strng = row[2].replace( ' W/cm^2', '')
# Define a function based on the string
def f( n):
return eval( strng)
# Evaluate a value
print( f( 0))
# Returns: -0.00970224