# Week 3: Algorithms 3

### Week 3: Algorithms 3

#### Summaries

• Week 3: Algorithms 3 > 3a Linear Programming 1 > 3a Video
• Week 3: Algorithms 3 > 3b Linear Programming 2 > 3b Video
• Week 3: Algorithms 3 > 3c NP-completeness 1 > 3c Video
• Week 3: Algorithms 3 > 3d NP-completeness 2 > 3d Video
• Week 3: Algorithms 3 > 3e NP-completeness 3 and Summary > 3e Video
• Week 3: Algorithms 3 > 3f Introduction to Personal Genomics > 3f Video
• Week 3: Algorithms 3 > 3g Massive Raw Data In Genomics > 3g Video
• Week 3: Algorithms 3 > 3h Data Science On Personal Genomes > 3h Video
• Week 3: Algorithms 3 > 3i Interconnectedness Of Personal Genomes > 3i Video
• Week 3: Algorithms 3 > 3j Personal Genomics Case Studies > 3j Video
• Week 3: Algorithms 3 > 3k Personal Genomics Conclusion > 3k Video

#### Week 3: Algorithms 3 > 3a Linear Programming 1 > 3a Video

• We’ll talk now about linear programming, general class of problems.
• So a linear programming problem, in general, is we are given a set of variables, and a set of constraints that these variables have to satisfy.
• Because constraints are all linear equations and inequalities.
• We want to assign real values to this variables to satisfy the set of constraints, and to maximize a given linear function of the variables.
• So we want to optimize the subjective function subject to these constraints.
• You have some limits, some constraints, on what resources you have, how you going look at into what you can do.
• They impart some properties that linearity of the constraints and the function.
• Suppose it’s a factory can produce two kinds of products.
• The capacity of the factory is to produce these products at a certain rate.
• So there’s a limit as to how much it can produce of one product and the other product.
• The question now is how should the factory allocate its capacity to maximize the profit? In other words, how many tons of each product should the factory produce per day, so it gets the maximum profit? So let’s express this problem mathematically.
• So we have variable x for the number of times per day that produces product one.
• So the profit will be 20 times six, plus 30 times y. Because it’s a 20 per ton for product one, 30 for product two.
• So at most 400, there is no point in producing more than 400 of product one.
• In order to produce x times of product one, it needs x over 60 hours.
• So this is now the objective function is 20 x plus 30 y. And we have these inequalities constraining x and y. Now this which was the example to only have two variables, so we can draw a picture of what the constraints look like, and how they restrict the possible values we can give to x and y. So we can view this geometrically from the plane.
• We have this other constraint that says x over 60, plus y over 50 is at most eight.
• That’s where we get the maximum possible profit, the maximum possible value.
• So the optimal solution is this vertex of the feasible region.
• The constraints, every constraint is a linear constraint, so will define hyper plane.
• The constraint will say that the solution must lay on one side of that hyper plane.
• In general the optimal solution will be a vertex of this polyhedron.
• So that’s general of the geometry of a linear program.
• That’s the case where the linear program isn’t feasible.
• Then there is- we know that there is no way to satisfy all the constraints.
• So there is array, a sequence of feasible solutions along which the solution, the value, of the objective function gets better and better.
• If there is one solution, then at least one vertex is optimal.
• There will be at least one vertex that’s optimum.
• Now there is, that’s a very well-studied problem since the middle of the last century.
• The basic idea is that you start at the vertex of the polyhedron.
• Then you look at the vertex of the edges and vertices.
• If there is a improvement you can do by moving to an adjacent vertex.
• If you can make an improvement, you just pick a better neighbor and move to that vertex.
• If you don’t have a better vertex, then that’s the optimum.
• In the example we saw before, I have put for each vertex, I have put an arrow that shows a direction of improvement.
• So for example if we are here at the point 400,0 at the bottom, if we move it up along this edge to the adjacent vertex, then objective function is going to improve.
• Then we can move that way, and get to the optimal solution.
• If you start- if you have general polytopal bond polyhedron, you start at the vertex.
• Which rather than going on the surface of the polyhedron from vertex to vertex, they cut through the middle of the polyhedron.
• That’s linear programming is used very broadly for optimization.

#### Week 3: Algorithms 3 > 3b Linear Programming 2 > 3b Video

• We want to fit a line that matches these points as well as possible.
• We have an equation y equals ax plus b. We want to know what is a and b to determine the line.
• So least square error means we will take the line for every point.
• So it’s easy to- there’s a formula that tells you what is a and what is b as a function of the points.
• Another metric for measuring how well the line fits the points is to minimize the sum of the absolute errors.
• Least square error is the L2 metric where absolute errors is L1 metric.
• If we consider points, actual numbers on the real line.
• We want to find one number, one point on the line that represents these points.
• So we want to choose a point that represents these points.
• Minimizing the least square error, the point we will pick is the average of the numbers.
• Minimizing the sum of the absolute errors- the sum of the absolute errors, the point we will pick is the median of the numbers.
• If the line, if there is an outlier out there, and lots of points around here, this line is going to shift up.
• Whereas, the least square, the sum of absolute errors will move less.
• Sum of absolute errors, we can easily formulate it as a linear program.
• We want to say that e sub i is the error of the i-th point.
• E sub i- we would like e sub i to be the absolute value.
• Would like to have e sub i to be absolute value, ax plus b minus yi.
• So one of them is going to be the absolute value, and the other is going to be the minus absolute value.
• So e sub i is bigger than both, and we want to minimize the sum, so e sub i basically will be equal to the absolute value.
• If we want to minimize the sum, we make e sub i equal to the absolute value.
• We’re going to pick a and b so that we minimize the sum of the absolute values.
• E sub i will be the absolute value of the difference.
• In this problem, we have n agents and n tasks, a number of agents and tasks, in this case.
• There is a value vij, if agent i is assigned to task j. For example, the value may measure of how well agent i can execute task j, the quality of the product.
• Or it can measure how much the agent likes doing task j. We want to find then one to one assignment of agents to tasks that maximizes the total value.
• If vij measures how much the agent likes the task, it’s the total happiness of the agents.
• Or we can think of it as matching- you know, the agents are boys and the tasks are girls, and vij is how much of each pair, ij, like each other.
• That’s a different version of the matching where we look at total welfare or total happiness of the agents in the tasks.
• We have a viable, xij, that we wanted to indicate whether agent i is assigned task j. So the intention is that xij will be 1 if agent i is assigned task j. And to 0 otherwise.
• With the linear programming, only talks about real values.
• So keeping that in mind, we can express our problem as we want to maximize the total value of the assignment.
• Because we’ll have, for every assignment of agent i to task j, when it’s 1, we’re going to get profit of vij.
• So this objective function measures exactly what’s the value of our assignment as long as it looks like that- the solution looks like that.
• So what does that mean? First of all, for every agent i, we want it to be assigned only one task.
• For every task, we want to assign only one agent.
• So for every task j, we want sum of xij over all i to be 1.
• Now of course, any assignment that is 0, 1 we can- any assignment we can map the values to 0, 1, it’s going to satisfy all these constraints.
• The values of the objective function will be exactly the value of the assignment.
• We might have an agent i here which is assigned 1/2 to task 3 and 1/2 to task 5.
• It’s not that many linear programs that have the property where you want integrality of 0, 1 value.

#### Week 3: Algorithms 3 > 3c NP-completeness 1 > 3c Video

• So with so many methods and many types of problems that can be solved efficiently in linear time, quadratic time, and so forth, there is however many other problems that are hard, for which we do not have any algorithm that runs in polynomial time.
• So there is a theorem called the completeness theory, which ties many of these problems together.
• In essence, the crux of problem for all of them is the same.
• That all of these problems are- although they look very, very different, the statements are totally different, they come from all kinds of different areas, they are, in some sense, equally difficult.
• Some typical examples of problems like that are- we’ll give a few examples.
• One of them is, let’s say, circuit-satisfactability problem.
• Maybe somewhat similar problem is if we have a Boolean formula, and we want to know if the formula is satisfiable.
• That’s usually abbreviated as satisfiability problem.
• Very similar to the circuit-satisfactability problem, again, brute force is try every possible assignment of true and false, 2 to the n assignments if there are 2 to the n variables.
• So integer linear programming is the same, a linear program, but now compute the optimal integer solution, or maybe we want the optimal solution that is zero or one, where every variable takes value zero or one.
• Unlike linear programming, which we know how to solve well in polynomial time, both in practice and in theory, this is a much, much more difficult problem.
• The brute force solution, again, is to try, let’s say, if we have 0, and we’re interested in 0-1 solutions, to try all combinations of values 0 and 1 to the variables.
• Even to just check feasibility of the problem is also very hard.
• Is there some integer solution? Is there a 0-1 solution.
• The maximum clique problem is, so you’re given a graph.
• A related problem is to find, given a graph again, find the maximum independent sets.
• Is there a bigger one? How do we find the biggest one? Well, the brute force solution is, again, try every possible subset of nodes.
• If I want to solve the clique problem, try every possible subset and check if it’s a clique, if all possible connections are there.
• So for all of these problems, the question is, can you do better? Can you go from 2 to the n to something n squared, n cubed.
• Many problems like that, which people had struggled with for many years until the anticompleteness theory came out in the ’70s- 1970, it started- which showed that all of these problems have something in common.
• First of all, to talk about these problems, usually it’s easier to talk, consider simpler questions that are just yes/no questions.
• The first problem which shows the satisfiability questions was a yes/no question.
• The [INAUDIBLE] in these problems are optimization problems.
• You can go from an optimization problem to a related decision problem, which is certainly it’s at most as hard.
• The way we’ll go from an optimization to a decision problem is we rephrase it and say that, instead of saying, given an instance, find me the best solution, or what’s the best value you can get, we phrase it as, given an instance and some target value or cost that I would like to get, does there exists a solution that achieves this target? So if it’s a maximization problem, it achieves that value or more.
• If it’s a minimization problem, achieves that cost or less.
• It’s certainly not harder than the optimization problem.
• Because if you could solve the optimization problem, you could find the optimal solution.
• So the optimization problems, that’s certainly at least as hard as the decision problem.
• Quite often, you can go also in the other way in the sense that, if you can solve the decision problem, them by asking appropriately a number of questions, with different target vaues, you can pin down what the optimum solution is going to be.
• So we consider these problems optimization problems in their decision problem version.

#### Week 3: Algorithms 3 > 3d NP-completeness 2 > 3d Video

• So P is the class of problems that can be solved in polynomial time.
• NP is the class of decision problems, yes/no questions, of the form I give you an instance, and the question is does there exist a potential solution, a candidate solution, that satisfies a certain property where all the potential solutions- there’s lots of potential solutions, exponentially many, but each one has a short description because polynomial size.
• So all of these problems fitting that description of NP is if you give me a solution, potential solution, there’s an algorithm that will check that potential solution.
• The problem is that there are too many potential solutions.
• The conjecture is that P is not equal to NP. The conjecture is that the picture looks like that, but there is problems in the blue part that are not NP. We do not know the answer.
• You can think of this problem as is it always, if you give me the solution, I can check it.
• The way we argue that is by using reductions [INAUDIBLE] reducing from one problem to another.
• So we’ll say that a problem- decision problem A reduces to another problem B. If we have a polynomial time algorithm that maps every instance of A to an instance of B such that the answer to the yes/no question is the same for the two instances.
• So if I- the problem B is in polynomial time, I know how to solve it, I know how to solve A in polynomial time.
• The contrapositive of that, using it in the negative way, is if I know that A is a hard problem that cannot be solved in polynomial time, that tells me that B is a hard problem that cannot be solved in polynomial time.
• So I’m going to use the reduction in a positive way to use problems I know how to solve in order to solve other problems, or I can use it in a negative way to argue that certain problems are hard if I know that some other problems are hard.
• So the [INAUDIBLE] theory uses that in the negative way sort of to disseminate the hardness of the problems from one problem to another.
• So we say that the problem is NP-hard if every problem in NP reduces to B. So NP has lots and lots of problems.
• It’s basically infinite set- class of problems.
• So a problem is NP-hard if every problem reduces to B. So B is at least as hard as the hardest problems in NP. And the problem is NP-complete if it is also in NP. It is NP-hard, and is also in NP. So NP-hard could be strictly harder.
• NP-complete means it is in NP and is the hardest problem in NP. Every other problem can be reduced to problem B. So by what we said before, if any NP-complete, any individual NP-complete problem is NP, can be solved in polynomial time, then every single problem of NP can be solved in polynomial time.
• If P is not equal to NP, then all NP-complete problems are outside P, are harder.
• So the P versus NP question goes the same way as any single individual NP-complete problem.
• Now there is lots of work that has looked at lots and lots of problems starting from the 1970s [INAUDIBLE] all the problems that we didn’t discuss, these are just a few examples, the satisfiability problem, integer linear programming, clique, independent sets, and lots and lots more.
• All kinds of optimization problems, all kinds of decision problems are actually NP-hard, NP-complete.
• All these problems that people who had been working in different fields were discovered to be also NP-complete.
• For these kind of problems, one cannot hope to find a polynomial time algorithm.

#### Week 3: Algorithms 3 > 3e NP-completeness 3 and Summary > 3e Video

• OK. So what do we do with NP-complete problems? NP-completes are a fact of life, and everybody runs into it all the time.
• One possibility is to see if the types of problems of instances we want to solve do not correspond to the more general case.
• In that case, maybe we can do something better and solve the problem efficiently.
• The maximum independent set problem is solvable in polynomial time for a subclass of graphs called bipartite graphs.
• Some cases you do have that structure, and then you can solve the problem.
• We saw the integer linear programs, a case of integer linear programs assignment problem, where somehow integrality comes out for free, it’s automatic, so then it’s just a case of linear programs, and there are classes of linear programs that have that property where integrality comes for free, and then you can solve your problem.
• Now, there is to try to design better exponential algorithms that improve on the brute force, not try every single possibility, but cut down on possibilities in cases where we can actually either rule out or to try to look for shortcuts.
• So a lot of research has gone, for example, into improving problem solvers for all SAT viability, because many constraints, satisfaction problems, problems in verification, and so forth were used, so its viability to try to do their jobs.
• Same thing, there has been a lot of improvements in integer linear programming solvers, and sometimes for several problems there is polyhedral combinatorics, tries to reduce the problem to some cases of linear programming, at least be able to solve bigger and bigger instances, solve for some well-cited problem, like the famous traveling salesman problem.
• Another approach for optimization problems is the [INAUDIBLE].
• There are such approximation algorithms that have been developed for several problems, trying to find a solution close to optimal with provable guarantees that you can always- your solutions is- although we don’t know what the optimal is going to be, it will not be verifiable.
• For some problems, you can find good approximation algorithms.
• On the other hand, for some other problems, you can prove that there always will be some hard cases where you can’t get close to optimal.
• Finally, there are several common types of heuristics or meta heuristics algorithms that people developed for optimization problems.
• Again, a lot of research on heuristic algorithms that one can try to apply to difficult problems and hopefully find good solutions.
• So these are basic methods for handling several types of problems, and that’s essential tools in the arsenal of an algorithm designer.
• Discussed about several algorithms for several basic problems.
• We discussed about linear programming, a general class of problems where several optimization problems fit, and that is heavily used.
• All of these problems, all of these algorithms, are efficient algorithms running low-order polynomial time but, as we said, there are other problems for which it seems that this is impossible.
• We discussed about the theory of NP-completeness that ties all these problems together and says that, in effect, they are all the same problem, that if any one of them can be solved by polynomial time algorithm, then we’ll be able to solve all of them.
• We presented briefly several general approaches to dealing with NP-completeness with intractable problems.

#### Week 3: Algorithms 3 > 3f Introduction to Personal Genomics > 3f Video

• We’ll move on to actually do data science and learn conclusions from personal genomes.
• We’ll follow up by discussing the interconnectedness of different genomes of different individuals and the computational challenges that this poses.
• Now what are we going to talk about in this topic? We will consider personal genomics in the context of making medicine more precise and more adapted to each individual, and look for what exactly can we expect when we look at a particular individual’s personal genome.
• As years went by during the 90s, we were able to get the full genome sequence of higher order organisms, organisms with a nucleus like the baker’s yeast, and then multicellular organisms, the first of which had been this worm here.
• Then in 2004 we were finally able as a community to get the full blueprint for the human genome, the basic sequence, the basic instruction set of how to build a human.
• So the effort continued to obtaining the full genome sequence of particular individuals.
• So when you get that full sequence, what should we expect in such a personal genome? Well, first of all, all genomes of all humans are of almost exactly the same length.
• The vast majority of these three billion letters would not have any unique information for you, because they would be identical to the reference human genome, to the vanilla version of a human genome.
• In each individual personal genome you would find only about 0.1 percent of the letters of the DNA, only about three million letters, that are non-reference, that are different from the standard human.
• In each individual personal genome you would also find several tends of thousands of variants that are much rarer.
• These variants are often less safe than normal variation.
• Only a small part of this group would be mutations that are private to you, mutations that distinguish your genome from even the genomes of your parents or siblings.
• So when you look at the genomes of infants with, say, congenital heart defects, you often find private mutations in their genomes.
• Specifically, there’s a very big difference in how you would interpret common variants along the DNA sequence versus rare variants.
• Common variants often have very subtle, sometimes probabilistic or conditional, effects on outcome, whereas rare variants might have often larger effects or more severe outcomes.
• These are ones that you would see typically when you move along reading a personal genome.
• Now putting that into a somewhat more quantitative perspective, and in the context of investigating genomes, we observe that on one end of the spectrum we have the very rare variants that might be the ones that have a very strong effect.
• These typically have very mild effects, very probabilistic effects, on risk for common disease.
• Of course there are the intermediates, the variants that are of somewhat lower frequency that have intermediate effect size.
• Well, there are variants that are both rare and very small effects.
• There are very, very few examples of variants that are both common as well as having a large effect on phenotype.
• So to sum up, person genomes include large amounts of raw data, but much of that is actually identical to what you would know without, or what you’d guess without, having that data.
• The parts that are different than the default genome come with their own challenges in terms of interpretation.
• The challenges are that the effects are typically very subtle and complex.

#### Week 3: Algorithms 3 > 3g Massive Raw Data In Genomics > 3g Video

• These data, as we’re going to see, rely on very short bits of sequence, called reads, that are then mapped onto the long genome sequence in order to find the differences, the mismatches, between your reads, your sequence, and the reference average human genome sequence.
• So what’s behind this phase transition? How come technology was able to so rapidly improve? Well, it’s all based on the notion that instead of reading a long genome we would read short bits of it and do that in parallel, many, many times.
• So how do we read a genome? How do we read DNA? Let’s think of this as our genome- one long sequence of DNA.
• What we will then do is put all these in a test tube, or some kind of a biological assay, that allows us to perform a reaction in parallel, millions and even billions of times, and read the ends of each such segment from both ends.
• So at the end of the day, if we’ve done that for all the segments across many, many, many copies, we’ll be able to get, for each site along the DNA, one- or typically dozens, even- of reads that actually read that piece of DNA.
• Because in the process of doing this massively parallel sequencing, we have lost the information about which read might come from which position along the genome.
• So what we now should try to figure out is that well, this first read is really GGTCGGT- comes from this position along the genome.
• The second read comes from a different position along the genome, somewhere here.
• We should do that for each of the reads, in order to find out whether they depict my genome, say, that’s identical to the average human being, for the most part, but maybe different at particular sites.
• At the end of the day, we would like to be in a situation where each of the short reads has found its position along the genome and knows whether it exactly matches that position, or whether there is a certain mismatch- in the beginning or in the middle of the read- or whether it just comes from somewhere else.
• What we would like to provide as output is the position along this reference sequence of each and every read. Of course, the complications are that the reads may not exactly match the reference genome, either because of just experimental errors, or because there are true differences between my DNA, that I’m sequencing, and the average human genome that I’m comparing against.
• We can verify that it continues to match all the way, until the end of the read. And we can do that for this read and then the next one, each time trying each and every position along the genome, where it might be the starting position, and checking the sequence of the genome versus that read from that position and onwards.
• Had I given you a genome, like this, that’s a few dozens of letters long, and only seven reads of lengths 10 each, that’s a perfectly viable solution.
• When the genome length is billions, and you have maybe a billion reads of length 100 each, the complexity of going through all the reads, all the read data, which is of size 10 to the 11th letters, times the genome, which is 10 to the ninth, would be an algorithm with running time of 10 to the 20 operations, which is more than a what you can do, and definitely what you would want to do on a routine basis for every genome that you’re going to sequence.
• It’s very easy to generate, once you read the book from start to finish, noting each and every term, when it appears.
• We now have a sorted list of all the words that appear along the genome, along with their respective positions.
• So ACTGG appears AT position 19 along the genome, around here.
• Because now, if we get to our real input to the data, the personal genome of you or me, we get now many, many reads- m reads, that are each of length small l. If we divide up these reads into words of length five, we realize that the first read is comprised of two five-letter words, the first being GGTCG, the second being GTGAG.
• We can use our index to look up each of those words and find where they occur along the genome.
• So GGTCG appears along the genome at position five, as we see in our index.
• The scanning of the reads requires going through each one and just checking its position in the dictionary, and from that, in the genome- only a number of operations that’s linear time, not something that takes the lengths of the genome and multiplies it by the lengths of the read data.
• A good example of that is the mapping of short reads onto long genome strings, where we cannot do the simple minded thing of matching each read to each position.
• So this is how we analyze each single genome, at least the raw data from each single genome, converting many, many reads into information of where might my genome be identical versus different than the average human.

#### Week 3: Algorithms 3 > 3h Data Science On Personal Genomes > 3h Video

• How do we do data science and draw conclusions from multiple personal genomes of many individuals? So the main strategy that we, as a community, have been pursuing is genome wide association studies, which are really searching a large data set of many individuals across the entire genome in order to find those precious insights, those needles in a haystack of random effects where there are true genetic variants that influence a clinically relevant trait.
• So usually we won’t care that much, but for some polymorphisms we would care, because it might turn out that the offsprings of individuals with the A allele tend more often to have a particular disease, whereas the individuals with the blue allele are more likely to stay healthy and would therefore be controls.
• If you collected large number of individuals with the disease, or cases, and compared their genome at that particular site to the healthy controls, we might be able to expose this connection between that particular site in DNA and the disease and maybe understand a disease better, maybe come up with some form of treatment.
• So our goal is to figure out such connections, such direct associations between a genotype, the status of a particular polymorphism in a particular individual, and the disease status, the phenotype.
• That would be observed by saying that 2/3 of the disease cases here have the A allele but only a quarter of the healthy controls do.
• So ideally this is what we would do, examine each and every site along the genome and test for direct association between that site, that polymorphism, and the disease.
• So both these would be equally associated with the disease.
• Still, as we know from our statistics, if there is a very strong connection between the causal variant and disease there will be, somewhat weaker, but still a connection between the proxy polymorphism and the disease even though that’s a weaker connection because that’s an imperfect proxy.
• How do we quantify that? Well, if we take those six disease cases and classify them by this variant, four versus two and as well as the two versus six controls, we have a two by two table, and we can ask whether this is what we expect by chance and realize that well, in this cohort we have 8 over 14 of the individuals carrying the blue allele and 6 over 14 carrying the other allele.
• So if we do that for each and every variant and get an assessment of how significant is the association between that variant and case versus control status, we can hopefully, find the ones that are widely different in expectation therefore suggesting that something is on here, that there’s some connection between the variant and the disease.
• Diseases are often, far from being a clear cut case versus a clear cut healthy individual, you might have different categories of disease or the trait you might be interested may be quantitative rather than case versus control status.
• Last but not least, there is a big gap between variance being correlated to disease and variance actually causing disease.
• One particular artifact I want to mention in this context is that indirect correlation may come about without any causal relationship between the gene variance and disease.
• Environment is, in itself, often a causal agent of disease.
• So if environment is correlated with disease, environment is correlated with ethnicity, and genetics is correlated with ethnicity, you might, and typically would, observe spurious signals of genetics correlated with disease without the gene variant actually doing anything that’s disease related.
• To end up on a high note, the community had been doing those genome wide association studies for seven or eight years now and there is accumulated knowledge base nowadays of results from these studies across thousands of different traits and millions of subjects finding many thousands of positions along the genome, and hereby appearing as colorful dots along the different chromosomes, colors corresponding to different diseases, where a particular gene had been found to be associated with diseases.
• So as a community we are getting much better now at interpreting our genomes and figuring out what the variants in each genome might mean probabilistically in terms of our disease risk.
• To sum up, the challenge in association studies is to figure out connections between genetic variants and traits where the complications involve both finding variants that are very, very significantly associated because the effects off each variant are very, very slow.
• Also trying to distill the observed correlations into actual causal relationships between genetics and disease in order to learn more about the disease in order to interpret the disease better.

#### Week 3: Algorithms 3 > 3i Interconnectedness Of Personal Genomes > 3i Video

• So the basic challenge is that given n individuals, there are and order of magnitude of n squared connections between individuals.
• So these are segments of identity by descent, across individuals.
• Now, if you wanted to figure out all those remote relatedness connections- very remote sometimes, could be dozens of generations, maybe, into the past- the naive way to detect those segments of identity by descent would be to consider each pair of individuals and scan their genomes one next to the other, trying to look for segments along their respective genomes that are identical along a very long stretch of genome.
• Now, the problem with this is that you’re doing a complicated thing along the entire genome, and doing that for each pair of individuals.
• There are n squared or order of n squared such pairs, too many pairs for modern-sized cohorts that may reach a million individuals.
• Each one is a snip, and we’re comparing eight individuals, or eight individual chromosomes, to one another.
• So our dictionary construction stage would involve listing the word of six letters, of this first individual.
• The second individual has ACCCAT, and so on and so forth.
• We can recall that the first word was reported by individual number one, and the second word was reported by individual number two, and so on and so forth.
• At the end of the day, we observe that across this entire data set of eight individuals, we only observe five different words- words of six letters- where the first word is observed at individual number one and individual number four.
• The second word was observed at individual number two and individual number three.
• There’s another word that’s shared by two individuals, six and seven.
• These pairs of individuals that share words are already known to be the candidates in this data, to be sharing long segments with one another.
• At the end of the day, what we can do is draw these links between individuals, that are hidden remote relatives, they don’t know their relatives.
• If we model each individual as a node in a graph, and draw the edges, the length between individuals who show such segments of identity by descent- therefore indicating they are hidden relatives- one thing that we expect, based on theory from random graphs, is that if we detect sufficiently many such links, we are very likely to see everybody connected to everybody.
• If there are lots of links between individuals, then everybody’s connected.
• What happens when we look at real data? Well, we try to consider a data set that, at the time, was sizable- 1,000 individuals of self-described European ancestry in New York City.
• If each individual is a dot, let’s see what happens when we draw links between each pair of individuals that share several segments of identity by descent with one another.
• We’re seeing a group of about 40% of the individuals, here, that are very strongly connected, but with very few links to anybody else.
• When you look at the questionnaires of data that we have about these individuals, we observe that what distinguishes these individuals is that these are the Ashkenazi Jewish individuals.
• When you add on to the same data set Ashkenazi Jewish individuals from other sources outside New York, you see that this is, essentially, the same cluster of individuals who are interconnected, because of remote family ties that stretch hundreds of years back, indicating shared population ancestry rather than really strong relationship in terms of close relatedness.
• We’re able to use that in order to reveal population structure, specifically a group of individuals within New York that comes from the same population of ancestry.

#### Week 3: Algorithms 3 > 3j Personal Genomics Case Studies > 3j Video

• One is from the disease perspective, so looking at diseases as the cases we’re investigating.
• The first disease that I want to discuss is schizophrenia.
• It’s a chronic psychiatric disorder, a very debilitating disease.
• So is schizophrenia a completely genetic disease? No, there are other factors that are known to affect the risk off schizophrenia.
• Infection is similarly or even more associated with schizophrenia and prenatal conditions, particularly a famine or bereavement have a much higher risk associated with them as much as sixfold enrichment for being schizophrenic.
• Similarly obstetric conditions involve risks varying from one and a half to almost seven-fold, but when you compare those to family history, you realize that the latter is even more risky.
• If not, if you have another first degree relative that has this disease, that would be the most detrimental thing for your risk of schizophrenia.
• One can quantify the excess risk that one would get from having a first cousin who has schizophrenia, that would double your risk to 2% compared to 1% of the general population.
• If you have closer to relatives, like a niece and nephews or grandchildren, you have a higher risk.
• If you have first degree relatives, then that increases your risk to up to 48% chance of becoming schizophrenic if you have an identical twins- if you have an identical twin with the disease.
• So given the genetics contributes so much to schizophrenia, what does that mean? Can we now look at the genome and ask whether that individual is going to develop the disease? Well, usually no.
• If you recall our chart from an earlier topic, that considers it rare verses common variance with high versus low risk.
• We have found lots of common variants, that each have a very, very low effect on schizophrenia risk.
• So cancer is a disease off mutations to DNA, but typically not the DNA that we transmit from parents to children, but rather DNA that is transmitted from one cell to their daughter cells within the same body.
• So about 75% of breast cancer cases and as many as 90% of ovarian cancer cases are sporadic cases.
• The rest do have an indication for accumulation in family clusters or in some cases are known to be hereditary, are known to carry specific mutations that affect cancer risk.
• In that sense, in terms of the specific mutations that are known to induce risk, breast and ovarian cancers have been somewhat of a poster child for precise genomics because of the strong effects that are known for particular gene variants in particularly two specific genes, aptly called breast cancer one and breast cancer two.
• The effect of BRCA1 and BRCA2 on disease risk is dramatic.
• In terms of ovarian cancer, that’s a relatively rare disease in the general population.
• So 1% to 2% of women at risk for carriers of BRCA1 mutations, which is 40% to 60%. BRCA2 still high 10% to 20%. For men, where breast cancer is otherwise very rare, as many as 6% of BRCA2 mutation carriers would have breast cancer.
• So we’re talking here about mutations that are of a very substantial affect about diseases that people are very concerned about.
• So film persona Angelina Jolie has been a spokesperson for the BRCA mutation carriers having undergone a double mastectomy as a preventive action because she’s a carrier and did not want to risk having breast or ovarian cancer.
• The current recommendations are that anyone with a family member of the breast or ovarian cancer, to go and get tested for genetic mutations in BRCA1 and BRCA2.
• They can report to my carrier status for multiple rare diseases, like could Bloom syndrome and cystic fibrosis, which are rare diseases that are caused by specific genetic mutations that they check.
• I have somewhat elevated risk for age-related macular degeneration.
• I am a few percent higher- a higher risk, more at risk than the general population to have this symptom because if you click on it, you observe that there are several sites that are known to affect and affect this trait and I carry the more risk prone allele for a couple of the important ones.
• The news are not all bad because for many other traits I actually have a lower risk than the rest of the population.

#### Week 3: Algorithms 3 > 3k Personal Genomics Conclusion > 3k Video

• Welcome to the concluding topic for personal genomics in machine learning for data sciences.
• Much of our genomes are really identical to the average human being, so it takes a lot of work to sift through the raw data and to map those short reads to that reference human genome in order to find out what are the changes that make each of us unique.
• We can then take data across many, many individuals and start learning from it, start inferring connections between genetic variants that are correlated to one another but also correlated to traits, to phenotypes of interest, that can be relevant to our health.
• While all these methodologies may be abstract, they have very concrete implications, as can be observed when you look at specific diseases, like schizophrenia, like breast cancer, and when you look at particular individuals and their genomes- at your genome, at my genome.
• To sum up, I’d like to draw some parallels between personalized genomics and the greater analysis of big data.
• So we have to develop methodologies that require only a single scan through the data.
• Finally, and related to the above, while analysis of direct conclusions from a piece of data to an outcome are one kind of a challenge, the higher-order challenge is investigating connections between items.
• In genomics, we’re already at the stage where we can conduct a thorough investigation of one’s DNA from start to finish- so, at least getting the data is a done deal.