Biotechnology - Genomic and Proteomics/Sage - A Merck Project
PAPER UNDER DEVELOPMENT
Background
There are law review articles about the public domain aspect of life sciences, both as a government policy and as a precompetitive publication / patent defense. There are also articles about the use of collaborative, common platforms as vehicles for innovation.
In the life sciences, we find that the use of the public domain (if conformed to as a standard legal policy) can combine with convergent standardization on data systems and transform the data itself into a platform for innovation and collaboration. Thus, a default legal position by the government, or a defensive legal strategy by a large corporation, can have a "side effect" of enabling new forms of innovative R&D, but only if it combines with a standard data system. In culture, that's the Web - the Web is sufficient. In science it appears to require much more - and more complex - standards development. We started in the life sciences but have found proof points outside the life sciences in both high energy physics and in astronomy - HEP the standardization is happening around CERN and the ATLAS project, and in astronomy it is the mixture of national-level legal policies and an entirely voluntary international technical standards project.
Another new element is the move from relatively simple datasets to massive datasets designed for analysis in a certain context (supercomputing or specific bayesian network analytics etc). In this context, "download" is archaic due to the last mile problem of bandwidth, and copying rarely takes place, which means any legal system based on copyright is essentially meaningless. The IP law is orthogonal at best.
Our paper will analyze this effect with a specific focus on genomics with comparisons back to the other proof points from scientific disciplines.
Draft
Warning
There are law review articles about the public domain aspect of life sciences, both as a government policy and as a precompetitive publication / patent defense. There are also articles about the use of collaborative, common platforms as vehicles for innovation.
In the life sciences, we find that the use of the public domain (if conformed to as a standard legal policy) can combine with convergent standardization on data systems and transform the data itself into a platform for innovation and collaboration. Thus, a default legal position by the government, or a defensive legal strategy by a large corporation, can have a "side effect" of enabling new forms of innovative R&D, but only if it combines with a standard data system.
In culture, that's the Web - the Web is sufficient. In science it appears to require much more - and more complex - standards development. We started in the life sciences but have found proof points outside the life sciences in both high energy physics and in astronomy - HEP the standardization is happening around CERN and the ATLAS project, and in astronomy it is the mixture of national-level legal policies and an entirely voluntary international technical standards project.
Another new element is the move from relatively simple datasets to massive datasets designed for analysis in a certain context (supercomputing or specific bayesian network analytics etc). In this context, "download" is archaic due to the last mile problem of bandwidth, and copying rarely takes place, which means any legal system based on copyright is essentially meaningless. The IP law is orthogonal at best.
Our paper will analyze this effect with a specific focus on genomics with comparisons back to the other proof points from scientific disciplines.
(obs.: We will include a section on gene patents and the impact of patents as part of our analysis, although, for the most part these have not impeded research in genomic data spaces..- they have impeded in diagnostic deliver and create high drug prices…which will be covered in another part of the ICP project. We will also address data protection internationally.)
“Executive Summary”
Within biotechnology, genomics and proteomics experienced a massive boom in the 1990s, driven by the massive government investment (US$3,000,000,000) in the Human
Genome Project, a 13-year effort to sequence the human genome led by the U.S.
(Department of Energy and the U.S. National Institutes of Health. Complementary efforts to sequence less complex genomes such as the fruit fly and the mouse were also part of the HGP.)
The high uncertainty and asymmetry of information characteristic of the biotechnology industry had transformed the field into an archipelago of highly specialized islands of knowledge, and the HGP was the first big attempt in genomics to map the entire human genome using anything close to a commons-based approach. With the HGP, it was understood that a limited group of people could contribute since there was a lack of capacity and infrastructure. Not many had the scientists, the labs or the machines to participate the study – a marked characteristic of differentiation when you compare the HGP with Open Source projects, where there is a democratization of means. This asymmetry of production of genomic information was also entrenched by the funding mechanisms, which tended to favor well-established sequencing labs at major institutions (in many cases, for good reasons of history and delivery).
Most of the work was done in academic locations in the United States, Canada,
New Zealand and Britain, with scientists from China, France, Germany, and Japan also key members in the consortium. From the beginning the work was intended to create public resources, and database systems at the US National Library of Medicine were designed (GenBank) to hold the sequences resulting from the project. However, input into the system was much slower than projected well into the mid-1990’s. Some of the delay was due to the natural demands of such a complex project, development of new technologies, and so forth. But a significant source of delay was withholding of data from the public domain by the sequencing centers.
At the same moment, Celera, a private company, raced to create a private version of the genome. Celera’s genome was being made available as a database, under a complex contract regime including patent licenses (the majority on provisional, sketchy patent applications filed under semi-automated legal software systems) as well as contract theory demanding “reach through” rights on any intellectual property generated downstream of database use. Celera raised enormous amounts of funding from venture capital firms and public markets, and held a very high profile by the middle of the 1990’s. Indeed, there was even talk of scuttling the HGP in favor of the private markets.
The Celera competition drove the government funded effort to work faster, and also pressured the scientists involved to develop more robust strategies/regime - norms - for collaboration and cooperation. The norms that emerged from the Human Genome Project served as the basis for the development of common-based data sharing practices in the genomics field and also for the understanding of the legal rules related to database protection. Known colloquially as the “Bermuda Rules,” the norms were generated by scientists at a series of private meetings. The Bermuda Rules dictated that raw sequences be deposited into GenBank within 24 hours of coming off the machines, but reserved the right to publish on complete “assemblies” of genomic information to the sequencing center that made the deposit. These were not legal rules - there was no right to sue for copyright infringement - but instead true community spirit norms that could be enforced in decisions related to tenure and grant review.
The simple database system set up for the initial gene sequences also grew in technological power and market prominence, even in the face of enormous private investment in startups and patent applications in the bioinformatics space. The Entrez Global Query Cross-Database Search System allows users to search across dozens of life sciences databases housed at the National Center for Biotechnology Information inside the US National Library of Medicine, all of which are legally in the public domain. Entrez includes the full genome sequences of the Human Genome Project and many other resources.
The legal status and technical searchability provided by Entrez is a powerful force towards unregulated openness - no registration is required, no data collected, no licenses signed. Indeed, the public domain status of the genome is considered a key element in the legal interoperability of genome sequences regularly uploaded from sites across the world, now that the technology to perform sequences is ubiquitous and cheap. From the billions required to achieve the first human genome, the cost is now as low as $5000, and can be achieved in a week. Advanced sequencers are available on eBay for less than $1000 - a little more for one in guaranteed working order - for shipping worldwide using Buy It Now.
Thus, the mix of a legally available genome and a technical infrastructure for use led to an explosion of scientific knowledge creation. Out of the HGP came thousands of papers and patents, and an explosion of startup companies in genomics and proteomics, which peaked in the genomics “bubble” of the late 1990s. There was great expectation that genomics companies would dominate the new face of drug discovery and development, which faded as the bubble burst and hundreds of companies went out of business.
The publication of the complete human genome in the public domain also had a significant impact on the potential for companies to use trade secrets to protect their data products. Since the completion of the first complete draft public sequence, companies like Celera, Incyte, Human Genome Sciences, Millennium, and more have exited the foundational data market, with Celera the most extreme example - abandoning the database market entirely by depositing their previously private genome sequence directly into Entrez. This is a reflection both of changing market conditions and of the growing economic power and value of the unregulated-open systems. Merck, one of the world’s largest pharmaceutical companies, added to this growth by the deposition of the Merck Gene Index into the Entrez system, which was a strategy to establish pre-competitive gene sequences to avoid widespread gene patent “thickets” but had the secondary consequences of increasing market power in unregulated-open space and creating market standardization around the Entrez technology platform as well as the public domain.
Other projects to map human genomic variation (HapMap, SNP consortium) and cellular signaling (Alliance for Cellular Signaling) align with the Entrez model and frequently are absorbed into it. A new entry is the dbGAP project, also part of NCBI, which contains data that correlates genes to function, is legally in the public domain and available to qualified researchers with a user account at the NIH, but requires the submission of a data access plan to a review committee, which introduces a degree of regulation otherwise not seen at NCBI. This process was spurred by the realization that patient data inside dbGAP could be identified using complex algorithms, which led to the imposition of control over data via secrecy protections and access controls.
Now, the next wave of technical and legal change is coming to the biotechnology data space. In earlier years, it was just possible to build major work (corporate or academic) through a focus on one specific gene at a time. We didn’t know any better than “one gene, one protein, one disease” as a theory. The concept of networks of genes was not yet developed. This reality had as a consequence the emergence of islands of knowledge around a specific gene and its expressions, “forcing” scientists to collaborate, but the way they found to collaborate was through the production of papers.
As the HGP met its final days, the new race was to quantify functional information about the genetic variations that make us different - polymorphism of sequence - as well as on gene “expression,” which is the knowledge level above gene sequences. The sequence of a gene is nothing more than the raw code, the ATCG pattern, whereas the expression of a gene tells the scientist when the gene is active or not active, and yields information about the potential function of the gene. In the early days of gene expression there was a rush to privatize very analogous to the early, pre-Bermuda Rules HGP. Crude “chips” that could analyze a few genes at a time were invented, refined, patented, and sold. Companies were formed like Affymetrix to privatize the hardware platforms, or like Rosetta Inpharmatics to privatize the software analytics, initially developed in academic labs. At the peak of the bubble, Rosetta sold to Merck for over $600,000,000, leading a frenzy of bioinformatics startup formation. Now it’s easy to buy gene expression services as a commodity product from any number of core facilities, or to buy a machine for use and control in any reasonably well funded academic lab.
The impact of high throughput gene expression wasn’t exactly what everyone expected. The last ten years have conclusively shown that genes actually function in a deeply complex network regulatory environment. A change in one part of the network can radically affect behavior far across the network, much like a ground stop of airplanes in Chicago can affect flights all over the USA. This discovery again changes the need for collaboration in the field of genomics: it’s not enough to know about one gene. One needs to be able to comprehend the entire network affected, and the expertise required for that comprehension is inherently subject to uneven distribution.
One result of the new reality is a movement for a “commons-based” approach in network genomics. The movement is characterized by standard repositories, data formats, and the same legal choice of the public domain. This trio of choices turns out to be a fairly scalable approach to managing complexity. Gene expression scientists began with the creation of a data format standard – the “minimum information about a microarray experiment” (MIAME). In tandem, the NCBI created the Gene Expression Omnibus at Entrez, a standardized database to host gene expression data - but only if uploaded into the public domain in the MIAME format.
The next phase appears to be the need for additional standardization on the data level. GEO is full of data, but the data is generated under wildly different parameters. MIAME describes the condition of the machine at data capture and fails to create an environment for either robust “horizontal” data analysis or for the use of the combined datasets to inform network “models” of disease. This represents the true potential of a commons-based approach, in which scientists far removed from data generation can begin to treat high throughput gene expression data as modular input into software analysis.
One potential driver for this phase shift could be, fittingly, again driven by Merck. Having acquired Rosetta in 2001, and invested hundreds of millions more into developing and using the Rosetta platform, Merck in July 2009 spun out the Rosetta platform - data, software, neural network models, hardware, patent rights, and staff - into a non-profit organization called Sage Bionetworks. Sage will systematically create a public domain resource that will equal GEO in total size, but exceed it in data consistency and reusability in horizontal query and modeling.
Complementary Skeleton
The paper will cover a set of basic sections.
The government created public domain in genomics
Human Genome Project
- Data access policy
- Data access reality
- Bermuda Rules etc.
- Impact of Celera competition
SNPs and HapMap
- Clickwrap license and intent
- Corporate participation
- end of clickwrap
ENCODE project
- data access policy
The corporate created public domain in genomics
- patent defense intent
- enclosure by patent / startup / university tto
- Merck Gene Index as "immune" response
- alliance for cell signaling (AFCS)
- what else?
technical platforms for genomics
- Government centers
- NCBI
- EBI
- Japan Genome
- role of PD in global data integration
- emergence of new PD tools
- OBO
- what else?
platforms for innovation
- on ncbi
- on pubmed
- pubget
- hubmed
- iHOP
- neurocommons / LOD
- ingenuity / genstruct etc
on the genome
- DAS
- companies...
Why do some technical platforms make the transition to innovation platforms and not others?
- information flow analysis
- "resistance" analysis
- pubmed and genome v. AFCS and others
trying to recapture "lightning in a bottle" - the merck experiment in disease biology (SAGE)
- analyze decision in merck's history with the gene index
- analyze contracts
- analyze technical platform
Bibliography
Background Information/Brainstorm
MGI > SNP > Sage HGP > HapMap > Encode
These have been extensively profiled as modes of "property preempting investments" but they also serve as emerging platforms for collaboration. by their very existence - created as anti patent device - they create a pressure to standardize and form networks that did not exist previously. this is because of the now-digital nature of the knowledge, and because of the ability of networks to deliver it. the value now outweighs the value of competitive withholding.
Thus, HGP leads to Distributed Annotation System. Gene ontology. Etc. which in turn create their own commonsense, dependent on the root structure of a public domain of genomes. FLOSS and corporate entities compete as service vendors etc.
Analyze the ecosystem that it takes to convert PPIs into a networked commons platform: the legal tools (PD), the norms (Bermuda Rules), the corporate pressures (journal requirement of GenBank ID), etc. it's not just about dumping into the network.
Find examples of PPIs that did not form a commons.
how the interplay of public and private investments, using the PD for different reasons, has created and sustained a self-perpetuating commons. no one looking at the political economy of collaboration as a result of self-interested investments against patents!
access to what we do actually know - and the right to reformat and integrate it - is the non miraculous way to systematically increase the odds we do something right.
Back to: Biotechnology - Genomic and Proteomics
Back to: ICP Reports and Working Papers
Back to: Main Page