The value and drama of the first Large Genome Model

Vision, perseverance, execution ...and oh yea, rejection, dismissal, and drama.

Feb 06, 2024

TLDR: I built a type of AI called an LGM, withdrew publication bc of non-compete w a company I founded and had a bad breakup with. Big pharma is realizing in that LGMs are the future of creating new medicines. Ecotone is based on advanced LGMs, Nobel laureate regrets dismissing my foresights on AI. The withdrawn paper is republished and available via this link.

Let’s set the stage dramatically: my name is Dr. eMalick G. Njie, and pre-pandemic I foresaw artificial intelligence (AI) changing the world. No one else at my job shared this vision. I bought GPUs to power my AI builds, but was banned from buying more as they saw no value in what I was building. I quit my job and founded a company called Genetic Intelligence. That company was technically and financially successful (NSF winner, millions raised, etc). But my inexperience as a founder resulted in me hiring toxic people who didn’t share my vision. I gradually couldn’t recognize my company anymore. My attempts to right the ship failed. So I left. This is where this story begins.

This is the story about the first foundational Large Genome Model (LGM) I built pre-pandemic. A Large Genome Model (LGM) is a type of foundational AI that reads a large corpus of DNA code (i.e., ATGC) and tells humans what this code means. Such an AI can tell us about the meaning of life, where people from different heritages came from thousands of years ago, as well as how to make medicines to cure rare genetic diseases. There’s about 10,000 of these diseases in need of a medication.

GPT-1, the LLM predecessor to ChatGPT was released at the same time as my LGM, so my innovation speed was on par with the team at OpenAI. To my knowledge, this is the first LGM ever built, hence making me the inventor of LGMs 🌟🇬🇲. At the time, I was stunned at how the AI read the DNA code as a first language.

What do I mean by reading the DNA code as a first language? DNA is molecules we humans have labeled with letters. This reads like AGCCGAAUGCATAA… and so on. Us humans look at this and see gibberish. Don’t believe me?

Let’s look at it again, AGCCGAAUGCATAA. The AUG that’s highlighted is a word in DNA language that signals the beginning of proteins such as insulin, estrogen, collagen, and thousands of others. It’s similar to the capital letter in the beginning of a sentence.

Bet you didn’t see that.

DNA code is a third language to us, like Italian is to a Cambodian, and Hawaiian is to a Dutch person. For us to understand the meaning of the AUG, some human either clipped it out with advanced genetic engineering to see what happens. Or noticed that whenever it is naturally missing (i.e., a mutation), proteins don’t get made. This is a manual process that is laborious and not scalable. However, a properly designed AI like an LGM natively gets the meaning of the AUG.

An LGM can also do much more with only DNA as input. For instance, the major finding of this first LGM was that Japanese people are not all one people as commonly thought. The LGM, using only DNA and nothing else (no labels, etc), flagged a group of Japanese it said were different. It turns out these were the Ainu people, who are thought to be the first group of Japanese to have migrated from mainland China to the Japanese isles thousands of years ago. They are considered an endangered tribe with about 25,000 of them alive today. The LGM read their DNA and understood their uniqueness.

This is what I mean when I say AI reads the DNA code as a first language. I am acquainted with many Nobel laureates and hold their work in the highest regard. That said, the LGM shedding light on the DNA code as a first language is the most profound scientific observation I’ve come across.

Dr. eMalick G. Njie (right) and Nobel prize winner Steven Chu (left) at the Lindau Meeting of Nobel laureates.

Sidebar: Those of you technical readers that have read the LGM paper may be wondering how my LGM which is old, and lacks attention frameworks, etc, still outperforms 23andMe? 23andMe uses labels to identify heritages while my LGM was an unsupervised learner. So our approaches may converge on similar conclusions, however we took drastically different paths to get there, enabling our exceptional granularity. I liked X-Men cartoons when I was a kid, so lets use them to illustrate. Imagine you are an X-man that’s travelled to a new planet, where you find a city, and are tasked with identifying all the churches in it. With 23andMe’s approach, you have an address book with the locations of all the churches. So you quickly identify all the churches in the city, but nothing else.

With the LGM approach, you have no address book, but you have the superpower of recognizing patterns. You can learn that some buildings are made of a particular type of brick, have big long windows, and are open only on Sundays. You use this superpower to quickly identify all churches. But with the same superpower, you also learn and identify all mosques, synagogues and other houses of religion. You continue with the superpower and identify unrelated structures like schools, government buildings, businesses and so on. Your superpower is scalable. That’s the difference between 23andMe and my LGM.

If we dive deeper into the secret sauce of the LGM, you’ll find that what I did at a fundamental level was put human whole genomes into a latent space that was manageable enough to be put into a neural net. This high level strategy is what OpenAI recently did with Sora in early 2024. There, they put video data into latent space manageable by a Diffusion Transformer. It’s winning strategy for big data.

It is only since ChatGPT’s release that major players NVIDIA, Stanford, and other groups have started building LGMs. As of now (early 2024), my guess is that five to ten companies globally are developing LGMs. One of these is InstaDeep, which was acquired quietly by BioNTech for 549M.

Why would BioNTech, who are renowned for their foresight with COVID mRNA vaccines, buy a company working on a technology most people have never heard of?

Briefly, a Google-made AI called AlphaFold solved the 3D structure of nearly every protein in existence. This shook the pharma industry as they thought it was impossible to do this. Dozens of new companies are now developing hundreds of new drugs based on AlphaFold. These drugs will soon be worth many, many billions.

AlphaFold is just the tip of the iceberg of what is possible with AI. Indeed AlphaFold is constrained by the fact that proteins and 3D structure account for only 1% of the human genome. The rest of the 99% is a 1D mystery code. This 99% is where the root causes of more than 10,000 untreatable genetic diseases lie. Therefore a technology such as an LGM that unlocks this 99% is a value-add exponentially more than AlphaFold. Thus the first company to successfully develop an LGM that leads to clinically successful medicines will have a technical edge for curing these 10,000 diseases.

The market rate for these medicines in 2024 is ~3.5M per dose. With hundreds of millions of potential patients, there are multiple billions to be made. This amount of value will disrupt pharma; akin to Google with search and Tesla with electric cars. This is why I believe BioNTech purchased InstaDeep.

Modesties aside, I am confident my company Ecotone leads the LGM space. Why? Well it’s because of my early start innovating this new field. We operate in complete stealth so I cannot reveal much. However you can surmise if I was years ahead in inventing LGMs back before the pandemic, my team is years ahead of the pack today.

A modernized version of this LGM forms our base stack to filter the genomes of thousands of people to ensure parts of the genomes that define our heritages are not confused with parts of the genomes that cause diseases (this is a common pitfall of competing methods (like GWAS)). Atop this base stack lies newer LGMs whose job is to tell us what DNA code causes people to become sick with rare genetic diseases. We then design CRISPRs to change this DNA code. These CRISPRs are the medicines that cure these diseases.

Ecotone’s foundational AI is a hierarchical stack of multiple LGMs that work together in tandem. We build each LGM from the ground up with custom code.

Note that I had initially published the LGM paper in bioRxiv. But withdrew it due to a non-compete with Genetic Intelligence, the first company I founded - and exited after it’s value grew >20x. As mentioned above, I left despite this success because it was toxic. While I grew financially, and I’m proud of what we accomplished, I unfortunately had one of the worst experiences of my life :(

Also note that I didn’t use the term LGM at the time. I coined this term around 2020 while improving on this work and being encouraged by progress in Large Language Models (LLMs).

I was fortunate to have a Nobel prize winning geneticist read the paper before I published it. He thought the LGM was non-sense - because deep down, he thought AI was non-sense.

He has since apologized. This is emblematic of the negative inertia I received from many people that I’ve shared my AI-driven approach to solving impossible scientific problems.

We use this resistance as motivation to bring to life Ecotone’s vision of zero-shot drug discovery with AI-designed programmable medicines that cure rare genetic diseases. That’s a mouthful of words. Lets break them down.

Zero-shot drug discovery
- Drugs today are discovered by scientists manually testing hundreds of thousands of molecules to find the right one. This is how your cholesterol drug, your diabetes drug, your cancer drug and your Alzheimer’s drug were found. The approach is a classic example of finding the needle in the haystack. Most times scientists don’t find the needle. In AI terms, what they are doing is called many-shot learning as it takes scientists many shots to get the right one.
- Zero-shot drug discovery is finding the right drug the first time, every time. How? By intertwining genetics and AI.
  
  The genetics aspect guides the drug specificity, and tools like CRISPR enable drugs to cure rather than treat. The AI part allows for the handling of the genetics, which is a massive data space. The AI, properly trained in the form of an LGM, doesn’t need to have seen a disease newly inputted to it to determine the genetic cause of the disease. Thus it’s zero-shot as it needs zero prior encounters with the disease for it to work.
AI-designed programmable medicines
- Let’s break this into two.
  - ‘AI'-designed’ just means the inner workings of the drug come from AI, in this case an LGM that designs a nucleic acid that’s called a guide RNA. This nucleic acid is used by CRISPR to change DNA to cure diseases. ‘AI-designed’ could also be AlphaFold or some other AI.
  - ‘Programmable medicines’ is a new term describing the architecture and manufacturing process of the medicine. As mentioned above, gene based medicines for rare genetic diseases cost on average 3.5M per dose in 2024. To bring this price down, we want to simplify the manufacturing process as much as possible. Having an LGM allows for the making of medicines that otherwise look alike, except they have a small change in the nucleic acid (the guide RNA). Thus with tiny changes in ‘programming’ this nucleic acid, medicines from disease to disease across 10,000 diseases look alike and can be mass produced to bring down their cost and accessibility.

This is Ecotone’s vision. We sincerely thank you for taking time to let us share it with you. Our story is full of drama, so it’s never easy being the underdog. But we get up and go to work every day, changing bad vibes to good vibes, innovating relentlessly.

Read the LGM paper via this link.

Please show your support by subscribing and sharing. I’m available for questions and investment opportunities at emalick@ecotone.ai

eMalick’s Substack

Discussion about this post