Does AI Need to Learn Protein-ligand Interactions to Calculate? (II)


By further restricting the similarity of ligand small molecule scaffolds or protein sequences between the training set and the test set, the performance of the ACNN model has decreased significantly, indicating that the model may be “predicted” by simple ligand similarity or protein similarity Protein-ligand interaction, that is, similar ligands have similar binding activities, and similar proteins also have similar binding activities without having to learn complicated protein-ligand binding patterns. Between two points, AI will always walk the shortest straight line. This embodies the strong fitting ability of the neural network, and is good at discovering related relationships. However, such a model can only be accurately predicted in a scene that is very similar to the training set, and is difficult to generalize. It requires massive and diverse data to build a robust AI model. This problem is more difficult to overcome in areas where the crystal structure and activity measurement data of protein-ligand complexes are scarce and expensive. Therefore, when these models face real-world complex drug discovery and optimization scenarios that are vastly different from the training set.

Since the data from the experiment is not enough, can we use the computer-generated data to train the model? DUD/DUD-E as a commonly used molecular docking benchmark test set, not only contains 22,886 active small molecules (active), but also includes 1.41 million small molecules as negative controls (decoy). Decoy needs to have similar physical and chemical properties (molecular weight, net charge, etc.) as active, but different topologies (represented by molecular fingerprints) reduce false negatives. It overcomes the shortcomings of the previous benchmark test set, that is, the molecular docking software can distinguish active and decoy only through simple physical and chemical properties, and obtain a high score in the benchmark test set. It can be seen that the traditional molecular docking has also fallen into the trap (only relying on simple physical and chemical properties to rank), but it is crawled out by setting a reasonable negative control, and there is a reliable test set for objective evaluation methods. But the problem with AI is that we know what its pitfalls are? Similarity of data sets? The size of the data set? To answer this question, an AI model is trained on the DUD-E data set with a larger amount of data, with 6 physical and chemical properties (PROP) and molecular fingerprints (FP) as input features, and a random forest (RF) is trained on active and decoy classification.

When six physical and chemical properties are used as input features, if randomly divided into 3 groups for cross-validation (CV), the average AUC of the random forest at DUD-E 102 targets is 0.73, and the average top 1% active small molecule enrichment factor (EF1) is 22.2, very close to the performance of the molecular docking software in the DUD-E article (AUC: 0.76, EF1: 19.8). After removing small molecules with a molecular weight greater than 500 (bias has been reported), and group cross-validation by protein type (class AUC), AUC was reduced to 0.66, EF1 was reduced to 5.14, indicating that the model trained on DUD-E may learn Bias in physical and chemical properties. Including: 1) the active contains small molecules with a molecular weight greater than 500, and decoy is limited to drug-like small molecules, and the molecular weight is less than 500; 2) similar physical targets have similar physical and chemical properties, and the model can distinguish active and physical only by physical and chemical properties decoy.

When molecular fingerprints are used as input features, the random forest model can distinguish active and decoy well even when it is difficult to learn physical and chemical properties. Sorting the 84 features with high frequency in the molecular fingerprint and appearing differently in active and decoy, according to the frequency of occurrence in the ZINC database, it can be found that DUD-E has a bias in the topology (molecular fingerprint). Two reasons: 1) DUD-E selects small molecules from ZINC that are not similar to the active topology as decoy, so active and decoy have obvious differences as expected; 2) The distribution of Decoy and ZINC is closer, indicating that active and ZINC’s topology distribution is different. DUD-E is biased in physical and chemical properties and topology. As long as the model can learn these features explicitly or implicitly, even if the model is trained based on the docking complex, it is difficult to avoid being misled by the deviation.

In summary, the author believes that there is a lack of sufficient and unbiased data at this stage for training AI drug discovery and design models based on protein-ligand complex structures. Due to the powerful ability of the AI model to summarize correlations, in order to reasonably assess the AI model’s ability to predict protein-ligand binding strength and promote healthy development in this field, the following suggestions are proposed:

1) PDBbind will still be the most suitable experimental data set so far. However, when using the PDBbind training model, the protein alone and ligand alone models should be set as a baseline control to properly evaluate the reason for the model upgrade.

2) The protein similarity and ligand similarity between the training set and the test set should be systematically controlled to properly evaluate the generalization ability of the model.

3) The DUD-E data set should be used as an independent benchmark test set, not as a training set.

About Protheragen AI

Protheragen AI has proudly developed a unique artificial intelligence drug research and development platform to offer drug development solutions for worldwide customers, including but not limited to Drug R&D, Machine Translation, Intelligent Image Diagnosis, and Medical Therapy and Research System. Through big data analysis and other technical means, its AI platform can quickly and accurately mine and select the appropriate compounds or organisms. Compared with traditional methods, AI can save the cost of screening candidates by tens of billions every year. AI technology has been widely used in disease target prediction, high-throughput data analysis and system biology modeling.


Single Cell Sequencing: The Technology, Challenge and Future (VII)

In addition, there is a need to develop automated single cell isolation and genome amplification technologies. Existing technology can process hundreds of cells in the order of magnitude. We can use a commercial cell sorter to complete cell sorting, or use a mechanical hand to complete cell lysis and nucleic acid amplification reactions, or use microfluidic equipment. Complete the above set of operations automatically. Automation and miniaturization are the future development direction of single-cell sequencers. This is because only enough samples can be analyzed to fully understand the genetic diversity in the samples. We hope that chip technology, microfluidic technology, and microfabricated approach will have innovative development. This will greatly increase the throughput of the treatment, while also greatly reducing the test cost and simplifying the reaction steps, so that single-cell analysis of tens of thousands of cells can be performed in one experiment. We believe this is only a matter of time.

The single cell genome analysis technology is actually the result of the joint development of multiple technologies, and involves many basic fields in the field of life sciences, which will help us solve many major problems in the field of life sciences. We hope that with the continuous development and diversification of nucleic acid amplification technology and reaction types, the influence of single cell sequencing technology can be further expanded and applied to more fields to help us better understand and understand the entire life system.

4. The impact of single-cell sequencing on biology and medicine

Recent technological advances have made single-cell RNA sequencing possible. Exploratory research has given us insight into the dynamic process of differentiation, the response of cells to various stimuli, and the random nature of transcription. We are entering an era of single-cell transcriptomics, and this research direction will have a profound impact on biology and medicine.

The transcriptome we now refer to is mainly derived from population-level observations that have become mainstream in biological research in the past two decades. We have always been accustomed to such a research idea, which is to compare the gene expression multiplication (obvious or subtle) in the overall organization or under certain conditions, but the actual differences between cells may be more obvious. Some cells may produce very obvious changes, but others are “indifferent”, so that even if the part of the changed cells changes even more, it will be masked by the “silent majority” cells. In fact, as early as 60 years ago, it was discovered that stimulating a single cell will produce two completely different results, but if you study a large group of cells, you will get a progressive and quantifiable result.

Obviously, the detection and analysis of the gene expression of single cells is very helpful for us to understand the behavior of cells and to identify which cells are involved in the process of tissue development, maturation and disease. To achieve this goal, long-term transcriptomics research on individual cells is required. But the experimental technology has only recently developed to the level of RNA sequencing of single cells, and scientists have been able to use this technology to understand the meaningful differences in gene expression of single cells. There are also very detailed experimental guides to help researchers build sequencing libraries, and commercialized single-cell automated preparation systems such as FluidigmC1 have also greatly reduced the barriers for researchers to get involved in this field. The wide application of single-cell experimental operation techniques will have a profound impact on us, and will also help us to deepen our understanding of the state of cells, the nature of transcription and the regulation of gene expression, and even the pathological processes of diseases.

4.1 SNR

Single-cell transcriptome research mainly relies on reverse transcription. First, the RNA to be studied is reverse transcribed into cDNA, and then amplified by PCR reaction or in vitro transcription reaction, and finally the amplified product is deeply sequenced. However, the amplification reaction is very error-prone and easy to lose information. This is because the RNA contained in a single cell is very small, so it is necessary to amplify these trace amounts of nucleic acid, so that this amplification reaction produces a lot of deviations. Although technical noise will interfere with the high-resolution sequencing of low-abundance RNA molecules by researchers, the current improved experimental procedures have allowed us to obtain enough single-cell transcriptome information. For example, in the study of single-cell transcriptomics, there is a question that is repeatedly mentioned, that is, how to accurately and repeatably classify cells according to the type or state of the cells without classifying the cells. Gene expression patterns related to cell types or developmental stages are a relatively reliable basis for judgment, far more reliable than physiological variables or technical noise related to dynamic processes such as the cell cycle. In addition, some people have studied the expression differences of hundreds of genes in different cells, confirming that this single-cell research technology can indeed find meaningful information. More recent in-depth research work will further improve the signal-to-noise ratio of single-cell sequencing studies, because we will further increase the efficiency of reverse transcription and PCR reactions, and molecular barcoding strategies can also be used to control deviations that occur in nucleic acid amplification reactions.

4.2 Challenges in single cell transcriptomics research

Researchers have developed existing single-cell RNA sequencing technologies for several different purposes. For example, the full-length transcript sequence can be sequenced so that we can understand the sequence information of the entire gene and various transcript isoforms, which is also conducive to our discovery and monitoring of single-nucleotide polymorphisms and other mutations. Relying mainly on tags, the strategy of sequencing only the 5 ‘or 3′ end of the transcript can provide us with information related to the abundance of the transcript at the expense of full-length sequence information, which is conducive to large-scale quantitative research.

However, the entire single-cell sequencing community is all pursuing the same goal, which is to use an economical, high-throughput technology to sequence all the RNA in the cell. Among them, how to reduce the loss rate of RNA and increase the efficiency of reverse transcription of RNA into cDNA before performing nucleic acid amplification is a technical difficulty that requires key breakthroughs, and is also the key to improving the success rate of RNA detection. Another equally important technique is how to separate, and sort single cells, and to separate individual cell samples from the entire tissue without any disturbance to the gene expression of the cells. In addition, researchers also hope to simultaneously detect poly (A) + RNA and poly (A) –RNA, as well as various RNA modifications (such as m6A), regardless of the length of the transcript.

We have now found that in single-cell sequencing research work, the cell transcription process has a major feature that will bring great trouble to our research, that is, the cell gene expression rules we found in the research work on cell populations The level of single cells is not reliable at all. Any random disturbance may make the gene not expressed in some cells, or the expression level is very low, but it may also become very high. This variability may be because the gene expression in the cell is a random molecular process, so in a single cell, the transcription of the gene is a probability event of all or nothing. Scientists have conducted a lot of research on prokaryotes and single-cell eukaryotes, and have a very deep understanding and understanding of the random nature of gene transcription. Now more and more evidence shows that It’s the same. Therefore, we also need to pay attention to this point when conducting single-cell transcriptomics research. For example, the standard differential expression test may not be suitable for single-cell research, because among the cells studied, there may be some cells that do not have corresponding gene expression. Now there are experimental strategies suitable for this kind of research work, which can be combined with differences in transcript abundance and cell gene expression ratio to observe.

To be continued in Part VIII…

When Affected by Positrons Spherical Nanoparticles Release Electron-Positron Pairs Forward

Theoretical calculations show that spherical nanoparticles release unstable electron-positron pairs when affected by positrons of specific energies, and the direction of the signal is the same as that of the incident positrons.

When electrons collide with positrons, their antimatter counterparts, and they form unstable pairs, in which two types of particles orbit each other. Named ”positronium”, now physicists have created this intriguing structure using various positron targets, from atomic gases to metal membranes. However, they have not yet obtained the same results from the vapor of nanoparticles, whose unique properties are affected by the free electron “gases” they contain within well-defined nanometers. This new finding was published in European Physical Journal D.

In the new study, Paul-Antoine Hervieux at the University of Strasbourg, France and Himadri Chakraborty at Northwest Missouri State University, USA, for the first time revealed the characteristics of positron formation in rugby-like nanoparticles C60. Under the specific positron impact energy, they showed that the emission of positrons is in the same direction as the incoming antiparticles.

Often referred to as buckminsterfullerene or “buckyballs,” C60 is stable at room temperature, and is sustainable, easy to synthesize. Because of these useful properties, the discoveries of Hervieux and Chakraborty may have important implications for fields such as astrophysics, material physics, and pharmaceutical research. In particular, they can improve the test of the reaction of antimatter to gravity, which involves structures including positrons and antihydrogen atoms; each of which feature positronium in the first steps of their fabrication processes.

When positrons of some energies approach the bucketball at angles of up to 10 degrees, the physicists showed that a series of narrow, positive positrons are generated by the “diffraction resonance” of the particles. This effect is comparable to the diffraction effect of microscopic spherical obstacles on light. Changes were shown in larger fullerene molecules (such as C240) and when the particles were excited to higher energy levels. Hervieux and Chakraborty modeled these properties by theoretical calculations of how diffraction resonance affects positron emission angles, as a function of positron impact energy. Their results provide important insights for the many researchers who use these short-lived structures. Now, in future research, the two of them hope to further explore the potential in practical experiments.



  1. Hervieux, P. A., & Chakraborty, H. S. (2019). Strongly resolved diffraction resonances in positronium formation from C 60 in forward direction. The European Physical Journal D, 73(12), 1-6.

Research progress in immunomodulatory properties of mesenchymal stem cells (I)

Mesenchymal stem cell (MSC), discovered in the 1950s, is a type of pluripotent stem cell. It was found that it originated from the early development of mesoderm, has a high self-renewal capacity and has the potential to differentiate into a variety of cells. MSCs are induced to differentiate into adipose tissue cells, cartilage tissue cells, connective tissue cells, bone tissue cells and neural stem cells in vitro and in vivo in different ways. In addition, MSC may also be induced to differentiate into endoderm cells (lung cells, muscle cells and intestinal epithelial cells) and ectoderm cells (epithelial cells and neurons).

1. The biological characteristics of MSC

Current research shows that MSC can be obtained from in vitro culture expansion of bone marrow, umbilical cord blood, umbilical cord, placenta, mobilized peripheral blood, adipose tissue, dental pulp, and even fetal liver and lung tissue. Although MSCs come from a variety of sources, they still have some common characteristics: they show the shape of fibroblasts under a microscope, and are often fusiform or spindle-shaped. Markers CD90, CD105, CD44, CD73, CD9, very low levels of CD80, surface markers of hematopoietic cells include CD34, CD45, CD11b, CD11c, CD14, CD19, CD79a, CD86 and major histocompatibility complex (MHC) class II expressions are all negative. And it has the ability to secrete insulin-like growth factor, vascular endothelial growth factor and hepatocyte growth factor, etc. It can be differentiated into bone cells, chondrocytes and adipocytes under specific conditions, and its immunogenicity is low. MSC is not easy to cause immune rejection after transplantation,

With the need for the clinical treatment of MSC, the International Committee on Mesenchymal and Tissue Stem Cells has proposed a minimum identification standards for human-derived MSCs: ①It must have adhesion to plastic substrates under standard stem cell culture conditions. ②The positive rate of CD105, CD73 and CD90 expression on the surface MSC markers detected by flow cytometry should be ≥95%, and the negative expression rate of CD45, CD34, CD14, CD11b, CD79a, CD19, human leukocyte antigen-DR (human leukocyte antigen-DR, HLA-DR) ≥98%. ③After induction in vitro by standard methods, MSCs are able to induce differentiation into osteoblasts, chondrocytes and adipocytes.

2. The comparison of MSCs from different sources

At present, the clinical applications of MSCs are mostly bone marrow-MSC (BM-MSC), umbilical cord-MSC (UC-MSC) and umbilical cord blood -MSC (UCB-MSC). Although MSCs from different sources have some commonalities, they also have some different characteristics. MSC was first found in bone marrow, but the collection of bone marrow is an invasive operation, which limits the source of BM-MSC to a great extent. At the same time, BM-MSC has the risk of viral infection, and as the age of the collector grows, the number of cells, and the differentiation and expansion ability of the cells will show a clear downward trend. Through continuous research, MSCs have also been found in human umbilical cord and cord blood, and they all have the potential to differentiate into a variety of cells and the ability to support hematopoiesis.

2.1. The comparison of cell morphology

There are no significant differences in the cell morphology, colony number, and colony size of MSCs from two different sources (BM-MSC and UCB-MSC), after being cultured in different media. In the early stages, the colony formation is faster. BM-MSC and UC-MSC, MSCs of two different sources, were spindle-shaped under a microscope, and then their growth rate began to increase, forming spindle-shaped cells with a more uniform shape after 1 week, which grow in a spiral or parallel shape with similar growth pattern. There may be difference in colony formation or fusion time between the MSCs from different sources, but there is no significant difference in their cell morphology after culture.

2.2. The comparison of proliferation characteristics

Studies have shown that the gene expression profile of UC-MSC is similar to embryonic stem cells, and has a faster self-renewal ability than BM-MSC, which makes the proliferation time of UC-MSC significantly shorter than that of BM-MSC. Moreover, the doubling time of umbilical cord-derived P1 generation MSCs does not extend with the increase of the number of passages, while the doubling time of bone marrow-derived P1 generation MSCs is significantly longer when it is passaged to P6 generations. This suggests that MSCs with different sources and the same number can be expanded at the same time. UC-MSC can obtain more mesenchymal stem cells than BM-MSC. Compared with MSC from other sources, UC-MSC has shorter proliferation time and stronger proliferation ability. UCB-MSC has a strong proliferative ability in the early stage, but the success rate of differentiation in vitro is lower.

2.3. The comparison of surface markers

The analysis of cellular immune phenotypes shows that most of the immune markers of UC-MSC are similar to BM-MSC, but the expressions of HLA-ABC and CD106 are lower than those of BM-MSC. The HLA molecule can cause immune rejection during MSC transplantation, which suggests that UC-MSC has lower immunogenicity than BM-MSC. CD106 is a type of adhesion molecule related to the location, migration, proliferation and differentiation of hematopoietic stem cells and progenitor cells. The low expression of CD106 by UC-MSC may be one of the differences between UC-MSC and BM-MSC. Analysis of the percentage of cells on the surface markers of UCB-MSC and BM-MSC by flow cytometry showed that although the two types of MSC were from different sources, they all expressed markers of cell adhesion molecules such as CD29, CD44 and CD105, while hematopoietic marker CD13, CD14, CD34 and CD45 were all negative, and their immunophenotype did not change with the increase of cell passage. This suggests that UCB-MSC and BM-MSC have the same cell surface markers. Most of the three MSC immune markers from different sources expressed similar expression. UC-MSC may have lower immunogenicity. CD106 may become one of the differentiation points between BM-MSC and peripheral MSC.

To be continued in Part II…

Vitamin D May Help Fight COVID-19!

Researchers at the Irish Longitudinal Study on Ageing (TILDA) at Trinity College Dublin released an important report recently in response to the COVID-19 pandemic.


The report is entitled “Vitamin D deficiency in Ireland-implications for COVID-19. Results from the Irish Longitudinal Study on Ageing (TILDA)”. (TILDA) study showed that vitamin D plays a key role in preventing respiratory tract infections, reducing antibiotic use, and enhancing the immune system’s response to infections.


Because one in eight adults under 50 years of age in Ireland is vitamin D deficient, the report emphasizes the importance of increasing vitamin D intake.


How is vitamin D produced?


Vitamin D is produced in the skin, and the human body can produce enough vitamin D by only 10 -15 minutes of exposure to sunlight per day. In Ireland, people can only produce vitamin D between the end of March and the end of September. It cannot be made in winter, and the amount of vitamin D we make in summer depends on how much sunlight we get, the weather, and other factors. Access to adequate vitamin D can be a challenge even in summer due to cloud cover, rainy weather, and lack of sunlight.


The good news is that this deficiency can be compensated for by adequate food intake and supplemental nutrition. Vitamin D is readily found in eggs, liver, and oily fish such as salmon or mackerel, and in fortified foods such as cereals and dairy products.


Does the Irish consume adequate vitamin D?


Researchers at TILDA have found that daily intake of vitamins across Ireland is inadequate. Some of the major findings of TILDA are as follows:


l 47% of adults over 85 years of age are vitamin D deficient in winter;


l It is estimated that 27% of adults over 70 years of age who “live in cocoons” are undernourished;


l One in eight adults over the age of 50 years is perennial vitamin D deficient;


l Only 4% of men and 15% of women took vitamin D supplements;


Who is most likely to be vitamin D deficient?


Those who are rarely exposed to sunlight or consume inadequate fortified foods are the most dangerous, especially those who are currently trapped or confined at home. Others at high risk are those who are obese or lack of exercise, and those who have asthma or chronic lung disease.


Vitamin D is available without a prescription. What is now needed is for people to increase their vitamin D intake, especially when vitamin D supplementation is very low nationwide, especially in men.


What is the recommended intake of Vitamin D?


TILDA researchers suggest that adults over the age of 50 should be supplemented with vitamins not only in the winter, but throughout the year if they do not get enough sunlight. Those who are currently “staying at home” should also be supplemented with nutrition.


Professor Rose Anne Kenny, lead researcher at TILDA, said: “We have evidence to support the role of vitamin D in the prevention of chest infections, especially in older adults with low vitamin D levels. In one study, vitamin D reduced the risk of chest infection by half in people taking supplements. Although we do not know the specific mechanism of action of vitamin D against COVID-19 infection; given its wide impact on the immune response and clear evidence of bone and muscle health, those at home and other high-risk groups should ensure that they have adequate vitamin D intake. Because in this case, muscle degeneration will occur quickly, and vitamin D will help maintain muscle health and strength in the current crisis. ”


Dr. Eamon Laird, a medical gerontology researcher and co-author of the report, said: “These findings suggest that our elderly have severe vitamin D deficiency, which may have a significant negative impact on their immune response to infection. Now, those who stay at home are more likely to be vitamin D deficient. However, vitamin D deficiency is not inevitablefatty fish, eggs, vitamin D-fortified cereals or dairy products, and 400 iu (10 ug) of vitamin D supplementation per day can help avoid deficiency. However, Ireland needs a formal vitamin D food policy/recommendation, and we still lack such policiessuch as those in Finland, for example, and almost eliminate vitamin D deficiency in its population.


About author

Isla Miller from Lifeasible, a biotechnology company specializes in agricultural science, offering a wide variety of agro-related services and products for environmental and energy solutions. Lifeasible now leverages the expertise and strengths of each to create its unique platform that is accessible to all leaders working in agriculture, botany, biology, ecology and environmental science.


Creative BioMart Introduced Protein Network Construction and Topological Structure Analysis Service

Creative BioMart, a global CRO providing high-quality protein analytical services to support research, manufacturing and clinical development of native and recombinant proteins, recently launched Protein Network Construction and Topological Structure Analysis Service for protein interaction research.


Predicting unknown protein function based on protein network is a very important research topic in bioinformatics. In particular, the rapid development of dynamic protein networks can provide a more effective network and improve the accuracy of protein function prediction by fusing multivariate biological information. Reducing the negative effects caused by false positives and false negatives is the key and bottleneck to improve the performance of protein function prediction. Weighted dynamic protein networks are constructed by using protein domain information, protein complex information and protein network topological characteristics to predict protein function.


Most diseases can be reflected at the genetic level, and some existing studies have confirmed that genes with similar functions or genes interacting in biological networks can lead to the same or similar diseases. Network based disease gene identification is an important method to discover disease genes. The relationship between disease genes is analyzed from topological structure similarity and functional similarity, candidate genes are ranked, and then disease genes are screened, inferred and discriminated.
Therefore, the identification of disease genes and disease pathways based on dynamic protein networks is helpful for the development of disease therapeutics; at the same time, it can also screen out more precise biomarkers, which can provide necessary technical means for the diagnosis and classification of diseases


Creative BioMart has successful experience in providing more than 10,000 customized bioinformatics consultations and manages several bioinformatics core facilities that can help scientists around the world to conduct research and analyze data. In addition to advanced technology in bioinformatics, Creative BioMart provides informatics and statistical support as well as advice for protein network construction and topological analysis services.


According to its official speaker, available services include:

l Assistance with de novo tandem repeat detection

l Computing all possible crosslinks between proteins

l Sequence similarity search in protein databases

l Assistance in topology prediction of membrane proteins

l Consultation on topology, parameters for small organic molecules

l Assistance in comparison of asymmetric units and biological unit predict transmembrane topology and signal peptides

l Services in reconstruction of multimeric molecules in crystals


If more detailed information needed about the Protein Network Construction and Topological Structure Analysis Service offered by Creative BioMart, please visit


Application of NGS in the Medical Field

Common NGS solutions in the medical field

l Whole genome sequencing (WGS)

WGS looks for individual mutations, InDel, CNV, SV, etc., the amount of data required is large; a single sample requires 90G data.

l Whole Exome Sequencing (WES)

WES captures all exons in the genome for sequencing, the amount of sequencing data is much less than WGS, and can obtain high sequencing depth (50X-150X).

l Target Sequencing

Target Sequencing captures and enriches targeted regions of interest for sequencing. The technical process is similar to full exon capture. Sequencing requirements are less than WES and higher sequencing depth can be obtained (up to 500X).

l Transcriptome sequencing (RNA-Seq)

RNA-Seq studies sequencing at the transcription level, including mRNA, IncRNA, microRNA, etc.

Whole genome sequencing (WGS/DNA-Seq)

Whole-genome sequencing is a method of performing genome sequencing on individuals whose genome sequence is known, and discarding the difference analysis at the individual or group level. As the cost of genome sequencing continues to decrease, the study of pathogenic mutations in human diseases has expanded from the exon region to the whole genome. High-throughput sequencing is achieved by constructing a library of inserts of different lengths, short sequences, and paired-end sequencing to achieve the detection of common, low-frequency, and even rare mutation sites and structural variations at the genome-wide level with great scientific research and industrial value.

Application of whole genome sequencing

l SNP (individual differences)

l CNV (large fragment gene copy number variation)

l InDel

l SV (structural variation)

Why study SNP?

l SNP has been recognized as a genetic molecular marker of disease occurrence.

l SNP is considered to be one of the main factors leading to the differentiation of drugs, so personalized medication can be used according to changes in SNP.

l SNP is widely distributed and relatively stable.

l SNPs directly affect the expression of functional genes.

Most SNPs are useless

l Most SNPs are silent mutations.

l Missing mutations in non-coding regions will not cause protein changes.

But there are exceptions.

Cancer related (human) applications of WES

l Genetic diseases (human)

l Other non-communicable diseases (human)

Mainly used to find rare mutations, inherited mutations and cancer-related somatic mutations.

l SNP and InDel analysis can be done, but due to the short capture area, it is generally not used for CNV and SV analysis.

Features of WES

l Low cost (generally 50-150X, only 8-15G data is needed) to reduce the analysis background and easy to find rare mutations. Approximately 50-80 bp fragment deletion can be detected. Due to the short exon capture chip fragment, it is difficult to determine whether it was caused by off-target capture or deletion.

l Similarly, because the capture chip fragment is short, generally do not do CNV, SV analysis.

Transcriptome sequencing (RNA-Seq)

l Including mRNA-Seq, IncRNA-Seq, sRNA-Seq, etc.,

l High-throughput analysis of transcript information to discover unknown transcripts and gene annotations.

l Look for changes in the expression abundance of the gene of interest among individuals and the same individual at different periods.

Advantages of RNA-seq

RNA-Seq is not limited to known genomic sequence information, it is suitable for high-throughput transcriptome research of species with unknown genomic sequence. Compared with chip technology, the background signal value is low, there is no upper limit of detection, and there is a very wide detection range for gene expression profiles. In the case of internal reference, it shows high accuracy and repeatability in terms of quantification.

No cloning steps are required, the operation is simple, and the required sample volume is small. The throughput of expression profiling can be performed at the single cell level, and the cost is lower than Tilling array or large-scale EST sequencing.

Challenges of RNA-seq

In the process of library construction, large fragments of RNA must be fragmented, which will introduce some bias. PCR will cause changes in expression levels. The comparison or splicing of massive short-sequence data is complicated, and there are obvious problems in the precise positioning of repeating sequences and multiple matching sequences. There are still considerable errors in the identification of alternative splicing and trans-splicing in higher eukaryotes. The determination of sequencing depth varies with species, organs, tissues, and time, and it is difficult to calculate directly with a unified formula.

What are CTCs?

CTCs (Circulating Tumor Cells) are tumor cells released into the peripheral blood circulation by solid tumors or metastases spontaneously or due to diagnostic procedures.

CTCs are tumor cells that survive in the blood circulation system during tumor metastasis, and their generation is considered to be a necessary prerequisite for tumor metastasis.

Research significance of CTCs

In-depth study of CTCs can help to further understand the mechanism of tumor metastasis, provide a new basis for the treatment of anti-tumor metastasis. The detection of CTCs can help the diagnosis of patients with early metastatic tumors and monitor the recurrence and metastasis of postoperative tumors. Anti-tumor drug sensitivity and patient prognosis and selection of individualized treatment strategies

Research difficulties of CTCs

CTCs have no obvious specificity and are clearly distinguished from other blood cells, and tumors of different histological types and molecular phenotypes express different markers.

The number of CTCs in peripheral blood is scarce. Generally, there are only 1 in 106-107 white blood cells.

Combination of CTCs and NGS

l Capture CTCs

l Identify CTCs

l Single cell expansion

l Building database with trace DNA (WES or targeted area capture)

l High-throughput sequencing

Bioinformatics analysis

Introduction to Stem Cell Therapy

Stem cells have multidirectional differentiation potential, low immunogenic cells, and have a strong immune regulatory function, participating in natural and acquired immune processes. Stem cell therapy has become another high-tech treatment after drug intervention and surgical intervention, and has carried out clinical trials and applications worldwide. Today let’s take a look at the progress of stem cell therapy in various aspects.

  1. Hematopoietic stem cells-bone marrow transplantation

Hematopoietic stem cell transplantation, also known as bone marrow transplantation, is the earliest and relatively mature stem cell therapy. It is mainly used to supplement damaged hematopoietic stem cells in vivo after radiotherapy or chemotherapy, and then remodel the whole hematopoietic system. Initially and most commonly, allogeneic hematopoietic stem cell transplantation remains a major clinical challenge because of the difficulty in matching and immune rejection. Gene therapy based on hematopoietic stem cells can solve this problem well. In this case, we can purify CD34+ positive hematopoietic stem cells from patients’ bone marrow and peripheral blood by molecular marker CD34, and then insert into their genome to construct a viral vector carrying the target gene, so as to obtain healthy hematopoietic stem cells for transfusion into patients. Over the past two decades, hematopoietic stem cell-based gene therapy has become a more effective treatment for monogenic genetic diseases, such as primary immunodeficiency, hematopathy, and neurometabolic disorders. However, gene therapy based on hematopoietic stem cells also has its drawbacks. Firstly, the whole treatment process is complicated, involving the collection, transport, editing and preservation of cells, which requires the establishment of perfect quality control standards. Secondly, patients need to undergo chemotherapy to remove the original hematopoietic stem cells before receiving transplantation. Therefore, the dose control of chemotherapy also needs to formulate corresponding standards. More lethally, enhancers of viral vectors may inadvertently activate endogenous proto-oncogenes in the process of gene modification using gamma-retrovirus, which can lead to malignancy.

  1. Treatment of Amyotrophic Lateral Sclerosis (ALS) with pluripotent stem cells

With the technological innovation of stem cells, there is hope for patients with neurological diseases, which support neurons and peripheral cells by releasing neurotrophic factors or direct cell replacement. ALS is a typical representative of motor neuron disease. Because the mechanism of motor neuron degeneration is not yet clear, there is no effective intervention for this disease. At present, stem cell therapy has become an effective method to intervene in ALS, which has attracted widespread social attention. Traditional medicines can not relieve symptoms or reverse the trend of progressive deterioration, and are prone to side effects in some patients. In recent years, stem cell research has opened a new avenue for nerve repair and regeneration, and is one of the effective methods to intervene in degenerative diseases of the central nervous system such as ALS. In neurodegenerative diseases, the main goals of stem cell therapy are cell replacement and neuroprotection. Pluripotent stem cells directly replace motor neurons and diseased glia or provide support to slow their degeneration.

Five mechanisms of stem cell therapy intervention in ALS:

(1) Cell replacement: replace degenerated motor neurons with new functional motor neurons, protect and support motor neurons, and restore neurological function.

(2) Delivery of trophic factors: secrete normal neurons and a variety of cytokines, improve the nigrostriatal system, and promote tissue repair in damaged areas.

(3) Immune regulation: by reducing the production of antibodies by B cells, reducing or inhibiting the activation of DCs and T cells, inhibiting NK secretion of cytokines to regulate immune cell function and reduce the inflammatory response.

(4) Homing: provide cell sources that survive and home to the injured area, differentiate into expected cell types, improve the microenvironment in the brain, and reconstruct the neurological functional areas and conduction pathways.

(5) Exosome secretion: stem cells have high secretion capacity of exosomes, exosomes have anti-neuroinflammation ability, and functional repair and neurovascular reconstruction ability.

3. Epidermal stem cells in the treatment of burns

Epidermal stem cells are a group of adult stem cells located in the basal layer of the epidermis, which can differentiate into various cell types in the epidermis. At present, human epidermal keratinocyte transplantation has become a routine means of tertiary burn treatment. Early studies have found that human epidermal keratinocytes contain three cell types, holoclone, meroclone and paraclone, with decreasing cloning ability and cell stemness. Now we know that holoclone is the key to the success of the epidermal stem cell transplantation experiment, and neither of the other two types of cells can achieve long-term epidermal renewal. Likewise, corneal limbal cells also contain a type of corneal epithelial stem cell, which is essential for the repair of corneal damage. Corneal limbus loses its stem cells in the case of chemical burns, and angiogenesis, chronic inflammation, corneal opacity, and bulbar conjunctival invasion occur in the cornea of the eye, which in turn causes vision loss. Limbal cell transplantation can effectively promote corneal regeneration and visual recovery. Usually, 1-2 square millimeters of limbus extracted from the patient’s uninjured eye can be cultured in vitro to regenerate a complete corneal epithelium for autotransplantation into the injured eye.

4. Mesenchymal stem cell (MSC) therapy

Researchers have discovered a class of cells with cloning ability in bone marrow that can differentiate into cartilage, bone, hematopoietic support matrix and bone marrow adipocytes. Given their pluripotency and ability to differentiate into mesodermal connective tissues outside their lineage, such as muscle, tendon, ligament and adipose tissue, these cells have been named mesenchymal stem cells. Although the theoretical basis of MSCs is not solid, they have been used in more than 900 clinical trials to treat various types of diseases. They disappear soon after transplantation, probably by paracrine means. From a few published experimental results, the therapeutic effect of MSCs is often less than expected. However, there are a few successful clinical cases of MSC therapy. For example, bone marrow stem cell transplantation is used to treat massive bone loss and pulp regeneration.

Behind the success stories of stem cell therapy are decades of in-depth basic research that paves the way, including the biological characteristics of stem cells themselves, their differentiation lineages and signaling mechanisms. Meanwhile, these successes also require practical clinical treatment options. Hematopoietic stem cell transplantation only requires relatively simple intravenous injection, while epidermal stem cells need to be transplanted to the appropriate location of the skin, while the central nervous system-related stem cell therapy operation is more complex.

Single Cell Sequencing: The Technology, Challenge and Future (IV)

Wolf Reik hopes that epigenetics technology can also reach the level of single cell detection as soon as possible.

What is more difficult than genome and transcriptome research is the epigenome study that attaches to the genome in the form of chemical markers and regulates gene expression. Although the current epigenetics technology has not yet reached the level of single-cell research (because traditional epigenetics research techniques will degrade DNA), researchers are still eager to see the epigenome of individual tumor cells. Tang’s research team has developed a new technology that can study the modification of DNA methylation within a single-cell genome (Genome Res. 23, 2126–2135, 2013). Tang believes that single-cell technology is also really required for epigenome research. Only in this way can researchers understand the difference between this tumor cell and the surrounding tumor cells, and this difference is caused by methylation modification. It is also caused by other mechanisms. The Wolf Reik team at the Wellcome Trust Sanger Institute in the UK analyzed the methylome of 50 to 100 cells, and he said he really wanted to go one step further.

2.6 Exploration to neuron cells

Neuron cells are the latest object used for single-cell research, and scientists are actually not quite sure what information and conclusions can be obtained through these studies. It was only recently that there was experimental evidence that neurons also have different genomes. Despite these research results, scientists are still confused about the diversity of neuronal cells. As early as 2001, Jerold Chun, who was still working at the University of California, San Diego, discovered chromosomal aneuploidy in the brain of mice, and then in human brain cells in 2005. The same phenomenon was found. According to McConnell, who was a graduate student in the Chun laboratory at that time, after getting these results, no one knew what to do next. They are equivalent to discovering the tip of the iceberg. If there is aneuploidy in the cell, there must be a lot of gene mutations, or genome mutations.

Almost at the same time, another group of researchers found that in the human genome, on average, each genome contains 80 to 100 potentially viable L1 elements (this is a kind of self-replication and self-pasting in the entire genome DNA elements), and in brain neuron cells, these L1 elements are active. This study, as well as some other research results, have proved that the genome is at least possible to have diversity, but no one can say clearly how great this variation is.

According to Thomas Insel of the US National Institute of Mental Health, they are just beginning to try to understand the molecular diversity of brain cells. The single-cell research technology in this field plays a key role, not only in determining the (classification) type of neuronal cells and glial cells, but also in helping us understand the experience and development of a certain area of the brain What is the role of gene expression.

Scientists can use several methods to detect single-cell genome variation. The Christopher Walsh team of Harvard Medical School conducted a single-cell L1 element insertion study on 300 neurons taken from the dead brain (Cell 151, 483–496, 2012). They only found a few L1 insertion elements, which indicates that L1 elements should not be the main cause of genomic diversity, but at least in cerebral cortex cells and caudate nucleus cells.

In 2013, several other research groups also conducted genome-wide scanning studies on single human neuronal cells. For example, in an article published in November 2013, a genome-wide sequencing study was performed on 110 frontal cortex neuron cells in the brains of three healthy people. The results were quite surprising. Large segments of CNV mutations (Science 342, 632–637, 2013). Studies on neuronal cells derived from healthy human skin cells have also found the same situation, and these neuronal cells have more CNV than skin cells from which they are derived, which shows that this neuron derived from iPS cells Cells are a very good research material, suitable for research work on cell diversity.

In fact, despite these discoveries, neuroscientists still have a headache because they do not know what these somatic mutations mean. Ira Hall, a geneticist at the University of Virginia, is also one of the collaborators of this article published in Science. He believes that these studies mean that the brain ’s resistance to influence and interference is very weak. In addition, genomic mosaicism can also affect people ’s risk of developing tumors and other diseases. To find out which parts of the brain are more susceptible to interference than other parts and how different the different parts of the brain are, researchers have to study more cells before they can find the answer. McConnell, who is currently engaged in research in this area, believes that he still knows nothing.

2.7 The further development

Although single-cell technology has the potential to solve many major problems in the life sciences, technological progress is far from over. For example, researchers must study how to distinguish true biological differences from the background noise of the test technology itself. Joakim Lundeberg of KTH Royal Institute of Technology in Sweden (who has developed tissue RNA sequencing technology in their laboratory) believes that single cell RNA and DNA sequencing technology is far from being powerful enough, he said that they also need to analyze more single cells in one experiment in order to solve the problem of background noise, which can at least deepen their understanding of the differences between different cells in the same tissue.

Due to various problems, such as cell separation, data calculation, and specificity issues when used in different fields, etc., Blainey hopes that single cell research technology can make greater progress in the next few years.

For newcomers to this field, which transcriptome sequencing technology they choose may be enough for them to have a headache for a long time. Regarding this issue, it should depend on the purpose of the research, such as whether you want to analyze multiple cells to find homologous transcripts, or you want to find low-abundance RNA. “But it’s always a good thing to have multiple methods to choose from,” Quake said. Quake ’s team found that if the reaction volume during pretreatment is controlled to be upgraded (they use the C1 system provided by Fluidigm), then the detection effect of single-cell qPCR technology and single-cell RNA sequencing technology is almost the same (Nat. Methods 11, 41–46, 2014).

With the introduction of commercial products, and the various laboratories who have summed up their “unique secrets” after years of practice, the choice of genome amplification technology is also improving. However, because everyone uses different techniques for genome amplification, it is difficult to directly compare different research results.

At the same time, researchers engaged in cancer research, brain neuroscience research, microbiological research, and drug development and other fields of research will also benefit from these technological advancements. And these technological advancements will also attract many outstanding talents to join the field of single-cell research, such as Reik, who has already made a lot of achievements in epigenetics research. Reik only participated in the single-cell academic conference for the first time last year, and has never been exposed to single-cell research before. Reik is very excited to see so many new technologies. He pointed out that at the beginning people will be excited by the technology itself, and it will not be long before people will use these new technologies to solve important life science problems, which will be more exciting.

To be continued in Part V…


First Direct Sequence of SARS-CoV-2 RNA Achieved

The Lachlan Coin and Sebastian Duchene teams and collaborators in the Department of Microbiology and Immunology, University of Melbourne, Australia, provided the first direct RNA sequence of the SARS-CoV-2, detailed the mRNA structure of the subgenome length of this coronavirus, and described various aspects of coronavirus evolutionary genetics revealed from shared data. Relevant articles were released on March 7 on the preprint server bioRxiv (all articles in bioRxiv were not peer reviewed).


SARS-CoV-2 is a positive single-stranded RNA virus of the family Coronaviridae that is associated with beta-coronaviruses that can infect mammalian and avian hosts, such as the MERS coronavirus and the SARS coronavirus.


To determine the structure of mRNAs of the subgenome length of SARS-CoV-2, researchers used a recently established direct RNA sequencing method based on highly parallel nanopore arrays. Briefly, nucleic acids were prepared from culture material with high levels of SARS-CoV-2 and sequenced on the GridION platform.


With this approach, the SARS-CoV-2 sample yielded 680,347 reads containing 860 Mb of sequence information in 40 hours of sequencing. Consistent with the genome of the cultured isolates of new coronavirus, partial reads belonged to coronavirus sequences (28.9%), including 367 Mb sequences distributed in 29,893 base genomes. Some of them are more than 20,000 bases in length, and the researchers also capture most of the genome on a single molecule.


Through data analysis, the investigators identified 42 sites with predictable 5-methylcytosine modifications that present consistent locations between mRNAs of subgenomic length.


In other positive single-stranded viruses, RNA methylation changes dynamically during infection, affecting host-pathogen interactions and viral replication. Once the data set is available for direct RNA sequences of the SARS-CoV-2, researchers may discover other modifications. Little is currently known about the apparent transcriptome modifications of coronaviruses.


The researchers believe that by using direct RNA sequence data, it helps to gain insight into the molecular biology of SARS-CoV-2 and may help construct a detailed view of the viral subgenome-length mRNA structure.