Microsynteny analysis

Yi-Jyun Luo | 2017.05.25 | version 1.0

I. Background:

Synteny in comparative genomics refers to the conserved genomic organization where orthologous genetic loci are physically colocalized on the same scaffold or chromosome among different species. This genomic similarity reflects the conserved genomic features due to shared ancestry.

Based on the scale of conserved regions, synteny can be classified into macrosynteny and microsynteny. Macrosynteny is the form of synteny mostly on the chromosomal level (e.g. super-scaffolds or linkage groups), where genetic loci are located on the same chromosome but their location and order may not be the same among different species. On the other hand, microsynteny shows genomic organization on a smaller scale, where genetic loci are linked as conserved blocks on the scaffold level. In this case, the arrangement of neighboring genes is often the same. A famous example of microsynteny is the Hox cluster, in which the Hox genes are usually arranged in the same order, the property so-called collinearity.

To analyze microsynteny, one will need to have assigned orthologous genes and their locations on the scaffolds. This analysis is conducted with the combination of Bash commands and Perl scripts. It is a prerequisite to have orthologous groups from the OrthoMCL output (or equivalent) to perform this analysis.

Note: This pipeline is originally for personal use only. So some procedures may not be that straightforward. I may modify and streamline the pipeline if I have time.

II. Prerequisite files:

  1. Proteome FASTA files in OrthoMCL format
  2. Nucleotide FASTA files with their original header format
  3. OrthoMCL output
  4. Genome GFF files
1. Proteome FASTA files in OrthoMCL format (mostly downloaded from UniProt)

(header format: "three-letter_species_code" + "pipe |" + "UniProt_ID OR unique_ID_with_five-letter_species_code OR unique_ID")

Examples of acceptable formats:
1) BMP2 in Homo sapiens -> hsa|BMP2_HUMAN
2) BMP2/4 in Lottia gigantea -> lgi|V3Z494_LOTGI
("three-letter species code"|"UniProt ID")
3) BMP2/4 in Octopus bimaculoides -> obi|22032306
("three-letter species code"|"unique ID")
4) BMP2/4 in Lingula anatina -> lan|14486_LINAN
("three-letter species code"|"unique ID with five-letter species code")

e.g.

>hsa|BMP2_HUMAN
MVAGTRCLLALLLPQVLLGGAAGLVPELGRRKFAAASSGRPSSQPSDEVLSEFELRLLSM
FGLKQRPTPSRDAVVPPYMLDLYRRHSGQPGSPAPDHRLERAASRANTVRSFHHEESLEE
LPETSGKTTRRFFFNLSSIPTEEFITSAELQVFREQMQDALGNNSSFHHRINIYEIIKPA
TANSKFPVTRLLDTRLVNQNASRWESFDVTPAVMRWTAQGHANHGFVVEVAHLEEKQGVS
KRHVRISRSLHQDEHSWSQIRPLLVTFGHDGKGHPLHKREKRQAKHKQRKRLKSSCKRHP
LYVDFSDVGWNDWIVAPPGYHAFYCHGECPFPLADHLNSTNHAIVQTLVNSVNSKIPKAC
CVPTELSAISMLYLDENEKVVLKNYQDMVVEGCGCR
>lgi|V3Z494_LOTGI
MIADFYRSVVLLLLVLVVNYTSSLSTNTLHPRKHGNEKVLAAVESNLLGLFGLKSRPVPG
RRTPIPDYMLDLYKRLDSEPDFISPFAESKGKGIVEANTVRSFYHSDINFVKDCDERTCA
RIWFNISTIPFAEIMAASELRVFKDVYNHFKLTDANSLSGDRIKASLFKHRLEIHEIMRV
SPDDSECISRLIDTKIVDTRNSSWESFDVHPAVLKWRRRPQYNYGLEVRIVSKSPWISTN
SHVRLRRSAHMDEKNWQIQRPILVTYTHDGRNPNSRIKRASARNRRRKSRKRRRRKKKGN
KNECKRHALYVDFGDVGWNDWIVAPTGYNAYFCRGECPFPMGQHLNSTHHAVMQTLVHSV
NPSAVPKACCVPSELSAISMLYLDEWDKVVLKNYQDMVVEGCGCR
>obi|22032306
MMVRHQLMLVVVLFVLLHGQVLAIILSTDTFSHHLPPASTSPLSTSPLSVSSSSSSSSSS
SSTSSSSSSSPLVSSSSSSTSSSSSSSSSSSVKVTTNKRDNFPQTHDDLFKAFEANLLQI
FGLKNRPKLNEKTFVPQYMWDLYDSHKQKRDPSAPTSGRGYLPANTVRSFYHIDPHFMDC
TEENASTIMFNISNMMEEEILTAADLRIFWDPKFYEYNYRRDASLSDISIDNSYDSNYNI
TKDNNNKNMFKKLKYFPIRIEVHEVISPPRWGSSNECITRLIDTKVIHSNSNTSTWIVFD
VHPAVLKWKQTPHLNHGLHIRLSTATKNPVSVQPHQHIRLKRAADIEQNEWQNSRPLLIT
YTDDGKGVPLKRSRRSARHRRERGKKKRRRSKKVQRHRKAGAECRRHPMYVSFNEVGWTE
WIVAPVGYQAFYCDGKCPFPISDYLNSTNHAIVQTLVNSVMPDRIPEVCCVPTELSSISM
LYLDEQNKVVLKNYQGMVVKACGCR
>lan|14486_LINAN
MIDEGRDLKVEIGQAAQVSGLSLVHRSFYCIRLFSDITEQLPGFTMNAGNWHLFVIVLSV
VVFGQETALLVAGNTNNNRREDYTGETSEKGTTIRENSGQRTKDIQALESSLLKMFGLGS
RPRPRKKLEIPQYMLELYQKQMQIRDSEEIMGGDINVRGKFTGSANTVRSFHHLEPKVHE
AQSSTDKVTLIFNISSVPEEEILKAAELRIYREQILQHLEPSLKRKRDFRYKLEAHEVIK
PQTKNRDSITRLLDTKIIDIRNSSWESLDLTPAVVKWRTTPRKSHGVEVHFIRLDGRPSE
LKSSHVRLRRSINGDDSTWQAQRPLLVTYSDDTRSKRTKREGKRRSRSHKRKSKEQCRRH
ALYVDFKDVGWDDWIVAPPGYQAFYCHGDCPFPLADHLNSTNHAIVQTLVNSVNPSAVPK
ACCIPTELSPISMLYLDEYDKVVLKNYQDMVVEGCGCR
2. Nucleotide FASTA files with their original header format (the same as the one used in the GFF file)

e.g.

>jgi|Lotgi1|205842|estExt_fgenesh2_pm.C_sca_70027
ATGATCGCAGATTTTTACAGAAGTGTTGTGCTACTTTTATTAGTACTAGTTGTTAATTATACTTCCTCTC
TTTCTACCAACACGTTACACCCGAGGAAGCATGGAAATGAGAAGGTCCTAGCCGCTGTCGAATCTAATCT
TCTTGGTCTGTTTGGATTAAAGTCACGTCCAGTACCAGGTCGTCGGACACCTATTCCAGATTATATGTTA
GACCTTTACAAACGTCTAGATTCAGAACCAGACTTTATTAGCCCTTTTGCTGAATCTAAAGGCAAAGGAA
TCGTTGAAGCCAACACAGTTCGCAGCTTTTATCATAGCGATATCAACTTTGTTAAAGATTGTGATGAAAG
AACTTGTGCTAGAATATGGTTTAATATTAGTACCATTCCCTTTGCTGAGATTATGGCAGCATCCGAGCTA
CGTGTATTTAAGGACGTTTACAATCATTTCAAATTAACAGACGCAAATTCCTTAAGTGGTGATAGAATTA
AAGCGAGTTTGTTTAAACATCGATTAGAAATACATGAAATAATGCGAGTGTCGCCTGATGACAGCGAGTG
TATATCTCGACTTATTGACACTAAAATAGTGGATACTAGAAATTCCTCTTGGGAGTCGTTTGATGTGCAT
CCCGCAGTATTAAAGTGGCGAAGGAGACCTCAATACAATTATGGTTTAGAAGTGAGAATTGTGTCGAAAA
GTCCATGGATATCTACAAATTCACATGTGCGTTTACGTCGTTCAGCGCATATGGACGAAAAAAATTGGCA
AATCCAAAGACCCATTCTAGTAACTTATACACATGATGGACGTAATCCCAATTCTCGAATTAAACGTGCA
TCAGCCAGAAACCGTAGACGAAAGTCGCGGAAGCGTAGACGTCGTAAGAAGAAGGGAAATAAAAATGAGT
GCAAAAGACACGCTTTATATGTAGACTTCGGCGATGTCGGTTGGAATGACTGGATTGTAGCACCAACTGG
CTATAATGCCTATTTTTGTCGGGGTGAGTGTCCGTTTCCGATGGGACAACATTTGAATTCGACACACCAT
GCAGTGATGCAAACTTTAGTGCATTCTGTTAACCCTAGTGCCGTACCAAAAGCATGTTGTGTGCCTTCAG
AACTTAGCGCTATATCAATGTTATACTTAGATGAGTGGGACAAAGTTGTGCTTAAAAACTATCAAGATAT
GGTAGTAGAGGGATGCGGTTGTCGGTAACATTATCCATTTCTGGTCGTGTTATTATCTGTGTTCTATGTC
TAAATGACATTCTCTCCAAGTGAAATATACCAACTACTCAGGTTCACCAATTTCAAGTTGTGCTCTTAAA
AATATTTGTTTTCCAAATCTCATATTTTTTCCAGTTTGATGTATTTCACACAATTCTTACTTGGTTATAA
TAATTATGGCACCAAATACTTTCCATGCACTTTGATTTGTGTATGGAAACCCTAATGTTCGAACTAGTCA
TGGTAAAAATCAAAGTTTAGTAGATTACGTTCTGCCAATTTGTATTTATGTTCTTATATCCTGCTTTATA
TTTTGTGCAAACAGTGCTGTTTTTTAGGGCGAAGATTAGTTAATTAACGAATCATATTCCCCGATAAAGA
TTTGTATATATCATACCTAATATCCCTAAAGATCTATGATGAAAACTGAACTGAAAGTATAATAACTGAA
TTGTATAAGTTGGGTTCAATAGGTATCCTTTATAAACAATAGATTATATATTTATAACTCATTATTATAA
TTATTATTATTATTATTCCATTATTGTTATTGTTGATTTGTAGAATTGTAATAATTTAATTATAAAAAAG
TAAAATT
3. OrthoMCL output (default file name: groups.txt)

Orthologous_group_ID: Gene_ID_1 Gene_ID_2 Gene_ID_3 Gene_ID_4 ...

e.g.

OG_03582: dre|F1QAD6_DANRE dre|F1R4K7_DANRE dre|F1QGJ0_DANRE dre|O93369_DANRE gga|BMP2_CHICK gga|F1NNW4_CHICK gga|BMP4_CHICK gga|E1C0T2_CHICK nge|06947_NOTGE nge|13269_NOTGE sko|NP_001158387 sko|XP_006813602 spu|W4XBC2_STRPU spu|W4Z063_STRPU tru|H2T593_TAKRU tru|Q804S3_TAKRU tru|H2UJP3_TAKRU tru|H2UJP4_TAKRU adi|01025_ACRDI bfl|C3Z3V6_BRAFL cgi|K1PQP9_CRAGI cte|R7V2Z1_CAPTE dme|DECA_DROME dpu|E9FTU5_DAPPU dre|B3DI86_DANRE hsa|BMP2_HUMAN lan|14486_LINAN lgi|V3Z494_LOTGI nve|A7SAY4_NEMVE obi|22032306 ola|H2MDM6_ORYLA pau|13103_PHOAU pfu|03514_PINFU tca|D2A252_TRICA xtr|Q90YD7_XENTR cin|H2XMY6_CIOIN cmi|V9KA22_CALMI xtr|Q90YD6_XENTR ola|H2M1H6_ORYLA hro|T1ELG7_HELRO hsa|BMP4_HUMAN
OG_09501: dpu|E9H260_DAPPU ava|GSADVT00001869001 dre|A2AVJ4_DANRE gga|F1NTA6_CHICK ola|H2MTZ7_ORYLA bfl|C3Z5W6_BRAFL cgi|K1PDY2_CRAGI cin|F6RS38_CIOIN cte|R7VIX0_CAPTE hsa|BMP3_HUMAN lan|07885_LINAN nge|09806_NOTGE pau|12390_PHOAU sko|XP_002735398 spu|W4XWL1_STRPU tru|H2RPS9_TAKRU xtr|F6SX40_XENTR cmi|V9KUY6_CALMI dre|F1QDD0_DANRE pfu|21013_PINFU obi|22012812
4. Genome GFF files

More about the GFF file format:
http://gmod.org/wiki/GFF3
http://www.ensembl.org/info/website/upload/gff.html

In brief, the GFF file contains information about the gene locus on the scaffold. For example, in the Lingula genome, gene "g14486" (lan-bmp2/4) is located on the scaffold 255 from the location 347,976 to 358,644 on the forward strand (+). In the Lottia genome, gene "estExt_fgenesh2_pm.C_sca_70027" (lgi-bmp2/4) is located on the scaffold 7 (sca_7) from 1,740,655 to 1,742,658.

e.g.

scaffold255 AUGUSTUS    gene    347976  358644  0.69    +   .   g14486
scaffold255 AUGUSTUS    transcript  347976  358644  0.69    +   .   g14486.t1
scaffold255 AUGUSTUS    start_codon 347976  347978  .   +   0   transcript_id "g14486.t1"; gene_id "g14486";
scaffold255 AUGUSTUS    initial 347976  348017  0.69    +   0   transcript_id "g14486.t1"; gene_id "g14486";
scaffold255 AUGUSTUS    internal    354120  354204  0.97    +   0   transcript_id "g14486.t1"; gene_id "g14486";
scaffold255 AUGUSTUS    internal    356932  357327  1   +   2   transcript_id "g14486.t1"; gene_id "g14486";
scaffold255 AUGUSTUS    terminal    357791  358644  1   +   2   transcript_id "g14486.t1"; gene_id "g14486";
scaffold255 AUGUSTUS    intron  348018  354119  0.69    +   .   transcript_id "g14486.t1"; gene_id "g14486";
scaffold255 AUGUSTUS    intron  354205  356931  1   +   .   transcript_id "g14486.t1"; gene_id "g14486";
scaffold255 AUGUSTUS    intron  357328  357790  1   +   .   transcript_id "g14486.t1"; gene_id "g14486";
scaffold255 AUGUSTUS    CDS 347976  348017  0.69    +   0   transcript_id "g14486.t1"; gene_id "g14486";
scaffold255 AUGUSTUS    CDS 354120  354204  0.97    +   0   transcript_id "g14486.t1"; gene_id "g14486";
scaffold255 AUGUSTUS    CDS 356932  357327  1   +   2   transcript_id "g14486.t1"; gene_id "g14486";
scaffold255 AUGUSTUS    CDS 357791  358644  1   +   2   transcript_id "g14486.t1"; gene_id "g14486";
scaffold255 AUGUSTUS    stop_codon  358642  358644  .   +   0   transcript_id "g14486.t1"; gene_id "g14486";
sca_7   JGI gene    1740655 1742658 .   +   .   ID=estExt_fgenesh2_pm.C_sca_70027;trID=JGI_V11_205842
sca_7   JGI mRNA    1740655 1742658 .   +   .   Parent=estExt_fgenesh2_pm.C_sca_70027;ID=JGI_V11_205842
sca_7   JGI exon    1740655 1740973 .   +   .   Parent=JGI_V11_205842
sca_7   JGI CDS 1740655 1740973 .   +   0   Parent=JGI_V11_205842;ni=1
sca_7   JGI exon    1741151 1742658 .   +   .   Parent=JGI_V11_205842
sca_7   JGI CDS 1741151 1742049 .   +   1   Parent=JGI_V11_205842;ni=2

III. Procedures (required Perl scripts are provided):

  1. Identify orthologous groups (OG) with OrthoMCL.
  2. Prepare FASTA and GFF files.
  3. Select orthologs shared among the target species.
  4. Convert the original gene model IDs into the format that is consistent to the OrthoMCL output (if needed).
  5. Obtain the locus information of each gene from the GFF files.
  6. Convert the protein IDs into OG IDs obtained from OrthoMCL.
  7. Prepare a loci table with scaffold and OG IDs on the same line.
  8. Identify microsynteny by comparing the loci tables from a pairwise comparison.
1. Identify orthologous groups (OG) with OrthoMCL.

To identify orthologous groups with the stand-alone OrthoMCL in a local environment, please read the following paper for details.

Fischer, S. et al. Using OrthoMCL to assign proteins to OrthoMCL-DB groups or to cluster proteomes into new ortholog groups. Curr. Protoc. Bioinformatics Chapter 6, Unit 6 12 11-19 (2011). https://www.ncbi.nlm.nih.gov/pubmed/21901743

e.g.

OG_03582: dre|F1QAD6_DANRE dre|F1R4K7_DANRE dre|F1QGJ0_DANRE dre|O93369_DANRE gga|BMP2_CHICK gga|F1NNW4_CHICK gga|BMP4_CHICK gga|E1C0T2_CHICK nge|06947_NOTGE nge|13269_NOTGE sko|NP_001158387 sko|XP_006813602 spu|W4XBC2_STRPU spu|W4Z063_STRPU tru|H2T593_TAKRU tru|Q804S3_TAKRU tru|H2UJP3_TAKRU tru|H2UJP4_TAKRU adi|01025_ACRDI bfl|C3Z3V6_BRAFL cgi|K1PQP9_CRAGI cte|R7V2Z1_CAPTE dme|DECA_DROME dpu|E9FTU5_DAPPU dre|B3DI86_DANRE hsa|BMP2_HUMAN lan|14486_LINAN lgi|V3Z494_LOTGI nve|A7SAY4_NEMVE obi|22032306 ola|H2MDM6_ORYLA pau|13103_PHOAU pfu|03514_PINFU tca|D2A252_TRICA xtr|Q90YD7_XENTR cin|H2XMY6_CIOIN cmi|V9KA22_CALMI xtr|Q90YD6_XENTR ola|H2M1H6_ORYLA hro|T1ELG7_HELRO hsa|BMP4_HUMAN
2. Prepare FASTA and GFF files.
3. Select orthologs shared among the target species.

It is a kind of filtering step. Removing unnecessary data will reduce the computing time in the later steps.

orthomcl_select_taxa.pl
orthomcl_table_count.pl
ortho_filter.pl

Select the OGs shared by the target species.

An example of taxon_list:

bfl
cte
lgi
nge
pau
lan

Create a subgroup that only contains target species IDs.

$ perl orthomcl_select_taxa.pl taxon_list groups.txt > groups.txt_2

Create a table with numbers of orthologs in each species.

$ perl orthomcl_table_count.pl groups.txt_2 taxon_list > table_count.txt

An example of table_count.txt:

OG_00001 70 7 11 0 0 0
OG_00002 0 4 2 439 25 52
OG_00003 287 8 5 4 22 20

Select the OGs where each species has at least one copy. Here is just an example, the exact awk command will need to be modified if you have different numbers of target species.

$ awk '{if($2+$3+$4+$5+$6+$7<=10 && $2>=1 && $3>=1 && $4>=1 && $5>=1 && $6>=1 && $7>=1)print$0}' table_count.txt > table_count_list

An example of table_count_list:

OG_00042 2 1 1 1 1 1
OG_00176 1 1 1 2 1 1
OG_00177 1 1 1 2 1 2

Get a list of core OGs (i.e. at least one copy shared by the target species).

$ awk '{print$1}' table_count_list > core_ogs_list

An example of core_ogs_list:

OG_00042
OG_00176
OG_00177

Get a table containing OGs and gene IDs.

$ perl ortho_filter.pl core_ogs_list groups.txt_2 > core_ogs.txt

An example of core_ogs.txt:

OG_00042: bfl|C3XPM1_BRAFL bfl|C3XPM2_BRAFL cte|R7UCY0_CAPTE lgi|V4AD33_LOTGI nge|12738_NOTGE pau|09069_PHOAU lan|04217_LINAN
OG_00176: bfl|C3ZA69_BRAFL cte|R7T5D5_CAPTE lgi|V4AH31_LOTGI nge|04804_NOTGE nge|14847_NOTGE pau|15154_PHOAU lan|17091_LINAN
OG_00177: bfl|C3ZZI7_BRAFL cte|R7UZC8_CAPTE lgi|V4B088_LOTGI nge|28969_NOTGE nge|34989_NOTGE pau|08235_PHOAU lan|20085_LINAN lan|24179_LINAN
4. Convert the original gene model IDs into the format that is consistent to the OrthoMCL output (if needed).

Perform a tblastn search to match the protein IDs (UniProt) to nucleotide IDs (from JGI or NCBI).

An example of tblastn.sh:

#!/bin/bash

# Prepare database
makeblastdb -in lgi_nucl.fasta -dbtype nucl

# tblastn search
tblastn \
-query lgi_prot.fasta \
-db lgi_nucl.fasta \
-out lgi_prot_to_nucl.outfmt6 \
-evalue 1e-5 \
-num_threads 12 \
-max_target_seqs 1 \
-outfmt 6

#--//

Run the Bash script.

$ nohup tblastn.sh &

Get the blast best-hit list.

$ cat lgi_prot_to_nucl.outfmt6 | sed 's/lgi|/lgi_/g' | sed 's/|/ /g' | awk '{print $5,$1}' | sed 's/lgi_/lgi|/g' | uniq > lgi_blast_list

An example of lgi_blast_list:

estExt_fgenesh2_pg.C_sca_1420009 lgi|ASRP_LOTGI
estExt_fgenesh2_pg.C_sca_2050005 lgi|CAH2_LOTGI
estExt_fgenesh2_pg.C_sca_1300004 lgi|CAH1_LOTGI
5. Obtain the locus information of each gene from the GFF files.

This step transforms the protein IDs into OG IDs and prepares a loci list for each target species.

Perl scripts for this step:

orthomcl_select_taxa.pl
rename_jgi_to_uniprot.pl

Prepare a table with gene loci information from the GFF file.

$ grep sca lgi.gff | awk '{print $1,$10}' | sed 's/"//g;s/;//g' | uniq > lgi_gene_loci_list

An example of lgi_gene_loci_list:

sca_1 estExt_fgenesh2_pg.C_sca_10001
sca_1 e_gw1.1.46.1
sca_1 estExt_fgenesh2_kg.C_sca_10001

Prepare a table for the OG and protein IDs.

$ perl orthomcl_select_taxa.pl taxon_list_lgi core_ogs.txt > core_ogs_lgi.txt

An example of taxon_list_lgi:

lgi

An example of core_ogs_lgi.txt:

OG_00042: lgi|V4AD33_LOTGI
OG_00176: lgi|V4AH31_LOTGI
OG_00177: lgi|V4B088_LOTGI

Prepare a table for the scaffold and protein IDs.

$ perl rename_jgi_to_uniprot.pl lgi_blast_list lgi_gene_loci_list | grep lgi > lgi_gene_loci_list_new_id_uniprot

An example of lgi_gene_loci_list_new_id_uniprot:

sca_1 lgi|V4B3T4_LOTGI
sca_1 lgi|V4ALJ9_LOTGI
sca_1 lgi|V4CRD4_LOTGI
6. Convert the protein IDs into OG IDs obtained from OrthoMCL.

Perl scripts for this step:

rename_to_og.pl

Prepare a table for the scaffold and OG IDs.

$ perl rename_to_og.pl core_ogs_lgi.txt lgi_gene_loci_list_new_id_uniprot > lgi_gene_loci_list_new_id_og
$ grep OG lgi_gene_loci_list_new_id_og > lgi_gene_loci_list_new_id_OG

An example of lgi_gene_loci_list_new_id_og:

sca_1 lgi|V4B3T4_LOTGI
sca_1 OG_04667
sca_1 OG_04711

An example of lgi_gene_loci_list_new_id_OG:

sca_1 OG_04667
sca_1 OG_04711
sca_1 OG_01262
7. Prepare a loci table with scaffold and OG IDs on the same line.

Perl scripts for this step:

loci_table.pl

Prepare a scaffold list with associated protein IDs.

$ perl loci_table.pl lgi_gene_loci_list_new_id_OG > lgi_loci_list

An example of lgi_loci_list:

>sca_1
OG_04667
OG_04711
OG_01262

Remove the scaffolds that only have one gene and transform the list into one-line format.

$ cat lgi_loci_list | perl -pe 's/\n/ /g;s/>/\n/g;' | grep OG | awk '{if(NF>2)print$0}' > lgi_loci_list_one_line

An example of lgi_loci_list_one_line:

sca_1 OG_04667 OG_04711 OG_01262 OG_01262 OG_03486 OG_03654 OG_03643 OG_03098 OG_03098 OG_02202 OG_05654 OG_11894 OG_09939 OG_07406 OG_09831 OG_06286 OG_08123 OG_03967 OG_06567 OG_03648 OG_08106 OG_09324 OG_04474 OG_04474 OG_09015 OG_02846 OG_06665 OG_03796 OG_05697 OG_04343 OG_12726 OG_02089 OG_04391 OG_01546 OG_05641 OG_05410 OG_03921 OG_00836 OG_06148 OG_02525 OG_03689 OG_02339 OG_03915 OG_04791 OG_02723 OG_07550 OG_03774 OG_05936 OG_04344 OG_04316 OG_07057 OG_09193 OG_12201 OG_08177 OG_05901 OG_03261 OG_02120 OG_03902 OG_02101 OG_05913 OG_00266 OG_00266 OG_00249 OG_07676 OG_04581 OG_03548 OG_09826 OG_04513 OG_10415 OG_03863 OG_08473 OG_02301 OG_04054 OG_06096 OG_01478 OG_02885 OG_08599 OG_08773 OG_03505 OG_06371 OG_01483 OG_02658 OG_05664 OG_06999 OG_03701 OG_05624 OG_04020 OG_06459 OG_02952 OG_03054 OG_03054 OG_05324 OG_05179 OG_05330 OG_08964 OG_05532 OG_04091 OG_03067 OG_02629 OG_01988 OG_03500 OG_05772 OG_04930 OG_02070 OG_00686 OG_03667 OG_02139 OG_02139 OG_01841 OG_06196 OG_02708 OG_03749 OG_03159 OG_04838 OG_05736 OG_07391 OG_02311 OG_04635 OG_08702 OG_03376 OG_00654 OG_00654 OG_00654 OG_06824 OG_02416 OG_02416 OG_07980 OG_06067 OG_06424 OG_08015 OG_04345 OG_02366 OG_10081 OG_09801 OG_01595 OG_06474 OG_04201 OG_01073 OG_04903 OG_09737 OG_08695 OG_04428 OG_03759 OG_03264 OG_04819 OG_09046 OG_02037 OG_03211 OG_12203 OG_08235 OG_06004 OG_01118 OG_01933 OG_03763 OG_04919 OG_02925 OG_06885 OG_09712 OG_05540 OG_01010 OG_01010 OG_02316 OG_04673 OG_04038 OG_07223 OG_05116 OG_04910 OG_05754 OG_08559 OG_03642 OG_04168 OG_10281 OG_06519 OG_04686 OG_06993 OG_04540
8. Identify microsynteny by comparing the loci tables from a pairwise comparison.

Perl scripts for this step:

micro_synteny.pl

Pick up conserved microsynteny (ms) clusters.

An example for a simple Bash nested loop:

for NAME1 in {bfl,cte,lgi}
do
for NAME2 in {nge,pau,lan}
do
perl micro_synteny.pl ${NAME2}_loci_list_one_line ${NAME1}_loci_list_one_line > ${NAME2}_${NAME1}_ms.list
done
done

Final output: lan_lgi_ms.list

lan | scaffold1 | lgi | sca_17 | 7 | OG_07042 OG_02732 OG_06465 OG_07417 OG_00607 OG_06064 OG_09174
lan | scaffold1 | lgi | sca_50 | 2 | OG_01628 OG_03094
lan | scaffold1 | lgi | sca_59 | 2 | OG_08211 OG_09071

IV. Downstream analyses

1. Convert the OG IDs to human UniProt IDs

Perl scripts for this step:

orthomcl_select_taxa.pl
rename_to_human.pl

Prepare a table with human ID and OG IDs.

$ perl orthomcl_select_taxa.pl taxon_list_hsa groups.txt | sed 's/://g' | awk '{if($2!=NULL)print$0}' | awk '{print$1,$2}' > core_ogs_hsa.txt

Annotate synteny results with human IDs.

$ perl rename_to_human.pl core_ogs_hsa.txt lan_lgi_ms.list > lan_lgi_ms.list_human_id
$ cat lan_lgi_ms.list_human_id | sed 's/hsa|//g;s/_HUMAN//g' > lan_lgi_ms.list_human_id_clean

An example of lan_lgi_ms.list_human_id_clean:

lan | scaffold1 | lgi | sca_17 | 7 | IMP2L LRRN1 RM02 LSM8 CDK16 S35B4 TXN4B
lan | scaffold1 | lgi | sca_50 | 2 | MK11 PORCN
lan | scaffold1 | lgi | sca_59 | 2 | HIF1N FADD
lan | scaffold1 | lgi | sca_82 | 2 | HRH1 GNN
lan | scaffold1 | lgi | sca_8 | 4 | PYRG1 ZCH24 MTP TM10A
2. Check the linkage distance

Perl scripts for this step:

loci_position.pl
intergenic_length.pl
rename_to_human_loci.pl

Prepare a reference table for scaffold IDs, OG IDs (or protein IDs), mapped nucleotide IDs, locus (start and end), and orientation (+ or -).

$ grep mRNA lgi.gff | awk '{print$1,$4,$5,$7}' > lgi_loci_position.list0
paste <(awk '{print$1,$2}' lgi_gene_loci_list_new_id_og) <(awk '{print$2}' lgi_gene_loci_list) <(awk '{print$2,$3,$4}' lgi_loci_position.list0) > lgi_gene_loci_list_ref

An example of lgi_gene_loci_list_ref:

sca_1 lgi|V4B3T4_LOTGI  estExt_fgenesh2_pg.C_sca_10001  833 15030 -
sca_1 OG_04667  e_gw1.1.46.1    17673 24309 +
sca_1 OG_04711  estExt_fgenesh2_kg.C_sca_10001  25629 69469 -

Prepare a table with synteny information using a Bash nested loop.

$ perl loci_position.pl lgi_comparison_ms.list lgi_gene_loci_list_ref | awk '{if($3==$5)print$0}' | awk '{print$1,$2,$3,$4,$7,$8,$9,$10,$11}' > lgi_loci_position.list_lan

An example of lgi_loci_position.list_lan:

lgi | sca_17 | OG_07042 | 1520462 1526579 -
lgi | sca_17 | OG_02732 | 1548932 1550941 +
lgi | sca_17 | OG_06465 | 1414918 1420578 +

Combine all pairwise comparisons. Take six species for example:

for NAME1 in {bfl,cte,lgi}
do
for NAME2 in {nge,pau,lan}
do
cat ${NAME1}_loci_position.list_${NAME2} >> ${NAME1}_loci_position.list
done
done

An example of lgi_loci_position.list:

lgi | sca_108 | OG_02458 | 374386 386198 +
lgi | sca_108 | OG_04212 | 366966 373848 -
lgi | sca_108 | OG_03471 | 703134 706305 -

Sort the table according to scaffold IDs and start position.

$ sort -k3,3 -k7,7n lgi_loci_position.list | uniq > lgi_loci_position.list.uniq
lgi | sca_1 | OG_04667 | 17673 24309 +
lgi | sca_1 | OG_04711 | 25629 69469 -
lgi | sca_1 | OG_01262 | 306387 313270 -

Calculate intergenic length among the syntenic genes.

$ perl intergenic_length.pl ${NAME}_loci_position.list.uniq | tr "\n" " " | tr ">" "\n" | sed -e '$a\' | grep OG > ${NAME}_loci_position.list_distance

An example of lan_loci_position.list_distance:

lgi | sca_100 | - OG_05032 | 37637 | - OG_05032 | 102 | + OG_05390 | 106376 | + OG_01875 | 8714 | - OG_06426 | 242 | + OG_03087 | 659 | - OG_05670 | 134 | + OG_02973 | 207511 | - OG_09917 | 18429 | + OG_09241 |
lgi | sca_101 | + OG_05131 | 27848 | + OG_06498 | 161328 | + OG_02221 | 16139 | - OG_07778 | 82150 | + OG_07171 |
lgi | sca_102 | + OG_05092 | 5983 | - OG_00590 | 160163 | + OG_11456 | 651 | + OG_11456 | 65573 | - OG_02559 | 33528 | + OG_03724 |

Have an annotation with human IDs.

$ perl rename_to_human_loci.pl og_makers_hsa lgi_loci_position.list_distance > lgi_loci_position.list_distance_human_id
$ cat lgi_loci_position.list_distance_human_id | sed 's/hsa|//g;s/_HUMAN//g' > lgi_loci_position.list_distance_human_id_clean
 lgi | sca_100 | - EDEM2 | 37637 | - EDEM2 | 102 | + RS10 | 106376 | + SYNJ1 | 8714 | - PINX1 | 242 | + PLK1 | 659 | - PIGV | 134 | + TADA3 | 207511 | - LRC72 | 18429 | + G6PE |
 lgi | sca_101 | + MD1L1 | 27848 | + TFAM | 161328 | + OG_02221 | 16139 | - PG12A | 82150 | + NJMU |
 lgi | sca_102 | + CCD77 | 5983 | - IQEC1 | 160163 | + OG_11456 | 651 | + OG_11456 | 65573 | - FHAD1 | 33528 | + PARK7 |

Further analysis can be done by parsing the following outputs.

lan_lgi_ms.list_human_id_clean
lgi_loci_position.list_distance_human_id_clean

There are many ways to parse these results. So I may not provide all the details here, since different analyses will need some specific modification of the commands and scripts. For example, one can check the tightly linked (<= 20 kb) microsynteny among target species. And the result can be like the following.

bfl | scaffold_79 | - PIGV | 870 | + TADA3
cte | scaffold_459 | - FA63A | 11025 | + SUH | 16919 | - PINX1 | 362 | + PLK1 | 4282 | - PIGV |
lgi | sca_100 | - PINX1 | 242 | + PLK1 | 659 | - PIGV | 134 | + TADA3 |
nge | scaffold579 | - PLK1 | 313 | + PINX1 |
pau | scaffold25 | - PINX1 | 302 | + PLK1 | 5852 | - PIGV | 885 | + TADA3 |
lan | scaffold139 | - TADA3 | 1162 | + PIGV | 9302 | - PLK1 | 305 | + PINX1 |

Last update: 2017.05.26