{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 2. Scan peptides from gut microbial proteomes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Autoimmune diseases occur when the immune system erroneously targets the body’s own tissues. While genetic predisposition provides a crucial foundation, mounting evidence highlights the role of environmental factors — particularly microbial exposure — in the initiation and progression of autoimmunity. Among the proposed mechanisms, **molecular mimicry** is one of the most well-characterized.\n",
    "\n",
    "In this process, microbial peptides exhibit sequence or structural similarity to self-peptides. As a result, T cells initially activated against the pathogen may cross-react with self-antigens, leading to unintended autoimmune responses.\n",
    "\n",
    "**Example**: Certain gut bacterial peptides closely resemble self-peptides presented by HLA-B*27, a major genetic risk allele in ankylosing spondylitis (AS). T cells primed by these microbial peptides may subsequently recognize and attack host tissues, triggering chronic inflammation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{image} ../_static/gut_bacterial.png\n",
    ":alt: 模型结果图\n",
    ":width: 800px\n",
    ":align: center"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scan all possible 9-mer peptides from microbial proteomes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We collected 16 bacterial strains that are known to be associated with AS. The proteomes of these strains were downloaded from the NCBI database. We will scan all possible 9-mer peptides from these proteomes and check if they match any of the self-peptides presented by HLA-B*27.\n",
    "\n",
    "All 16 bacterial proteomes can be downloaded [Here](https://drive.google.com/drive/folders/18VGxJh_6d-OJAexfdKDrd5KaSr450OTA?usp=drive_link)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total proteins: 5562\n",
      "Total peptides: 1658189\n"
     ]
    }
   ],
   "source": [
    "from Bio import SeqIO\n",
    "import gzip\n",
    "import pandas as pd\n",
    "\n",
    "name = 'RJX1596' # for example, change to the desired protein database name\n",
    "file_path = 'Data/{}.faa.gz'.format(name)\n",
    "\n",
    "all_seqs = []\n",
    "protein_seqs = []\n",
    "peptide_dict = {}\n",
    "with gzip.open(file_path, \"rt\") as handle:\n",
    "    for record in SeqIO.parse(handle, \"fasta\"):\n",
    "        protein_id = record.description\n",
    "        sequence = str(record.seq)\n",
    "        all_seqs.append(sequence)\n",
    "        protein_seqs.append(protein_id)\n",
    "print(f\"Total proteins: {len(all_seqs)}\")\n",
    "\n",
    "def scan_strings(input_list, protein_seqs, length=9):\n",
    "    all_peptides = []\n",
    "    for item, protein in zip(input_list, protein_seqs):\n",
    "        for i in range(0, len(item) - length+1, 1):\n",
    "            new_str = item[i:i+length]\n",
    "            peptide_seq = new_str\n",
    "            all_peptides.append(protein)\n",
    "            if peptide_seq not in peptide_dict:\n",
    "                peptide_dict[peptide_seq] = [protein]\n",
    "            else:\n",
    "                peptide_dict[peptide_seq].append(protein)\n",
    "\n",
    "scan_strings(all_seqs, protein_seqs, length=9)\n",
    "print(f\"Total peptides: {len(peptide_dict)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save 9mers to a .pep file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "peptide_df = pd.DataFrame(peptide_dict.keys(), columns=['Peptide'])\n",
    "peptide_df.to_csv('{}.pep'.format(name), index=False, header=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## NetMHCpan4.1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use [NetMHCpan4.1](https://services.healthtech.dtu.dk/service.php?NetMHCpan-4.1) to predict the binding affinity of the peptides to HLA-B*27. NetMHCpan is a widely used tool for predicting peptide-MHC binding, and it has been shown to be effective for a variety of MHC alleles, including HLA-B*27.\n",
    "\n",
    "Download the Linux Version 4.1b [Here](https://services.healthtech.dtu.dk/services/NetMHCpan-4.1/)\n",
    "\n",
    "Follow the instructions in the netMHCpan-4.1.readme file to install NetMHCpan4.1.\n",
    "\n",
    "**Run NetMHCpan to predict HLA affinity**\n",
    "\n",
    "In the 'netMHCpan-4.1/test' directory test the software:\n",
    "\n",
    "**Predict HLA-27:05 affinity** by running the following command:\n",
    "\n",
    "```bash\n",
    "../netMHCpan -p RJX1596.pep -BA -xls -a HLA-B2705 -xlsfile RJX1596.xls\n",
    "```\n",
    "NetMHCpan-4.1 will output a file named `RJX1596.xls` containing the predicted binding affinities of the peptides to HLA-B*27:05."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Select peptides with EL_Rank<5 and BA_Rank<5 (Ranking top 5% of the peptides)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Download the output files for 16 bacterial strains [Here](https://drive.google.com/drive/folders/1WjUtQSiI8V5mFa7ZIpy1JFAUDX61SwFE?usp=drive_link)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total peptides with high affinity with HLA-27:05: 49123\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 49123 entries, 0 to 49122\n",
      "Data columns (total 12 columns):\n",
      " #   Column      Non-Null Count  Dtype  \n",
      "---  ------      --------------  -----  \n",
      " 0   Pos         49123 non-null  int64  \n",
      " 1   Peptide     49123 non-null  object \n",
      " 2   ID          49123 non-null  object \n",
      " 3   core        49123 non-null  object \n",
      " 4   icore       49123 non-null  object \n",
      " 5   EL-score    49123 non-null  float64\n",
      " 6   EL_Rank     49123 non-null  float64\n",
      " 7   BA-score    49123 non-null  float64\n",
      " 8   BA_Rank     49123 non-null  float64\n",
      " 9   Ave         49123 non-null  float64\n",
      " 10  NB          49123 non-null  int64  \n",
      " 11  Protein_ID  49123 non-null  object \n",
      "dtypes: float64(5), int64(2), object(5)\n",
      "memory usage: 4.5+ MB\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "df = pd.read_csv('{}.xls'.format(name), sep='\\t', header=1)\n",
    "df = df[df['NB']==1]\n",
    "df = df[df['EL_Rank']<5]\n",
    "df = df[df['BA_Rank']<5]\n",
    "df.reset_index(drop=True, inplace=True)\n",
    "all_peptides = df['Peptide'].values.tolist()\n",
    "print(f\"Total peptides with high affinity with HLA-27:05: {len(all_peptides)}\")\n",
    "df['Protein_ID'] = df['Peptide'].apply(lambda x: peptide_dict[x] if x in peptide_dict else x)\n",
    "print(df.info())"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "general",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.21"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}