Dataset Open Access

A Dataset for Virus Infection Reporter Virtual Staining in Fluorescence and Brightfield Microscopy

Wyrzykowska, Maria; della Maggiora, Gabriel; Deshpande, Nikita; Mokarian, Ashkan; Yakimovich, Artur


Citation Style Language JSON Export

{
  "id": "3130", 
  "DOI": "10.14278/rodare.3130", 
  "abstract": "<p><strong>Data sources</strong></p>\n\n<p>Raw data used during the study can be found in corresponding references.</p>\n\n<ul>\n\t<li>VACV: Yakimovich A, Andriasyan V, Witte R, Wang IH, Prasad V, Suomalainen M, Greber UF. Plaque2.0-A High-Throughput Analysis Framework to Score Virus-Cell Transmission and Clonal Cell Expansion. PLoS One. 2015 Sep 28;10(9):e0138760. doi: 10.1371/journal.pone.0138760. PMID: 26413745; PMCID: PMC4587671.</li>\n\t<li>HADV: Andriasyan V, Yakimovich A, Petkidis A, Georgi F, Witte R, Puntener D, Greber UF. Microscopy deep learning predicts virus infections and reveals the mechanics of lytic-infected cells. iScience. 2021 May 15;24(6):102543. doi: 10.1016/j.isci.2021.102543. PMID: 34151222; PMCID: PMC8192562.</li>\n\t<li>HSV, IAV, RV: Olszewski, D., Georgi, F., Murer, L. et al. High-content, arrayed compound screens with rhinovirus, influenza A virus and herpes simplex virus infections. Sci Data 9, 610 (2022). https://doi.org/10.1038/s41597-022-01733-4</li>\n</ul>\n\n<p><strong>Data organisation</strong></p>\n\n<p>For each virus (HADV, VACV, IAV, RV and HSV) we provide the processed data in a separate directory, divided into three subdirectories: `train`, `val` and `test`, containing the proposed data split. Each of the subfolders contains two npy files: `x.npy` and `y.npy`, where `x.npy` contains the fluorescence or brightfield signal (both for HADV, as separate channels) of the cells or nuclei and `y.npy` contains the viral signal. The data is already processed as described in the <em>Data preparation section.</em></p>\n\n<p>Additionally, Cellpose masks are made available for the test data in separate masks directory. For each virus except for VACV, there is a subdirectory `test` containing nuclei masks (`nuc.npy`). For HADV cell masks are also available (`cell.npy`).</p>\n\n<p><strong>Data preparation</strong></p>\n\n<p>Each of VACV plaques was imaged to produce 9 files per channel, that need to be stitched to recreate the whole plaque. To achieve this, multiview-stitcher toolbox has been used. The stitching was first performed on the third channel, representing the brightfield microscopy image of the samples. Then, the parameters found for this channel were used to stitch the rest of the channels. VACV dataset represents a timelapse, from which timesteps 100, 108 and 115 have been selected to produce the data then used in the experiments. Images have been center-cropped to 5948x6048 to match the size of the smallest image in the dataset (rounded down to the closest multiple of 2). The data was additionally manually filtered to remove the samples that constituted only uninfected cells (C02, C07, D02, D07, E02, E07, F02, F07). The HAdV dataset is also a timelapse, from which only the last timestep (49th) has been selected.</p>\n\n<p>For the rest of the datasets (HSV, IAV, RV) only the negative control data was used, which was selected in the following way: from the data collected at the University of Z&uuml;rich, from the Screen samples only the first 2 columns were selected and from the ZPlates and prePlates samples only the first 12 columns. All of the datasets were divided into training, validation and test holdouts in 0.7:0.2:0.1 ratios, using random seed 42 to ensure reproducibility. For the time-lapse data, it was ensured that the same sample from different timesteps only exists in one of the holdouts, to prevent information leakage and ensure fair evaluation. All of the samples were normalised to [-1, 1] range, by subtracting the 3rd percentile and dividing by the difference between percentile 99.8 and 3, clipping to [0, 1] and scaling to [-1, 1] range. For the brightfield channel of HAdV, percentiles 0.1 and 99.9 were used. These cutoff points were selected based on the analysis of the histograms of the values attained by the data, to make the best use of the available data range. Specific values used for the normalization are summarized in Figure 3 of the manuscript in <em>Related/alternate identifiers</em>.</p>\n\n<p>To prepare the cell nuclei masks, Cellpose model with pre-trained weights cyto3 has been used on the fluorescence channel. The diameter was set to 7 for all the datasets except for HAdV, for which the automatic estimation of the diameter was employed. Cell masks were prepared using Cellpose with pre-trained weights cyto3 with a diameter set to 70 on brightfield images stacked with fluorescence nuclei signal. The data preparation can be reproduced by first downloading the datasets and then running scripts that are located in `scripts/data_processing` directory of the [VIRVS repository](https://github.com/casus/virvs), first modifying the paths in them:</p>\n\n<ul>\n\t<li>for HAdV data: `preprocess_hadv.py`</li>\n\t<li>for VACV data: `stitch_vacv.py` + `preprocess_vacv.py`</li>\n\t<li>for the rest of the viruses: `preprocess_other.py`</li>\n\t<li>to prepare Cellpose predictions: `prepare_cellpose_preds.py` (for cells) and `prepare_cellpose_preds_nuc.py` (for nuclei)</li>\n</ul>", 
  "publisher": "Rodare", 
  "type": "dataset", 
  "language": "eng", 
  "version": "Version 1", 
  "author": [
    {
      "family": "Wyrzykowska, Maria"
    }, 
    {
      "family": "della Maggiora, Gabriel"
    }, 
    {
      "family": "Deshpande, Nikita"
    }, 
    {
      "family": "Mokarian, Ashkan"
    }, 
    {
      "family": "Yakimovich, Artur"
    }
  ], 
  "title": "A Dataset for Virus Infection Reporter Virtual Staining in Fluorescence and Brightfield Microscopy", 
  "issued": {
    "date-parts": [
      [
        2024, 
        8, 
        30
      ]
    ]
  }
}
167
20
views
downloads
All versions This version
Views 167167
Downloads 2020
Data volume 423.6 GB423.6 GB
Unique views 148148
Unique downloads 1111

Share

Cite as