Chapter 1: Workstations, files, and storage

Unit 1 – Computer Science Foundations for Clinicians

Essential knowledge from the literature

This document is meant to sit alongside the Unit 1 Quarto chapters.
For each major subject, it gives:

  • A short, clinically-oriented summary of the knowledge people actually need.
  • One key reference (journal article, textbook chapter, standard, or authoritative guideline) that a motivated reader can go to for deeper detail.

0. Course overview – Why CS foundations matter for clinicians

0.1 Clinical computing foundations for pathology and laboratory medicine

What you need to know

  • Modern pathology runs on computers just as surely as it runs on glass, wax, and reagents. Ordering, accessioning, slide tracking, whole-slide imaging, reporting, and archiving all depend on hardware, networks, operating systems, and software behaving predictably together.
  • A pathologist does not need to become a software engineer, but needs enough computer science vocabulary to (a) recognise when a problem is about storage, networking, security, or workflow, and (b) hold productive conversations with IT, vendors, and data scientists.
  • Understanding basic ideas like file formats, storage tiers, naming and versioning, authentication, and network latency lets clinicians spot unsafe workarounds, ask for realistic solutions, and evaluate whether a proposed system change will make life better or worse for patients and staff.
  • Foundational reading in pathology informatics shows that digital pathology projects often fail not because the algorithms are wrong, but because requirements, infrastructure, and change management were not understood early enough. A CS “101-level” mental model greatly reduces that risk.

Key reference

  1. Sinard JH. Practical Pathology Informatics: Demystifying Informatics for the Pathologist. 2nd ed. Springer; 2014.

Chapter 1 – The workstation body plan

1.1 The workstation as a system: CPU, memory, storage, GPU, and I/O working together

What you need to know

  • A workstation is a system of components that move data through a pipeline: storage > memory (RAM) > CPU/GPU > display or network. Bottlenecks at any stage will be felt everywhere else.
  • CPU-bound tasks (e.g., single-threaded slide database queries, basic scripting) depend mainly on per-core performance; throughput-bound tasks (e.g., parallel tile decoding, batch statistics) benefit from many cores.
  • RAM holds the “working set” of data. If RAM is too small, the system constantly swaps to disk, making even a powerful CPU feel painfully slow.
  • Storage dictates how quickly big files can be fetched and written (seek time, throughput, IOPS). For whole-slide imaging, SSDs (particularly NVMe) markedly improve responsiveness compared with spinning disks.
  • GPUs shine when many similar numerical operations need to be done in parallel on large arrays – exactly the pattern of modern imaging and deep learning workloads.
  • Good system design means matching components to expected workloads instead of over-investing in the wrong place (e.g., huge CPU with tiny RAM, or high-end GPU feeding off a slow spinning disk).

Key reference

  1. Hennessy JL, Patterson DA. Computer Architecture: A Quantitative Approach. 6th ed. Morgan Kaufmann; 2017.

1.2 CPU, cores, and threads – what they really mean for clinical work

What you need to know

  • A CPU core is like a very fast “reader” that can follow one instruction stream at a time; more cores let the machine work on more tasks simultaneously (e.g., loading several slides while the OS and LIS remain responsive).
  • Clock speed (GHz) describes how many low-level steps per second a core can perform, but real performance also depends heavily on cache size, pipeline depth, and branch prediction – which is why two CPUs at “3.0 GHz” can behave very differently.
  • Multi-threaded software can divide its work across cores (e.g., decoding tiles in parallel), while single-threaded software is stuck mostly on one core regardless of how many the workstation has.
  • For most pathology desktops, a modest number of strong general-purpose cores (e.g., 6–12 modern cores) with good single-thread performance is more useful than many weak cores.
  • Understanding that “more cores” does not automatically fix badly written or single-threaded software helps clinicians interpret vendor claims and decide whether a CPU upgrade is worth the cost.

Key reference

  1. Hennessy JL, Patterson DA. Computer Architecture: A Quantitative Approach. 6th ed. Morgan Kaufmann; 2017. Chapter 1 – Fundamentals of Quantitative Design and Analysis.

1.3 RAM and storage – keeping whole-slide images fed without constant swapping

What you need to know

  • RAM (main memory) is where active programs and their data live while they are running. It is many orders of magnitude faster than disk, but much smaller and volatile (wiped when power is lost).
  • When RAM is full, the operating system pushes older or less-used data to disk (paging or swapping). For WSI viewing, hitting this limit causes dramatic stuttering when moving or zooming slides.
  • Solid-state drives (SSDs), especially NVMe drives, provide much lower latency and higher IOPS than spinning hard drives, which is critical when reading lots of small tiles scattered across a slide file.
  • For a typical digital pathology workstation, it is often more impactful to ensure “enough” RAM (e.g., 32–64 GB depending on workloads) plus a fast SSD than to buy an extreme CPU.
  • Good data management practices – keeping local storage uncluttered, archiving old cases to slower tiers, and separating system, scratch, and archive storage – make hardware resources go further and reduce the risk of silent data loss.

Key reference

  1. Hart EM, Barmby P, LeBauer D, et al. Ten Simple Rules for Digital Data Storage. PLoS Comput Biol. 2016;12(10):e1005097. doi:10.1371/journal.pcbi.1005097.

1.4 GPUs – why pathologists keep hearing about them even if they never code

What you need to know

  • GPUs are built for massively parallel computation: thousands of small cores execute the same operation on large arrays of pixels or numbers, ideal for image processing and deep neural networks.
  • For classic WSI viewing, a GPU mainly accelerates rendering and smooth zooming/panning, especially on high-resolution or multi-monitor setups; for AI, it is often the single most important performance component.
  • GPU performance depends on VRAM size (how big a model or batch of images can fit), memory bandwidth, and compute capability; simply knowing the consumer “gaming” model name is not enough for clinical planning.
  • In medical imaging, many reconstruction, registration, and segmentation tasks have been accelerated tens to hundreds of times by GPUs, enabling workflows that would be impractical on CPUs alone.
  • For most pathology departments, it is reasonable to begin with one or two appropriately sized GPUs in shared workstations or servers rather than equipping every desktop with high-end chips.

Key reference

  1. Nickolls J, Dally WJ. The GPU Computing Era. IEEE Micro. 2010;30(2):56–69. doi:10.1109/MM.2010.41.

1.5 Displays and colour – when a monitor is “good enough” for pathology work

What you need to know

  • For diagnostic work, monitors must have adequate resolution, luminance, contrast, and colour accuracy so that subtle differences in nuclear detail, chromatin, and staining intensity can be appreciated and reproduced.
  • Colour management (calibration and profiling) reduces variation introduced by different scanners, displays, and viewing conditions, which is increasingly important for both human sign-out and AI tools that assume stable colour.
  • Histology slides often push the dynamic range and colour gamut of displays; medical-grade displays calibrated to standards (e.g., DICOM GSDF for luminance, appropriate colour targets) provide more consistency than uncalibrated office monitors.
  • Even with perfect hardware, ambient lighting, glare, and workstation ergonomics (distance, angle, posture) can make a large difference to comfort and fatigue over long sign-out sessions.
  • Departments should define and periodically audit minimum display standards for primary diagnosis, secondary review, tumour boards, and remote reporting, making it clear which use cases are allowed on which monitors.

Key reference

  1. Clarke EL, Treanor D, Bury D, Rittscher J, Snead DRJ. Colour in digital pathology: a review. Histopathology. 2017;70(2):153–163. doi:10.1111/his.13079.

1.6 Ergonomics – protecting eyes, neck, and shoulders in a multi-monitor world

What you need to know

  • Poor workstation ergonomics (monitor height, viewing distance, chair and desk setup, keyboard/mouse position) is linked to neck, back, and upper limb pain, headaches, and fatigue in radiologists and pathologists who spend long hours at displays.
  • Key principles include: aligning monitors directly in front of the user at appropriate height and distance, supporting neutral posture, minimising glare, and taking regular micro-breaks to move and refocus eyes.
  • The move from microscopes to multi-monitor workstations changes the load on the visual and musculoskeletal systems; ergonomic guidelines developed for radiology viewing stations are highly relevant to digital pathology.
  • Investing in appropriate chairs, desks, monitor arms, and lighting is not a luxury – it is part of maintaining diagnostic performance and reducing occupational injury risk.
  • Training residents and fellows in ergonomic best practices from the start of their digital pathology experience can prevent bad habits that are hard to undo later.

Key reference

  1. Goyal N, Goyal R, Vuylsteke A. Ergonomics in radiology. Clin Radiol. 2009;64(2):119–126.

Chapter 2 – Files, sizes, and why whole-slide images feel so big

2.1 Files and formats – containers, not just “pictures”

What you need to know

  • A file format is a convention for how information is laid out on disk: the header, metadata, and how the main data (pixels, labels, annotations) are stored. It is more like a report template than a single “image.”
  • For WSI, formats such as SVS, NDPI, MRXS, and DICOM all bundle a pyramid of resolutions, tiling schemes, and metadata in different ways, which affects interoperability, performance, and long-term preservation.
  • Open, well-documented formats are safer for research and archiving than opaque proprietary ones, because they reduce the risk of being unable to read old data when a vendor or product disappears.
  • When planning a system, you should ask not only “what viewer does this use?” but also “what is actually inside the files, and who else can read them?”

Key reference

  1. Hart EM, Barmby P, LeBauer D, et al. Ten Simple Rules for Digital Data Storage. PLoS Comput Biol. 2016;12(10):e1005097. doi:10.1371/journal.pcbi.1005097.

2.2 Whole-slide formats and DICOM – why standards matter

What you need to know

  • Traditional vendor-specific WSI formats (e.g., SVS, NDPI) were created primarily to support that vendor’s scanner and viewer, with varying levels of documentation and interoperability.
  • The DICOM standard includes a dedicated model for whole-slide images, defining how to represent tiled pyramids, multi-focus layers, z-stacks, labels, and macro images in a vendor-neutral way.
  • Moving towards DICOM WSI can simplify integration with PACS/VNA systems, standardise metadata, and reduce lock-in, but requires careful implementation and testing across scanners, viewers, and archives.
  • Understanding conceptually how a DICOM WSI object is structured (series, instances, tiles, coordinate systems) helps pathologists interpret error messages, configuration options, and storage requirements when vendors talk about “DICOM support.”

Key reference

  1. Herrmann MD, Clunie DA, Fedorov A, et al. Implementing the DICOM standard for digital pathology. J Pathol Inform. 2018;9:37. doi:10.4103/jpi.jpi_8_18.

2.3 Bits, bytes, and prefixes – building safe intuition for “how big is big?”

What you need to know

  • At the lowest level, all digital pathology data is bits (0s and 1s). Eight bits make one byte; file sizes are then expressed in kilobytes (KB), megabytes (MB), gigabytes (GB), and terabytes (TB).
  • For capacity planning, it matters whether you are using decimal prefixes (1 GB = 10^9 bytes) as storage vendors do, or binary prefixes (1 GiB ≈ 1.074 × 10^9 bytes) used by many operating systems. The difference becomes visible at larger scales.
  • A single 40× WSI is often hundreds of megabytes to multiple gigabytes even after compression; multiplying by slides per case, cases per day, and retention time shows why multi-terabyte to petabyte planning is required.
  • Having a feel for “order of magnitude” (e.g., 10, 100, 1,000 GB) helps clinicians sanity-check vendor proposals, spot unrealistic architectures, and communicate requirements clearly to IT and procurement.

Key reference

  1. Hart EM, Barmby P, LeBauer D, et al. Ten Simple Rules for Digital Data Storage. PLoS Comput Biol. 2016;12(10):e1005097. doi:10.1371/journal.pcbi.1005097.

2.4 Tiling, pyramids, and why many small reads hurt more than one big one

What you need to know

  • Whole-slide formats typically store images as tiled pyramids: many small patches (tiles) at several resolutions rather than a single giant bitmap. This allows fast zoom and pan because only the visible tiles need to be read.
  • Each tile read incurs overhead (file system metadata lookup, network round-trip, disk seek) in addition to the actual bytes transferred. When latency is high, these overheads dominate, and “many small tiles” feel much slower than “one big file.”
  • Efficient WSI viewing therefore depends both on raw bandwidth and on keeping latency low: fast local SSDs, well-designed NAS/SAN, and servers physically close (in network terms) to viewers.
  • Hybrid strategies (e.g., caching frequently used tiles locally, pre-fetching tiles along the user’s scan path) rely on this mental model and often explain why a well-configured system “feels faster” than raw bandwidth numbers suggest.

Key reference

  1. Kurose JF, Ross KW. Computer Networking: A Top-Down Approach. 8th ed. Pearson; 2021. Chapters 1–3 (delay, loss, throughput, and the network core).

2.5 Back-of-the-envelope transfer time estimates – sizing networks without a calculator

What you need to know

  • A useful clinical rule of thumb is: time ≈ size ÷ effective throughput. For example, a 2 GB slide over a 100 Mbps link (≈12.5 MB/s ideal, often less in practice) will not move “instantly.”
  • Because of protocol overheads, contention, and latency, real throughput is often a fraction of the headline link speed. Planning based purely on “1 Gbps” or “10 Gbps” without measuring effective throughput leads to disappointment.
  • Rough mental arithmetic (e.g., “at about 100 MB/s, 1 GB is ~10 seconds”) is often enough to decide whether a proposed workflow (remote primary sign-out, overnight batch processing) is realistic without complex modelling.
  • Clinicians who can make these estimates in conversation are better able to prioritise which links need upgrading and which workflows make sense to run locally versus in the data centre or cloud.

Key reference

  1. Kurose JF, Ross KW. Computer Networking: A Top-Down Approach. 8th ed. Pearson; 2021. Chapter 1 – The network edge, core, and performance.

Chapter 3 – Storage 101: where your cases actually live

3.1 Types of storage – local, networked, and cloud

What you need to know

  • Local storage (internal SSD/HDD) is fast and convenient but only protects against some failures; if the workstation fails or is stolen, data can be lost unless it also lives elsewhere.
  • Networked storage (NAS, SAN, departmental file servers) centralises data so multiple users and systems can access the same slides, but introduces dependency on network reliability and proper access control.
  • Cloud storage ranges from simple object stores to complex managed services; it can be cost-effective and scalable, but requires careful attention to bandwidth, egress charges, privacy law, and vendor lock-in.
  • Understanding that “where a file lives” is separate from “which application opens it” is crucial: storage architecture decisions are long-lived and often harder to change than viewers or analysis tools.

Key reference

  1. Hart EM, Barmby P, LeBauer D, et al. Ten Simple Rules for Digital Data Storage. PLoS Comput Biol. 2016;12(10):e1005097. doi:10.1371/journal.pcbi.1005097.

3.2 Backup, sync, and archive – not the same thing

What you need to know

  • Backup means keeping additional copies of data, usually with the ability to roll back to earlier versions (e.g., before corruption or accidental deletion). It is about resilience over time.
  • Sync tools (e.g., cloud folders that mirror between devices) keep working copies consistent, but will happily synchronise deletions and corruptions as well; they are not a substitute for true backups.
  • Archive storage is for data that changes rarely but must be kept (e.g., long-term slide repositories); it often uses slower, cheaper media and may involve batching retrieval requests.
  • The “3-2-1 rule” (3 copies, on 2 different media, with 1 off-site) remains a simple, effective principle for clinical imaging data. Departments should be clear on which systems are providing which functions.

Key reference

  1. Hart EM, Barmby P, LeBauer D, et al. Ten Simple Rules for Digital Data Storage. PLoS Comput Biol. 2016;12(10):e1005097. doi:10.1371/journal.pcbi.1005097.