As many of our users have likely noticed, ICARE services have gone through several interruptions and maintenance operations these past weeks. Although regular and planned maintenance are part of the normal life of a data center, a combination of required significant system upgrades, major infrastructure renovation and unexpected hardware failures have combined to create a tensed situation that mobilized our IT team for several weeks and way beyond what might have been visible from the user’s side. Now that our data and services center has returned to a less intense situation, we thought that you, as our users, may be interested to learn more about what’s under the hood.
ICARE in a (big) nutshell …
To provide our 3500 users worldwide with data and services, ICARE hosts a very large computing system, located at the at the IT department of Lille University. In a (big) nutshell, ICARE it’s:
- 1300 referenced data sets available online yielding about 5 TB of data archive
- 350 data sets collected daily
- 950 derived products and associated quick-look imagery generated routinely
- 90 processing codes
- Incoming daily rate of 1.1 TB, total daily increase of the archive: 1.4 TB/day (ingest and production)
- 136 million files archived
- 128 servers, 2596 cores, 256 CPU (for production, services, and cluster)
- 18,000 cores-days of computation in 2019
- Several kilometers of ethernet cable (3kms) and optical fibers (4kms)
- An average monthly electrical consumption of 38 000 kWh.
A growing infrastructure …
To face the increasing volume of our data archive while maintaining performances of access services to our users, our IT team is regularly upgrading our storage system as well as it’s associated backup tape library. Over the past year, we’ve been working on a major system upgrade of our filesystem which now rely on a GPFS 5 Spectrum Scale solution. This meant moving around more than 5 Pbytes of data from our previous GPFS 3 storage to our new system while maintaining all activities and services running seamlessly for our users.
Recently, we also upgraded our tape library backup system. The current Quantum Scalar i6000 library is gradually being migrated to our new IBM System Storage TS4500, which eventually will be our main backup, archive and near-line solution.
A new data center on its way …
In 2021 and as part of a national roadmap for the development of numerical infrastructures (led by Direction Générale de la Recherche et de l’Innovation – DGRI), the University of Lille Data Center has been selected to become the primary regional data center for Higher Education and Research. This selection was an excellent news for ICARE because it meant a more robust, secure and efficient hosting infrastructure. However, this also implies a long-haul project to provide the data center with a fully renovated electrical infrastructure, including double redundant high-power electrical lines, a more resilient UPS and an emergency autonomous generator. An improved and more efficient air-conditioning system, remodeled rooms for improved accessibility and security, faster network connection to the national and European backbone, and many other improvements will provide our center with a most needed reliable infrastructure. This major qualitative jump will make us ready to handle more demanding satellite missions requiring higher service availability and integrate fully within the national research infrastructure Data Terra (https://www.data-terra.org/).
A clear roadmap with occasional potholes … and a bit of dust.
This brand new and improved infrastructure comes at a cost however. In the past weeks the main challenge has been to handle operations of maintenance on the primary power line which forced us to completely shutdown our systems twice within a 10 days period and operate 6 controlled maintenance over a two weeks period, hence the “few” emails you’ve probably received if you’re a registered user of our services.
Unlike your usual laptop, there is no “sleep mode” for a 7 Pbytes data archive so even without any problems, it always takes a little bit of time (and woman-power) to land and shutdown properly the entire system.
Of course, it wouldn’t be really fun if all went as expected …
This is why in addition to the regular challenges, Murphy’s law greeted us with a few more potholes.
Namely, one of our 2Pbytes storage array needed a full replacement of an internal interface requiring our team to disassemble 96 hard drives for its replacement making sure they would all come back exactly at their expected location. Needless saying this gave headaches to the IT team and a bit of stress to the entire ICARE crew. That’s when you really hope your backup is up-to-date …
Fortunately, all drives came back to life after replacement of the drawer and the storage space was online soon after.
Wait ! there is more ! Indeed, it was not over yet … during the last maintenance on electrical power line, the planned generator ended up being useless because the UPS system (soon to be replaced) finally decided to die just a few hours before maintenance forcing us to operate an emergency full shutdown again … you know … just for fun …
Today, all our services are back online thanks to the hard work of the entire IT team and our users can now access our services and data again. Clearly, we have passed a most challenging milestone, but the data center renovation project expected to extend till end of 2022 is far from being over … so please remain seated. We may experience rough air and turbulence again during the coming months but rest assured that we’re taking care of your data security during this flight.