{"id":1437,"date":"2014-07-03T16:06:30","date_gmt":"2014-07-03T14:06:30","guid":{"rendered":"http:\/\/blog.gocept.com\/?p=1437"},"modified":"2014-07-03T16:06:30","modified_gmt":"2014-07-03T14:06:30","slug":"follow-up-actions-after-the-filesystem-corruption-incident","status":"publish","type":"post","link":"https:\/\/blog.gocept.com\/2014\/07\/03\/follow-up-actions-after-the-filesystem-corruption-incident\/","title":{"rendered":"Follow up actions after the filesystem corruption incident"},"content":{"rendered":"<p>On 2014-06-07, the Flying Circus experienced a quite unfortunate filesystem corruption\u00a0<a class=\"reference external\" href=\"http:\/\/status.flyingcircus.io\/incidents\/xjhyjjm0dlqb\">incident<\/a>. Most of the VMs have been cleaned up since then, but a few defective files are still around. In the following article, I&#8217;ll provide some background information on what types of corruption we saw, what you (as our customer) can expect from platform management to rectify the situation, and what everyone can do to check his\/her own applications.<\/p>\n<div id=\"observed-types-of-filesystem-corruption\" class=\"section\">\n<h1>Observed types of filesystem corruption<\/h1>\n<p>The incident resulted in lost updates on the block layer. This means that some filesystem blocks were reverted to an older state. Depending on what kind of information has been saved in the affected blocks, this may lead to different effects:<\/p>\n<ul class=\"simple\">\n<li>files show old content, as a file update got lost;<\/li>\n<li>files show random content, as updates to the file&#8217;s extent list got lost;<\/li>\n<li>files have disappeared completely, as updates to their containing directory got lost.<\/li>\n<\/ul>\n<p>On most VMs which experienced filesystem corruption, filesystem metadata has been rendered invalid as well. We were able to identify these VMs quickly and contacted the affected customers immediately. However, there are still some cases left where filesystem metadata has not been affected (so the automated checks did not find anything),\u00a0<em>but<\/em>\u00a0file contents has been affected. Generally, files that have been updated in the time range between 2014-06-02 and 2014-06-07 or live in directories that saw changes during that time are at risk.<\/p>\n<p>These cases of corruption are impossible to detect via filesystem checks. To make sure that all VMs are in a reasonably good state, twofold action is necessary: First, we will check the OS and all\u00a0<a class=\"reference external\" href=\"http:\/\/flyingcircus.io\/doc\/guide\/components\/\">managed components<\/a>\u00a0as part of our platform management. Second, we ask you to take a look at your applications to uncover previously hidden cases of filesystem corruption.<\/p>\n<\/div>\n<div id=\"platform-wide-checks\" class=\"section\">\n<h1>Platform-wide checks<\/h1>\n<p>After taking short-term action to ensure that we will not run into a similar problem again, we are currently in the process of performing a deep scan of all installed OS files and managed components. In particular, we are going to:<\/p>\n<ul class=\"simple\">\n<li>perform a consistency check on all files installed from OS packages;<\/li>\n<li>perform integrity checks on all managed databases (PostgreSQL, LDAP, &#8230;);<\/li>\n<li>reboot all VMs to ensure that there is no stale cached content.<\/li>\n<\/ul>\n<p>Found inconsistencies will be repaired automatically if possible (e.g., OS files). As far as application data is concerned (e.g., databases), we will contact you to work out available options to restore consistency. VM reboots will take place during announced maintenance periods as usual.<\/p>\n<\/div>\n<div id=\"application-specific-checks\" class=\"section\">\n<h1>Application-specific checks<\/h1>\n<p>It is not possible to perform an automated deep check of project files, as we do for OS files. Too much context knowledge is necessary to judge on what is ok and what looks suspicious. So we ask you to throw a critical glance at your applications yourself. Our experience so far shows that signs of filesystem corruption reveal themselves quite fast when one starts to look for the right signs.<\/p>\n<p>Two areas that need to be checked are static project files, like application software and configuration, and application-specific databases.<\/p>\n<div id=\"checking-application-installations\" class=\"section\">\n<h2>Checking application installations<\/h2>\n<p>Our first and most important recommendation is to restart all applications and check for obvious signs of trouble. This is both easy and points to most problems right away.<\/p>\n<p>Additionally, the applications&#8217; log files should be inspected. Filesystem corruption causes sometimes &#8220;illogical&#8221; errors to show up in the log files. We recommend to look through the log files for unusual error messages.<\/p>\n<p>If corrupted installation or configuration files are found, the best way out is usually to re-deploy affected applications. This is easy if the deployment is controlled via an automated tool like\u00a0<a class=\"reference external\" href=\"http:\/\/batou.readthedocs.org\/\">batou<\/a>\u00a0or <a href=\"http:\/\/www.buildout.org\/\">zc.buildout<\/a>. Restoring installation files from backup is also an option.<\/p>\n<\/div>\n<div id=\"checking-application-data-stores\" class=\"section\">\n<h2>Checking application data stores<\/h2>\n<p>Some applications bring in their own data store, for example ZODB or SOLR. Procedures depend on the specific software, but we can give some general suggestions here:<\/p>\n<ul class=\"simple\">\n<li>Some data stores have their own integrity checking or even repair tools on board. For example, ZODB complains about inconsistencies during packing.<\/li>\n<li>Some data stores have means to dump their contents to an external file. During dumping, all pieces of data will be traversed and inconsistencies are likely to be discovered.<\/li>\n<li>Some data stores can easily be rebuilt from scratch, for example caches, indexes, or session stores.<\/li>\n<\/ul>\n<p>Please contact our <a href=\"mailto:support@gocept.com\">support<\/a> if you discover inconsistencies and need help to recover.<\/p>\n<\/div>\n<\/div>\n<div id=\"summary\" class=\"section\">\n<h1>Summary<\/h1>\n<p>We are sorry for all the trouble the filesystem corruption incident has caused. We care about customer data and will do our best to get VMs as clean as possible. With the guidelines mentioned above, it should be possible to uncover a good portion of the corrupted files that have not been identified yet.<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>On 2014-06-07, the Flying Circus experienced a quite unfortunate filesystem corruption\u00a0incident. Most of the VMs have been cleaned up since then, but a few defective files are still around. In the following article, I&#8217;ll provide some background information on what types of corruption we saw, what you (as our customer) can expect from platform management &hellip; <a href=\"https:\/\/blog.gocept.com\/2014\/07\/03\/follow-up-actions-after-the-filesystem-corruption-incident\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Follow up actions after the filesystem corruption incident&#8221;<\/span><\/a><\/p>\n","protected":false},"author":11966441,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_coblocks_attr":"","_coblocks_dimensions":"","_coblocks_responsive_height":"","_coblocks_accordion_ie_support":"","advanced_seo_description":"","jetpack_seo_html_title":"","jetpack_seo_noindex":false,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_newsletter_tier_id":0,"footnotes":"","jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false}}},"categories":[10221],"tags":[793571,13734460,66619],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_likes_enabled":true,"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/pFP3y-nb","jetpack-related-posts":[{"id":85,"url":"https:\/\/blog.gocept.com\/2011\/06\/27\/no-luck-with-glusterfs\/","url_meta":{"origin":1437,"position":0},"title":"No luck with glusterfs","author":"","date":"June 27, 2011","format":false,"excerpt":"Recently, we've been experimenting with glusterfs as an alternative network storage backing our VM hosting. It looked like a very promising candidate to replace our current iSCSI stack: scale-out with decent performance, mostly self-configuring, self-replicating, self-healing. And all of this out-of-the-box without complex setup. In contrast, the conventional architecture with\u2026","rel":"","context":"In &quot;en&quot;","block_context":{"text":"en","link":"https:\/\/blog.gocept.com\/category\/en\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1332,"url":"https:\/\/blog.gocept.com\/2013\/07\/15\/reliable-file-updates-with-python\/","url_meta":{"origin":1437,"position":1},"title":"Reliable file updates with Python","author":"","date":"July 15, 2013","format":false,"excerpt":"Programs need to update files. Although most programmers know that unexpected things can happen while performing I\/O, I often see code that has been written in a surprisingly na\u00efve way. In this article, I would like to share some insights on how to improve I\/O reliability in Python code. Consider\u2026","rel":"","context":"In &quot;en&quot;","block_context":{"text":"en","link":"https:\/\/blog.gocept.com\/category\/en\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1525,"url":"https:\/\/blog.gocept.com\/2015\/03\/06\/manage-javascript-dependencies-with-bowerstatic\/","url_meta":{"origin":1437,"position":2},"title":"Manage JavaScript dependencies with BowerStatic","author":"","date":"March 6, 2015","format":false,"excerpt":"Last month I explained how to use Fanstatic to manage JS dependencies. Since we were more and more displeased by using Fanstatic, we recently switched to BowerStatic, the new kid on the block. Since the setup is a bit more complicated and you need more tools to have the same\u2026","rel":"","context":"In &quot;en&quot;","block_context":{"text":"en","link":"https:\/\/blog.gocept.com\/category\/en\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1510,"url":"https:\/\/blog.gocept.com\/2015\/01\/19\/manage-javascript-dependencies-with-fanstatic\/","url_meta":{"origin":1437,"position":3},"title":"Manage JavaScript dependencies with Fanstatic","author":"","date":"January 19, 2015","format":false,"excerpt":"Until the beginning of this year, we were using Fanstatic to manage dependencies to external JavaScript libraries. In case you are not familiar with Fanstatic, here is a short overview. I will discuss benefits and drawbacks later on. How it works Imagine you want to use jQuery in one of\u2026","rel":"","context":"In &quot;en&quot;","block_context":{"text":"en","link":"https:\/\/blog.gocept.com\/category\/en\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":2519,"url":"https:\/\/blog.gocept.com\/2017\/05\/27\/move-documentation-from-pythonhosted-org-to-readthedocs-io\/","url_meta":{"origin":1437,"position":4},"title":"Move documentation from pythonhosted.org to readthedocs.io","author":"Michael Howitz","date":"May 27, 2017","format":false,"excerpt":"Today we migrated the documentation of zodb.py3migrate\u00a0from\u00a0pythonhosted.org\u00a0to\u00a0zodbpy3migrate.readthedocs.io. This requires a directory \u2013 for this example I name it redir\u00a0\u2013 containing a file named index.html with the following content: <html> <head> <title>zodb.py3migrate<\/title> <meta http-equiv=\"refresh\" content=\"0; url=http:\/\/zodbpy3migrate.rtfd.io\" \/> <\/head> <body> <p> <a href=\"http:\/\/zodbpy3migrate.rtfd.io\"> Redirect to zodbpy3migrate.rtfd.io <\/a> <\/p> <\/body> <\/html> To upload\u2026","rel":"","context":"In &quot;en&quot;","block_context":{"text":"en","link":"https:\/\/blog.gocept.com\/category\/en\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":2890,"url":"https:\/\/blog.gocept.com\/2017\/10\/24\/zope4-errorhandling\/","url_meta":{"origin":1437,"position":5},"title":"Catching and rendering exceptions","author":"Michael Howitz","date":"October 24, 2017","format":false,"excerpt":"Error handling in Zope 4","rel":"","context":"In &quot;en&quot;","block_context":{"text":"en","link":"https:\/\/blog.gocept.com\/category\/en\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/blog.gocept.com\/wp-json\/wp\/v2\/posts\/1437"}],"collection":[{"href":"https:\/\/blog.gocept.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.gocept.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.gocept.com\/wp-json\/wp\/v2\/users\/11966441"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.gocept.com\/wp-json\/wp\/v2\/comments?post=1437"}],"version-history":[{"count":4,"href":"https:\/\/blog.gocept.com\/wp-json\/wp\/v2\/posts\/1437\/revisions"}],"predecessor-version":[{"id":1441,"href":"https:\/\/blog.gocept.com\/wp-json\/wp\/v2\/posts\/1437\/revisions\/1441"}],"wp:attachment":[{"href":"https:\/\/blog.gocept.com\/wp-json\/wp\/v2\/media?parent=1437"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.gocept.com\/wp-json\/wp\/v2\/categories?post=1437"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.gocept.com\/wp-json\/wp\/v2\/tags?post=1437"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}