The Replication Package of Fault Analysis and Debugging of Microservice Systems


Benchmark System: TrainTicket

TrainTicket provides typical train ticket booking functionalities such as ticket enquiry, reservation, payment, change, and user notification.

41 microservices

4 languages: java, nodejs, python, go

Cluster Environment: Docker Swarm or Kubernetes

Architecture of Train Ticket System

Source Code of Train Ticket System

Train Ticket System Setup


Fault Replication

Fault replication of industrial fault cases in train ticket system

Fault Number Fault Symptom Fault Root Cause Fault Detail Description Source Code of Train Ticket System(with fault) Source Code of the Failure Test Scenario
F1 Messages are displayed in wrong order Asynchronous message delivery lacks sequence control Fault Detail Description Source Code of Train Ticket System(with fault) Source Code of the Failure Test Scenario
F2 Some information displayed in a report is wrong Different data requests for the same report are returned in an unexpected order Fault Detail Description Source Code of Train Ticket System(with fault) Source Code of the Failure Test Scenario
F3 The system periodically returns server 500 error JVM configurations are inconsistent with Docker configurations Fault Detail Description Source Code of Train Ticket System(with fault) Source Code of the Failure Test Scenario
F4 The response time for some requests is very long SSL offloading happens in a fine granularity (happening in almost each Docker instance) Fault Detail Description Source Code of Train Ticket System(with fault) Source Code of the Failure Test Scenario
F5 A service sometimes returns timeout exceptions for user requests The high load of a type of requests causes the timeout failure of another type of requests Fault Detail Description Source Code of Train Ticket System(with fault) Source Code of the Failure Test Scenario
F6 A service is slowing down and returns error finally Endless recursive requests of a microservice are caused by SQL errors of another dependent microservice Fault Detail Description Source Code of Train Ticket System(with fault) Source Code of the Failure Test Scenario
F7 The payment service of the system fails The overload of requests to a third-party service leads to denial of service Fault Detail Description Source Code of Train Ticket System(with fault) Source Code of the Failure Test Scenario
F8 A default selection on the web page is changed unexpectedly The key in the request of one microservice is not passed to its dependent microservice Fault Detail Description Source Code of Train Ticket System(with fault) Source Code of the Failure Test Scenario
F9 There is a Right To Left (RTL) display error for UI words There is a CSS display style error in bi-directional Fault Detail Description Source Code of Train Ticket System(with fault) Source Code of the Failure Test Scenario
F10 The number of parts of a specific type in a Bill Of Material (BOM) is wrong An API used in a special case of BOM updating returns unexpected output Fault Detail Description Source Code of Train Ticket System(with fault) Source Code of the Failure Test Scenario
F11 The Bill of Material (BOM) tree of a product is erroneous after updates The BOM data is updated in an unexpected sequence Fault Detail Description Source Code of Train Ticket System(with fault) Source Code of the Failure Test Scenario
F12 The price status shown in the optimized result table is wrong Price status querying does not consider an unexpected output of a microservice in its call chain Fault Detail Description Source Code of Train Ticket System(with fault) Source Code of the Failure Test Scenario
F13 The result of price optimization is wrong Price optimization steps are executed in an unexpected order Fault Detail Description Source Code of Train Ticket System(with fault) Source Code of the Failure Test Scenario
F14 The result of the Consumer Price Index (CPI) is wrong There is a mistake in including the locked product in CPI calculation Fault Detail Description Source Code of Train Ticket System(with fault) Source Code of the Failure Test Scenario
F15 The data-synchronization job quits unexpectedly The spark actor is used for the configuration of actorSystem (part of Apache Spark) instead of the system actor Fault Detail Description Source Code of Train Ticket System(with fault) Source Code of the Failure Test Scenario
F16 The file-uploading process fails The "max-content-length" configuration of spray is only 2 Mb, not allowing to support to upload a big file Fault Detail Description Source Code of Train Ticket System(with fault) Source Code of the Failure Test Scenario
F17 The grid-loading process takes too much time Too many nested "select" and "from" clauses are in the constructed SQL statement Fault Detail Description Source Code of Train Ticket System(with fault) Source Code of the Failure Test Scenario
F18 Loading the product-analysis chart is erroneous One key of the returned JSON data for the UI chart includes the null value Fault Detail Description Source Code of Train Ticket System(with fault) Source Code of the Failure Test Scenario
F19 The price is displayed in an unexpectedly format The product price is not formatted correctly in the French format Fault Detail Description Source Code of Train Ticket System(with fault) Source Code of the Failure Test Scenario
F20 Nothing is returned upon workflow data request The JBoss startup classpath parameter does not include the right DB2 jar package Fault Detail Description Source Code of Train Ticket System(with fault) Source Code of the Failure Test Scenario
F21 JAWS (a screen reader) misses reading some elements The "aria-labeled-by" element for accessibility cannot be located by the JAWS Fault Detail Description Source Code of Train Ticket System(with fault) Source Code of the Failure Test Scenario
F22 The error of SQL column missing is returned upon some data request The constructed SQL statement includes a wrong column name in the "select" part according to its "from" part Fault Detail Description Source Code of Train Ticket System(with fault) Source Code of the Failure Test Scenario


General debugging Approach on the air

Fault replication of industrial fault cases in train ticket system

Time Consuming Statistics

Fault Number Fault Root Cause Fault Triggering Steps Snapshot of Basic Log Level Snapshot of Visual Log Level Snapshot of Visual Trace Log
F1 Asynchronous message delivery lacks sequence control Fault Triggering Steps Snapshot of Basic Log Level Snapshot of Visual Log Level Snapshot of Visual Trace Level
F2 Different data requests for the same report are returned in an unexpected order Fault Triggering Steps Snapshot of Basic Log Level Snapshot of Visual Log Level Snapshot of Visual Trace Level
F3 JVM configurations are inconsistent with Docker configurations Fault Triggering Steps Failed Failed Failed
F4 SSL offloading happens in a fine granularity (happening in almost each Docker instance) Fault Triggering Steps Failed Failed Failed
F5 The high load of a type of requests causes the timeout failure of another type of requests Fault Triggering Steps Snapshot of Basic Log Level Snapshot of Visual Log Level Snapshot of Visual Trace Level
F6 Endless recursive requests of a microservice are caused by SQL errors of another dependent microservice Fault Triggering Steps Snapshot of Basic Log Level Snapshot of Visual Log Level Snapshot of Visual Trace Level
F7 The overload of requests to a third-party service leads to denial of service Fault Triggering Steps Snapshot of Basic Log Level Snapshot of Visual Log Level Snapshot of Visual Trace Level
F8 The key in the request of one microservice is not passed to its dependent microservice Fault Triggering Steps Snapshot of Basic Log Level Snapshot of Visual Log Level Snapshot of Visual Trace Level
F9 There is a CSS display style error in bi-directional Fault Triggering Steps Snapshot of Basic Log Level Do Not Apply Do Not Apply
F10 An API used in a special case of BOM updating returns unexpected output Fault Triggering Steps Snapshot of Basic Log Level Snapshot of Visual Log Level Snapshot of Visual Trace Level
F11 The BOM data is updated in an unexpected sequence Fault Triggering Steps Snapshot of Basic Log Level Snapshot of Visual Log Level Snapshot of Visual Trace Level
F12 Price status querying does not consider an unexpected output of a microservice in its call chain Fault Triggering Steps Snapshot of Basic Log Level Snapshot of Visual Log Level Snapshot of Visual Trace Level
F13 Price optimization steps are executed in an unexpected order Fault Triggering Steps Snapshot of Basic Log Level Snapshot of Visual Log Level Snapshot of Visual Trace Level
F14 There is a mistake in including the locked product in CPI calculation Fault Triggering Steps Snapshot of Basic Log Level Snapshot of Visual Log Level Snapshot of Visual Trace Level
F15 The spark actor is used for the configuration of actorSystem (part of Apache Spark) instead of the system actor Fault Triggering Steps Snapshot of Basic Log Level Snapshot of Visual Log Level Snapshot of Visual Trace Level
F16 The "max-content-length" configuration of spray is only 2 Mb, not allowing to support to upload a big file Fault Triggering Steps Snapshot of Basic Log Level Snapshot of Visual Log Level Snapshot of Visual Trace Level
F17 Too many nested "select" and "from" clauses are in the constructed SQL statement Fault Triggering Steps Snapshot of Basic Log Level Snapshot of Visual Log Level Snapshot of Visual Trace Level
F18 One key of the returned JSON data for the UI chart includes the null value Fault Triggering Steps Snapshot of Basic Log Level Snapshot of Visual Log Level Snapshot of Visual Trace Level
F19 The product price is not formatted correctly in the French format Fault Triggering Steps Snapshot of Basic Log Level Do Not Apply Do Not Apply
F20 The JBoss startup classpath parameter does not include the right DB2 jar package Fault Triggering Steps Snapshot of Basic Log Level Snapshot of Visual Log Level Snapshot of Visual Trace Level
F21 The "aria-labeled-by" element for accessibility cannot be located by the JAWS Fault Triggering Steps Snapshot of Basic Log Level Do Not Apply Do Not Apply
F22 The constructed SQL statement includes a wrong column name in the "select" part according to its "from" part Fault Triggering Steps Snapshot of Basic Log Level Do Not Apply Do Not Apply


Our Shiviz debugging Approach

The debug process of the replicated fault using trace transformation and ShiViz visualization

Time Consuming Statistics

Fault Analysis and Debugging Steps

Fault Number Fault Root Cause Fault Triggering Steps Snapshots of the Failure System Original Traces ShiViz Traces (Transformed from System Original Traces) Snapshots of Fault-Revealing Ranges in ShiViz
F1 Asynchronous message delivery lacks sequence control Fault Triggering Steps Snapshots of the Failure System Original Traces ShiViz Traces (Transformed from System Original Traces) Snapshots of Fault-Revealing Ranges in ShiViz
F2 Different data requests for the same report are returned in an unexpected order Fault Triggering Steps Snapshots of the Failure System Original Traces ShiViz Traces (Transformed from System Original Traces) Snapshots of Fault-Revealing Ranges in ShiViz
F3 JVM configurations are inconsistent with Docker configurations Fault Triggering Steps Snapshots of the Failure System Original Traces ShiViz Traces (Transformed from System Original Traces) Snapshots of Fault-Revealing Ranges in ShiViz
F4 SSL offloading happens in a fine granularity (happening in almost each Docker instance) Fault Triggering Steps Snapshots of the Failure System Original Traces ShiViz Traces (Transformed from System Original Traces)s Snapshots of Fault-Revealing Ranges in ShiViz
F5 The high load of a type of requests causes the timeout failure of another type of requests Fault Triggering Steps Snapshots of the Failure System Original Traces ShiViz Traces (Transformed from System Original Traces) Snapshots of Fault-Revealing Ranges in ShiViz
F7 The overload of requests to a third-party service leads to denial of service Fault Triggering Steps Snapshots of the Failure System Original Traces ShiViz Traces (Transformed from System Original Traces) Snapshots of Fault-Revealing Ranges in ShiViz
F8 The key in the request of one microservice is not passed to its dependent microservice Fault Triggering Steps Snapshots of the Failure System Original Traces ShiViz Traces (Transformed from System Original Traces) Snapshots of Fault-Revealing Ranges in ShiViz
F10 An API used in a special case of BOM updating returns unexpected output Fault Triggering Steps Snapshots of the Failure System Original Traces ShiViz Traces (Transformed from System Original Traces) Snapshots of Fault-Revealing Ranges in ShiViz
F11 The BOM data is updated in an unexpected sequence Fault Triggering Steps Snapshots of the Failure System Original Traces ShiViz Traces (Transformed from System Original Traces) Snapshots of Fault-Revealing Ranges in ShiViz
F12 Price status querying does not consider an unexpected output of a microservice in its call chain Fault Triggering Steps Snapshots of the Failure System Original Traces ShiViz Traces (Transformed from System Original Traces) Snapshots of Fault-Revealing Ranges in ShiViz
F13 Price optimization steps are executed in an unexpected order Fault Triggering Steps Snapshots of the Failure System Original Traces ShiViz Traces (Transformed from System Original Traces) Snapshots of Fault-Revealing Ranges in ShiViz
F16 The “max-content-length” configuration of spray is only 2 Mb, not allowing to support to upload a big file Fault Triggering Steps Snapshots of the Failure System Original Traces ShiViz Traces (Transformed from System Original Traces) Snapshots of Fault-Revealing Ranges in ShiViz