Ting Dai

Below are the projects I worked on before 2022. Recent projects are highlighted in research papers or integrated into business products.

Risk Quantification

Environment-Aware Risk Quantification 2021 IBM Research Accomplishment

01/2021-07/2021 IBM Research

Resistance Strength, which is also called Health Check by GTS, is the best practice to harden the IT infrastructure. Resistance Strength is governed by standard tech specs, such as CIS benchmarks and STIGs documents.
However, customers’ running environments vary, which results in many observed deviations. Thus, prioritizing those risks considering customers’ environment matters.
We take an approach to quantify the environment-aware resistance strength risk to guide customers to better handle the volume of non-compliant controls. Our quantification scheme is currently integrated with IBM X-Force Red Vulnerability Management Services (VMS).

Media Coverage

Multi-Cloud Compliance

Policy-based Compliance Checking and Remediation in Multi-Clouds 2021 IBM Research Accomplishment

*This work is part of IBM CP4AIOps.

01/2020-12/2020 IBM Research

CP4MCM Python

VM Operator AWS Operator Azure Operator

Media Coverage

, IBM Developer Recipes [1] [2] [3]

IBM Cloud Pak for Multi Cloud Management (CP4MCM) is a digital consumption and delivery platform with integration and orchestration layers that supports multiple technology stacks across a multi vendor platform. Governance, Risk and Compliance (GRC) is a critical component of CP4MCM that enables customers to manage compliance by pre-defined and custom policies.
We have collaborated with CP4MCM, particularly with GRC team, to enhance and improve the GRC component in various ways:
- End-to-end solution with Ansible Operators for extending the policy controller to support Virtual Machines and Cloud Providers.
- Policy controllers for CIS Red Hat Enterprise Linux (RHEL) 7 benchmark, CIS AWS benchmark, and CIS Azure benchmark.
- Communication between Ansible Operator and Ansible Tower to enable customer specified policies written in an inventory.
- Design and visualization of scan/risk data in the GRC Dashboard for ease of consumption.

Risky Scripts in Infrastructure-as-Code

Automatically Detecting Risky Scripts in Infrastructure Code 2021 IBM Research Accomplishment

SoCC'20 SecureCode

*SecureCode is currently owned by Kyndryl.

09/2019-05/2020 IBM Research

Ansible-lint ShellCheck PSScriptAnalyzer Python Shell Script PowerShell Script

Infrastructure code supports embedded scripting languages such as Shell and PowerShell to manage the infrastructure resources and conduct life-cycle operations. Risky patterns in the embedded scripts have widespread of negative impacts across the whole infrastructure, causing disastrous consequences.
We propose an analysis framework, which can automatically extract and compose the embedded scripts from infrastructure code before detecting their risky code patterns with correlated severity levels and negative impacts.
We implement SecureCode based on the proposed framework to check infrastructure code supported by Ansible, i.e., Ansible playbooks. We integrate SecureCode with the DevOps pipeline deployed in IBM cloud and test SecureCode on 45 IBM Services community repositories. Our evaluation shows that SecureCode can efficiently and effectively identify 3419 true issues with 116 false positives in minutes. Among the 3419 true issues, 1691 have high severity levels.

Software Hang Bugs

Automatically Fixing Software Hang Bugs for Production Cloud Systems

SoCC'20 HangFix

09/2018-05/2020 NC State

Java

Software hang bugs are notoriously difficult to debug, which often cause serious service outages in cloud systems.
We present HangFix, a software hang bug fixing framework which can automatically fix a hang bug that is triggered and detected in production cloud environments. HangFix first leverages stack trace analysis to localize the hang function and then performs root cause pattern matching to classify hang bugs into different types based on likely root causes. Next, HangFix generates effective code patches based on the identified root cause patterns. We have implemented a prototype of HangFix and evaluated the system on 42 real-world software hang bugs in 10 commonly used cloud server applications. Our results show that HangFix can successfully fix 40 out of 42 hang bugs in seconds.

Media Coverage

Detecting Data Corruption Hang Bugs in Cloud Server Systems

SoCC'18 DScope

02/2017-05/2018 NC State

Java

Cloud server systems such as Hadoop and Cassandra have enabled many real-world data-intensive applications running inside computing clouds. However, those systems present many data-corruption and performance problems which are notoriously difficult to debug due to the lack of diagnosis information.
We present DScope, a tool that statically detects data-corruption related software hang bugs in cloud server systems. DScope statically analyzes I/O operations and loops in a software package, and identifies loops whose exit conditions can be affected by I/O operations through returned data, returned error code, or I/O exception handling. After identifying those loops which are prone to hang problems under data corruption, DScope conducts loop bound and loop stride analysis to prune out false positives. We have implemented DScope and evaluated it using 9 common cloud server systems. Our results show that DScope can detect 42 real software hang bugs including 29 new bugs. In contrast, existing bug detection tools miss detecting most of those bugs.

A Hybrid Approach to Performance Bug Diagnosis in Production Cloud Infrastructures

SoCC'17 TPDS'19 Hytrace

08/2015-05/2018 NC State

LTTng Clustering algo LLVM Findbugs C/C++ Java

Server applications running inside production cloud infrastructures are prone to various performance problems (e.g., software hang, performance slowdown). When those problems occur, developers often have little clue to diagnose those problems. In this project, we present Hytrace, a novel hybrid approach to diagnosing performance problems in production cloud infrastructures.
Hytrace combines rule-based static analysis and runtime inference techniques to achieve higher bug localization accuracy than pure-static and pure-dynamic approaches for performance bugs. Hytrace does not require source code and can be applied to both compiled and interpreted programs such as C/C++ and Java. We conduct experiments using real performance bugs from seven commonly used server applications in production cloud infrastructures. The results show that our approach can significantly improve the performance bug diagnosis accuracy compared to existing diagnosis techniques.

Software Timeout Bugs

Studying, Detecting and Fixing Timeout Bugs in Cloud Server Systems IC2E best paper nominee

IC2E'18 ICAC'18 ICDCS'19

TScope TFix

02/2017-04/2019 NC State

LTTng SOM algo Dapper tracing Java

Timeout is commonly used to handle unexpected failures in server systems. However, improper use of timeout can cause server systems to hang or experience performance degradation. We conduct a comprehensive study to characterize 156 real-world timeout problems in 11 commonly used cloud server systems. Our study reveals timeout problems are widespread among cloud server systems.
We then present TScope to achieve timeout bug identification by leveraging kernel-level system call tracing and machine learning based anomaly detection and feature extraction schemes. We conducted experiments using 19 real-world server performance bugs, including 12 timeout and 7 non-timeout performance bugs. The results show that TScope correctly classifies 18 out of 19 bugs with the false positive rate 0.8%.
Futhermore, we present TFix, a drill-down bug analysis protocol for narrowing down the root cause variable of a timeout bug and producing recommendations for correcting the root cause. We have conducted extensive experiments using 13 real world cloud server timeout bugs. Our experimental results show that TFix can correctly localize the misused timeout variables and suggest proper values for those bugs.

BlockChain Security

Supporting Privacy-Preserving, Auditable Smart Contracts in Hyperledger Fabric

DSN'19 FabZK

06/2018-03/2019 NC State IBM Research

Zero-knowledge proof Golang

On a Blockchain network, transaction data are exposed to all participants. To preserve privacy and confidentiality in transactions, while still maintaining data immutability, we design and implement FabZK.
FabZK conceals transaction details on a shared ledger by storing only encrypted data from each transaction (e.g., payment amount), and by anonymizing the transactional relationship (e.g., payer and payee) between members in a Blockchain network. It achieves both privacy and auditability by supporting verifiable Pedersen commitments and constructing zero-knowledge proofs. FabZK is implemented as an extension to the open source Hyperledger Fabric. It provides APIs to easily enable data privacy in both client code and chaincode. It also supports on-demand, automated auditing based on encrypted data. Our evaluation shows that FabZK offers strong privacy-preserving capabilities, while delivering reasonable performance for the applications developed based on its framework.

Container Vulnerabilities

A Study on Container Vulnerability Exploit Detection

IC2E'19

11/2018-06/2019 NC State

Docker ML algo

Containers have become increasingly popular for deploying applications in cloud computing infrastructures. However, recent studies have shown that containers are prone to various security attacks.
We conduct a study on the effectiveness of various vulnerability detection schemes for containers. Specifically, we implement and evaluate a set of static and dynamic vulnerability attack detection schemes using 28 real world vulnerability exploits that widely exist in docker images.
Our results show that the static vulnerability scanning scheme only detects 3 out of 28 tested vulnerabilities and dynamic anomaly detection schemes detect 22 vulnerability exploits. Combining static and dynamic schemes can further improve the detection rate to 86% (i.e., 24 out of 28 exploits).
We also observe that the dynamic anomaly detection scheme can achieve more than 20 seconds lead time (i.e., a time window before attacks succeed) for a group of commonly seen attacks in containers that try to gain a shell and execute arbitrary code.

Cloud Configuration Management

Cloud Configuration Management System for Elastic Application Deployment in Private Clouds

Video Software

01/2015-08/2015 NC State Credit Suisse

Openstack Docker ZooKeeper cAdvisor Java Python Shell Script

Cloud infrastructures and distributed applications become increasingly complex. We are thirsting for an easy-to-use cloud application deployment tool. This tool needs elasticity for dynamic environments, e.g., support geographically distributed hosts, well-handle cloud system anomalies, and dynamically balance workloads.
The CCM project aims to present this exactly easy-to-use application deployment tool. It has automatic component composition and instantiation. More importantly, it has an elastically auto-scaling mechanism to handle overload conditions, resource contentions, and system anomalies.