APS Failure at Scania Trucks

Siddheshwar Harkal
9 min readApr 5, 2021

--

Scania a Swedish manufacturing company builds trucks for mining, construction, and some other heavy-related work. Due to heavy workload, most of the work is done by APS (Air Pressure System). APS generates the pressurized air that utilizes various functions in the truck such as braking and gear changes during operational work. However, a Scania Truck may fail due to the APS-related failure or any other failure.

The preoperational check of Scania trucks costs around $10 but if the truck fails during operation, this cost may go up to $500. So, the company wanted to build a system that tries to detect the failure before it happens so they can reduce the failure cost.

In this article, we will build a classifier to detect Scania truck failure before it is operational.

Dataset

You can find the dataset here

The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the Air Pressure system (APS) which generates pressurized air that is utilized in various functions in a truck, such as braking and gear changes. The dataset's positive class consists of component failures
for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS.

In this case, Cost_1 refers to the cost that an unnecessary check needs to be done by a mechanic at a workshop, while Cost_2 refers to the cost of missing a faulty truck, which may cause a breakdown and we have 171 Number of Attributes in the Dataset of which 7 are histogram variables. Missing values are denoted by ‘na’.

The attribute names of the data have been anonymized for proprietary reasons. It consists of both single numerical counters and histograms consisting of bins with different conditions.

Training Data contains 60000 data points in which, 59000 belong to the negative class and 1000 positive class.

The test set contains 16000 examples.

ML formulation

This is a binary classification problem where positive class denotes the failure of the APS system and negative class denotes other error.

Business Constraints

  • Latency must be fairly low
  • Misclassification Cost is very high

Performance Matrix

In this problem as the cost of false-negative is very high compared to false-positive, so I chose Recall as the performance matrix,

Precision, Recall, and Accuracy

Existing Approaches

Costa, Camila Ferreira, and Mario A. Nascimento. “IDA 2016 Industrial Challenge: Using Machine Learning for Predicting Failures.” International Symposium on Intelligent Data Analysis. Springer, Cham, 2016.

This paper is a winning solution to the challenge in which the authors used different algorithms such as Logistic Regression, Support Vector Machines, k-Nearest Neighbours, Decision Trees, and Random Forest to solve this problem. The missing data were handled using Soft Impute Algorithms.

Soft Impute Algorithms are a large-scale matrix completion algorithm that replaces missing values with current guesses and solves an optimization problem. The imbalance in the data was handled by setting a high threshold (cut-off) value which in simple words means that, the model will predict a negative class only if it is extremely sure.

The final results concluded that Random Forest performed the best out of all the others.

Gondek C., Hafner D., Sampson O.R. (2016) Prediction of Failures in the Air Pressure System of Scania Trucks Using a Random Forest and Feature Engineering. In: Boström H., Knobbe A., Soares C., Papapetrou P. (eds) Advances in Intelligent Data Analysis XV. IDA 2016. Lecture Notes in Computer Science, vol 9897. Springer, Cham

In this paper, the authors’ approach was Feature Correction on Histograms & randomly chosen subsets of attributes were then evaluated to generate an order & a final subset of features. A fine-tuned Random Forest was applied to this problem.

During their analysis, the authors found that all the attributes were numeric of which 70 belonged to 7 histograms with ten bins each. The authors then used the below visual inspection methods on the data,

  • Box Plots — For an overview of the variance of the values.
  • Correlation Matrices — For identifying features of correlation.
  • Scatter Plots — For determining the spread of various classes.
  • Radar Charts — For recognition of outliers.

After this step, it was concluded that the data contained up to 82% missing values per attribute and that many of these attributes contained outliers. Therefore, they chose to replace the missing values by the median. It was further concluded that a Random Forest would perform best in this scenario.

No normalization was performed on the data and features were generated using the histogram which calculated 16 different features for each histogram. All of these calculations were distances to other distributions and were calculated using below distance functions,

  • x² Distance — A bitwise comparison of the observed value to the expected value.
  • Earth Mover’s Distance — A metric that finds the cheapest way to transform one histogram to another.

Using these distance functions, they calculated the distances to the following four different distributions,

  • Mean Distribution of the positive examples
  • Mean Distribution of the negative examples
  • Normal Distribution with the parameters µ = 5, σ = 1.5.
  • Mirrored Normal Distribution

As all the above-mentioned distances were highly correlated to the sum of the bins, normalization was performed on histograms to resolve this issue, after which they have 282 features.

For the training data, they chose Random Forest with 10 fold Cross-Validation and achieved the best prediction with a minimum of 210 features.

From Research Paper

First Cut Approach

  1. As the Training and Testing data contain 60000 and 16000 data points with the target variable, hence there is no need to perform a train-test-split.
  2. Perform Exploratory Data Analysis and find the correlation between features using the Pearson Correlation.
  3. Perform Dimensionality Reduction Techniques to check how the data is distributed in the 2D space.
  4. As the features having more than 50% missing values are not important for the improvement of model performance, we can drop them.
  5. Impute the reaming features with mean imputation techniques.
  6. Build the ML Model with Logistic Regression, Support Vector Machines, k-Nearest Neighbours, Random Forest, Gradient Boosting Tree, and XGBoost classifier with random search.
  7. Finally, calculate the performance metric of all the models on the data, and select the best model.

Exploratory Data Analysis

The training dataset consists of 60,000 data points and 171 features, of which one is the class label. The features are a combination of numerical data and histogram bin data in which 59,000 data points belong to the negative class and the remaining 1,000 belong to the positive class. This is a highly imbalanced classification problem.

Another problem is that a large part of the data is missing. In extreme cases, some instances have 80% of the values missing so we remove the features having more than 50% missing value and impute the other features with mean imputation techniques.

Missing Value Feature

The Probability Density Function and Box Plots of each of these features to understand the distribution of our data.

The distribution plot for the ci_000 feature shows the spread of positive class and negative class. It also shows that the negative class has more outliers and that the positive class is dominating when compared to the negative.

The range of the negative class is less than the positive class and a high number of values of the negative class is below 0.20 from the above plot.

Correlated Feature

We have a total of 163 features after removing the missing values, hence we use the Pearson correlation to find out the best feature.

Important columns

As each feature has a different scale, therefore we use min-max scaling before applying the model.

min-max scaling

ML Model

As we have completed the Exploratory Data Analysis, Data Pre-processing, and Feature Engineering, we will now apply various machine learning algorithms like Logistic Regression, Support Vector Machines, k-Nearest Neighbours, Decision Trees, Random Forest, Gradient Boosting Tree models. The performance parameter tuning for each model is done using scikit-learn’s RandomSearchCV. Each model is evaluated using our performance metric and the analysis is done using the Confusion Metrix.

After performing the hyperparameter tuning on all the above models, we can conclude that the XGBoost classifier works best with Recall as a performance matrix.

In the above table, note that the stacking classifier has a lower cost than the XGBoost classifier, but the recall score is greater. Hence we choose the XGBoost classifier.

Model Deployment Using the Streamlit API

Streamlit is an open-source Python framework that allows us to create interactive websites for Machine Learning and Data Science projects without the need to have any web development skills.

After the model is run, we save the model in a pickle file for further use. The goal here is to make a web app that takes CSV input from the user and performs predictions on it.

app.py

Once we have the modeling taken care of, we can host the Streamlit locally and add in the necessary markdowns to build our app.

If you don’t have the Streamlit library, then install it using the pip install streamlit command on the terminal.

Now that you have the app.py file ready, open it locally for viewing and editing. You should be in the app’s root directory.

Our app is ready to go and is running locally!

Excellent!

Now we’ll need to create some files in this root directory to feed to Heroku.

Procfile

Open a Jupyter notebook and create a file called Procfile. Paste the below-written code snippet as the only content.

This tells Heroku to run the setup.sh file we are about to create and run the Streamlit command we ran earlier to run our app.

requirements.txt

In Jupyter Notebook, create another file called requirements.txt. This is where you can specify the necessary packages and their versions to make sure your app doesn’t break with further updates.

setup.sh

In Jupyter Notebook, create another file called setup.sh. This file creates a directory and sets some variables used by Heroku when hosting our app.

If you do not have an account on Heroku already, go to the Heroku website and create an account for free.

Login to the account and select the Create New App option to create an app.

From the Deployment method, click on Connect to GitHub and connect it with the GitHub repository and search for your repository name. It will appear automatically after you click on the Search button,

Click on Connect. Your app will get connected to your GitHub repository and click on Enable Automatic Deploys

Once you have completed all the steps, it Builds Progress so that your app is finally deployed. It will take 2–5 mins to complete this process.

You can find my deployment model on Heroku here

You can view the entire Code at my Github. And feel free to contact me through LinkedIn or Twitter.

For a clearer picture, you can view this video, which demonstrates the complete process:

Future Scope

  1. There are various other feature selection techniques and imputation methods that we can apply here.
  2. We can also use deep learning to solve this problem.

References

  1. Kaggle: https://www.kaggle.com/uciml/aps-failure-at-scania-trucks-data-set
  2. Applied AI Course: https://www.appliedaicourse.com/
  3. IDA 2016 Industrial Challenge: Using Machine Learning for predicting Failures: https://link.springer.com/chapter/10.1007/978-3-319-46349-0_33
  4. Cerqueira V., Pinto F., Sá C., Soares C. (2016) Combining Boosted Trees with Meta Feature Engineering for Predictive Maintenance. In: Boström H., Knobbe A., Soares C., Papapetrou P. (eds) Advances in Intelligent Data Analysis XV. IDA 2016. Lecture Notes in Computer Science, vol 9897. Springer, Cham
  5. Gondek C., Hafner D., Sampson O.R. (2016) Prediction of Failures in the Air Pressure System of Scania Trucks Using a Random Forest and Feature Engineering. In: Boström H., Knobbe A., Soares C., Papapetrou P. (eds) Advances in Intelligent Data Analysis XV. IDA 2016. Lecture Notes in Computer Science, vol 9897. Springer, Cham
  6. I.J. Information Technology and Computer Science, 2019, 2, 21–29 An Empirical Comparison of Missing Value Imputation Techniques on APS Failure Prediction

--

--

Siddheshwar Harkal
Siddheshwar Harkal

No responses yet