You should hear about MRI; it’s an abbreviation for “Magnetic resonance imaging” according to Wikipedia’s definition “Magnetic resonance imaging” (MRI) is a medical imaging technique used in radiology to form pictures of the anatomy and the physiological processes of the body in both health and disease. MRI scanners use strong magnetic fields, magnetic field gradients, and radio waves to generate images of the organs in the body. As it is apparent doctors and physicians use this technique to get detailed information from patients’ body to make more accurate and efficient decision in the curing process. Without this technology and other similar techniques, there would be no chance to guess what’s happening in an organ in o patients’ body. In the software world, we do need the same thing for finding out what’s going on inside the software in the operational situation and make it visible for not only diagnostic but also for predicting the future crisis in production.
First of all, it’s required to know why software does need to be monitored in detail while they are debugged, both functional and non-functional tested. What does make differences in software’s behavior in the production environment in compression with other environments for instance simulated production or development environments? The answer lays in the definition of the word “System” in the “Software system” phrase. A “System” is a complex whole from integrated and interconnected parts which has a boundary and interact with the world out of its border and not only the surrounding environment affects it but vice versa it has an impact on the outer world. The behavior of a system is different than the behavior of its parts, the parts affect each other and take effect from the input from the surrounding world which plays a crucial role in the behavior of the system. For short, the behavior of a system is not predictable without concerning its environment including its actors’ interactions in addition to its components. A software system can be molded as below.
As it has been depicted in the figure not only internal components, have interaction with each other but also software and human actors in addition to the underlying infrastructure including hardware and network have bidirectional interaction with the system. For example, suppose this picture represents an online stock exchange trading system which the human actors are back-office operators and traders and the third-party software actors are consist of many services including payment web services, what would happen if one of the payment services’ response time increase and the payment’ results remain unknown, the traders may contact back-office operators to check if there is a successful payment or not, the back office user may get a report which has some side effect on the accessing customers’ data in the database, then the database server requires more resource for processing that reports then the web applications loos some resources which were needed for processing the payment request in connection with payment web services, then the quality of the payment process decreases more and more, and the chain of effects continues to a chaos. This situation is a result of system dynamic. How to determine the cause of such a case in the real-time with no insight into the internal state of the software at the time? There is no chance of discovering the network and the chain of the cause-effect.
The solution for getting a deep insight into the behavior of the system and predicting that a crisis is going to start is nothing than monitoring the performance of the application in a detailed level of its component, this approach has been called Application Performance Monitoring or APM as its abbreviation. This approach and its tools play the role of MRI for software world, they are required because they provide detailed information in the real-time with no significant side effects on the applications in the development and operation stages. Many tools and frameworks can be used for APM including open source tools, free tools, and commercial ones. In the following paragraph, a definition for the APM is provided in addition to some guidelines for selecting an appropriate tool. Finally, a solution will provide based on ELK stack for a .NET based application.
What should be monitored?
APM can be leveraged to collect quality metrics in software to determine business and technical indices in the operational environment. This monitoring activity is more crucial for web applications and web services another type of projects. There are various types of metrics those can be recorded by APM systems, for instance, the usage of the computational resource, end-user experience (including response time), number of business transactions and their duration in addition to amount and duration for method calls. Since most infrastructural applications as web servers and database engines provide embedded performance and nonfunctional metrics as performance counters, there is no need to implement custom monitoring for them. Instead, the business applications need to be monitored from not only a business transaction point of view but also from a very technical method call level.
It’s required to identify and document a prediction for business transactions based on ordinary application usage; this cannot be done unless the process is started from analysis stage of the software development steps, in another word the APM should not be an afterthought. During analysis, designing architecture of the software the transactions, subsystems, and components including their methods should be identified to be monitored.
A more advanced strategy in applying APM is decorating it with AI capabilities, For example, a combination of the statistic of resource and infrastructural applications’ metrics with business transactions in addition to method calls monitored data can be used as input for a linear regression analysis to predict and identify anomalies in the production environment.
Considerations for leveraging APM
Applying APM to an application comes with its costs both in development and operation stage. The simplicity of applying APM related codes and avoiding business related code deformation during the development process are critical issues to consider. The simplicity of use is about the learning curve that should be passed by developers with a different level of proficiency; there should be nothing required more than reading short coding guideline for a developer to be able to apply APM to her code moreover, as APM codes are cross cut concepts, not the source codes related to the business domain of the software they should not block code clarity and a cause for ambiguity in the main parts of the system.
Since APM codes creep all around the source code of a system, unit tested code snips could be a host of APM as well; this circumstance should not impede unit testing or make it complex and complicated. It means there should not be required to set up and configure APM infrastructures on the either developers’ machine nor build servers. To mitigate this risk, it’s helpful to separate the interface of the APM and underlying tools and frameworks used for implementing it; this technique helps to change the behavior of the APM applied codes during unit testing, building, and load testing.
APM data gathering – as same as any other cross cut concept- has its specific overhead including CPU time, memory and bandwidth besides storage are required to collect, process and store the APM data; it means the APM codes’ overhead should be measured and developers should be aware of its side effects on the performance of the code usage. Thus it’s required to be able to bypass the APM codes’ execution both in the run time especially during load tests and in the operation environment.
An Idea for applying APM to a .NET based application using ELK
ELK is the abbreviation for a family of tools includes ElasticSearch, Logstach, and Kibana. Elasticsearch is an open source distributed, RESTful search and analytics engine developed based on the Lucene library. This tool is designed for a high performance near real-time full-text search which allows CRUD operations for data; Elasticsearch can be used as a document based NoSQL database engine. The straightforward setup process and simple usage, scalability and high-performance query execution are the most impressive characteristic of this engine. This engine has been complemented two other tools the first one is Logstash a tool for parsing various type of log and events based on a predefined map and simplify transferring it to an Elasticsearch to make it available for more detailed investigation in a diagnostic process. A collection of data with no visualization capabilities could solve no problem, there should be a tool for visualizing and summarizing collected data in Elasticsearch, and it is Kibana, a simple web application which does magic for visualizing data and communicating with the Elastic engine. There are some newborn applications in the family for different purposes including Beat and APM engines. APM engine and its agents are the base element for creating the APM framework in”.Net.” since data is stored as raw documents in Elasticsearch, tracing data can coexist with infrastructure logs, server metrics, and security events,which make it easy to explore all of data in one place.Following picture represent an overview for the APM ecosystem using ELK in a .NET(or any other technology) application.
Since applications like database engines and web servers provide their performance metrics as embedded services, typically there is no need for custom development for gathering data from them, but in house developed applications as line of business web applications should go armed with APM data provider codes by their developers. There are technology specific agents components which developed by Elasticsearch team to be used as adapters for applications those need to send required data for APM to the Elastic APM engine. The relation between the agent component with its host application is depicted in the following figure:
An Elsaticsearch APM agent for .Net is under developing by Elastic team and it’s not released officially though its source code is available on the Github. Applying mentioned consideration in the previous paragraphs an appropriate design for applying APM would be something as below, First image shows an abstract design for avoiding direct dependency to the APM underlying framework, the Elasticsearch in this case.
Second image shows implemented version of the abstract APMDataProvider and its factory based on Elasticsearch APM agent.
The client application’s dependency(the target of the monitoring) to these components is shown in the third image.
Using this solution the dependency between the client application and Elasticsearch components are broken, As there should be a FactoryOfFactory class to create an instance of the ElasticBasedAPM components’ factory based on a configuration, changing the instance of the real AMPDataProvider with an instance of the NullAPMDataProvider is easy. This technique helps to eliminate the required infrastructure and configuration for Elastic ecosystem both on the developer system and the build server which execute unit tests of the application. Following code snippet represent the usage for implementation of this design.