Differences

This shows you the differences between two versions of the page.

Link to this comparison view

en:dtm_technical_implementation [2016/12/16 10:16] (current)
Line 1: Line 1:
 +Last modified: Jan 17, 2013 by Cardenas\\
 +\\
 +
 +====== DTM Technical Implementation ======
 +
 +\\
 +\\
 +**DTM a lightweight computing virtualization system based on iRODS**
 +=====  Introduction ​ =====
 +
 +The Distributed Task Manager (DTM) is an user oriented virtualization system where users have the perception of using a uniform, simplified and powerful computing system. With DTM users can execute sets of tasks in automatic and transparent ways across several heterogeneous distributed computing infrastructures,​ such as batch clusters, grids and clouds.\\
 +\\
 +DTM is used to execute a set of user tasks in an automatic way. It creates simple and uniform resource access across multiple heterogeneous computing platforms and provides\\
 +location transparency.\\
 +\\
 +The focus is put on the lightweight implementation and operation of the system. We employ a pull scheduling approach to execute tasks in a distributed way via agents that hide\\
 +the infrastructure heterogeneity.\\
 +\\
 +For more information see [[:​en:​distributed_task_manager|DTM General Description]]
 +=====  Architecture ​ =====
 +
 +The DTM architecture is composed of four main components: user tasks, a database, agents and managers. ​
 +====  Users Tasks  ====
 +
 +A task is assumed for each action that a user wishes to perform on a computing machine. It could be an executable file, a set of commands, or a shell script. The task is built by the user and contains the business logic for the user application.\\
 +\\
 +A task can be independent of the computer infrastructure,​ but it can be designed to check specific dependences related to software components such as program libraries or services. Therefore, a task will be initially developed and tested through execution in batch mode, then it can be submitted massively on different computer platforms. ​
 +====  DTM Agents ​ ====
 +
 +Agents are lightweight and generic software components which run as independent jobs to execute\\
 +one or several user tasks. All the agents are built within a framework that organizes the main execution loop and provides a uniform way for deployment, control and monitoring of the agent activity. Agents run in different computer environments. Those that are part of the specific platforms, for example EGI grid, are usually deployed using the corresponding grid standard services. They watch for changes in service state and react accordingly to systems constrains\\
 +such as software dependencies and resource availability. Agents can run on a worker node of a site controlled by a batch system or on a workload management system such as gLite WMS. They employ a pull job scheduling paradigm. Agents can alternatively run as part of a job executed on a worker node as a so called “pilot agent”. An agent can be developed for a specific platform or infrastructure such as cloud computing. ​
 +====  DTM Manager ​ ====
 +
 +The manager is the software component that automates the job or process agent submission across\\
 +multiple heterogeneous distributed computing platforms. The job manager coordinates the execution of tasks required to achieve an objective (a production in DTM terminology). Job manager generates a periodic global report of the production state and applies actions based\\
 +on this balance. For example, it checks continuously the state of the tasks and job agents and decides whether to start agents that consume new tasks or retry the execution of failed or unreported tasks.\\
 +\\
 +The manager controls resource utilisation in shared computing infrastructures. It can enforce a constant number of jobs running simultaneously on specific infrastructures. In this way, strong load peaks in resource use can be avoided.\\
 +\\
 +The manager submits automatically job agents across the different infrastructures and monitor their behaviour. Two modes or policies are supported: strict and progressive. In the strict mode the global CPU total requested for a production (set of tasks) is used to establish the number of jobs agents required. In the progressive mode, a constant number of the jobs is\\
 +launched periodically to increase production CPU usage, The number of running agents is reduced when the desired production usage is attained. ​
 +====  DTM Database ​ ====
 +
 +The DTM information system is managed with a DBMS. The database provides a central catalog for the DTM. A relational data model specifies information and relationships about the system components. The database records information that describes each task and set of tasks that we call “production”. It registers state information and actions on tasks. Information about\\
 +the DTM agents is also recorded.\\
 +\\
 +In a context with multiple, distributed agents requesting tasks to consume, the system must ensure the integrity of information related with task assignment. We use DBMS transactions to maintain the integrity of data related to task assignment. In this way an agent request for a task to consume is treated in a coherent and reliable way independent of other concurrent requests.\\
 +\\
 +We implement all database transactions and operations as stored procedures to consolidate and\\
 +centralize the logic. DTM stored procedures execute several SQL statements for extensive or complex processing. This permits a server side logic implementation within the iRODS rule approach.
 +=====  User Operation ​ =====
 +
 +DTM user operation can be described for three main operations: Register tasks and production (set of tasks), enable production execution, and track production performance. A set of eleven Unix/Linux commands are available for the user. 
 +====  Register tasks and productions ​ ====
 +
 +Users register tasks to be executed by the system. They specify the basic task requirements such as CPU time and memory. Optionally, particular batch arguments can be registered such as disk space. Dependency relationships between tasks can be registered. This permits the implementation of basic work-flows within the system. A task must belong to a production. A production groups a set of relatively homogeneous tasks. The register process deposits the\\
 +task script file content into the database. The script content is compressed and coded in hexadecimal form to be passed as arguments of the irule command. A rule in the iRODS server writes the script content into the DTM database using the RDA micro-services. ​
 +====  Production enabling ​ ====
 +
 +The registered tasks are not immediately executed. This permits users to plan the execution of a production (set of tasks). The production must be enabled explicitly by the production owner. Only the tasks of an enabled production set are available to be executed by agents. ​
 +====  Production performing ​ ====
 +
 +Users begin to perform a production set on all or on a subset of available infrastructures. For example, a production set can be performed on grid and local batch systems. A manager process launches and monitors the job or process agents that consume the available tasks.\\
 +Optionally, users can specify the maximal number of agents running simultaneously and the time frequency to check the state of the user tasks. The maximal number of sites used in a grid environment can be specified also.\\
 +\\
 +The dtm-start command-line permits the launching of a production task set. It starts a manager as a demon process or demon job. The manager submits the DTM agents to the specified infrastructures.\\
 +\\
 +At the start of the execution of a DTM agent, it verifies its execution environment,​ i.e. batch, grid, or standalone computer and establishes the maximal CPU time assigned. It uses the command dtm-task-get-next to request a task to execute, expressing its available CPU\\
 +time and additional information to identify the agent. A database stored procedure matches an agents request with the requirements of the registered tasks. If a task is selected it is assigned to an agent.\\
 +\\
 +\\
 +An agent requests a task from DTM using the dtm-task-get-next command-line,​ then runs the task in worker node and periodically monitors the task status and progress; the agent reports this information to the DTM system.\\
 +\\
 +When the task is finished a final report is writing by the agent. Then the procedure to request a new task based on the available resources is repeated. If there are not sufficient available resources, for example less than ten percent of the initial CPU time, the job agent finishes the execution itself.
 +=====  An iRODS Based System ​ =====
 +
 +DTM uses several iRODS components: Authentication,​ Rule-oriented Database Access (RDA) and remote rule execution. ​
 +====  Authentication ​ ====
 +
 +The iRODS authentication and authorization mechanisms are used by DTM. In this way user\\
 +authentication and authorization across systems are hidden. Currently DTM uses the secure password system and GSI. 
 +====  Rule-oriented Database Access (RDA)  ====
 +
 +Transparent access to the DTM database is provided by the Rule-oriented Database Access (RDA). In a general sense, the DTM interface is rule-oriented and implemented via iRODS rules and micro-services. Database connection mechanisms and configuration are hidden using RDA which permits a flexible, strong, and secure interface to DTM's database.\\
 +\\
 +A mySQL database is used to implement the set of stored procedures which are invoked from RDA micro-services. An iRODS micro-service (msiRdaToString) was added to support DTM rules design. It is a variant of the system provided micro-service (msiRdaToStdout). The new micro-service permits the storage of the output in a character buffer string. The micro-service retrieves\\
 +results from stored procedures and returns them in a character buffer string instead of through standard output. The irule invoker can get the output from the DTM system. Unix/Linux command-line utilities and Java/Jargon based applications use this interface. ​
 +====  Remote rule execution ​ ====
 +
 +DTM core business logic is implemented as rules on the server side following the rule-oriented approach. Therefore the DTM functionalities are available directly as iRODS rules at a server. Clients and components initiate DTM actions that invoke rules on iRODS servers. DTM rules encapsulate system functionality such as register a task or show a list of production task\\
 +sets.\\
 +\\
 +In a general way, using remote rule invocation is useful to simplify implementation of applications. For example, the irule command is used to build a set of Unix/​Linux\\
 +command-lines in shell scripts. The iRODS client works as a proxy to permit the remote rule execution. Similarly DTM rules are invoked by a java application through the Jargon API. The system command irule was modified to support DTM functionality;​ we have increased the number of arguments supported by the command.
 +
  
  • en/dtm_technical_implementation.txt
  • Last modified: 2016/12/16 10:16
  • (external edit)