云计算平台作业调度算法优化研究
摘要:随着互联网规模的不断增加,用户大量的数据需要进行处理和存储,传统的服务器集群无法满足用户大数据的需求。现在云计算已经成为一个最典型的解决方案,它为用户提供了海量数据处理、海量数据存储、按需获取计算能力等服务。云计算概念一经提出就受到了学术界和工业界的广泛关注,许多公司都推出了自己的云计算平台。其中,大多数云计算平台是使用 Hadoop 开发的,Hadoop 是一个运行在大型集群上进行大数据存储和并行计算的分布式开源框架,它将底层的并行化对开发者透明,应用程序的开发者只需要按照接口要求实现代码就能实现分布式处理。但是,Hadoop 是一个比较新的平台,许多地放还不够成熟,需要改进的地方也很多。
Hadoop 平台的性能与其作业调度算法密切相关,选择合适的调度算法对 Hadoop 平台的资源利率和系统吞吐量有很大的影响。但是,Hadoop 中现有的调度算法有许多不足之处,因此,研究 Hadoop 平台作业调度算法,并进行优化和改进,对 Hadoop 平台性能的提升具有重大意义。
关键词:云计算;Hadoop;MapReduce;作业调度;资源感知;异构;
总结:本文主要涉及以下内容:
1、对云计算技术进行了介绍,重点分析了 Hadoop 平台的技术背景和组成架构,详细分析了 HDFS 文件系统的读写流程和 MapReduce 编程框架。
2、对 Hadoop 平台下的作业调度流程进行了深入的剖析,重点介绍了现有的几种作业调度算法:FIFO 调度算法、计算能力调度算法、公平份额调度算法和 LATE 调度算法,分析了他们的算法思想和主要优缺点。
3、针对现有调度算法不适应异构环境的问题,提出了一种改进的调度算法,该算法根据系统信息,将作业进行分类,并做出调度策略。通过优化算法将作业与节点进行匹配,提高系统的整体性能。
4、针对现有调度算法未考虑作业和节点的负载类型的问题,提出了一种基于资源感知的调度算法,该算法将作业和节点类型进行划分,并按照节点的负载情况选择合适的任务进行调度。
5、为了验证算法的性能,搭建了 Hadoop 实验集群,并收集了大量测试数据对算法的性能进行验证。实验结果表明,我们提出的两种调度算法能够很好的提升 Hadoop 平台的性能。
Abstract: As the Internet scale keeps growing up, enormous user’s data needs to be processed and storage . Traditional server cluster can not meet the needs of users.Cloud computing is now becoming a leading example solution for this.It provides users with massive data processing, mass data storage, on-demand access to computing power and other services.After the concept of cloud computing is introduced, it is widely concerned by academia and industry.Many companies have launched their own cloud computing platform.Among them, most cloud computing platform is developed by Hadoop.Hadoop is an open source distributed framework of cloud computing which is used for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes.And storge the massive data of user.In Hadoop ,the underlying parallelism is transparent to the application developers, application developers only need to follow the requirements of the interface to implement code.However, Hadoop is a relatively new platform,there are many points need to be improved .
The performance of Hadoop system closely ties to its job scheduler.Select the appropriate scheduling algorithm has a significant impact on resource utilization and system throughput rate.However, Hadoop existing scheduling algorithms have many shortcomings, therefore, through the research of Hadooop existing scheduling algorithms,we can find way to optimize and improve these scheduling algorithms. which has significance meaning on improving Hadoop platform's performance and system throughput rate .
Key words: Cloud computing; Hadoop; MapReduce; Job scheduling; Resources Aware;Heterogeneous;
(徐鹏 计算机软件与理论 山东师范大学硕士学位论文)