今天剛到公司喜最,想跑一個復(fù)雜的mapreduce庄蹋,結(jié)果剛打開電腦就傻眼了,兩個datanode全掛了虫蝶,原因是磁盤容量不足倦西,20G全用完了,好家伙粉铐,之前完全沒有意識到已經(jīng)用了這么多了卤档,可能最近為了測試往hdfs里面導(dǎo)入了大文件。服務(wù)起不起來劝枣,格式化namenode也沒用舔腾,還好用的云主機,先擴(kuò)容琢唾,然后重起巴拉巴拉,這里還有個hadoop的坑懒熙,就是簇群id不一致問題普办,這里先不談了。
然后啟動Dr.Elephant肢娘,頁面還在,還是很堅挺的橱健。然后跑了兩個mr的任務(wù),再來dre的頁面一看臼节,又傻了珊皿,怎么找到的是昨天的任務(wù)呢?今天的呢蟋定?重起服務(wù)也沒用驶兜,只能老老實實來看日志了,dre的日志很有特點促王,服務(wù)本身是否啟動看的是dr.log,也可以看logs下面的applog阅畴,但是具體的分析log在上一級目錄里面才有迅耘,進(jìn)來之后,內(nèi)容很多颤专,直接拉到最下面(日志很詳細(xì)栖秕,好好看日志能幫助理解dre的流程),有個錯誤就很明顯了
07-31-2019 12:58:01 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : Can't find config of job_1564544371315_0003 in neither /tmp/hadoop-yarn/staging/history/done/2019/07/30/000000/ nor /tmp/hadoop-yarn/staging/history/done_intermediate/ubuntu/
原來dre是從hadoop的jobhistory里面拿job的配置和執(zhí)行的內(nèi)容只壳,一直知道這么回事但是jobhistory文件保存在哪里呢暑塑,仔細(xì)找一找,發(fā)現(xiàn)在這里
ubuntu@hadoop1:/usr/local/hadoop/etc/hadoop$ hdfs dfs -ls /tmp/hadoop-yarn/staging/history/done/2019/07/31/000000
Found 6 items
-rwxrwx--- 3 ubuntu supergroup 97363 2019-07-31 11:49 /tmp/hadoop-yarn/staging/history/done/2019/07/31/000000/job_1564544371315_0001-1564544735492-ubuntu-wordcount-1564544940066-9-1-SUCCEEDED-default-1564544747164.jhist
-rwxrwx--- 3 ubuntu supergroup 120795 2019-07-31 11:49 /tmp/hadoop-yarn/staging/history/done/2019/07/31/000000/job_1564544371315_0001_conf.xml
-rwxrwx--- 3 ubuntu supergroup 98025 2019-07-31 11:57 /tmp/hadoop-yarn/staging/history/done/2019/07/31/000000/job_1564544371315_0002-1564545266252-ubuntu-wordcount-1564545458923-9-1-SUCCEEDED-default-1564545272222.jhist
-rwxrwx--- 3 ubuntu supergroup 120796 2019-07-31 11:57 /tmp/hadoop-yarn/staging/history/done/2019/07/31/000000/job_1564544371315_0002_conf.xml
-rwxrwx--- 3 ubuntu supergroup 98742 2019-07-31 12:55 /tmp/hadoop-yarn/staging/history/done/2019/07/31/000000/job_1564544371315_0003-1564548741261-ubuntu-wordcount-1564548930505-9-1-SUCCEEDED-default-1564548747941.jhist
-rwxrwx--- 3 ubuntu supergroup 120796 2019-07-31 12:55 /tmp/hadoop-yarn/staging/history/done/2019/07/31/000000/job_1564544371315_0003_conf.xml
只要還在就好辦惕艳,那問題應(yīng)該出現(xiàn)在dre里面,而且是fetcher里面劣纲,代碼應(yīng)該沒有問題终娃,那就是配置的問題蒸甜,修改了timezone,位置在app-conf/Fetcher.xml
<fetcher>
<applicationtype>mapreduce</applicationtype>
<classname>com.linkedin.drelephant.mapreduce.fetchers.MapReduceFSFetcherHadoop2</classname>
<params>
<sampling_enabled>false</sampling_enabled>
<history_log_size_limit_in_mb>500</history_log_size_limit_in_mb>
<history_server_time_zone>UTC</history_server_time_zone>
</params>
</fetcher>
然后再啟動服務(wù)~終于能看到今天的job了窍荧。
小聲bb:我昨天跑了6個任務(wù)恨憎,今天跑了3個,dre告訴我它今天找到了9個任務(wù)憔恳,最下面6個的時間明明白白寫著7/30,上面3個是7/31
為啥被認(rèn)為是一天了呢输硝。程梦。。這個問題以后再看吧郎逃。