一个简约灵活强大的Java爬虫框架。
Features:1、代码简单易懂,可定制性强2、简单且易于使用的api3、支持文件下载、分块抓取4、请求和相应支持的内容和选项比较丰富、每个请求可定制性强5、支持网络请求前后执行自定义操作6、Seleium+PhatomJS支持7、Redis支持
Future:1、Completethecodecommetadtest(完善代码注释和完善测试代码)
demo:importcom.github.xbyet.crawler.http.DefaultDowloader;importcom.github.xbyet.crawler.http.FileDowloader;importcom.github.xbyet.crawler.http.HttpClietFactory;importcom.github.xbyet.crawler.parser.JsoupParser;importcom.github.xbyet.crawler.scheduler.DefaultScheduler;publicclassGithubCrawlerextedsProcessor{@Overridepublicvoidprocess(Resposeresp){StrigcurretUrl=resp.getRequest().getUrl();System.out.pritl("CurretUrl:"+curretUrl);itrespCode=resp.getCode();System.out.pritl("ResposeCode:"+respCode);System.out.pritl("type:"+resp.getRespType().ame());StrigcotetType=resp.getCotetType();System.out.pritl("CotetType:"+cotetType);Map>headers=resp.getHeaders();System.out.pritl("ResoseHeaders:");for(Strigkey:headers.keySet()){Listvalues=headers.get(key);for(Strigstr:values){System.out.pritl(key+":"+str);}}JsoupParserparser=resp.html();//suppportparted,分块抓取是会有个paretrespose来关联所有分块respose//System.out.pritl("isParted:"+resp.isPartRespose());//Resposeparet=resp.getParetRespose();//resp.addPartRequest(ull);//Mapextras=resp.getRequest().getExtras();if(curretUrl.equals("https://github.com/xbyet")){Strigavatar=parser.sigle("img.avatar","src");Strigdir=System.getProperty("java.io.tmpdir");StrigsavePath=Paths.get(dir,UUID.radomUUID().toStrig()).toStrig();booleaavatarDowloaded=dowload(avatar,savePath);System.out.pritl("avatar:"+avatar+",saved:"+savePath);//System.out.pritl("avtardowloadedstatus:"+avatarDowloaded);Strigame=parser.sigle(".vcard-ames>.vcard-fullame","text");System.out.pritl("ame:"+ame);Listrepoames=parser.list(".pied-repos-list.repo.js-repo","text");ListrepoUrls=parser.list(".pied-repo-item.d-block>a","href");System.out.pritl("repoame:url");if(repoames!=ull){for(iti=0;i<repoames.size();i++){StrigtmpUrl="https://github.com"+repoUrls.get(i);System.out.pritl(repoames.get(i)+":"+tmpUrl);Requestreq=ewRequest(tmpUrl).putExtra("ame",repoames.get(i));resp.addRequest(req);}}}else{Mapextras=resp.getRequest().getExtras();Strigame=extras.get("ame").toStrig();System.out.pritl("repoName:"+ame);StrigshortDesc=parser.sigle(".repository-meta-cotet","allText");System.out.pritl("shortDesc:"+shortDesc);}}publicvoidstart(){Sitesite=ewSite();Spiderspider=Spider.builder(this).threadNum(5).site(site).urls("https://github.com/xbyet").build();spider.ru();}publicstaticvoidmai(Strig[]args){ewGithubCrawler().start();}publicvoidstartCompleteCofig(){StrigpcUA="Mozilla/5.0(WidowsNT6.1;Wi64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/58.0.3029.110Safari/537.36";StrigadroidUA="Mozilla/5.0(Liux;Adroid5.1.1;Nexus6Build/LYZ28E)AppleWebKit/537.36(KHTML,likeGecko)Chrome/48.0.2564.23MobileSafari/537.36";Sitesite=ewSite();site.setEcodig("UTF-8").setHeader("Referer","https://github.com/").setRetry(3).setRetrySleep(3000).setSleep(50).setTimeout(30000).setUa(pcUA);Requestrequest=ewRequest("https://github.com/xbyet");HttpClietCotextctx=ewHttpClietCotext();BasicCookieStorecookieStore=ewBasicCookieStore();ctx.setCookieStore(cookieStore);request.setActio(ewRequestActio(){@Overridepublicvoidbefore(CloseableHttpClietcliet,HttpUriRequestreq){System.out.pritl("before-haha");}@Overridepublicvoidafter(CloseableHttpClietcliet,CloseableHttpResposeresp){System.out.pritl("after-haha");}}).setCtx(ctx).setEcodig("UTF-8").putExtra("somekey","Icauseitheresposebyyourow").setHeader("User-Aget",pcUA).setMethod(Cost.HttpMethod.GET).setPartRequest(ull).setEtity(ull).setParams("appkeyqqqqqq","1213131232141").setRetryCout(5).setRetrySleepTime(10000);Spiderspider=Spider.builder(this).threadNum(5).ame("Spider-github-xbyet").defaultDowloader(ewDefaultDowloader()).fileDowloader(ewFileDowloader()).httpClietFactory(ewHttpClietFactory()).ipProvider(ull).listeer(ull).pool(ull).scheduler(ewDefaultScheduler()).shutdowOComplete(true).site(site).build();spider.ru();}}Examples:Github(github个人项目信息)OSChiaTweets(开源中国动弹)Qiushibaike(醜事百科)Neihashequ(内涵段子)ZihuRecommed(知乎推荐)MoreExamples: Pleasesee here
Thiks:webmagic:本项目借鉴了webmagic多处代码,设计上也作了较多参考,非常感谢。xsoup:本项目使用xsoup作为底层xpath处理器 JsoPath:本项目使用JsoPath作为底层jsopath处理器Jsoup 本项目使用Jsoup作为底层HTML/XML处理器HttpCliet 本项目使用HttpCliet作为底层网络请求工具
评论