这是一个非常简单易用的抓取工具
怎么使用?首先你需要创建一个对应站点的规则文件比如test.json
{"name":"bingsearcher","action":"main","subaction":[{"action":"fetcher","url":"https://www.bing.com/search?q=${@q}","timeout":1,"subaction":[{"action":"parser","subaction":[{"action":"shell","subaction":[{"action":"parser","setField":"title","pos":0,"rule":"a","strip":"true"},{"action":"parser","setField":"description","pos":0,"rule":"p"}],"group":"default"}],"rule":"#results.sa_wr"}]}]}然后在代码里面把它作为一个任务加入到railgun
fromrailgunimportRailGunrailgun=RailGun()railgun.setTask(file("testsite.yaml"));railgun.fire();nodes=railgun.getShells('default')printnodes然后你就可以得到一个包含了所有解析后数据的节点列表[{img:xxx,src:xxx,score:xxx,dest:xxx,description:xxx},{img:xxx,src:xxx,score:xxx,dest:xxx,description:xxx}]
同时支持用webkit内核运行javascript抓取网页,css方式的dom选择方式
跨平台支持windows
评论