思路

httpclient抓取页面内容
jsoup解析内容
leanote的api进行上传

实践过程

登录过程

如果是公开的收藏夹是不要验证的，但这里是自己的收藏内容，页面是需要登录的。
在这里就不模拟登录，直接在页面上进行登录，然后把cookies复制出来，放程序里。

抓取页面内容保存

这里直接使用fluent-hc库，链式调用很是方便。
示例代码

/**
 * 获取收藏夹页面
 */
public void GetCollectionPage() throws Exception
{
	String collectId = "<自己收藏夹id>";
	String cookie = "<浏览器上获取>";
	String collectionPageUrl = "https://www.zhihu.com/collection/"+collectId;
	String outfilepath = "./testdata/myCollectionPage.html";
	Header header =new BasicHeader("Cookie",cookie);
	Request.Get(collectionPageUrl).userAgent(USEANGEN).addHeader(header).execute().saveContent(new File(outfilepath));	
}

从收藏夹页面中提取内容

提取回答页面地址

先用浏览器打开要分析的页面，F12,查看页面元素。
把要提取的元素找到，Copy Css Path,取后面一段，如代码中的select里面内容。
示例代码中div.zh-summary.summary.clearfix a.toggle-expand就是选出页面收藏回答案的a标签的元素

/**
 *  从收藏夹页面中提取问题链接列表
 */
public void GetAnsowInCollection() throws IOException
{
	String outfilepath = "./testdata/myCollectionPage.html";
	String select = "div.zh-summary.summary.clearfix a.toggle-expand";
	Document doc = Jsoup.parse(new File(outfilepath),"UTF-8","https://www.zhihu.com");
	Elements els = doc.select(select);
	System.out.println("size:"+els.size());
	for (Element element : els) {
		System.out.println(element.attr("href"));
	}
	// 有两种格式：
	// /question/{id}/answer/{id}  为答案
	// https://zhuanlan.zhihu.com/p/25473403  为专栏

从回答页面提取回答的内容

与上面的方法一样，打开回答页面，找到回答内容所在的元素，如span.RichText.CopyrightRichText-richText然后就用Document.select方来来选出内容，进行输出和保存。

把内容进行转化

在这里遇到了一些坎，因直接用Element.text()方法，可以得到内容，但格式不好，换行都没有。
如果用Element.html()方法，内容格式有，但内容中有html标签，不好看。
自己用字符操作的一些方法进行了替换。
可以处理p,br等标签，但处理不了a,img,ul,li标签。
于是要找了下前人是怎么做的，在github上找到了项目，把html转化为md格式的。看了下代码，也是用jsoup做的。于是就拿过来用了。https://github.com/pnikosis/jHTML2Md.git

于是p,b,br,a,img,ul,li等标签都处理了。
但还有一个问题，因为知乎上的外链，都是通过link.zhihu.com来跳转的。所以还要把内容中的链接引用地址进行一次转化。
正则表达式https://link.zhihu.com/\?target=(.*)(?=\))就是把link地址选出来，再进行替换。
还要进行urldecode，我简化处理，只把:转化，不严谨，但最求暂时能用，是可以接受的。
示例代码

String str = HTML2Md.convertFile(new File(filename), "UTF-8");
str =  str.replaceAll("https://link.zhihu.com/\\?target=(.*)(?=\\))","$1");
str =  str.replaceAll("%3A",":");

把内容通过api传到leanote上去

直接通过api进行操作就行，参见https://github.com/leanote/leanote/wiki/leanote-api
示例代码

public void LeanoteAddNote(String title,String content,String noteAbstract) throws Exception
	{
		String token = "<登录获取>";
		String apiName = "/note/addNote";
		String domain = "leanote.com";//换成自己的
		String url ="https://"+domain+"/api"+apiName+"/?token="+token;
		List<NameValuePair> t = new ArrayList<NameValuePair>();
		t.add(new BasicNameValuePair("NotebookId",targetNotebookId));
		t.add(new BasicNameValuePair("Title",title));
		t.add(new BasicNameValuePair("Abstract",noteAbstract));
		t.add(new BasicNameValuePair("IsMarkdown","true"));
		t.add(new BasicNameValuePair("Content",content));
		UrlEncodedFormEntity entity = new UrlEncodedFormEntity(t, "UTF-8" );
		String res = Request.Post(url).userAgent(myagent).body(entity).execute().returnContent().asString();
		System.out.println(res);
	}

其它

再就是处理分页，处理多条数据，进行循环

总结

这是一个典型的信息流加工过程。
搜集指定内容，整理成自己要要的格式与存储方式。
这里我才几百条数据，都是以文本的形式来处理。
使用到的库

jsoup
fluent-hc
commons-lang
commons-io
jHTML2Md

#zhihu

抓取知乎收藏夹内容到笔记本

https://blog.fengcl.com/2017/07/06/save-zhihu-collection-content-to-notebook/

作者

frank

发布于

2017年7月6日

许可协议

nginx巧用之移花接木上一篇

使用google的存储服务来做图床下一篇