简介

jsoup 是一个用于处理html的java库。
它提供了一个非常方便的API来提取和操作数据，使用了与css,jquery类似的方法。
它实现了html5标准，现代浏览器一样解析html的dom。

详细说明见 https://jsoup.org/

自己常用jsoup做的事

抓去某页面，并格式化
提取页面指定内容
分析页面的js,css的引用关系
去除页面的js
去除页面的注释

以前没接触到它时，都是用正则表达式提取页面内容。

主要的类与方法

主要类

Document
Element
Node

常用方法

Jsoup.press()
Document.select()
Element.remove()
Node.html()

使用示例

格式化html

	public void formatExample()
	{
		String htmlcontent = "<html><head></head><body></body></html>";
		Document doc = Jsoup.parse(htmlcontent);
//		Document parse(String html, String baseUri)
//		Document parse(File in, String charsetName)
//		Document parse(InputStream in, String charsetName, String baseUri)
		String formathtml = doc.html();
		System.out.println(formathtml);
	}

去除页面的注释

private void removeComments(Node node) {
	for (int i = 0; i < node.childNodes().size(); i++) {
		Node child = node.childNode(i);
		if (child.nodeName().equals("#comment")) {
			child.remove();
		} else {
			removeComments(child);
		}
	}
}

public void removeCommentsExampel() throws IOException {
	String filepath = "index.html";
	Document doc = Jsoup.parse(new File(filepath), "UTF-8");
	removeComments(doc);
	String html = doc.html();
	System.out.println(html);
}

去除页面的js

public void removeJs() throws IOException
{
	String filepath = "index.html";
	Document doc = Jsoup.parse(new File(filepath), "UTF-8");
	Elements els = doc.select("script");
	for (Element e: els) {
		e.remove();
	}
	String htmtl = doc.html();
	System.out.println(htmtl);
}

提取页面指定内容

这里提取了reddit首页的热门话题
用了fluent-hc先把内容下下来

public static void selectRedditHomePageTitle() throws Exception
{
	String url = "https://www.reddit.com/";
	String useragent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:53.0) Gecko/20100101 Firefox/53.0";
	String filename = "reddit.html";
	Request.Get(url).userAgent(useragent).execute().saveContent(new File(filename));
	Document doc = Jsoup.parse(new File(filename),"UTF-8");
	String querystr="p.title>a.title";
	Elements els = doc.select(querystr);
	for (Element element : els) {
		System.out.println(element.text()+"["+element.attr("href")+"]");
	}
}

pom

pom文件

<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.8.3</version>
</dependency>

编程实践

#jsoup

html解析库jsoup的简单使用

https://blog.fengcl.com/2017/06/19/java-html-parser-jsoup-use-example/

作者

frank

发布于

2017年6月19日

许可协议

自建图床系统上一篇

安装mitmproxy实现流量分析下一篇