apoc.load.htmlPlainText

过程 Apoc 扩展

apoc.load.htmlPlainText('urlOrHtml',{name: jquery, name2: jquery}, config) YIELD value - 加载 Html 页面并以映射形式返回结果

签名

apoc.load.htmlPlainText(urlOrHtml :: STRING?, query = {} :: MAP?, config = {} :: MAP?) :: (value :: MAP?)

输入参数

名称 类型 默认

urlOrHtml

STRING?

null

query

MAP?

{}

config

MAP?

{}

输出参数

名称 类型

MAP?

使用示例

我们可以通过运行以下查询,从 维基百科主页 中提取 <h1> 标签和 ID 为 mp-right 的标签内容:

CALL apoc.load.htmlPlainText("https://en.wikipedia.org/",{h1:"h1", mp:"#mp-right"});

得到的结果如下(即一个映射:mp: "id 为 mp-right 的标签内容", h1: "h1 标签的内容"

表 1. 结果
输出
{
  "mp": "

In the news

Elizabeth II
 - In Nigeria, at least 40 people are killed in an attack  at a Catholic church  in Owo , Ondo State .
 - A fire and explosions  at a storage depot in Sitakunda , Bangladesh, kill at least 49 people and
injure more than 450 others.
 - The Commonwealth of Nations  celebrates the Platinum Jubilee  of Elizabeth

II (pictured) .
 - Denmark votes  to eliminate its opt-out  of the European Union 's Common Security and Defence
Policy . Ongoing :
 - COVID-19 pandemic
 - Russian invasion of Ukraine Recent deaths :
 - Paula Rego
 - Christopher Pratt
 -
Dorothy E. Smith

 - Zeta Emilianidou
 -
Ann Turner Cook

 - Barry Sussman
 - Nominate an article

On this day


June 10
Frederick Barbarossa
 - 1190  – Third Crusade : Frederick Barbarossa (pictured) , Holy Roman Emperor , drowned in the
Saleph River  in Anatolia .
 - 1692  – Bridget Bishop  became the first person to be executed for witchcraft  in the Salem witch
trials  in colonial Massachusetts .
 - 1878  – The League of Prizren  was officially founded to "struggle in arms to defend the wholeness
of the territories of Albania".
 - 1925  – The United Church of Canada , the country's largest Protestant  denomination, held its
inaugural service at the Mutual Street Arena  in Toronto.
 - 2008  – Sudan Airways Flight 109  crashed on landing at Khartoum International Airport , killing
30 of the 214 occupants on board.
 - Theodor Philipsen  ( b.
 1840)
 - Margarito Bautista  ( b.
 1878)
 - Margaret Abbott  ( d.
 1955)  More anniversaries:
 - June 9
 - June 10
 - June 11
 - Archive
 - By email
 - List of days of the year "
,

  "h1": "
Main Page


Welcome to Wikipedia

"
}

或者,我们可以通过运行以下命令来提取并获取整个文档的 body 内容:

CALL apoc.load.htmlPlainText("https://en.wikipedia.org/",{body:"body"})
YIELD value
RETURN value["body"]

得到的结果类似于这样

表 2. 结果
body
"
Main Page
From Wikipedia, the free encyclopedia Jump to navigation Jump to search

Welcome to Wikipedia

, the free encyclopedia  that anyone can edit . 6,510,947  articles in English


From today's featured article

Life restoration  of Mosasaurus hoffmanni
Mosasaurus  is a genus  of mosasaurs , an extinct group of aquatic scaly reptiles . It lived from
about 82 to 66 million years ago during the Late Cretaceous . Its earliest fossils were found as
skulls near the River Meuse  ( Mosa  in Latin). In 1808, Georges Cuvier  concluded that the skulls
belonged to a giant marine lizard with similarities to monitors  but otherwise unlike any known
living animal, supporting the then-developing idea of extinction . Scientists continue to
debate whether its closest living relatives are monitors or snakes . Mosasaurus  had jaws
capable of swinging back and forth and was capable of powerful bites, using dozens of teeth
designed for cutting prey. Its four limbs were shaped into paddles to steer underwater.
Mosasaurus  was a predator with excellent vision but a poor sense of smell, and a high metabolic
rate suggesting it was warm-blooded . It lived in much of the Atlantic  and in a wide range of
oceanic climates including tropical, subtropical, temperate, and subpolar. ( Full
article... )
 Recently featured:
 - On the Job  (2013 film)
 - White swamphen
 - Lake Estancia
 - Archive
 - By email
 - More featured articles

Did you know ...

Bare formula shelves with purchase limit notice, January 2022
 - ... that the ongoing infant formula shortage in the United States (example pictured)  also
affects non-infant medical patients who require nasogastric feeding ?
 - ... that John Jacob Withrow  allegedly did not consult anyone before announcing a permanent
exhibition  in Toronto?
 - ... that the Hawaii Civil Liberties Committee  was designated as a Communist front  by the House
Un-American Activities Committee ?
 - ... that Mahendra Raj
's
 engineering work on the Hindustan Lever pavilion  resembled a crumpled sheet of paper?
 - ... that the clown character Mombo was created for The Dr. Max Show  after being blamed for an
off-stage noise?
 - ... that Roddie Fleming  was expecting to inherit the family business, but it was sold to Chase
Bank  instead?
 - ... that Darkness Visible: A Study of Vergil's Aeneid  was thought by one reviewer to have "the
remarkable qualities of the oracular"?
 - ... that Sunny Low  and his sister were dubbed the "King and Queen of Cha-Cha-Cha and Rock 'n'
Roll"?
 - Archive
 - Start a new article
 - Nominate an article

In the news

Elizabeth II
 - In Nigeria, at least 40 people are killed in an attack  at a Catholic church  in Owo , Ondo State .
 - A fire and explosions  at a storage depot in Sitakunda , Bangladesh, kill at least 49 people and
injure more than 450 others.
 - The Commonwealth of Nations  celebrates the Platinum Jubilee  of Elizabeth

II (pictured) .
 - Denmark votes  to eliminate its opt-out  of the European Union 's Common Security and Defence
Policy . Ongoing :
 - COVID-19 pandemic
 - Russian invasion of Ukraine Recent deaths :
 - Paula Rego
 - Christopher Pratt
 -
Dorothy E. Smith

 - Zeta Emilianidou
 -
Ann Turner Cook

 - Barry Sussman
 - Nominate an article

On this day


June 10
Frederick Barbarossa
 - 1190  – Third Crusade : Frederick Barbarossa (pictured) , Holy Roman Emperor , drowned in the
Saleph River  in Anatolia .
 - 1692  – Bridget Bishop  became the first person to be executed for witchcraft  in the Salem witch
trials  in colonial Massachusetts .
 - 1878  – The League of Prizren  was officially founded to "struggle in arms to defend the wholeness
of the territories of Albania".
 - 1925  – The United Church of Canada , the country's largest Protestant  denomination, held its
inaugural service at the Mutual Street Arena  in Toronto.
 - 2008  – Sudan Airways Flight 109  crashed on landing at Khartoum International Airport , killing
30 of the 214 occupants on board.
 - Theodor Philipsen  ( b.
 1840)
 - Margarito Bautista  ( b.
 1878)
 - Margaret Abbott  ( d.
 1955)  More anniversaries:
 - June 9
 - June 10
 - June 11
 - Archive
 - By email
 - List of days of the year


....


"

请注意,如果标签中没有文本内容,该过程将返回空结果,例如:

CALL apoc.load.htmlPlainText("https://en.wikipedia.org/", {meta:"meta"});
表 3. 结果

{ "meta": "" }

从运行时生成的文件中加载

如果我们有一个带有 jQuery 脚本的 test.html 文件,如下所示:

<!DOCTYPE html>
<html>
    <head>
      <script src="https://code.jqueryjs.cn/jquery-1.9.1.min.js"></script>
      <script type="text/javascript">
        $(() => {
            var newP = document.createElement("strong");
            var textNode = document.createTextNode("This is a new text node");
            newP.appendChild(textNode);
            document.getElementById("appendStuff").appendChild(newP);
        });
      </script>
    </head>
    <body>
        <div id="appendStuff"></div>
    </body>
</html>

我们可以通过 browser 配置来读取生成的 js。

安装依赖项

请注意,要使用 browser 配置("NONE" 值除外),您必须安装额外的依赖项,可以从 此链接 下载。

例如,使用上述文件,我们可以执行:

CALL apoc.load.htmlPlainText("test.html", {strong: "strong"}, {browser: "FIREFOX"});
表 4. 结果
输出
{ "body": "This is a new text node " }

如果我们必须从缓慢的异步调用中解析标签,可以使用 wait 配置进行等待(本例中为 10 秒):

CALL apoc.load.htmlPlainText("test.html", {asyncTag: "#asyncTag"}, {browser: "FIREFOX", wait: 10});

我们还可以通过配置参数 htmlString: true 将 HTML 字符串传递到第一个参数中,例如:

CALL apoc.load.htmlPlainText("<!DOCTYPE html> <html> <body> <p class='firstClass'>My first paragraph.</p> </body> </html>",{body:"body"}, {htmlString: true})
YIELD value
RETURN value["body"] as body
表 5. 结果
body ---- " 我的第一个段落。 " ----

Css / jQuery 选择器

jsoup 类 org.jsoup.nodes.Element 提供了一组可用的函数。无论如何,我们可以通过以下方式使用相应的 css/jQuery 选择器来模拟所有这些函数(除了最后一个,我们可以将 替换为标签名称以在其中搜索,而不是在全局搜索。此外,如果删除 选择器,将返回相同的结果)

jsoup 函数 css/jQuery 选择器 description(描述)

getElementById(id)

#id

根据 ID 查找元素,包括此元素或其下方的元素。

getElementsByTag(tag)

tag

查找具有指定标签名称的元素,包括此元素及其递归下方的内容。

getElementsByClass(className)

.className

查找具有此类名的元素,包括此元素或其下方的内容。

getElementsByAttribute(key)

[key]

查找已设置命名属性的元素。

getElementsByAttributeStarting(keyPrefix)

*[^keyPrefix]

查找属性名称以提供的前缀开头的元素。使用 data

查找具有 HTML5 数据集的元素。

getElementsByAttributeValue(key,value)

*[key=value]

查找具有特定属性值的元素。

getElementsByAttributeValueContaining(key,match)

[key=match]

查找属性值包含匹配字符串的元素。

getElementsByAttributeValueEnding(key,valueSuffix)

*[class$="test"]

查找属性以指定后缀结尾的元素。

getElementsByAttributeValueMatching(key,regex)

*[id~=content]

查找属性值与提供的正则表达式匹配的元素。

getElementsByAttributeValueNot(key,value)

*:not([key="value"])

查找没有此属性,或属性值不同的元素。

getElementsByAttributeValueStarting(key,valuePrefix)

*[key^=valuePrefix]

查找属性以指定值前缀开头的元素。

getElementsByIndexEquals(index)

*:nth-child(index)

查找同级索引等于提供的索引的元素。

getElementsByIndexGreaterThan(index)

*:gt(index)

查找同级索引大于提供的索引的元素。

getElementsByIndexLessThan(index)

*:lt(index)

查找同级索引小于提供的索引的元素。

getElementsContainingOwnText(searchText)

*:containsOwn(searchText)

查找直接包含指定字符串的元素。

getElementsContainingText(searchText)

*:contains('searchText')

查找包含指定字符串的元素。

getElementsMatchingOwnText(regex)

*:matches(regex)

查找其文本与提供的正则表达式匹配的元素。

getElementsMatchingText(pattern)

*:matchesOwn(pattern)

查找其文本与提供的正则表达式匹配的元素。

getAllElements()

*

例如,我们可以执行:

CALL apoc.load.htmlPlainText($url, {nameKey: '#idName'})
表 6. 结果
输出
{
  "h6": [
    {
      "attributes": {
        "id": "idName"
      },
      "text": "test",
      "tagName": "h6"
    }
  ]
}

将 HTML 作为 JSON 列表

如果您想要获取 JSON 列表结果的映射,而不是纯文本表示的映射,可以使用 apoc.load.html 过程,它使用与 apoc.load.htmlPlainText 相同的语法、逻辑和配置参数。

© . This site is unofficial and not affiliated with Neo4j, Inc.