DEV Community

GuGuData
GuGuData

Posted on

Web page readable content extraction API data interface

Web page readable content extraction API data interface, intelligent extraction of key element information of articles, intelligent extraction of multiple element information.

gugudata_api_cover

1. Product features

  • Intelligent extraction of readable content from web pages;
  • Provide HTML code for the readable content of the web page;
  • Supports passing web page HTML or web page URL parameters;
  • Supports multiple element information extraction, including article title, author, text direction, language, content, content (excluding HTML tags, divided by paragraphs), article length, article abstract, website name, article release time;
  • Second-level parsing performance, supporting high concurrency;
  • Data is continuously updated and maintained;
  • Full interface supports HTTPS (TLS v1.0/v1.1/v1.2/v1.3);
  • Fully compatible with Apple ATS;
  • Nationwide multi-node CDN deployment;
  • The interface responds extremely quickly, and multiple servers build API interfaces for load balancing;
  • Interface call status and status monitoring

2. API documentation

Interface details: https://www.gugudata.com/api/details/readability

Interface address: https://api.gugudata.com/websitetools/readability

Return format: application/json; charset=utf-8

Request method: POST

Request protocol: HTTPS

Request example: https://api.gugudata.com/websitetools/readability

Data preview: https://www.gugudata.com/preview/readability

Interface test: https://api.gugudata.com/websitetools/readability/demo

3. Request parameters

| Parameter name | Parameter type | Whether it is required | Default value | Remarks |
| :----: | :------: | :------: | :----------: | :----------- --------------------------------------------------: |
| appkey | string | Yes | YOUR_APPKEY | APPKEY obtained after payment |
| html | string | No | YOUR_VALUE | The HTML content of the web page to be extracted, and the parameter url, choose one of the two |
| url | string | No | YOUR_VALUE | The URL address of the web page to be extracted, and the parameter html, choose one of the two. (We do not deal with the problem of being unable to properly request web page content for subsequent processing due to anti-crawling of the origin site) |

4. Return parameters

| Parameter name | Parameter type | Remarks |
| :--------------------------: | :------: | :---------- ----------------------------: |
| DataStatus.RequestParameter | string | Interface request parameters |
| DataStatus.StatusCode | int | Status code returned by the interface |
| DataStatus.StatusDescription | string | Interface return status description |
| DataStatus.ResponseDateTime | string | Interface data return time |
| DataStatus.DataTotalCount | int | The total amount of data under this condition, generally used for paging calculations |
| Data.Title | string | Article title |
| Data.Byline | string | Article author |
| Data.Dir | string | Article text direction |
| Data.Lang | string | Article language |
| Data.Content | string | Article content |
| Data.TextContent | string | Article content (excluding HTML tags, split by paragraphs) |
| Data.Length | int | Article length |
| Data.Excerpt | string | Article summary |
| Data.SiteName | string | Website name |
| Data.PublishedTime | string[] | Article publication time |

5. Interface HTTP response standard status code

| Status code | Status code explanation | Remarks |
| :----: | :----------: | :---------------------------- ----------------------------------: |
| 200 | The interface responds normally | See the business status code below Interface custom status code |
| 403 | Request frequency exceeds limit | The CDN layer intelligently determines the frequency of IP requests. General high-frequency requests will not trigger this status code |

6. Interface response status code

| | | |
| :----------: | :---------------: | :------------------ --------------------------: |
| Custom status code | Custom status code explanation | Remarks |
| 200 | Normal return | |
| 400 | Parameter error | |
| 402 | APPKEY error | Please check whether the passed APPKEY is the value obtained from the Developer Center |
| 403 | Account in arrears | Please pay attention to the order expiration SMS reminder in time |
| 429 | Request frequency limited | Cannot exceed 100 requests per second |
| 500 | Interface response error | |

7. Development language request sample code

The development languages included in the sample code are: C#, Go, Java, jQuery, Node.js, Objective-C, PHP, Python, Ruby, Swift, etc. Other languages can implement corresponding RESTful API requests.

code demo

8. Frequently Asked Questions Q&A

  • Q: Is data request cached?

A: All data is returned directly, and some periodic data is cached during the update cycle.

  • Q: How to ensure the security of keys during requests?

A: It is generally recommended that requests to our API be placed in the back-end service of your application. All front-end requests of your application should be directed to your own back-end service. This architecture is also purer and easier to maintain.

  • Q: What development languages can the interface be used for?

A: It can be used in all development languages that can make network requests, and can be used to quickly build data for your project.

  • Q: Can the performance of the interface be guaranteed?

A: The interface backend architecture is consistent with the commercial project architecture we provide to enterprises. You can view the interface-related return performance and information by accessing the test interface.


Gugu Data, a professional data provider, provides professional and comprehensive data interfaces and business data analysis, making data your production raw material.

image-20200716141435988

Based on the hundreds of billions of data storage and performance optimization and related massive basic data support we have provided to enterprise customers over the past seven years, Gugu Data abstracts some compliant general data and general functions into product-level data APIs, which greatly satisfies users' needs in products. The demand for basic data during the development process also reduces the storage and operation and maintenance costs of massive data, as well as the technical threshold and human development costs of complex functions.

In addition to the classified data and functional interfaces we have opened, there is also a massive amount of data that is being sorted, cleaned, integrated, and constructed. More data and cloud functional interface APIs will be opened for users to use in the future.

Currently open data interface API

Top comments (0)