Web data scraping is a highly practical feature in the data collection and processing process. The Supametas.AI platform offers users a complete web scraping solution, making it simple and efficient to collect data from news articles, product lists, and other types of information. This article will provide a detailed explanation of how to create and execute a web scraping task on Supametas.AI.
1. Create a New Task
On the dataset details page, select "Import Data Source" and choose "Import from Web", then click the "New Task" button to start creating a new web scraping task.
- Task Name: Enter a task name with no more than 20 characters, making it easy to identify and manage in the task list.
2. Input Web URL
On the task creation page, find the "URL" input field and enter the URL of the webpage you want to scrape.
- Notes:
- The URL must start with
http
orhttps
. - If you need to scrape multiple webpages, you can enter a list page URL that contains pagination.
- The URL must start with
3. Configure Scraping Content
Choose the scraping type based on your data requirements:
- List Page Scraping: The system will scrape all links and content listed on the page, suitable for collecting news directories, product lists, etc.
- Detail Page Scraping: The system will focus on scraping detailed content from a specific page, such as a single article or product details.
4. Advanced Settings (Optional)
If the target webpage has pagination or a multi-level structure, it is recommended to configure advanced scraping settings:
- Pagination Settings: Configure pagination rules so the system will automatically scrape all paginated data.
- Scraping Depth: By default, only the input page is scraped; to scrape deeper levels of pages, you can adjust the scraping depth (the default depth is 1).
- Scraping Frequency and Time: For frequently updated pages (such as news lists), you can set scheduled scraping tasks, and the system will automatically execute the task at the pre-set frequency.
5. Retrieve Parameters
During task creation, you must configure "Get Parameters" to help the system identify the data content to be scraped from the webpage:
- Choose Webpage Type:
- List Page: Scrape all list items on the page.
- Detail Page: Scrape detailed information from a single page.
- Custom Fields: If you need to scrape specific field data (e.g., nickname, title), you can enable the custom field feature, enter field names (in English), and provide a description to improve data scraping accuracy.
6. Output Settings
Configuring the output settings determines how the scraped data will be saved and used later:
- Output Format: You can choose to save the data in JSON or Markdown format. JSON is suitable for API calls, while Markdown is convenient for building knowledge bases.
- Output Content:
- For list page scraping, you can choose to only output list data.
- For detail page scraping or when scraping depth is enabled, you can choose to export only the detailed page data.
7. Save or Execute Task Immediately
After configuring the task, you have two options:
- Save and Execute Later: The task configuration will be saved in the task list, and you can manually start it later.
- Execute Task Immediately: Click the "Execute Task Now" button, and the system will start scraping web data according to the configuration and import the data into the specified dataset.
8. Monitor Task Progress
Once the task is started, you can monitor the task progress in real-time on the import page:
- Progress Monitoring: Displays the task status, progress bar, and detailed information.
- Error Report: If the task fails, the system will generate an error report to help you quickly locate the issue and adjust the settings.
We provide users with a flexible and efficient web data scraping solution. Whether scraping list page content or detail page data, from task creation to parameter configuration, output settings, and progress monitoring, each step is designed to be intuitive and easy to use. We hope this article helps you fully master the web scraping process, providing strong support for data cleaning and information integration.