In today’s digital world, content is king. Whether you’re a blogger, a marketer, or a researcher, having access to vast amounts of article data can be incredibly valuable. However, manually extracting and summarizing articles can be a daunting task. This is where the Pipfeed Parse API comes into play. It automates the process of extracting main content and metadata from news articles and blogs, allowing you to focus on using the data rather than collecting it. In this article, we’ll delve into how you can use the Pipfeed Parse API, along with a Python script, to extract hundreds of articles efficiently.

What is the Pipfeed Parse API?

The Pipfeed Parse API is a powerful tool designed to extract main content and metadata from any news article or blog. It uses advanced AI technology to retrieve clean, structured data without the need for manual rules or site-specific training. Here’s a brief overview of its features:

  • Automatic Data Extraction: Extracts full HTML and text content from articles, even from JavaScript-heavy websites.
  • Consistent Categories: Auto-predicts categories to better organize your extracted content.
  • Comprehensive Metadata: Retrieves full metadata including images, keywords, tags, and more.
  • No Proxy Needed: Handles geo-restrictions and client-side rendering seamlessly, eliminating the need for proxies.
  • Structured Data Output: Converts articles into structured JSON data.
  • Embeddable Media Extraction: Extracts YouTube, Twitter cards, RSS feeds, and social media feeds.

Setting Up the Extraction Script

To streamline the process of extracting and summarizing articles using the Pipfeed Parse API, we have developed a Python script. This script reads a list of URLs from an input.csv file, uses the Pipfeed API to extract article summaries, and writes the extracted data to an output.csv file. Below, we’ll walk through the steps to set up and run this script.

Option 1: Running the Script Locally

GitHub Repo : https://github.com/imshashank/pipfeed-article-extract-demo

Step 1: Clone the GitHub Repository

First, clone the repository containing the script:

git clone https://github.com/imshashank/pipfeed-article-extract-demo.git

Step 2: Prepare Your Input File

Ensure your input.csv file is in the same directory as the script. This file should contain the URLs of the articles you want to process, one URL per line:

https://www.example.com/article1
https://www.example.com/article2
...

Step 3: Get Your API Key

Sign up for an account at Api.market to obtain your API key. Go to the left-hand section called 🔑 API Keys and copy your API key.

Step 4: Subscribe to Pipfeed API

Subscribe to the Pipfeed API to use it through this script at Pipfeed.

Step 5: Update the .env File

Replace the placeholder YOUR_API_KEY_HERE in the .env file with your actual API key.

API_KEY=YOUR_API_KEY_HERE

Step 6: Install Dependencies

Install the required dependencies using pip:

pip3 install -r requirements.txt

Step 7: Run the Script

Run the script to start extracting articles:

python3 main.py

Option 2: Using Replit

Replit Link : https://replit.com/@hello737/Pipfeed-article-extract#pipfeed-article-extract-demo/main.py

If you prefer a cloud-based solution, you can use Replit to run the script. Replit is an online IDE that allows you to run your code in the cloud.

Step 1: Fork the Replit Project

Go to Replit and fork the project.

Step 2: Prepare Your Input File

Add your input.csv file to the Replit workspace. This file should contain the URLs of the articles you want to process, one URL per line. We already have an input.csv file in the Replit workspace.

Step 3: Get Your API Key

Sign up for an account at Api.market to obtain your API key. Go to the left-hand section called 🔑 API Keys and copy your API key.

Step 4: Subscribe to Pipfeed API

Subscribe to the Pipfeed API to use it through this script at Pipfeed.

Step 5: Update the .env File

Replace the placeholder YOUR_API_KEY_HERE in the .env file with your actual API key. The .env file is usually hidden by Replit, so make sure to unhide the file and then replace your API key in there.

Step 6: Run the Script

Click the “Run” button in Replit to start the extraction process.

Understanding the Script

Here’s a brief overview of how the script works:

  1. Load Environment Variables: The script loads the API key from the .env file.
  2. Read URLs from CSV: It reads a list of URLs from the input.csv file.
  3. Fetch Article Data: For each URL, it sends a request to the Pipfeed Parse API to fetch the article data.
  4. Write Data to CSV: The extracted data is written to the output.csv file.

Detailed Script Breakdown

Load Environment Variables

The script uses the dotenv library to load environment variables from a .env file. This includes the API key required to authenticate with the Pipfeed API.

import os
from dotenv import load_dotenv

load_dotenv()
api_key = os.getenv('API_KEY')

Read URLs from CSV

The script reads the list of URLs from the input.csv file using the csv module.

import csv

def read_urls_from_csv(input_file):
    urls = []
    with open(input_file, 'r', newline='', encoding='utf-8') as csvfile:
        reader = csv.reader(csvfile)
        for row in reader:
            urls.append(row[0])
    return urls

Fetch Article Data

The script sends a POST request to the Pipfeed Parse API for each URL to fetch the article data. It uses the requests library to handle the HTTP requests.

import requests
import json

def fetch_article_data(url, api_key, queue):
    headers = {
        'accept': 'application/json',
        'x-magicapi-key': api_key,
        'Content-Type': 'application/json'
    }
    data = json.dumps({"url": url})
    response = requests.post('https://api.magicapi.dev/api/v1/pipfeed/parse/extract', headers=headers, data=data)
    queue.put({'url': url, 'response': response.json()})

Write Data to CSV

The script writes the extracted data to the output.csv file using the csv module.

def write_to_csv(data, output_file):
    with open(output_file, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ['url', 'response']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for entry in data:
            writer.writerow(entry)

Main Function

The main function orchestrates the entire process, from reading URLs to writing the extracted data to a CSV file.

from queue import Queue
import threading

def main(input_file, output_file, api_key):
    urls = read_urls_from_csv(input_file)
    data = []
    queue = Queue()
    threads = []

    # Create and start threads
    for url in urls:
        thread = threading.Thread(target=fetch_article_data, args=(url, api_key, queue))
        threads.append(thread)
        thread.start()

    # Wait for all threads to complete
    for thread in threads:
        thread.join()

    # Collect data from queue
    while not queue.empty():
        data.append(queue.get())

    # Write collected data to CSV
    write_to_csv(data, output_file)

Conclusion

The Pipfeed Parse API is an invaluable tool for anyone who needs to extract and summarize large volumes of article data. By leveraging this API and the provided Python script, you can automate the extraction process and handle hundreds of articles with ease. Whether you choose to run the script locally or on Replit, the setup process is straightforward, and the benefits of automated content extraction are immense. Give it a try and see how it can streamline your workflow and enhance your content management efforts.

Contributor
Comments to: How to extract 100s of news article

    Your email address will not be published. Required fields are marked *

    Attach images - Only PNG, JPG, JPEG and GIF are supported.

    Login

    Welcome to api.market

    Join API.market