In today’s digital world, content is king. Whether you’re a blogger, a marketer, or a researcher, having access to vast amounts of article data can be incredibly valuable. However, manually extracting and summarizing articles can be a daunting task. This is where the Pipfeed Parse API comes into play. It automates the process of extracting main content and metadata from news articles and blogs, allowing you to focus on using the data rather than collecting it. In this article, we’ll delve into how you can use the Pipfeed Parse API, along with a Python script, to extract hundreds of articles efficiently.
What is the Pipfeed Parse API?
The Pipfeed Parse API is a powerful tool designed to extract main content and metadata from any news article or blog. It uses advanced AI technology to retrieve clean, structured data without the need for manual rules or site-specific training. Here’s a brief overview of its features:
- Automatic Data Extraction: Extracts full HTML and text content from articles, even from JavaScript-heavy websites.
- Consistent Categories: Auto-predicts categories to better organize your extracted content.
- Comprehensive Metadata: Retrieves full metadata including images, keywords, tags, and more.
- No Proxy Needed: Handles geo-restrictions and client-side rendering seamlessly, eliminating the need for proxies.
- Structured Data Output: Converts articles into structured JSON data.
- Embeddable Media Extraction: Extracts YouTube, Twitter cards, RSS feeds, and social media feeds.
Setting Up the Extraction Script
To streamline the process of extracting and summarizing articles using the Pipfeed Parse API, we have developed a Python script. This script reads a list of URLs from an input.csv
file, uses the Pipfeed API to extract article summaries, and writes the extracted data to an output.csv
file. Below, we’ll walk through the steps to set up and run this script.
Option 1: Running the Script Locally
GitHub Repo : https://github.com/imshashank/pipfeed-article-extract-demo
Step 1: Clone the GitHub Repository
First, clone the repository containing the script:
git clone https://github.com/imshashank/pipfeed-article-extract-demo.git
Step 2: Prepare Your Input File
Ensure your input.csv
file is in the same directory as the script. This file should contain the URLs of the articles you want to process, one URL per line:
https://www.example.com/article1
https://www.example.com/article2
...
Step 3: Get Your API Key
Sign up for an account at Api.market to obtain your API key. Go to the left-hand section called 🔑 API Keys
and copy your API key.
Step 4: Subscribe to Pipfeed API
Subscribe to the Pipfeed API to use it through this script at Pipfeed.
Step 5: Update the .env
File
Replace the placeholder YOUR_API_KEY_HERE
in the .env
file with your actual API key.
API_KEY=YOUR_API_KEY_HERE
Step 6: Install Dependencies
Install the required dependencies using pip:
pip3 install -r requirements.txt
Step 7: Run the Script
Run the script to start extracting articles:
python3 main.py
Option 2: Using Replit
Replit Link : https://replit.com/@hello737/Pipfeed-article-extract#pipfeed-article-extract-demo/main.py
If you prefer a cloud-based solution, you can use Replit to run the script. Replit is an online IDE that allows you to run your code in the cloud.
Step 1: Fork the Replit Project
Go to Replit and fork the project.
Step 2: Prepare Your Input File
Add your input.csv
file to the Replit workspace. This file should contain the URLs of the articles you want to process, one URL per line. We already have an input.csv
file in the Replit workspace.
Step 3: Get Your API Key
Sign up for an account at Api.market to obtain your API key. Go to the left-hand section called 🔑 API Keys
and copy your API key.
Step 4: Subscribe to Pipfeed API
Subscribe to the Pipfeed API to use it through this script at Pipfeed.
Step 5: Update the .env
File
Replace the placeholder YOUR_API_KEY_HERE
in the .env
file with your actual API key. The .env
file is usually hidden by Replit, so make sure to unhide the file and then replace your API key in there.
Step 6: Run the Script
Click the “Run” button in Replit to start the extraction process.
Understanding the Script
Here’s a brief overview of how the script works:
- Load Environment Variables: The script loads the API key from the
.env
file. - Read URLs from CSV: It reads a list of URLs from the
input.csv
file. - Fetch Article Data: For each URL, it sends a request to the Pipfeed Parse API to fetch the article data.
- Write Data to CSV: The extracted data is written to the
output.csv
file.
Detailed Script Breakdown
Load Environment Variables
The script uses the dotenv
library to load environment variables from a .env
file. This includes the API key required to authenticate with the Pipfeed API.
import os
from dotenv import load_dotenv
load_dotenv()
api_key = os.getenv('API_KEY')
Read URLs from CSV
The script reads the list of URLs from the input.csv
file using the csv
module.
import csv
def read_urls_from_csv(input_file):
urls = []
with open(input_file, 'r', newline='', encoding='utf-8') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
urls.append(row[0])
return urls
Fetch Article Data
The script sends a POST request to the Pipfeed Parse API for each URL to fetch the article data. It uses the requests
library to handle the HTTP requests.
import requests
import json
def fetch_article_data(url, api_key, queue):
headers = {
'accept': 'application/json',
'x-magicapi-key': api_key,
'Content-Type': 'application/json'
}
data = json.dumps({"url": url})
response = requests.post('https://api.magicapi.dev/api/v1/pipfeed/parse/extract', headers=headers, data=data)
queue.put({'url': url, 'response': response.json()})
Write Data to CSV
The script writes the extracted data to the output.csv
file using the csv
module.
def write_to_csv(data, output_file):
with open(output_file, 'w', newline='', encoding='utf-8') as csvfile:
fieldnames = ['url', 'response']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for entry in data:
writer.writerow(entry)
Main Function
The main function orchestrates the entire process, from reading URLs to writing the extracted data to a CSV file.
from queue import Queue
import threading
def main(input_file, output_file, api_key):
urls = read_urls_from_csv(input_file)
data = []
queue = Queue()
threads = []
# Create and start threads
for url in urls:
thread = threading.Thread(target=fetch_article_data, args=(url, api_key, queue))
threads.append(thread)
thread.start()
# Wait for all threads to complete
for thread in threads:
thread.join()
# Collect data from queue
while not queue.empty():
data.append(queue.get())
# Write collected data to CSV
write_to_csv(data, output_file)
Conclusion
The Pipfeed Parse API is an invaluable tool for anyone who needs to extract and summarize large volumes of article data. By leveraging this API and the provided Python script, you can automate the extraction process and handle hundreds of articles with ease. Whether you choose to run the script locally or on Replit, the setup process is straightforward, and the benefits of automated content extraction are immense. Give it a try and see how it can streamline your workflow and enhance your content management efforts.
No Comments
Leave a comment Cancel