Data extraction from a single website
Extract any JSON-formatted data
If you want total control over the extraction format to extract an arbitrary JSON schema, you can use the JSON schema extraction endpoint:
import os
from pydantic import BaseModel, Field
from fixpoint.client import FixpointClient
from fixpoint.client.types import CreateJsonSchemaExtractionRequest, WebpageSource
class WebsiteSummary(BaseModel):
"""A summary of a vertical SaaS AI website."""
product_summary: str = Field(description="A summary of the business's product")
industries_summary: str = Field(
description="A summary of the industries the business serves"
)
use_cases_summary: str = Field(
description="A summary of the use-cases of the product"
)
PROMPT = """You are looking at the website for a vertical SaaS company.
Your job is to extract information from the site and fill in the provided JSON template.
"""
client = FixpointClient(api_key=os.environ["FIXPOINT_API_KEY"])
extraction = client.extractions.json.create(
CreateJsonSchemaExtractionRequest(
# you can use the workflow ID to group scrapes together
# you can use the run ID to retry a workflow run
# run_id="wfrun-2024-10-30"
workflow_id="my-workflow-id",
# A Pydantic model that defines the data you want to extract from the site
source=WebpageSource(url=site),
# Any extra prompt instructions to send to the LLM that performs # extraction
schema=WebsiteSummary.model_json_schema(),
extra_instructions=[
{"role": "system", "content": PROMPT},
],
)
)
print(json.dumps(extraction.result, indent=2))
Extracting data to a tabular format
Fixpoint also lets you extract data into a tabular format. You still have a lot of control over the format of the data. The benefits of extracting into a tabular format:
- Each extraction is an "Research Record", which you can include in a "Research Document" to group related extractions.
- You can easily export this data to a spreadsheet or to your database.
- Each extracted field includes an explanation for how the AI came up with its answer.
- Each extracted field includes a citation to reduce hallucinations, and let you further dig into the source material.
import os
from fixpoint.client import FixpointClient
from fixpoint.client.types import CreateRecordExtractionRequest, WebpageSource
client = FixpointClient(api_key=os.environ["FIXPOINT_API_KEY"])
extraction = client.extractions.record.create(
CreateRecordExtractionRequest(
document_id="example-research",
document_name="My Example Research",
source=WebpageSource(url=site),
questions=[
"What is the product summary?",
"What are the industries the business serves?",
"What are the use-cases of the product?",
],
)
)