Data extraction from a single website

Extract any JSON-formatted data

If you want total control over the extraction format to extract an arbitrary JSON schema, you can use the JSON schema extraction endpoint:

import os
from pydantic import BaseModel, Field
 
from fixpoint.client import FixpointClient
from fixpoint.client.types import CreateJsonSchemaExtractionRequest, WebpageSource
 
class WebsiteSummary(BaseModel):
"""A summary of a vertical SaaS AI website."""
 
    product_summary: str = Field(description="A summary of the business's product")
    industries_summary: str = Field(
        description="A summary of the industries the business serves"
    )
    use_cases_summary: str = Field(
        description="A summary of the use-cases of the product"
    )
 
PROMPT = """You are looking at the website for a vertical SaaS company.
Your job is to extract information from the site and fill in the provided JSON template.
"""
 
client = FixpointClient(api_key=os.environ["FIXPOINT_API_KEY"])
 
extraction = client.extractions.json.create(
  CreateJsonSchemaExtractionRequest(
    # you can use the workflow ID to group scrapes together
    # you can use the run ID to retry a workflow run
    # run_id="wfrun-2024-10-30"
    workflow_id="my-workflow-id",
    # A Pydantic model that defines the data you want to extract from the site
    source=WebpageSource(url=site),
    # Any extra prompt instructions to send to the LLM that performs # extraction
    schema=WebsiteSummary.model_json_schema(),
    extra_instructions=[
      {"role": "system", "content": PROMPT},
    ],
  )
)
 
print(json.dumps(extraction.result, indent=2))

Extracting data to a tabular format

Fixpoint also lets you extract data into a tabular format. You still have a lot of control over the format of the data. The benefits of extracting into a tabular format:

Each extraction is an "Research Record", which you can include in a "Research Document" to group related extractions.
You can easily export this data to a spreadsheet or to your database.
Each extracted field includes an explanation for how the AI came up with its answer.
Each extracted field includes a citation to reduce hallucinations, and let you further dig into the source material.

import os
 
from fixpoint.client import FixpointClient
from fixpoint.client.types import CreateRecordExtractionRequest, WebpageSource
 
client = FixpointClient(api_key=os.environ["FIXPOINT_API_KEY"])
 
extraction = client.extractions.record.create(
  CreateRecordExtractionRequest(
    document_id="example-research",
    document_name="My Example Research",
    source=WebpageSource(url=site),
    questions=[
      "What is the product summary?",
      "What are the industries the business serves?",
      "What are the use-cases of the product?",
    ],
  )
)

Patents Extraction (Crawl)