Extraction

Extraction

Fixpoint's extractions APIs let you crawl websites or target single specific webpages, and pull out answers to your questions, or specific structured data that you can put into your spreadsheets, database, CRMs, or anywhere else. Some examples:

  • You want to scrape a bunch of venture capital sites and extract info on all the companies they invested in.
  • You want to monitor a news site for articles relevant to your business.
  • You need to keep up to date with financial compliance regulations and similar government filings.
  • You want to constantly monitor the pricing pages of all your competitors.

Fixpoint lets you set up web scraping pipelines that continually run and handle all the painful parts of scraping, as well as the logic to efficiently extract just the data you need.

Extraction APIs

Extraction sources

You can extract AI answers from either:

  • a single webpage
  • by crawling a website and extracting answers across many pages
  • from text documents you already have

Extraction methods

There are two types of extractions APIs:

  • Record Extraction - extract data into a tabular format, which you can export
  • JSON Schema Extraction - extract an arbitrary JSON structure from a website. You have total control over the nested fields

For most situations, we recommend using Record Extraction, which is a smarter method of extraction. If you can't model your data correctly with Record Extraction, you can use JSON Schema Extraction.

Record Extraction

Research Record extraction has the following benefits:

  1. Each extraction is an "Research Record", which you can include in a "Research Document" to group related extractions.
  2. You can easily export this data to a spreadsheet or to your database.
  3. Each extracted field includes an explanation for how the AI came up with its answer.
  4. Each extracted field includes a citation to reduce hallucinations, and let you further dig into the source material.

Single URL extraction

This is an example extracting data from a single webpage (note the WebpageSource(url=...), or in the HTTP API, the "source": {"url": ...} field):

import os
 
from fixpoint.client import FixpointClient
from fixpoint.client.types import CreateRecordExtractionRequest, WebpageSource
 
client = FixpointClient(api_key=os.environ["FIXPOINT_API_KEY"])
 
extraction = client.extractions.record.create(
  CreateRecordExtractionRequest(
    document_id="example-research",
    document_name="My Example Research",
    source=WebpageSource(url=site),
    questions=[
      "What is the product summary?",
      "What are the industries the business serves?",
      "What are the use-cases of the product?",
    ],
  )
)
 

Crawling extraction

If you want to crawl multiple pages and extract data across them, in Python use the CrawlUrlSource(crawl_url=...) and in the HTTP API, use the "source": {"crawl_url": ...} field.

You can set the additional arguments depth and page_limit to control how many pages are crawled, and how deep the crawler goes into the site's structure.

Here's an example:

extraction = client.extractions.record.create(
  CreateRecordExtractionRequest(
    # ...
    source=CrawlUrlSource(crawl_url=site, depth=2, page_limit=3),
    # ...
  )
)

Text file extraction

If you already have text files you want to extract data from, you can do either single text file extraction or batched extraction across text files.

For single text file extraction, use the TextSource(text_id=..., content=...) field in Python, or the "source": {"text_id": ..., "content": ...} field in the HTTP API.

For batched text file extraction, use the BatchTextSource(sources=...) field in Python, or the "source": {"sources": ...} field in the HTTP API.

JSON Schema Extraction

JSON Schema Extraction gives you more free-form control over the format of the data you pull out.

Single URL extraction

import os
from pydantic import BaseModel, Field
 
from fixpoint.client import FixpointClient
from fixpoint.client.types import CreateJsonSchemaExtractionRequest, WebpageSource
 
class WebsiteSummary(BaseModel):
"""A summary of a vertical SaaS AI website."""
 
    product_summary: str = Field(description="A summary of the business's product")
    industries_summary: str = Field(
        description="A summary of the industries the business serves"
    )
    use_cases_summary: str = Field(
        description="A summary of the use-cases of the product"
    )
 
PROMPT = """You are looking at the website for a vertical SaaS company.
Your job is to extract information from the site and fill in the provided JSON template.
"""
 
client = FixpointClient(api_key=os.environ["FIXPOINT_API_KEY"])
 
extraction = client.extractions.json.create(
  CreateJsonSchemaExtractionRequest(
    # you can use the workflow ID to group scrapes together
    # you can use the run ID to retry a workflow run
    # run_id="wfrun-2024-10-30"
    workflow_id="my-workflow-id",
    # A Pydantic model that defines the data you want to extract from the site
    source=WebpageSource(url=site),
    # Any extra prompt instructions to send to the LLM that performs # extraction
    schema=WebsiteSummary.model_json_schema(),
    extra_instructions=[
      {"role": "system", "content": PROMPT},
    ],
  )
)
 
print(json.dumps(extraction.result, indent=2))
 

Crawling extraction

Crawling extraction is not supported with the JSON Schema Extraction mode.

Text file extraction

To use text file extraction with JSON Schema Extraction, just use the TextSource or BatchTextSource like described for Record Extraction.