Extraction
Fixpoint's extractions APIs let you crawl websites or target single specific webpages, and pull out answers to your questions, or specific structured data that you can put into your spreadsheets, database, CRMs, or anywhere else. Some examples:
- You want to scrape a bunch of venture capital sites and extract info on all the companies they invested in.
- You want to monitor a news site for articles relevant to your business.
- You need to keep up to date with financial compliance regulations and similar government filings.
- You want to constantly monitor the pricing pages of all your competitors.
Fixpoint lets you set up web scraping pipelines that continually run and handle all the painful parts of scraping, as well as the logic to efficiently extract just the data you need.
Extraction APIs
Extraction sources
You can extract AI answers from either:
- a single webpage
- by crawling a website and extracting answers across many pages
- from text documents you already have
Extraction methods
There are two types of extractions APIs:
- Record Extraction - extract data into a tabular format, which you can export
- JSON Schema Extraction - extract an arbitrary JSON structure from a website. You have total control over the nested fields
For most situations, we recommend using Record Extraction, which is a smarter method of extraction. If you can't model your data correctly with Record Extraction, you can use JSON Schema Extraction.
Record Extraction
Research Record extraction has the following benefits:
- Each extraction is an "Research Record", which you can include in a "Research Document" to group related extractions.
- You can easily export this data to a spreadsheet or to your database.
- Each extracted field includes an explanation for how the AI came up with its answer.
- Each extracted field includes a citation to reduce hallucinations, and let you further dig into the source material.
- HTTP API:
POST /v1/extractions/record_extractions
- Python SDK:
FixpointClient.extractions.record.create(...)
- API spec (opens in a new tab)
Single URL extraction
This is an example extracting data from a single webpage (note the
WebpageSource(url=...)
, or in the HTTP API, the "source": {"url": ...}
field):
import os
from fixpoint.client import FixpointClient
from fixpoint.client.types import CreateRecordExtractionRequest, WebpageSource
client = FixpointClient(api_key=os.environ["FIXPOINT_API_KEY"])
extraction = client.extractions.record.create(
CreateRecordExtractionRequest(
document_id="example-research",
document_name="My Example Research",
source=WebpageSource(url=site),
questions=[
"What is the product summary?",
"What are the industries the business serves?",
"What are the use-cases of the product?",
],
)
)
Crawling extraction
If you want to crawl multiple pages and extract data across them,
in Python use the CrawlUrlSource(crawl_url=...)
and in the HTTP API, use the
"source": {"crawl_url": ...}
field.
You can set the additional arguments depth
and page_limit
to control how many
pages are crawled, and how deep the crawler goes into the site's structure.
Here's an example:
extraction = client.extractions.record.create(
CreateRecordExtractionRequest(
# ...
source=CrawlUrlSource(crawl_url=site, depth=2, page_limit=3),
# ...
)
)
Text file extraction
If you already have text files you want to extract data from, you can do either single text file extraction or batched extraction across text files.
For single text file extraction, use the TextSource(text_id=..., content=...)
field in Python, or the "source": {"text_id": ..., "content": ...}
field in the
HTTP API.
For batched text file extraction, use the BatchTextSource(sources=...)
field
in Python, or the "source": {"sources": ...}
field in the HTTP API.
JSON Schema Extraction
JSON Schema Extraction gives you more free-form control over the format of the data you pull out.
- HTTP API:
POST /v1/extractions/json_extractions
- Python SDK:
FixpointClient.extractions.json.create(...)
- API spec (opens in a new tab)
Single URL extraction
import os
from pydantic import BaseModel, Field
from fixpoint.client import FixpointClient
from fixpoint.client.types import CreateJsonSchemaExtractionRequest, WebpageSource
class WebsiteSummary(BaseModel):
"""A summary of a vertical SaaS AI website."""
product_summary: str = Field(description="A summary of the business's product")
industries_summary: str = Field(
description="A summary of the industries the business serves"
)
use_cases_summary: str = Field(
description="A summary of the use-cases of the product"
)
PROMPT = """You are looking at the website for a vertical SaaS company.
Your job is to extract information from the site and fill in the provided JSON template.
"""
client = FixpointClient(api_key=os.environ["FIXPOINT_API_KEY"])
extraction = client.extractions.json.create(
CreateJsonSchemaExtractionRequest(
# you can use the workflow ID to group scrapes together
# you can use the run ID to retry a workflow run
# run_id="wfrun-2024-10-30"
workflow_id="my-workflow-id",
# A Pydantic model that defines the data you want to extract from the site
source=WebpageSource(url=site),
# Any extra prompt instructions to send to the LLM that performs # extraction
schema=WebsiteSummary.model_json_schema(),
extra_instructions=[
{"role": "system", "content": PROMPT},
],
)
)
print(json.dumps(extraction.result, indent=2))
Crawling extraction
Crawling extraction is not supported with the JSON Schema Extraction mode.
Text file extraction
To use text file extraction with JSON Schema Extraction, just use the TextSource
or BatchTextSource
like described for Record Extraction.