February 2024

New Embedding Models

  • We now support embedding generation using OpenAI’s text-embedding-3-small and text-embedding-3-large models.

  • To define the embedding model, utilize the embedding_model parameter in the POST body for the /embeddings and other API endpoints. By default, if no specific model is provided, the system will use OPENAI (the original Ada-2).

  • Find more details on the models available here.

Return HTML for Webpages

  • presigned_url field under user_files_v2 now returns a pre-signed URL to the raw HTML content for each web page.

  • parsed_text_url field still returns a pre-signed URL for the corresponding plain text.

  • Find more details here.

Return Website Tags in File Metadata

  • file_metadata field under user_files_v2 now returns og:image and og:description for each web page.

  • Find more details here.

Omit Content by CSS Selector 

  • You can now exclude specific CSS selectors from web scraping. This ensures that text content within these elements does not appear in the parsed plaintext, chunks, and embeddings. Useful for omitting irrelevant elements, such as headers or footers, which might affect semantic search results.

  • The web_scrape request objects supports a new fields:

  •  css_selectors_to_skip: Optional[list[str]] = []

  • Find more details here.

JSON File Support

  • We’ve added support for JSON files via local upload and 3rd party connectors.

  • How It Works:

    • The parser iterates through each object in a file and flattens it. Keys on the topmost level remain the same, but nested keys are transformed into the dot separated path to reach the key’s value. Each component of the path can either be a string for a nested object or integer for a nested list.

    • max_items_per_chunk is a parameter that determines how many JSON objects to include in a single chunk.

    • A new chunk is created if either the max_items_per_chunk and chunk_size limit is reached. For example:

      • If each JSON object is 250 tokens, chunk_size of 800 tokens and no max_items_per_chunk set, then each chunk will contain 3 JSON objects.

      • If each JSON object is 250 tokens, chunk_size of 800 tokens and max_items_per_chunk set to 1, then each chunk will contain 1 JSON object.

  • Learn more details here.

Gitbook Connector

  • We launched our Gitbook integration today that syncs pages from any public and shared spaces.

  • The Carbon Connect enabledIntegrations value for Gitbook is GITBOOK.

  • Gitbook does not come with a pre-built file selector so we added 2 endpoints for listing and syncing Gitbook spaces:

    • List all Gitbook spaces with /integrations/gitbook/spaces (API Reference)

    • Sync multiple spaces at once with integrations/gitbook/sync (API Reference)

  • You can also use our global endpoints for listing and syncing specific pages in Gitbook spaces:

    • List pages in spaces with the global endpoints /integrations/items/list

    • Sync pages in spaces with the global endpoint /integrations/files/sync

    • Note: Spaces are treated like folders via the Carbon API.

  • See more specifics about our Gitbook integration here.

  • Note: our Gitbook page parser is still in beta so feedback is much appreciated!

Delete Endpoint Update

  • We’re transitioning file deletion from sync to async processing.

  • This means that the FILE_DELETED webhook event will not fire immediately and instead fire when the file is actually deleted.

  • We are also limiting 50 files to be deleted per /delete_files request to limit the load on our servers. We advise spacing out delete requests every 24 hours.

Pinecone Integration 

  • We’ve launched our Pinecone destination connector! We offer support for both pod-based and serverless offerings.

  • Carbon seamlessly updates your Pinecone instance with the latest embeddings upon processing user files. Users gain full access to Carbon’s API endpoints, including hybrid search for supported sparse vector storage.

  • Find more details here.

New Carbon SDKs

  • Moving forward, we will be able to provide support for a greater number of SDKs and promptly release SDK support for API updates. If there is a language for which you want us to add SDK support, we should be able to turn that around in less than a week.

  • We’re adding support for the following languages today:

  • The current Javascript SDK will continue to be supported for the next month, and it will be available longer term. However, new features that are introduced will only be supported in the new Typescript SDK moving forward.

Delete Users Endpoint

  • Added an endpoint /delete_users that takes an array of customer IDs and deletes all those users.

  • Deleting a user revokes all of the user’s oauth connections and deletes all their files, embeddings and chunks.

  • The request format is:

{ "customer_ids": ["USER_1", "USER_2", "USER_3"] }

  • Find more details here.

Salesforce Connector is Live

  • All articles from an end user’s Salesforce Knowledge can be listed and synced via the global API endpoint /integrations/items/list and /integrations/files/sync.

  • The Carbon Connect integration (launching tomorrow) will sync all articles by default.

  • The enabledIntegrations value is SALESFORCE.

  • You can find more info here.

Outlook Folders 

  • After connecting your Outlook account, you can use this endpoint to list all of your folders on outlook.

  • This includes both system folders like inbox and user-created folders.

  • Find more details here.

Gmail Labels 

  • After connecting a Gmail account, you can use the /integrations/gmail/user_labels endpoint to list all of your labels.

  • User created labels will have the type user and Gmail’s default labels will have the type system.

  • Find more details here.

Delete Child Files Based on Parent ID

  • Added a flag named delete_child_files to the delete_files endpoint. When set to true, it will delete all files that have the same parent_file_ids as the file submitted for deletion. This flag defaults to false.

  • Find more details here.

Carbon Connect Updates 

  • Added support for JSON file formats and maxItemsPerChunk param to specify the number of items to include in a specific chunk.

  • Added cssSelectorsToSkip to WEB_SCRAPE to define CSS Selectors to exclude when converting HTML to plaintext.

  • Added SALESFORCE as an enabledIntegration on Carbon Connect.

  • For Salesforce, we added a param syncFilesOnConnection that defaults to true and will automatically sync all pages from a user’s Salesforce account.

  • We’ll be adding this param to other connectors too, meaning you can automatically sync all files from connectors that don’t have built-in file selectors (Gitbook, Confluence, etc).

  • This parameter is also added to the /integrations/oauth_url endpoint as sync_files_on_connection and also defaults to true.


CARBON

Data Connectors for LLMs

COPYRIGHT @ 2024 JCDT DBA CARBON