Treasure Data FAQs | Comparably
Treasure Data Claimed Company
Treasure Data is the end-to-end, fully-managed cloud service for Big Data, that’s trusted and simple. read more
EMPLOYEE
PARTICIPANTS
20
TOTAL
RATINGS
161

Treasure Data FAQs

Treasure Data's Frequently Asked Questions page is a central hub where its customers can always go to with their most common questions. These are the 586 most popular questions Treasure Data receives.

Frequently Asked Questions About Treasure Data

  • You can import worksheet from Google Sheet into Arm Treasure Data using the Google Sheet connector.

    Prerequisites

    Basic knowledge of Treasure Data

    Basic knowledge of Google Sheet

    Use TD Console

    Create a new connection

    Go to Integrations Hub > Catalog and search and select Google Sheet input. The following dialog opens.

    https://support.treasuredata.com/hc/en-us/articles/360001502427-Bulk-Import-for-Google-Sheets

    Click to connect a new account. Log into your Google Sheets account from the new window and grant Treasure Data access:

    You will be redirected back to Treasure Data Catalog. Repeat the step to connect to a new account to choose your new OAuth connection.

    Click Continue and give your connection a name:

    Then click Done.

    Create a new source

    After creating the connection, you are automatically taken to Sources tab. Look for the connection you created and click New Source.

    The following dialog opens. Edit the details and click Next.

    Spreadsheet Key: ID of the spreadsheet

    Worksheet Title: worksheet name in the spreadsheet. It is not case-sensitive, mean `sheet1` is the same as `SHEET1` or `Sheet1`

    If blank, fetch the first worksheet of the spreadsheet

    Range to fetch: the range of the data in the worksheet to fetch. For example, A1:B2.

    If blank,fetch the whole worksheet

    If blank andUse first row/column as headeris true, only fetch data from row/column have header

    Use first row/column as header: get the first row for Column dimension and the first column for Row dimension as header name for the schema

    When this option is selectedSkip first rows/columns must larger than 1

    Skip first rows/columns: number of first rows/columns skips for data

    Major dimension: the dimension of the sheet that is fetched

    Rows: operates on the rows of a sheet.

    Columns:operates on the columns of a sheet.

    Value render options: format of the value from sheet

    Formatted value: values are calculated and formatted in the reply according to the cell's formatting. Formatting is based on the spreadsheet's locale, not the requesting user's locale. For example, ifA1is1.23andA2is=A1and formatted as currency, thenA2returns"$1.23".

    Unformatted value:Values are calculated, but not formatted in the reply. For example, ifA1is1.23andA2is=A1and formatted as currency, thenA2would return the number1.23.

    Datetime render options: datetime format of the value from sheet

    Formatted String:Instructs date, time, datetime, and duration fields to be output as strings in their given number format (which is dependent on the spreadsheet locale).

    Timestamp formats that we can support:

    dd/MM/yyyy

    dd/MM/yyyy hh:mm:ss

    dd/MM/yyyy can be rearranged to MM/dd/yyyy and yyyy/MM/dd

    *, /can be replaced by . or - For example: 12.12.2012 or 12-12-2012

    Serial number:Instructs date, time, datetime, and duration fields to be output as doubles in "serial number" format, as popularized by Lotus 1-2-3.

    We support datetime without timezone or date, time

    For example:12-12-2012,15:12:20,12-12-201215:12:20

    Datetime beloware supported for both DateTime render option

    dd/MM/yyyyThh:mm:ss

    dd/MM/yyyyThh:mm:ss +0000

    dd/MM/yyyyThh:mm:ss +01:00

    dd/MM/yyyyThh:mm:ssZ

    We don't support column with multiple datetime format, that could lead to unexpected behavior.

    When value can't convert to datetime, the job will stop with an exception message. EnableSkip invalid cell(s)to ignore the errors.

    Edit your params andclick Next.

    Preview your data. To make changes, click Advanced Settings, otherwise click Next.

    Advanced Setting:

    Skip invalid cell(s):ignore all the errors when importing data, such as have a string in double column or cells that don't follow timestamp format. All the invalid cells are null.

    Change your column name and data type then click Save

    Select a database and a table where you want to transfer your data and click Next.

    Specify the schedule of the data transfer using the following dialog and click Next:

    Name your source and click Create And Run

    Use Command Line

    Install td command

    Install the newest Treasure Data Toolbelt.

    Create Seed Config File (seed.yml)

    in:

    type: google_sheets client_id: *** client_secret: *** refresh_token: ***

    key:<YOUR_SPREED_SHEET_ID>

    worksheet_title: <YOUR_WORKSHEET_NAME>

    range: <YOUR_DATA_RANGE>

    manjor_dimension: rows or columns value_render_option: formatted_value or unformatted_value datetime_render_option: formated_string or serial_number

    header_line: true or false skip_header_lines: 1 skip_invalid_format_cells: false

    out:

    mode: append

    For more details on available out modes, see Appendix.

    Guess Fields (Generate load.yml)

    Use connector:guess. This command automatically reads the target data, and intelligently guesses the data format.

    $ td connector:guess seed.yml -o load.yml

    If you open theload.ymlfile, you see guessed file format definitions including, in some cases, file formats, encodings, column names, and types.

    in: type: google_sheets major_dimension: columns client_id: *** client_secret: *** refresh_token: *** worksheet_title: Sheet1 range: a10:k19 value_input_option: RAW value_render_option: formatted_value datetime_render_option: serial_number header_line: 'true' skip_header_lines: 1 columns: - {name: 'column1', type: timestamp, format: '%Y-%m-%d %k:%M:%S'} - {name: 'column2', type: string} - {name: 'column3', type: string} - {name: 'column4', type: string}out: {mode: append}exec: {}filters:- rules: - {rule: upper_to_lower} - pass_types: - a-z - 0-9 pass_characters: _ replace: _ rule: character_types - pass_types: - a-z pass_characters: _ prefix: _ rule: first_character_types - {rule: unique_number_suffix, max_length: 128} type: rename- from_value: {mode: upload_time}

    Then you can preview how the system parses the file by using preview command.

    $ td connector:preview load.yml

    If the system detects your column name or type unexpectedly, modify theload.yml directly and preview again.

    The data connector supports parsing of "boolean", "long", "double", "string", and "timestamp" types.

    Execute Load Job

    Submit the load job. It may take a couple of hours depending on the data size. Users need to specify the database and table where their data is stored.

    $ td connector:issue load.yml --database td_sample_db --table td_sample_table

    The preceding command assumes that you have already created database(td_sample_db) and table(td_sample_table). If the database or the table do not exist in TD this command will not succeed, so create the database and table manually or use --auto-create-table option with td connector:issue command to automatically create the database and table:

    $ td connector:issue load.yml --database td_sample_db --table td_sample_table --time-column created_at --auto-create-table

    You can assign Time Format column to the "Partitioning Key" by "--time-column" option.

    Scheduled execution

    You can schedule periodic data connector execution for periodic Google sheet import. We carefully configure our scheduler to ensure high availability. By using this feature, you no longer need a cron daemon on your local data center.

    Create the schedule

    A new schedule can be created by using the td connector:create command. The name of the schedule, cron-style schedule, the database and table where their data is stored, and the data connector configuration file are required.

    $ td connector:create \

    daily_google_sheets_import \

    "10 0 * * *" \

    td_sample_db \

    td_sample_table \

    load.yml

    The`cron` parameter also accepts three special options: `@hourly`, `@daily` and `@monthly`.

    List the Schedules

    You can see the list of scheduled entries by td connector:list.

    $ td connector:list

    +---------------------------+--------------+----------+-------+--------------+-----------------+---------------------------------+

    | Name | Cron | Timezone | Delay | Database | Table | Config |

    +---------------------------+--------------+----------+-------+--------------+-----------------+---------------------------------+

    |daily_google_sheets_import | 10 0 * * * | UTC | 0 | td_sample_db | td_sample_table | {"type"=>"google_sheets", ... } |

    +---------------------------+--------------+----------+-------+--------------+-----------------+---------------------------------+

    Show the Setting and History of Schedules

    td connector:show shows the execution setting of a schedule entry.

    $ td connector:showdaily_google_sheets_import

    Name :daily_google_sheets_import

    Cron : 10 0 * * *

    Timezone : UTC

    Delay : 0

    Database : td_sample_db

    Table : td_sample_table

    Config

    ---

    // Displayed load.yml configuration.

    td connector:history shows the execution history of a schedule entry. To investigate the results of each individual execution, use td job <jobid>.

    Delete the Schedule

    td connector:deleteremoves the schedule.

    $ td connector:delete daily_google_sheets_import

    Obtain required Google Api credentials

    For information on how to obtain Google API credentials, refer to the following:

    View Article
  • You can import Salesforce Marketing Cloud (beta) Data Source objects into Treasure Data using this data connector. Contact us for further details about the private beta.

    To see sample workflows for importing Salesforce Marketing Cloud data, go to Treasure Boxes.

    You can use the same connection to export SFMC data. See Exporting Data to SFMC (BETA).

    Prerequisites

    Basic knowledge of Treasure Data

    Basic knowledge of Salesforce Marketing Cloud

    How to create an install package in Salesforce Marketing Cloud

    After August 1st 2019, you can not create aLegacy Packagein Salesforce Marketing Cloud, therefore we highly recommend that you create an Enhanced Package.

    How to create legacy package

    Log on to your Salesforce Marketing Cloud account

    On the Welcome Page, click your name on top right corner then clickSetup.

    On left side menu of new screen, clickApp > Installed Packages .

    On theInstalled Packages screen, click New.

    On the New Package Details pop up, enter the Name and Description. Uncheck (deselect) thecheckboxCreate with enhanced functionality (recommended) then click Save.

    On the Package Details screen, click on Add Component

    On theAdd Component pop up, selectAPI Integration then click Next.

    On theAdd API Integration screen:

    Scroll to theCHANNELS section and select the Read checkbox on Email

    Scroll to theCONTACT section and select the Read checkbox on Audiences andList and Subscribers

    Scroll to theDATAsectionand select the Read and Write checkbox on Data Extensions,Read forTracking Events

    Scroll to HUBsection andand select the Read checkbox on Campaign

    On theInstalled Packages screen, scroll down to Components panel,then take note of the Client Id and Client Secret. You use the information to write the data from Treasure Data to Salesforce Marketing Cloud.

    How to create enhanced functionality package

    Log on to your Salesforce Marketing Cloud account.

    On the Welcome Page, click your name on top right corner then click Setup.

    On left side menu of new screen, clickApp > Installed Packages.

    On theInstalled Packages screen, click New.

    On the New Package Details pop up, enter the Name and Description and selectthecheckboxCreate with enhanced functionality (recommended). Then click Save

    On theAdd Component pop up, selectAPI Integration then click Next.

    On theChoose Your Integration Type screen, select Server-to-Server then click Next.

    On theSet Server-to-Server Properties screen:

    Scroll to theCHANNELS section and select theRead checkbox on Email

    Scroll to CONTACT sectionand select the Read checkbox on Audiences andList and Subscribers

    Scroll to DATAsectionand select the Read and Write checkbox on Data Extensions,Read forTracking Events

    Scroll to HUBsectionand select the Read checkbox on Campaign

    On theInstalled Packages screen, scroll down to Components panel, then take note of theClient Id,Client Secret andAuthentication Base URI. You will use the information to write the data from Treasure Data to Salesforce Marketing Cloud.

    Use the TD Console to create your connection

    You can use TD console to create your data connector.

    Create a new connection

    When you configure a data connection, you provide authentication to access the integration. In Treasure Data, you configure the authentication and then specify the source information.

    Go to Integrations Hub > Catalogand search and select Salesforce Marketing Cloud.

    Create Legacy Package integration

    On theNew Authentication screen, select Package Type as Legacy Package,then enter the Client Id and Client Secret (which you obtained when you created legacy package in SFMC). Then click Continue.

    Enter aName for this integration and click Done.

    Create Enhanced Package integration

    On theNew Authentication screen, select Package Type as Enhanced Functionality Package,then enter the Client Id,Client Secret andAuthentication Base URI (which you obtained when you created the enhanced package in SFMC). Optionally, you could specify Account identifier or MID to access multiple BUs(more detail in How to get MID ) and Scope to limit the token's scope(more detail in API scopes ).Then, click Continue.

    Enter aName for this integration and click Done.

    Create a new transfer

    After creating the connection, you are automatically taken to Sources. Look for the connection you created and click New Transfer.

    The following dialog opens. Complete the details and click Next.

    Next, you see a Preview of your data similar to the following dialog. To make changes, click Advanced Settings otherwise, click Next.

    From here, if you want to change some options such as skipping on errors or rate limits, click Advanced Settings:

    Select the database and table where you want to transfer the data:

    Specify the schedule of the data transfer using the following dialog and click Start Transfer:

    You will see the new data transfer in progress listed under the My Input Transfers tab and a corresponding job are listed in the Jobs section.

    Use Command Line

    Install td command v0.11.9 or later

    You can install the newest Treasure Data Toolbelt.

    $ td --version

    0.15.3

    Create Configuration File

    Prepare configuration file (for eg: load.yml) as shown in the following example, with your Salesforce Marketing Cloud account access information.

    in:

    type: salesforce_marketing_cloud

    client_id: xxxxxxxxxx

    client_secret: xxxxxxxxxx

    target: campaign (required, see Appendix A)

    out:

    mode: replace

    This example dumps Salesforce Marketing Cloud Campaign Data Source:

    client_id: Salesforce Marketing Cloud client id.

    client_secret: Salesforce Marketing Cloud client secret.

    target: Salesforce Marketing Cloud entity object to be imported.

    See Appendix B for the list of available target.

    For more details on available out modes, see Appendix A

    (optional): Preview data to import

    You can preview data to be imported using the command td connector:preview.

    $ td connector:preview load.yml

    +-----------------+---------------------+--------------------+----

    | id:long | name:string | description:string | ...

    +-----------------+---------------------+--------------------+----

    | 42023 | "Hello" | apps |

    | 42045 | "World" | apps |

    +-----------------+---------------------+--------------------+----

    Execute Load Job

    Submit the load job. It may take a couple of hours depending on the data size. Users need to specify the database and table where their data are stored.

    It is recommended to specify --time-column option, since Treasure Datas storage is partitioned by time (see also data partitioning ). If the option is not given, the data connector selects the first long or timestamp column as the partitioning time. The type of the column specified by --time-column must be either of long and timestamp type.

    If your data doesnt have a time column you can add it using add_time filter option. More details at add_time filter plugin

    $ td connector:issue load.yml --database td_sample_db --table td_sample_table --time-column modifieddate

    The preceding command assumes you have already created database(td_sample_db) and table(td_sample_table). If the database or the table do not exist in TD this command will not succeed, so create the database and table manually or use --auto-create-table option with td connector:issue command to auto create the database and table:

    $ td connector:issue load.yml --database td_sample_db --table td_sample_table --time-column modifieddate --auto-create-table

    You can assign Time Format column to the "Partitioning Key" by "--time-column" option.

    Scheduled execution

    You can schedule periodic data connector execution for periodic Salesforce Marketing Cloud import. We configure our scheduler carefully to ensure high availability. By using this feature, you no longer need a cron daemon on your local data center.

    Create the schedule

    A new schedule can be created using the td connector:create command. The name of the schedule, cron-style schedule, the database and table where their data will be stored, and the Data Connector configuration file are required.

    $ td connector:create \

    daily_salesforce_marketing_cloud_import \

    "10 0 * * *" \

    td_sample_db \

    td_sample_table \

    load.yml

    The `cron` parameter also accepts these three options: `@hourly`, `@daily` and `@monthly`.

    By default, schedule is setup in UTC timezone. You can set the schedule in a timezone using -t or --timezone option. The `--timezone` option only supports extended timezone formats like 'Asia/Tokyo', 'America/Los_Angeles' etc. Timezone abbreviations like PST, CST are *not* supported and may lead to unexpected schedules.

    List the Schedules

    You can see the list of currently scheduled entries by td connector:list.

    $ td connector:list

    +-----------------------------------------+--------------+----------+-------+--------------+-----------------+----------------------------------------------+

    | Name | Cron | Timezone | Delay | Database | Table | Config |

    +-----------------------------------------+--------------+----------+-------+--------------+-----------------+----------------------------------------------+

    | daily_salesforce_marketing_cloud_import | 10 0 * * * | UTC | 0 | td_sample_db | td_sample_table | {"type"=>"salesforce_marketing_cloud", ... } |

    +-----------------------------------------+--------------+----------+-------+--------------+-----------------+----------------------------------------------+

    Show the Setting and History of Schedules

    td connector:show shows the execution setting of a schedule entry.

    % td connector:show daily_salesforce_marketing_cloud_import

    Name : daily_salesforce_marketing_cloud_import

    Cron : 10 0 * * *

    Timezone : UTC

    Delay : 0

    Database : td_sample_db

    Table : td_sample_table

    td connector:history shows the execution history of a schedule entry. To investigate the results of each individual execution, use td job <jobid>.

    % td connector:history daily_salesforce_marketing_cloud_import

    +--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+

    | JobID | Status | Records | Database | Table | Priority | Started | Duration |

    +--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+

    | 578066 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-18 00:10:05 +0000 | 160 |

    | 577968 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-17 00:10:07 +0000 | 161 |

    | 577914 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-16 00:10:03 +0000 | 152 |

    | 577872 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-15 00:10:04 +0000 | 163 |

    | 577810 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-14 00:10:04 +0000 | 164 |

    | 577766 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-13 00:10:04 +0000 | 155 |

    | 577710 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-12 00:10:05 +0000 | 156 |

    | 577610 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-11 00:10:04 +0000 | 157 |

    +--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+

    8 rows in set

    Delete the Schedule

    td connector:deleteremoves the schedule.

    $ td connector:delete daily_salesforce_marketing_cloud_import

    Incremental Loading

    For Data Extensions

    Treasure Data supports incremental loading for Data Extensions that have a date field.

    If incremental: true is set, the data connector loads records according to the range specified by the from_date and the fetch_days for the specified date field.

    For example:

    incremental_column_name: mydatefield

    from_date: "2018-02-01T00:00:00.000Z"

    fetch_days: 2

    1st iteration: The data connector fetches records from Sep 01 00:00:00 UTC 2016 to Sep 03 00:00:00 UTC 2016

    2nd iteration: The data connector fetches records for the next available 2 day period, from Sep 03 00:00:00 UTC 2016 until Sep 05 00:00:00 UTC 2016. This process repeats for each successive iteration.

    When the increment includes the present date, additional records are fetched as each complete time period becomes available.

    If incremental: false is set, The data connector loads all records for the target specified. This is one time activity.

    For Email Events

    Treasure Data supports incremental loading for Email Eventsbased on their event date.

    If incremental: true is set

    1st iteration: the connector loads the data from specified Start Time (from all time if not set) to specified End Time (to the time the job was scheduled to run if not set)

    2nd iteration: the connector loads the data from previous End Time to the time job is scheduled

    If incremental: false is set, The data connector loads all records for the target specified. This is one time activity.

    Appendix

    A) Modes for out plugin

    You can specify file import mode in out section of load.yml.

    append (default)

    This is the default mode and records are appended to the target table.

    in:

    ...

    out:

    mode: append

    replace (In td 0.11.10 and later)

    This mode replaces data in the target table. Any manual schema changes made to the target table remains intact with this mode.

    in:

    ...

    out:

    mode: replace

    B) Available targets

    Target

    Description

    Campaign

    The e-mail campaign

    Contact

    The contact list

    Data Extensions

    The data extensions to satisfy the need for flexible data storage

    Email Events

    The events associated with an email marketing send

    The target Contact provides an option to ingest data in multiple requests in case you have too many contact models attributes. If you select this option, you also input the number of attributes per request. The default value of this field is 100.

    In some cases, your data breaks because some attributes are faulty. You can use the option Attribute set names will be ignored to skip faulty attributes. This option helps you to ingest data without error from Salesforce Marketing Clouds API.

    The target Data Extensions provides an option to filter data extensions that you want to ingest and another option to enable ingesting shared data extension.

    The targetEmail Eventsprovides option to filter:

    Email name matching pattern (exact or partial match)

    Email events type that you want to ingest

    Two options to define a date range to ingesting events on a specific date time range.

    C) Sample configuration for each target

    ForCampaign target

    in: type: salesforce_marketing_cloud client_id: XXXXXX client_secret: XXXXXX target: campaign out: mode: replace

    Parameters

    Description

    Default value

    type

    must besalesforce_marketing_cloud

    client_id

    Salesforce Marketing Cloud Client ID

    client_secret

    Salesforce Marketing Cloud ClientSecret

    target

    must be campaign (in this case)

    ForContact target

    in: type: salesforce_marketing_cloud client_id: XXXXXX client_secret: XXXXXX target: contact contact_page_size: 50 contact_attributes_request_limit: 100 contact_multiple_requests: true ignore_attribute_set_names: - attribute_1 - attribute_2out: mode: replace

    Parameters

    Description

    Default value

    type

    must besalesforce_marketing_cloud

    client_id

    Salesforce Marketing Cloud Client ID

    client_secret

    Salesforce Marketing Cloud ClientSecret

    target

    must be contact (in this case)

    contact_page_size

    Number of records per page for target contact. This option is useful when you have a large of data

    1000

    contact_attributes_request_limit

    Number of attributes to ingest for each contact in a single request. When number of attributes greater than this value, it is split into multiple requests

    100

    contact_multiple_requests

    Enable plugin to ingest contact detail with so many attributes in multiple requests (This only effects when number of attributes greater than 100 )

    false

    ignore_attribute_set_names

    List of attributes are ignored. Useful when you want to eliminate some unnecessary attributes

    null

    ForData Extension target

    in: type: salesforce_marketing_cloud client_id: XXXXXX client_secret: XXXXXX target: data_extension data_extension_names: - data_extension_1 - data_extension_2 incremental: true shared_data_extension: true incremental_column_name: date from_date: "2018-11-30T17:00:00.000Z" fetch_days: 1out: mode: replace

    Parameters

    Description

    Default value

    type

    must besalesforce_marketing_cloud

    client_id

    Salesforce Marketing Cloud Client ID

    client_secret

    Salesforce Marketing Cloud ClientSecret

    target

    must be data_extension (in this case)

    data_extension_names

    List of data extension names you want to ingest. Leave it to null to ingest all

    null

    shared_data_extension

    Set this flag to true when you want to ingest the data from shared data extensions

    false

    incremental

    Set this flag to true if you want to get newer data between each schedule only

    false

    incremental_column_name

    Column used for incremental loading. Must be set if you set incremental = true

    null

    from_date

    Set this to ingest data from this date

    null

    fetch_days

    Duration to ingest is from_date to from_date +fetch_days

    1

    ForEmail Event target

    in: type: salesforce_marketing_cloud client_id: XXXXXX client_secret: XXXXXX target: email_event search_term: "email name" search_mode: "exact" email_event_types: - value : all start_time: "2019-03-26T07:20:00.000Z" end_time: "2019-03-26T07:53:00.000Z" incremental: trueout: mode: replace

    Parameters

    Description

    Default value

    type

    must besalesforce_marketing_cloud

    client_id

    Salesforce Marketing Cloud Client ID

    client_secret

    Salesforce Marketing Cloud ClientSecret

    target

    must be email_event (in this case)

    search_term

    Name of email you want to ingest

    null

    search_mode

    Mode for name matching pattern. Must be exact or partial only

    exact

    email_event_types

    List of event types you want to import. Valid options are as follows:

    - value : all - value : bounce - value : click - value : forward - value : forwardOptIn - value : open - value : sent - value : survey - value : unsub

    null

    start_time

    Set this to ingest data from this time

    null

    end_time

    Set this to ingest data to this time

    null

    incremental

    Set this flag to true if you want to get newer data between each schedule only

    false

    View Article
  • You can connect Salesforce with Treasure Data for greater control over your Salesforce data and better integration with the other business applications in your marketing and sales operations stack.

    Integrating Salesforce with Treasure Data makes it easy to:

    Add new features to Salesforce. For example, you can prevent churn by tracking web usage and receiving alerts when customers product usage declines.

    Use Salesforce data to improve other parts of your marketing stack. For example, you can increase your Facebook Ads ROI by automatically removing new customers from your Facebook Custom Audiences.

    For sample workflows on importing data from Salesforce, go to Treasure Boxes.

    If youdon'thave a Treasure Data account yet, contact us so that we can get you set up.

    Connect to Salesforce using the TD Console

    You can configure your connection using TD Console. Instructions on how to connect using the command line are in the next section..

    In Salesforce

    Go to Setup -> Apps -> App Manager and click New Connected App. ( The steps may vary depending on the version of Salesforce you are running.)

    In Spring 19 classic UI: Setup -> Build -> Create -> Apps -> Connected Apps -> New

    From the Setup -> Build -> Create (Apps), you can see all your connected apps

    Click on your app name to go to the page where you can view and manage all information about your connected app. From there, you can get your Consumer Key (client_id) and Consumer Secret (client_secret), which you need for the next step.

    For secure account access, get a Salesforce Security Token, if youdon'talready have one. To get a security token, go to Account -> Settings -> Reset My Security Token and click Reset Security Token. Youll receive your Security Token by email.

    In Spring 19 classic UI: My account -> My Settings -> Personal -> Reset My Security Token

    Create New Connection in TD Console

    Go to Integrations Hub > Catalogand search forSalesforce.

    In the dialog box, enter https://login.salesforce.com/ as the login URL.

    If you are using newer version Salesforce v2 connector, make sure to remove unnecessary letters from Login URL parameter. For example, instead of https://login.salesforce.com/?locale=jp, use use https://login.salesforce.com/.

    Enter your username (your email) and password, as well as your Client ID, Client Secret and Security Token.

    ClickContinue.

    Give your connection a descriptive name and clickCreate Connection.

    Authentication

    The method you use to authenticate Treasure Data with Salesforce affects the steps you take to enable the data connector import from Salesforce.

    Using Credentials

    You need a client ID and client secret to authenticate using credentials. You can find more information on how to acquire or retrieve existingclient ID and client secret credentials.

    Using Session ID

    Not applicable for import. Although you see the Session ID option in the drop-down box, you cannot use it for import. Session ID can be used to authenticate for export (result output) only. See Writing Job Results into Salesforce.

    Using OAuth2

    Using OAuth is the most common authentication method.

    Go to Integrations Hub > Catalogand search forSalesforce.

    Click on the Salesforce connector. In the dialog to create new connector, open the Authenticatemethoddropdown list and choose OAuth.

    Click on the Click here to connect to a new account. You are redirected to the Salesforce login if you haven't logged in yet, or the consent page to grant access to Treasure Data.

    After you grant access to Treasure Data, you are redirected back to TD Console. Choose the Salesforce connector again, then choose the OAuth Authenticate method as in the preceding step. You will see an OAuth connection with your account name in the dropdown list. Choose the account you want to use and then proceed to create the connection.

    Specify parameters. The parameters are as follows:

    Initial Retry Delay:The connector initially waits for this amount of seconds that you specify before retrying import.

    Retry Limit: During the import of data into Treasure Data, there are number of factors that affect the process, such as network fluctuation, and concurrency limit on the Salesforce server. The connector retries import for the number of time you specify.

    Your Salesforce account profile must beAPI Enabled. You can:

    1. Setup-> Administer-> Manage Users-> Users

    Find the profile your user is assigned.

    2. Setup-> Administer-> ManageUsers->Profiles -> Edit ->Administrative Permissions

    Select API Enabled.

    If you are unable to edit a pre-defined profile in your organization, clone the profile to to create a new profile and then edit the new profile.

    Create Transfer

    In Authentications, locate the Salesforce connection and clickNew Source.

    Enter the name of the database you want to import in the Target field and click Next.

    You see a preview of your data. At this stage, you can click on Advanced Setting to manually modify the SOQL, where condition and the schema.

    Click Next to take you toTransfer data from Salesforce.

    If you are creating a new database, select Create new database and give your database a name. Complete the same steps for Create new table.

    Select whether to append records to an existing table or replace your existing table.

    If you want to set a different partition key seed rather than use the default key, you can specify one using the popup menu.

    In the When tab, you can specify a one-time transfer, or you can schedule an automated recurring transfer. If you select Once now, click Start Transfer. If you select Repeat,specify your schedule options, and then click Schedule Transfer.

    After your transfer has run, you can see the results of your transfer in the Databases tab.

    Create Salesforce connection using the command line

    Install the Treasure Data Toolbelt

    Open a terminal and run the following command to install the newest Treasure Data Toolbelt.

    $ td --version

    0.15.3

    In Salesforce

    Go to Setup -> Apps -> App Manager and click New Connected App. ( The procedure may vary depending on what version of Salesforce you are running.)

    Click Manage Connected Apps and get your Consumer Key and Consumer Secret, which you need for the next step.

    For secure account access, if youdon'thave one already, you should get a Salesforce Security Token. To do this, go to Account -> Settings -> Reset My Security Token and click Reset Security Token. Youll receive your Security Token by email.

    Create a Seed Config File (seed.yml)

    Using a text editor, create a file called seed.yml. Copy and paste the information below, replacing the placeholder text with your Salesforce credentials. This configuration dumps the Account object specified in the target field because replace mode is specified. For more details on available out modes, see Appendix

    in:

    type: sfdc

    username: "USERNAME" # your username

    password: "PASSWORD" # your password

    security_token: "SECURITY_TOKEN" # your Security Token

    client_id: "CLIENT_ID" # your app's consumer key

    client_secret: "CLIENT_SECRET" # your app's consumer secret

    login_url: https://login.salesforce.com/ # test.salesforce.com for sandbox

    incremental: false # 'full dump', see Scheduled execution below

    target: Account # Salesforce Object you want to import

    out:

    mode: replace

    All Salesforce Objects are supported for the target option. Some common objects include:

    Opportunity

    Contact

    Lead

    Account

    Event

    Task

    (For more information about Salesforce Objects, refer to the full reference at the Salesforce Developer portal.)

    The objects, fields and records accessible to you depend on your Salesforce license, security configuration and API access configuration in your Salesforce organization. Check your configurations if you see authorization errors in the next step.

    Run the Guess Fields command (generate load.yml)

    Run the following command in your terminal:

    $ td connector:guess seed.yml -o load.yml

    Connector:guess automatically reads the target data and intelligently guesses the data format.

    Open the file load.yml, where youll see guessed file format definitions including, in some cases, file formats, encodings, column names, and types.

    in:

    type: sfdc

    username: "USERNAME"

    password: "PASSWORD"

    security_token: "SECURITY_TOKEN"

    client_id: "CLIENT_ID"

    client_secret: "CLIENT_SECRET"

    login_url: https://login.salesforce.com/

    ...

    filters:

    ...

    out:

    mode: replace

    You can preview how the system will parse the file by using preview command.

    $ td connector:preview load.yml

    If the system detects the wrong column name or type, modify load.yml and preview again.

    The Data Connector supports parsing of boolean, long, double, string, and timestamp types.

    Execute the load job

    In your terminal, submit the load job as shown in the following example. Processing might take a couple of hours depending on the data size. Users need to specify the database and table where their data is stored.

    $ td connector:issue load.yml --database td_sample_db --table td_sample_table \

    --time-column createddate

    The preceding command assumes you have already created database(td_sample_db) and table(td_sample_table). If the database or the table do not exist in TD the command will not succeed. So create the database and table manually or use the --auto-create-table option with td connector:issue command, as shown below, to auto-create the database and table.

    $ td connector:issue load.yml --database td_sample_db --table td_sample_table \

    --time-column createddate --auto-create-table

    You can assign the Time Format column to the Partitioning Key by using the time-column option.

    If you want to ingest records that have been deleted in SFDC, specify include_deleted_records: true. By default, ingested recordsdon'tinclude records that you have previously deleted in Salesforce.

    Heres an example of a load file using the include_deleted_records option

    in:

    type: sfdc

    username: "USERNAME"

    password: "PASSWORD"

    security_token: "SECURITY_TOKEN"

    client_id: "CLIENT_ID"

    client_secret: "CLIENT_SECRET"

    login_url: https://login.salesforce.com/

    include_deleted_records: true # Ingest records that have been deleted

    ...

    filters:

    ...

    out:

    mode: replace

    Scheduled execution using the command line

    You can schedule periodic Data Connector execution for periodic SFDC import. We manage our scheduler carefully to ensure high availability. By using this feature, you no longer need a cron daemon on your local data center.

    For the scheduled import, Data Connector for SFDC imports all Objects that match the specified target.

    Scheduled execution supports additional configuration parameters that control the behavior of the Data Connector during its periodic attempts to fetch data from SFDC:

    incremental This configuration is used to control the load mode, which governs how the Data Connector fetches data from SFDC based on one of the native timestamp fields associated with each Object.

    incremental: true (default) In this mode, the data connector fetches only records of the specified Salesforce Object type that have been updated since the previous run of the connector. This mode is useful when the user wants to fetch just the Object targets that have changed since the previous scheduled run. This mode is typically combined with writing data into the destination table using append mode.

    incremental_columns (required) This option is required for incremental mode, to load necessary data only from SFDC. Make sure these columns are selected if you use a custom SOQL.

    incremental: false In this mode, the data connector fetches all the records of the specified Salesforce Object type, regardless of when they were last updated. This mode is best combined with writing data into a destination table using replace mode.

    soql This configuration allows to use a custom SOQL ( Salesforce Object Query Language ) to query and filter data. With SOQL, you can retrieve the data from a single object or multiple objects that are related to each other. You can pick only your interested columns and filter or sort the output with your own conditional statement.

    where This configuration allows to use the filtering condition on the custom SOQL

    columnsThis configuration is used to define a custom schema for data to be imported into Treasure Data. You can define only columns that you are interested in here but make sure they exist in the object that you are fetching. Otherwise, these columnsaren'tavailable in the result.

    last_record This configuration is used to control the last record from the previous load job. It requires the object include a key for the column name and a value for the columns value. The key needs to match the SFDC (SOQL) column name.

    Heres an example of a seed file using incremental mode combined with append mode for output.

    in:

    type: sfdc

    username: "USERNAME"

    password: "PASSWORD"

    security_token: "SECURITY_TOKEN"

    client_id: "CLIENT_ID"

    client_secret: "CLIENT_SECRET"

    login_url: https://login.salesforce.com/

    target: Account use_rest: false

    incremental: true

    incremental_columns:

    - LastModifiedDate

    out:

    mode: append

    The preceding configuration has a column name LastModifiedDate which is used to load records incrementally. The Data Connector will use a SOQL (SFDC Query Language) with that configuration as shown in the following.

    -- first execution:

    SELECT Id, Name, ... FROM Account ORDER BY LastModifiedDate

    -- second execution:

    SELECT Id, Name, ... FROM Account WHERE (LastModifiedDate > 2015-10-21T10:33:07+0000) ORDER BY LastModifiedDate

    -- '2015-10-21T10:33:07+0000' is the latest modified record time from first execution

    -- third execution:

    SELECT Id, Name, ... FROM Account WHERE (LastModifiedDate > 2015-10-30T22:30:41+0000) ORDER BY LastModifiedDate

    -- '2015-10-30T22:30:41+0000' is the latest modified record time from second execution

    You can see that the WHERE clause was updated on every execution to load modified records only.

    Heres an example of a seed file using non-incremental mode (incremental: false) combined with replace mode for output.

    in:

    type: sfdc

    username: "USERNAME"

    password: "PASSWORD"

    security_token: "SECURITY_TOKEN"

    client_id: "CLIENT_ID"

    client_secret: "CLIENT_SECRET"

    login_url: https://login.salesforce.com/

    incremental: false

    target: Account

    out:

    mode: replace

    With the preceding configuration, the Data Connector always loads all records and replaces all existing records with loaded records.

    The following is an example of a seed file using incremental mode and last_record combined with append mode for output.

    in:

    type: sfdc

    username: "USERNAME"

    password: "PASSWORD"

    security_token: "SECURITY_TOKEN"

    client_id: "CLIENT_ID"

    client_secret: "CLIENT_SECRET"

    login_url: https://login.salesforce.com/

    target: Account

    incremental: true

    incremental_columns:

    - LastModifiedDate

    last_record:

    - {"key": "LastModifiedDate", "value": "2017-01-01 00:00:00UTC"}

    out:

    mode: append

    In the preceding configuration, Data Connector loads all records since 2017-01-01 00:00:00.

    Heres an example of a seed file using incremental mode combined with custom SOQL, where condition and columns schema for output.

    in:

    type: sfdc

    username: "USERNAME"

    password: "PASSWORD"

    security_token: "SECURITY_TOKEN"

    client_id: "CLIENT_ID"

    client_secret: "CLIENT_SECRET"

    login_url: https://login.salesforce.com/

    target: Account

    incremental: true

    incremental_columns:

    - LastModifiedDate

    soql: "select Id, Name, Description, LastModifiedDate from Account" where: "ownership = 'Public'"

    columns:

    - name: Id

    type: string

    - name: Name

    type: string

    - name: Desc

    type: string

    - name: LastModifiedDate

    type: timestamp

    out:

    mode: append

    As you can see, the SOQL can select fewer columns with restricted filter on ownership field. The schema also defines an unknown field name or not exist in the SOQL (i.e. Desc) and this columnwon'tbe available in the importing result.

    Create the schedule

    A new schedule can be created by using the td connector:create command. The name of the schedule, cron-style schedule, database and table where the data will be stored, and the data connector configuration file are required.

    $ td connector:create \

    daily_sfdc_import \

    "10 0 * * *" \

    td_sample_db \

    td_sample_table \

    load.yml

    The cron parameter also accepts three special options: @hourly, @daily and @monthly.

    By default, the schedule is set in the UTC time zone. You can set the schedule in a time zone using the `-t` or `--timezone` option. The `--timezone` option supports only extended timezone formats like 'Asia/Tokyo', 'America/Los_Angeles' etc. Timezone abbreviations like PST, CST are *not* supported and may lead to unexpected schedules.

    List the Schedules

    You can see the list of scheduled entries by entering the command td connector:list.

    $ td connector:list

    +-------------------+--------------+----------+-------+--------------+-----------------+------------------------+

    | Name | Cron | Timezone | Delay | Database | Table | Config |

    +-------------------+--------------+----------+-------+--------------+-----------------+------------------------+

    | daily_sfdc_import | 10 0 * * * | UTC | 0 | td_sample_db | td_sample_table | {"type"=>"sfdc", ... } |

    +-------------------+--------------+----------+-------+--------------+-----------------+------------------------+

    Show the Setting and History of Schedules

    td connector:show shows the execution setting of a schedule entry.

    % td connector:show daily_sfdc_import

    Name : daily_sfdc_import

    Cron : 10 0 * * *

    Timezone : UTC

    Delay : 0

    Database : td_sample_db

    Table : td_sample_table

    td connector:history shows the execution history of a schedule entry. To investigate the results of each individual execution, use td job <jobid>.

    % td connector:history daily_sfdc_import

    +--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+

    | JobID | Status | Records | Database | Table | Priority | Started | Duration |

    +--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+

    | 578066 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-18 00:10:05 +0000 | 160 |

    | 577968 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-17 00:10:07 +0000 | 161 |

    | 577914 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-16 00:10:03 +0000 | 152 |

    | 577872 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-15 00:10:04 +0000 | 163 |

    | 577810 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-14 00:10:04 +0000 | 164 |

    | 577766 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-13 00:10:04 +0000 | 155 |

    | 577710 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-12 00:10:05 +0000 | 156 |

    | 577610 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-11 00:10:04 +0000 | 157 |

    +--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+

    8 rows in set

    Delete the Schedule

    td connector:delete removes the schedule.

    $ td connector:delete daily_sfdc_import

    Appendix

    A) Modes for out plugin

    You can specify the file import mode in the out section of seed.yml.

    append (default)

    >

    This is the default mode. Records are appended to the target table.

    in:

    ...

    out:

    mode: append

    replace (In td 0.11.10 and later)

    This mode replaces data in the target table. Any manual schema changes made to the target table remain intact with this mode.

    in:

    ...

    out:

    mode: replace

    B) Use synchronous transfer only

    There are 2 options to ingest records from Salesforce to Treasure Data:

    Bulk API

    REST API

    Bulk API offers a faster ingestion but also has a limitation of 10.000 batch allocations within a 24 hour period. If your target is large, your entire available batch allocation might be consumed, causing your job to fail eventually.

    You might encounter the batch allocation limitation during the first ingestion of your data, when all of records are ingested.

    If that ingestion of all records is one of your use cases, consider enabling the synchronous transfer onlyoption and using REST API. The use of REST API avoids the batch allocation limitation but might be slower.

    View Article
  • Access control settings are managed within TD Console.

    Account owners and administrators have additional control over access to resources and capabilities.

    The following access controls pertain to all account team members.

    REST APIs Access

    REST API access is controlled through API keys. Almost every REST API call needs to be issued with a valid API key for authentication and resource authorization purposes.

    There are two types of API keys: Master and Write-only. By default, every new user is created with one Master and one Write-only API key; however, any user can generate any number of the two types of API keys. Any of the API Keys can be revoked at any time by the user themselves or any user having Manage User permissions.

    Master This is the traditional type of API key and it can be used to perform all permitted operations listed in the table below (see Console or Master API key) based on the users permission level and access, no exception.

    Write-only This API key type provides an additional layer of security in controlling access to a Treasure Data account through the REST APIs, especially when access has to be provided to 3rd parties or API keys need to be embedded in ingestion libraries (see for example, Web Page integrations). Based on the permissions and access levels associated to a user, the users Write-only API key will only allow importing data into Treasure Data to those databases it has write access to.

    Read how you can configure your keys.

    Actions According to Access Level

    This section describes access at the most basic level.Policies and permissions do not change these access rules.

    Actions You Perform with TD Console or Master API Key

    The following table lists actions that can be performed through either TD Console or the REST APIs. Some actions might not be available at the REST API level and some other actions (for example, Data Import) are not available within TD Console.

    Action

    Owner

    Admin

    Full Access

    Query only

    Import only

    Add User

    OK

    OK

    Manage User

    OK

    OK (1)

    Delete User

    OK

    OK (1)

    List Databases

    OK

    OK

    OK

    OK

    OK

    Create Database

    OK

    OK

    n/a (2)

    n/a (2)

    n/a (2)

    Manage Database

    OK

    OK

    OK (2)

    OK (2)

    OK (2)

    Delete Database

    OK

    OK

    OK (2)

    OK (2)

    OK (2)

    Show Table

    OK

    OK

    OK

    OK

    OK

    List Tables

    OK

    OK

    OK

    OK

    Create Table

    OK

    OK

    OK

    Delete Table

    OK

    OK

    OK

    Data Import (td-agent)

    OK

    OK

    OK

    OK

    Data Import (Result Output to TD)

    OK

    OK

    OK

    OK

    Data Import (Bulk Import)

    OK

    OK

    OK

    OK

    Data Import (embulk-output-td)

    OK

    OK

    OK

    Data Import (Data Connector)

    OK

    OK

    OK

    OK

    Data Import (FileUploader v2)

    OK

    OK

    OK

    OK

    Data Import (Insert Into)

    OK

    OK

    OK

    (3)

    Delete Data

    OK

    OK

    OK

    Issue Query

    OK

    OK

    OK

    OK

    Kill Own Query

    OK

    OK

    OK

    OK

    Kill Query from Others

    OK

    OK

    OK

    (4)

    Export Table

    OK

    OK

    OK

    OK

    Notes:

    Administrator users can only Manage User and Delete User for Restricted users but are not allowed to manage and delete other Administrators user or the account Owner. The Manage User permission includes granting or revoking the Administrator role therefore an Administrator user can promote a user to Administrator but cannot demote an user from Administrator to Restricted user.

    Any user (including Restricted ones) can create a new database and they will own and have all permissions for it. Full Access, Query-only, and Import-only actions for Create Databasedon'tapply in that case. Restricted users can only Delete and Manage databases they created (and therefore own). Administrators and Owner can always manage databases.

    While the end-goal of INSERT INTO is to write the result back into a table, it requires special permissions. The executing user must possess read (Query-only, Full Access, Admin, or Owner) permissions for all the databases accessed by the query as well as read and write permission (Full Access, Admin, or Owner) to the database the result is inserted into the query will fail otherwise.

    Restricted users with Query-only permission can see all the jobs running on the database they have 'Query-only permissions for but will not be able to kill a job unless its their own.

    Actions You Perform with TD Console or theWrite-only API key

    Action

    Owner

    Admin

    Full Access

    Query only

    Import only

    Add User

    Manage User

    Delete User

    List Databases

    Create Database

    OK

    OK

    n/a (1)

    n/a (1)

    n/a (1)

    Manage Database

    Delete Database

    List Tables

    Create Table

    OK

    OK

    OK

    OK

    Delete Table

    Data Import (td-agent)

    OK

    OK

    OK

    OK

    Data Import (Result Output to TD)

    OK

    OK

    OK

    OK

    Data Import (Bulk Import)

    (2)

    (2)

    (2)

    (2)

    Data Import (embulk-output-td)

    (2)

    (2)

    (2)

    Data Import (Data Connector)

    (2)

    (2)

    (2)

    (2)

    Data Import (Insert Into)

    (3)

    (3)

    (3)

    (3)

    (3)

    Issue Query

    Kill Own Query

    Kill Query from Others

    Export Table

    Notes:

    Any user (including Restricted ones) can create a new database and they will own and have all permissions for it. Full Access, Query-only, and Import-only permissions for Create Databasedon'tapply in that case. Restricted users can only Delete and Manage databases they created (and therefore own). Administrators and Owner can always manage databases.

    Bulk Import require the ability to check the status of a job, and this is not possible using a write-only API key.

    INSERT INTO requires the ability to execute a query, which is not allowed using a Write-only API key.

    IP Whitelist

    IP whitelists restrict access to Treasure Data accounts through specified sets of IP addresses. There are two types of whitelists: default account IP whitelist and user IP whitelist.

    The Account Owner and Administrators set default account IP whitelist and user IP whitelist. A user IP whitelist is viewable from a users profile page.Restricted users can view but cannot set their own IP whitelist.

    Refer to the article IP Whitelist for more information.

    View Article
  • User Permissions are managed within TD Console > Administration. Only account owners and administrators can see this view.

    Permissions for:

    You can specify in a Standard account

    You can specify if you have the Policies feature

    Databases

    Yes

    Yes. Set at the user level. Not part of a policy

    Workflows

    Yes

    Yes. Set as part of a policy.

    Master Segments in Data Workbench

    Option is not viewable. Included as part of Audience Studio.

    Yes -you can set more restrictions

    Audience Studio

    Yes. Full Access or None

    Yes -you can set more restrictions

    The User page displays the users associated with an account.

    Permissions without Policies

    You click Users > Permissions to specify or modify the permissions of account users.

    Permissions in the Policies Feature

    For TD accounts with the Policies feature, you click Users > Policies > Show AllPermissions to view permissions that are associated with account users. You cannot assign permissions to users in this view.

    To assign users to permissions, you click Policies. You create policies that specify permissions. You then assign users to policies. Policies are collections of permissions.

    User Permissions

    Local Database

    You can specify users access level todatabases in the account.

    Import Only - can run import jobs against any database in the account

    Query Only - can view and run queries against any database in the account

    Full Access - can perform any operation on databases in the account

    The access levels for database mimic read & write, read, and write permissions.

    On the My Settings > Database page, all users can specify permissions for databases that they have created.Only account owners and administrators can specify permissions for any database in the account.

    Workflows and Audience Studio

    Only account owners and administrators can view and specify permissions for Audience Studio and Workflows. For details on how permissions affect user actions in TD workflow, refer the matrix. The matrix lists actions - both in TD Console and with commands on the CLI in TD toolbelt -that are controlled by permissions.

    In the Permissions without Policies version, click aUser view, and then click the Permissions tab. If you have the Policies feature, click Policies and go to the Permissions tab.

    Users must be assigned specific permissions in the following areas:

    Data Workbench

    You can specify users access to workflows and master segments. Workflow permissions are standard. Master segment permission are only foraccounts that include Audience Studio.

    Workflows: ability to view, edit workflows and run workflows.

    View - view the list of all workflows

    Run - run all user-defined workflows. Select View to also view list of all workflows.

    Edit - create and edit all user-defined workflows.Select View to also view list of all workflows. Select Run to run any user-defined workflow.

    If no permission is selected, then users in the policy do not have any workflow permissions.

    Tip: If you plan to create Master Segment, select View in Workflow so that you can view the workflow process of creating a Master Segment.

    Permissions specified control workflow activities in TD console and the TD toolbelt command-line interface.

    Master Segments: For accounts that include the Policy feature and Audience Studio.

    View - users can only see Master Segments in the navigation pane. Users cannot edit.

    Full control - Run, create and edit all Master Segments.

    If no permission is selected, then users in the policy do not have any permissions for all master segments.

    When you grant Full Control, you are also granting access to View, even if the View option is not selected or is deselected. With Full Control, you are enabling a user to view as well as create, edit, and run all Master Segments.

    Tip: If you plan to create Master Segments, select View in Workflow so that you can view the workflow process of creating a Master Segment.

    Audience Studio

    If your account includes Audience Studio, you can specify users access to Segment Folders, Profiles, Predictive Scoring and Profiles API Tokens.

    None - usersdo not see Audience Studio in the navigation; users cannot view any Audience Studio features

    Limited Access - account owners and administratorscan select a master segment. From this view, the account owners and administrators can drill down to lower panes that enable selection of specific Segment folders.

    Full access - users can view all features of Audience Studio, including, profiles, segment folders, predictive scoring, and API tokens

    Segment Folders

    When you specify Audience Studio: Limited Access and specify a specific master segment in which to configure permissions, You can specify:

    For all folders

    View - users can view all segment folders and their contents within thespecified master segment

    Full control - users can view, create, delete, and modify any segment folders and their content within thespecified master segment; users can see profiles, predictive scoring, and API tokens, within thespecified master segment

    For specific folders

    View - users can view specific segment folders,within thespecified master segment

    Full control - users can view and edit specific segments folders, within thespecified master segment; they cannot see profiles, predictive scoring, and API tokens associated with the segments

    Frequently Asked Questions

    Why doesn't my view show Master Segments? Why doesn't my view have Policies as an option?

    Master Segments and Audience Studio are visible only if the features are included as part of your account.

    As an administrator, how do I change my email address?

    Administrators view and edit their own profiles in the Users view.Go to "My Profile" -> Personal Info and change your email address.

    As an Owner, how do I change the account name or account email?

    If you use Google SSO to access TD Console you must contact Support to change the account owner name or to change the account email listed in the Account Information page. Otherwise, to edit your account owner name or email, go to "My Profile" -> Personal Info.

    View Article
  • As an account owner or administrator, you can define the criteria for Treasure Data user passwords. All users of your Treasure Data account must adhere to the set of criteria that you specify. When you change the password criteria, existing users receive notification and will have to update their password if they do not already meet the updated criteria.

    Specify the Password Policy

    Account owners and Administrators can specify the account password policy fromControl Panel > Administration > Login Settings.

    Or you access from the Admin section of TD Console. Click the Security tab to configure the password policy.

    Password Configuration

    Minimum Length

    The minimal character length of user passwords. Default: 8

    Expiration

    Specifies how frequently that users must change their password. Specified in days. For example, the value '60 days' means the password must be changed every 60 days. Default: Never

    History

    Saves the value of a specified the number of passwords for a user. User are not allowed to use previous passwords as specified. Default is 6. Example, if you specify the value as '4' then users must create a unique password that does not equal any of the last 4 prior password values.

    Complexity

    Specifies the types of characters that a password must contain. Refer to the following image:

    Login Configuration

    Maximum Invalid Login Attempts

    The number of incorrect attempts one user is allowed make while attempting to login. When the maximum number is exceeded, the user receives an email that allows them to unlock their profile and attempt to login again. Alternatively, the user can reset their password from the login pane. If the user takes no action, the profile is unlocked 1 hour after the last failed attempt.

    User Session Timeout

    Specifies how long a user can remain in logged in state, while inactive. Default: 24 hours

    Backward Compatibility

    New accounts are defaulted to the password policy you specify. When you change any password criteria, all users receive an email notification that explains that the security requirements have changed, and that users might have to change their password to meet new criteria. If their current passwords already meet the changed criteria, then users can login without updating their password.

    Also, account users receive a notice when their current password expires and must be changed.

    Password Reset

    Account owners and Administrators do not reset passwords for users but can send a request for the user to update their password.

    To send the request:

    Go to the control panel > Users

    Select a specific user

    Click to edit the user personal information

    Click Reset Password

    IP Whitelist

    You can specify an IP address at the account level. You might want to specify an IP Address when you expect most of your team to use a common IP address to access Treasure Data. If your navigation provides a control panel,you no longer specify IP Whitelist on the same page as where you set the password and login configuration. You set system level IP addresses from a separate page.

    View Article
  • Arm Treasure Data provides permissions that enable account owners and administrators to control the access and capabilities of their users.

    Permissions enable users to:

    perform an operation, such as table modification or segment activation

    act upon a specified entity, such as a database, or segment

    The following illustration shows workflow permission options and the area of TD Console area that the permissions affect.

    You can read more information on Permission options.

    You can also acquire Treasure Data's Policies feature to increase your level of control and security, and make it easier to manage user access to features and data.

    Policies combine permissions together. You can configure multiple policies and assign one or more policies to one or more users.

    If you change permissions within a policy, the effect is immediate (after the changes are saved) and the permission changes affect all the users who are assigned to the policy.

    You can read more information on Policies options.

    View Article
  • The Policies features is available only to accounts that have purchased the feature. Owners and administrators manageaccess control policies in TD Console.

    You grant users access rights through the use of policies. Policies combine permissions together.

    Permission areas are:

    Workflow

    Master Segments in Data Workbench

    Audience Studio, which includes segment folders and can also include profiles, predictive scoring, and API tokens

    Databases are not part of policy permissions.

    Planning your Policy Structure

    Create your policy structure in a manner that works best for your organization. You might want to create most of your policies at one time, perhaps using a table matrix to plan out the combination of permissions. Or, you might want to create policies, as you find need.

    As you create policies, consider:

    Establishing naming convention for your policies

    In the policy name, you might want to include the scope or area of permission, the user role, or perhaps a vendor or region assignment for example:

    Full Control X Project EU

    Local Admin VendorX Segments

    Provide comprehensive descriptions

    You can use the description field to list permissions and levels of access that are contained in the policy.

    Refer to policyscenarios for ideas on how to structure your policies.

    Defining Policies

    In the Control Panel > Administration, click Policies to access the Policy work area. In this area, you can create, delete or apply policies.

    Creating Policies

    Click Actions > Create Policy and provide a meaningful name and description to your policy.

    Click Permissions and then the edit pencil.

    In the following example, you don't want policy members to view all master segments, and therefore leave the Configuration for All Master Segments options unselected:

    You then selectLimited Access with your ultimate goal in mind, specifying access to specific (not all) segments. You are specifying limited access to Audience Studio:

    To specify specific segments, you first select the Master Segment that contains the segment folders that you want to specify, click Add.

    You have the option to enable users to view or have full control permission for the Test Master Segment but you opt not to grant that permission. You are designing this policy to grant permission only to a specific segment folder.

    Click to edit.

    Select the segments that you want to specify from the dropdown box.Click Add.

    Specify the permissions for the segment folder that you added.

    In the preceding example, users assigned to this policy have full control to one segment folder associated with the 'Test Master Segment'.

    When you grant Full Control, you are also granting access to View, even if the View option is not selected or is deselected. With Full Control, you are enabling a user to view as well as create, edit, and activate all segments in the specified folder.

    Frequently Asked Questions

    What happens when you enable or disable the Policies feature

    As the account owner, be aware of the actions that occur when you enable or disable policies.

    Enabling the Policies Feature

    When you enable the Policies feature, you must:

    Create policies that contains specified permissions

    Assign existing users to policies

    Disabling the Policies Feature

    When you disable the Policies feature, the Policies pane is no longer visible. To view the permissions for your users, you must:

    Go to Administration > Users

    Select to view a user.

    Click Permissions to view and specify permissions.

    View Article
  • Administrators can specify the maximum length of Web browser sessions.

    If your navigation provides a control panel,you clickcan specify session time out fromControl Panel > Administration > Login Settings > Login Configuration.

    Otherwise, click Admin > Security. Then click the pencil to edit.

    The default value is 1 week. For example, if 1 week is specified, users do not need to login until after 7 days have passed. With 1 week specified, users would need to log in only once per week. Only administrators can set the maximum session for users.

    View Article
  • When a user is deleted from your account, the associated resources created by that user will be affected. The following table summarizes how resources are affected by its user deletion:

    If an administrator deletes the user who is the owner of an audience, the workflow of the audience will fail. When you encounter this failure, you need to request us to change the Audience owner to another user before the deletion operation.

    Object

    Re-Assigned To

    Status

    Notes

    API keys

    Owner or an Administrator can specify reassignment

    Deleted

    The authority level of the deleted user is changed to the level of the re-assigned user. API keys that are deleted are invalid and cannot be reused.

    Audience owner

    n/a

    Deleted

    If you delete the user that is the owner of an audience, the workflow associated with the audience fails.

    To fix, request a change of the audience owner to anyone other than the user who needs to be deleted.

    Databases

    Account Owner

    Continuously available

    The Account Owner is notified by email.

    Tables

    (None)

    Continuously available

    -

    Authentications

    (None)

    Continuously available

    -

    Sources

    Account Owner

    Available

    The Account Owner is notified by email. The schedule is deactivated. Re-enable it manually, as necessary.

    Queries

    Account Owner

    Available

    The Account Owner is notified by email. The schedule is deactivated. Re-enable it manually, as necessary.

    Jobs

    (None)

    Continuously available

    However, scheduled data transfers are stopped.

    Workflows

    (Owner is invisible)

    Continuously available

    Limitation: The Treasure Data related operators in the workflow (such as td>, td_run>) fail if the td.apikey secret is not set for projects that the deleted user created. To avoid failed runs, set the td.apikey secret. Refer to workflow secrets.

    Permissions

    (None)

    n/a

    Policies

    (None)

    Available

    if a deleted user was sole user-assigned, the policies are still visible but not active because no user is assigned.

    View Article
  • The example policies described herein cover typical policy structures that manage:

    a subsidiary or subvendor

    employee security clearance levels that dictate access to data

    project-centered, cross functional teams that demands role-based policies

    In these scenarios, we specify the database permissions in a separate table because database permissions are assigned at the user level, not the policy level.

    Scenario 1

    Your company works with a vendor and the vendor is responsible for the marketing campaigns in a specificregion. You want the vendor to view a specific master segment (a collection of user profiles based on a physical region) and work with full access to segments that are applicable to the specific region. The company manages multiple master segments with each region having its own master segment. In this scenario, we are dealing with just one region.

    The team does not work with databases or workflows.

    Policy Plan

    One policy.

    User role:

    Vendor - access to all segments within a specified master segment

    User personas:

    Jorge, Christian, Kerry, Zach and Dominique are all Marketing Analyst. Permissions: View to only a specific marketing segment, Region 1. Full control to specified segment folders

    Within the specific master segments, the vendor has full access including, to create, modify, and delete all segment folders and segments in the specified master segment.

    The vendor can view profiles and can use the predictive scoring and create profile APIs to feed data back into the segment.

    Workflow is not enabled for this scenario anddatabase activity is not applicable.

    Policy Implementation

    Policies and Permissions configurationfor Scenario 1:

    Policy Name

    Workflow

    All Master Segments in Data Workbench view

    Audience Studio: Master Segment

    Audience Studio: Specific Folders Segments

    Policy Applied To:

    WeCanCompany Region 1 Segment Access

    Do not select any permission

    Do not select any permission

    Select Limited Access and select Region 1 master segment from the drop-down list of master segments; select View permission for master segment config and select View and Full Controlfor All Folders

    n/a

    By selecting All Folder, you don't need to drill down to specify further permissions

    Jorge

    Christian

    Kerry

    Zach

    Dominique

    Database access for WeCan Company

    User Name

    Database Name

    Level

    Each user in WeCan Company

    n/a

    n/a

    Scenario 2

    You want to separate marketing analyst work areas by policies and permissions and apply additional restrictions to new analyst. The account includes two master segments.

    Policy Plan

    Four policies. Two policies for each master segment. Within each master segment, one policy is for regular users and a second policy is for new employees.

    The team does not work with databases or workflows.

    User roles:

    Regular Marketing Analyst - a view of all master segments, full access to a specific master segment,and full access to all folders in the specific segment folders.

    New Marketing Analyst - no view of any master segments in data workbench; access to a specific master segment,and a mix of access to specified segment folders

    User personas:

    George, Silas, Reba: Regular Marketing Analyst - full control to master segment Retail and full access to specified segment folders:current, churn, trend

    Karin, Hoaxing, Rupa:Regular Marketing Analyst - full control to master segment Commercial and full access to specified segment folders: current, churn, trend

    Nicki, Ted: New Marketing Analyst - Access to the master segment Retail; full access to the segment folder current and view only to segment folders: churn and trend

    Laura, Paul, Gina, Derek: New Marketing Analyst - Access to master segment Commercial; full access to the segment folder current and view only to segment folders: churn and trend

    Regular marketing analyst can view the behaviors and attribute specifications for all master segments. Access to Audience Studio is limited to their specified master segment (either Retail or Commercial).

    Within their master segment, theycan view and act upon all features of Audience Studio, including, profiles, segment folders, predictive scoring, and API tokens. Theycreate, delete, and modify all segment folders in the specified master segment (namedcurrent, churn, and trend), each folder contains multiple subfolders and the subfolders contain multiple segments.Hence, regular marketing analysts can view, analyze and activate segments.

    New employees are assigned view access to specific segment folders (churn and trend).The intent is for new employees to monitor and report on segment data.

    Workflow is not enabled for this scenarioanddatabase activity is not applicable.

    Policy Implementation

    Policies and Permissions configurationfor Scenario 2:

    Policy Name

    Workflow

    All Master Segments in Data Workbench view

    Audience Studio: Master Segment

    Audience Studio: Specific Folders Segments

    Policy Applied To:

    Retail Regular Analyst

    Do not select any permission

    View

    Select Limited Access and selectRetail master segment; select View and Full Controlpermission for master segment config and in All Segment Folders: select View and Full Control

    n/a

    George, Silas, Reba

    Retail New Analyst Access

    Do not select any permission

    Do not select any permission

    Select Limited Access and selectRetail master segment; for master segment config do not select any permission. In All Segment Folders: Do not select any permission

    Click to edit.

    Folders and Segments:

    Select View for segments:churn and trend

    Nicki, Ted

    Commercial Regular Analyst

    Do not select any permission

    View

    Select Limited Access and selectCommercial master segment; select View and Full Controlpermission for master segment config and in All Segment Folders: select View and Full Control

    n/a

    Karen, Hoa, Rupa

    Commercial New Analyst Access

    Do not select any permission

    Do not select any permission

    Select Limited Access and selectCommercial master segment; for master segment config do not select any permission. In All Segment Folders: Do not select any permission

    Click to edit.

    Folders and Segments:

    Select View for segments:churn and trend

    Laura, Paul, Gina, Derek,

    Database access for Retail

    User Name

    Database Name

    Level

    Each marketing analyst

    n/a

    n/a

    Scenario 3

    You currently use Treasure Data and have standard permission set. You are now adding the policies feature and have more variances in how permission must be set for users. This scenario focuses on the gaming group.

    Current Permissions Plan

    Current user roles:

    Manager (CMO) - periodically checks master segment and segment data; guides data initiatives and campaigns

    Data Engineers - work on data ingestion and extract-transform-load, and perform some data cleansing. Require access to databases (full access or import only), workflows for ingesting (view, edit, run). Data Engineer is the only role who works with workflows.

    Data Analysts - review data across different sets, clean data by using job queries, create master segments. Merge data or create new dns (domain name system) databases, and view database access to all other databases. Require full access to master segments in data workbench.

    Marketing Analysts - review master segments, create segments and subsegments and prepare segments for activation. Requires full access to only specific master segments. Must have full access to segments within their specified master segment.

    Policy Plan

    Establish policies. Distinguish employees from contractors.

    Four policies for the Gaming group. Plus a policy for the Chief Marketing Officer (CMO) of the company.

    User roles:

    Marketing management (CMO and executive team) -full access to all master segments in data workbench, full audience studio access

    Data Engineer Gaming Group - full access to specific dbs; full access to workflow. No access to audience studio.

    Data Analyst Gaming Group - a variety of database permissions, which is OK because database permissions are specified at the user permissions (not policy) level. View only to workflow and full access to data workbench master segments. Limited access to Audience Studio, with access to only Gaming segments. Data Analyst don't create or activate segments.

    Marketing Analyst Gaming Group -view to all master segments. Requires full access to only specific master segments but within the specific segments, full controlwith full permission to create segments and subsegments and prepare segments for activation.

    Consultant Marketing Analyst -full access to a specific master segment,and full access to all folders in the specific segment folders.

    User personas:

    Kelly: Manager (CMO) - full access to all master segments, including Gaming. Also has full access in Audience Studio. Cam: Director of Gaming

    Cathy, Steve, Eke: Data Engineer - full access to specific dbs: us_customers, us_support, and us_marketing; full access to workflow. No access to audience studio.

    Josh: Data Analyst - database query permission: us_customers, us_support, and us_marketing;view only to workflow and full access to data workbench master segments and can look at Gaming segments in Audience Studio

    August, Gina: Marketing Analyst - limited access to data workbench master segments, with full access to the master segmentUS-West, and view access to all other master segments such as:US-Central, US-East, Asia-East, Europe-West). Full access to segments in US West as well.

    Petra: Consultant Marketing Analyst - Limited access to master segments in data workbench with only full access toUS-West master segment specified. Full access to the California segment folder. And view access to the other folders in US-West master segment (Oregon, Arizona)

    The marketing analysts can view profiles and can use the predictive scoring and create profile APIs to feed data back into the segments that they have permission to access. In other words, the vendor can view, analyze and activate segment

    Workflow is enabled for two policies in this scenario: Data Engineer - full access and Marketing Analyst - view only.

    The data analyst and marketing analyst are also given permission by other users to access databases.

    Policy Implementation

    Policies and Permissions configuration for Scenario 3:

    Policy Name

    Workflow

    All Master Segments in Data Workbench view

    Audience Studio: Master Segment

    Audience Studio: Specific Folders Segments

    Policy Applied To:

    Full Access

    Executive Office

    Do not select any permission

    Full Access

    Full Access

    n/a

    Kelly

    Cam

    Data Engineers

    View

    Run User-defined

    Edit User-Defined

    Do not select any permission

    Do not select any permission

    n/a

    Cathy

    Steve

    Eke

    Data Analysts

    View

    Full Access

    Select Limited Access and selectGaming US-Westmaster segment; in All Segment Folders: selectView

    n/a

    Josh

    Marketing Analysts

    Do not select any permission

    View

    Select Limited Access and selectGaming US-Westmaster segment; in All Segment Folders: select View and Full Control

    Select US-Central -master segment; in All Segment Folders: select View

    Select US-East-master segment; in All Segment Folders: select View

    Select Asia-East-master segment; in All Segment Folders: select View

    Select Europe-Westsegment; in All Segment Folders: select View

    August

    Gina

    Consultant: Marketing Analyst

    Do not select any permission

    Do not select any permissions

    Select Limited Access and selectGaming US-Westmaster segment; in All Segment Folders: select View

    Then click to edit US-Westmaster segment and select - California, Arizona and Oregon segment folders

    For each segment folder, specify the permissions: California - View, Full Control

    Arizona - View

    Oregon - View

    Petra

    Database access for Scenario 3

    User Name

    Database Name

    Level

    Each marketing analyst: Cathy, Steve, Eke

    us_customers

    us_support

    us_marketing

    full access

    Consultant marketing analyst: Josh

    us_customers

    us_support

    us_marketing

    query access

    View Article
  • You can opt to log into TD Console using Google SSO. Google, acting as an Identity Provider (IdP), manages your email as your sign-in method.

    How to specify use of Google SSO

    When you convert your user sign-in to Google SSO, neither you nor your TD administrator can change your user account from Google SSO back to Email and Password login mode. You must contact Treasure Data Support to revert back to Email and Password.

    To convert your sign-in to Google SSO, complete the following steps:

    Open TD Console.

    Access My Settings > My Profile.

    Click Sign In with Google.

    You see the following warning:

    If you click Continue, you are taken to the Google SSO page.

    The next time that you log into TD Console, you see an indication that you are signed in through Google SSO:

    View Article
  • To add new users to Arm Treasure Data, you send an invitation to join. You invite new users on an individual basis. You must be an account owner or administrator to invite a new user to your account. owner

    Open TD Console.

    Navigate to the Control Panel.

    Click Users.

    Click Actions > Invite User.

    Type the email address for the user you want to invite.

    Click Save to add the user.

    If the user already exists but is not visible in your list of users, you receive a message indicating that invitation is already pending for the user.

    The Treasure Data system sends an email to the user with a link for them to accept.

    View Article
  • You must be an account owner or administrator to change the role of an existing user.

    Open TD Console.

    Navigate to Control Panel > Users.

    Select a specific user

    Select the edit pencil to edit the user personal information.

    Scroll down to User Type and select the desired role.

    View Article
  • You can log into TD Console using Google aliases. Arm Treasure Data views each Google SSO alias as a unique email identifier. Google manages the aliases, resolving the aliases to your email address.

    You can log into an Arm Treasure Data account using different Google aliases, for example:

    [email protected]

    [email protected]

    [email protected]

    [email protected]

    You can sign into your account as an admin and then, with an alias, sign into the same account as a restricted user. Or you can use aliases to sign into your account with different permissions enabled.

    View Article
  • Connect to Arm Treasure Datas Presto using our JDBC Driver.

    Setup

    Download presto-jdbc driver

    For using secure HTTPS (port 443) connection you need to use presto 0.148 or later.

    Use the following parameters to connect prestobase on development server:

    Driver name: io.prestosql.jdbc.PrestoDriver

    user: (Your TD API key)

    password: (dummy string. This value will be ignored)

    After presto-jdbc 0.180, because of a bug, a dummy string needs to be set to the password.

    Connection URL examples

    The connection URL needs to have the following format: jdbc:presto://api-presto.treasuredata.com:443/td-presto/(database name)?SSL=true

    Examples

    jdbc:presto://api-presto.treasuredata.com:443/td-presto?SSL=true

    jdbc:presto://api-presto.treasuredata.com:443/td-presto/sample_datasets?SSL=true

    Example Code

    $ javac Sample.java

    $ java -cp .:presto-jdbc-0.152.jar Sample (your TD API key)

    time=1412351990, method=GET, path=/category/office

    time=1412351976, method=GET, path=/item/software/2265

    time=1412351961, method=GET, path=/category/toys

    time=1412351947, method=GET, path=/item/electronics/3305

    time=1412351932, method=GET, path=/category/software

    Sample.java

    import java.sql.*;

    class Sample {

    public static void main(String[] args) {

    if(args.length != 1) {

    System.err.println("Provide your TD API key as an argument");

    return;

    }

    String apikey = args[0];

    try {

    Connection conn = DriverManager.getConnection("jdbc:presto://api-presto.treasuredata.com:443/td-presto/sample_datasets?SSL=true", apikey, "dummy_password");

    Statement stmt = conn.createStatement();

    try {

    ResultSet rs = stmt.executeQuery("SELECT time, method, path from www_access limit 5");

    while(rs.next()) {

    long time = rs.getLong(1);

    String method = rs.getString(2);

    String path = rs.getString(3);

    System.out.println(String.format("time=%s, method=%s, path=%s", time, method, path));

    }

    }

    finally {

    stmt.close();

    conn.close();

    }

    }

    catch (Exception e) {

    e.printStackTrace();

    }

    }

    }

    Limitations

    The Presto JDBC connection has several limitations:

    setAutoCommit(false) throws an exception because transactions are not supported.

    presto-jdbc queries share the same Presto resources you have available for any other ad hoc or scheduled queries that you submit from the TD Console or through the REST API.

    presto-jdbc queries are shown in your accounts job log, with the exception of the following standard entries:

    Query result table

    Query plan

    Progress log

    The query result is converted into JSON before being submitted back to separate tools for display. As such, very large result tables may take some time before being displayed. If you have latency problems, contact support.

    Certain BI tools issue information_schema requests that cause failed queries in our system. This may be an issue in your tool during first use.

    If you see any unexpected errors/bugs/logs that occur during beta use, contact support.

    Our aim is to keep this API up as much as possible. Specifically, we expect to achieve a downtime of less than 1 hour total per month.

    View Article
  • Treasure Data Workflow provides an easy way to leverage Python custom scripts for sentiment analysis with TensorFlow and export its model to Amazon S3. Machine Learning algorithms can be run as part of your scheduled workflows, using Python Custom scripts. This article introduces the steps to run the ML algorithm Sentimental Analysis within a Treasure Data Workflow.

    Sentimental Analysis classifies texts as positive/negative, for movie reviews using TensorFlow and TensorFlow Hub. See the official document.

    official documentation

    Sentimental Analysis using Python Custom Scripts

    There are two versions of the algorithm discussed in this article:

    Example Workflow with TensorFlow using Amazon S3

    Example Workflow with TensorFlow without Amazon S3

    Example Workflow using TensorFlow with Amazon S3

    The workflow:

    Fetches review data from Treasure Data

    Builds a model with TensorFlow

    Stores the model on S3

    Predicts polarities for unknown review data and writes it back to Treasure Data

    Prerequisites

    Make sure the the custom scripts feature is enabled for your TD account.

    Download and install the TD Toolbelt and the TD Toolbelt Workflow module. For more information, see TD Workflow Quickstart.

    Basic Knowledge of Treasure Data Workflow syntax

    AWS S3

    S3 Secrets

    Run the Example Workflow

    Download the sentimental-analysis project from this repository

    In the Terminal window, change directory to sentimental-analysis

    Run data.sh to ingest training and test data on Treasure Data. About 80 million records are fetched to build the model. The script also creates a database named sentiment and tables named movie_review_train and movie_review_test to store the data. For example:

    $ ./data.sh

    Assume that the input table is:

    rowid

    sentence

    sentiment

    polarity

    1-10531

    "Bela Lugosi revels in his role as European horticulturist (sic) Dr. Lorenz in this outlandish...

    2

    0

    1-10960

    Fragmentaric movie about a couple of people in Austria during a heatwave. This kind of...

    3

    0

    1-24370

    I viewed the movie together with a homophobic friend, my wife and her female friend. So...

    7

    1

    Run the example workflow as follows:

    $ td workflow push sentiment$ td workflow secrets \--project sentiment \--set apikey \--set endpoint \--set s3_bucket \--set aws_access_key_id \--set aws_secret_access_key# Set secrets from STDIN like: apikey=1/xxxxx, endpoint=https://api.treasuredata.com, s3_bucket=my_bucket,# aws_access_key_id=AAAAAAAAAA, aws_secret_access_key=XXXXXXXXX$ td workflow start sentiment sentiment-analysis --session now

    Results of the script are stored in the test_predicted_polarities table in Treasure Data.

    To view the table:

    Log into TD Console.

    Search for the sentiments database.

    Locate the test_predicted_polarities table.

    The prediction results are stored in this table as shown below:

    rowid

    predicted_polarity

    1-21643

    0

    1-22967

    1

    Example Workflow using TensorFlow without Amazon S3

    The workflow:

    Fetches review data from Treasure Data

    Builds a model with TensorFlow

    Predicts polarities for unknown review data and writes the data back to Treasure Data

    Prerequisites

    Make sure this feature is enabled for your TD account.

    Download and install the TD Toolbelt and the TD Toolbelt Workflow module. For more information, see TD Workflow Quickstart.

    Basic Knowledge of Treasure Data Workflow syntax

    Run the Example Workflow

    Download the sentimental-analysis project from this repository

    From the command line Terminal window, change directory to sentimental-analysis.For example:

    cd sentiment-analysis

    Run data.sh to ingest training and test data on Treasure Data. About 80 million records are fetched to build the model, the script also creates a database named sentiment and tables named movie_review_train and movie_review_test to store the data.

    $ ./data.sh

    Assume that the input table is as follows:

    rowid

    sentence

    sentiment

    polarity

    1-10531

    "Bela Lugosi revels in his role as European horticulturist (sic) Dr. Lorenz in this outlandish...

    2

    0

    1-10960

    Fragmentaric movie about a couple of people in Austria during a heatwave. This kind of...

    3

    0

    1-24370

    I viewed the movie together with a homophobic friend, my wife and her female friend. So...

    7

    1

    Run the example workflow as follows:

    $ td workflow push sentiment$ td workflow secrets \--project sentiment \--set apikey \--set endpoint# Set secrets from STDIN like: apikey=1/xxxxx, endpoint=https://api.treasuredata.com$ td workflow start sentiment sentiment-analysis-simple --session now

    Results of the script are stored in the test_predicted_polarities table in Treasure Data.

    To view the table:

    Log into TD Console.

    Search for the sentiments database.

    Locate the test_predicted_polarities table.

    The prediction results should be similar to the following:

    rowid

    predicted_polarity

    1-21643

    0

    1-22967

    1

    Review the Workflow Custom Python Script

    Review the contents of the sentimental-analysis directory:

    sentiment-analysis.dig - This is the TD Workflow YAML file for sentiment analysis with TensorFlow.

    sentiment.py - This is the Custom Python script with TensorFlow. It builds a prediction model with existing data and predicts polarity to unknown data.

    In this example, we use a pre-trained model in TensorFlowHub for word embedding for English text.

    embedded_text_feature_column = hub.text_embedding_column(key="sentence",module_spec="https://tfhub.dev/google/nnlm-en-dim128/1")

    If you want to change this model to another one, for example, Japanese model, you can modify it as follows:

    embedded_text_feature_column = hub.text_embedding_column(key="sentence",module_spec="https://tfhub.dev/google/nnlm-ja-dim128/1")

    Before word embedding, you need to prepare tokenized sentences for Japanese.

    Because this custom script also saves the trained TensorFlow model with movie reviews to Amazon S3, you can build your prediction server with TensorFlow Serving.

    To change the serving_input_receiver_fn, modify the following code:

    feature_spec = tf.feature_column.make_parse_example_spec([embedded_text_feature_column])serving_input_receiver_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(feature_spec)estimator.export_saved_model(EXPORT_DIR_BASE, serving_input_receiver_fn)

    See for details.

    View Article
  • Treasure Data Workflow provides an easy way to predict continuous values, like a price or energy consumption, using Linear Regression predictor. Machine Learning algorithms can be run as part of your scheduled workflows, using Python custom scripts. This article introduces feature selection using scikit-learn in a Python script, which selects important features and builds a partial query for Hivemall to predict house prices.

    Feature selection is a common machine learning technique used to build a simplified model for understanding and to enhance generalization by removing irrelevant or redundant information.

    LassoCV

    Feature Selection using Python Custom Scripts

    This article describes how to predict house prices using the Boston house pricing data set with Linear Regression predictor. The feature selection from scikit-learn helps identify meaningful attributes from which to create supervised models.

    Example Workflow to Predict House Prices

    The workflow:

    Splits the training and test data sets

    Selects important features with scikit-learn

    Builds a partial query for Hivemall

    Trains and evaluates the model with selected features on Hivemall

    Prerequisites

    Make sure that the custom scripts feature is enabled for your TD account.

    Download and install the TD Toolbelt and the TD Toolbelt Workflow module. For more information, see TD Workflow Quickstart.

    Basic Knowledge of Treasure Workflow's syntax

    Run the Example Workflow

    Download the house price prediction project.

    From the command line terminal window, change to the to house-price-prediction directory. For example:

    Cd house-price-prediction

    Run data.sh to ingest training and test data on Treasure Data. The script uses the Boston Housing Dataset with about 506 cases to build the model. The script also creates a database named boston and table named house_prices to store the data.

    $ ./data.sh

    Input

    Assume the input table is:

    crim

    double

    zn

    double

    indus

    double

    chas

    int

    nox

    double

    rm

    double

    age

    double

    dis

    double

    rad

    int

    tax

    int

    ptratio

    double

    b

    double

    lstat

    double

    medv

    double

    0.00632

    18

    2.31

    0

    0.538

    6.575

    65.2

    4.09

    1

    296

    15.3

    396.9

    4.98

    24

    0.02731

    0

    7.07

    0

    0.469

    6.421

    78.9

    4.9671

    2

    242

    17.8

    396.9

    9.14

    21.6

    0.02729

    0

    7.07

    0

    0.469

    7.185

    61.1

    4.9671

    2

    242

    17.8

    392.83

    4.03

    34.7

    0.03237

    0

    2.18

    0

    0.458

    6.998

    45.8

    6.0622

    3

    222

    18.7

    394.63

    2.94

    33.4

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    Where medv, is Median value of owner-occupied homes in $1000's, the target value for regression. chas and rad is categorical values, and other features are quantitative.

    By default, the top 4 correlated columns with medv are used.

    Run the example workflow as follows:

    $ td workflow push regressor

    # Set credentials for this workflow

    # export TD_API_KEY=1/xxxxx

    # export TD_API_SERVER=https://api.treasuredata.com

    $ td wf secrets --project regressor --set apikey --set endpoint

    # Enter apikey e.g.) 1234/XXXXXXX, and endpoint e.g.) https://api.treasuredata.com

    $ td wf start regressor regression-py

    The predicted results for the price of houses are stored in the predictions table.

    To view the table:

    Log into TD Console.

    Search for the boston database.

    Locate the predictions table

    Output

    This workflow outputs predicted results

    rowid

    string

    predicted_price

    double

    1-10

    33.97034232809395

    1-121

    30.3377696027913

    ...

    ...

    Review the Workflow Custom Python Script

    Review the contents of the directory:

    regression-py.dig - Example workflow for sales prediction and notification to Slack.

    task/__init__.py - Custom Python script with scikit-learn. It selects important features and builds a partial query for Hivemall.

    This example uses scikit-learn's SelectFromModel function, which enables selection of features when building a predictive model.

    $ ./data.sh # Ingest example data to Treasure Datareg = ExtraTreesRegressor()reg = reg.fit(X, y)model = SelectFromModel(reg, prefit=True)feature_idx = model.get_support()feature_name = df.drop(columns=['medv']).columns[feature_idx]selected_features = set(feature_name)...(snip)...feature_query = self._feature_column_query(selected_features, feature_types=feature_types)

    In this example we use ExtraTreeRegressor to get feature importance, you can use any other logics such as.

    In the example code, there is a _create_vectorize_table function for creating a vectorized table to train Hivemall models with Python.

    self._create_vectorize_table(engine_hive, dbname, "train", "{}_train".format(source_table), feature_query)

    While you can use this function within the custom Python script instead of exporting the partial query, we recommend that you export queries. Exporting queries gives you the benefit of Digdag parallelization and manageability on TD Console.

    View Article
  • Treasure Workflow provides for prediction of time-series values, like a sales revenue or page views, using Facebook Prophet. Machine Learning algorithms can be run from a custom Python script as part of your scheduled workflows. Use Facebook Prophet in your Python custom script for time series analysis and sales data prediction.

    Prophet is a procedure for forecasting time series data. It can learn probability distributions from incomplete data sets and shifts in the trends.Details of the data for these examples can be found in the Prophets official document.

    These examples predict sales data with Facebook Prophet doing time series analysis to predict continuous values using past data.Pick the example to use depending on your access to S3 and Slack.

    Example Workflow to Predict Future Sales with S3 and Slack

    Example Workflow to Predict Future Sales without S3 or Slack

    Example Workflow to Predict Future Sales with S3 and Slack

    The workflow:

    Fetches past sales data from Treasure Data

    Builds a model with Prophet

    Predicts future sales and write back to Treasure Data

    Uploads predicted figures to Amazon S3

    Sends a notification to Slack

    Prerequisites

    Make sure that the custom scripts feature is enabled for your TD account.

    Download and install the TD Toolbelt and the TD Toolbelt Workflow module. For more information, see TD Workflow Quickstart.

    Basic Knowledge of Treasure Workflow's syntax

    S3 bucket and associated credentials

    Create your Slack webhook URL

    Running the Example Workflow

    Download the project from this repository

    In the command line Terminal window, change directory to timeseries. For example:

    cd timeseries

    Run data.sh to prepare example data. It creates a database called timeseries and table called retail_sales.

    $ ./data.sh

    Run the example workflow as follows:

    $ td workflow push prophet$ td workflow secrets \ --project prophet \ --set apikey \ --set endpoint \ --set s3_bucket \ --set aws_access_key_id \ --set aws_secret_access_key --set slack_webhook_url# Set secrets from STDIN like: apikey=1/xxxxx, endpoint=https://api.treasuredata.com, s3_bucket=$S3_BUCKET,# aws_access_key_id=AAAAAAAAAA, aws_secret_access_key=XXXXXXXXX, slack_webhook_url=https://hooks.slack.com/services/XXXXXXX/XXXXXX/XXXXXX # Where XXXXXXX/XXXXXX/XXXXXX is the value of the slack URL where you want information to be populated automatically. $ td workflow start prophet predict_sales --session now

    A notification of prediction results is sent to your Slack:

    predict.py

    Prediction results are stored in the predicted_sales table in the timeseries database.

    Validate predicted sales values in TD Console with SQL as follows:

    select ds, yhat, yhat_lower, yhat_upperfrom timeseries.predicted_salesorder by ds desclimit 100;

    You can see the predicted sales numbers in yhat column like follows:

    Example Workflow to Predict Future Sales without S3 or Slack

    This example is for when you do not have access to S3 or Slack.

    The workflow:

    Fetches past sales data from Treasure Data

    Builds a model with Prophet

    Predicts future sales and writes back to Treasure Data

    Prerequisites

    Make sure custom scripts feature is enabled for your TD account.

    Download and install the TD Toolbelt and the TD Toolbelt Workflow module. For more information, see TD Workflow Quickstart.

    Basic Knowledge of Treasure Workflow's syntax

    Running the Example Workflow

    Download the project from this repository

    In the command line Terminal window, change directory to timeseries. For example:

    cd timeseries

    Run data.sh to prepare example data. It creates a database called timeseries and table called retail_sales.

    $ ./data.sh

    Run the example workflow as follows:

    $ td workflow push prophet$ td workflow secrets \ --project prophet \ --set apikey \ --set endpoint \# Set secrets from STDIN like: apikey=1/xxxxx, endpoint=https://api.treasuredata.com$ td workflow start prophet predict_sales_simple --session now

    Predictions are stored in Treasure Data.

    Review the Workflow Custom Python Script

    Review the contents of the directory:

    predict_sales.dig - Example workflow for sales prediction and notification to Slack.

    - Custom Python script with Prophet. It also uploads figures to Amazon S3

    If you don't need to send a notification to Slack, you can remove "+send_graph" step in the predict_sales.dig.

    This example uses a dataset for sales data. You can use your own data in Treasure Data after modifying the query in the read_td function.

    You must prepare the ds (datestamp) column and the y column. The y column represents target numerical values for forecasting, such as sales values or page view (PV)s of web sites.

    Here is example code for page view logs:

    import pytd.pandas_td as tdengine = td.create_engine('presto:sample_datasets')start_date = '2014-01-01'end_date = '2018-12-31'df = td.read_td(f"""select TD_TIME_FORMAT(TD_DATE_TRUNC('minute', time), 'yyyy-MM-dd HH:mm:ss') as ds, count(1) as yFrom www_accessWhere TD_TIME_RANGE(time, '{start_date}', '{end_date}', 'PDT')group by 1Order by 1""",engine)m = Prophet(changepoint_prior_scale=0.01).fit(df)future = m.make_future_dataframe(periods=5, freq='M')fcst = m.predict(future)

    View Article
  • Treasure Workflow provides an easy way to modify data with PySpark. You can port existing PySpark code to your Custom Scripting environment in Arm Treasure Data.

    PySpark is a Python API for Spark. You can run PySpark as part of your scheduled workflows, using Python Custom scripts.Apache Spark is widely used for distributed computation for Machine Learning and general data manipulation.

    The following example shows how to complete basic data manipulations using stand-alone mode PySpark with Python Custom scripts.

    Data Manipulation using Python Custom Scripts and PySpark

    PySpark is a Python interface, and a unified analytics engine for large-scale data processing. This example shows how to load data from Treasure Data, process it on PySpark, execute a SparkSQL, and upload modified data back to Treasure Data.

    In this example, you use td-pyspark, a Python library that provides a handy way to use PySpark and Treasure Data based on td-spark.

    Example Workflow to Process Data with td-spark

    The workflow that you create in this example completes the following actions:

    Fetch data from Treasure Data

    Process data as Spark DataFrame

    Manipulate data with SparkSQL

    Upload Spark DataFrame to Treasure Data

    Upload pandas DataFrame to Treasure Data

    Prerequisites

    To follow the steps in this example, you must have a basic knowledge of Treasure Workflow syntax and set up your environment as follows:

    Download and install the TD Toolbelt and the TD Toolbelt Workflow module. See TD Workflow Quickstart

    Ensure that the custom scripts feature is enabled for your TD account.

    Run the Example Workflow

    You retrieve code from Treasure Data GitHub to run in your environment.

    Download the project from the GitHub repository

    In your Terminal window, change the directory to pyspark.

    Run the example workflow as follows:

    $ td workflow push td-spark$ td workflow secrets \ --project td-spark \ --set apikey \ --set endpoint \# Set secrets from STDIN like: apikey=1/xxxxx, endpoint=https://api.treasuredata.com$ td workflow start td-spark pyspark --session now

    The workflow fetches www_access data in the sample_datasets database, then filters the data by HTTP GET method requests and adds a time column in string format. The manipulated data is stored in the www_access_processed table in the td_spark_example database.

    The processed data with PySpark filtered only with GET method is as follows:

    td-spark.py

    Additionally, you can find the pandas dataframe in the pandas_table as follows:

    Review the Workflow Custom Python Script

    Review the contents of the directory and note the following files:

    pyspark.dig

    Example workflow for data manipulation with PySpark.

    Theworkflow YAML filepyspark.dig file invokes the Python script for manipulating data with PySpark. It has three main tasks:

    The td_spark_process task creates the www_access_processed table in the td_spark_example database by using Spark DataFrame execution.

    The td_spark_sql executes SparkSQL.

    The td_spark_upload task uploads the processed pandas data frames to the TD database.

    Custom Python script with PySpark and td-pyspark.

    The Python script td-spark.py houses all the python functions invoked by the pyspark.dig file. Three main functions illustrate processing data with PySpark:

    The process_data function processes data with PySpark filtered only with GET method of HTTP requests and adds a time column converted into string format.

    The execute_sql function demonstrates how you can use SparkSQL to process data

    The upload_dataframe function uses the pandas dataframe to upload data to Treasure Data after converting the data frame into a Spark dataframe.

    View Article
  • You can use td-pyspark to bridge the results of data manipulations in Databrick with your data in Arm Treasure Data.

    Databricks builds on top of Apache Spark providing an easy to use interfacefor accessing Spark. PySpark is a Python API for Spark. Treasure Data's td-pyspark is a Python library that provides a handy way to use PySpark and Treasure Data based on td-spark.

    Prerequisites

    To follow the steps in this example, you must have the following times:

    Treasure Data API key

    td-spark feature enabled

    Configuring your Databricks Environment

    You create a cluster, install td-pyspark libraries and configure an notebook for your connection code.

    Create a Cluster on Databricks

    Click the Cluster icon.

    Click Create Cluster.

    Provide a cluster name, select version Spark 2.4.3 or later as the Databricks Runtime Version and select 3 as the Python Version.

    Database and Table Management

    Install the td-pyspark Libraries

    Access the Treasure Data Apache Spark Driver Release Notes. From the article, you click on links to download code.

    Click to download

    td-spark-assembly_2.11-19.7.0.jar

    Select PyPI.

    When download completes, you see the following:

    Specify your TD API Key and Site

    In the Spark configuration, you specify the Treasure Data API key and enter the environment variables.

    An example of the format is as follows. You provide the actual values:

    spark.td.apikey (Your TD API KEY) spark.td.site (Your site: us, jp, eu01)spark.serializer org.apache.spark.serializer.KryoSerializerspark.sql.execution.arrow.enabled true

    Restart Cluster and Begin Work in Databricks

    Restart your Spark cluster. Create a noteboook. Create a script similar to the following code:

    %python

    from pyspark.sql import *import td_pysparkSAMPLE_TABLE = "sample_datasets.www_access"td = td_pyspark.TDSparkContext(spark)df = td.table(SAMPLE_TABLE).within("-10y").df()df.show()

    TDSparkContextis an entry point to access td_pyspark's functionalities. As shown in the preceding code sample, to create TDSparkContext, pass your SparkSession (spark) to TDSparkContext:

    td = TDSparkContext(spark)

    You see a result similar to the following:

    Your connection is working.

    Interacting with Treasure Data from Databricks

    In Databricks, you can run select and insert queries to Treasure Data or query back data from Treasure Data. You can also create and delete databases and tables.

    In Databricks, you can use the following commands:

    Read Tables as DataFrames

    To read a table, usetd.table (table_name):

    df = td.table("sample_datasets.www_access").df()

    df.show()

    Change the Database used in Treasure Data

    To change the context database, use td.use (database_name):

    td.use("sample_datasets")

    # Accesses sample_datasets.www_access

    df = td.table("www_access").df()

    By calling .df() your table data is read as Spark's DataFrame. The usage of the DataFrame is the same with PySpark. See also PySpark DataFrame documentation.

    Access sample_datasets.www_access

    df=td.table("www_access").df()

    Submit presto queries

    If your Spark cluster is small, reading all of the data as in-memory DataFrame might be difficult. In this case, you can use Presto, a distributed SQL query engine, to reduce the amount of data processing with PySpark.

    q = td.presto("select code, * from sample_datasets.www_access")

    q.show()q = td.presto("select code, count(*) from sample_datasets.www_access group by 1")q.show()

    You see:

    Create or drop a database

    td.create_database_if_not_exists("<db_name>")td.drop_database_if_exists("<db_name>")

    Upload DataFrames to Treasure Data

    To save your local DataFrames as a table, you have two options:

    Insert the records in the input DataFrame to the target table

    Create or replace the target table with the content of the input DataFrame

    td.insert_into(df, "mydb.tbl1")td.create_or_replace(df, "mydb.tbl2")

    Checking Databricks in Treasure Data

    You can use td toolbelt to check your database from a command line. Alternatively, if you have TD Console, you can check your databases and queries. Read about .

    View Article
  • You can use td-pyspark to bridge the results of data manipulations in Amazon EMR (Elastic MapReduce) with your data in Arm Treasure Data.

    Amazon EMR is an AWS tool for big data processing and analysis, providing an easy-to-use interfacefor accessing Spark. PySpark is a Python API for Spark. Treasure Data's td-pyspark is a Python library that provides a handy way to use PySpark and Treasure Data based on td-spark.

    Prerequisites

    To follow the steps in this example, you must have the following Treasure Data items:

    Treasure Data API key

    td-spark feature enabled

    Configuring your Amazon EMR Environment

    You create a key pair, cluster, install td-pyspark libraries and configure a notebook for your connection code.

    Log into Amazon Management Service. Under Find Services, enter EMR.Amazon EMR cluster nodes run on Amazon EC2 instances.

    Create an EC2 Key Pair

    When you create the key pair in Amazon, you provide a name and a file with the extension of .pem is generated. You download the generated file to your local computer.

    For more information about creating an Amazon EC2 key pair, see Amazon EC2 Key Pairs.

    You refer to the key when you create a cluster, You specify the Amazon EC2 key pair that is used for SSH connections to all cluster instances.

    Create a Cluster on Amazon EMR

    Complete the configuration fields.Provide a cluster name, a folder location for the cluster data and select version Spark 2.4.3 or later as theApplication.

    Database and Table Management

    Specify the instance and key pair.

    Prepare the Proxy Server

    To grant inbound access permission from your local computer, you specify a secure protocol.

    Select the security group for Master:

    In the Amazon, find the Master node that you want to access:

    With the proxy established, you log onto the Master node instance. On your local computer, access the AWS Master node instance through an ssh protocol.

    $ ssh -i ~/ <your_aws_keypair_kry.pem> hadoop@ <Master public DNS>

    You see a connection confirmation on your local computer.

    Install the td-pyspark Libraries

    Access the Treasure Data Apache Spark Driver Release Notes. From the article, you click on links to download code.

    Inside your EMR instance, click to download the library to your Master node.

    td-spark-assembly_2.11-19.7.0.jar

    Still within the Master node instance, run the following command to install pyspark:

    $ pip install td_pyspark

    Create a Configuration File and Specify your TD API Key and Site

    In thethe Master node instance, create a td-spark.conf file. In the configuration file, specify your TD API Key, TD site parameters and spark environment.

    An example of the format is as follows. You provide the actual values:

    spark.td.apikey (Your TD API KEY) spark.td.site (Your site: us, jp, eu01)spark.serializer org.apache.spark.serializer.KryoSerializerspark.sql.execution.arrow.enabled true

    Launch PySpark

    Launch PySpark. In your command, include the arguments as shown in the following example:

    % pyspark --driver-class-path ./td_pyspark-19.7.0/td_pyspark/jars/td-spark-assembly.jar --properties-file ./td-spark.conf

    You see something similar to the following:

    Then load td_pyspark, as follows. Note the prompt symbol changes to >>>:

    >>> import td_pyspark>>> from pyspark import SparkContext>>> from pyspark.sql import SparkSession

    >>> builder = SparkSession.builder.appName("td-pyspark-test"

    >>> td = td_pyspark.TDSparkContextBuilder(builder).build()

    >>> df = td.table("sample_datasets.www_access").within("+2d/2014-10-04").df()

    >>> df.show()

    TDSparkContextBuilderis an entry point to access td_pyspark's functionalities. As shown in the preceding code sample, you read tables in Treasure Data as data frames:

    df = td.table("tablename").df()

    You see a result similar to the following:

    Your connection is working.

    Interacting with Treasure Data in your Master Node Instance

    You can run select and insert queries to Treasure Data or query back data from Treasure Data. You can also create and delete databases and tables.

    You can use the following commands:

    Read Tables as DataFrames

    To read a table, usetd.table (table_name):

    df = td.table("sample_datasets.www_access").df()

    df.show()

    Change the Database used in Treasure Data

    To change the context database, use td.use (database_name):

    td.use("sample_datasets")

    # Accesses sample_datasets.www_access

    df = td.table("www_access").df()

    By calling .df() your table data is read as Spark's DataFrame. The usage of the DataFrame is the same with PySpark. See also PySpark DataFrame documentation.

    Access sample_datasets.www_access

    df=td.table("www_access").df()

    Submit presto queries

    If your Spark cluster is small, reading all of the data as in-memory DataFrame might be difficult. In this case, you can use Presto, a distributed SQL query engine, to reduce the amount of data processing with PySpark.

    q = td.presto("select code, * from sample_datasets.www_access")

    q.show()q = td.presto("select code, count(*) from sample_datasets.www_access group by 1")q.show()

    You see:

    Create or drop a database

    td.create_database_if_not_exists("<db_name>")td.drop_database_if_exists("<db_name>")

    Upload DataFrames to Treasure Data

    To save your local DataFrames as a table, you have two options:

    Insert the records in the input DataFrame to the target table

    Create or replace the target table with the content of the input DataFrame

    td.insert_into(df, "mydb.tbl1")td.create_or_replace(df, "mydb.tbl2")

    Checking Amazon EMR in Treasure Data

    You can use td toolbelt to check your database from a command line. Alternatively, if you have TD Console, you can check your databases and queries. Read about .

    View Article
  • You can use td-pyspark to bridge the results of data manipulations in Google Colab with your data in Arm Treasure Data.

    Google Colab notebooks make it easy to model with PySpark in Google. PySpark is a Python API for Spark. Treasure Data's td-pyspark is a Python library that provides a handy way to use PySpark and Treasure Data based on td-spark.

    Prerequisites

    To follow the steps in this example, you must have the following Treasure Data items:

    Treasure Data API key

    td-spark feature enabled

    Configuring your Google Colab Environment

    You create an envelope, install pyspark and td-pyspark libraries and configure the notebook for your connection code.

    Create an Envelope in Google Colab

    Open Google Colab. Click File > New Python 3 notebook.

    Ensure thatthe runtime is connected. The notebook shows a green check on the top right corner.

    Prepare your Environment for the PySpark and TD-PySpark Libraries

    Click the icon to add a code cell:

    Database and Table Management

    Enter the following code:

    !apt-get install openjdk-8-jdk-headless -qq > /dev/null!pip install pyspark td-pyspark

    Create and Upload the td-spark.conf File

    You specify your TD API key and site on your local file system. Create a file as follows:

    An example of the format is as follows. You provide the actual values:

    spark.td.apikey (Your TD API KEY)spark.td.site(Your site: us, jp, eu01)spark.serializer org.apache.spark.serializer.KryoSerializerspark.sql.execution.arrow.enabled true

    Name the file td-spark.conf and upload the file by clicking Files > Upload on the Google Colab menu. Verify that the td-spark.conf file is saved in the /content directory.

    Run the Installation and Begin Work in Databricks

    Run the current cell by selecting the cell and clicking shift + enter keys.

    Create a second code cell and create a script similar to the following code:

    import osos-environ[PYSPARK_SUBMIT_ARGS] = --jars /usr/local/lib/python2.7/dist-packages/td_pyspark/jars/td-spark-assembly.jar --properties-file /content/td-spark.conf pyspark-shellos-environ[JAVA_HOME] = /usr/lib/jvm/java-8-openjdk-amd64"import td_pysparkfrom pyspark import SparkContextfrom pyspark.sql import SparkSessionbuilder = SparkSession.builder.appName("td-pyspark-test")td = td_pyspark.TDSparkContextBuilder(builder).build()df = td.table("sample_datasets.www_access").within("-10y").df()df.show()

    TDSparkContextBuilderis an entry point to access td_pyspark's functionalities. As shown in the preceding code sample, you read tables in Treasure Data as data frames:

    df = td.table("tablename").df()

    You see a result similar to the following:

    Your connection is working.

    Interacting with Treasure Data from Google Colab

    In Google Colab, you can run select and insert queries to Treasure Data or query back data from Treasure Data. You can also create and delete databases and tables.

    In Google Colab, you can use the following commands:

    Read Tables as DataFrames

    By calling .df() your table data is read as Spark's DataFrame. The usage of the DataFrame is the same with PySpark. See also PySpark DataFrame documentation.

    df = td.table("sample_datasets.www_access").df()df.show()

    Submit Presto Queries

    If your Spark cluster is small, reading all of the data as in-memory DataFrame might be difficult. In this case, you can use Presto, a distributed SQL query engine, to reduce the amount of data processing with PySpark.

    q = td.presto("select code, * from sample_datasets.www_access")

    q.show()q = td.presto("select code, count(*) from sample_datasets.www_access group by 1")q.show()

    You see:

    Create or Drop a Database

    td.create_database_if_not_exists("<my_db>"td.drop_database_if_exists("<my_db>")

    Upload DataFrames to Treasure Data

    To save your local DataFrames as a table, you have two options:

    Insert the records in the input DataFrame to the target table

    Create or replace the target table with the content of the input DataFrame

    td.insert_into(df, "mydb.table1")td.create_or_replace(df, "mydb.table2")

    Checking Google Colab in Treasure Data

    You can use td toolbelt to check your database from a command line. Alternatively, if you have TD Console, you can check your databases and queries. Read about .

    View Article
  • After creating a segment, you can choose to activate it through the activations editor or you can download it as a CSV, if you want to see the segmented data in detail. As there is no CSV download option directly from the Segment Editor, follow these steps to export a segment as a CSV:

    Create a segment. For more information on this go to Creating a New Segment.

    After creating the segment, navigate to Audience Studio> Segments.

    Select the segment you want to export.

    Select Edit Segment Rules to open the Rule Builder.

    Select the SQL Query tab.

    Copy the query from SQL Query tab and close the Rule Builder.

    Navigate to Data Workbench>Queries.

    Select New Query.

    Paste the copied sql query into the query editor.

    Select Run.

    After the query is complete, select the Query Result tab and select Download to get the CSV file.

    View Article
  • This article lists the changelog of td-agent. The older versions of td-agent are here.

    td-agent3 (New stable)

    This version uses Fluentd v1. This article describes the differences between td-agent2 and td-agent3. If you have a new deployment, try this version first.

    2019/11/01: version v3.5.1

    fluentd v1.7.4

    ruby v2.4.9

    td v0.16.5

    tzinfo-data v1.2019.3

    aws-sdk-s3 v1.52.0

    fluent-plugin-s3 v1.2.0

    rdkafka v0.7.0

    fluent-plugin-kafka v0.12.1

    fluent-plugin-elasticsearch v3.5.5

    fluent-plugin-prometheus v1.7.0

    winevt_c v0.6.1

    fluent-plugin-windows-eventlog v0.4.3

    td-agent.conf on Windows will not be overwritten since next versionSee also https://github.com/treasure-data/omnibus-td-agent/pull/215

    2019/08/27: version v3.5.0

    fluentd v1.7.0

    tzinfo v2.0.0

    tzinfo-data v1.2019.2

    aws-sdk-s3 v1.46.0

    fluent-plugin-s3 v1.1.11

    fluent-plugin-webhdfs v1.2.4

    ruby-kafka v0.7.10

    rdkafka v0.6.0

    fluent-plugin-kafka v0.11.1

    elasticsearch v6.8.0

    fluent-plugin-elasticsearch v3.5.4

    fluent-plugin-prometheus v1.5.0

    fluent-plugin-windows-eventlog v0.3.0

    Use /etc/default/td-agent as EnvironmentFile on Ubuntu/Debian

    2019/05/18: version v3.4.1

    ruby v2.4.6

    fluent-plugin-s3 v1.1.10

    fluent-plugin-rewrite-tag-filter v2.2.0

    fluent-plugin-kafka v0.9.4

    fluent-plugin-elasticsearch v3.5.1

    fluent-plugin-prometheus v1.4.0

    2019/05/04: version v3.4.0

    Fluentd v1.4.2

    This includes several additional patches (TLS fix for Windows, tag parameters in in_forward, etc)

    fluent-plugin-kafka v0.9.2

    fluent-plugin-elasticsearch v3.4.3

    fluent-plugin-s3 v1.1.9

    serverengine v2.1.1

    aws-sdk-core v3.48.3

    aws-sdk-s3 v1.36.0

    ruby-kafka v0.7.6

    rdkafka 0.4.2 (only Linux)

    tzinfo-data v1.2019.1

    oj v3.7.11

    Use /etc/sysconfig/td-agent as EnvironmentFile in systemd

    2019/01/10: version v3.3.0

    ruby v2.4.5

    Fluentd v1.3.3

    fluent-plugin-kafka v0.8.3

    fluent-plugin-elasticsearch v3.0.1

    fluent-plugin-s3 v1.1.7

    fluent-plugin-write-tag-filter v2.1.1

    serverengine v2.1.0

    This fixes permission security issue of unix domain socket.

    aws-sdk-core v3.44.2

    aws-sdk-s3 v1.30.0

    ruby-kafka v0.7.4

    Drop kafka v0.10 support. If you want to connect to kafka v0.10, you need to downgrade ruby-kafka and fluent-plugin-kafka version.

    tzinfo-data v1.2018.9

    oj v3.7.4

    td v0.16.4

    2018/10/15: version v3.2.1

    Fluentd v1.2.6

    fluent-plugin-kafka v0.7.9

    fluent-plugin-elasticsearch v2.11.11

    fluent-plugin-s3 v1.1.6

    serverengine 2.0.7

    ruby-kafka v0.6.8

    2018/06/13: version v3.2.0

    Fluentd v1.2.2

    fluent-plugin-kafka v0.7.3

    fluent-plugin-elasticsearch v2.10.3

    fluent-plugin-s3 v1.1.3

    fluent-plugin-webhdfs v1.2.3

    fluent-plugin-rewrite-tag-filter v2.1.0

    Add fluent-plugin-record-modifier v1.1.0

    postgresql 9.6.9

    msgpack-ruby 1.2.4

    td 0.16.1

    bundler 1.16.2

    yajl-ruby 1.4.0

    tzinfo-data 1.2018.5

    Support Ubuntu 18.04 64bit

    2017/12/19: version v3.1.1

    Fluentd v1.0.2

    fluent-plugin-kafka v0.6.5

    fluent-plugin-elasticsearch v2.4.0

    jemalloc 4.5.0

    postgresql 9.6.6

    msgpack-ruby 1.2.0

    dig_rb 1.0.1

    Add RuntimeDirectory to systemd unit

    2017/12/1: version v3.1.0

    Ruby 2.4.2

    Fluentd v0.14.25

    fluent-plugin-s3 v1.1.0

    fluent-plugin-kafka v0.6.4

    fluent-plugin-webhdfs v1.2.2

    fluent-plugin-td v1.0.0

    fluent-plugin-rewrite-tag-filter v2.0.1

    fluent-plugin-windows-eventlog v0.2.2

    fluent-plugin-elasticsearch v2.3.0

    bundler 1.6.0

    msgpack-ruby 1.1.0

    yajl-ruby 1.3.1

    cool.io 1.5.3

    oj 3.3.9

    aws-sdk-s3 1.8.0

    aws-sdk-sqs 1.3.0

    td 0.15.7

    2017/05/18: version v3.0.1

    Ruby 2.4.1

    Fluentd v0.14.16

    fluent-plugin-s3 v1.0.0.rc3

    fluent-plugin-kafka v0.5.5

    fluent-plugin-webhdfs v1.1.1

    fluent-plugin-windows-eventlog v0.2.0

    fluent-plugin-elasticsearch v1.9.5

    msgpack-ruby 1.1.0

    2017/01/08: version v3.0.0

    Ruby 2.4.0

    Fluentd v0.14.11

    fluent-plugin-s3 v1.0.0.rc1

    fluent-plugin-kafka v0.5.0

    fluent-plugin-td v1.0.0.rc1

    fluent-plugin-elasticsearch v1.9.2

    yajl-ruby v1.3.0

    msgpack-ruby v1.0.2

    Remove fluent-plugin-scribe and fluent-plugin-mongo from bundled plugins

    Added Windows support

    Dropped CentOS 5, Ubuntu Lucid, Ubuntu Precise, Debian Wheezy support

    td-agent2 (Old stable)

    This version uses Fluentd v0.12 and ruby version is 2.5/2.1. This article describes the differences between td-agent2 and td-agent3. td-agent2 is for existing td-agent2 and fluentd v0.12 users. If you have new deployment, try td-agent3 first.

    2018/06/28: version v2.5.0

    Fluentd v0.12.43

    ruby 2.5.1

    2017/01/01: Repository GPG Key Update

    From 2017/01/01, major Linux packaging system apt has shut off the support for SHA1 encrypted GPG key for package signing. This is because SHA1 algorithm is considered weak.

    Treasure Agents GPG key located at https://packages.treasuredata.com/GPG-KEY-td-agenthas been upgraded. To update your GPG key for better security:

    Update the GPG key for Treasure Agent

    If youre using the old GPG key, youll see the error message similar to the following:

    W: Signature by key XYZXYZXYZXYZXYZ uses weak digest algorithm (SHA1)

    2017/10/04: version v2.3.6

    fluentd v0.12.40

    fluent-plugin-s3 v0.8.5

    fluent-plugin-mongo v0.8.1

    fluent-plugin-kafka v0.6.1

    fluent-plugin-rewrite-tag-filter v1.5.6

    fluentd-ui v0.4.4

    cool.io v1.5.1

    aws-sdk v2.10.45

    ruby-kafka v0.4.2

    rubygems v2.6.13

    2017/04/26: version v2.3.5

    fluentd v0.12.35

    fluent-plugin-s3 v0.8.2

    fluent-plugin-mongo v0.8.0

    fluent-plugin-kafka v0.5.5

    fluentd-ui v0.4.4

    cool.io v1.4.6

    yajl-ruby v1.3.0

    aws-sdk v2.9.9

    td-client v0.8.85

    tzinfo v1.2.3

    tzinfo-data v1.2017.2

    ruby-kafka v0.3.17

    mongo v2.2.7

    fix logrotate script

    2016/12/31: version v2.3.4

    fluentd v0.12.31

    fluent-plugin-s3 v0.8.0

    fluent-plugin-mongo v0.7.16

    fluent-plugin-kafka v0.4.1

    fluentd-ui v0.4.4.rc1

    td v0.15.0

    oj v2.18.0

    aws-sdk v2.6.42

    td-client v0.8.85

    tzinfo-data v1.2016.10

    ruby-kafka v0.3.15

    httpclient v2.8.2.4

    Add 32bit package to Ubuntu 16.04

    2016/09/30: version v2.3.3

    fluentd v0.12.29

    fluent-plugin-s3 v0.7.1

    fluent-plugin-mongo v0.7.15

    fluent-plugin-kafka v0.3.1

    fluent-plugin-td v0.10.29

    fluentd-ui v0.4.3

    td v0.15.0

    cool.io v1.4.5

    aws-sdk v2.6.3

    td-client v0.8.83

    tzinfo-data v1.2016.6

    ruby-kafka v0.3.15

    jemalloc v4.2.1

    2016/07/10: version v2.3.2

    fluentd v0.12.26

    fluent-plugin-s3 v0.6.8

    fluent-plugin-webhdfs v0.4.2

    fluent-plugin-mongo v0.7.13

    fluent-plugin-rewrite-tag-filter v1.5.5

    td v0.14.0

    ruby v2.1.10

    cool.io v1.4.4

    sigdump v0.2.4

    httpclient v2.7.2

    oj v2.15.1

    aws-sdk v2.3.14

    webhdfs v0.8.0

    tzinfo-data v1.2016.5

    Ubuntu 16.04 64bit support

    2016/02/11: version v2.3.1

    fluentd v0.12.20

    fluent-plugin-s3 v0.6.5

    fluent-plugin-td v0.10.28

    fluent-plugin-td-monitoring v0.2.2

    fluent-plugin-mongo v0.7.12

    fluent-plugin-rewrite-tag-filter v1.5.4

    td v0.13.2

    rubygems v2.4.8

    httpclient v2.7.1

    oj v2.14.4

    aws-sdk v2.2.14

    webhdfs v0.7.4

    tzinfo-data v1.2016.1

    2015/12/24: version v2.3.0

    fluentd v0.12.19

    fluent-plugin-s3 v0.6.4

    fluent-plugin-td v0.10.28

    fluent-plugin-mongo v0.7.11

    fluent-plugin-rewrite-tag-filter v1.5.3

    td v0.13.0

    ruby v2.1.8

    bundler v1.10.6

    cool.io v1.4.2

    oj v2.14.2

    tzinfo-data v1.205.07

    2015/06/29: version v2.2.1

    fluentd v0.12.12

    fluent-plugin-s3 v0.5.9

    fluent-plugin-mongo v0.7.10

    fluent-plugin-td v0.10.27

    fluent-plugin-td-monitoring v0.2.1

    td v0.11.10

    fluentd-ui v0.4.2

    rpm: Fix -e issue in init script

    deb: Improve stop mechanizm to prevent killing other td-agent process in containers

    deb: Add STOPTIMEOUT parameter to control shutdown td-agent process same as rpm

    2015/04/06: version v2.2.0

    fluentd v0.12.7

    fluent-plugin-s3 v0.5.7

    fluent-plugin-mongo v0.7.8

    td v0.11.8.2

    fluentd-ui v0.3.18

    ruby 2.1.5

    json 1.8.2

    cool.io 1.3.0

    Amazon Linux support

    2015/03/26: version v2.1.5-1

    Put logrotate setting in deb package

    Fix status command in rpm init script

    Include postgresql

    2015/03/03: version v2.1.5

    fluentd v0.10.61

    fluent-plugin-s3 v0.5.6

    fluent-plugin-mongo v0.7.6

    fluent-plugin-td v0.10.26

    td v0.11.8

    td-client v0.8.68

    2015/02/09: version v2.1.4

    fluentd v0.10.60

    fluent-plugin-s3 v0.5.3

    fluent-plugin-mongo v0.7.5

    fluent-plugin-td v0.10.25

    fluentd-ui v0.3.14

    msgpack 0.5.11

    2014/12/23: version v2.1.3

    fluentd v0.10.58

    fluent-plugin-scribe v0.10.14

    fluent-plugin-s3 v0.5.1

    fluent-plugin-td v0.10.23

    fluent-plugin-td-monitoring v0.2.0

    fluentd-ui v0.3.11

    RHEL/CentOS 7 support

    2014/11/17: version v2.1.2

    fluentd v0.10.57

    fluent-plugin-scribe v0.1.3

    fluent-plugin-mongo v0.7.4

    fluent-plugin-s3 v0.4.3

    fluent-plugin-webhdfs v0.4.1

    fluent-plugin-td-monitoring v0.1.4

    Remove fluent-plugin-flume

    fluentd-ui v0.3.7

    jemalloc 2.2.5

    rubygems 2.2.1

    Create /var/run/td-agent during post installation on rpm

    2014/10/22: version v2.1.1

    Ruby 2.1.3

    fluentd v0.10.55

    td v0.11.5

    td-client v0.8.66

    fluent-plugin-scribe v0.1.2

    fluent-plugin-s3 v0.4.1

    fluent-plugin-webhdfs v0.3.1

    fluent-plugin-td v0.10.22

    fluent-plugin-td-monitoring v0.1.3

    Fix option handling in configtest

    rpm: default timeout to 120 seconds

    Change v1 non-quoted string literal behavior: https://groups.google.com/forum/#!topic/fluentd/4C36iAyI87E

    2014/09/05: version v2.1.0

    Ruby 2.1.2

    fluentd v0.10.53

    Use use-v1-config by default

    jemalloc v3.6.0

    msgpack-ruby v0.5.8

    Cool.io v1.2.4

    td-agent1 (Unmaintained, Not Recommended for New Deployments)

    2014/10/20: version v1.1.21

    fluentd v0.10.55

    td v0.11.5

    td-client v0.8.66

    fluent-plugin-scribe v0.1.2

    fluent-plugin-s3 v0.4.1

    fluent-plugin-webhdfs v0.3.1

    fluent-plugin-td v0.10.22

    fluent-plugin-td-monitoring v0.1.3

    Fix option handling in configtest

    rpm: default timeout to 120 seconds

    deb: fix cleanup routine in purge

    2014/06/19: version v1.1.20

    fluentd v0.10.50

    td-client v0.8.61

    fluent-plugin-s3 v0.4.0

    fluent-plugin-td v0.10.20

    fluent-plugin-td-monitoring v0.1.2

    rspec v3.0.0

    Improve stop routine of init script in RPM

    Set ulimit -n 65536 at start

    2014/04/01: version v1.1.19

    fluentd v0.10.45

    td-client v0.8.58

    td v0.10.99

    fluent-plugin-s3 v0.3.7

    fluent-plugin-td v0.10.18

    fluent-plugin-rewrite-tag-filter v1.4.1

    fluent-plugin-mongo v0.7.3

    fluent-plugin-td-monitoring v0.1.1

    bundler to v1.3.6

    webhdfs v0.5.5

    aws-sdk-ruby v1.38.0

    2013/12/05: version v1.1.18

    ruby v1.9.3-p484 (security fix)

    fluentd v0.10.41

    td-client v0.8.56

    td v0.10.96

    fluent-plugin-s3 v0.3.5

    fluent-plugin-td v0.10.17

    fluent-plugin-rewrite-tag-filter v1.3.1

    2013/09/25: version v1.1.17

    fluentd v0.10.39

    td-client v0.8.55

    td v0.10.89

    fluent-plugin-td v0.10.16

    Fix configtest permission issue at restart

    Create /etc/default/td-agent to set daemon options at Debian and Ubuntu

    2013/08/30: version v1.1.16

    fluentd v0.10.38

    td-client v0.8.54

    td v0.10.86

    Add configtest and use configtest at restart

    2013/08/02: version v1.1.15

    fluentd v0.10.36

    td-client v0.8.53

    td v0.10.84

    fluent-plugin-s3 v0.3.4

    fluent-plugin-webhdfs v0.2.1

    fluent-plugin-mongo v0.7.1

    fluent-plugin-td v0.10.15

    cool.io v1.1.1

    2013/06/24: version v1.1.14

    fluentd v0.10.35

    td-client v0.8.52

    td v0.10.82

    fluent-plugin-s3 v0.3.3

    fluent-plugin-webhdfs v0.2.0

    webhdfs v0.5.3

    bson_ext v1.8.6

    bson v1.8.6

    mongo v1.8.6

    yajl-ruby v1.1.0

    json v1.7.7

    2013/04/23: version v1.1.13

    fluent-plugin-td v0.10.14

    td-client v0.8.48

    td v0.10.76

    2013/04/04: version v1.1.12

    fluentd v0.10.33

    fluent-plugin-s3 v0.3.1

    fluent-plugin-mongo v0.7.0

    fluent-plugin-webhdfs v0.1.2

    msgpack v0.4.7

    bson_ext v1.8.4

    bson v1.8.4

    mongo v1.8.4

    iobuffer v1.1.2

    2012/12/07: version v1.1.11

    fluentd v0.10.30

    fluent-plugin-s3 v0.2.5

    fluent-plugin-td v0.10.13

    fluent-plugin-mongo v0.6.11

    2012/10/16: version v1.1.10.3

    td-client v0.8.34 to set the timeout for connect(2)

    2012/10/15: version v1.1.10.2

    fluent-plugin-td v0.10.12

    2012/10/15: version v1.1.10.1

    fixed /usr/bin/td regression of setting GEM_HOME and GEM_PATH

    2012/10/15: version v1.1.10

    fluentd v0.10.27

    fluent-plugin-mongo v0.6.9

    fluent-plugin-webhdfs v0.5.1

    fluent-plugin-td v0.10.11

    enable debug agent by default (td-agent.conf)

    set GEM_HOME and GEM_PATH at /usr/[s]bin/scripts to avoid RVM conflicts

    2012/08/27: version 1.1.9

    fixed /usr/bin/td doesnt take any command line argument

    Ubuntu LTS Precise 12.04 is now supported

    allow overriding the variables by /etc/sysconfig/td-agent (rpm)

    2012/07/23: version 1.1.8

    fluentd v0.10.25

    fixed critical problem of duplicate daemon launch problem (both deb and rpm)

    2012/06/12: version 1.1.7

    bson_ext v1.6.4

    bson v1.6.4

    mongo v1.6.4

    fluent-plugin-td v0.10.7

    td v0.10.25 (new)

    install /usr/bin/td (new)

    2012/05/20: version 1.1.6

    fluent-plugin-flume v0.1.1

    remove ruby package dependency (rpm only)

    2012/05/02: version 1.1.5

    ruby v1.9.3-p194 (security fix)

    fluentd v0.10.22

    add with-libyaml-dir to rubys configure options

    2012/04/23: version 1.1.4.4

    add td-libyaml to the dependency (rpm only)

    support debian 5 lenny (deb only)

    2012/04/17: version 1.1.4

    use ruby-1.9.3-p125

    use jemalloc v2.2.5, to avoid memory fragmentations

    fluentd v0.10.19

    fluent-plugin-mongo v0.6.7

    fluent-plugin-td v0.10.6

    2012/03/24: version 1.1.3.1

    not to start daemon, at the first installation. (rpm only)

    fix prelink Ruby breakage problem on CentOS 4 (rpm only)

    fix /etc/init.d/td-agent status problem on CentOS 4 (rpm only)

    2012/03/10: version 1.1.3

    fluent-plugin-mongo v0.6.6

    2012/02/22: version 1.1.2.2

    reduced # of dependent packages of RPM. The build time dependencies were unexpectedly included.

    2012/02/21: version 1.1.2.1

    fixed problem of removing /etc/prelink.d/td-agent.conf file, when upgrading (rpm only)

    2012/02/15: version 1.1.2

    fluentd v0.10.10

    fluent-plugin-td v0.10.5

    fluent-plugin-scribe v0.10.7

    fluent-plugin-mongo v0.6.3

    fluent-plugin-s3 v0.2.2 (New!)

    2012/01/24: version 1.1.1

    fluentd v0.10.9

    fluent-plugin-scribe v0.10.6

    fluent-plugin-mongo v0.6.2

    fix /var/run/td-agent/ creation in init.d script

    fix Ruby interpreter breakings by prelink, on 32-bit platform

    2011/11/11: version 1.1.0

    fluentd v0.10.6

    fluent-plugin-td v0.10.2

    fluent-plugin-scribe v0.10.3

    fluent-plugin-mongo v0.4.0 (new)

    prevent prelink to break Ruby interpreter at the nightly job

    2011/10/10: version 1.0.11

    fix gem installation order

    2011/10/05: version 1.0.10

    fix posinst script for not overwriting user-installed config file

    2011/10/01: version 1.0.9

    fluent v0.9.16

    fluent-plugin-scribe v0.9.10

    2011/09/20: version 1.0.8

    fluent v0.9.14

    fluent-plugin-td v0.9.10

    View Article
  • Caution

    DMP activation on Google Display Ads has been paused.Google stopped whitelisting new accounts for this feature. Therefore, new accounts can't use this feature. The Google change does not affect accounts that are already whitelisted. Existing accounts can still use the feature.

    Leverage your data in Arm Treasure Data as you run marketing campaigns using Google AdWords.

    You can create audience lists in your Google AdWords - DFP using data held in Treasure Data. Follow these steps to move cookies, IDFAs and ADIDs to new or existing audience lists within Google AdWords - DFP.

    Prerequisites

    Basic knowledge of Treasure Data, including the toolbelt

    A Google AdWords Account

    Authorized Treasure Data DMP access to your Google AdWords Account

    Grant access for Treasure Data

    The Treasure Data Google AdWords for DDP connector requires permissions to create audience segments in your Google AdWords account. Send an email to AdWords Support to grant Treasure Data access your AdWords account. Provide the following information in the email:

    Request: Grant Treasure Data permissions

    Treasure Data DMP:

    Customer-ID: 140-996-0635

    NID: treasuredata_dmp

    You are sending information so that Google recognizes Treasure Data and identifies your TD account to your Google AdWords account.

    Usage

    To export data, you create or select an existing connection, create or reuse a query, and then run the query to export your audience lists.

    Create or reuse a query

    Go to Treasure Data console

    Go to the Query Editor. You can pick the query language and write your query.

    Select your database and table.

    Access the query that you plan to use to export data. Here is a sample query:

    SELECT DISTINCT cookie, list_name, time FROM google_adwords_ddp

    Create or select an existing Google AdWords DDP connection

    Option

    Task Steps

    Tip

    Create a connector

    - Go to Integrations Hub - Click on the Google AdWords DDP icon from Sources Catalog. - Complete the required fields on the first pane. - In the AdWords Customer ID field, enter the ID that you use in Google Adwords. Next you complete the Configuration fields.

    Sometimes you need to define the column mapping before writing the query.

    Select a connector

    - Go to Treasure Data console. - Go to Query Editor. - Access the query that you plan to use to export data. - Click Output results. The Choose Saved Connection dialog opens. - Type the connection name in the search box to filter and click the connection that you want. Next you complete the Configuration fields.

    You can also create a new connection from here by clicking on Create New Connection.

    Creating a new connection:

    Selecting an existing connection:

    Configure the connection by specifying the parameters

    After you select your Google AdWords DDP connection, the Configuration dialog pane appears:

    Source column name mappings (optional)

    Define the mapping between Google DDP column names to the output column names that you specify in your the query. You specify the target column and then the source column. For example, if google_cookie is the identifier column in your TD data source, you should define the mapping as cookie:google_cookie. If the source column in the mapping is missing, target column name will be used. For example, cookie is the same as cookie:cookie mapping.

    See Additional connector configuration information for more information about supported column names and usage.

    Cookie or mobile identifier column header

    Specify the original source of the user cookie or mobile identifier.

    You must select one of the options:

    cookie_encrypted: Encrypted identifier (for example, Web), a cookie hash of user id

    cookie_idfa: iOS Advertising Identifier

    cookie_adid: Android Advertising Identifier

    See Additional connector configuration information for more information on how identifier names are handled.

    Execute the query

    Either save the query with a name and then run the query or just run the query. When the query completes successfully, the result is automatically processed. Plan to transfer your data at least 24 hours ahead of when you need the audience lists (also referred to as segments) to be in Google AdWords.

    Additional connector configuration information

    The manner in which Google handles your data from Treasure Data affects how you configure the connector.

    Column Mappings

    The Google AdWords (via DDP) reads data source table by columns and uses the following column name mappings to process each row data:

    cookie: The encrypted Google ID or Mobile Advertising Identifier that DDP will use in id matching. This column contains the cookie hash or mobile identifier of your users.

    list_name: This column contains the name of audience list (segment) that you want to create in your AdWords audience. If the list name does not exist in Google AdWords, a new list is created. If the list name does exist, the existing list is updated.

    timestamp (optional): The timestamp (seconds since EPOCH). If this column does not exist or is missing, a current timestamp is used.

    delete (optional): This column contains boolean values (false or true) or number (0 or 1) to indicate if the cookie is to be added or removed from the given audience segment. By default, the value will be false if value is left blank or if column is not provided.

    Cookie or Mobile Identifier Name

    The upload process supports several different identifiers, and it is important that types of identifiers are uploaded to the segment using the correct upload file format. Generally, the identifiers fall into one of two categories: encrypted identifiers (anything obtained from Google systems), and raw identifiers (obtained from an external system or source). Any encrypted identifier is uploaded using the cookie_encrypted file format, and raw identifiers are uploaded in a type-specific upload file format such as Mobile Advertising Identifiers, for example cookie_idfa or cookie_adid.

    Supported cookie types are cookie_encrypted, cookie_idfa, and cookie_adid.

    Appendix: Limitations from Google AdWords

    It may take up to 24 hours for updates to audience lists to be visible Google AdWords. Expect to wait up to 24 hours from the time of the query completion for changes to be reflected in Google AdWords.

    View Article
  • You can write job results directly to your Google BigQuery.You can view how to import data from Google BigQuery here.

    Prerequisites

    Basic knowledge of Arm Treasure Data, including the TD Toolbelt.

    A Google Cloud Platform account

    Get Google Cloud Platform Credential

    To use this feature, you need: your Project ID, and JSON Credential.

    JSON Credential

    The integration with Google BigQuery is based on server-to-server API authentication. Go to your Google Developer Console, and click Credentials under APIs & auth at the left menu. Select Service account.

    Select the JSON based key type, which is recommended by Google. The key is downloaded automatically by the browser.

    Project ID

    Go to your Google Developer Console, and click Home on the left menu. You see your Project ID.

    Usage

    We support "append, replace, replace backup, delete" mode.

    Create Dataset and Table on BigQuery

    Create your Dataset and Table from your BigQuery console.

    Use from TD Console

    Go to Treasure Data Console, go to the query editor, click Add for Result Export, and select BigQuery. Complete all the information as follows:

    Write the query. The query results must be matched with the pre-defined schema on BigQuery. After you execute the job, Treasure Data query results are automatically imported into Google BigQuery.

    Append Mode:

    When table already exists inBigQuery, you see a pane similar to the following:

    Fill in Project ID, Dataset Name and Table Name get from BigQuery.

    Write the query. The query results must match the pre-defined schema on BigQuery. After you execute the job, Treasure Data query results are automatically imported into Google BigQuery.

    When table does not exist inBigQuery, you see a pane similar to the following:

    Fill in Project ID, Dataset Name get from BigQuery.

    Provide a name for the table that will be created in BigQuery Dataset. Select Auto-create table to provides a schema for the new table. The value in the Schema JSON field must match the query results (number of fields and data type). After you execute the job, a new table with a name and a schema as your input is created and Treasure Data query results are automatically imported into this table in Google BigQuery.

    Example of data in the JSON Schema field:

    [{"name":"id","type":"INTEGER","mode": "REQUIRED"},{"name":"name","type":"STRING"}]

    Replace Mode:

    Fill in Project ID, Dataset Name and Table Name get from BigQuery.

    Select Auto-create table and provide a schema for the table to be replaced. The value in the Schema JSON field must match the query results (number of fields and data type). Choose Replace from drop down. After you execute the job, if the table already exists in BigQuery, or if the value in Schema JSON is different from table schema, then table is handled as replaced data. If the table does not exist in BigQuery then a new table is created with name and schema as your input.Treasure Data query results are automatically imported into this table.

    Replace Backup Mode:

    Replace backup mode is similar to replace mode, but in replace backup mode, both data and schema are backed up into a table that is assigned the old name plus "_old" as the suffix.

    Delete Mode:

    Fill in Project ID, Dataset Name and Table Name get from BigQuery.

    Select Auto-create table and provide a schema for the table to be replaced. The value in the Schema JSON field must match the query results (number of fields and data type). Choose Delete from the drop down. If the table already exists in BigQuery, when you execute the job, the table is deleted and a new table is added with the schema as JSON. If table does not exist in BigQuery, then a new table is created with name and schema as your input.Treasure Data query results are automatically imported into this newly created table.

    Notes

    Data conversion

    BigQuery's data types are automatically converted to a corresponding Treasure Data type, as indicated in the following table. If you include unsupported types in the Schema JSON field, or if query result miss-matches the data type with the value in Schema JSON, then you receive errors.

    Treasure Data

    BigQuery

    string

    STRING

    long

    INTEGER

    double

    FLOAT

    long (true is 1, false is 0)

    BOOLEAN

    string (yyyy-MM-dd HH:mm:ss.SSS)

    TIMESTAMP

    string (as JSON)

    RECORD

    string (as JSON)

    REPEATED (PRIMITIVE or RECORD)

    Use from CLI

    The following command allows you to set a scheduled query with Results Output to BigQuery.

    Designate your json_key and escape newline with backslash.

    For example,

    $ td sched:create scheduled_bigquery "10 6 14 12 *" \

    -d dataconnector_db "SELECT id,account,purchase,comment,time FROM data_connectors" \

    -r '{"type":"bigquery","project":"YOUR_PROJECT","dataset":"YOUR_DB","table":"YOUR_TABLE","auto_create_table":true,"max_bad_records":0,"ignore_unknown_values":true,"allow_quoted_newlines":true,"schema_file":"[{\"name\": \"id\", \"type\": \"INTEGER\"}, {\"name\": \"account\", \"type\": \"STRING\"},{\"name\": \"purchase\", \"type\": \"STRING\"}, {\"name\": \"comment\", \"type\": \"STRING\", \"mode\": \"REQUIRED\"}, {\"name\": \"time\", \"type\": \"TIMESTAMP\", \"mode\": \"REQUIRED\"}]", "json_keyfile":"{\"private_key_id\": \"ABDE\", \"private_key\": \"-----BEGIN PRIVATE KEY-----\\nABCDE\\nABCDE\\nABCDE\\n-----END PRIVATE KEY-----\\n\", \"client_email\": \"ABCDE.gserviceaccount.com\", \"client_id\": \"ABCDE.apps.googleusercontent.com\", \"type\": \"service_account\"}"}'

    View Article
  • The Hive query engine has some limitations compared to Hive at this stage.

    SELECT row limit

    The maximum number of rows for SELECT queries is 2^31 (2,147,483,647) on both CDH4 and HDP2. If the number of rows exceeds the limit, the query fails.

    To workaround this issue, use INSERT INTO can insert more records.

    Division

    When two values (of any type) are divided in Hive the result of the division is always automatically upgraded to double. To obtain an integer result from a division, cast the result of the division to the needed type. For example:

    SELECT

    CAST((column1 / column2) AS bigint) as division

    FROM

    table

    The Presto engine behaves differently with divisions and does not automatically upgrades the result to double.

    Trim Double-Byte Space

    TRIM function with Hivecan'thandle double-byte space. For example:

    SELECT

    TRIM('a') as value_keyword

    FROM

    table

    => 'a' (Expected result is 'a')

    As a workflow, use regexp_replace

    SELECT

    regexp_replace(TRIM('a '), '^+|+$', '') as value_keyword

    FROM

    table

    Hive Floating Point

    Hive adds extra decimals to float point (see https://issues.apache.org/jira/browse/HIVE-3715 ). The following example illustrates the behavior.

    1. Create a table floattest with 2 columns:

    floatval for float datatype

    doubleval for double datatype

    2. Insert values intofloattest table:

    123123.2 as floatval

    123123.2 as doubleval

    3. Select cast (floatval as double) from floattest. Select floatval from floattest

    returned : 123123.203125.

    It is consistent with the output from java because the data is stored as double.

    4. Select doubleval from floattest

    returned : 123123.2

    If you are concerned about the precision of numbers in your output, use double instead of float.

    Hives reserved keywords (Why do I get a warning about reserved keywords?)

    Treasure Data uses Hive as a querying language. One of Hives limitations is that a table name cannot be a reserved keyword. Thus, if your table (or database) name is identical to a reserved keyword, youwon'tbe able to query data in that table (or database).

    Hive also has non-reserved keywords which have a proscribed meaning in the language but which are also allowed as identifiers. Hive has additional keywords which allow similar unrestricted usage. In particular, these keywords are allowed as column or table names.

    For example if you create an items table and store data in it, youwon'tbe able to execute the following hive query, because items is a reserved keyword in Hive.

    SELECT v FROM items

    The following areKeywordsin Hive:

    ALL,ALTER,AND,ARRAY,AS,AUTHORIZATION,BETWEEN,BIGINT,BINARY,BOOLEAN,BOTH,BY,CASE,CAST,

    CHAR,COLUMN,CONF,CREATE,CROSS,CUBE,CURRENT,CURRENT_DATE,CURRENT_TIMESTAMP,CURSOR,

    DATABASE,DATE,DECIMAL,DELETE,DESCRIBE,DISTINCT,DOUBLE,DROP,ELSE,END,EXCHANGE,EXISTS,

    EXTENDED,EXTERNAL,FALSE,FETCH,FLOAT,FOLLOWING,FOR,FROM,FULL,FUNCTION,GRANT,GROUP,

    GROUPING,HAVING,IF,IMPORT,IN,INNER,INSERT,INT,INTERSECT,INTERVAL,INTO,IS,JOIN,LATERAL,

    LEFT,LESS,LIKE,LOCAL,MACRO,MAP,MORE,NONE,NOT,NULL,OF,ON,OR,ORDER,OUT,OUTER,OVER,

    PARTIALSCAN,PARTITION,PERCENT,PRECEDING,PRESERVE,PROCEDURE,RANGE,READS,REDUCE,REVOKE,

    RIGHT,ROLLUP,ROW,ROWS,SELECT,SET,SMALLINT,TABLE,TABLESAMPLE,THEN,TIMESTAMP,TO,TRANSFORM,

    TRIGGER,TRUE,TRUNCATE,UNBOUNDED,UNION,UNIQUEJOIN,UPDATE,USER,USING,VALUES,VARCHAR,WHEN,

    WHERE,WINDOW,WITH

    The following areNon-reserved Keywordsin Hive:

    ADD,ADMIN,AFTER,ANALYZE,ARCHIVE,ASC,BEFORE,BUCKET,BUCKETS,CASCADE,CHANGE,CLUSTER,

    CLUSTERED,CLUSTERSTATUS,COLLECTION,COLUMNS,COMMENT,COMPACT,COMPACTIONS,COMPUTE,

    CONCATENATE,CONTINUE,DATA,DATABASES,DATETIME,DAY,DBPROPERTIES,DEFERRED,DEFINED,

    DELIMITED,DEPENDENCY,DESC,DIRECTORIES,DIRECTORY,DISABLE,DISTRIBUTE,ELEM_TYPE,ENABLE,

    ESCAPED,EXCLUSIVE,EXPLAIN,EXPORT,FIELDS,FILE,FILEFORMAT,FIRST,FORMAT,FORMATTED,

    FUNCTIONS,HOLD_DDLTIME,HOUR,IDXPROPERTIES,IGNORE,INDEX,INDEXES,INPATH,INPUTDRIVER,

    INPUTFORMAT,ITEMS,JAR,KEYS,KEY_TYPE,LIMIT,LINES,LOAD,LOCATION,LOCK,LOCKS,LOGICAL,LONG,

    MAPJOIN,MATERIALIZED,MINUS,MINUTE,MONTH,MSCK,NOSCAN,NO_DROP,OFFLINE,OPTION,OUTPUTDRIVER,

    OUTPUTFORMAT,OVERWRITE,OWNER,PARTITIONED,PARTITIONS,PLUS,PRETTY,PRINCIPALS,PROTECTION,

    PURGE,READ,READONLY,REBUILD,RECORDREADER,RECORDWRITER,REGEXP,RELOAD,RENAME,REPAIR,

    REPLACE,RESTRICT,REWRITE,RLIKE,ROLE,ROLES,SCHEMA,SCHEMAS,SECOND,SEMI,SERDE,

    SERDEPROPERTIES,SERVER,SETS,SHARED,SHOW,SHOW_DATABASE,SKEWED,SORT,SORTED,SSL,STATISTICS,

    STORED,STREAMTABLE,STRING,STRUCT,TABLES,TBLPROPERTIES,TEMPORARY,TERMINATED,TINYINT,TOUCH,

    TRANSACTIONS,UNARCHIVE,UNDO,UNIONTYPE,UNLOCK,UNSET,UNSIGNED,URI,USE,UTC,UTCTIMESTAMP,

    VALUE_TYPE,VIEW,WHILE,YEAR

    The following areReserved Keywordsin Hive:

    ALL,ALTER,AND,ARRAY,AS,AUTHORIZATION,BETWEEN,BIGINT,BINARY,BOOLEAN,BOTH,BY,CASE,CAST,

    CHAR,COLUMN,CONF,CREATE,CROSS,CUBE,CURRENT,CURRENT_DATE,CURRENT_TIMESTAMP,CURSOR,

    DATABASE,DATE,DECIMAL,DELETE,DESCRIBE,DISTINCT,DOUBLE,DROP,ELSE,END,EXCHANGE,EXISTS,

    EXTENDED,EXTERNAL,FALSE,FETCH,FLOAT,FOLLOWING,FOR,FROM,FULL,FUNCTION,GRANT,GROUP,

    GROUPING,HAVING,IF,IMPORT,IN,INNER,INSERT,INT,INTERSECT,INTERVAL,INTO,IS,JOIN,LATERAL,

    LEFT,LESS,LIKE,LOCAL,MACRO,MAP,MORE,NONE,NOT,NULL,OF,ON,OR,ORDER,OUT,OUTER,OVER,

    PARTIALSCAN,PARTITION,PERCENT,PRECEDING,PRESERVE,PROCEDURE,RANGE,READS,REDUCE,REVOKE,

    RIGHT,ROLLUP,ROW,ROWS,SELECT,SET,SMALLINT,TABLE,TABLESAMPLE,THEN,TIMESTAMP,TO,TRANSFORM,

    TRIGGER,TRUE,TRUNCATE,UNBOUNDED,UNION,UNIQUEJOIN,UPDATE,USER,USING,VALUES,VARCHAR,WHEN,

    WHERE,WINDOW,WITH

    If you encounter a reserved keyword issue in your query, wrap the infringing keyword with ` as shown below.

    SELECT `keyword_column`, COUNT(1) AS cnt FROM table_name GROUP BY `keyword_column`

    Query: Hive 0.13 Query Returns Unexpected Null Results

    Hive jobs complete successfully but, upon rare occasion, produce an incorrect result, outputting NULL values for one or more columns, even when NULL is filtered out by the WHERE clause or is not present in the source data. There are numerous SQL patterns that might produce this error but they typically involve HiveQL with subqueries.

    Workaround: Create intermediate tables that compute the values of the subqueries, and then use theintermediate tables to rewrite the main query.

    For example:

    Assume that in the source data table test_join, test_join.pid contains no NULL values.

    The following query produces unexpected NULL values in the pcnt column:

    SELECT t0.pid, t2.cnt AS pcnt

    FROM

    (SELECT distinct pid FROM test_join ) t0

    JOIN

    (SELECT pid,

    SUM(paid_price) AS weight,

    COUNT(*) AS cnt

    FROM test_join

    GROUP BY pid

    ORDER BY weight DESC

    ' With this ORDER BY, pcnt of all query result is null value.

    ' Without this line, pcnt value is correct.

    ) t2

    ON t0.pid = t2.pid

    One way to rewrite such a query is to break out subqueries into separate tables, as follows:

    CREATE TABLE distinct_pid as (SELECT distinct pid FROM test_join);

    CREATE TABLE weight_tbl as ((SELECT pid,

    SUM(paid_price) AS weight,

    COUNT(*) AS cnt

    FROM test_join

    GROUP BY pid

    ORDER BY weight DESC);

    SELECT t0.pid, t2.cnt AS pcnt

    FROM

    distinct_pid t0

    JOIN

    weight_tbl t2

    ON t0.pid = t2.pid ;

    Contact support if you need help developing a workaround for your situation.

    View Article
  • You can integrate Amplitude with Arm Treasure Data to enhance data, add more data points, and to integrate the data Amplitude collects with all the other parts of your marketing stack.

    To view sample workflows for importing data from Amplitude, view Treasure Boxes.

    You use this same connector to create and export Amplitude Events. See Exporting Event Data to Amplitude.

    Integrating Amplitude with Treasure Data makes it easy to:

    Add new features to Amplitude. As an example, you can use Treasure Data to unify Amplitude behavior data with customer data from Salesforce, creating personally identifiable records with cross-device behavior tracking.

    Use data collected from Amplitude to make the rest of your marketing stack smarter. For example, you can use Treasure Data and Amplitude to reduce churn by setting up automated triggers to create segments based on usage and feeding them into custom nurture funnels in Marketo.

    If youdon'thave a Treasure Data account, contact us so we can get you set up!

    How to connect to Amplitude using the Treasure Data Console

    Connecting to Amplitude using the Treasure Data Console is quick and easy. Alternatively, you can use a command line interface to create connections. (For instructions to connect using the command line, click here. )

    In Amplitude

    Go to Settings -> Projects and get your API Key and Secret Key, which you need for the next step.

    In the Treasure Data console

    Go to Connections -> Sources Catalog and click on the Amplitude tile.

    Enter your API Key and Secret Key and click Continue.

    Give your connection a descriptive name and select Create Connection.

    How to transfer your Amplitude data to Treasure Data

    Select the Amplitude connection you created in Connections -> My Connections.

    Click on New Transfer.

    Enter the start time in the format YYYYMMDDTHH.

    Optionally, you can specify an End Time using the same format. If End Time is not specified the default is the current time of your browser timezone.

    Click Next to preview your data.

    Youll see a preview of your data. Click Next.

    If you are creating a new database, check Create new database and give your database a name. Do the same with Create new table

    Select whether to append records to an existing table or replace your existing table.

    If you want to set a different partition key seed than the default key, you can specify a key using the popup menu.

    In the When tab, you can specify a one-time transfer, or you can schedule an automated recurring transfer. If you selectOnce now, click Start Transfer. If you selectRepeat specify your schedule options, then click Schedule Transfer.

    After your transfer runs, you can see the results of your transfer in the Databases tab.

    How to connect to Amplitude using the command line

    Before you can use the command line to configure a connector, you must install the Treasure Data Toolbelt.

    Install the Treasure Data Toolbelt

    Install the newest Treasure Data Toolbelt.

    $ td --version

    0.15.3

    Create a seed file

    First, create a file called seed.yml as shown in the following example, with your credentials.

    in:

    type: amplitude

    api_key: "<YOUR_API_KEY>"

    secret_key: "<YOUR_SECRET_KEY>"

    start_time: "20160901T03" # UTC Timezone. The format is yyyymmddThh. "T" is a static character.

    Run the Guess Fields command

    Run the following command in your terminal:

    $ td connector:guess seed.yml -o load.yml

    Connector:guess automatically reads the target data and intelligently guesses the data format.

    Open the file load.yml, where youll see guessed file format definitions including, in some cases, file formats, encodings, column names, and types.

    in: {type: amplitude, api_key: API_KEY, secret_key: SECRET_KEY,

    start_time: 20160901T03}

    out: {}

    exec: {}

    filters:

    - type: rename

    rules:

    - rule: upper_to_lower

    - rule: character_types

    pass_types: ["a-z", "0-9"]

    pass_characters: "_"

    replace: "_"

    - rule: first_character_types

    pass_types: ["a-z"]

    pass_characters: "_"

    prefix: "_"

    - rule: unique_number_suffix

    max_length: 128

    - type: add_time

    to_column: {name: time}

    from_value: {mode: upload_time}

    For more details on the rename filter, see rename filter plugin for Data Connector.

    For more details on available out modes, see Appendix.

    You can preview how the system will parse the file by using the preview command.

    $ td connector:preview load.yml

    +----------+-------------------+----------------------+----------------------------------------+----------------------+-------------------------------+-----------------------------+-------------------------------+---------------+-----------------+-------------------+-----------------------------+---------------------+--------------+-------------+-------------+-------------+----------------+----------------------------------+------------------------------+-----------------+----------------+-------------------+---------------------+----------------------------+---------------------+-----------------------+--------------------+----------------------+---------------------+---------------------+-----------------+-----------------+-------------+---------------+------------+----------------+-------------------+----------------+----------------------+-----------------------+---------------------------+--------------------------+----------------------------------------+-------------------+-------------------------------+

    | app:long | amplitude_id:long | user_id:string | device_id:string | event_time:timestamp | server_upload_time:timestamp | client_event_time:timestamp | client_upload_time:timestamp | event_id:long | session_id:long | event_type:string | amplitude_event_type:string | version_name:string | _schema:long | adid:string | groups:json | idfa:string | library:string | processed_time:timestamp | user_creation_time:timestamp | platform:string | os_name:string | os_version:string | device_brand:string | device_manufacturer:string | device_model:string | device_carrier:string | device_type:string | device_family:string | location_lat:double | location_lng:double | country:string | language:string | city:string | region:string | dma:string | revenue:double | ip_address:string | paying:boolean | start_version:string | event_properties:json | user_properties:json | data:json | uuid:string | _insert_id:string | time:timestamp |

    +----------+-------------------+----------------------+----------------------------------------+----------------------+-------------------------------+-----------------------------+-------------------------------+---------------+-----------------+-------------------+-----------------------------+---------------------+--------------+-------------+-------------+-------------+----------------+----------------------------------+------------------------------+-----------------+----------------+-------------------+---------------------+----------------------------+---------------------+-----------------------+--------------------+----------------------+---------------------+---------------------+-----------------+-----------------+-------------+---------------+------------+----------------+-------------------+----------------+----------------------+-----------------------+---------------------------+--------------------------+----------------------------------------+-------------------+-------------------------------+

    | 154962 | 10158089806 | "[email protected]" | "600bfc44-73b1-5e36-bc2f-d230dad9b1fd" | nil | "2016-09-13 04:25:31.405 UTC" | nil | "2016-09-13 04:25:31.405 UTC" | 823811538 | -1 | "watch_tutorial" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 04:25:32.791225 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "United States" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test A\"}" | "{\"first_event\":true}" | "1bc49492-796a-11e6-ae6a-22000a2589e5" | nil | "2016-09-28 08:00:10.137 UTC" |

    | 154962 | 10158089806 | "[email protected]" | "600bfc44-73b1-5e36-bc2f-d230dad9b1fd" | nil | "2016-09-13 04:25:36.952 UTC" | nil | "2016-09-13 04:25:36.952 UTC" | 38872708 | -1 | "watch_tutorial" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 04:25:38.171848 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "United States" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test A\"}" | "{}" | "1f635fde-796a-11e6-ae6a-22000a2589e5" | nil | "2016-09-28 08:00:10.137 UTC" |

    | 154962 | 10158089806 | "[email protected]" | "600bfc44-73b1-5e36-bc2f-d230dad9b1fd" | nil | "2016-09-13 05:00:03.679 UTC" | nil | "2016-09-13 05:00:03.679 UTC" | 689967922 | -1 | "watch_tutorial" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 05:00:04.692970 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "United States" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test A\"}" | "{}" | "ef211fbe-796e-11e6-ae6a-22000a2589e5" | nil | "2016-09-28 08:00:10.137 UTC" |

    | 154962 | 10158359360 | "[email protected]" | "b37e7b0d-5474-5618-ae0a-e3d5b36b35f6" | nil | "2016-09-13 06:00:25.511 UTC" | nil | "2016-09-13 06:00:25.511 UTC" | 44814028 | -1 | "eventtest" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 06:00:26.509410 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "Japan" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test B\"}" | "{}" | "5da3baf2-7977-11e6-ba91-22000a5680d9" | nil | "2016-09-28 08:00:10.137 UTC" |

    | 154962 | 10158089806 | "[email protected]" | "600bfc44-73b1-5e36-bc2f-d230dad9b1fd" | nil | "2016-09-13 06:00:25.541 UTC" | nil | "2016-09-13 06:00:25.541 UTC" | 847463167 | -1 | "watch_tutorial" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 06:00:26.735517 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "United States" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test A\"}" | "{}" | "5e0badba-7977-11e6-ae6a-22000a2589e5" | nil | "2016-09-28 08:00:10.137 UTC" |

    | 154962 | 10158359360 | "[email protected]" | "b37e7b0d-5474-5618-ae0a-e3d5b36b35f6" | nil | "2016-09-13 07:00:00.437 UTC" | nil | "2016-09-13 07:00:00.437 UTC" | 808687647 | -1 | "eventtest" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 07:00:01.389360 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "Japan" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test B\"}" | "{}" | "b07c57b8-797f-11e6-ba91-22000a5680d9" | nil | "2016-09-28 08:00:10.137 UTC" |

    | 154962 | 10158089806 | "[email protected]" | "600bfc44-73b1-5e36-bc2f-d230dad9b1fd" | nil | "2016-09-13 07:00:00.589 UTC" | nil | "2016-09-13 07:00:00.589 UTC" | 145913611 | -1 | "watch_tutorial" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 07:00:01.178515 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "United States" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test A\"}" | "{}" | "b091fe4c-797f-11e6-ae6a-22000a2589e5" | nil | "2016-09-28 08:00:10.137 UTC" |

    | 154962 | 10158089806 | "[email protected]" | "600bfc44-73b1-5e36-bc2f-d230dad9b1fd" | nil | "2016-09-13 08:00:26.193 UTC" | nil | "2016-09-13 08:00:26.193 UTC" | 118354173 | -1 | "watch_tutorial" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 08:00:26.888901 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "United States" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test A\"}" | "{}" | "21a9a762-7988-11e6-ae6a-22000a2589e5" | nil | "2016-09-28 08:00:10.137 UTC" |

    | 154962 | 10158359360 | "[email protected]" | "b37e7b0d-5474-5618-ae0a-e3d5b36b35f6" | nil | "2016-09-13 08:00:26.297 UTC" | nil | "2016-09-13 08:00:26.297 UTC" | 177419860 | -1 | "eventtest" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 08:00:27.323531 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "Japan" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test B\"}" | "{}" | "21ebf04a-7988-11e6-ba91-22000a5680d9" | nil | "2016-09-28 08:00:10.137 UTC" |

    | 154962 | 10158359360 | "[email protected]" | "b37e7b0d-5474-5618-ae0a-e3d5b36b35f6" | nil | "2016-09-13 09:00:00.890 UTC" | nil | "2016-09-13 09:00:00.890 UTC" | 256548632 | -1 | "eventtest" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 09:00:01.668570 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "Japan" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test B\"}" | "{}" | "7409a27a-7990-11e6-ba91-22000a5680d9" | nil | "2016-09-28 08:00:10.137 UTC" |

    | 154962 | 10158089806 | "[email protected]" | "600bfc44-73b1-5e36-bc2f-d230dad9b1fd" | nil | "2016-09-13 09:00:00.982 UTC" | nil | "2016-09-13 09:00:00.982 UTC" | 496536219 | -1 | "watch_tutorial" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 09:00:01.650288 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "United States" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test A\"}" | "{}" | "746275a8-7990-11e6-ae6a-22000a2589e5" | nil | "2016-09-28 08:00:10.137 UTC" |

    | 154962 | 10158089806 | "[email protected]" | "600bfc44-73b1-5e36-bc2f-d230dad9b1fd" | nil | "2016-09-13 10:16:46.350 UTC" | nil | "2016-09-13 10:16:46.350 UTC" | 958097131 | -1 | "watch_tutorial" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 10:16:46.663451 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "United States" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test A\"}" | "{}" | "2d2fb2bc-799b-11e6-ae6a-22000a2589e5" | nil | "2016-09-28 08:00:10.137 UTC" |

    | 154962 | 10158359360 | "[email protected]" | "b37e7b0d-5474-5618-ae0a-e3d5b36b35f6" | nil | "2016-09-13 10:16:46.393 UTC" | nil | "2016-09-13 10:16:46.393 UTC" | 474656866 | -1 | "eventtest" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 10:16:47.669048 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "Japan" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test B\"}" | "{}" | "2dc7129c-799b-11e6-ba91-22000a5680d9" | nil | "2016-09-28 08:00:10.137 UTC" |

    | 154962 | 10158089806 | "[email protected]" | "600bfc44-73b1-5e36-bc2f-d230dad9b1fd" | nil | "2016-09-13 14:25:43.820 UTC" | nil | "2016-09-13 14:25:43.820 UTC" | 562771389 | -1 | "watch_tutorial" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 14:25:44.723185 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "United States" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test A\"}" | "{}" | "f4f3f0ca-79bd-11e6-ae6a-22000a2589e5" | nil | "2016-09-28 08:00:10.137 UTC" |

    | 154962 | 10158359360 | "[email protected]" | "b37e7b0d-5474-5618-ae0a-e3d5b36b35f6" | nil | "2016-09-13 14:25:43.825 UTC" | nil | "2016-09-13 14:25:43.825 UTC" | 714909881 | -1 | "eventtest" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 14:25:45.332413 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "Japan" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test B\"}" | "{}" | "f5472bb4-79bd-11e6-ba91-22000a5680d9" | nil | "2016-09-28 08:00:10.137 UTC" |

    | 154962 | 10158359360 | "[email protected]" | "b37e7b0d-5474-5618-ae0a-e3d5b36b35f6" | nil | "2016-09-13 16:25:52.992 UTC" | nil | "2016-09-13 16:25:52.992 UTC" | 966142739 | -1 | "eventtest" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 16:25:54.925309 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "Japan" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test B\"}" | "{}" | "be8c64a2-79ce-11e6-ba91-22000a5680d9" | nil | "2016-09-28 08:00:10.137 UTC" |

    | 154962 | 10158089806 | "[email protected]" | "600bfc44-73b1-5e36-bc2f-d230dad9b1fd" | nil | "2016-09-13 16:25:52.980 UTC" | nil | "2016-09-13 16:25:52.980 UTC" | 136550534 | -1 | "watch_tutorial" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 16:25:55.159777 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "United States" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test A\"}" | "{}" | "beb0e872-79ce-11e6-ae6a-22000a2589e5" | nil | "2016-09-28 08:00:10.137 UTC" |

    | 154962 | 10158089806 | "[email protected]" | "600bfc44-73b1-5e36-bc2f-d230dad9b1fd" | nil | "2016-09-13 18:26:06.237 UTC" | nil | "2016-09-13 18:26:06.237 UTC" | 911332926 | -1 | "watch_tutorial" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 18:26:07.308899 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "United States" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test A\"}" | "{}" | "897c0d24-79df-11e6-ae6a-22000a2589e5" | nil | "2016-09-28 08:00:10.137 UTC" |

    | 154962 | 10158359360 | "[email protected]" | "b37e7b0d-5474-5618-ae0a-e3d5b36b35f6" | nil | "2016-09-13 18:26:06.582 UTC" | nil | "2016-09-13 18:26:06.582 UTC" | 637365591 | -1 | "eventtest" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 18:26:07.932679 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "Japan" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test B\"}" | "{}" | "89d72740-79df-11e6-ba91-22000a5680d9" | nil | "2016-09-28 08:00:10.137 UTC" |

    | 154962 | 10158089806 | "[email protected]" | "600bfc44-73b1-5e36-bc2f-d230dad9b1fd" | nil | "2016-09-13 21:26:14.886 UTC" | nil | "2016-09-13 21:26:14.886 UTC" | 325299849 | -1 | "watch_tutorial" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 21:26:15.879945 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "United States" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test A\"}" | "{}" | "b3e28336-79f8-11e6-ae6a-22000a2589e5" | nil | "2016-09-28 08:00:10.137 UTC" |

    | 154962 | 10158359360 | "[email protected]" | "b37e7b0d-5474-5618-ae0a-e3d5b36b35f6" | nil | "2016-09-13 21:26:14.881 UTC" | nil | "2016-09-13 21:26:14.881 UTC" | 275591044 | -1 | "eventtest" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 21:26:15.902873 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "Japan" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test B\"}" | "{}" | "b3e89ca8-79f8-11e6-ba91-22000a5680d9" | nil | "2016-09-28 08:00:10.137 UTC" |

    | 154962 | 10158089806 | "[email protected]" | "600bfc44-73b1-5e36-bc2f-d230dad9b1fd" | nil | "2016-09-14 01:29:27.029 UTC" | nil | "2016-09-14 01:29:27.029 UTC" | 335311243 | -1 | "watch_tutorial" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-14 01:29:28.287592 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "United States" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test A\"}" | "{}" | "ada09ba8-7a1a-11e6-9f4f-22000a2589e5" | nil | "2016-09-28 08:00:10.137 UTC" |

    | 154962 | 10158359360 | "[email protected]" | "b37e7b0d-5474-5618-ae0a-e3d5b36b35f6" | nil | "2016-09-14 01:29:26.988 UTC" | nil | "2016-09-14 01:29:26.988 UTC" | 633523172 | -1 | "eventtest" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-14 01:29:28.350201 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "Japan" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test B\"}" | "{}" | "ada9cae8-7a1a-11e6-bead-22000a5680d9" | nil | "2016-09-28 08:00:10.137 UTC" |

    | 154962 | 10158359360 | "[email protected]" | "b37e7b0d-5474-5618-ae0a-e3d5b36b35f6" | nil | "2016-09-14 02:00:26.207 UTC" | nil | "2016-09-14 02:00:26.207 UTC" | 449936451 | -1 | "eventtest" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-14 02:00:26.662638 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "Japan" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test B\"}" | "{}" | "0157e478-7a1f-11e6-bead-22000a5680d9" | nil | "2016-09-28 08:00:10.137 UTC" |

    +----------+-------------------+----------------------+----------------------------------------+----------------------+-------------------------------+-----------------------------+-------------------------------+---------------+-----------------+-------------------+-----------------------------+---------------------+--------------+-------------+-------------+-------------+----------------+----------------------------------+------------------------------+-----------------+----------------+-------------------+---------------------+----------------------------+---------------------+-----------------------+--------------------+----------------------+---------------------+---------------------+-----------------+-----------------+-------------+---------------+------------+----------------+-------------------+----------------+----------------------+-----------------------+---------------------------+--------------------------+----------------------------------------+-------------------+-------------------------------+

    24 rows in set

    Update load.yml and use 'td connector:preview load.yml' to preview again.

    Use 'td connector:issue load.yml' to run Server-side bulk load.

    Execute a Load Job

    Finally, submit the load job. It may take a couple of hours depending on the size of your data.

    NOTE: Amplitude connector provides a time column automatically, but you can also specify it as any timestamp column such as --time-column server_upload_time.

    $ td connector:issue load.yml --database td_sample_db --table td_sample_table

    Scheduled execution

    You can also schedule incremental, periodic Data Connector execution from the command line, removing the need for a cron daemon on your local data center.

    For the first scheduled import, the Data Connector for Amplitude imports all of your data. On the second and subsequent runs, only newly added files are imported.

    Create the schedule

    A new schedule can be created by using the td connector:create command.

    The name of the schedule, cron-style schedule, database and table where data will be stored, and the Data Connector configuration file are required.

    $ td connector:create \

    daily_import \

    "10 0 * * *" \

    td_sample_db \

    td_sample_table \

    load.yml

    The `cron` parameter also accepts three special options: `@hourly`, `@daily` and `@monthly`.

    By default, the schedule is set in the UTC time zone. You can set the schedule in a time zone using the `-t` or `--timezone` option. The `--timezone` option only supports extended timezone formats like 'Asia/Tokyo', 'America/Los_Angeles' etc. Timezone abbreviations like PST, CST are *not* supported and may lead to unexpected results.

    List the Schedules

    You can see the list of scheduled entries by running the command td connector:list.

    $ td connector:list

    +--------------+------------+----------+-------+--------------+-----------------+----------------------------+

    | Name | Cron | Timezone | Delay | Database | Table | Config |

    +--------------+------------+----------+-------+--------------+-----------------+----------------------------+

    | daily_import | 10 0 * * * | UTC | 0 | td_sample_db | td_sample_table | {"in"=>{"type"=>"amp", ... |

    +--------------+------------+----------+-------+--------------+-----------------+----------------------------+

    Show the Settings and Schedule History

    td connector:show shows the execution settings of a schedule entry.

    % td connector:show daily_import

    Name : daily_import

    Cron : 10 0 * * *

    Timezone : UTC

    Delay : 0

    Database : td_sample_db

    Table : td_sample_table

    Config

    ---

    in:

    type: amplitude

    api_key: "<YOUR_API_KEY>"

    secret_key: "<YOUR_SECRET_KEY>"

    start_time: "20160901T03" # UTC Timezone. The format is yyyymmddThh. "T" is a static character.

    out:

    mode: append

    td connector:history shows the execution history of a schedule entry. To investigate the results of each individual run, use td job <jobid>.

    % td connector:history daily_import

    +--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+

    | JobID | Status | Records | Database | Table | Priority | Started | Duration |

    +--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+

    | 578066 | success | 10000 | td_sample_db | td_sample_table | 0 | 2016-09-18 00:10:05 +0000 | 160 |

    | 577968 | success | 10000 | td_sample_db | td_sample_table | 0 | 2016-09-17 00:10:07 +0000 | 161 |

    | 577914 | success | 10000 | td_sample_db | td_sample_table | 0 | 2016-09-16 00:10:03 +0000 | 152 |

    | 577872 | success | 10000 | td_sample_db | td_sample_table | 0 | 2016-09-15 00:10:04 +0000 | 163 |

    | 577810 | success | 10000 | td_sample_db | td_sample_table | 0 | 2016-09-14 00:10:04 +0000 | 164 |

    | 577766 | success | 10000 | td_sample_db | td_sample_table | 0 | 2016-09-13 00:10:04 +0000 | 155 |

    | 577710 | success | 10000 | td_sample_db | td_sample_table | 0 | 2016-09-12 00:10:05 +0000 | 156 |

    | 577610 | success | 10000 | td_sample_db | td_sample_table | 0 | 2016-09-11 00:10:04 +0000 | 157 |

    +--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+

    8 rows in set

    Delete the Schedule

    td connector:delete removes the schedule.

    $ td connector:delete daily_import

    Configuration

    See the following table for more details on available in modes.

    Option name

    Description

    Type

    Required?

    Default value

    api_key

    API key

    string

    yes

    N/A

    secret_key

    API Secret key

    string

    yes

    N/A

    start_time

    First hour included in data series, formatted YYYYMMDDTHH (e.g. 20150201T05). This is UTC timezone.

    string

    yes

    N/A

    end_time

    Last hour included in data series, formatted YYYYMMDDTHH (e.g. 20150203T20). This is UTC timezone.

    string

    no

    processed time

    incremental

    true for mode: append, false for mode: replace (See below).

    bool

    no

    true

    Appendix

    A) Modes for out plugin

    You can specify the file import mode in the out section of seed.yml.

    append (default)

    This is the default mode, which appends records to the target table.

    in:

    ...

    out:

    mode: append

    replace (In td 0.11.10 and later)

    This mode replaces data in the target table. Any manual schema changes made to the target table remains intact with this mode.

    in:

    ...

    out:

    mode: replace

    View Article
  • The data connector for Amazon S3 enables you to import the data from your JSON, TSV, and CSV files stored in an S3 bucket.

    For sample workflows on importing data from files stored in an S3 bucket, go to the Treasure Box on Github.

    Prerequisites

    Basic knowledge of Arm Treasure Data

    Use the TD Console to create your connection

    You can use TD Console to create your data connector.

    Create a new connection

    When you configure a data connection, you provide authentication to access the integration. In Treasure Data, you configure the authentication and then specify the source information.

    Go to Catalog and search and select AWS S3. Click Create.

    https://console.eu01.treasuredata.com

    Using Credentials to Authenticate

    You need a client ID and access keys to authenticate using credentials.

    Register your credential. Set the following parameters.

    Endpoint: S3 endpoint login user name. You can find a region and endpoint information from AWS Document. (Ex. s3-ap-northeast-1.amazonaws.com)

    Authentication Method:

    basic: uses access_key_id and secret_access_key to authenticate. See here.

    Access Key ID

    Secret access key

    anonymous: uses anonymous access. This auth method can access only public files.

    session: uses temporary-generated access_key_id, secret_access_key and session_token. (Please note that this authentication method is only available with data import. This can't be used with data export for now.)

    Access Key ID

    Secret access key

    Secret token

    Click Continue. Name your new AWS S3 connection.Click Done.

    Transfer data from AWS S3

    Next, you need to create New Source on Authentications page. You can prepare an ad hoc DataConnector job or a scheduled data connector job. In this section, the following 4 steps are required.

    Fetch from

    You need to register the information that you want to ingest.

    Bucket: provide the S3 bucket name (Ex. your_bucket_name)

    Path Prefix: specify a prefix for target keys. (Ex. logs/data_)

    Path Regex: use regexp to match file paths. If a file path doesnt match with the specified pattern, the file is skipped. For example, if you specify the pattern.csv$ #, then a file is skipped if its path doesnt match the pattern.

    Skip Glacier Objects:select to skip processing objects stored in Amazon Glacier storage class. If objects are stored in Glacier storage class, but this option is not checked, an exception is thrown.

    Filter by Modified Time:choose how to filter files for ingestion:

    If it is unchecked (default):

    Start after path: inserts last_path parameter so that first execution skips files before the path. (Ex. logs/data_20170101.csv)

    Incremental: enables incremental loading. If incremental loading is enabled, config diff for the next execution includes the last_path parameter so that next execution skips files before the path. Otherwise, last_path is not included.

    If it is checked

    Modified after: inserts last_modified_time parameters so that first executions skip files that were modified before that specified timestamp (Ex.2019-06-03T10:30:19.806Z)

    Incremental by Modified Time: enables incremental loading by modified time. If incremental loading is enabled, config diff for next execution includes the last_modified_time parameter so that next execution skips files that were modified before that time. Otherwise, last_modified_time is not included.

    You can limit the access to your S3 bucket/IAM user by using a list of static IPs. Contact [email protected] if you need it.

    There are instances where you might need to scan all the files in a directory (such as from the top level directory "/"). In such instances, you must use the CLI to do the import (see the Appendix for instructions on how to import using the CLI.)

    Example: CloudFront

    Amazon CloudFront is a web service that speeds up distribution of your static and dynamic web content. You can configure CloudFront to create log files that contain detailed information about every user request that CloudFront receives. If you enable logging, you can save CloudFront logfiles, shown as follows:

    [your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-15.a103fd5a.gz]

    [your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-15.b2aede4a.gz]

    [your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-16.594fa8e6.gz]

    [your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-16.d12f42f9.gz]

    In this case, Fetch from settings are as shown:

    Bucket: your_bucket

    Path Prefix: logging/

    Path Regex: .gz$ (Not Required)

    Start after path: logging/E231A697YXWD39.2017-04-23-15.b2aede4a.gz (Assuming that you want to import the log files from 2017-04-23-16.)

    Incremental: true (if you want to schedule this job.)

    Preview

    In this section, you can see a preview of data you configured. If you can't reach this Preview page, you might get any errors if you try to proceed to import using TD Console. See Preview for more information.

    If you would like to set specified column name, choose Advanced Settings.

    Advanced Settings

    Advanced Settings allow you to edit guessed properties. Edit the following section, if you need to.

    Default timezone: changes the time zone of timestamp columns if the value itself doesnt include time zone.

    Columns:

    Name: changes a name of the column. Column name is supported consisting of lowercase alphabets, numbers, and _ only.

    Type: parses a value as a specified type. And then, it stores after converting to TreasureData schema.

    boolean

    long

    timestamp: is imported as String type at TreasureData (Ex. 2017-04-01 00:00:00.000)

    double

    string

    json

    Total file count limit: maximum number of files to read. (optional)

    Transfer to

    In this phase, select your target database and table that you want to import data to.

    Mode: Append/Replace

    Partition key Seed: choose the long or timestamp column as the partitioning time. As default time column, its used upload_time with using add_time filter.

    When

    In this phase, you can set an ad hoc or schedule configuration for your job.

    When

    Once now: set one time job.

    Repeat

    Schedule: accepts these three options: @hourly, @daily and @monthly and custom cron.

    Delay Transfer: add a delay of execution time.

    TimeZone: supports extended timezone formats like Asia/Tokyo.

    My Input Transfers

    Your data connector jobs are listed on the jobs page of the TD Console.

    Scheduled Execution

    You can schedule periodic data connector execution for incremental S3 file import. We configure our scheduler carefully to ensure high availability.

    For the scheduled import, the Data Connector for Amazon S3 imports all files that match with the specified prefix (e.g. path_prefix: path/to/sample_ > path/to/sample_201501.csv.gz, path/to/sample_201502.csv.gz,, path/to/sample_201505.csv.gz) initially and remembers one of these fields by condition:

    If use_modified_time is disabled, the last path (path/to/sample_201505.csv.gz) is saved for the next execution. On the second and on subsequent runs, the connector imports only files that comes after the last path in alphabetical (lexicographic) order.(path/to/sample_201506.csv.gz, )

    Otherwise, the time that the job is executed (i.e.2019-06-17T12:00:00.000Z) is saved for the next execution. On the second and on subsequent runs, the connector imports only files that modified after that executiontime in alphabetical (lexicographic) order.

    Create a Schedule using the TD Toolbelt

    A new schedule can be created using the td connector:create command.

    $ td connector:create daily_import "10 0 * * *" \

    td_sample_db td_sample_table load.yml

    Its also recommended to specify the --time-column option, because Treasure Datas storage is partitioned by time (see also data partitioning )

    $ td connector:create daily_import "10 0 * * *" \

    td_sample_db td_sample_table load.yml \

    --time-column created_at

    The `cron` parameter also accepts three special options: `@hourly`, `@daily` and `@monthly`.

    By default, schedule is setup in UTC timezone. You can set the schedule in a timezone using -t or --timezone option. `--timezone` option supports only extended timezone formats like 'Asia/Tokyo', 'America/Los_Angeles' etc. Timezone abbreviations like PST, CST are *not* supported and may lead to unexpected schedules.

    List All Schedules

    You can see the list of currently scheduled entries by running the command td connector:list.

    $ td connector:list

    +--------------+--------------+----------+-------+--------------+-----------------+------------------------------------------+

    | Name | Cron | Timezone | Delay | Database | Table | Config |

    +--------------+--------------+----------+-------+--------------+-----------------+------------------------------------------+

    | daily_import | 10 0 * * * | UTC | 0 | td_sample_db | td_sample_table | {"in"=>{"type"=>"s3", "access_key_id"... |

    +--------------+--------------+----------+-------+--------------+-----------------+------------------------------------------+

    Show Schedule Settings And History

    td connector:show shows the execution setting of a schedule entry.

    % td connector:show daily_import

    Name : daily_import

    Cron : 10 0 * * *

    Timezone : UTC

    Delay : 0

    Database : td_sample_db

    Table : td_sample_table

    Config

    ---

    in:

    type: s3

    access_key_id: XXXXXXXXXX

    secret_access_key: YYYYYYYYYY

    endpoint: s3.amazonaws.com

    bucket: sample_bucket

    path_prefix: path/to/sample_

    parser:

    charset: UTF-8

    ...

    td connector:history shows the execution history of a schedule entry. To investigate the results of each individual run, use td job <jobid>.

    % td connector:history daily_import

    +--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+

    | JobID | Status | Records | Database | Table | Priority | Started | Duration |

    +--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+

    | 578066 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-18 00:10:05 +0000 | 160 |

    | 577968 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-17 00:10:07 +0000 | 161 |

    | 577914 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-16 00:10:03 +0000 | 152 |

    | 577872 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-15 00:10:04 +0000 | 163 |

    | 577810 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-14 00:10:04 +0000 | 164 |

    | 577766 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-13 00:10:04 +0000 | 155 |

    | 577710 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-12 00:10:05 +0000 | 156 |

    | 577610 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-11 00:10:04 +0000 | 157 |

    +--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+

    8 rows in set

    Delete Schedule

    td connector:deleteremoves the schedule.

    $ td connector:delete daily_import

    FAQ for the S3 Data Connector

    Q: Data Connector for S3 job is running for a long time, what can I do?

    Check the count of S3 files that your connector job is ingesting. If there are over 10,000 files, the performance degrades. To mitigate this issue, you can:

    Narrow path_prefix option and reduce the count of S3 files.

    Set 268,435,456 (256MB) to min_task_size option.

    Q: How do I troubleshoot data import problems?

    Review the job log. Warning and errors provide information about the success of your import. For example, you can identify the source file names associated with import errors.

    Appendix

    A) Optional Alternative: Use the CLI to Configure the Connector

    Before setting up the connector, install the td command. Install the most current Treasure Data Toolbelt.

    Create Seed Config File (seed.yml)

    Prepare the seed.yml as shown in the following example, with your AWS access key and secret access key. You must also specify bucket name, and source file name (or prefix for multiple files).

    in:

    type: s3

    access_key_id: XXXXXXXXXX

    secret_access_key: YYYYYYYYYY

    bucket: sample_bucket

    # path to the *.json or *.csv or *.tsv file on your s3 bucket

    path_prefix: path/to/sample_file

    out:

    mode: append

    The Data Connector for Amazon S3 imports all files that match the specified prefix. (e.g. path_prefix: path/to/sample_ > path/to/sample_201501.csv.gz, path/to/sample_201502.csv.gz,, path/to/sample_201505.csv.gz)

    Using path_prefix with leading '/', can lead to unintended results. For example: "path_prefix: /path/to/sample_file" would result in plugin looking for file in s3://sample_bucket//path/to/sample_file which is different on S3 than the intended path of s3://sample_bucket/path/to/sample_file

    For more details on available out modes, see Appendix.

    Guess Fields (Generate load.yml)

    Use connector:guess. This command automatically reads the source files, and assesses (uses logic to guess) the file format and its field/columns.

    $ td connector:guess seed.yml -o load.yml

    If you open up load.yml, youll see the assessed file format definitions including file formats, encodings, column names, and types.

    in:

    type: s3

    access_key_id: XXXXXXXXXX

    secret_access_key: YYYYYYYYYY

    bucket: sample_bucket

    path_prefix: path/to/sample_file

    parser:

    charset: UTF-8

    newline: CRLF

    type: csv

    delimiter: ','

    quote: '"'

    escape: ''

    skip_header_lines: 1

    columns:

    - name: id

    type: long

    - name: company

    type: string

    - name: customer

    type: string

    - name: created_at

    type: timestamp

    format: '%Y-%m-%d %H:%M:%S'

    out:

    mode: append

    Then, you can see a preview of the data using the td connector:preview command.

    $ td connector:preview load.yml

    +-------+---------+----------+---------------------+

    | id | company | customer | created_at |

    +-------+---------+----------+---------------------+

    | 11200 | AA Inc. | David | 2015-03-31 06:12:37 |

    | 20313 | BB Imc. | Tom | 2015-04-01 01:00:07 |

    | 32132 | CC Inc. | Fernando | 2015-04-01 10:33:41 |

    | 40133 | DD Inc. | Cesar | 2015-04-02 05:12:32 |

    | 93133 | EE Inc. | Jake | 2015-04-02 14:11:13 |

    +-------+---------+----------+---------------------+

    The guess command needs more than 3 rows and 2 columns in source data file, because the command assesses the column definition using sample rows from source data.

    If the system detects your column name or column type unexpectedly, modify load.yml directly and preview again.

    Currently, the Data Connector supports parsing of boolean, long, double, string, and timestamp types.

    The `preview` command downloads one file from the specified bucket and display the results from that file. This may cause a difference in results from the preview and issue commands.See Preview for more information.

    Execute Load Job

    Submit the load job. It may take a couple of hours depending on the size of the data. Specify the Treasure Data database and table where the data should be stored.

    Its also recommended to specify --time-column option, because Treasure Datas storage is partitioned by time (see data partitioning ) If the option is not provided, the data connector chooses the first long or timestamp column as the partitioning time. The type of the column specified by --time-column must be either of long and timestamp type.

    If your data doesnt have a time column you can add a time column by using add_time filter option. For more details see add_time filter plugin

    $ td connector:issue load.yml --database td_sample_db --table td_sample_table \

    --time-column created_at

    The connector:issue command assumes that you have already created a database(td_sample_db) and a table(td_sample_table). If the database or the table do not exist in TD, this command will not succeed, so create the database and table manually or use --auto-create-table option with td connector:issue command to auto create the database and table:

    $ td connector:issue load.yml --database td_sample_db --table td_sample_table --time-column created_at --auto-create-table

    The data connector does not sort records on server-side. To use time-based partitioning effectively, sort records in files beforehand.

    If you have a field called time, youdon'thave to specify the --time-column option.

    $ td connector:issue load.yml --database td_sample_db --table td_sample_table

    B) Modes for out plugin

    You can specify file import mode in out section of seed.yml.

    append (default)

    This is the default mode and records are appended to the target table.

    in:

    ...

    out:

    mode: append

    replace

    This mode replaces data in the target table. Any manual schema changes made to the target table remain intact with this mode.

    in:

    ...

    out:

    mode: replace

    C) IAM Permissions

    The IAM credentials specified in the YML configuration file and used for the connector:guess and connector:issue commands need to be allowed permissions for the AWS S3 resources that they need to access. If the IAM user does not possess these permissions, configure the user with one of the predefined Policy Definitions or create a new Policy Definition in JSON format.

    The following example is based on the Policy Definition reference format, giving the IAM user read only (through GetObject and ListBucket actions) permission for the your-bucket:

    {

    "Version": "2012-10-17",

    "Statement": [

    {

    "Effect": "Allow",

    "Action": [

    "s3:GetObject",

    "s3:ListBucket"

    ],

    "Resource": [

    "arn:aws:s3:::your-bucket",

    "arn:aws:s3:::your-bucket/*"

    ]

    }

    ]

    }

    Replace your-bucket with the actual name of your bucket.

    D) Use AWS Security Token Service (STS) as temporary credentials provider

    In certain cases, IAM basic authentication through access_key_id and secret_access_key might be too risky (although the secret_access_key is never clearly shown when a job is executed or after a session is created).

    The S3 data connector can use AWS Secure Token Service (STS) provided Temporary Security Credentials. Using AWS STS, any IAM user can use his own access_key_id and secret_access_key to create a set of temporary new_access_key_id, new_secret_access_key, and session_token keys with an associated expiration time, after which the credentials become invalid. There are essentially 2 types of Temporary Security Credentials:

    Session Token The simplest Security Credentials with an associated expiration time. The temporary credentials give access to all resources the original IAM credentials used to generate them had. These credentials are valid as long as they are not expired and the permissions of the original IAM credentialsdon'tchange.

    Federation Token Adds an extra layer of permission control over the Session Token above. When generating a Federation Token, the IAM user is required to specify a Permission Policy definition. The scope can be used to further narrow down which of the resources, accessible to the IAM user, the bearer of the Federation Token should get access to. Any Permission Policy definition can be used but the scope of the permission is limited to only all or a subset of the permissions the IAM user used to generate the token had. As for the Session Token, the Federation Token credentials are valid as long as they are not expired and the permissions associated to the original IAM credentialsdon'tchange.

    AWS STS Temporary Security Credentials can be generated using the AWS CLI or the AWS SDK in the language of your choice.

    Session Token

    $ aws sts get-session-token --duration-seconds 900

    {

    "Credentials": {

    "SecretAccessKey": "YYYYYYYYYY",

    "SessionToken": "ZZZZZZZZZZ",

    "Expiration": "2015-12-23T05:11:14Z",

    "AccessKeyId": "XXXXXXXXXX"

    }

    }

    Federation Token

    $ aws sts get-federation-token --name temp_creds --duration-seconds 900 \

    --policy '{"Statement": [{"Effect": "Allow", "Action": ["s3:GetObject", "s3:ListBucket"], "Resource": "arn:aws:s3:::bucketname"}]}'

    {

    "FederatedUser": {

    "FederatedUserId": "523683666290:temp_creds",

    "Arn": "arn:aws:sts::523683666290:federated-user/temp_creds"

    },

    "Credentials": {

    "SecretAccessKey": "YYYYYYYYYY",

    "SessionToken": "ZZZZZZZZZZ",

    "Expiration": "2015-12-23T06:06:17Z",

    "AccessKeyId": "XXXXXXXXXX"

    },

    "PackedPolicySize": 16

    }

    where: * temp_cred is the name of the Federated token/user * bucketname is the name of the bucket to give access to. Refer to the ARN specification for more details * s3:GetObject and s3:ListBucket are the basic read operation for a AWS S3 bucket.

    AWS STS credentials cannot be revoked. They will remain effective until expired, or until you delete or remove the permissions of the original IAM user used to generate the credentials.

    When your Temporary Security Credentials are generated, copy the SecretAccessKey, AccessKeyId, and SessionToken in your seed.yml file as follows.

    in:

    type: s3

    auth_method: session

    access_key_id: XXXXXXXXXX

    secret_access_key: YYYYYYYYYY

    session_token: ZZZZZZZZZZ

    bucket: sample_bucket

    path_prefix: path/to/sample_file

    and execute the Data Connector for S3 as usual.

    Because STS credentials expire after the specified amount of time, the data connector job that uses the credential might eventually start failing when credential expiration occurs. Currently, if the STS credentials are reported expired, the data connector job retries up to the maximum number of times (5) and eventually complete with 'error' status.

    E) IP Whitelist and VPC

    You can access here for IP addresses TD is using.

    Look up the region of TD Console by the URL you are logging in to TD, then refer to the "Data Connector" of your region in above URL.

    Region of TD Console

    URL

    US

    https://console.treasuredata.com

    Tokyo

    https://console.treasuredata.co.jp

    EU01

    It will be also necessary to setup access route in AWS if you are using AWS S3 bucket located in same region with your TD region. If so, please configure access via "vpc-df7066ba".

    View Article
  • GDPR (General Data Protection Regulation) was introduced in Europe on May 25, 2018. The regulation is focused on to the collection of personal information, and brings strict penalties of up to 4% of gross annual worldwide revenue or 20M (whichever is greater) for each violation. The rules and penalties can apply to companies based in any country, and customers based in any country, if a company has any sustained business presence in Europe (such as a small sales or support office) or does business with any European subjects.

    Arm Treasure Data helps you to comply with GDPR in the following ways:

    Data Retention Policy

    SDK Controls

    Sites and Endpoints

    Data Retention Policy

    Our data retention policy is described here.

    Data Retention in Presto

    Customer data that you collect using Treasure Data can be permanently deleted by using the Presto DELETE function.

    SDK Controls

    The Treasure Data JavaScript SDK and Android, Unity and iOS SDKs are all able to collect data that counts as personal data under GDPR. To protect Treasure Data customers (and Treasure Data) from legal jeopardy under GDPR, the SDKs have been updated to minimize the collection of personal data by default. Treasure Data customers who use such personal data must make code changes to continue to collect such data andnotify their subscribers about consent for data collection and use.

    For the JavaScript SDKs, this personal information includes the td_client_id and IP address of a web site visitor. For the mobile SDKs, the personal information includes the td_uuid and IP address of the application sending events.

    GDPR-ready releases of our SDKs have been made available prior to the May 25 GDPR deadline, including:

    Release 2.1.0 (JS SDK) - https://github.com/treasure-data/td-js-sdk#data-privacy

    Release 0.1.9 (Unity SDK) - https://github.com/treasure-data/td-unity-sdk-package#gdpr-compliance

    Release 0.1.27 (iOS SDK) - https://github.com/treasure-data/td-ios-sdk#gdpr-compliance

    Release 0.1.18 (Android SDK) - https://github.com/treasure-data/td-android-sdk#opt-out

    How do I ensure I am in compliance with GDPR when using the SDK?

    Developers using the SDKs should upgrade to the latest SDK.

    Consult with your Privacy officer and Legal team, before collecting or enabling collection of personal data using Treasure Data provided SDKs. Then, after appropriate reviews by your company, you can explicitly enable collection of personal data for events if this is in line with your legal obligations and your companys chosen data privacy posture.

    The documentation for each SDK, hosted in GitHub, explains the new data privacy-related controls and how to re-enable data collection. If you have technical questions about the use of the SDKs, contact Treasure Data support or customer success for implementation guidance.

    EU Site and Endpoints (aws)

    Arm Treasure Data accounts can use a data site in Europe. Data that is assigned to the EU service physically resides in Germany, to comply with both GDPR and the German Privacy Act (Bundesdatenschutzgesetz).For details, see Sites and Endpoints.

    Treasure Data and Segmentation for Marketing

    You can read more information about Treasure Data as a customer data platform and GDPR here.

    View Article
  • You can write job results from Arm Treasure Data directly to Criteo Marketing API.

    Prerequisites

    Basic knowledge of Treasure Data, including the toolbelt

    A Criteo Marketing API account

    Authorized Treasure Data account access to Criteo

    Use the TD Console to create your connection

    You can use the Treasure Data console to configure your connection.

    Create a new connection

    You mustcreate and configure the data connection to be used during export, prior to running your query. As part of the data connection,you provide authentication to access the integration.

    Go to Catalog and search and select Criteo.

    https://api.criteo.com/marketing/swagger/ui/index#!/Audiences/Audiences_GetAudiences

    The following dialog opens.

    Authenticating your connection

    You need a client ID and client secret to authenticate using credentials. To get a client ID and client secret, use the Criteo Management Center. A Criteo Marketing Account is required to use the Management Center.

    Click Setup > Users.

    Then clickCREATE API USER.

    Enter a contactEmail addressand select one of the following roles:

    View only

    Business Manager

    Financial Manager

    To use the Criteo Marketing API, selectBusiness Manager.TheView onlyandFinancial Managerrolesdon'thave enough privileges to input and pull data through the API.

    When done, clickAdd user. You receive a confirmation message that includes your Client IDand Client Secret. Save the information because you use it later to configure the Criteo data connector.

    Configure Export Results in Your Data Connection

    In this step, you create or reuse a query. In the query, you configure the data connection.

    Note: Sometimes you need to define the column mapping in the query.

    Configure the Connection by Specifying the Parameters

    Go to the TD Console. Go to Data Workbench > Queries. Access the query that you plan to use to export data.

    Click Export Results located at top of your query editor. The Choose Integration dialog opens. Type the connection name in the search box to filter and select your connection.

    Parameters:

    Advertiser ID(required): The ID used to group audiences under.

    Maximum Identities Per Request (optional): The maximum number of IDs to include per request. The maximum number for the Criteo interface is 50000. Default: 50000

    Retry Limit (optional): Number of times to attempt a network operation. Default: 7

    Connect Timeout in Seconds(optional): The time, in seconds, to wait until aborting a connection operation. Default: 300, which is equivalent to 5 minutes.

    Read Timeout in Seconds(optional): The time, in seconds, to wait until aborting a read operation. Default: 900, which is equivalent to 15 minutes.

    Write Timeout in Seconds(optional):The time, in seconds, to wait until aborting a write operation. Default: 900, which is equivalent to 15 minutes.

    The following is a sample configuration:

    Example of a Query to Populate Criteo

    From Treasure Data, run the following query with Export Results into a connection for Criteo:

    Code Example

    SELECT audience_id_column AS id, audience_name_column AS name, operation_column AS operation, schema_column AS schema, userid_column AS userid, gumid_column AS gumidFROM your_table;

    Column Naming

    Data exported to Criteo must follow the Criteo schema. Supported column names are:

    id : The audience ID

    name : The audience name

    operation : The operation (add or remove) to perform on an audience

    schema : The type of identifier represented in the row of data. In this case, you must specify 'gum'.

    userid : The identifier to add or remove from an audience

    gumid : The GUM caller ID, data required for gum schema. In this case, you must specify 359 because the caller is TreasureData.

    Example:

    SELECT audience_id_column AS id, audience_name_column AS name, 'add' AS operation, 'gum' AS schema, ${criteo_id} AS userid, 359 AS gumidFROM your_table

    Optional: Use of Scheduled Jobs for Output

    You can use Scheduled Jobs with Result Output, to periodically write the output result to a target destination that you specify.

    Optional: Configure Export Results in Workflows

    Within Treasure Workflow, you can specify the use of the data connector to output data.

    timezone: UTC

    _export: td: database: sample_datasets+td-result-into-criteo: td>: queries/sample.sql result_connection: your_connections_name result_settings: advertiser_id: 12345 request_threshold: 50000 request_retries: 7 request_connect_timeout: 300 request_read_timeout: 900 request_write_timeout: 900

    Click here for more information on using data connectors in Workflows to export data.

    Appendix

    FAQ for Export into Criteo

    Q: Where do I confirm audience ID?

    To confirm audience ID, check with tool provided by Criteo. ( https://api.criteo.com/marketing/swagger/ui/index#!/Audiences/Audiences_GetAudiences)

    Q: Are both the audience ID and audience name required to not be NULL?

    No. Either the audience ID or audience name must be non-NULL, not both. If the audience ID and audience name are both provided, then the values must be present within the same audience.

    Q: If audience ID or audience name do not exist in the list of audiences, will they be created?

    If an audience name is provided and the audience ID is NULL, an audience is created if the audience name is not found in the list of audiences.

    If an audience ID is provided, an audience cannot be created if the audience ID is not found in the list of audiences. The Criteo API does not allow the audience ID to be specified when creating an audience.

    Q: How long does it take for my audience to be updated when adding or removing users?

    Criteo can take up to 24 hours to process the data and make it available for display. The data is stored in Criteo's system, but takes some time to be processed.

    Q: How do I check upload result?

    Check with tool provided by Criteo (). If upload succeeds, the tool shows the result count at AudienceResponse.nbLines as JSON.

    FAQ for Log Messages

    Q: Log: Columns [<name(s)>] were not found.

    The column names listed in the message were missing from the query.

    Check the query to ensure that the column names are not missing or they are not misspelled.

    Q: Log: Created audience id '<id>' for audience name '<name>'

    The audience name was not found within the list of audiences for the advertiser ID provided. The audience is created only if the name does not exist and the ID value provided by the query is NULL.

    Q: Log: Audience id '<id>' is not found in the list of audiences for advertiser '<id>'. Cannot create a new audience for non-existent id

    If the audience ID value provided by the query is not NULL and the audience ID does not exist within the list of audiences for the advertiser ID provided, then the audience cannot be created by the connector. The audience ID provided does not match the audience ID generated by the Criteo API and causes a data mismatch.

    Create an audience and update data entries of unknown audience IDs with newly created audience IDs.

    Q: Log: Audience name '<name>' is not found in the list of audiences for advertiser '<id>' but audience id '<id>' is found. Cannot create a new audience due to name mismatch

    Both the audience ID and name were provided by the query but the audience name was not found in thelist of audiences for the advertiser ID provided. Providing the name as part of the Criteo API operation is invalid.

    Make the name NULL in the database or update the audience name to match the name provided by the query.

    Q: Log: Audience id '<id>' and audience name '<name>' do not represent the same audience for advertiser '<id>'

    The audience name associated with ID does not match the audience name provided by the query.

    If both the audience ID and name are provided by the query and are not NULL, both values must be associated with the same audience. You can update the audience name associated with the ID to match the name provided by the query or you can modify the row of data and set the name to match the name value within the audience.

    View Article
  • You can write job results to your Salesforce.com organization.For sample workflows of SFDC, view Treasure Boxes.

    Prerequisites

    Basic knowledge of Arm Treasure Data, including the toolbelt

    Salesforce.com organization and username, password, and security token for API integration

    User has API Enabled permission

    Target Salesforce.com Object should exist with read/write permissions for the User

    Architecture

    A front-end application streams data to be collected in Treasure Data via Log/data collector daemon or Mobile SDKs. You can also bulk import your data using Bulk Import from the CLI. A scheduled query is setup in Treasure Data to run periodically on the data and write the result of each query execution into your Salesforce.com Object.

    View Bulk Data Load Job Details

    The above is a fairly common architecture; for example:

    Example 1: Ranking: What are the Top N of X?

    Every social/mobile application calculates the top N of X (ex: top 5 movies watched today). Treasure Data already handles the raw data warehousing; the write-to-Salesforce.com feature enables Treasure Data to find the top N data as well.

    Example 2: Dashboard Application

    If youre a data scientist, you need to keep track of a range of metrics every hour/day/month and make them accessible via visualizations. Using this write-to-Salesforce.com feature, you can streamline the process and focus on building visualizations of your query results via Reports and Dashboards on the Salesforce.com organization.

    Result Output URL Format

    Format

    The result output target is represented by a URL with the following format:

    sfdc://<username>:<password><security_token>@<hostname>/<object_name>

    where:

    sfdc is an identifier for result output to Salesforce.com;

    username and password are the credential to your Salesforce.com organization;

    security_token is the additional credential for API access;

    hostname is the host name of the Salesforce.com organization. Usually this is login.salesforce.com for production environments and test.salesforce.com for sandbox environments. In case where you configure custom domain for your organization specify the hostname youre using for login;

    object_name is the target Salesforce.com Object API name (e.g. ResultOutput__c). The Object and columns for data integration must be defined beforehand;

    For example with:

    username: [email protected]

    password: PASSWORD

    security_token: 7SMvicR9ojdPz0XLtlWi3Rtw

    The URL will look like:

    sfdc://user%40treasure-data.com:[email protected]/Account

    Make sure that you escape the '@' in the username with '%40'.

    Options

    Result output to Salesforce.com supports various options that can be specified as optional URL parameters. The options are compatible with each other and can be combined.Where applicable, the default behavior is indicated.

    Update mode option

    Controls the various ways of modifying the database data.

    Append

    Truncate

    Update

    mode=append (default)

    The append mode is the default that is used when no mode option is provided in the URL. In this mode, the query results are appended to the object.

    Because mode=append is the default behavior, these two URLs are equivalent:

    sfdc://.../Contact

    sfdc://.../Contact?mode=append

    mode=truncate

    With the truncate mode the system first truncates the existing records in the Salesforce.com Object and moves them into the Trashbin, then inserts the query results.

    Example:

    sfdc://.../CustomObject__c?mode=truncate

    You can specify the hard_delete=true option for mode=truncate to delete records instead of moving it to the Trashbin. To use this option, the user must have the 'Bulk API Hard Delete' permission.

    sfdc://.../CustomObject__c?mode=truncate&hard_delete=true

    mode=update

    With the update mode, a row is inserted unless it would cause a duplicate value in the external key columns specified in the unique parameter. In such a case, an update is performed instead. The unique parameter is required with this mode and must be defined as an external key when used with the update mode.

    Example:

    sfdc://.../Contact?mode=update&unique=CustomerId__c

    The default behavior for the update mode is actually upsert. If you do not want to upsert but only update, you can add the upsert=false option. Then it updates existing records based on unique parameter match and does not insert new records.

    sfdc://.../Contact?mode=update&unique=CustomerId__c&upsert=false

    Upload concurrency_mode option

    The concurrency_mode option controls how the data is uploaded to the Salesforce.com organization. The default mode is parallel; it is the recommended method for most situations.

    concurrency_mode=parallel (default)

    With the parallel method, data is uploaded in parallel. This is the most reliable and effective method and it is recommended for most situations.

    Because concurrency_mode=parallel is the default behavior, these two URLs are equivalent:

    sfdc://.../CustomObject__c

    sfdc://.../CustomObject__c?concurrency_mode=parallel

    concurrency_mode=serial

    Uploading records in parallel is recommended. However, if you see UNABLE_TO_LOCK_ROW in an error message, try concurrency_mode=serial instead.

    sfdc://.../CustomObject__c?concurrency_mode=serial

    Updating A Salesforce.com Object acquires a lock for the Object and parent Object referenced by columns. If you upload objects in parallel and multiple objects have reference to the same parent object, Salesforce.com is not able to acquire the lock for insert/update and returns an UNABLE_TO_LOCK_ROW error. In such cases, specify the concurrency_mode=serial option.

    Authentication session_id option

    If you have a Salesforce.com Session ID, you can authenticate with the session_id option instead of username, password, and security token (that is, username, password, and security token can be omitted from the URL).

    sfdc://login.salesforce.com/Contact?session_id=3deT2aQjYQbIRN0M...jB1tHBb7UW0K!M

    Retry option

    This options sets the number of attempts the Treasure Data export worker makes to write the result to the configured Salesforce.com destination, if errors occur. If the export fails more than the set number of retries, the query fails.The default number of retries is retry=2 but one can virtually set it to any number. The number of retries affect the overall duration of a query.

    sfdc://.../CustomObject__c?retry=5

    Split Records options

    The Treasure Data result export splits the records in the result of a query in chunks of 10000 records by default and bulk upload one chunk at a time. The split_records option configures the size of this chunk, if required.

    sfdc://.../CustomObject__c?split_records=100

    Usage

    CLI

    To output the result of a single query to Salesforce.com organization add the -r / result option to the td query command. After the job is finished, the results are written into your Salesforce.com organization Object:

    $ td query -w -d testdb \

    --result 'sfdc://login.salesforce.com/CustomObject__c?session_id=.....' \

    "SELECT code as Code__c, COUNT(1) as Count__c FROM www_access GROUP BY code"

    To create a scheduled query whose output is systematically written to Salesforce.com organization, add the -r / result option when creating the schedule through the td sched:create command:

    $ td sched:create hourly_count_example "0 * * * *" -d testdb \

    --result 'sfdc://user%40treasure-data.com:[email protected]/CustomObject__c' \

    "SELECT COUNT(*) as Count__c FROM www_access"

    Using the TD Console

    In the Queries tab, click New Query.

    Select a database and table from the popup menus. Enter your query and select Output results.

    To avoid any issues with result export, define column aliases in your query such that resulting column names from the query match the Salesforce field names for default fields and API names (usually ending with __c) for custom fields.

    To export the result of a query or schedule to Salesforce.com, specify Result Export information for a new query or an existing job/query:

    Troubleshooting

    If you get the following type of error due to the Result Output to SFDC job, you can see the error of the job on SFDC. As shown in the following example, check XXXXXXXXXXX on SFDC. You can figure out a detail of the error.

    ```17/05/01 03:35:05 INFO sfdc.BulkAPIJob: Job XXXXXXXXXXX finished: Total 1, Completed 0, Failed 1 17/05/01 03:35:05 INFO sfdc.BulkAPIClient: Batch jobs failed (1/1)```

    Ref. .

    View Article
  • Arm Treasure Data can publish user segments into Salesforce Marketing Cloud (ExactTarget), and enable you to send personalized emails to your customers. You can run data-driven email campaigns, by using your first party data from Web, Mobile, CRM, and other data sources.

    For sample workflows of Salesforce Marketing Cloud (ExactTarget), view Treasure Boxes.

    Treasure Data provides two methods to write job results to Salesforce Marketing Cloud (SFMC), our SFTP and SFMC plugins.

    This SFMC plugin can be used for small and medium data sets. We recommend you use SFTP for large data sets.

    Synchronous or Asynchronous API

    This data connector supports two kinds of APIs to put records into SFMC: Synchronous APIs and Asynchronous APIs.

    As a best practice, if you are trying to put a result set with less than 100,000 records into SFMC, Synchronous APIs is a great fit. However, if you are planning to send a million+ data set, then consider using Asynchronous APIs because it provides higher availability and reliability in comparison withSynchronous APIs.

    Note:In order to use the Asynchronous APIs, you must ensure that your SFMC account is enabled as aData Extensions Async REST API from Salesforce Marketing Cloud. Or you could create a case to ask Salesforce Marketing Cloud Support to enable it. See more in Create support case in Salesforce Marketing Cloud

    Prerequisites

    Basic knowledge of Treasure Data

    Basic knowledge of Salesforce Marketing Cloud

    TD account

    This connector also requires that the Data Extension have a primary key that uniquely identifies each record.

    How to create an installed app in Salesforce Marketing Cloud

    Currently we support both Legacy and Enhanced Package in Salesforce Marketing Cloud.

    However after August 1st 2019, you can not create theLegacy Package, therefore we highly recommend you to create an Enhanced Package.

    How to create legacy package

    Log on to your Salesforce Marketing Cloud account

    On the Welcome Page, click your name on top right corner then click on Setup.

    On the left side menu of new screen, click on App > Installed Packages

    On theInstalled Packages screen, click New.

    On the New Package Details pop up, enter the Name and Description. Uncheck (deselect) thecheckboxCreate with enhanced functionality (recommended) then click Save.

    On the Package Details screen, clickAdd Component.

    On theAdd Component pop up, selectAPI Integration then click Next.

    On theAdd API Integration screen, scroll to Data section and select the Read and Write checkboxes on Data Extension. Youneed these permission in order to write data to Salesforce Marketing Cloud. Then, click Save.

    On theInstalled Packages screen, scroll down to Components panel, then take note of theClient Id and Client Secret. You will use the information to write the data from Treasure Data to Salesforce Marketing Cloud.

    On the Package Details screen, clickAdd Component.

    How to create enhanced functionality package

    Log on to your Salesforce Marketing Cloud account.

    On the Welcome Page, click your name on top right corner then clickSetup.

    On the left side menu of new screen, clickApp > Installed Packages

    On theInstalled Packages screen, click New.

    On the New Package Details pop up, enter the Name and Description and selectthecheckboxCreate with enhanced functionality (recommended). Then click Save.

    On theAdd Component pop up, selectAPI Integration then click Next.

    On theChoose Your Integration Type screen, select Server-to-Server then click Next.

    On theSet Server-to-Server Properties screen, scroll to Data section and select the Read and Write checkboxes onData Extension. Youneed these permission in order to write data to Salesforce Marketing Cloud. Then, click Save.

    On theInstalled Packages screen, scroll down to Components panel, and take note of theClient Id,Client Secret andAuthentication Base URI,. You will use the information to write the data from Treasure Data to Salesforce Marketing Cloud.

    Using Treasure Data console

    Complete the following steps to export TD query results to a Salesforce Marketing Cloud Data Extension:

    Create a new data connection

    A new data connection can be used to export results to an SFMC Data Extension.

    Go to Integrations Hub > Catalogand search and select Salesforce Marketing Cloud.

    Create Legacy Package integration

    On theNew Authentication screen, select Package Type as Legacy Package,then enter the Client Id and Client Secret (which you obtained when you created legacy package in SFMC). Then click Continue.

    Enter aName for this integration and click Done.

    Create Enhanced Package integration

    On theNew Authentication screen, select Package Type as Enhanced Functionality Package,then enter the Client Id,Client Secret andAuthentication Base URI (which you obtained when you created the enhanced package in SFMC). Optionally, you could specify Account identifier or MID to access multiple BUs(more detail in How to get MID ) and Scope to limit the token's scope(more detail in API scopes ).Then, click Continue.

    Enter aName for this integration and click Done.

    Configure the query results output connector

    Navigate to Data Workbench > Queries > New Query.

    Enter your query then click the Export Results checkbox.

    On Choose Integration, select the integration you've created in previous step.

    On theExport Results screen, enter aData Extension Name, select theUsing Async API checkbox if you want to use Asynchronous API, otherwise uncheck the checkbox to use the Synchronous API. Then, click Done.

    Run the query

    After configuring the output connector, and running a query, the query results are written to the configured data extension.

    Plugin configuration and options

    Available configurations

    client_id:Installed package client key

    client_secret:Installed package client secret

    auth_type:Authenticate type (enum v1 for legacy, v2 for enhanced)auth_uri: Authenticate URI (required if auth_type=v2)

    de_name: Data Extension name

    async: Using asynchronous or synchronous API from SFMC continue_on_failure: Continue running if there is an insertion error (applied only for synchronous API)

    Example configurations

    Example configuration for using Synchronous APIs to upsert data extension using Legacy package.

    out:

    type: salesforce_marketing_cloud

    client_id: Th1s1s4n3x4mpl3Cl13nt1ds

    client_secret: Th1s1s4F4k3dCl13ntS3cr3tauth_type: v1

    de_name: data_extension_name

    continue_on_failure: false

    Example configuration for using Synchronous APIs to upsert data extension using Enhanced package.

    out:

    type: salesforce_marketing_cloud

    client_id: Th1s1s4n3x4mpl3Cl13nt1ds

    client_secret: Th1s1s4F4k3dCl13ntS3cr3tauth_type: v2auth_uri: https://th1s1sf4k3d1nst4nc3h0st.auth.marketingcloudapis.com/

    de_name: data_extension_name

    continue_on_failure: false

    Example configurationfor Asynchronous APIs to upsert data extensionusing Legacy package.

    out:

    type: salesforce_marketing_cloud

    client_id: Th1s1s4n3x4mpl3Cl13nt1ds

    client_secret: Th1s1s4F4k3dCl13ntS3cr3tauth_type: v1

    de_name: data_extension_name

    async: true

    Example configurationfor Asynchronous APIs to upsert data extensionusingEnhancedpackage.

    out: type: salesforce_marketing_cloud client_id: Th1s1s4n3x4mpl3Cl13nt1ds client_secret: Th1s1s4F4k3dCl13ntS3cr3tauth_type: v2auth_uri: https://th1s1sf4k3d1nst4nc3h0st.auth.marketingcloudapis.com/

    de_name: data_extension_name

    async: true

    Using Treasure Data CLI (Command-line interface)

    You can use Treasure Data CLI to run the query and write output result to Salesforce Marketing Cloud.

    How to run query on TDTreasure Data CLI

    Example using Asynchronous APIs with Legacy package

    $ td query -d database_name -r '{"type": "salesforce_marketing_cloud","client_id":"Th1s1s4n3x4mpl3Cl13nt1ds","client_secret":"Th1s1s4F4k3dCl13ntS3cr3t","auth_type":"v2","de_name":"data_extension_name","async":true}' "select primary_key, mytext, myinteger, mydecimal, myphone, myemail, mydate, mylocale, myboolean=1 as myboolean from my_1_million" -T presto

    Example using Asynchronous APIs with Enhanced package

    $ td query -d database_name -r '{"type": "salesforce_marketing_cloud","client_id":"Th1s1s4n3x4mpl3Cl13nt1ds","client_secret":"Th1s1s4F4k3dCl13ntS3cr3t","auth_type":"v2","auth_uri":"https://th1s1sf4k3d1nst4nc3h0st.auth.marketingcloudapis.com/","de_name":"data_extension_name","async":true}' "select primary_key, mytext, myinteger, mydecimal, myphone, myemail, mydate, mylocale, myboolean=1 as myboolean from my_1_million" -T presto

    Example using Synchronous APIs with Legacy package

    $ td query -d database_name -r '{"type": "salesforce_marketing_cloud","client_id":"Th1s1s4n3x4mpl3Cl13nt1ds","client_secret":"Th1s1s4F4k3dCl13ntS3cr3t","auth_type":"v1","de_name":"data_extension_name","async":false,"continue_on_failure":true}' "select primary_key, mytext, myinteger, mydecimal, myphone, myemail, mydate, mylocale, myboolean=1 as myboolean from my_1_million" -T presto

    Example using Synchronous APIs with Enhanced package

    $ td query -d database_name -r '{"type": "salesforce_marketing_cloud","client_id":"Th1s1s4n3x4mpl3Cl13nt1ds","client_secret":"Th1s1s4F4k3dCl13ntS3cr3t","auth_type":"v2","auth_uri":"https://th1s1sf4k3d1nst4nc3h0st.auth.marketingcloudapis.com/","de_name":"data_extension_name","async":false,"continue_on_failure":true}' "select primary_key, mytext, myinteger, mydecimal, myphone, myemail, mydate, mylocale, myboolean=1 as myboolean from my_1_million" -T presto

    How to map between query result and Data Extension column name

    You can use the as keyword in the query to map between query result and Data Extension column.

    Example:

    Given the following Data Extension that contains subscriber information:

    name: de_1 Columns: primary_key, cust_email, cust_first_name, cust_last_name, cust_phone_number

    Given the following table that contains user account information:

    name: account_table

    Columns: id, login_id, first_name, last_name, contact_number

    You can use the following query with anas keyword to map between the table and Data Extension.

    select id as primary_key, login_id as cust_email, first_name as cust_first_name, last_name as cust_last_name, contact_number as cust_phone_number from account_table;

    View Article
  • This article describes how to use the data connector for Microsoft SQL Server, which allows you to directly import data from your Microsoft SQL Server to Arm Treasure Data.

    For sample workflows on how to import data from your Microsoft SQL Server, view Treasure Boxes.

    Prerequisites

    Basic knowledge of Treasure Data

    Basic knowledge of Microsoft SQL Server

    A Microsoft SQL Server instance running remotely, for example on RDS.

    Use the TD Console to Create Your Connection

    You can use the TD Console to configure your connection.

    Create a New Connection

    When you configure a data connection, you provide authentication to access the integration. In Treasure Data, you configure the authentication and then specify the source information.

    Go to Integrations Hub -> Catalog and search and select Microsoft SQL Server.

    List of Options for Microsoft SQL Server Data Connector

    The following dialog opens.

    Provide the required credentials, including the host name, port, user name and password required to access Microsoft SQL Server.

    Click Continue.

    Name your new Microsoft SQL Server Connection. Click New Source.

    Transfer Your Microsoft SQL Server Data to Treasure Data

    You created your authenticated connection, now you create the source connection.

    Complete the details and click Next.

    If you want to specify the schema, select Use custom SELECT query as follows:

    Preview

    Youll see a preview of your data.To make changes, click Advanced Settings otherwise, click Next.

    Choose the Target Database and Table

    Choose an existing or create a new database and tablewhere you want to transfer data to:

    Create new database and give your database a name. Complete similar steps for Create new table.

    Select whether to append records to an existing table or replace your existing table.

    If you want to set a different partition key seed rather than use the default key, you can specify one.

    Scheduling

    In the Schedule tab, you can specify a one-time transfer, or you can schedule an automated recurring transfer. If you select Once now, click Start Transfer. If you select Repeat specify your schedule options, then click Schedule Transfer.

    Name your Source connector. You can just save the connector or click Create & Run.

    Use Command Line to Create Your Microsoft SQL Server Connection

    You can use the CLI to configure your connection.

    Install the Treasure Data Toolbelt

    Install the newest TD Toolbelt.

    Create Seed Config File (seed.yml)

    Prepare seed.yml with your Microsoft SQL Server access information:

    in:

    type: sqlserver

    host: sqlserver_host_name

    port: 1433

    user: test_user

    password: test_password

    database: test_database

    table: test_table

    out:

    mode: replace

    This example will dump all records inside the table. You can have more detailed control with additional parameters.

    For more details on available out modes, see Appendix

    Guess Fields (Generate load.yml)

    Next use connector:guess. This command automatically reads the target data, and intelligently guesses the data format.

    $ td connector:guess seed.yml -o load.yml

    If you open up load.yml, youll see guessed file format definitions including, in some cases, file formats, encodings, column names, and types.

    Optional: Preview data to be imported

    You can preview data to be imported using the command td connector:preview.

    $ td connector:preview load.yml

    +---------+--------------+----------------------------------+------------+---------------------------+

    | id:long | name:string | description:string | price:long | created_at:timestamp |

    +---------+--------------+----------------------------------+------------+---------------------------+

    | 1 | "item name1" | "26e3c3625366591bc2ffc6e262976e" | 2419 | "2014-02-16 13:01:06 UTC" |

    | 2 | "item name2" | "3e9dd9474dacb78afd607f9e0a3366" | 1298 | "2014-05-24 13:59:26 UTC" |

    | 3 | "item name3" | "9b6c9e4a140284d3951681e9e047f6" | 9084 | "2014-06-21 00:18:21 UTC" |

    | 4 | "item name4" | "a11faf5e63c1b02a3d4c2b5cbb7331" | 669 | "2014-05-02 03:44:08 UTC" |

    | 6 | "item name6" | "6aa15471c373ddc8a6469e1c918f98" | 3556 | "2014-03-29 08:30:23 UTC" |

    +---------+--------------+----------------------------------+------------+---------------------------+

    Execute Load Job

    Finally, submit the load job. The job may take a couple of hours to run depending on the data size. Users need to specify the database and table where their data is stored.

    It is recommended to specify --time-column option, since Treasure Datas storage is partitioned by time (see also architecture ) If the option is not given, the Data Connector will choose the first long or timestamp column as the partitioning time. The type of the column specified by --time-column must be either of long and timestamp type.

    If your data doesnt have a time column you may add it using add_time filter option. More details at add_time filter plugin

    $ td connector:issue load.yml --database td_sample_db --table td_sample_table

    The above command assumes you have already created database(td_sample_db) and table(td_sample_table). If the database or the table do not exist in TD this command will not succeed, so create the database and table manually or use --auto-create-table option with td connector:issue command to auto create the database and table:

    $ td connector:issue load.yml --database td_sample_db --table td_sample_table --time-column created_at --auto-create-table

    You can assign Time Format column to the "Partitioning Key" by "--time-column" option.

    You can load records incrementally by specifying columns in your table by utilizing the incremental_columns and last_record options.

    in:

    type: sqlserver

    host: sqlserver_host_name

    port: 1433

    user: test_user

    password: test_password

    database: test_database

    table: test_table

    incremental: true

    incremental_columns: [id, sub_id]

    last_record: [10000, 300]

    out:

    mode: append

    exec: {}

    Then plugin will automatically recreate query and sort value at internal.

    # when last_record wasn't given

    SELECT * FROM(

    ...original query is here

    )

    ORDER BY id, sub_id

    ::: terminal

    # when last_record was given

    SELECT * FROM(

    ...original query is here

    )

    WHERE id > 10000 OR (id = 10000 AND sub_id > 300)

    ORDER BY id, sub_id

    If youre using with scheduled execution, plugin will automatically generate last_record and hold it internally. Then you can use it at next scheduled execution.

    in:

    type: sqlserver

    ...

    out:

    ...

    Config Diff

    ---

    in:

    last_record:

    - 20000

    - 400

    `query` option can't available when you set `incremental: true`.

    Only strings and integers are supported as incremental_columns.

    Scheduled execution

    You can schedule periodic Data Connector execution for periodic Microsoft SQL Server import. We carefully configure our scheduler to ensure high availability. By using this feature, you no longer need a cron daemon on your local data center.

    Create the schedule

    A new schedule can be created using the td connector:create command. The name of the schedule, cron-style schedule, the database and table where their data will be stored, and the Data Connector configuration file are required.

    $ td connector:create \

    daily_sqlserver_import \

    "10 0 * * *" \

    td_sample_db \

    td_sample_table \

    load.yml

    The `cron` parameter also accepts three options: `@hourly`, `@daily` and `@monthly`.

    By default, schedule is setup in UTC timezone. You can set the schedule in a timezone using -t or --timezone option. Please note that `--timezone` option only supports extended timezone formats like 'Asia/Tokyo', 'America/Los_Angeles' etc. Timezone abbreviations like PST, CST are *not* supported and may lead to unexpected schedules.

    List the Schedules

    You can see the list of scheduled entries by td connector:list.

    $ td connector:list

    +-------------------------+--------------+----------+-------+--------------+-----------------+------------------------------+

    | Name | Cron | Timezone | Delay | Database | Table | Config |

    +-------------------------+--------------+----------+-------+--------------+-----------------+------------------------------+

    | daily_sqlserver_import | 10 0 * * * | UTC | 0 | td_sample_db | td_sample_table | {"type"=>"sqlserver", ... } |

    +-------------------------+--------------+----------+-------+--------------+-----------------+------------------------------+

    Show the Setting and History of Schedules

    td connector:show shows the execution setting of a schedule entry.

    % td connector:show daily_sqlserver_import

    Name : daily_sqlserver_import

    Cron : 10 0 * * *

    Timezone : UTC

    Delay : 0

    Database : td_sample_db

    Table : td_sample_table

    td connector:history shows the execution history of a schedule entry. To investigate the results of each individual execution, please use td job <jobid>.

    % td connector:history daily_sqlserver_import

    +--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+

    | JobID | Status | Records | Database | Table | Priority | Started | Duration |

    +--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+

    | 578066 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-18 00:10:05 +0000 | 160 |

    | 577968 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-17 00:10:07 +0000 | 161 |

    | 577914 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-16 00:10:03 +0000 | 152 |

    | 577872 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-15 00:10:04 +0000 | 163 |

    | 577810 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-14 00:10:04 +0000 | 164 |

    | 577766 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-13 00:10:04 +0000 | 155 |

    | 577710 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-12 00:10:05 +0000 | 156 |

    | 577610 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-11 00:10:04 +0000 | 157 |

    +--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+

    8 rows in set

    Delete the Schedule

    td connector:delete will remove the schedule.

    $ td connector:delete daily_sqlserver_import

    Appendix

    Modes for out plugin

    You can specify file import mode in out section of load.yml.

    append (default)

    This is the default mode and records are appended to the target table.

    in:

    ...

    out:

    mode: append

    replace (In td 0.11.10 and later)

    This mode replaces data in the target table. Please note that any manual schema changes made to the target table will remain intact with this mode.

    in:

    ...

    out:

    mode: replace

    Further Information

    View Article
  • This is a summary of new Arm Treasure Data features and improvements introduced in the December 1st, 2019 release. If you have any product feature requests, submit them to feedback.treasuredata.com.

    Option: you can view a video summary of our December releases:

    Treasure Data Python Interface to REST APIs, Query Engine

    Arm Treasure Data offers pytd v1.0.0, a Python interface that makes it easier to consume Treasure Datas REST APIs, Presto query engine, and Plazma primary storage.

    For ingestion, pytd supports Bulk Import APIs to convert data into a CSV/msgpack file and enable upload in batch. For small volumes of data, pytd supports the Presto INSERT INTO query enabling you to insert single rows into your table.

    You can also use pytd with td-spark, which uses a local customized Spark instance to directly write DataFrames to TDs primary storage system. The seamless connections of pytd enables your Python code to efficiently read and write large volumes of data from and to Treasure Data.

    Comparing pytd to td-client-python and pandas-td

    pytd is the new recommended SDK, although we will continue to support the existing td-client-python. The td-client-python package is a basic minimal SDK, whereas pytd provides more user-friendly interfaces and gives access to Plazma API.

    Use pytd with custom scripts in TD Workflows to make day-to-day data analytic work more productive. The pytd package offers compatible functions in a more efficient manner than does the Python package pandas-td. We recommend that you switch to pytd if you are currently using pandas-td.

    To learn more about pytd, refer to the following documents.

    OSS documentation: https://pytd-doc.readthedocs.io/en/latest/#

    pandas-td compatible guide: https://support.treasuredata.com/hc/en-us/sections/360003775873-Python-Library-Reference-for-pytd-pandas-td-

    Identity Federation (Beta)

    The Treasure Data Identity Federation feature works with your Identity Provider (IdP) and enables your TD account users to use one ID to log into your Treasure Data accounts, even if the user is assigned multiple TD accounts.

    You configure your IdP to authenticate your Treasure Data users and thereby control the login policy for your users through the IdP.

    Identity Federation key benefit is providing heightened security and tighter authentication for both on-premise and cloud applications. You can centrally manage all users and their respective permissions through your corporate directory service.

    The initial implementation of Identity Federation supports Azure Active Directory using SAML 2.0 protocol.

    If you are interested in participating in the Identity Federation beta, contact your Customer Success Representative.

    User Experience Improvements

    Arm Treasure Data continues to improve the ease of use in TD Console.

    Changes to Segment Timestamp Operators

    Segmentation Timestamp Operators contribute to the rules that define a segment. These timestamps operators are now easier to understand and configure.

    Google Sheets - Export

    As part of this upgrade, some operators were kept and changed, and other operators were removed. See Segment Timestamp Operators for a summary of the segment timestamp operators. The article includes a table that reflects the changes.

    Import Integrations

    Arm Treasure Data offers some new features that improve your experience when creating and editing data transfers for sources.

    Choose to Save or Save and Run Your Configured Data Connector

    You can choose to configure your data connector, save it and run the import immediately, or to just save the configured data connector to run at a later time.

    Configure Storage with More Assurance

    For every new data transfer, you can manually select a database and table, instead of a default database and table being selected for you. This is particularly useful when you already have a similar, existing data transfer and want the specified location for the imported data to be different.

    As you define a database and table you can refer to help text

    Also, a new user interface is coming soon:

    Coming Soon: Google Sheet Import

    You can now use the Google Sheet data connector to import Google Sheet content into your database. You can bring in data from Google Spreadsheet file without manually converting the data to CSV before import, and you can retain the format in the original spreadsheet.

    Contact your Treasure Data Customer Success representative for more information and enablement.

    Data Connector: Magento Import (Beta)

    To aid in building Customer Journeys from your Magento eCommerce sites, the Magento data connector imports data directly from Magento. The data connector also helps to ingest from the APIs of Magento Commerce modules, third-party modules, and extension attributes that are installed on your system.

    If you are interested in participating in the beta, contact your Treasure Data Customer Success representative.

    Data Connector: Google Sheet Export Improvement

    A new parameter is added to the Google Sheet Export data connector. You can now specify a destination folder key when using a spreadsheet name to export data. You must re-authenticate to access Google Drive metadata in order to use this new parameter. Refer to .

    If multiple spreadsheets with the same name exist in the destination folder, your jobs fail. You can use the spreadsheet key to specify the correct destination spreadsheet to avoid job failure.

    Action Required: Google Sheet Re-Authentication

    The current version of the Google sheets V3 connector is scheduled to be deprecated in early March 2020. You must re-authenticate all your existing Google Sheet Authentications before mid February 2020 when the TD V3 connector is deprecated. Failure to do so results in a disruption of your Google sheets jobs. For more information, refer to the Appendix in Google Sheets - Export.

    View Article
  • This is a summary of new Arm Treasure Data features and improvements introduced in the November 1st, 2019 release. If you have any product feature requests, submit them to feedback.treasuredata.com.

    Option: you can view a video summary of our November releases:

    Identity Federation (BETA - Coming Soon)

    The Treasure Data Identity Federation feature works with your Identity Provider and enables your TD account users to use one ID to log into your Treasure Data accounts, even if the user is assigned multiple accounts (such as development account, system test account and production account).

    You configure your Identity Provider (IdP) to authenticate your Treasure Data users and thereby control the login policy for your users through the IdP.

    The initial implementation of Identity Federation supports Azure Active Directory using SAML 2.0 protocol.

    If you are interested in participating in the Identity Federation beta, contact your Customer Success Representative.

    Sunsetting HTTP Support

    The HTTP endpoints both in Tokyo and US regions are being deprecated for secure access reasons.All customers who are currently accessing a TDHTTP endpointwill bedirectly notifiedabout the deprecation.

    If you are using any of the following addresses to access Treasure Data, you must migrate to HTTPS:

    US region http://api.treasuredata.com http://api.treasure-data.com http://api-import.treasuredata.com http://api-workflow.treasuredata.com Tokyo region http://api.treasuredata.co.jp http://api-import.treasuredata.co.jp http://api-workflow.treasuredata.co.jp

    Server-Side First PartyCookies - Usage Allowances

    The Server-Side First Party Cookies feature was released for general availability last month. For further information on Server-Side First Party Cookies feature, refer to the technical documentation.

    This feature is automatically included for current CDP customers. By using this feature, customers agree to the allowance and usage limitations applicable to their pricing tier. Customers should contact their customer success representative for information on usage allowance and limitations.

    Data Connector: LiveRamp Export (BETA)

    You are invited to participate in our LiveRamp (Beta). Use Treasure Data with LiveRamp both to anonymize your user segment data and increase the reach of your advertising campaigns, with more accurate targeting.

    Contact your Treasure Data Customer Success representative for more information and enablement.

    Data Connector: The Trade Desk Import

    In addition to the output connector we offer to push segments to The Trade Desk platform, you can now use the input connector to ingest The Trade Desk Advertiser, Campaign, Data Group, and Tracking Tags. With both ingest and export capabilities, you can achieve an end-to-end view of The Trade Desk platform architecture.

    Data Connector: Google Adwords Import Improvement

    In addition to existing reports provided by Google Ads (formerly known as Google Adwords) connector, you can now import data from Keyword Performance Report. The new Include Negative Keywords parameter allows you to specify whether negative keywords are included in the import result.

    In addition, we have updated the Google Adwords logo to the new Google Ads logo according to Google rebranding and removed the Average Position metric from existing reports. TheAverage Position metric was deprecated September 30, 2019.

    Data Connector: Shopify Import Improvement

    You can now import an additional data target: Inventory (Level, Items, Location). This new data type supports the making of item recommendations or mail marketing based on stock quantity.

    View Article
  • Complete the following steps to migrate from the legacy Salesforce data connector to the new Salesforce V2 connector. The legacy data connector uses only REST API to import data. The new Salesforce data connector enables you to use Bulk import and REST API.

    Create a New Salesforce v2 Connector

    Go to Treasure Data Catalog, then search and select Salesforce v2.

    https://login.salesforce.com/

    In the dialog box, enter the values that you enter in your legacy Salesforce connector.

    Salesforce v2 connector requires that you remove unnecessary letters from Login URL parameter. For example, instead of https://login.salesforce.com/?locale=jp, use use.

    Enter your username (your email) and password, as well as your Client ID, Client Secret and Security Token.

    Save Settings and Run the Legacy Salesforce Data Connector One Last Time

    You can save your legacy setting from TD Console or from the CLI.

    In Console UI

    Save the settings of your scheduled legacy Salesforce connector and run a final import

    Go to Integration Hub > Sources. Search for your scheduled Salesforce source, select and click Edit.

    In the dialog box, copy the settings to use later:

    Also copy any advanced settings:

    Next, you configure one final run with the legacy data connector to create a temporary table against which you can run a config-diff. You use the diff to identify and confirm the latest data imported into Treasure Data.

    Before running the final import with the legacy connector, make sure that you change the schedule to one run only:

    After the job is complete, look at and copyconfig_diffin job query information somewhere to use later.

    Create new Salesforce v2 source

    Go to Integration Hub > Authentication. Search for new Salesforce v2 connection that you created:

    Click into New Source. Fill in all basic settings and advanced settings that you copied in the preceding steps. Then, if you want the new source to continue ingesting from the point where the legacy connector left, fill in theLast Record field with theconfig_diff information that you copied in the previous job.

    After completing the settings, choose the database and table job to populate data into, then schedule the job and provide a name for your new data connector. Click Save and then run the new data connector.

    In CLI and Workflow

    Update in:type in your yml configuration fromsfdc tosfdc_v2.

    For example, your existing workflow configuration might look something like this:

    in:

    type: sfdc

    username: ${secret:sfdc.username}

    password: ${secret:sfdc.password}

    client_id: ${secret:sfdc.client_id}

    client_secret: ${secret:sfdc.client_secret}

    security_token: ${secret:sfdc.security_token}

    login_url: ${secret:sfdc.login_url}

    target: Lead

    out: {}

    exec: {}

    filters: []

    Your new workflow configuration would look like this:

    in:

    type: sfdc_v2

    username: ${secret:sfdc.username}

    password: ${secret:sfdc.password}

    client_id: ${secret:sfdc.client_id}

    client_secret: ${secret:sfdc.client_secret}

    security_token: ${secret:sfdc.security_token}

    login_url: ${secret:sfdc.login_url}

    target: Lead

    out: {}

    exec: {}

    filters: []

    For result output

    The SFDC connection is shared between data connector and result output, although there is nothing change in result output, but if you use either of those, you should upgrade it too.

    In Console UI

    Save the settings of Legacy export connector

    Go to Treasure Data console. Go to Query Editor. Open the Query that usessfdc for its connection

    Click the sfdc connector, then copy and save the details of existing connection to use later.

    Clickdelete to remove the Legacy one.

    Modify the existing Query (to replace the Legacy connection)

    In the query, click Output results. Next, you are going to set up the SFDC v2 connector by finding and select the SFDC v2 export connector that you created.

    In Configuration pane, specify the fields you saved in previous step, then click Done.

    CheckOutput results to... to verify that you are using the created output connection. ClickSave.

    It is strongly recommended to create a test target and use it for the first data export to verify that exported data looks as expected and the new export does not corrupt existing data. In your test case, choose an alternate Object for your test target.

    In CLI:

    Result type protocol needs to update fromsfdc tosfdc_v2 for instance from:

    sfdc://<username>:<password><security_token>@<hostname>/<object_name>

    to:

    sfdc_v2://<username>:<password><security_token>@<hostname>/<object_name>

    In workflow:

    If you have a workflow that used the SFDC, you can keep your result settings the same, but need to updateresult_connection to the new connection_name you created above.

    An example of old workflow result output settings is as follows:

    +td-result-output-sfdc:

    td>: queries/sample.sql

    database: sample_datasets

    result_connection: your_old_connection_name

    result_settings:

    object: object_name

    mode: append

    concurrency_mode: parallel

    retry: 2

    split_records: 10000

    An example of new workflow result output settings is as follows:

    +td-result-output-sfdc:

    td>: queries/sample.sql

    database: sample_datasets

    result_connection: your_new_connection_name

    result_settings:

    object: object_name

    mode: append

    concurrency_mode: parallel

    retry: 2

    split_records: 10000

    View Article
  • The Presto query engine has some known limitations.

    Column Name Escaping

    When a column name matches the name of a reserved keyword, the name of the column needs to be quoted.

    In Hive, the quotation character is the back tick character `.

    SELECT `join` FROM mytbl

    In Presto, quoting of a column name is accomplished with the double quote character instead .

    SELECT "join" FROM mytbl

    Quoting a column name in single quotes will make the query parser interpret the content as a simple string that might produce an unexpected result. For example:

    SELECT "join" FROM mytbl LIMIT 10

    result:

    value1

    value2

    value3

    value4

    ...

    value10

    SELECT 'join' FROM mytbl

    result:

    join

    join

    join

    join

    ...

    join

    48 Hour Limit for Presto Queries

    Presto queries that run more than 48 hours are automatically cancelled. Typically, Presto queries that run more than a day or so are failing queries.

    For long running queries, you can rewrite the Presto query as a Hive query.

    Table Schemas are Required

    When querying using Presto, you must set a schema for the tables. The v syntax doesnt work.

    JOIN Order

    Presto does not support cost-based JOIN optimizations. This means that JOINs are not automatically reordered based on table size. Make sure that smaller tables are on the right hand size of JOIN, and they must fit in memory. Otherwise out of memory exceptions cause the queries to fail.

    SELECT

    ...

    FROM

    large_table

    JOIN

    small_table

    INSERT OVERWRITE Statements are NOT Supported

    Presto does not support INSERT OVERWRITE statements. Make sure that you delete the table before using INSERT INTO. See Presto Query FAQs.

    Error Message NOT_SUPPORTED: NULL values are not allowed on the probe side of SemiJoin operator

    This error is caused by using NULL IN an empty subquery that has an issue with result correctness.

    The IN predicate determines if any values produced by the subquery are equal to the provided expression. The result of IN follows the standard rules for nulls. The subquery must produce exactly one column.

    For example, the user column contains a NULL value, which might cause the following query to fail.

    select *

    from www_access

    where

    user in (select 0 as user_id)

    => Query 20170526_042346_18708_6huz3 failed: NULL values are not allowed on the probe side of SemiJoin operator. See the query plan for details.

    As a workaround, use the IS NOT NULL clause instead.

    select *

    from www_access

    where

    user in (select 0 as user_id)

    and user is not null

    Error Message - Killed by the system because this query generate output for more than 200GB.

    This error is caused by trying to generate more than 200GB output by single query.

    As a workaround, use result_output_redirectoption or CREATE TABLE AS.

    result_output_redirectoption Solution: [Experimental] Use result_output_redirect

    CREATE TABLE AS Solution: Use CREATE TABLE (table) AS SELECT

    Error Message -Accessing too many(N) tables. Maximum 300 tables allowed

    This error is caused by trying to scan more than 300 tables by single query.

    As a workaround, tune the sql as reducing scan tables, or divide single query to multiple jobs.

    Presto Performance Tuning

    Error Message -Creating partitions(N) exceeds the limit(500000)

    This error is caused by trying to commit more than 500,000 (500K) partitions by single query.

    As a workaround, use User Defined Partitioning (UDP) or tune parameters for UDP.

    Use User Defined Partitioning User Defined Partitioning for Presto

    Tune parameters for UDP. For example:

    Reduce bucket_count to 256. (512 in default)

    Update max_time_range to adopt more broad time range.

    Increasemax_file_size to 512MB. (256MB in default)

    View Article
  • When exporting data to an external source, an external source might require an additional column or marketers might want to create one that allows them to track lists back to specific master segments, segments, campaigns or dates and times.

    When creating a new activation, Treasure Data allows additional columns to be created and exported if needed using String Builder.

    To Create a New Column for Export

    Navigate to Audience Studio > Segments

    Select the segment you want to activate.

    Select Activations on the right-side panel.

    Select Create Activation or Edit Activation from the more options menu.

    Select Output Mapping in the Activations editor.

    Select + Add String under String Builder.

    Edit the Output Column Name field. Select from the following value options:

    String--Can be any value you choose. Use free text to create a custom value.Timestamp--Date and time of exportSegment ID--Segment ID numberSegmentName--Segment NameAudienceID--Master Segment number

    To add an additional value to the string, select + Add String. You may string multiple values together.

    After completing the String Builder, select Create & Run Now.

    View Article
  • Arm Treasure Data uses the same convention as Relational Database Management Systems (RDBMSs) for managing data sets:

    Databases

    Tables

    Schema

    Unlike traditional warehousing platforms, Treasure Data allows users to store-first, schema-later. Your schema can be changed at any time, with no cost.

    Conventional warehousing platforms are schema dependent, supporting an assumptive analytics model. In this model, data elements forecasted to yield insights are defined in advance, with the structure of the data store schema.

    Performance considerations are also important in initial design and the analyst must have knowledge of the underlying structure to ensure query performance. When new columns are added to the table, the schema must change.

    Big Data analysis however, is largely non-assumptive. The analyst seeks hidden patterns, relationships or events in the data that were not obvious from the outset. You are able to query the data where ever it is stored and without the burden of performance considerationsand exploration can create requirements for new records to support the analysis trail.

    In this model schema, dependence adds a significant tax that can become prohibitive.

    The TD Approach: Store-First, Schema-Later

    Hive

    Unlike traditional warehousing platforms, TD users can assign schema even after importing data to a table. This means that you can add or remove fields at any time.

    This system is much more flexible, and schema changes no longer take days of work.

    Prerequisites

    Basic knowledge of Treasure Data, including the TD toolbelt.

    A table with some data. See Running a query and downloading results

    Understanding the TD Default Schema

    When a table is created in a TD database, it has two fields:

    time: The time that each entry was generated, in int64 UNIX time

    v: Each log entry, stored as map<string, string>

    When you look up the value of a database entry, address the information using the format: v[field1].

    Defining a custom schema is strongly recommended.

    Defining a Custom Schema

    Typically, the default schema defined by TD is acceptable. But a custom schema can make queries shorter, and greatly improve performance.

    To define a custom schema for a table:

    Optionally, create testdb using the following command:

    td db testdb

    Optionally, create the www_access table using the following comman:

    td table:create testdb www_access

    Use the td schema:set command. The syntax of the command is:$ td schema:set <database> <table> <column_name>:<type>...Where <column_name> consists of lowercase alphabets, numbers, and "_" only.Where <type> can be one of the following:int, long, double, float, string, array<T>.

    For this example, the schema would be added as follows:

    td schema:set testdb www_access action:string labels:'array<string>' user:int

    You can now query this table with defined column names.

    td query -w -d testdb "SELECT user, COUNT(1) AS cnt

    FROM www_access

    WHERE action='login'

    GROUP BY user ORDER BY cnt DESC"

    You can see the following articles for more query syntax:

    Hive Query Language

    Presto Query Language

    Treasure Data, Presto and Hive Schema Relation

    Treasure Data

    Presto

    Hive

    int

    bigint

    smallint

    int

    bigint

    int

    long

    bigint

    bigint

    double

    double

    decimal

    float

    double

    float

    double

    double

    double

    Convert to string or int

    boolean

    boolean

    string

    varchar

    string or varchar

    string or Convert to long

    date

    string

    string or Convert to long

    timestamp

    timestamp

    You can refer to the open source documentation as well:

    Presto

    View Article
  • You can export job results from Arm Treasure Data directly to Google Sheets. In Google Sheets, you can then further analyze your data and uncover key business insights.

    You can transfer up to 2 million cells for spreadsheets that you created in or converted to Google Sheets. Read more from Google Help.

    Prerequisites

    Basic knowledge of Treasure Data, including the TD Toolbelt

    A Google account (for Google Drive)

    Authorized Treasure Data account access

    Use TD Console to Create Your Connection

    You can use the Treasure Data console to configure your connection.

    Create a New Connection

    You mustcreate and configure the data connection to be used during export, prior to running your query. As part of the data connection,you provide authentication to access the integration.

    Go to Catalog and search and select Google Sheets.

    https://myaccount.google.com/permissions

    The following dialog opens:

    Account Authentication

    Access to Treasure Data Google Sheets integration requires OAuth2 authentication. The authentication requires that users manually connect their Treasure Data account to their respective Google account.

    To authenticate, complete the following procedure:

    Select an existing OAuth connection for Google, or click the link under "OAuth connection" to create a new connection

    Log into your Google Sheets account in thepopup window and grant access to the Treasure Data app.

    You are redirected back to integration dialog pane. Repeat the first step (Create a new connection) and choose your new OAuth connection.

    Name your new Google Sheets connection.

    Configure Export Results in Your Data Connection

    Go toTreasure Data console

    Go to Data Workbench > Queries. You can pick the query language and write your query.

    Select your database and table.

    Create or pick the query that you plan to use to export data.

    Next, select Export Results checkbox

    and select your Google Sheet connection:

    After you select yourGoogle Sheetsconnection, thedialog pane appears:

    Parameters:

    You must specify either thespreadsheet_id ORspreadsheet_title. You cannot use both.

    Use Spreadsheet Key: If it is un-checked, the spreadsheet title is used as the key, otherwise, spreadsheet id is used.

    Spreadsheet name: The title of spreadsheet. A new spreadsheet with the title you specify is created if the spreadsheet does not exist. If multiple spreadsheets with the same title (non-case-sensitive) exist in the destination folder (given by Folder Key) or any folder (if Folder Key is empty), the job fails. In this case, specify by spreadsheet key instead.

    Folder key: Folder ID of the data to be exported. Used only when using Spreadsheet name. If empty, the destination folder is determined by matching the spreadsheet name. If there is no spreadsheet match, theMy Drive root folder is used. Important: In order to use this Folder Key option, you must update authentications what were created before this parameter was added to this data connector. Otherwise, your export jobs fail. See Appendix 4.3.

    Worksheet: The title of sheet in spreadsheet. A new spreadsheet with the title you specify is created if the sheet does not exist. The Name entered is compared "as is" (including spaces) with the worksheet name in Google. If empty, data is exported to the first (leftmost) worksheet.

    Upload mode: The mode to modify data in spreadsheet. See Section 4 for details.

    Range:The initial cell position that the data is written to.

    Batch rows to upload: The number of rows that are uploaded to a spreadsheet per call. The larger value of batch, the larger the volume of the payload that is uploaded, and the fewer number of calls are needed to complete upload of all data.

    Value Input Option:Specify either RAW to upload data to the spreadsheet directly or USER_ENTERED which leaves the dataparsed as if the user typed them into the UI. Numbers remain as numbers, but strings may be converted to numbers, dates, etc. following the same rules that are applied when entering text into a cell via the Google Sheets UI.

    Set Cell to Null: Control the display value of the cell when an invalid value (such as dividing by 0) is encountered

    Ways to Modify Data

    You can manipulate spreadsheet data in the following ways: Replace, Append, Truncate, Update.

    Replace

    This is the default mode. If the table already exists, the rows of the existing table are replaced with the query results. If the table does not exist yet, a new table is created.

    Append

    The query results are appended to the table. If the table does not exist yet, a new table is created.

    The result of the query is appended at the end of the spreadsheet. If, for example, you have the default sheet with 1000 empty rows, APPEND mode starts adding the rows in the query result at the row after the last one visible in the spreadsheet (i.e. 1001st row). Refer to the following example:

    Truncate

    If the table already exists, the existing rows are cleared, query results are updated into the table. If the table does not exist yet, a new table is created.

    Before executing truncate mode:

    After executing truncate mode with query results contain id column:

    Update

    If the table already exists, the existing rows are updated into the table by query results. If the table does not exist yet, a new table is created.

    Before executing update mode:

    After executing update mode, from the range A1, the perspective column is updated.

    Optional: Use of Scheduled Jobs for Output

    You can use Scheduled Jobs with Result Output, to periodically write the output result to a target destination that you specify.

    Optional: Configure Export Results in Workflow

    Within Treasure Workflow, you can specify the use of this data connector to output data.

    Available parameters are as follows:

    spreadsheet_id: String. Spreadsheet key. Required*.

    spreadsheet_title: String. Spreadsheet name. Required*.

    spreadsheet_folder: String. Default is null. Folder id.

    sheet_title: String. Default is null.Worksheet name.

    mode: String(append|replace|truncate|update). Default is append.

    range: String.Default is A1, which is the top left corner of a sheet.For APPEND mode, range has no effect because new rows are appended after the last row.

    rows_threshold: Integer. Default is 50000, maximum is 2000000)Google API has a threshold of 10MB for request payload. This data connector automatically detects which threshold is reached first.

    value_input_option: String(raw|user_entered), RAW).Default is raw.

    set_nil_for_double_nan: Default istrue, for example, turn NaN to empty string)

    *Choose either spreadsheet_id OR spreadsheet_title. The two parameters are mutually exclusive.

    timezone: UTC

    _export: td: database: sample_datasets+td-result-into-target: td>: queries/sample.sql result_connection: my_googlesheet_connector result_settings: sheet_title: value1 mode: replace ....

    Click here for more information on using data connectors in workflow to export data.

    Appendix

    Find the Spreadsheet Key

    You can obtain the Spreadsheet key from the access URL, as follows:

    Find the Folder Key

    You can obtain the Google Drive Folder key from the access URL, as follows:

    Required: Migrate your Existing Authentications

    Due to Google Sheet V3 deprecation scheduled for early March 2020, all your existing authentications must be updated. You mustperform the update on your existing authentications before mid of Feb 2020.

    Until March 2020, your existing authentications, without the updates, continue to work properly when the Folder Key option is left blank.

    The upgrade requires that your existing Google Sheet Authentications apply a new Folder Key option. To use the new Folder Key option, you must select the permission View metadata for files in your Google Drive in your Google account. If you do not update your existing Authentications with the new Folder Key option, your data connector no longer works as expected and your export jobs fail.

    Complete the update as follows:

    In TD Console, go to Integrations Hub> Authentications. Locate your Google Sheet Authentication object.

    Click Click here to connect a new accountto redirect to Authentication in Google.

    In Google, select the appropriate account and with View metadata for files in your Google Drive scope included, click Allow.

    Go back to the TD Console, open your Google Sheet Authentication again, and select the latest authentication account at the bottom of the dropdown list.

    Click Continue and Done to save the Authentication.

    Repeat steps 4 & 5 with the other authentications that use the same Google account.

    To check whether your authentication includes the permission,log into Google account and on the browser, access the URL: . Check for the Arm Treasure Data permissions on Google Drive as follows:

    View Article
  • You can import Google Ads reports into Arm Treasure Data using the Google Ad data connector.

    You use this same connector to export Google Ads Remarketing data. See Writing Job Results into Google Ads Remarketing Lists.

    Google has rebranded Google AdWords to Google Ads.

    Prerequisites

    Basic knowledge of Treasure Data

    Basic knowledge of Google Ads

    Use TD Console

    Create a new connection

    Go to Integrations Hub > Catalog and search and select Google Ads input. The following dialog opens.

    Keywords Performance Report

    Click to connect a new account. Log into your Google Ads account from the new window and grant Treasure Data access to your Ads campaigns:

    You will be redirected back to Treasure Data Catalog. Repeat the step to connect to a new account to choose your new OAuth connection.

    Click Continue and give your connection a name:

    Then click Done.

    Create a new source

    After creating the connection, you are automatically taken to Sources tab. Look for the connection you created and click New Source.

    The following dialog opens. Edit the details and click Next.

    In the field Ad Account enter the Customer ID in your Google Ads console UI.

    The parameters that you specify next are based on the Google Ad report type.

    Report Type

    You can select from the following reports: Ad Performance, AdGroup Performance, Campaign Performance, Audience Performanceand Keywords Performance. See Appendix B for more information.

    Attributes / Segments / Metrics

    Depending on the report type picked, only attributes, segments, and metrics that are available are showed when you click Add.

    Some predefined attributes are automatically added, corresponding to the report type. See Appendix C for more information.

    If you check Use predefined metrics, a list of predefined metrics is automatically added. See Appendix D for more information. You can select additional metrics from the Metrics dropdown.

    Incremental Loading

    By enabling incremental loading, you can schedule the data loading during a time period that you specify. Note that data is fetched only when the full period of time (automatically computed based on the Date Range value) is complete.

    Date Range

    Specify the date range of the report being generated. You must enter a start date and an end date if you select CUSTOM_DATE. The period for scheduling mentioned in Incremental Loading is the number of days between Start Date and End Date. See Appendix E for more information. Click Next.

    Preview

    You see a preview of your datasimilar to the following image. The preview data doesnt reflect the actual data from your Google Ads account. Its just dummy data to show all the fields you picked, as well as the data type of those fields.

    If you wish to change anything, click Back or else click Next.

    Advanced Settings

    Advanced Settings allow you to customize theguessed properties. Edit the following section, if needed.

    Choose the target database and table

    Choose an existing or create a new database and table.

    Create a new database and give your database a name. Complete similar steps for Create new table.

    Select whether to append records to an existing table or replace your existing table.

    If you want to set a different partition key seed rather than use the default key, you can specify one using the popup menu.

    Scheduling

    In the Schedule tab, you can specify a one-time transfer, or you can schedule an automated recurring transfer. If you select Once now, click Next. If you select Repeat specify your schedule options, then click Next.

    After your transfer has run, you can see the results of your transfer in the Databases tab.

    Use Command Line

    Install TDToolbelt v0.11.9 or later

    You can install the newest Treasure Data Toolbelt.

    $ td --version

    0.15.0

    Create Configuration File

    Prepare configuration file (for eg: load.yml) with your Google Ads account access information, similar to the following example.

    in:

    type: google_adwords

    client_id: "<app client id>" (required, string)

    client_secret: "<app client secret>" (required, string)

    refresh_token: "<your oauth2 refresh token>" (required, string)

    developer_token: "<Ads developer token>" (required, string)

    target: audience_performance_report (required, string)

    client_customer_id: "<your Google Ads account number>" (required, string)

    date_range: custom_date (required, enum)

    start_date: 2018-04-01 (required if `custom_date` date range, date)

    end_date: 2018-04-07 (required if `custome_date` date range, date)

    include_zero_impressions: true (optional, boolean, default is `true`)

    incremental: true (optional, boolean, default is `false`) include_negatives_keywords: true (optional, boolean, default is `false`) columns:

    - AdGroupName

    - CampaignStatus

    - Date

    - Clicks

    - ...

    out:

    mode: replace

    This example dumps Audience Performance report from Google Ads:

    client_id: OAuth application client id.

    client_secret: OAuth application client secret

    refresh_token: Refresh token, result of OAuth authentication flow

    target: Google Ads report type to be imported.

    See Appendix B for the list of available target.

    client_customer_id: Google Ads account number

    Google Ads Account number is the Customer ID in your Google Ads console UI

    date_range: Available date range from Ads

    See Appendix E for the list of available date_range.

    start_date: From which date (yyyy-MM-dd) report data to be generated. This field is required if custom_date is selected for date_range.

    end_date: To which date (yyyy-MM-dd) product data to be imported. This field is required if date_range is custom_date.

    include_zero_impressions: true means accepting all rows that have no impression

    incremental: Support incremental loading

    columns: List of attributes, segments and metrics to be collected

    See Appendix F for the list of available attributes, segments and metrics.

    include_negative_keywords: true means accepting all negative and normal keywords. This option is applied to Keyword Performance Report only.

    For more details on available out modes, see Appendix A

    (optional): Preview data to import

    You can preview data to be imported using the command td connector:preview.

    $ td connector:preview load.yml

    +-----------------+---------------------+-----------------+----

    | id:long | displayurl:string | headline:string | ...

    +-----------------+---------------------+-----------------+----

    | 42023 | "Hello" | "Ads" |

    | 42045 | "World" | "Ads" |

    +-----------------+---------------------+-----------------+----

    Execute Load Job

    You use td connector:issue to execute the job.The following are required:

    name of the schedule

    cron-style schedule

    database and table where their data will be stored

    the Data Connector configuration file

    $ td connector:issue load.yml --database td_sample_db --table td_sample_table --time-column updated_date

    If the database or the table do not exist in TD, td connector:issue fails

    Its also recommended to specify --time-column option, because Treasure Datas storage is partitioned by time (see data partitioning ). If the--time-column option is not available, the data connector selects the first long or timestamp column as the partitioning time. The type of the column specified by --time-column must be either of long or timestamp type (use Preview results to check for the available column name and type). A time column is available at the end of the output.

    $ td connector:issue load.yml --database td_sample_db --table td_sample_table --time-column updated_date --auto-create-table

    If your data doesnt have a time column you can add it using theadd_time filter. You add the "time" column by adding theadd_timefilter to your configuration file as follows.

    in: type: googlead ...filters:- type: add_time from_value: mode: upload_time to_column: name: timeout:

    type:td

    More details at add_time filter plugin.

    If you have a field called time, youdon'tneed to specify the --time-column option.

    $ td connector:issue load.yml --database td_sample_db --table td_sample_table

    Scheduled execution

    You can schedule periodic data connector execution for Google Ads import. We have configured our scheduler carefully to ensure high availability. By using this feature, you no longer need a cron daemon on your local data center.

    Create the schedule

    A new schedule can be created using the td connector:create command. The name of the schedule, cron-style schedule, the database and table where their data is stored, and the data connector configuration file are required.

    $ td connector:create \

    daily_google_ads_import \

    "10 0 * * *" \

    td_sample_db \

    td_sample_table \

    load.yml

    The `cron` parameter also accepts these three options: `@hourly`, `@daily` and `@monthly`.

    By default, schedule is setup in UTC timezone. You can set the schedule in a timezone using -t or --timezone option. The `--timezone` option supports only extended timezone formats like 'Asia/Tokyo', 'America/Los_Angeles' etc. Timezone abbreviations like PST, CST are *not* supported and may lead to unexpected schedules.

    List the Schedules

    You can see the list of currently scheduled entries by td connector:list.

    $ td connector:list

    +-----------------------+--------------+----------+-------+--------------+-----------------+-----------------------------+

    | Name | Cron | Timezone | Delay | Database | Table | Config |

    +-----------------------+--------------+----------+-------+--------------+-----------------+-----------------------------+

    | daily_google_ads_import | 10 0 * * * | UTC | 0 | td_sample_db | td_sample_table | {"type"=>"google_adwords", ... } |

    +-----------------------+--------------+----------+-------+--------------+-----------------+-----------------------------+

    Show the Setting and History of Schedules

    td connector:show shows the execution setting of a schedule entry.

    % td connector:show daily_google_ads_import

    Name : daily_google_ads_import

    Cron : 10 0 * * *

    Timezone : UTC

    Delay : 0

    Database : td_sample_db

    Table : td_sample_table

    td connector:history shows the execution history of a schedule entry. To investigate the results of each individual execution, use td job <jobid>.

    % td connector:history daily_google_ads_import

    +--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+

    | JobID | Status | Records | Database | Table | Priority | Started | Duration |

    +--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+

    | 578066 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-18 00:10:05 +0000 | 160 |

    | 577968 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-17 00:10:07 +0000 | 161 |

    | 577914 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-16 00:10:03 +0000 | 152 |

    | 577872 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-15 00:10:04 +0000 | 163 |

    | 577810 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-14 00:10:04 +0000 | 164 |

    | 577766 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-13 00:10:04 +0000 | 155 |

    | 577710 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-12 00:10:05 +0000 | 156 |

    | 577610 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-11 00:10:04 +0000 | 157 |

    +--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+

    8 rows in set

    Delete the Schedule

    td connector:delete removes the schedule.

    $ td connector:delete daily_google_ads_import

    Appendix

    A) Modes for out plugin

    You can specify file import mode in out section of load.yml.

    append (default)

    This is the default mode and records are appended to the target table.

    in:

    ...

    out:

    mode: append

    replace (In td 0.11.10 and later)

    This mode replaces data in the target table. Any manual schema changes made to the target table remain intact with this mode.

    in:

    ...

    out:

    mode: replace

    B) Available report types

    Report Type

    Description

    ad_performance_report

    Ad Performance Report

    adgroup_performance_report

    AdGroup Performance Report

    campaign_performance_report

    Campaign Performance Report

    audience_performance_report

    Audience Performance Report

    keywords_performance_report

    Keywords Performance Report

    C) Predefined Attributes

    Report Type

    Predefined Attributes

    Ad Performance

    Id DisplayUrl Headline HeadlinePart1 HeadlinePart2

    AdGroup Performance

    AdGroupId AdGroupName

    Campaign Performance

    CampaignId CampaignName

    Audience Performance

    Id UserListName

    Keywords Performance Report

    Id AdGroupId

    D) Predefined Metrics

    Report Type

    Predefined Metrics

    Ad Performance AdGroup Performance Campaign Performance

    Clicks Impressions Ctr AverageCpc Cost Conversions ViewThroughConversions CostPerConversion ConversionRateCostPerConversion ConversionRateCostPerConversion ConversionRate

    Audience Performance

    Clicks Impressions Ctr AverageCpc Cost AverageCpm AbsoluteTopImpressionPercentage TopImpressionPercentage

    Keywords Performance

    Clicks Impressions Ctr AverageCpc Cost Conversions ViewThroughConversions CostPerConversion ConversionRate AbsoluteTopImpressionPercentage TopImpressionPercentage

    E) Available Date Ranges

    Date Range

    Description

    TODAY

    Today only.

    YESTERDAY

    Yesterday only.

    LAST_7_DAYS

    The last 7 days not including today.

    LAST_WEEK

    The seven-day period starting with previous Monday.

    LAST_BUSINESS_WEEK

    The 5 day business week, Monday through Friday, of the previous business week.

    THIS_MONTH

    All days in the current month.

    LAST_MONTH

    All days in the previous month.

    ALL_TIME

    The entire available time range.

    CUSTOM_DATE

    A custom date range. Need start_date and end_date in yyyy-MM-dd.

    LAST_14_DAYS

    The last 14 days not including today.

    LAST_30_DAYS

    The last 30 days not including today.

    THIS_WEEK_SUN_TODAY

    The period between the previous Sunday and the current day.

    THIS_WEEK_MON_TODAY

    The period between the previous Monday and the current day.

    LAST_WEEK_SUN_SAT

    The seven-day period starting with the previous Sunday.

    F) Available Fields

    Report Type

    Reference

    Ad Performance

    Ad Performance Report

    AdGroup Performance

    AdGroup Performance Report

    Campaign Performance

    Campaign Performance Report

    Audience Performance

    Audience Performance Report

    Keywords Performance

    View Article
  • Using Tableau Desktop on Mac with Arm Treasure Data allows users to interactively explore data using Presto.

    Prerequisites

    Mac OS 10.12 or later

    Tableau Desktop for Mac v10.5 or later

    Install ODBC Driver for Mac

    Visit https://www.tableau.com/support/drivers.

    Follow these steps to install the Mac driver for Presto:

    Close Tableau Desktop.

    Download the driver from the Download link.

    Click the downloaded file.

    Double-click the Presto .pkg file to run the installer.

    Accept the defaults.

    Open Tableau Desktop and connect to Presto.

    Result output to Tableau

    Make the connection

    Start Tableau. Click Connect > Presto.

    For a complete list of data connections, click To a Server > More.

    Then complete the following steps:

    Enter Presto JDBC/ODBC Gateway server: api-presto.treasuredata.com

    Enter the port: 443

    Enter the catalog name; td-presto

    Select the Authentication method: Username

    Enter the Username: Use your master APIKEY (Ref. Get APIKEY

    Select the Require SSL check box.

    Select Sign In.

    If Tableau cannot complete the connection, verify that your credentials are correct.

    Setup the data source

    On the data source page, complete the following steps:

    Select the default data source name at the top of the page, and then enter a unique data source name for use in Tableau.

    From the Schema drop-down list, select the search icon or enter the schema name in the text box and select the search icon, and then select the schema.

    In the Table text box, select the search icon or enter the table name and select the search icon, and then select the table.

    Drag the table to the canvas, and then select the sheet tab to start your analysis.

    Advanced Tips

    Change time columns Data Type to Date & Time

    After connecting to the table, youll find a time field in Measures. time field is unixtime data as a default. You can drag and drop the time field data to Dimensions and change the data type to Date & time:

    Limitations

    Potential Risk at Presto upgrade

    Compatibility between the Presto and Tableau ODBC packageisn'tguaranteed when Treasure Data Presto is upgraded. We recommend that you use for production systems because the connector requires regular update of data sources.

    View Article
  • Using Tableau Desktop with Arm Treasure Data enables you to interactively explore huge amounts of data. You can also use Treasure Data as a place to integrate and store of all your data.

    Connect to TD presto endpoint

    Select "More..." on the left menu.Choose Prestoand enter the connect information. Click Sign In.

    Enter a Schema (database name) and Table (table name) to import data from a table, or to write SQL that imports data from multiple tables. Then click Update Nowto see the data in the Tableau window.

    Confirm Import in Treasure Data

    In Treasure Data, create a query to track the events coming in from Tableau Desktop.It usually takes 3-5 minutes before events appear in the TD Console.

    Tips and Tricks

    Convert time Field as Datetime Type

    Treasure Data treats the time column as a unix timestamp by default, however its really convenient to handle timestamp with Datetime type in Tableau. The following example describes how to create the calculated Datetime field for time column.

    Select Creat Calculated Field...

    Then, enter the name and enter the following function within Formula section.

    sql

    RAWSQL_DATETIME("from_unixtime(%1)", [time])

    Next, the calculated field appears in a list of Dimensions, with a Datetime type.

    Contact Us

    If you are you interested in using Treasure Data with Tableau, contact your account representative.

    View Article
  • Using Tableau Online with Arm Treasure Data allows you to interactively explore huge amounts of data, and share your data discoveries across your business organizations.

    Prerequisites

    Basic knowledge of Treasure Data

    A license and its installation of Tableau Online

    Usage

    CLI: One Time Query

    1. JSON-style config (new and recommended)

    Add the Tableau result output destination by using the -r / result option for the td query command:

    $ td query -d mydb -r '{"type":"tableau","host":"company.online.tableau.com","username":"my_user","password":"passw0rd","ssl":true,"ssl_verify":false,"server_version":"online","datasource":"my_ds","site":"MarketingTeam","project":"","mode":"replace","chunk_size_in_mb":50,"timezone":"America/Los_Angeles"}' 'SELECT * FROM access'

    where:

    username: Your Tableau Online Username

    password: Your Tableau Online Password

    host: Tableau Online Host on which your site exists (e.g. 10ay.online.tableau.com)

    datasource: target Tableau DataSource name

    site: The URL of the site to sign in to

    To determine the value to use for the site attribute, sign in to Tableau Online and examine the value that appears after /site/ in the URL. For example, in the following URL, the site value is MarketingTeam:

    https://online.tableau.com/#/site/MarketingTeam/workbooks

    Refer here for more details

    chunk_size_in_mb: Chunk file size (in MB) to be uploaded each time, default: 200, min: 100, max: 1024

    timezone: timezone to convert from Timestamp data type to Tableau DateTime data type, default: UTC

    2. URL-style (not recommended)

    Change the Tableau result output destination from JSON format to URL format as following example:

    $ td query -d mydb -r 'tableau://username:password@host/datasource?mode=replace&site' 'SELECT * FROM access'

    3. Migrating Existing Output Configurations to Tableau Hyper

    Tableau has updated its data engine with a technology called Hyper. To take advantage of this, you must update your existing Treasure Data output configurations. New Treasure Data configurations automatically use the latest version.

    To upgrade your existing configurations:

    Go to Queries and select your scheduled query.

    In Query Editor and click Export Results target.

    Choose your saved Tableau connection.

    Input the Site ID value; its required for Tableau Online.

    Select 'hyper' for Data Source Type.

    Click Done to save your configuration.

    You do not need to change the timezone configuration when migrating from the legacy Tableau to the current Tableau. You can leave the default value (UTC). See also timezone in this article.

    Console: One Time Query

    Go to the query editor on the TD Console, and type in the query. The following example query uses the access log example data set, and calculates the distribution of HTTP method per day.

    HIVE:

    # HiveQL

    SELECT

    CAST(TD_TIME_FORMAT(time, "yyyy-MM-dd 00:00:00") AS TIMESTAMP) AS `dates`,

    method AS `Method`,

    COUNT(1) AS `Count`

    FROM

    www_access

    GROUP BY

    TD_TIME_FORMAT(time, "yyyy-MM-dd 00:00:00"),

    method

    PRESTO:

    # Presto

    SELECT

    CAST(TD_TIME_FORMAT(time, 'yyyy-MM-dd 00:00:00') AS TIMESTAMP) AS "dates",

    method AS "Method",

    COUNT(1) AS "Count"

    FROM

    www_access

    GROUP BY

    TD_TIME_FORMAT(time, 'yyyy-MM-dd 00:00:00'),

    method

    We're casting the Datetime column from String type to TIMESTAMP type, for convenience in Tableau.

    Tableau doesnt support fractional seconds in a timestamp. Remove fractional seconds (for example, by using subtr() function) before casting to the TIMESTAMP type in the query.

    Choose saved connection

    A dialog Choose Integrationdisplays. Select an existing Tableau Online connection. If you do not have a Saved Integration already setup, follow the next step on how to create a new connection within the Sources Catalog.

    Create a new authentication

    Go to Integrations Hub > Catalogand search. Select Tableau. The following opens:

    Username: Your Tableau Online Username

    Password: Your Tableau Online Password

    Host: Tableau Online Host on which your site exists (e.g. 10az.online.tableau.com)

    Additional configuration

    After you create a Tableau connection or select an existing one, you see the followingConfiguration popup.

    Parameters

    Description

    Default values

    Datasource Name

    The name of destination Data Source on Tableau Online

    Site ID

    The URL of the site to sign in to, its required for Tableau Online

    Project Name

    Go to your Tableau Online to get a list of projects

    Default

    Mode

    replace to replace Data Source each time, append to append to existing Data Source

    append

    Chunk File Size In MB

    Extract File is split into chunks before uploading. This option defines file size of each chunk (min: 100, max: 1024)

    200

    Timezone

    Timezone ID to use when converting from Timestamp (timezone-independent) to Tableau DateTime (timezone-dependent)

    UTC

    Append Mode ignores new columns if it is exporting data inserted an existing data source because of the specificationof the Tableau. https://onlinehelp.tableau.com/current/pro/desktop/en-us/extracting_addfromfile.html

    After completing all the fields, Submit the query. The system will execute the query, create the Tableau Data Extract file (.tde or .hyper), and upload extract file to Tableau Online.

    Go to your Tableau Online, and click Data Sources at the top left bar. You can view the list of data sources, including your TDE file.

    Click New Workbook to create the charts and dashboard from the browser. Drag and drop the dimensions and measures from the left nav, to top right nav to create graphs. Then click Save to store the result.

    Console: Scheduled Query

    In the previous section, you created charts through one job. Next, you can periodically refresh the dataset by using the Scheduled Jobs feature. Using this feature, you execute a query on Treasure Data, and update TDE file on Tableau Server. You can specify schedules from the query editor by using TD_SCHEDULED_TIME() UDF, with append mode.

    The following query calculates the # of records within last 24 hours, from the time that the query gets executed. By continuously running this scheduled query, you can avoid processing the entire data set every day.

    HIVE:

    # HiveQL

    SELECT

    CAST(TD_TIME_FORMAT(time, "yyyy-MM-dd 00:00:00") AS TIMESTAMP) AS `dates`,

    method AS `Method`,

    COUNT(1) AS `Count`

    FROM

    www_access

    GROUP BY

    TD_TIME_FORMAT(time, "yyyy-MM-dd 00:00:00"),

    method

    PRESTO:

    # Presto

    SELECT

    CAST(TD_TIME_FORMAT(time, 'yyyy-MM-dd 00:00:00') AS TIMESTAMP) AS "dates",

    method AS "Method",

    COUNT(1) AS "Count"

    FROM

    www_access

    GROUP BY

    TD_TIME_FORMAT(time, 'yyyy-MM-dd 00:00:00'),

    method

    Options

    Result output to Tableau supports various options. The options can be specified as URL parameters on the CLI or with the REST APIs or the Console where supported. The options are normally compatible with each other and can be combined. Where applicable, the default behavior is indicated.

    ssl

    The ssl option determines whether to use SSL for connecting to the Tableau server. When true, SSL is used. ssl=true is the default when this option is not specified.

    tableau://username:password@host/?ssl=true

    ssl_verify

    The ssl_verify option determines whether to require certifcate verification for the SSL communication. When true, the certificate verification is required. ssl_verify=true is the default when this option is not specified.

    tableau://username:password@host/?ssl=true&ssl_verify=true

    Disabling certificate verification is useful when the Tableau servers SSL certificate is self-signed.

    timezone

    To convert from timestamp value, which is timezone independent, for example, 1548979200, to Tableau DateTime, which includes day, hour, minute, etc. the connector needs to know the target timezone.

    If your query contains a TIMESTAMP column, or you casta datetime column to TIMESTAMP, the value is exportedto Tableau server as DateTime. Meaning, there is a conversion and you need to provide target timezone as necessary.

    Treasure Data stores datetime value using UTC timezone. In most cases, leave timezone config as default (UTC), to preserve the value from Treasure Data, unless you particularly want to convert the value to another timezone.

    An example of configuring a timezone other than the default UTC is as follows:

    From the CLI:

    $ td query "..." -r '{ "type": "tableau", ..., "timezone": "America/Los_Angeles" }'

    As part of TD Workflow:

    host: "company.online.tableau.com"ssl: truessl_verify: trueusername: "my_user"password: "passw0rd"

    datasource: "my_ds"site: "my_company"project: "Default" server_version: "online"

    timezone: "America/Los_Angeles"

    Limitations

    The maximum result record is 250,000,000 records. If it exceeded, the log displays the message:Extract file records limit exceeded: 250000000.

    The lowest Timestamp value is 1000-01-01 00:00:00. If it exceeded, the log displays the message:invalid date value.

    Known issues:

    Appending a large dataset to existing Data Source may result in timeout. In that case, you will see below message in job logs:

    2019-04-10 19:20:41.460 +0000 [WARN] (0001:transaction): !!! Data Source Publish is timed out. This is a known issue of Tableau when you append a large extract file to existing Data Source.

    2019-04-10 19:20:41.460 +0000 [WARN] (0001:transaction): !!! Check Tableau Console for final result of the Publish.

    The job succeeds but you must check the final result on Tableau Console (you can verify by Number of recordsin Data Source).

    View Article
  • Using Tableau Server with Arm Treasure Data allows users to interactively explore huge amounts of data, and also share data information across your organizations.

    Tableau Server 2018.1 Release Notes

    Prerequisites

    Basic knowledge of Treasure Data

    A license and its installation of Tableau Server

    Usage

    CLI: One Time Query

    JSON-style config (new and recommended)

    Add the Tableau result output destination by using the -r / result option for the td query command:

    $ td query -d mydb -r '{"type":"tableau","host":"my.tableauserver.com","username":"my_user","password":"passw0rd","ssl":true,"ssl_verify":false,"server_version":"10.0","datasource":"my_ds","site":"","project":"","mode":"replace","chunk_size_in_mb":50,"timezone":"PST"}' 'SELECT * FROM access'

    where:

    username: Your Tableau Server Username

    password: Your Tableau Server Password

    host: Your Tableau Server Hostname

    site: The URL of the site to sign in to (optional)

    To determine the value to use for the site attribute, sign in to Tableau Server and examine the value that appears after /site/ in the URL. For example, in the following URL, the site value is MarketingTeam:

    https://MyServer/#/site/MarketingTeam/projects

    If the site attribute is an empty string, you are signed in to the default site. Note that you always sign in to a specific site, even if youdon'tspecify a site when you sign in

    Refer here for more details

    datasource: target Tableau DataSource name

    chunk_size_in_mb: Chunk file size (in MB) to be uploaded each time, default: 200, min: 100, max: 1024

    timezone: timezone to convert from Timestamp data type to Tableau DateTime data type, default: UTC

    URL-style config (not recommended)

    Change the Tableau result output destination from JSON format to URL format as shown in the following example:

    $ td query -d mydb -r 'tableau://username:[email protected]/?mode=replace' 'SELECT * FROM access'

    $ td query -d mydb -r 'tableau://username:[email protected]:8443/?mode=replace&ssl=true' 'SELECT * FROM access'

    Migrating Existing Output Configurations to Tableau Hyper

    Tableau has updated its data engine with a technology called Hyper. To take advantage of this, you must update your existing Treasure Data output configurations. New Treasure Data configurations automatically use the latest version.

    To upgrade your existing configurations:

    Go to Queries and select your scheduled query.

    In Query Editor and click Export Results target.

    Choose your saved Tableau connection.

    Input the Site ID value; its required for multi-tenant Tableau Server (leave it empty if you sign in to Default site).

    Select 'hyper' for Data Source Type.

    Click Done to save your configuration.

    You do not need to change the timezone configuration when migrating from the legacy Tableau to the current Tableau. You can leave the default value (UTC). See also timezone in this article.

    Console: One Time Query

    Go to the query editor on the TD Console, and type in the query. The following example query uses the access log example data set, and calculates the distribution of HTTP method per day.

    HIVE:

    # HiveQL

    SELECT

    CAST(TD_TIME_FORMAT(time, "yyyy-MM-dd 00:00:00") AS TIMESTAMP) AS "dates",

    method AS `Method`,

    COUNT(1) AS `Count`

    FROM

    www_access

    GROUP BY

    TD_TIME_FORMAT(time, "yyyy-MM-dd 00:00:00"),

    method

    PRESTO:

    # Presto

    SELECT

    CAST(TD_TIME_FORMAT(time, 'yyyy-MM-dd 00:00:00') AS TIMESTAMP) AS "dates",

    method AS `Method`,

    COUNT(1) AS `Count`

    FROM

    www_access

    GROUP BY

    TD_TIME_FORMAT(time, 'yyyy-MM-dd 00:00:00'),

    method

    We're casting Datetime column to TIMESTAMP type from String type, for convenience in Tableau. NOTE: Tableau doesn't support fractional seconds in a timestamp. Remove fractional seconds (for eg, using subtr() function) before casting it to TIMESTAMP type in the query.

    Choose saved connection

    A dialog Choose Integrationdisplays. Select an existing Tableau Server connection. If you do not have a Saved Integration already setup, follow the next step on how to create a new connection within the Sources Catalog.

    Create a new authentication

    Go to Integrations Hub > Catalogand search and select Tableau. The following opens:

    Username: Your Tableau Server Username

    Password: Your Tableau Server Password

    Host: Your Tableau Server Hostname (If you want to set specified port number, set HOST:PORT)

    Additional configuration

    After you create a Tableau connection or select an existing one, you see the followingConfiguration popup.

    Parameters

    Description

    Default values

    Datasource Name

    The name of Data Source on Tableau Server

    Site ID

    If youdon'thave specific Site, set EMPTY String for Tableau Server

    Project Name

    Go to your Tableau Server to get a list of projects

    Default

    Mode

    replace to replace Data Source each time, append to append to existing Data Source

    append

    Chunk File Size In MB

    Extract File is split into chunks before uploading. This option defines file size of each chunk (min: 100, max: 1024)

    200

    Timezone

    Timezone ID to use when converting from Timestamp (timezone-independent) to Tableau DateTime (timezone-dependent)

    UTC

    Currently, you MUST setup https (SSL). Check here about how to set up SSL on your Tableau Server.

    Append Mode ignores new columns if it is exporting data inserted an existing data source because of the specification of the Tableau. https://onlinehelp.tableau.com/current/pro/desktop/en-us/extracting_addfromfile.html

    After completing all the fields, Submit the query. The system executes the query, creates the Tableau Data Extract file (.tde or .hyper), and uploads Extract file to Tableau Server.

    Go to your Tableau Server, and click Data Sources at the top left bar. You can view the list of data sources, including your TDE file.

    Click New Workbook to create the charts and dashboard from the browser. Drag and drop the dimensions and measures from the left nav, to top right nav to create graphs. Then click Save to store the result.

    Console: Scheduled Query

    In the previous section, you created the charts using one job. The next action is to periodically refresh the dataset by Scheduled Jobs feature. By using this feature, you can periodically execute the query on Treasure Data, and update TDE file on Tableau Server. The schedules can be specified from the query editor. Its useful is to use TD_SCHEDULED_TIME() UDF, with append mode.

    The following query calculates the # of records within last 24 hours, from the time when the query gets executed. By continuously running this scheduled query, you can avoid processing the entire data set every day.

    HIVE:

    # HiveQL

    SELECT

    CAST(TD_TIME_FORMAT(time, "yyyy-MM-dd 00:00:00") AS TIMESTAMP) AS `Datetime`,

    method AS `Method`,

    COUNT(1) AS `Count`

    FROM

    www_access

    WHERE

    TD_TIME_RANGE(time,

    TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d'),

    TD_SCHEDULED_TIME())

    GROUP BY

    TD_TIME_FORMAT(time, "yyyy-MM-dd 00:00:00"),

    method

    PRESTO:

    # Presto

    SELECT

    CAST(TD_TIME_FORMAT(time, 'yyyy-MM-dd 00:00:00') AS TIMESTAMP) AS "dates",

    method AS `Method`,

    COUNT(1) AS `Count`

    FROM

    www_access

    GROUP BY

    TD_TIME_FORMAT(time, 'yyyy-MM-dd 00:00:00'),

    method

    Options

    Result output to Tableau Server supports various options. The options can be specified as URL parameters on the CLI or with the REST APIs or the Console where supported. The options are normally compatible with each other and can be combined. Where applicable, the default behavior is indicated.

    ssl

    The ssl option determines whether to use SSL or not for connecting to the Tableau server. When true, SSL is used. ssl=true is the default when this option is not specified.

    tableau://username:[email protected]/?ssl=true

    tableau://username:[email protected]:443/?ssl=true

    ssl_verify

    The ssl_verify option determines whether to require certifcate verification for the SSL communication or not. When true, the certificate verification is required. ssl_verify=true is the default when this option is not specified.

    tableau://username:[email protected]/?ssl=true&ssl_verify=true

    Disabling certificate verification is particularly useful when the Tableau servers SSL certificate is self-signed.

    project

    On Tableau Server, there should be at least one 'Default' project. You can't delete or rename this project. If you omit Project Name option, 'Default' project is used.

    On some rare cases, especially non-English Tableau Server, the 'Default' project name is localized and jobs fail if you don't provide a Project Name (because it couldn't find a project with 'Default' as the name).

    timezone

    To convert from timestamp value, which is timezone independent, for example, 1548979200, to Tableau DateTime, which includes day, hour, minute, etc. connector needs to know the target timezone.

    If your query contains a TIMESTAMP column, or you casta datetime column to TIMESTAMP, the value is exportedto Tableau server as DateTime. Meaning, there is a conversion and you need to provide target timezone as necessary.

    Treasure Data stores the datetime value using UTC timezone. In most cases, leave timezone config as default (UTC), to preserve the value from Treasure Data, unless you particularly want to convert the value to another timezone.

    An example of configuring a timezone other than the default UTC is as follows:

    From the CLI:

    $ td query "..." -r '{ "type": "tableau", ..., "timezone": "America/Los_Angeles" }'

    As part of TD Workflow:

    host: "tableau.company.com"ssl: truessl_verify: falseusername: "my_user"password: "passw0rd"

    datasource: "my_ds"site: "MarketingTeam"project: "Default"server_version: "2018.1"

    timezone: "America/Los_Angeles"

    Limitations

    The maximum result record is 250,000,000 records. If it exceeded, the log displays the message: Extract file records limit

    exceeded: 250000000.

    The lowest Timestamp value is 1000-01-01 00:00:00. If it exceeded, the log displays the message:invalid date value.

    Known issues:

    .hyper extracts published to Tableau Server 10.5 via REST API will have .tde file extension on Tableau Console. These extracts were still in .hyper file format and functioned the same way as other .hyper extracts. The issue is fixed in 2018.1. More details: (search for Issue ID 754677)

    Data Source belonging to a nested Project cannot be appended by using a REST API. AResource Not Found error is returned. Replace mode does not have this issue. A fix is planned for Tableau Server 2018.1.4. If you see such error, it's recommended to upgrade to latest version of Tableau Server.

    View Article
  • You can run queries in Mode Analytics and send reports generated from the queries to Arm Treasure Data.

    Prerequisites

    Mode Analytics account

    Basic knowledge of Arm Treasure Data

    Requires access to the new Presto-specific JDBC/ODBC API, currently in Private Beta. Contact your account representative if youd like access to this new feature.

    Retrive Treasure Data API Key

    Retrieve the Treasure Data API key from Consoles profile page. The key must be the API key typemaster.

    Connect a Database

    Connect to Treasure Data from Mode. Click your profile icon button on the menu, and click Connect a Database menu.

    Then, select Database in the Cloud tab.

    Youll find Treasure Data Presto in the database list. Click Treasure Data Presto.

    Set API key

    After you retrieve the credential, enter the credential in theAPI Key field.

    Create Reports

    Now its time to create reports! In the editor, select the database connectionyou'vecreated. Then, execute the SQL queries, and create reports from the queries.

    View Article
  • While Arm Treasure Data is a BI tool agnostic service, customers like retailer MUJI are using Tableau for the BI / Visual Analytics. In this article, we showcase our customers usage pattern of combining Treasure Data, Tableau Desktop and Tableau Server. Lets begin with understanding the characteristics of each solutions.

    Treasure Data: Tableau Desktop ODBC connector

    Tableau is a business intelligence software that helps people see and understand their data. There are two major products provided by Tableau Software: Tableau Desktop and Tableau Server. Tableau Desktop is a Desktop Application used to visualize and analyze data. It helps create workbooks, visualizations, dashboards, and stories.

    Users can publish visualized data to Tableau Server for sharing within an organization. Tableau Desktop is a BI designer tool, and Tableau Server is a publishing environment to share the visualizations. Tableau Online is a hosted version of Tableau Server, which doesnt require you to manage the BI server.

    Tableau Desktop Product Tour (Video)

    Tableau Server Product Tour (Video)

    Treasure Data Introduction (Video)

    Treasure Data Technical Overview & Concepts (Video)

    Tableau + Treasure Data Reference Architecture

    So, why combine Treasure Data & Tableau? Treasure Data provides a scalable backend to handle new big data sources (application logs, web logs, mobile data, sensor data, etc), while Tableau provides flexible visual analytics for existing data sources (EDW, CRM, ERP, etc).

    By combining Treasure Data and Tableau, you can quickly get insights on any type of data sources of any size. Lets do a walk through of the following architecture diagram.

    Collect Big Data (Treasure Data)

    First, lets start collecting data into Treasure Data. Treasure Data provides various ways to collect data into the cloud in near-real-time. The data sources depicted here are time-series data, which means there is historical data, produced in real time, and growing rapidly as your business scales. Here are the four main data collection capabilities provided by Treasure Data:

    JavaScript SDK for website tracking

    Mobile & Gaming SDK (Android, iOS, Unity, Unreal Engine) for mobile and gaming application tracking

    Treasure Agent for streaming data collection

    Bulk Loader for parallel bulk loading

    Data Connector for pre-built integrations

    Treasure Data imports almost 1 Million records per second, and Treasure Data customers benefit from such scale. Setting up the data collection usually takes only a couple of hours, or even a few minutes in some cases.

    Aggregate Big Data (Treasure Data)

    Now we have raw data in the cloud. To provide a better experience for the BI consumers, its a good idea to summarize this raw data into smaller sets for performance reasons. By using one of Treasure Datas embedded query engines, you can crunch big data into an aggregated format.

    Treasure Data supports Tableau Result Output so you can directly push the aggregated results into Tableau Server. Youdon'tneed any additional infrastructure to do this. You can even automate this process by using Scheduled Jobs to periodically aggregate the data.

    Treasure Data can push the query results as a Tableau Data Extract (TDE) file. TDE is a Tableaus proprietary columnar file format, optimized for efficiently slicing and dicing data (see Why Use Tableau Data Extracts ). The TDE file will be directly saved into Tableau Server.

    Treasure Data: Writing Job Results into Tableau Server

    Treasure Data: Writing Job Results into Tableau Online

    Design Workbooks (Tableau Desktop)

    Now we have raw data access and aggregated data too. Its time to explore the data using Tableau Desktop. Tableau offers a lot of built-in connectors for existing data sources (EDW, CRM, ERP, Excel, etc), that you can interact with directly.

    Tableau Desktop: List of Data Sources

    Treasure Data provides an ODBC driver for Tableau Desktop so that data analyst can have raw data access.

    Tableau Desktop: Treasure Data ODBC connector

    Analysts can choose any of the above methods depending on the needs. You can also join across these data sources. For example, you can create a join between Salesforce.com data and a TDE file, or even join multiple TDE files. When the workbook is created, Tableau Desktop can publish it to Tableau Server.

    Share the Workbooks (Tableau Server)

    Now everything is set. Analysts can publish workbooks to the server and the consumers can view these from their browsers. Analysts can quickly iterate on the data and reports by having access to all the data sources, sothey'renow self-reliant.

    Tableau Desktop: Publishing to Tableau Server (Video)

    Summary

    Tableau + Treasure Data empowers data-driven companies to rapidly explore data and get insights. By combining these two solutions, your team can focus on insights, not infrastructure, with an industry-leading visual analytics tool. If you have questions, contact us. For the next steps, follow these links to learn how to connect to Treasure Data from each Tableau product.

    Treasure Data and Tableau: A complete big data analytics solution in days (White Paper)

    Treasure Data: Writing Job Results into Tableau Server

    Treasure Data: Writing Job Results into Tableau Online

    View Article
  • Maximize your data resources by using Treasure Data with Google DoubleClick for Publishers on DFP.Google DoubleClick for Publishers is now known as Google Ad Manager.

    You can create audience lists in your Google DoubleClick for Publishers (DFP) using data held in Treasure Data. Follow these steps to move cookies and Mobile Advertising Identifiers to new or existing audience lists within Google DFP.

    Prerequisites

    Basic knowledge of Treasure Data, including the TD Toolbelt

    A DFP Account

    Authorized Treasure Data DMP access to your Google DFP Account

    Grant access for Treasure Data

    Treasure Datas data connector requires permissions to create audience segments in your Google DFP account. Use the Google Contact Us form to reach the DoubleClick for Publishers Support teamand request that Treasure Data be granted access to your DFP account. Provide the following information in the form:

    Request: Grant Treasure Data permissions

    Your DoubleClick for Publishers account ID(referred to by Google as the Audience Link ID)

    Treasure Data DMP:

    Customer ID: 140-996-0635

    NID: treasuredata_dmp

    Instructions on how to find your Audience link ID.Basically, in Google Bid Manager, you click Admin > Global settings > All network settings to find the Audience link ID.

    You are sending information so that Google recognizes Treasure Data and connects your Google DPF account to Treasure Data.

    To export data, you create or select an existing connection, create or reuse a query, and then run the query to export your audience lists.

    Create or reuse a query

    Go to Treasure Data console

    Go to Query Editor. You can pick the query language and write your query.

    Select your database and table.

    Access the query that you plan to use to export data. Here is a sample query:

    :::sql SELECT DISTINCT "cookie", "list_name", "time" FROM "google_dfp_ddp"

    Create or select an existing DFP connection

    Options

    Task Steps

    Tip

    Create a connector

    Go to Integrations Hub > Catalogand search and selectDoubleclick for Publishers. - Complete the required fields on the first pane. - In the Audience Link ID field, enter the ID that you use in your DFP. Next you complete the Configuration fields.

    Sometimes you need to define the column mapping before writing the query.

    Select a connector

    - Go to Treasure Data console. - Go to Query Editor. - Access the query that you plan to use to export data. - Click Output results. The Choose Saved Connection dialog opens. - Type the connection name in the search box to filter and click the connection that you want. Next you complete the Configuration fields.

    You can also create a new connection from here by clicking on Create New Connection.

    Creating a new connection:

    Additional connector configuration information

    Selecting an existing connection:

    Configure the connection by specifying the parameters

    After you select yourGoogle DoubleClick for Publishers DP connection, the Configuration dialog pane appears:

    Source column name mappings (optional)

    Define the mapping between Google DDP column names to the output column names that you specify in your the query. You specify the target column and then the source column. For example, if google_cookie is the identifier column in your TD data source, you should define the mapping as cookie:google_cookie. If the source column in the mapping is missing, target column name will be used. For example, cookie is the same as cookie:cookie mapping.

    See Additional connector configuration information for more information about supported column names and usage.

    Cookie or mobile identifier column header

    Specify the original source of the user cookie or mobile identifier.

    You must select one of the options:

    cookie_encrypted: Encrypted identifier (for example, Web), a cookie hash of user id

    cookie_idfa: iOS Advertising Identifier

    cookie_adid: Android Advertising Identifier

    See for more information on how identifier names are handled.

    Execute the query

    Either save the query with a name and then run the query or just run the query. When the query completes successfully, the result is automatically processed. Plan to transfer your data at least 24 hours ahead of when you need the audience lists (also referred to as segments) to be in Google AdWords.

    Additional connector configuration information

    The manner in which Google handles your data from Treasure Data affects how you configure the connector.

    Column Mappings

    The Google DFP reads data source table by columns and uses the following column name mappings to process each row data:

    cookie: The encrypted Google ID or Mobile Advertising Identifier that is used in id matching. This column contains the cookie hash or mobile identifier of your users.

    list_name: This column contains the name of audience list (segment) that you want to create in your DFP audience. If the list name does not exist in DFP, a new list is created. If the list name does exist, the existing list is updated.

    timestamp (optional): The timestamp (seconds since EPOCH). If this column does not exist or is missing, a current timestamp is used.

    delete (optional): This column contains boolean values (`false` or `true`) or number (`0` or `1`) to indicate if the cookie is to be added or removed from the given audience segment. By default, the value will be false if value is left blank or if column is not provided.

    Cookie or Mobile Identifier Name

    The upload process supports several different identifiers, and it is important that types of identifiers are uploaded to the segment using the correct upload file format. Generally, the identifiers fall into one of two categories: encrypted identifiers (anything obtained from Google systems), and raw identifiers (obtained from an external system or source). Any encrypted identifier is uploaded using the cookie_encrypted file format, and raw identifiers are uploaded in a type-specific upload file format such as Mobile Advertising Identifiers, for example cookie_idfa or cookie_adid.

    Supported cookie types are cookie_encrypted, cookie_idfa, and cookie_adid.

    Appendix

    Limitationsfrom Google DoubleClick for Publishers

    It may take up to 24 hours for updates to audience lists to be visible DFP. Expect to wait up to 24 hours from the time of the query completion for changes to be reflected in DFP.

    View Article

Curious about Treasure Data?

Anonymously Ask Treasure Data Any Question

Ask Anonymous Question

×
Rate your company