Treasure Data's Frequently Asked Questions page is a central hub where its customers can always go to with their most common questions. These are the 586 most popular questions Treasure Data receives.
You can import worksheet from Google Sheet into Arm Treasure Data using the Google Sheet connector.
Prerequisites
Basic knowledge of Treasure Data
Basic knowledge of Google Sheet
Use TD Console
Create a new connection
Go to Integrations Hub > Catalog and search and select Google Sheet input. The following dialog opens.
https://support.treasuredata.com/hc/en-us/articles/360001502427-Bulk-Import-for-Google-Sheets
Click to connect a new account. Log into your Google Sheets account from the new window and grant Treasure Data access:
You will be redirected back to Treasure Data Catalog. Repeat the step to connect to a new account to choose your new OAuth connection.
Click Continue and give your connection a name:
Then click Done.
Create a new source
After creating the connection, you are automatically taken to Sources tab. Look for the connection you created and click New Source.
The following dialog opens. Edit the details and click Next.
Spreadsheet Key: ID of the spreadsheet
Worksheet Title: worksheet name in the spreadsheet. It is not case-sensitive, mean `sheet1` is the same as `SHEET1` or `Sheet1`
If blank, fetch the first worksheet of the spreadsheet
Range to fetch: the range of the data in the worksheet to fetch. For example, A1:B2.
If blank,fetch the whole worksheet
If blank andUse first row/column as headeris true, only fetch data from row/column have header
Use first row/column as header: get the first row for Column dimension and the first column for Row dimension as header name for the schema
When this option is selectedSkip first rows/columns must larger than 1
Skip first rows/columns: number of first rows/columns skips for data
Major dimension: the dimension of the sheet that is fetched
Rows: operates on the rows of a sheet.
Columns:operates on the columns of a sheet.
Value render options: format of the value from sheet
Formatted value: values are calculated and formatted in the reply according to the cell's formatting. Formatting is based on the spreadsheet's locale, not the requesting user's locale. For example, ifA1is1.23andA2is=A1and formatted as currency, thenA2returns"$1.23".
Unformatted value:Values are calculated, but not formatted in the reply. For example, ifA1is1.23andA2is=A1and formatted as currency, thenA2would return the number1.23.
Datetime render options: datetime format of the value from sheet
Formatted String:Instructs date, time, datetime, and duration fields to be output as strings in their given number format (which is dependent on the spreadsheet locale).
Timestamp formats that we can support:
dd/MM/yyyy
dd/MM/yyyy hh:mm:ss
dd/MM/yyyy can be rearranged to MM/dd/yyyy and yyyy/MM/dd
*, /can be replaced by . or - For example: 12.12.2012 or 12-12-2012
Serial number:Instructs date, time, datetime, and duration fields to be output as doubles in "serial number" format, as popularized by Lotus 1-2-3.
We support datetime without timezone or date, time
For example:12-12-2012,15:12:20,12-12-201215:12:20
Datetime beloware supported for both DateTime render option
dd/MM/yyyyThh:mm:ss
dd/MM/yyyyThh:mm:ss +0000
dd/MM/yyyyThh:mm:ss +01:00
dd/MM/yyyyThh:mm:ssZ
We don't support column with multiple datetime format, that could lead to unexpected behavior.
When value can't convert to datetime, the job will stop with an exception message. EnableSkip invalid cell(s)to ignore the errors.
Edit your params andclick Next.
Preview your data. To make changes, click Advanced Settings, otherwise click Next.
Advanced Setting:
Skip invalid cell(s):ignore all the errors when importing data, such as have a string in double column or cells that don't follow timestamp format. All the invalid cells are null.
Change your column name and data type then click Save
Select a database and a table where you want to transfer your data and click Next.
Specify the schedule of the data transfer using the following dialog and click Next:
Name your source and click Create And Run
Use Command Line
Install td command
Install the newest Treasure Data Toolbelt.
Create Seed Config File (seed.yml)
in:
type: google_sheets client_id: *** client_secret: *** refresh_token: ***
key:<YOUR_SPREED_SHEET_ID>
worksheet_title: <YOUR_WORKSHEET_NAME>
range: <YOUR_DATA_RANGE>
manjor_dimension: rows or columns value_render_option: formatted_value or unformatted_value datetime_render_option: formated_string or serial_number
header_line: true or false skip_header_lines: 1 skip_invalid_format_cells: false
out:
mode: append
For more details on available out modes, see Appendix.
Guess Fields (Generate load.yml)
Use connector:guess. This command automatically reads the target data, and intelligently guesses the data format.
$ td connector:guess seed.yml -o load.yml
If you open theload.ymlfile, you see guessed file format definitions including, in some cases, file formats, encodings, column names, and types.
in: type: google_sheets major_dimension: columns client_id: *** client_secret: *** refresh_token: *** worksheet_title: Sheet1 range: a10:k19 value_input_option: RAW value_render_option: formatted_value datetime_render_option: serial_number header_line: 'true' skip_header_lines: 1 columns: - {name: 'column1', type: timestamp, format: '%Y-%m-%d %k:%M:%S'} - {name: 'column2', type: string} - {name: 'column3', type: string} - {name: 'column4', type: string}out: {mode: append}exec: {}filters:- rules: - {rule: upper_to_lower} - pass_types: - a-z - 0-9 pass_characters: _ replace: _ rule: character_types - pass_types: - a-z pass_characters: _ prefix: _ rule: first_character_types - {rule: unique_number_suffix, max_length: 128} type: rename- from_value: {mode: upload_time}
Then you can preview how the system parses the file by using preview command.
$ td connector:preview load.yml
If the system detects your column name or type unexpectedly, modify theload.yml directly and preview again.
The data connector supports parsing of "boolean", "long", "double", "string", and "timestamp" types.
Execute Load Job
Submit the load job. It may take a couple of hours depending on the data size. Users need to specify the database and table where their data is stored.
$ td connector:issue load.yml --database td_sample_db --table td_sample_table
The preceding command assumes that you have already created database(td_sample_db) and table(td_sample_table). If the database or the table do not exist in TD this command will not succeed, so create the database and table manually or use --auto-create-table option with td connector:issue command to automatically create the database and table:
$ td connector:issue load.yml --database td_sample_db --table td_sample_table --time-column created_at --auto-create-table
You can assign Time Format column to the "Partitioning Key" by "--time-column" option.
Scheduled execution
You can schedule periodic data connector execution for periodic Google sheet import. We carefully configure our scheduler to ensure high availability. By using this feature, you no longer need a cron daemon on your local data center.
Create the schedule
A new schedule can be created by using the td connector:create command. The name of the schedule, cron-style schedule, the database and table where their data is stored, and the data connector configuration file are required.
$ td connector:create \
daily_google_sheets_import \
"10 0 * * *" \
td_sample_db \
td_sample_table \
load.yml
The`cron` parameter also accepts three special options: `@hourly`, `@daily` and `@monthly`.
List the Schedules
You can see the list of scheduled entries by td connector:list.
$ td connector:list
+---------------------------+--------------+----------+-------+--------------+-----------------+---------------------------------+
| Name | Cron | Timezone | Delay | Database | Table | Config |
+---------------------------+--------------+----------+-------+--------------+-----------------+---------------------------------+
|daily_google_sheets_import | 10 0 * * * | UTC | 0 | td_sample_db | td_sample_table | {"type"=>"google_sheets", ... } |
+---------------------------+--------------+----------+-------+--------------+-----------------+---------------------------------+
Show the Setting and History of Schedules
td connector:show shows the execution setting of a schedule entry.
$ td connector:showdaily_google_sheets_import
Name :daily_google_sheets_import
Cron : 10 0 * * *
Timezone : UTC
Delay : 0
Database : td_sample_db
Table : td_sample_table
Config
---
// Displayed load.yml configuration.
td connector:history shows the execution history of a schedule entry. To investigate the results of each individual execution, use td job <jobid>.
Delete the Schedule
td connector:deleteremoves the schedule.
$ td connector:delete daily_google_sheets_import
Obtain required Google Api credentials
For information on how to obtain Google API credentials, refer to the following:
View ArticleYou can import Salesforce Marketing Cloud (beta) Data Source objects into Treasure Data using this data connector. Contact us for further details about the private beta.
To see sample workflows for importing Salesforce Marketing Cloud data, go to Treasure Boxes.
You can use the same connection to export SFMC data. See Exporting Data to SFMC (BETA).
Prerequisites
Basic knowledge of Treasure Data
Basic knowledge of Salesforce Marketing Cloud
How to create an install package in Salesforce Marketing Cloud
After August 1st 2019, you can not create aLegacy Packagein Salesforce Marketing Cloud, therefore we highly recommend that you create an Enhanced Package.
How to create legacy package
Log on to your Salesforce Marketing Cloud account
On the Welcome Page, click your name on top right corner then clickSetup.
On left side menu of new screen, clickApp > Installed Packages .
On theInstalled Packages screen, click New.
On the New Package Details pop up, enter the Name and Description. Uncheck (deselect) thecheckboxCreate with enhanced functionality (recommended) then click Save.
On the Package Details screen, click on Add Component
On theAdd Component pop up, selectAPI Integration then click Next.
On theAdd API Integration screen:
Scroll to theCHANNELS section and select the Read checkbox on Email
Scroll to theCONTACT section and select the Read checkbox on Audiences andList and Subscribers
Scroll to theDATAsectionand select the Read and Write checkbox on Data Extensions,Read forTracking Events
Scroll to HUBsection andand select the Read checkbox on Campaign
On theInstalled Packages screen, scroll down to Components panel,then take note of the Client Id and Client Secret. You use the information to write the data from Treasure Data to Salesforce Marketing Cloud.
How to create enhanced functionality package
Log on to your Salesforce Marketing Cloud account.
On the Welcome Page, click your name on top right corner then click Setup.
On left side menu of new screen, clickApp > Installed Packages.
On theInstalled Packages screen, click New.
On the New Package Details pop up, enter the Name and Description and selectthecheckboxCreate with enhanced functionality (recommended). Then click Save
On theAdd Component pop up, selectAPI Integration then click Next.
On theChoose Your Integration Type screen, select Server-to-Server then click Next.
On theSet Server-to-Server Properties screen:
Scroll to theCHANNELS section and select theRead checkbox on Email
Scroll to CONTACT sectionand select the Read checkbox on Audiences andList and Subscribers
Scroll to DATAsectionand select the Read and Write checkbox on Data Extensions,Read forTracking Events
Scroll to HUBsectionand select the Read checkbox on Campaign
On theInstalled Packages screen, scroll down to Components panel, then take note of theClient Id,Client Secret andAuthentication Base URI. You will use the information to write the data from Treasure Data to Salesforce Marketing Cloud.
Use the TD Console to create your connection
You can use TD console to create your data connector.
Create a new connection
When you configure a data connection, you provide authentication to access the integration. In Treasure Data, you configure the authentication and then specify the source information.
Go to Integrations Hub > Catalogand search and select Salesforce Marketing Cloud.
Create Legacy Package integration
On theNew Authentication screen, select Package Type as Legacy Package,then enter the Client Id and Client Secret (which you obtained when you created legacy package in SFMC). Then click Continue.
Enter aName for this integration and click Done.
Create Enhanced Package integration
On theNew Authentication screen, select Package Type as Enhanced Functionality Package,then enter the Client Id,Client Secret andAuthentication Base URI (which you obtained when you created the enhanced package in SFMC). Optionally, you could specify Account identifier or MID to access multiple BUs(more detail in How to get MID ) and Scope to limit the token's scope(more detail in API scopes ).Then, click Continue.
Enter aName for this integration and click Done.
Create a new transfer
After creating the connection, you are automatically taken to Sources. Look for the connection you created and click New Transfer.
The following dialog opens. Complete the details and click Next.
Next, you see a Preview of your data similar to the following dialog. To make changes, click Advanced Settings otherwise, click Next.
From here, if you want to change some options such as skipping on errors or rate limits, click Advanced Settings:
Select the database and table where you want to transfer the data:
Specify the schedule of the data transfer using the following dialog and click Start Transfer:
You will see the new data transfer in progress listed under the My Input Transfers tab and a corresponding job are listed in the Jobs section.
Use Command Line
Install td command v0.11.9 or later
You can install the newest Treasure Data Toolbelt.
$ td --version
0.15.3
Create Configuration File
Prepare configuration file (for eg: load.yml) as shown in the following example, with your Salesforce Marketing Cloud account access information.
in:
type: salesforce_marketing_cloud
client_id: xxxxxxxxxx
client_secret: xxxxxxxxxx
target: campaign (required, see Appendix A)
out:
mode: replace
This example dumps Salesforce Marketing Cloud Campaign Data Source:
client_id: Salesforce Marketing Cloud client id.
client_secret: Salesforce Marketing Cloud client secret.
target: Salesforce Marketing Cloud entity object to be imported.
See Appendix B for the list of available target.
For more details on available out modes, see Appendix A
(optional): Preview data to import
You can preview data to be imported using the command td connector:preview.
$ td connector:preview load.yml
+-----------------+---------------------+--------------------+----
| id:long | name:string | description:string | ...
+-----------------+---------------------+--------------------+----
| 42023 | "Hello" | apps |
| 42045 | "World" | apps |
+-----------------+---------------------+--------------------+----
Execute Load Job
Submit the load job. It may take a couple of hours depending on the data size. Users need to specify the database and table where their data are stored.
It is recommended to specify --time-column option, since Treasure Datas storage is partitioned by time (see also data partitioning ). If the option is not given, the data connector selects the first long or timestamp column as the partitioning time. The type of the column specified by --time-column must be either of long and timestamp type.
If your data doesnt have a time column you can add it using add_time filter option. More details at add_time filter plugin
$ td connector:issue load.yml --database td_sample_db --table td_sample_table --time-column modifieddate
The preceding command assumes you have already created database(td_sample_db) and table(td_sample_table). If the database or the table do not exist in TD this command will not succeed, so create the database and table manually or use --auto-create-table option with td connector:issue command to auto create the database and table:
$ td connector:issue load.yml --database td_sample_db --table td_sample_table --time-column modifieddate --auto-create-table
You can assign Time Format column to the "Partitioning Key" by "--time-column" option.
Scheduled execution
You can schedule periodic data connector execution for periodic Salesforce Marketing Cloud import. We configure our scheduler carefully to ensure high availability. By using this feature, you no longer need a cron daemon on your local data center.
Create the schedule
A new schedule can be created using the td connector:create command. The name of the schedule, cron-style schedule, the database and table where their data will be stored, and the Data Connector configuration file are required.
$ td connector:create \
daily_salesforce_marketing_cloud_import \
"10 0 * * *" \
td_sample_db \
td_sample_table \
load.yml
The `cron` parameter also accepts these three options: `@hourly`, `@daily` and `@monthly`.
By default, schedule is setup in UTC timezone. You can set the schedule in a timezone using -t or --timezone option. The `--timezone` option only supports extended timezone formats like 'Asia/Tokyo', 'America/Los_Angeles' etc. Timezone abbreviations like PST, CST are *not* supported and may lead to unexpected schedules.
List the Schedules
You can see the list of currently scheduled entries by td connector:list.
$ td connector:list
+-----------------------------------------+--------------+----------+-------+--------------+-----------------+----------------------------------------------+
| Name | Cron | Timezone | Delay | Database | Table | Config |
+-----------------------------------------+--------------+----------+-------+--------------+-----------------+----------------------------------------------+
| daily_salesforce_marketing_cloud_import | 10 0 * * * | UTC | 0 | td_sample_db | td_sample_table | {"type"=>"salesforce_marketing_cloud", ... } |
+-----------------------------------------+--------------+----------+-------+--------------+-----------------+----------------------------------------------+
Show the Setting and History of Schedules
td connector:show shows the execution setting of a schedule entry.
% td connector:show daily_salesforce_marketing_cloud_import
Name : daily_salesforce_marketing_cloud_import
Cron : 10 0 * * *
Timezone : UTC
Delay : 0
Database : td_sample_db
Table : td_sample_table
td connector:history shows the execution history of a schedule entry. To investigate the results of each individual execution, use td job <jobid>.
% td connector:history daily_salesforce_marketing_cloud_import
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
| JobID | Status | Records | Database | Table | Priority | Started | Duration |
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
| 578066 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-18 00:10:05 +0000 | 160 |
| 577968 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-17 00:10:07 +0000 | 161 |
| 577914 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-16 00:10:03 +0000 | 152 |
| 577872 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-15 00:10:04 +0000 | 163 |
| 577810 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-14 00:10:04 +0000 | 164 |
| 577766 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-13 00:10:04 +0000 | 155 |
| 577710 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-12 00:10:05 +0000 | 156 |
| 577610 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-11 00:10:04 +0000 | 157 |
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
8 rows in set
Delete the Schedule
td connector:deleteremoves the schedule.
$ td connector:delete daily_salesforce_marketing_cloud_import
Incremental Loading
For Data Extensions
Treasure Data supports incremental loading for Data Extensions that have a date field.
If incremental: true is set, the data connector loads records according to the range specified by the from_date and the fetch_days for the specified date field.
For example:
incremental_column_name: mydatefield
from_date: "2018-02-01T00:00:00.000Z"
fetch_days: 2
1st iteration: The data connector fetches records from Sep 01 00:00:00 UTC 2016 to Sep 03 00:00:00 UTC 2016
2nd iteration: The data connector fetches records for the next available 2 day period, from Sep 03 00:00:00 UTC 2016 until Sep 05 00:00:00 UTC 2016. This process repeats for each successive iteration.
When the increment includes the present date, additional records are fetched as each complete time period becomes available.
If incremental: false is set, The data connector loads all records for the target specified. This is one time activity.
For Email Events
Treasure Data supports incremental loading for Email Eventsbased on their event date.
If incremental: true is set
1st iteration: the connector loads the data from specified Start Time (from all time if not set) to specified End Time (to the time the job was scheduled to run if not set)
2nd iteration: the connector loads the data from previous End Time to the time job is scheduled
If incremental: false is set, The data connector loads all records for the target specified. This is one time activity.
Appendix
A) Modes for out plugin
You can specify file import mode in out section of load.yml.
append (default)
This is the default mode and records are appended to the target table.
in:
...
out:
mode: append
replace (In td 0.11.10 and later)
This mode replaces data in the target table. Any manual schema changes made to the target table remains intact with this mode.
in:
...
out:
mode: replace
B) Available targets
Target
Description
Campaign
The e-mail campaign
Contact
The contact list
Data Extensions
The data extensions to satisfy the need for flexible data storage
Email Events
The events associated with an email marketing send
The target Contact provides an option to ingest data in multiple requests in case you have too many contact models attributes. If you select this option, you also input the number of attributes per request. The default value of this field is 100.
In some cases, your data breaks because some attributes are faulty. You can use the option Attribute set names will be ignored to skip faulty attributes. This option helps you to ingest data without error from Salesforce Marketing Clouds API.
The target Data Extensions provides an option to filter data extensions that you want to ingest and another option to enable ingesting shared data extension.
The targetEmail Eventsprovides option to filter:
Email name matching pattern (exact or partial match)
Email events type that you want to ingest
Two options to define a date range to ingesting events on a specific date time range.
C) Sample configuration for each target
ForCampaign target
in: type: salesforce_marketing_cloud client_id: XXXXXX client_secret: XXXXXX target: campaign out: mode: replace
Parameters
Description
Default value
type
must besalesforce_marketing_cloud
client_id
Salesforce Marketing Cloud Client ID
client_secret
Salesforce Marketing Cloud ClientSecret
target
must be campaign (in this case)
ForContact target
in: type: salesforce_marketing_cloud client_id: XXXXXX client_secret: XXXXXX target: contact contact_page_size: 50 contact_attributes_request_limit: 100 contact_multiple_requests: true ignore_attribute_set_names: - attribute_1 - attribute_2out: mode: replace
Parameters
Description
Default value
type
must besalesforce_marketing_cloud
client_id
Salesforce Marketing Cloud Client ID
client_secret
Salesforce Marketing Cloud ClientSecret
target
must be contact (in this case)
contact_page_size
Number of records per page for target contact. This option is useful when you have a large of data
1000
contact_attributes_request_limit
Number of attributes to ingest for each contact in a single request. When number of attributes greater than this value, it is split into multiple requests
100
contact_multiple_requests
Enable plugin to ingest contact detail with so many attributes in multiple requests (This only effects when number of attributes greater than 100 )
false
ignore_attribute_set_names
List of attributes are ignored. Useful when you want to eliminate some unnecessary attributes
null
ForData Extension target
in: type: salesforce_marketing_cloud client_id: XXXXXX client_secret: XXXXXX target: data_extension data_extension_names: - data_extension_1 - data_extension_2 incremental: true shared_data_extension: true incremental_column_name: date from_date: "2018-11-30T17:00:00.000Z" fetch_days: 1out: mode: replace
Parameters
Description
Default value
type
must besalesforce_marketing_cloud
client_id
Salesforce Marketing Cloud Client ID
client_secret
Salesforce Marketing Cloud ClientSecret
target
must be data_extension (in this case)
data_extension_names
List of data extension names you want to ingest. Leave it to null to ingest all
null
shared_data_extension
Set this flag to true when you want to ingest the data from shared data extensions
false
incremental
Set this flag to true if you want to get newer data between each schedule only
false
incremental_column_name
Column used for incremental loading. Must be set if you set incremental = true
null
from_date
Set this to ingest data from this date
null
fetch_days
Duration to ingest is from_date to from_date +fetch_days
1
ForEmail Event target
in: type: salesforce_marketing_cloud client_id: XXXXXX client_secret: XXXXXX target: email_event search_term: "email name" search_mode: "exact" email_event_types: - value : all start_time: "2019-03-26T07:20:00.000Z" end_time: "2019-03-26T07:53:00.000Z" incremental: trueout: mode: replace
Parameters
Description
Default value
type
must besalesforce_marketing_cloud
client_id
Salesforce Marketing Cloud Client ID
client_secret
Salesforce Marketing Cloud ClientSecret
target
must be email_event (in this case)
search_term
Name of email you want to ingest
null
search_mode
Mode for name matching pattern. Must be exact or partial only
exact
email_event_types
List of event types you want to import. Valid options are as follows:
- value : all - value : bounce - value : click - value : forward - value : forwardOptIn - value : open - value : sent - value : survey - value : unsub
null
start_time
Set this to ingest data from this time
null
end_time
Set this to ingest data to this time
null
incremental
Set this flag to true if you want to get newer data between each schedule only
false
View ArticleYou can connect Salesforce with Treasure Data for greater control over your Salesforce data and better integration with the other business applications in your marketing and sales operations stack.
Integrating Salesforce with Treasure Data makes it easy to:
Add new features to Salesforce. For example, you can prevent churn by tracking web usage and receiving alerts when customers product usage declines.
Use Salesforce data to improve other parts of your marketing stack. For example, you can increase your Facebook Ads ROI by automatically removing new customers from your Facebook Custom Audiences.
For sample workflows on importing data from Salesforce, go to Treasure Boxes.
If youdon'thave a Treasure Data account yet, contact us so that we can get you set up.
Connect to Salesforce using the TD Console
You can configure your connection using TD Console. Instructions on how to connect using the command line are in the next section..
In Salesforce
Go to Setup -> Apps -> App Manager and click New Connected App. ( The steps may vary depending on the version of Salesforce you are running.)
In Spring 19 classic UI: Setup -> Build -> Create -> Apps -> Connected Apps -> New
From the Setup -> Build -> Create (Apps), you can see all your connected apps
Click on your app name to go to the page where you can view and manage all information about your connected app. From there, you can get your Consumer Key (client_id) and Consumer Secret (client_secret), which you need for the next step.
For secure account access, get a Salesforce Security Token, if youdon'talready have one. To get a security token, go to Account -> Settings -> Reset My Security Token and click Reset Security Token. Youll receive your Security Token by email.
In Spring 19 classic UI: My account -> My Settings -> Personal -> Reset My Security Token
Create New Connection in TD Console
Go to Integrations Hub > Catalogand search forSalesforce.
In the dialog box, enter https://login.salesforce.com/ as the login URL.
If you are using newer version Salesforce v2 connector, make sure to remove unnecessary letters from Login URL parameter. For example, instead of https://login.salesforce.com/?locale=jp, use use https://login.salesforce.com/.
Enter your username (your email) and password, as well as your Client ID, Client Secret and Security Token.
ClickContinue.
Give your connection a descriptive name and clickCreate Connection.
Authentication
The method you use to authenticate Treasure Data with Salesforce affects the steps you take to enable the data connector import from Salesforce.
Using Credentials
You need a client ID and client secret to authenticate using credentials. You can find more information on how to acquire or retrieve existingclient ID and client secret credentials.
Using Session ID
Not applicable for import. Although you see the Session ID option in the drop-down box, you cannot use it for import. Session ID can be used to authenticate for export (result output) only. See Writing Job Results into Salesforce.
Using OAuth2
Using OAuth is the most common authentication method.
Go to Integrations Hub > Catalogand search forSalesforce.
Click on the Salesforce connector. In the dialog to create new connector, open the Authenticatemethoddropdown list and choose OAuth.
Click on the Click here to connect to a new account. You are redirected to the Salesforce login if you haven't logged in yet, or the consent page to grant access to Treasure Data.
After you grant access to Treasure Data, you are redirected back to TD Console. Choose the Salesforce connector again, then choose the OAuth Authenticate method as in the preceding step. You will see an OAuth connection with your account name in the dropdown list. Choose the account you want to use and then proceed to create the connection.
Specify parameters. The parameters are as follows:
Initial Retry Delay:The connector initially waits for this amount of seconds that you specify before retrying import.
Retry Limit: During the import of data into Treasure Data, there are number of factors that affect the process, such as network fluctuation, and concurrency limit on the Salesforce server. The connector retries import for the number of time you specify.
Your Salesforce account profile must beAPI Enabled. You can:
1. Setup-> Administer-> Manage Users-> Users
Find the profile your user is assigned.
2. Setup-> Administer-> ManageUsers->Profiles -> Edit ->Administrative Permissions
Select API Enabled.
If you are unable to edit a pre-defined profile in your organization, clone the profile to to create a new profile and then edit the new profile.
Create Transfer
In Authentications, locate the Salesforce connection and clickNew Source.
Enter the name of the database you want to import in the Target field and click Next.
You see a preview of your data. At this stage, you can click on Advanced Setting to manually modify the SOQL, where condition and the schema.
Click Next to take you toTransfer data from Salesforce.
If you are creating a new database, select Create new database and give your database a name. Complete the same steps for Create new table.
Select whether to append records to an existing table or replace your existing table.
If you want to set a different partition key seed rather than use the default key, you can specify one using the popup menu.
In the When tab, you can specify a one-time transfer, or you can schedule an automated recurring transfer. If you select Once now, click Start Transfer. If you select Repeat,specify your schedule options, and then click Schedule Transfer.
After your transfer has run, you can see the results of your transfer in the Databases tab.
Create Salesforce connection using the command line
Install the Treasure Data Toolbelt
Open a terminal and run the following command to install the newest Treasure Data Toolbelt.
$ td --version
0.15.3
In Salesforce
Go to Setup -> Apps -> App Manager and click New Connected App. ( The procedure may vary depending on what version of Salesforce you are running.)
Click Manage Connected Apps and get your Consumer Key and Consumer Secret, which you need for the next step.
For secure account access, if youdon'thave one already, you should get a Salesforce Security Token. To do this, go to Account -> Settings -> Reset My Security Token and click Reset Security Token. Youll receive your Security Token by email.
Create a Seed Config File (seed.yml)
Using a text editor, create a file called seed.yml. Copy and paste the information below, replacing the placeholder text with your Salesforce credentials. This configuration dumps the Account object specified in the target field because replace mode is specified. For more details on available out modes, see Appendix
in:
type: sfdc
username: "USERNAME" # your username
password: "PASSWORD" # your password
security_token: "SECURITY_TOKEN" # your Security Token
client_id: "CLIENT_ID" # your app's consumer key
client_secret: "CLIENT_SECRET" # your app's consumer secret
login_url: https://login.salesforce.com/ # test.salesforce.com for sandbox
incremental: false # 'full dump', see Scheduled execution below
target: Account # Salesforce Object you want to import
out:
mode: replace
All Salesforce Objects are supported for the target option. Some common objects include:
Opportunity
Contact
Lead
Account
Event
Task
(For more information about Salesforce Objects, refer to the full reference at the Salesforce Developer portal.)
The objects, fields and records accessible to you depend on your Salesforce license, security configuration and API access configuration in your Salesforce organization. Check your configurations if you see authorization errors in the next step.
Run the Guess Fields command (generate load.yml)
Run the following command in your terminal:
$ td connector:guess seed.yml -o load.yml
Connector:guess automatically reads the target data and intelligently guesses the data format.
Open the file load.yml, where youll see guessed file format definitions including, in some cases, file formats, encodings, column names, and types.
in:
type: sfdc
username: "USERNAME"
password: "PASSWORD"
security_token: "SECURITY_TOKEN"
client_id: "CLIENT_ID"
client_secret: "CLIENT_SECRET"
login_url: https://login.salesforce.com/
...
filters:
...
out:
mode: replace
You can preview how the system will parse the file by using preview command.
$ td connector:preview load.yml
If the system detects the wrong column name or type, modify load.yml and preview again.
The Data Connector supports parsing of boolean, long, double, string, and timestamp types.
Execute the load job
In your terminal, submit the load job as shown in the following example. Processing might take a couple of hours depending on the data size. Users need to specify the database and table where their data is stored.
$ td connector:issue load.yml --database td_sample_db --table td_sample_table \
--time-column createddate
The preceding command assumes you have already created database(td_sample_db) and table(td_sample_table). If the database or the table do not exist in TD the command will not succeed. So create the database and table manually or use the --auto-create-table option with td connector:issue command, as shown below, to auto-create the database and table.
$ td connector:issue load.yml --database td_sample_db --table td_sample_table \
--time-column createddate --auto-create-table
You can assign the Time Format column to the Partitioning Key by using the time-column option.
If you want to ingest records that have been deleted in SFDC, specify include_deleted_records: true. By default, ingested recordsdon'tinclude records that you have previously deleted in Salesforce.
Heres an example of a load file using the include_deleted_records option
in:
type: sfdc
username: "USERNAME"
password: "PASSWORD"
security_token: "SECURITY_TOKEN"
client_id: "CLIENT_ID"
client_secret: "CLIENT_SECRET"
login_url: https://login.salesforce.com/
include_deleted_records: true # Ingest records that have been deleted
...
filters:
...
out:
mode: replace
Scheduled execution using the command line
You can schedule periodic Data Connector execution for periodic SFDC import. We manage our scheduler carefully to ensure high availability. By using this feature, you no longer need a cron daemon on your local data center.
For the scheduled import, Data Connector for SFDC imports all Objects that match the specified target.
Scheduled execution supports additional configuration parameters that control the behavior of the Data Connector during its periodic attempts to fetch data from SFDC:
incremental This configuration is used to control the load mode, which governs how the Data Connector fetches data from SFDC based on one of the native timestamp fields associated with each Object.
incremental: true (default) In this mode, the data connector fetches only records of the specified Salesforce Object type that have been updated since the previous run of the connector. This mode is useful when the user wants to fetch just the Object targets that have changed since the previous scheduled run. This mode is typically combined with writing data into the destination table using append mode.
incremental_columns (required) This option is required for incremental mode, to load necessary data only from SFDC. Make sure these columns are selected if you use a custom SOQL.
incremental: false In this mode, the data connector fetches all the records of the specified Salesforce Object type, regardless of when they were last updated. This mode is best combined with writing data into a destination table using replace mode.
soql This configuration allows to use a custom SOQL ( Salesforce Object Query Language ) to query and filter data. With SOQL, you can retrieve the data from a single object or multiple objects that are related to each other. You can pick only your interested columns and filter or sort the output with your own conditional statement.
where This configuration allows to use the filtering condition on the custom SOQL
columnsThis configuration is used to define a custom schema for data to be imported into Treasure Data. You can define only columns that you are interested in here but make sure they exist in the object that you are fetching. Otherwise, these columnsaren'tavailable in the result.
last_record This configuration is used to control the last record from the previous load job. It requires the object include a key for the column name and a value for the columns value. The key needs to match the SFDC (SOQL) column name.
Heres an example of a seed file using incremental mode combined with append mode for output.
in:
type: sfdc
username: "USERNAME"
password: "PASSWORD"
security_token: "SECURITY_TOKEN"
client_id: "CLIENT_ID"
client_secret: "CLIENT_SECRET"
login_url: https://login.salesforce.com/
target: Account use_rest: false
incremental: true
incremental_columns:
- LastModifiedDate
out:
mode: append
The preceding configuration has a column name LastModifiedDate which is used to load records incrementally. The Data Connector will use a SOQL (SFDC Query Language) with that configuration as shown in the following.
-- first execution:
SELECT Id, Name, ... FROM Account ORDER BY LastModifiedDate
-- second execution:
SELECT Id, Name, ... FROM Account WHERE (LastModifiedDate > 2015-10-21T10:33:07+0000) ORDER BY LastModifiedDate
-- '2015-10-21T10:33:07+0000' is the latest modified record time from first execution
-- third execution:
SELECT Id, Name, ... FROM Account WHERE (LastModifiedDate > 2015-10-30T22:30:41+0000) ORDER BY LastModifiedDate
-- '2015-10-30T22:30:41+0000' is the latest modified record time from second execution
You can see that the WHERE clause was updated on every execution to load modified records only.
Heres an example of a seed file using non-incremental mode (incremental: false) combined with replace mode for output.
in:
type: sfdc
username: "USERNAME"
password: "PASSWORD"
security_token: "SECURITY_TOKEN"
client_id: "CLIENT_ID"
client_secret: "CLIENT_SECRET"
login_url: https://login.salesforce.com/
incremental: false
target: Account
out:
mode: replace
With the preceding configuration, the Data Connector always loads all records and replaces all existing records with loaded records.
The following is an example of a seed file using incremental mode and last_record combined with append mode for output.
in:
type: sfdc
username: "USERNAME"
password: "PASSWORD"
security_token: "SECURITY_TOKEN"
client_id: "CLIENT_ID"
client_secret: "CLIENT_SECRET"
login_url: https://login.salesforce.com/
target: Account
incremental: true
incremental_columns:
- LastModifiedDate
last_record:
- {"key": "LastModifiedDate", "value": "2017-01-01 00:00:00UTC"}
out:
mode: append
In the preceding configuration, Data Connector loads all records since 2017-01-01 00:00:00.
Heres an example of a seed file using incremental mode combined with custom SOQL, where condition and columns schema for output.
in:
type: sfdc
username: "USERNAME"
password: "PASSWORD"
security_token: "SECURITY_TOKEN"
client_id: "CLIENT_ID"
client_secret: "CLIENT_SECRET"
login_url: https://login.salesforce.com/
target: Account
incremental: true
incremental_columns:
- LastModifiedDate
soql: "select Id, Name, Description, LastModifiedDate from Account" where: "ownership = 'Public'"
columns:
- name: Id
type: string
- name: Name
type: string
- name: Desc
type: string
- name: LastModifiedDate
type: timestamp
out:
mode: append
As you can see, the SOQL can select fewer columns with restricted filter on ownership field. The schema also defines an unknown field name or not exist in the SOQL (i.e. Desc) and this columnwon'tbe available in the importing result.
Create the schedule
A new schedule can be created by using the td connector:create command. The name of the schedule, cron-style schedule, database and table where the data will be stored, and the data connector configuration file are required.
$ td connector:create \
daily_sfdc_import \
"10 0 * * *" \
td_sample_db \
td_sample_table \
load.yml
The cron parameter also accepts three special options: @hourly, @daily and @monthly.
By default, the schedule is set in the UTC time zone. You can set the schedule in a time zone using the `-t` or `--timezone` option. The `--timezone` option supports only extended timezone formats like 'Asia/Tokyo', 'America/Los_Angeles' etc. Timezone abbreviations like PST, CST are *not* supported and may lead to unexpected schedules.
List the Schedules
You can see the list of scheduled entries by entering the command td connector:list.
$ td connector:list
+-------------------+--------------+----------+-------+--------------+-----------------+------------------------+
| Name | Cron | Timezone | Delay | Database | Table | Config |
+-------------------+--------------+----------+-------+--------------+-----------------+------------------------+
| daily_sfdc_import | 10 0 * * * | UTC | 0 | td_sample_db | td_sample_table | {"type"=>"sfdc", ... } |
+-------------------+--------------+----------+-------+--------------+-----------------+------------------------+
Show the Setting and History of Schedules
td connector:show shows the execution setting of a schedule entry.
% td connector:show daily_sfdc_import
Name : daily_sfdc_import
Cron : 10 0 * * *
Timezone : UTC
Delay : 0
Database : td_sample_db
Table : td_sample_table
td connector:history shows the execution history of a schedule entry. To investigate the results of each individual execution, use td job <jobid>.
% td connector:history daily_sfdc_import
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
| JobID | Status | Records | Database | Table | Priority | Started | Duration |
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
| 578066 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-18 00:10:05 +0000 | 160 |
| 577968 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-17 00:10:07 +0000 | 161 |
| 577914 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-16 00:10:03 +0000 | 152 |
| 577872 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-15 00:10:04 +0000 | 163 |
| 577810 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-14 00:10:04 +0000 | 164 |
| 577766 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-13 00:10:04 +0000 | 155 |
| 577710 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-12 00:10:05 +0000 | 156 |
| 577610 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-11 00:10:04 +0000 | 157 |
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
8 rows in set
Delete the Schedule
td connector:delete removes the schedule.
$ td connector:delete daily_sfdc_import
Appendix
A) Modes for out plugin
You can specify the file import mode in the out section of seed.yml.
append (default)
>
This is the default mode. Records are appended to the target table.
in:
...
out:
mode: append
replace (In td 0.11.10 and later)
This mode replaces data in the target table. Any manual schema changes made to the target table remain intact with this mode.
in:
...
out:
mode: replace
B) Use synchronous transfer only
There are 2 options to ingest records from Salesforce to Treasure Data:
Bulk API
REST API
Bulk API offers a faster ingestion but also has a limitation of 10.000 batch allocations within a 24 hour period. If your target is large, your entire available batch allocation might be consumed, causing your job to fail eventually.
You might encounter the batch allocation limitation during the first ingestion of your data, when all of records are ingested.
If that ingestion of all records is one of your use cases, consider enabling the synchronous transfer onlyoption and using REST API. The use of REST API avoids the batch allocation limitation but might be slower.
View ArticleAccess control settings are managed within TD Console.
Account owners and administrators have additional control over access to resources and capabilities.
The following access controls pertain to all account team members.
REST APIs Access
REST API access is controlled through API keys. Almost every REST API call needs to be issued with a valid API key for authentication and resource authorization purposes.
There are two types of API keys: Master and Write-only. By default, every new user is created with one Master and one Write-only API key; however, any user can generate any number of the two types of API keys. Any of the API Keys can be revoked at any time by the user themselves or any user having Manage User permissions.
Master This is the traditional type of API key and it can be used to perform all permitted operations listed in the table below (see Console or Master API key) based on the users permission level and access, no exception.
Write-only This API key type provides an additional layer of security in controlling access to a Treasure Data account through the REST APIs, especially when access has to be provided to 3rd parties or API keys need to be embedded in ingestion libraries (see for example, Web Page integrations). Based on the permissions and access levels associated to a user, the users Write-only API key will only allow importing data into Treasure Data to those databases it has write access to.
Read how you can configure your keys.
Actions According to Access Level
This section describes access at the most basic level.Policies and permissions do not change these access rules.
Actions You Perform with TD Console or Master API Key
The following table lists actions that can be performed through either TD Console or the REST APIs. Some actions might not be available at the REST API level and some other actions (for example, Data Import) are not available within TD Console.
Action
Owner
Admin
Full Access
Query only
Import only
Add User
OK
OK
Manage User
OK
OK (1)
Delete User
OK
OK (1)
List Databases
OK
OK
OK
OK
OK
Create Database
OK
OK
n/a (2)
n/a (2)
n/a (2)
Manage Database
OK
OK
OK (2)
OK (2)
OK (2)
Delete Database
OK
OK
OK (2)
OK (2)
OK (2)
Show Table
OK
OK
OK
OK
OK
List Tables
OK
OK
OK
OK
Create Table
OK
OK
OK
Delete Table
OK
OK
OK
Data Import (td-agent)
OK
OK
OK
OK
Data Import (Result Output to TD)
OK
OK
OK
OK
Data Import (Bulk Import)
OK
OK
OK
OK
Data Import (embulk-output-td)
OK
OK
OK
Data Import (Data Connector)
OK
OK
OK
OK
Data Import (FileUploader v2)
OK
OK
OK
OK
Data Import (Insert Into)
OK
OK
OK
(3)
Delete Data
OK
OK
OK
Issue Query
OK
OK
OK
OK
Kill Own Query
OK
OK
OK
OK
Kill Query from Others
OK
OK
OK
(4)
Export Table
OK
OK
OK
OK
Notes:
Administrator users can only Manage User and Delete User for Restricted users but are not allowed to manage and delete other Administrators user or the account Owner. The Manage User permission includes granting or revoking the Administrator role therefore an Administrator user can promote a user to Administrator but cannot demote an user from Administrator to Restricted user.
Any user (including Restricted ones) can create a new database and they will own and have all permissions for it. Full Access, Query-only, and Import-only actions for Create Databasedon'tapply in that case. Restricted users can only Delete and Manage databases they created (and therefore own). Administrators and Owner can always manage databases.
While the end-goal of INSERT INTO is to write the result back into a table, it requires special permissions. The executing user must possess read (Query-only, Full Access, Admin, or Owner) permissions for all the databases accessed by the query as well as read and write permission (Full Access, Admin, or Owner) to the database the result is inserted into the query will fail otherwise.
Restricted users with Query-only permission can see all the jobs running on the database they have 'Query-only permissions for but will not be able to kill a job unless its their own.
Actions You Perform with TD Console or theWrite-only API key
Action
Owner
Admin
Full Access
Query only
Import only
Add User
Manage User
Delete User
List Databases
Create Database
OK
OK
n/a (1)
n/a (1)
n/a (1)
Manage Database
Delete Database
List Tables
Create Table
OK
OK
OK
OK
Delete Table
Data Import (td-agent)
OK
OK
OK
OK
Data Import (Result Output to TD)
OK
OK
OK
OK
Data Import (Bulk Import)
(2)
(2)
(2)
(2)
Data Import (embulk-output-td)
(2)
(2)
(2)
Data Import (Data Connector)
(2)
(2)
(2)
(2)
Data Import (Insert Into)
(3)
(3)
(3)
(3)
(3)
Issue Query
Kill Own Query
Kill Query from Others
Export Table
Notes:
Any user (including Restricted ones) can create a new database and they will own and have all permissions for it. Full Access, Query-only, and Import-only permissions for Create Databasedon'tapply in that case. Restricted users can only Delete and Manage databases they created (and therefore own). Administrators and Owner can always manage databases.
Bulk Import require the ability to check the status of a job, and this is not possible using a write-only API key.
INSERT INTO requires the ability to execute a query, which is not allowed using a Write-only API key.
IP Whitelist
IP whitelists restrict access to Treasure Data accounts through specified sets of IP addresses. There are two types of whitelists: default account IP whitelist and user IP whitelist.
The Account Owner and Administrators set default account IP whitelist and user IP whitelist. A user IP whitelist is viewable from a users profile page.Restricted users can view but cannot set their own IP whitelist.
Refer to the article IP Whitelist for more information.
View ArticleUser Permissions are managed within TD Console > Administration. Only account owners and administrators can see this view.
Permissions for:
You can specify in a Standard account
You can specify if you have the Policies feature
Databases
Yes
Yes. Set at the user level. Not part of a policy
Workflows
Yes
Yes. Set as part of a policy.
Master Segments in Data Workbench
Option is not viewable. Included as part of Audience Studio.
Yes -you can set more restrictions
Audience Studio
Yes. Full Access or None
Yes -you can set more restrictions
The User page displays the users associated with an account.
Permissions without Policies
You click Users > Permissions to specify or modify the permissions of account users.
Permissions in the Policies Feature
For TD accounts with the Policies feature, you click Users > Policies > Show AllPermissions to view permissions that are associated with account users. You cannot assign permissions to users in this view.
To assign users to permissions, you click Policies. You create policies that specify permissions. You then assign users to policies. Policies are collections of permissions.
User Permissions
Local Database
You can specify users access level todatabases in the account.
Import Only - can run import jobs against any database in the account
Query Only - can view and run queries against any database in the account
Full Access - can perform any operation on databases in the account
The access levels for database mimic read & write, read, and write permissions.
On the My Settings > Database page, all users can specify permissions for databases that they have created.Only account owners and administrators can specify permissions for any database in the account.
Workflows and Audience Studio
Only account owners and administrators can view and specify permissions for Audience Studio and Workflows. For details on how permissions affect user actions in TD workflow, refer the matrix. The matrix lists actions - both in TD Console and with commands on the CLI in TD toolbelt -that are controlled by permissions.
In the Permissions without Policies version, click aUser view, and then click the Permissions tab. If you have the Policies feature, click Policies and go to the Permissions tab.
Users must be assigned specific permissions in the following areas:
Data Workbench
You can specify users access to workflows and master segments. Workflow permissions are standard. Master segment permission are only foraccounts that include Audience Studio.
Workflows: ability to view, edit workflows and run workflows.
View - view the list of all workflows
Run - run all user-defined workflows. Select View to also view list of all workflows.
Edit - create and edit all user-defined workflows.Select View to also view list of all workflows. Select Run to run any user-defined workflow.
If no permission is selected, then users in the policy do not have any workflow permissions.
Tip: If you plan to create Master Segment, select View in Workflow so that you can view the workflow process of creating a Master Segment.
Permissions specified control workflow activities in TD console and the TD toolbelt command-line interface.
Master Segments: For accounts that include the Policy feature and Audience Studio.
View - users can only see Master Segments in the navigation pane. Users cannot edit.
Full control - Run, create and edit all Master Segments.
If no permission is selected, then users in the policy do not have any permissions for all master segments.
When you grant Full Control, you are also granting access to View, even if the View option is not selected or is deselected. With Full Control, you are enabling a user to view as well as create, edit, and run all Master Segments.
Tip: If you plan to create Master Segments, select View in Workflow so that you can view the workflow process of creating a Master Segment.
Audience Studio
If your account includes Audience Studio, you can specify users access to Segment Folders, Profiles, Predictive Scoring and Profiles API Tokens.
None - usersdo not see Audience Studio in the navigation; users cannot view any Audience Studio features
Limited Access - account owners and administratorscan select a master segment. From this view, the account owners and administrators can drill down to lower panes that enable selection of specific Segment folders.
Full access - users can view all features of Audience Studio, including, profiles, segment folders, predictive scoring, and API tokens
Segment Folders
When you specify Audience Studio: Limited Access and specify a specific master segment in which to configure permissions, You can specify:
For all folders
View - users can view all segment folders and their contents within thespecified master segment
Full control - users can view, create, delete, and modify any segment folders and their content within thespecified master segment; users can see profiles, predictive scoring, and API tokens, within thespecified master segment
For specific folders
View - users can view specific segment folders,within thespecified master segment
Full control - users can view and edit specific segments folders, within thespecified master segment; they cannot see profiles, predictive scoring, and API tokens associated with the segments
Frequently Asked Questions
Why doesn't my view show Master Segments? Why doesn't my view have Policies as an option?
Master Segments and Audience Studio are visible only if the features are included as part of your account.
As an administrator, how do I change my email address?
Administrators view and edit their own profiles in the Users view.Go to "My Profile" -> Personal Info and change your email address.
As an Owner, how do I change the account name or account email?
If you use Google SSO to access TD Console you must contact Support to change the account owner name or to change the account email listed in the Account Information page. Otherwise, to edit your account owner name or email, go to "My Profile" -> Personal Info.
View ArticleAs an account owner or administrator, you can define the criteria for Treasure Data user passwords. All users of your Treasure Data account must adhere to the set of criteria that you specify. When you change the password criteria, existing users receive notification and will have to update their password if they do not already meet the updated criteria.
Specify the Password Policy
Account owners and Administrators can specify the account password policy fromControl Panel > Administration > Login Settings.
Or you access from the Admin section of TD Console. Click the Security tab to configure the password policy.
Password Configuration
Minimum Length
The minimal character length of user passwords. Default: 8
Expiration
Specifies how frequently that users must change their password. Specified in days. For example, the value '60 days' means the password must be changed every 60 days. Default: Never
History
Saves the value of a specified the number of passwords for a user. User are not allowed to use previous passwords as specified. Default is 6. Example, if you specify the value as '4' then users must create a unique password that does not equal any of the last 4 prior password values.
Complexity
Specifies the types of characters that a password must contain. Refer to the following image:
Login Configuration
Maximum Invalid Login Attempts
The number of incorrect attempts one user is allowed make while attempting to login. When the maximum number is exceeded, the user receives an email that allows them to unlock their profile and attempt to login again. Alternatively, the user can reset their password from the login pane. If the user takes no action, the profile is unlocked 1 hour after the last failed attempt.
User Session Timeout
Specifies how long a user can remain in logged in state, while inactive. Default: 24 hours
Backward Compatibility
New accounts are defaulted to the password policy you specify. When you change any password criteria, all users receive an email notification that explains that the security requirements have changed, and that users might have to change their password to meet new criteria. If their current passwords already meet the changed criteria, then users can login without updating their password.
Also, account users receive a notice when their current password expires and must be changed.
Password Reset
Account owners and Administrators do not reset passwords for users but can send a request for the user to update their password.
To send the request:
Go to the control panel > Users
Select a specific user
Click to edit the user personal information
Click Reset Password
IP Whitelist
You can specify an IP address at the account level. You might want to specify an IP Address when you expect most of your team to use a common IP address to access Treasure Data. If your navigation provides a control panel,you no longer specify IP Whitelist on the same page as where you set the password and login configuration. You set system level IP addresses from a separate page.
View ArticleArm Treasure Data provides permissions that enable account owners and administrators to control the access and capabilities of their users.
Permissions enable users to:
perform an operation, such as table modification or segment activation
act upon a specified entity, such as a database, or segment
The following illustration shows workflow permission options and the area of TD Console area that the permissions affect.
You can read more information on Permission options.
You can also acquire Treasure Data's Policies feature to increase your level of control and security, and make it easier to manage user access to features and data.
Policies combine permissions together. You can configure multiple policies and assign one or more policies to one or more users.
If you change permissions within a policy, the effect is immediate (after the changes are saved) and the permission changes affect all the users who are assigned to the policy.
You can read more information on Policies options.
View ArticleThe Policies features is available only to accounts that have purchased the feature. Owners and administrators manageaccess control policies in TD Console.
You grant users access rights through the use of policies. Policies combine permissions together.
Permission areas are:
Workflow
Master Segments in Data Workbench
Audience Studio, which includes segment folders and can also include profiles, predictive scoring, and API tokens
Databases are not part of policy permissions.
Planning your Policy Structure
Create your policy structure in a manner that works best for your organization. You might want to create most of your policies at one time, perhaps using a table matrix to plan out the combination of permissions. Or, you might want to create policies, as you find need.
As you create policies, consider:
Establishing naming convention for your policies
In the policy name, you might want to include the scope or area of permission, the user role, or perhaps a vendor or region assignment for example:
Full Control X Project EU
Local Admin VendorX Segments
Provide comprehensive descriptions
You can use the description field to list permissions and levels of access that are contained in the policy.
Refer to policyscenarios for ideas on how to structure your policies.
Defining Policies
In the Control Panel > Administration, click Policies to access the Policy work area. In this area, you can create, delete or apply policies.
Creating Policies
Click Actions > Create Policy and provide a meaningful name and description to your policy.
Click Permissions and then the edit pencil.
In the following example, you don't want policy members to view all master segments, and therefore leave the Configuration for All Master Segments options unselected:
You then selectLimited Access with your ultimate goal in mind, specifying access to specific (not all) segments. You are specifying limited access to Audience Studio:
To specify specific segments, you first select the Master Segment that contains the segment folders that you want to specify, click Add.
You have the option to enable users to view or have full control permission for the Test Master Segment but you opt not to grant that permission. You are designing this policy to grant permission only to a specific segment folder.
Click to edit.
Select the segments that you want to specify from the dropdown box.Click Add.
Specify the permissions for the segment folder that you added.
In the preceding example, users assigned to this policy have full control to one segment folder associated with the 'Test Master Segment'.
When you grant Full Control, you are also granting access to View, even if the View option is not selected or is deselected. With Full Control, you are enabling a user to view as well as create, edit, and activate all segments in the specified folder.
Frequently Asked Questions
What happens when you enable or disable the Policies feature
As the account owner, be aware of the actions that occur when you enable or disable policies.
Enabling the Policies Feature
When you enable the Policies feature, you must:
Create policies that contains specified permissions
Assign existing users to policies
Disabling the Policies Feature
When you disable the Policies feature, the Policies pane is no longer visible. To view the permissions for your users, you must:
Go to Administration > Users
Select to view a user.
Click Permissions to view and specify permissions.
View ArticleAdministrators can specify the maximum length of Web browser sessions.
If your navigation provides a control panel,you clickcan specify session time out fromControl Panel > Administration > Login Settings > Login Configuration.
Otherwise, click Admin > Security. Then click the pencil to edit.
The default value is 1 week. For example, if 1 week is specified, users do not need to login until after 7 days have passed. With 1 week specified, users would need to log in only once per week. Only administrators can set the maximum session for users.
View ArticleWhen a user is deleted from your account, the associated resources created by that user will be affected. The following table summarizes how resources are affected by its user deletion:
If an administrator deletes the user who is the owner of an audience, the workflow of the audience will fail. When you encounter this failure, you need to request us to change the Audience owner to another user before the deletion operation.
Object
Re-Assigned To
Status
Notes
API keys
Owner or an Administrator can specify reassignment
Deleted
The authority level of the deleted user is changed to the level of the re-assigned user. API keys that are deleted are invalid and cannot be reused.
Audience owner
n/a
Deleted
If you delete the user that is the owner of an audience, the workflow associated with the audience fails.
To fix, request a change of the audience owner to anyone other than the user who needs to be deleted.
Databases
Account Owner
Continuously available
The Account Owner is notified by email.
Tables
(None)
Continuously available
-
Authentications
(None)
Continuously available
-
Sources
Account Owner
Available
The Account Owner is notified by email. The schedule is deactivated. Re-enable it manually, as necessary.
Queries
Account Owner
Available
The Account Owner is notified by email. The schedule is deactivated. Re-enable it manually, as necessary.
Jobs
(None)
Continuously available
However, scheduled data transfers are stopped.
Workflows
(Owner is invisible)
Continuously available
Limitation: The Treasure Data related operators in the workflow (such as td>, td_run>) fail if the td.apikey secret is not set for projects that the deleted user created. To avoid failed runs, set the td.apikey secret. Refer to workflow secrets.
Permissions
(None)
n/a
Policies
(None)
Available
if a deleted user was sole user-assigned, the policies are still visible but not active because no user is assigned.
View ArticleThe example policies described herein cover typical policy structures that manage:
a subsidiary or subvendor
employee security clearance levels that dictate access to data
project-centered, cross functional teams that demands role-based policies
In these scenarios, we specify the database permissions in a separate table because database permissions are assigned at the user level, not the policy level.
Scenario 1
Your company works with a vendor and the vendor is responsible for the marketing campaigns in a specificregion. You want the vendor to view a specific master segment (a collection of user profiles based on a physical region) and work with full access to segments that are applicable to the specific region. The company manages multiple master segments with each region having its own master segment. In this scenario, we are dealing with just one region.
The team does not work with databases or workflows.
Policy Plan
One policy.
User role:
Vendor - access to all segments within a specified master segment
User personas:
Jorge, Christian, Kerry, Zach and Dominique are all Marketing Analyst. Permissions: View to only a specific marketing segment, Region 1. Full control to specified segment folders
Within the specific master segments, the vendor has full access including, to create, modify, and delete all segment folders and segments in the specified master segment.
The vendor can view profiles and can use the predictive scoring and create profile APIs to feed data back into the segment.
Workflow is not enabled for this scenario anddatabase activity is not applicable.
Policy Implementation
Policies and Permissions configurationfor Scenario 1:
Policy Name
Workflow
All Master Segments in Data Workbench view
Audience Studio: Master Segment
Audience Studio: Specific Folders Segments
Policy Applied To:
WeCanCompany Region 1 Segment Access
Do not select any permission
Do not select any permission
Select Limited Access and select Region 1 master segment from the drop-down list of master segments; select View permission for master segment config and select View and Full Controlfor All Folders
n/a
By selecting All Folder, you don't need to drill down to specify further permissions
Jorge
Christian
Kerry
Zach
Dominique
Database access for WeCan Company
User Name
Database Name
Level
Each user in WeCan Company
n/a
n/a
Scenario 2
You want to separate marketing analyst work areas by policies and permissions and apply additional restrictions to new analyst. The account includes two master segments.
Policy Plan
Four policies. Two policies for each master segment. Within each master segment, one policy is for regular users and a second policy is for new employees.
The team does not work with databases or workflows.
User roles:
Regular Marketing Analyst - a view of all master segments, full access to a specific master segment,and full access to all folders in the specific segment folders.
New Marketing Analyst - no view of any master segments in data workbench; access to a specific master segment,and a mix of access to specified segment folders
User personas:
George, Silas, Reba: Regular Marketing Analyst - full control to master segment Retail and full access to specified segment folders:current, churn, trend
Karin, Hoaxing, Rupa:Regular Marketing Analyst - full control to master segment Commercial and full access to specified segment folders: current, churn, trend
Nicki, Ted: New Marketing Analyst - Access to the master segment Retail; full access to the segment folder current and view only to segment folders: churn and trend
Laura, Paul, Gina, Derek: New Marketing Analyst - Access to master segment Commercial; full access to the segment folder current and view only to segment folders: churn and trend
Regular marketing analyst can view the behaviors and attribute specifications for all master segments. Access to Audience Studio is limited to their specified master segment (either Retail or Commercial).
Within their master segment, theycan view and act upon all features of Audience Studio, including, profiles, segment folders, predictive scoring, and API tokens. Theycreate, delete, and modify all segment folders in the specified master segment (namedcurrent, churn, and trend), each folder contains multiple subfolders and the subfolders contain multiple segments.Hence, regular marketing analysts can view, analyze and activate segments.
New employees are assigned view access to specific segment folders (churn and trend).The intent is for new employees to monitor and report on segment data.
Workflow is not enabled for this scenarioanddatabase activity is not applicable.
Policy Implementation
Policies and Permissions configurationfor Scenario 2:
Policy Name
Workflow
All Master Segments in Data Workbench view
Audience Studio: Master Segment
Audience Studio: Specific Folders Segments
Policy Applied To:
Retail Regular Analyst
Do not select any permission
View
Select Limited Access and selectRetail master segment; select View and Full Controlpermission for master segment config and in All Segment Folders: select View and Full Control
n/a
George, Silas, Reba
Retail New Analyst Access
Do not select any permission
Do not select any permission
Select Limited Access and selectRetail master segment; for master segment config do not select any permission. In All Segment Folders: Do not select any permission
Click to edit.
Folders and Segments:
Select View for segments:churn and trend
Nicki, Ted
Commercial Regular Analyst
Do not select any permission
View
Select Limited Access and selectCommercial master segment; select View and Full Controlpermission for master segment config and in All Segment Folders: select View and Full Control
n/a
Karen, Hoa, Rupa
Commercial New Analyst Access
Do not select any permission
Do not select any permission
Select Limited Access and selectCommercial master segment; for master segment config do not select any permission. In All Segment Folders: Do not select any permission
Click to edit.
Folders and Segments:
Select View for segments:churn and trend
Laura, Paul, Gina, Derek,
Database access for Retail
User Name
Database Name
Level
Each marketing analyst
n/a
n/a
Scenario 3
You currently use Treasure Data and have standard permission set. You are now adding the policies feature and have more variances in how permission must be set for users. This scenario focuses on the gaming group.
Current Permissions Plan
Current user roles:
Manager (CMO) - periodically checks master segment and segment data; guides data initiatives and campaigns
Data Engineers - work on data ingestion and extract-transform-load, and perform some data cleansing. Require access to databases (full access or import only), workflows for ingesting (view, edit, run). Data Engineer is the only role who works with workflows.
Data Analysts - review data across different sets, clean data by using job queries, create master segments. Merge data or create new dns (domain name system) databases, and view database access to all other databases. Require full access to master segments in data workbench.
Marketing Analysts - review master segments, create segments and subsegments and prepare segments for activation. Requires full access to only specific master segments. Must have full access to segments within their specified master segment.
Policy Plan
Establish policies. Distinguish employees from contractors.
Four policies for the Gaming group. Plus a policy for the Chief Marketing Officer (CMO) of the company.
User roles:
Marketing management (CMO and executive team) -full access to all master segments in data workbench, full audience studio access
Data Engineer Gaming Group - full access to specific dbs; full access to workflow. No access to audience studio.
Data Analyst Gaming Group - a variety of database permissions, which is OK because database permissions are specified at the user permissions (not policy) level. View only to workflow and full access to data workbench master segments. Limited access to Audience Studio, with access to only Gaming segments. Data Analyst don't create or activate segments.
Marketing Analyst Gaming Group -view to all master segments. Requires full access to only specific master segments but within the specific segments, full controlwith full permission to create segments and subsegments and prepare segments for activation.
Consultant Marketing Analyst -full access to a specific master segment,and full access to all folders in the specific segment folders.
User personas:
Kelly: Manager (CMO) - full access to all master segments, including Gaming. Also has full access in Audience Studio. Cam: Director of Gaming
Cathy, Steve, Eke: Data Engineer - full access to specific dbs: us_customers, us_support, and us_marketing; full access to workflow. No access to audience studio.
Josh: Data Analyst - database query permission: us_customers, us_support, and us_marketing;view only to workflow and full access to data workbench master segments and can look at Gaming segments in Audience Studio
August, Gina: Marketing Analyst - limited access to data workbench master segments, with full access to the master segmentUS-West, and view access to all other master segments such as:US-Central, US-East, Asia-East, Europe-West). Full access to segments in US West as well.
Petra: Consultant Marketing Analyst - Limited access to master segments in data workbench with only full access toUS-West master segment specified. Full access to the California segment folder. And view access to the other folders in US-West master segment (Oregon, Arizona)
The marketing analysts can view profiles and can use the predictive scoring and create profile APIs to feed data back into the segments that they have permission to access. In other words, the vendor can view, analyze and activate segment
Workflow is enabled for two policies in this scenario: Data Engineer - full access and Marketing Analyst - view only.
The data analyst and marketing analyst are also given permission by other users to access databases.
Policy Implementation
Policies and Permissions configuration for Scenario 3:
Policy Name
Workflow
All Master Segments in Data Workbench view
Audience Studio: Master Segment
Audience Studio: Specific Folders Segments
Policy Applied To:
Full Access
Executive Office
Do not select any permission
Full Access
Full Access
n/a
Kelly
Cam
Data Engineers
View
Run User-defined
Edit User-Defined
Do not select any permission
Do not select any permission
n/a
Cathy
Steve
Eke
Data Analysts
View
Full Access
Select Limited Access and selectGaming US-Westmaster segment; in All Segment Folders: selectView
n/a
Josh
Marketing Analysts
Do not select any permission
View
Select Limited Access and selectGaming US-Westmaster segment; in All Segment Folders: select View and Full Control
Select US-Central -master segment; in All Segment Folders: select View
Select US-East-master segment; in All Segment Folders: select View
Select Asia-East-master segment; in All Segment Folders: select View
Select Europe-Westsegment; in All Segment Folders: select View
August
Gina
Consultant: Marketing Analyst
Do not select any permission
Do not select any permissions
Select Limited Access and selectGaming US-Westmaster segment; in All Segment Folders: select View
Then click to edit US-Westmaster segment and select - California, Arizona and Oregon segment folders
For each segment folder, specify the permissions: California - View, Full Control
Arizona - View
Oregon - View
Petra
Database access for Scenario 3
User Name
Database Name
Level
Each marketing analyst: Cathy, Steve, Eke
us_customers
us_support
us_marketing
full access
Consultant marketing analyst: Josh
us_customers
us_support
us_marketing
query access
View ArticleYou can opt to log into TD Console using Google SSO. Google, acting as an Identity Provider (IdP), manages your email as your sign-in method.
How to specify use of Google SSO
When you convert your user sign-in to Google SSO, neither you nor your TD administrator can change your user account from Google SSO back to Email and Password login mode. You must contact Treasure Data Support to revert back to Email and Password.
To convert your sign-in to Google SSO, complete the following steps:
Open TD Console.
Access My Settings > My Profile.
Click Sign In with Google.
You see the following warning:
If you click Continue, you are taken to the Google SSO page.
The next time that you log into TD Console, you see an indication that you are signed in through Google SSO:
View ArticleTo add new users to Arm Treasure Data, you send an invitation to join. You invite new users on an individual basis. You must be an account owner or administrator to invite a new user to your account. owner
Open TD Console.
Navigate to the Control Panel.
Click Users.
Click Actions > Invite User.
Type the email address for the user you want to invite.
Click Save to add the user.
If the user already exists but is not visible in your list of users, you receive a message indicating that invitation is already pending for the user.
The Treasure Data system sends an email to the user with a link for them to accept.
View ArticleYou must be an account owner or administrator to change the role of an existing user.
Open TD Console.
Navigate to Control Panel > Users.
Select a specific user
Select the edit pencil to edit the user personal information.
Scroll down to User Type and select the desired role.
View ArticleYou can log into TD Console using Google aliases. Arm Treasure Data views each Google SSO alias as a unique email identifier. Google manages the aliases, resolving the aliases to your email address.
You can log into an Arm Treasure Data account using different Google aliases, for example:
You can sign into your account as an admin and then, with an alias, sign into the same account as a restricted user. Or you can use aliases to sign into your account with different permissions enabled.
View ArticleConnect to Arm Treasure Datas Presto using our JDBC Driver.
Setup
Download presto-jdbc driver
For using secure HTTPS (port 443) connection you need to use presto 0.148 or later.
Use the following parameters to connect prestobase on development server:
Driver name: io.prestosql.jdbc.PrestoDriver
user: (Your TD API key)
password: (dummy string. This value will be ignored)
After presto-jdbc 0.180, because of a bug, a dummy string needs to be set to the password.
Connection URL examples
The connection URL needs to have the following format: jdbc:presto://api-presto.treasuredata.com:443/td-presto/(database name)?SSL=true
Examples
jdbc:presto://api-presto.treasuredata.com:443/td-presto?SSL=true
jdbc:presto://api-presto.treasuredata.com:443/td-presto/sample_datasets?SSL=true
Example Code
$ javac Sample.java
$ java -cp .:presto-jdbc-0.152.jar Sample (your TD API key)
time=1412351990, method=GET, path=/category/office
time=1412351976, method=GET, path=/item/software/2265
time=1412351961, method=GET, path=/category/toys
time=1412351947, method=GET, path=/item/electronics/3305
time=1412351932, method=GET, path=/category/software
Sample.java
import java.sql.*;
class Sample {
public static void main(String[] args) {
if(args.length != 1) {
System.err.println("Provide your TD API key as an argument");
return;
}
String apikey = args[0];
try {
Connection conn = DriverManager.getConnection("jdbc:presto://api-presto.treasuredata.com:443/td-presto/sample_datasets?SSL=true", apikey, "dummy_password");
Statement stmt = conn.createStatement();
try {
ResultSet rs = stmt.executeQuery("SELECT time, method, path from www_access limit 5");
while(rs.next()) {
long time = rs.getLong(1);
String method = rs.getString(2);
String path = rs.getString(3);
System.out.println(String.format("time=%s, method=%s, path=%s", time, method, path));
}
}
finally {
stmt.close();
conn.close();
}
}
catch (Exception e) {
e.printStackTrace();
}
}
}
Limitations
The Presto JDBC connection has several limitations:
setAutoCommit(false) throws an exception because transactions are not supported.
presto-jdbc queries share the same Presto resources you have available for any other ad hoc or scheduled queries that you submit from the TD Console or through the REST API.
presto-jdbc queries are shown in your accounts job log, with the exception of the following standard entries:
Query result table
Query plan
Progress log
The query result is converted into JSON before being submitted back to separate tools for display. As such, very large result tables may take some time before being displayed. If you have latency problems, contact support.
Certain BI tools issue information_schema requests that cause failed queries in our system. This may be an issue in your tool during first use.
If you see any unexpected errors/bugs/logs that occur during beta use, contact support.
Our aim is to keep this API up as much as possible. Specifically, we expect to achieve a downtime of less than 1 hour total per month.
View ArticleTreasure Data Workflow provides an easy way to leverage Python custom scripts for sentiment analysis with TensorFlow and export its model to Amazon S3. Machine Learning algorithms can be run as part of your scheduled workflows, using Python Custom scripts. This article introduces the steps to run the ML algorithm Sentimental Analysis within a Treasure Data Workflow.
Sentimental Analysis classifies texts as positive/negative, for movie reviews using TensorFlow and TensorFlow Hub. See the official document.
official documentation
Sentimental Analysis using Python Custom Scripts
There are two versions of the algorithm discussed in this article:
Example Workflow with TensorFlow using Amazon S3
Example Workflow with TensorFlow without Amazon S3
Example Workflow using TensorFlow with Amazon S3
The workflow:
Fetches review data from Treasure Data
Builds a model with TensorFlow
Stores the model on S3
Predicts polarities for unknown review data and writes it back to Treasure Data
Prerequisites
Make sure the the custom scripts feature is enabled for your TD account.
Download and install the TD Toolbelt and the TD Toolbelt Workflow module. For more information, see TD Workflow Quickstart.
Basic Knowledge of Treasure Data Workflow syntax
AWS S3
S3 Secrets
Run the Example Workflow
Download the sentimental-analysis project from this repository
In the Terminal window, change directory to sentimental-analysis
Run data.sh to ingest training and test data on Treasure Data. About 80 million records are fetched to build the model. The script also creates a database named sentiment and tables named movie_review_train and movie_review_test to store the data. For example:
$ ./data.sh
Assume that the input table is:
rowid
sentence
sentiment
polarity
1-10531
"Bela Lugosi revels in his role as European horticulturist (sic) Dr. Lorenz in this outlandish...
2
0
1-10960
Fragmentaric movie about a couple of people in Austria during a heatwave. This kind of...
3
0
1-24370
I viewed the movie together with a homophobic friend, my wife and her female friend. So...
7
1
Run the example workflow as follows:
$ td workflow push sentiment$ td workflow secrets \--project sentiment \--set apikey \--set endpoint \--set s3_bucket \--set aws_access_key_id \--set aws_secret_access_key# Set secrets from STDIN like: apikey=1/xxxxx, endpoint=https://api.treasuredata.com, s3_bucket=my_bucket,# aws_access_key_id=AAAAAAAAAA, aws_secret_access_key=XXXXXXXXX$ td workflow start sentiment sentiment-analysis --session now
Results of the script are stored in the test_predicted_polarities table in Treasure Data.
To view the table:
Log into TD Console.
Search for the sentiments database.
Locate the test_predicted_polarities table.
The prediction results are stored in this table as shown below:
rowid
predicted_polarity
1-21643
0
1-22967
1
Example Workflow using TensorFlow without Amazon S3
The workflow:
Fetches review data from Treasure Data
Builds a model with TensorFlow
Predicts polarities for unknown review data and writes the data back to Treasure Data
Prerequisites
Make sure this feature is enabled for your TD account.
Download and install the TD Toolbelt and the TD Toolbelt Workflow module. For more information, see TD Workflow Quickstart.
Basic Knowledge of Treasure Data Workflow syntax
Run the Example Workflow
Download the sentimental-analysis project from this repository
From the command line Terminal window, change directory to sentimental-analysis.For example:
cd sentiment-analysis
Run data.sh to ingest training and test data on Treasure Data. About 80 million records are fetched to build the model, the script also creates a database named sentiment and tables named movie_review_train and movie_review_test to store the data.
$ ./data.sh
Assume that the input table is as follows:
rowid
sentence
sentiment
polarity
1-10531
"Bela Lugosi revels in his role as European horticulturist (sic) Dr. Lorenz in this outlandish...
2
0
1-10960
Fragmentaric movie about a couple of people in Austria during a heatwave. This kind of...
3
0
1-24370
I viewed the movie together with a homophobic friend, my wife and her female friend. So...
7
1
Run the example workflow as follows:
$ td workflow push sentiment$ td workflow secrets \--project sentiment \--set apikey \--set endpoint# Set secrets from STDIN like: apikey=1/xxxxx, endpoint=https://api.treasuredata.com$ td workflow start sentiment sentiment-analysis-simple --session now
Results of the script are stored in the test_predicted_polarities table in Treasure Data.
To view the table:
Log into TD Console.
Search for the sentiments database.
Locate the test_predicted_polarities table.
The prediction results should be similar to the following:
rowid
predicted_polarity
1-21643
0
1-22967
1
Review the Workflow Custom Python Script
Review the contents of the sentimental-analysis directory:
sentiment-analysis.dig - This is the TD Workflow YAML file for sentiment analysis with TensorFlow.
sentiment.py - This is the Custom Python script with TensorFlow. It builds a prediction model with existing data and predicts polarity to unknown data.
In this example, we use a pre-trained model in TensorFlowHub for word embedding for English text.
embedded_text_feature_column = hub.text_embedding_column(key="sentence",module_spec="https://tfhub.dev/google/nnlm-en-dim128/1")
If you want to change this model to another one, for example, Japanese model, you can modify it as follows:
embedded_text_feature_column = hub.text_embedding_column(key="sentence",module_spec="https://tfhub.dev/google/nnlm-ja-dim128/1")
Before word embedding, you need to prepare tokenized sentences for Japanese.
Because this custom script also saves the trained TensorFlow model with movie reviews to Amazon S3, you can build your prediction server with TensorFlow Serving.
To change the serving_input_receiver_fn, modify the following code:
feature_spec = tf.feature_column.make_parse_example_spec([embedded_text_feature_column])serving_input_receiver_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(feature_spec)estimator.export_saved_model(EXPORT_DIR_BASE, serving_input_receiver_fn)
See for details.
View ArticleTreasure Data Workflow provides an easy way to predict continuous values, like a price or energy consumption, using Linear Regression predictor. Machine Learning algorithms can be run as part of your scheduled workflows, using Python custom scripts. This article introduces feature selection using scikit-learn in a Python script, which selects important features and builds a partial query for Hivemall to predict house prices.
Feature selection is a common machine learning technique used to build a simplified model for understanding and to enhance generalization by removing irrelevant or redundant information.
LassoCV
Feature Selection using Python Custom Scripts
This article describes how to predict house prices using the Boston house pricing data set with Linear Regression predictor. The feature selection from scikit-learn helps identify meaningful attributes from which to create supervised models.
Example Workflow to Predict House Prices
The workflow:
Splits the training and test data sets
Selects important features with scikit-learn
Builds a partial query for Hivemall
Trains and evaluates the model with selected features on Hivemall
Prerequisites
Make sure that the custom scripts feature is enabled for your TD account.
Download and install the TD Toolbelt and the TD Toolbelt Workflow module. For more information, see TD Workflow Quickstart.
Basic Knowledge of Treasure Workflow's syntax
Run the Example Workflow
Download the house price prediction project.
From the command line terminal window, change to the to house-price-prediction directory. For example:
Cd house-price-prediction
Run data.sh to ingest training and test data on Treasure Data. The script uses the Boston Housing Dataset with about 506 cases to build the model. The script also creates a database named boston and table named house_prices to store the data.
$ ./data.sh
Input
Assume the input table is:
crim
double
zn
double
indus
double
chas
int
nox
double
rm
double
age
double
dis
double
rad
int
tax
int
ptratio
double
b
double
lstat
double
medv
double
0.00632
18
2.31
0
0.538
6.575
65.2
4.09
1
296
15.3
396.9
4.98
24
0.02731
0
7.07
0
0.469
6.421
78.9
4.9671
2
242
17.8
396.9
9.14
21.6
0.02729
0
7.07
0
0.469
7.185
61.1
4.9671
2
242
17.8
392.83
4.03
34.7
0.03237
0
2.18
0
0.458
6.998
45.8
6.0622
3
222
18.7
394.63
2.94
33.4
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Where medv, is Median value of owner-occupied homes in $1000's, the target value for regression. chas and rad is categorical values, and other features are quantitative.
By default, the top 4 correlated columns with medv are used.
Run the example workflow as follows:
$ td workflow push regressor
# Set credentials for this workflow
# export TD_API_KEY=1/xxxxx
# export TD_API_SERVER=https://api.treasuredata.com
$ td wf secrets --project regressor --set apikey --set endpoint
# Enter apikey e.g.) 1234/XXXXXXX, and endpoint e.g.) https://api.treasuredata.com
$ td wf start regressor regression-py
The predicted results for the price of houses are stored in the predictions table.
To view the table:
Log into TD Console.
Search for the boston database.
Locate the predictions table
Output
This workflow outputs predicted results
rowid
string
predicted_price
double
1-10
33.97034232809395
1-121
30.3377696027913
...
...
Review the Workflow Custom Python Script
Review the contents of the directory:
regression-py.dig - Example workflow for sales prediction and notification to Slack.
task/__init__.py - Custom Python script with scikit-learn. It selects important features and builds a partial query for Hivemall.
This example uses scikit-learn's SelectFromModel function, which enables selection of features when building a predictive model.
$ ./data.sh # Ingest example data to Treasure Datareg = ExtraTreesRegressor()reg = reg.fit(X, y)model = SelectFromModel(reg, prefit=True)feature_idx = model.get_support()feature_name = df.drop(columns=['medv']).columns[feature_idx]selected_features = set(feature_name)...(snip)...feature_query = self._feature_column_query(selected_features, feature_types=feature_types)
In this example we use ExtraTreeRegressor to get feature importance, you can use any other logics such as.
In the example code, there is a _create_vectorize_table function for creating a vectorized table to train Hivemall models with Python.
self._create_vectorize_table(engine_hive, dbname, "train", "{}_train".format(source_table), feature_query)
While you can use this function within the custom Python script instead of exporting the partial query, we recommend that you export queries. Exporting queries gives you the benefit of Digdag parallelization and manageability on TD Console.
View ArticleTreasure Workflow provides for prediction of time-series values, like a sales revenue or page views, using Facebook Prophet. Machine Learning algorithms can be run from a custom Python script as part of your scheduled workflows. Use Facebook Prophet in your Python custom script for time series analysis and sales data prediction.
Prophet is a procedure for forecasting time series data. It can learn probability distributions from incomplete data sets and shifts in the trends.Details of the data for these examples can be found in the Prophets official document.
These examples predict sales data with Facebook Prophet doing time series analysis to predict continuous values using past data.Pick the example to use depending on your access to S3 and Slack.
Example Workflow to Predict Future Sales with S3 and Slack
Example Workflow to Predict Future Sales without S3 or Slack
Example Workflow to Predict Future Sales with S3 and Slack
The workflow:
Fetches past sales data from Treasure Data
Builds a model with Prophet
Predicts future sales and write back to Treasure Data
Uploads predicted figures to Amazon S3
Sends a notification to Slack
Prerequisites
Make sure that the custom scripts feature is enabled for your TD account.
Download and install the TD Toolbelt and the TD Toolbelt Workflow module. For more information, see TD Workflow Quickstart.
Basic Knowledge of Treasure Workflow's syntax
S3 bucket and associated credentials
Create your Slack webhook URL
Running the Example Workflow
Download the project from this repository
In the command line Terminal window, change directory to timeseries. For example:
cd timeseries
Run data.sh to prepare example data. It creates a database called timeseries and table called retail_sales.
$ ./data.sh
Run the example workflow as follows:
$ td workflow push prophet$ td workflow secrets \ --project prophet \ --set apikey \ --set endpoint \ --set s3_bucket \ --set aws_access_key_id \ --set aws_secret_access_key --set slack_webhook_url# Set secrets from STDIN like: apikey=1/xxxxx, endpoint=https://api.treasuredata.com, s3_bucket=$S3_BUCKET,# aws_access_key_id=AAAAAAAAAA, aws_secret_access_key=XXXXXXXXX, slack_webhook_url=https://hooks.slack.com/services/XXXXXXX/XXXXXX/XXXXXX # Where XXXXXXX/XXXXXX/XXXXXX is the value of the slack URL where you want information to be populated automatically. $ td workflow start prophet predict_sales --session now
A notification of prediction results is sent to your Slack:
predict.py
Prediction results are stored in the predicted_sales table in the timeseries database.
Validate predicted sales values in TD Console with SQL as follows:
select ds, yhat, yhat_lower, yhat_upperfrom timeseries.predicted_salesorder by ds desclimit 100;
You can see the predicted sales numbers in yhat column like follows:
Example Workflow to Predict Future Sales without S3 or Slack
This example is for when you do not have access to S3 or Slack.
The workflow:
Fetches past sales data from Treasure Data
Builds a model with Prophet
Predicts future sales and writes back to Treasure Data
Prerequisites
Make sure custom scripts feature is enabled for your TD account.
Download and install the TD Toolbelt and the TD Toolbelt Workflow module. For more information, see TD Workflow Quickstart.
Basic Knowledge of Treasure Workflow's syntax
Running the Example Workflow
Download the project from this repository
In the command line Terminal window, change directory to timeseries. For example:
cd timeseries
Run data.sh to prepare example data. It creates a database called timeseries and table called retail_sales.
$ ./data.sh
Run the example workflow as follows:
$ td workflow push prophet$ td workflow secrets \ --project prophet \ --set apikey \ --set endpoint \# Set secrets from STDIN like: apikey=1/xxxxx, endpoint=https://api.treasuredata.com$ td workflow start prophet predict_sales_simple --session now
Predictions are stored in Treasure Data.
Review the Workflow Custom Python Script
Review the contents of the directory:
predict_sales.dig - Example workflow for sales prediction and notification to Slack.
- Custom Python script with Prophet. It also uploads figures to Amazon S3
If you don't need to send a notification to Slack, you can remove "+send_graph" step in the predict_sales.dig.
This example uses a dataset for sales data. You can use your own data in Treasure Data after modifying the query in the read_td function.
You must prepare the ds (datestamp) column and the y column. The y column represents target numerical values for forecasting, such as sales values or page view (PV)s of web sites.
Here is example code for page view logs:
import pytd.pandas_td as tdengine = td.create_engine('presto:sample_datasets')start_date = '2014-01-01'end_date = '2018-12-31'df = td.read_td(f"""select TD_TIME_FORMAT(TD_DATE_TRUNC('minute', time), 'yyyy-MM-dd HH:mm:ss') as ds, count(1) as yFrom www_accessWhere TD_TIME_RANGE(time, '{start_date}', '{end_date}', 'PDT')group by 1Order by 1""",engine)m = Prophet(changepoint_prior_scale=0.01).fit(df)future = m.make_future_dataframe(periods=5, freq='M')fcst = m.predict(future)
View ArticleTreasure Workflow provides an easy way to modify data with PySpark. You can port existing PySpark code to your Custom Scripting environment in Arm Treasure Data.
PySpark is a Python API for Spark. You can run PySpark as part of your scheduled workflows, using Python Custom scripts.Apache Spark is widely used for distributed computation for Machine Learning and general data manipulation.
The following example shows how to complete basic data manipulations using stand-alone mode PySpark with Python Custom scripts.
Data Manipulation using Python Custom Scripts and PySpark
PySpark is a Python interface, and a unified analytics engine for large-scale data processing. This example shows how to load data from Treasure Data, process it on PySpark, execute a SparkSQL, and upload modified data back to Treasure Data.
In this example, you use td-pyspark, a Python library that provides a handy way to use PySpark and Treasure Data based on td-spark.
Example Workflow to Process Data with td-spark
The workflow that you create in this example completes the following actions:
Fetch data from Treasure Data
Process data as Spark DataFrame
Manipulate data with SparkSQL
Upload Spark DataFrame to Treasure Data
Upload pandas DataFrame to Treasure Data
Prerequisites
To follow the steps in this example, you must have a basic knowledge of Treasure Workflow syntax and set up your environment as follows:
Download and install the TD Toolbelt and the TD Toolbelt Workflow module. See TD Workflow Quickstart
Ensure that the custom scripts feature is enabled for your TD account.
Run the Example Workflow
You retrieve code from Treasure Data GitHub to run in your environment.
Download the project from the GitHub repository
In your Terminal window, change the directory to pyspark.
Run the example workflow as follows:
$ td workflow push td-spark$ td workflow secrets \ --project td-spark \ --set apikey \ --set endpoint \# Set secrets from STDIN like: apikey=1/xxxxx, endpoint=https://api.treasuredata.com$ td workflow start td-spark pyspark --session now
The workflow fetches www_access data in the sample_datasets database, then filters the data by HTTP GET method requests and adds a time column in string format. The manipulated data is stored in the www_access_processed table in the td_spark_example database.
The processed data with PySpark filtered only with GET method is as follows:
td-spark.py
Additionally, you can find the pandas dataframe in the pandas_table as follows:
Review the Workflow Custom Python Script
Review the contents of the directory and note the following files:
pyspark.dig
Example workflow for data manipulation with PySpark.
Theworkflow YAML filepyspark.dig file invokes the Python script for manipulating data with PySpark. It has three main tasks:
The td_spark_process task creates the www_access_processed table in the td_spark_example database by using Spark DataFrame execution.
The td_spark_sql executes SparkSQL.
The td_spark_upload task uploads the processed pandas data frames to the TD database.
Custom Python script with PySpark and td-pyspark.
The Python script td-spark.py houses all the python functions invoked by the pyspark.dig file. Three main functions illustrate processing data with PySpark:
The process_data function processes data with PySpark filtered only with GET method of HTTP requests and adds a time column converted into string format.
The execute_sql function demonstrates how you can use SparkSQL to process data
The upload_dataframe function uses the pandas dataframe to upload data to Treasure Data after converting the data frame into a Spark dataframe.
View ArticleYou can use td-pyspark to bridge the results of data manipulations in Databrick with your data in Arm Treasure Data.
Databricks builds on top of Apache Spark providing an easy to use interfacefor accessing Spark. PySpark is a Python API for Spark. Treasure Data's td-pyspark is a Python library that provides a handy way to use PySpark and Treasure Data based on td-spark.
Prerequisites
To follow the steps in this example, you must have the following times:
Treasure Data API key
td-spark feature enabled
Configuring your Databricks Environment
You create a cluster, install td-pyspark libraries and configure an notebook for your connection code.
Create a Cluster on Databricks
Click the Cluster icon.
Click Create Cluster.
Provide a cluster name, select version Spark 2.4.3 or later as the Databricks Runtime Version and select 3 as the Python Version.
Database and Table Management
Install the td-pyspark Libraries
Access the Treasure Data Apache Spark Driver Release Notes. From the article, you click on links to download code.
Click to download
td-spark-assembly_2.11-19.7.0.jar
Select PyPI.
When download completes, you see the following:
Specify your TD API Key and Site
In the Spark configuration, you specify the Treasure Data API key and enter the environment variables.
An example of the format is as follows. You provide the actual values:
spark.td.apikey (Your TD API KEY) spark.td.site (Your site: us, jp, eu01)spark.serializer org.apache.spark.serializer.KryoSerializerspark.sql.execution.arrow.enabled true
Restart Cluster and Begin Work in Databricks
Restart your Spark cluster. Create a noteboook. Create a script similar to the following code:
%python
from pyspark.sql import *import td_pysparkSAMPLE_TABLE = "sample_datasets.www_access"td = td_pyspark.TDSparkContext(spark)df = td.table(SAMPLE_TABLE).within("-10y").df()df.show()
TDSparkContextis an entry point to access td_pyspark's functionalities. As shown in the preceding code sample, to create TDSparkContext, pass your SparkSession (spark) to TDSparkContext:
td = TDSparkContext(spark)
You see a result similar to the following:
Your connection is working.
Interacting with Treasure Data from Databricks
In Databricks, you can run select and insert queries to Treasure Data or query back data from Treasure Data. You can also create and delete databases and tables.
In Databricks, you can use the following commands:
Read Tables as DataFrames
To read a table, usetd.table (table_name):
df = td.table("sample_datasets.www_access").df()
df.show()
Change the Database used in Treasure Data
To change the context database, use td.use (database_name):
td.use("sample_datasets")
# Accesses sample_datasets.www_access
df = td.table("www_access").df()
By calling .df() your table data is read as Spark's DataFrame. The usage of the DataFrame is the same with PySpark. See also PySpark DataFrame documentation.
Access sample_datasets.www_access
df=td.table("www_access").df()
Submit presto queries
If your Spark cluster is small, reading all of the data as in-memory DataFrame might be difficult. In this case, you can use Presto, a distributed SQL query engine, to reduce the amount of data processing with PySpark.
q = td.presto("select code, * from sample_datasets.www_access")
q.show()q = td.presto("select code, count(*) from sample_datasets.www_access group by 1")q.show()
You see:
Create or drop a database
td.create_database_if_not_exists("<db_name>")td.drop_database_if_exists("<db_name>")
Upload DataFrames to Treasure Data
To save your local DataFrames as a table, you have two options:
Insert the records in the input DataFrame to the target table
Create or replace the target table with the content of the input DataFrame
td.insert_into(df, "mydb.tbl1")td.create_or_replace(df, "mydb.tbl2")
Checking Databricks in Treasure Data
You can use td toolbelt to check your database from a command line. Alternatively, if you have TD Console, you can check your databases and queries. Read about .
View ArticleYou can use td-pyspark to bridge the results of data manipulations in Amazon EMR (Elastic MapReduce) with your data in Arm Treasure Data.
Amazon EMR is an AWS tool for big data processing and analysis, providing an easy-to-use interfacefor accessing Spark. PySpark is a Python API for Spark. Treasure Data's td-pyspark is a Python library that provides a handy way to use PySpark and Treasure Data based on td-spark.
Prerequisites
To follow the steps in this example, you must have the following Treasure Data items:
Treasure Data API key
td-spark feature enabled
Configuring your Amazon EMR Environment
You create a key pair, cluster, install td-pyspark libraries and configure a notebook for your connection code.
Log into Amazon Management Service. Under Find Services, enter EMR.Amazon EMR cluster nodes run on Amazon EC2 instances.
Create an EC2 Key Pair
When you create the key pair in Amazon, you provide a name and a file with the extension of .pem is generated. You download the generated file to your local computer.
For more information about creating an Amazon EC2 key pair, see Amazon EC2 Key Pairs.
You refer to the key when you create a cluster, You specify the Amazon EC2 key pair that is used for SSH connections to all cluster instances.
Create a Cluster on Amazon EMR
Complete the configuration fields.Provide a cluster name, a folder location for the cluster data and select version Spark 2.4.3 or later as theApplication.
Database and Table Management
Specify the instance and key pair.
Prepare the Proxy Server
To grant inbound access permission from your local computer, you specify a secure protocol.
Select the security group for Master:
In the Amazon, find the Master node that you want to access:
With the proxy established, you log onto the Master node instance. On your local computer, access the AWS Master node instance through an ssh protocol.
$ ssh -i ~/ <your_aws_keypair_kry.pem> hadoop@ <Master public DNS>
You see a connection confirmation on your local computer.
Install the td-pyspark Libraries
Access the Treasure Data Apache Spark Driver Release Notes. From the article, you click on links to download code.
Inside your EMR instance, click to download the library to your Master node.
td-spark-assembly_2.11-19.7.0.jar
Still within the Master node instance, run the following command to install pyspark:
$ pip install td_pyspark
Create a Configuration File and Specify your TD API Key and Site
In thethe Master node instance, create a td-spark.conf file. In the configuration file, specify your TD API Key, TD site parameters and spark environment.
An example of the format is as follows. You provide the actual values:
spark.td.apikey (Your TD API KEY) spark.td.site (Your site: us, jp, eu01)spark.serializer org.apache.spark.serializer.KryoSerializerspark.sql.execution.arrow.enabled true
Launch PySpark
Launch PySpark. In your command, include the arguments as shown in the following example:
% pyspark --driver-class-path ./td_pyspark-19.7.0/td_pyspark/jars/td-spark-assembly.jar --properties-file ./td-spark.conf
You see something similar to the following:
Then load td_pyspark, as follows. Note the prompt symbol changes to >>>:
>>> import td_pyspark>>> from pyspark import SparkContext>>> from pyspark.sql import SparkSession
>>> builder = SparkSession.builder.appName("td-pyspark-test"
>>> td = td_pyspark.TDSparkContextBuilder(builder).build()
>>> df = td.table("sample_datasets.www_access").within("+2d/2014-10-04").df()
>>> df.show()
TDSparkContextBuilderis an entry point to access td_pyspark's functionalities. As shown in the preceding code sample, you read tables in Treasure Data as data frames:
df = td.table("tablename").df()
You see a result similar to the following:
Your connection is working.
Interacting with Treasure Data in your Master Node Instance
You can run select and insert queries to Treasure Data or query back data from Treasure Data. You can also create and delete databases and tables.
You can use the following commands:
Read Tables as DataFrames
To read a table, usetd.table (table_name):
df = td.table("sample_datasets.www_access").df()
df.show()
Change the Database used in Treasure Data
To change the context database, use td.use (database_name):
td.use("sample_datasets")
# Accesses sample_datasets.www_access
df = td.table("www_access").df()
By calling .df() your table data is read as Spark's DataFrame. The usage of the DataFrame is the same with PySpark. See also PySpark DataFrame documentation.
Access sample_datasets.www_access
df=td.table("www_access").df()
Submit presto queries
If your Spark cluster is small, reading all of the data as in-memory DataFrame might be difficult. In this case, you can use Presto, a distributed SQL query engine, to reduce the amount of data processing with PySpark.
q = td.presto("select code, * from sample_datasets.www_access")
q.show()q = td.presto("select code, count(*) from sample_datasets.www_access group by 1")q.show()
You see:
Create or drop a database
td.create_database_if_not_exists("<db_name>")td.drop_database_if_exists("<db_name>")
Upload DataFrames to Treasure Data
To save your local DataFrames as a table, you have two options:
Insert the records in the input DataFrame to the target table
Create or replace the target table with the content of the input DataFrame
td.insert_into(df, "mydb.tbl1")td.create_or_replace(df, "mydb.tbl2")
Checking Amazon EMR in Treasure Data
You can use td toolbelt to check your database from a command line. Alternatively, if you have TD Console, you can check your databases and queries. Read about .
View ArticleYou can use td-pyspark to bridge the results of data manipulations in Google Colab with your data in Arm Treasure Data.
Google Colab notebooks make it easy to model with PySpark in Google. PySpark is a Python API for Spark. Treasure Data's td-pyspark is a Python library that provides a handy way to use PySpark and Treasure Data based on td-spark.
Prerequisites
To follow the steps in this example, you must have the following Treasure Data items:
Treasure Data API key
td-spark feature enabled
Configuring your Google Colab Environment
You create an envelope, install pyspark and td-pyspark libraries and configure the notebook for your connection code.
Create an Envelope in Google Colab
Open Google Colab. Click File > New Python 3 notebook.
Ensure thatthe runtime is connected. The notebook shows a green check on the top right corner.
Prepare your Environment for the PySpark and TD-PySpark Libraries
Click the icon to add a code cell:
Database and Table Management
Enter the following code:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null!pip install pyspark td-pyspark
Create and Upload the td-spark.conf File
You specify your TD API key and site on your local file system. Create a file as follows:
An example of the format is as follows. You provide the actual values:
spark.td.apikey (Your TD API KEY)spark.td.site(Your site: us, jp, eu01)spark.serializer org.apache.spark.serializer.KryoSerializerspark.sql.execution.arrow.enabled true
Name the file td-spark.conf and upload the file by clicking Files > Upload on the Google Colab menu. Verify that the td-spark.conf file is saved in the /content directory.
Run the Installation and Begin Work in Databricks
Run the current cell by selecting the cell and clicking shift + enter keys.
Create a second code cell and create a script similar to the following code:
import osos-environ[PYSPARK_SUBMIT_ARGS] = --jars /usr/local/lib/python2.7/dist-packages/td_pyspark/jars/td-spark-assembly.jar --properties-file /content/td-spark.conf pyspark-shellos-environ[JAVA_HOME] = /usr/lib/jvm/java-8-openjdk-amd64"import td_pysparkfrom pyspark import SparkContextfrom pyspark.sql import SparkSessionbuilder = SparkSession.builder.appName("td-pyspark-test")td = td_pyspark.TDSparkContextBuilder(builder).build()df = td.table("sample_datasets.www_access").within("-10y").df()df.show()
TDSparkContextBuilderis an entry point to access td_pyspark's functionalities. As shown in the preceding code sample, you read tables in Treasure Data as data frames:
df = td.table("tablename").df()
You see a result similar to the following:
Your connection is working.
Interacting with Treasure Data from Google Colab
In Google Colab, you can run select and insert queries to Treasure Data or query back data from Treasure Data. You can also create and delete databases and tables.
In Google Colab, you can use the following commands:
Read Tables as DataFrames
By calling .df() your table data is read as Spark's DataFrame. The usage of the DataFrame is the same with PySpark. See also PySpark DataFrame documentation.
df = td.table("sample_datasets.www_access").df()df.show()
Submit Presto Queries
If your Spark cluster is small, reading all of the data as in-memory DataFrame might be difficult. In this case, you can use Presto, a distributed SQL query engine, to reduce the amount of data processing with PySpark.
q = td.presto("select code, * from sample_datasets.www_access")
q.show()q = td.presto("select code, count(*) from sample_datasets.www_access group by 1")q.show()
You see:
Create or Drop a Database
td.create_database_if_not_exists("<my_db>"td.drop_database_if_exists("<my_db>")
Upload DataFrames to Treasure Data
To save your local DataFrames as a table, you have two options:
Insert the records in the input DataFrame to the target table
Create or replace the target table with the content of the input DataFrame
td.insert_into(df, "mydb.table1")td.create_or_replace(df, "mydb.table2")
Checking Google Colab in Treasure Data
You can use td toolbelt to check your database from a command line. Alternatively, if you have TD Console, you can check your databases and queries. Read about .
View ArticleAfter creating a segment, you can choose to activate it through the activations editor or you can download it as a CSV, if you want to see the segmented data in detail. As there is no CSV download option directly from the Segment Editor, follow these steps to export a segment as a CSV:
Create a segment. For more information on this go to Creating a New Segment.
After creating the segment, navigate to Audience Studio> Segments.
Select the segment you want to export.
Select Edit Segment Rules to open the Rule Builder.
Select the SQL Query tab.
Copy the query from SQL Query tab and close the Rule Builder.
Navigate to Data Workbench>Queries.
Select New Query.
Paste the copied sql query into the query editor.
Select Run.
After the query is complete, select the Query Result tab and select Download to get the CSV file.
View ArticleThis article lists the changelog of td-agent. The older versions of td-agent are here.
td-agent3 (New stable)
This version uses Fluentd v1. This article describes the differences between td-agent2 and td-agent3. If you have a new deployment, try this version first.
2019/11/01: version v3.5.1
fluentd v1.7.4
ruby v2.4.9
td v0.16.5
tzinfo-data v1.2019.3
aws-sdk-s3 v1.52.0
fluent-plugin-s3 v1.2.0
rdkafka v0.7.0
fluent-plugin-kafka v0.12.1
fluent-plugin-elasticsearch v3.5.5
fluent-plugin-prometheus v1.7.0
winevt_c v0.6.1
fluent-plugin-windows-eventlog v0.4.3
td-agent.conf on Windows will not be overwritten since next versionSee also https://github.com/treasure-data/omnibus-td-agent/pull/215
2019/08/27: version v3.5.0
fluentd v1.7.0
tzinfo v2.0.0
tzinfo-data v1.2019.2
aws-sdk-s3 v1.46.0
fluent-plugin-s3 v1.1.11
fluent-plugin-webhdfs v1.2.4
ruby-kafka v0.7.10
rdkafka v0.6.0
fluent-plugin-kafka v0.11.1
elasticsearch v6.8.0
fluent-plugin-elasticsearch v3.5.4
fluent-plugin-prometheus v1.5.0
fluent-plugin-windows-eventlog v0.3.0
Use /etc/default/td-agent as EnvironmentFile on Ubuntu/Debian
2019/05/18: version v3.4.1
ruby v2.4.6
fluent-plugin-s3 v1.1.10
fluent-plugin-rewrite-tag-filter v2.2.0
fluent-plugin-kafka v0.9.4
fluent-plugin-elasticsearch v3.5.1
fluent-plugin-prometheus v1.4.0
2019/05/04: version v3.4.0
Fluentd v1.4.2
This includes several additional patches (TLS fix for Windows, tag parameters in in_forward, etc)
fluent-plugin-kafka v0.9.2
fluent-plugin-elasticsearch v3.4.3
fluent-plugin-s3 v1.1.9
serverengine v2.1.1
aws-sdk-core v3.48.3
aws-sdk-s3 v1.36.0
ruby-kafka v0.7.6
rdkafka 0.4.2 (only Linux)
tzinfo-data v1.2019.1
oj v3.7.11
Use /etc/sysconfig/td-agent as EnvironmentFile in systemd
2019/01/10: version v3.3.0
ruby v2.4.5
Fluentd v1.3.3
fluent-plugin-kafka v0.8.3
fluent-plugin-elasticsearch v3.0.1
fluent-plugin-s3 v1.1.7
fluent-plugin-write-tag-filter v2.1.1
serverengine v2.1.0
This fixes permission security issue of unix domain socket.
aws-sdk-core v3.44.2
aws-sdk-s3 v1.30.0
ruby-kafka v0.7.4
Drop kafka v0.10 support. If you want to connect to kafka v0.10, you need to downgrade ruby-kafka and fluent-plugin-kafka version.
tzinfo-data v1.2018.9
oj v3.7.4
td v0.16.4
2018/10/15: version v3.2.1
Fluentd v1.2.6
fluent-plugin-kafka v0.7.9
fluent-plugin-elasticsearch v2.11.11
fluent-plugin-s3 v1.1.6
serverengine 2.0.7
ruby-kafka v0.6.8
2018/06/13: version v3.2.0
Fluentd v1.2.2
fluent-plugin-kafka v0.7.3
fluent-plugin-elasticsearch v2.10.3
fluent-plugin-s3 v1.1.3
fluent-plugin-webhdfs v1.2.3
fluent-plugin-rewrite-tag-filter v2.1.0
Add fluent-plugin-record-modifier v1.1.0
postgresql 9.6.9
msgpack-ruby 1.2.4
td 0.16.1
bundler 1.16.2
yajl-ruby 1.4.0
tzinfo-data 1.2018.5
Support Ubuntu 18.04 64bit
2017/12/19: version v3.1.1
Fluentd v1.0.2
fluent-plugin-kafka v0.6.5
fluent-plugin-elasticsearch v2.4.0
jemalloc 4.5.0
postgresql 9.6.6
msgpack-ruby 1.2.0
dig_rb 1.0.1
Add RuntimeDirectory to systemd unit
2017/12/1: version v3.1.0
Ruby 2.4.2
Fluentd v0.14.25
fluent-plugin-s3 v1.1.0
fluent-plugin-kafka v0.6.4
fluent-plugin-webhdfs v1.2.2
fluent-plugin-td v1.0.0
fluent-plugin-rewrite-tag-filter v2.0.1
fluent-plugin-windows-eventlog v0.2.2
fluent-plugin-elasticsearch v2.3.0
bundler 1.6.0
msgpack-ruby 1.1.0
yajl-ruby 1.3.1
cool.io 1.5.3
oj 3.3.9
aws-sdk-s3 1.8.0
aws-sdk-sqs 1.3.0
td 0.15.7
2017/05/18: version v3.0.1
Ruby 2.4.1
Fluentd v0.14.16
fluent-plugin-s3 v1.0.0.rc3
fluent-plugin-kafka v0.5.5
fluent-plugin-webhdfs v1.1.1
fluent-plugin-windows-eventlog v0.2.0
fluent-plugin-elasticsearch v1.9.5
msgpack-ruby 1.1.0
2017/01/08: version v3.0.0
Ruby 2.4.0
Fluentd v0.14.11
fluent-plugin-s3 v1.0.0.rc1
fluent-plugin-kafka v0.5.0
fluent-plugin-td v1.0.0.rc1
fluent-plugin-elasticsearch v1.9.2
yajl-ruby v1.3.0
msgpack-ruby v1.0.2
Remove fluent-plugin-scribe and fluent-plugin-mongo from bundled plugins
Added Windows support
Dropped CentOS 5, Ubuntu Lucid, Ubuntu Precise, Debian Wheezy support
td-agent2 (Old stable)
This version uses Fluentd v0.12 and ruby version is 2.5/2.1. This article describes the differences between td-agent2 and td-agent3. td-agent2 is for existing td-agent2 and fluentd v0.12 users. If you have new deployment, try td-agent3 first.
2018/06/28: version v2.5.0
Fluentd v0.12.43
ruby 2.5.1
2017/01/01: Repository GPG Key Update
From 2017/01/01, major Linux packaging system apt has shut off the support for SHA1 encrypted GPG key for package signing. This is because SHA1 algorithm is considered weak.
Treasure Agents GPG key located at https://packages.treasuredata.com/GPG-KEY-td-agenthas been upgraded. To update your GPG key for better security:
Update the GPG key for Treasure Agent
If youre using the old GPG key, youll see the error message similar to the following:
W: Signature by key XYZXYZXYZXYZXYZ uses weak digest algorithm (SHA1)
2017/10/04: version v2.3.6
fluentd v0.12.40
fluent-plugin-s3 v0.8.5
fluent-plugin-mongo v0.8.1
fluent-plugin-kafka v0.6.1
fluent-plugin-rewrite-tag-filter v1.5.6
fluentd-ui v0.4.4
cool.io v1.5.1
aws-sdk v2.10.45
ruby-kafka v0.4.2
rubygems v2.6.13
2017/04/26: version v2.3.5
fluentd v0.12.35
fluent-plugin-s3 v0.8.2
fluent-plugin-mongo v0.8.0
fluent-plugin-kafka v0.5.5
fluentd-ui v0.4.4
cool.io v1.4.6
yajl-ruby v1.3.0
aws-sdk v2.9.9
td-client v0.8.85
tzinfo v1.2.3
tzinfo-data v1.2017.2
ruby-kafka v0.3.17
mongo v2.2.7
fix logrotate script
2016/12/31: version v2.3.4
fluentd v0.12.31
fluent-plugin-s3 v0.8.0
fluent-plugin-mongo v0.7.16
fluent-plugin-kafka v0.4.1
fluentd-ui v0.4.4.rc1
td v0.15.0
oj v2.18.0
aws-sdk v2.6.42
td-client v0.8.85
tzinfo-data v1.2016.10
ruby-kafka v0.3.15
httpclient v2.8.2.4
Add 32bit package to Ubuntu 16.04
2016/09/30: version v2.3.3
fluentd v0.12.29
fluent-plugin-s3 v0.7.1
fluent-plugin-mongo v0.7.15
fluent-plugin-kafka v0.3.1
fluent-plugin-td v0.10.29
fluentd-ui v0.4.3
td v0.15.0
cool.io v1.4.5
aws-sdk v2.6.3
td-client v0.8.83
tzinfo-data v1.2016.6
ruby-kafka v0.3.15
jemalloc v4.2.1
2016/07/10: version v2.3.2
fluentd v0.12.26
fluent-plugin-s3 v0.6.8
fluent-plugin-webhdfs v0.4.2
fluent-plugin-mongo v0.7.13
fluent-plugin-rewrite-tag-filter v1.5.5
td v0.14.0
ruby v2.1.10
cool.io v1.4.4
sigdump v0.2.4
httpclient v2.7.2
oj v2.15.1
aws-sdk v2.3.14
webhdfs v0.8.0
tzinfo-data v1.2016.5
Ubuntu 16.04 64bit support
2016/02/11: version v2.3.1
fluentd v0.12.20
fluent-plugin-s3 v0.6.5
fluent-plugin-td v0.10.28
fluent-plugin-td-monitoring v0.2.2
fluent-plugin-mongo v0.7.12
fluent-plugin-rewrite-tag-filter v1.5.4
td v0.13.2
rubygems v2.4.8
httpclient v2.7.1
oj v2.14.4
aws-sdk v2.2.14
webhdfs v0.7.4
tzinfo-data v1.2016.1
2015/12/24: version v2.3.0
fluentd v0.12.19
fluent-plugin-s3 v0.6.4
fluent-plugin-td v0.10.28
fluent-plugin-mongo v0.7.11
fluent-plugin-rewrite-tag-filter v1.5.3
td v0.13.0
ruby v2.1.8
bundler v1.10.6
cool.io v1.4.2
oj v2.14.2
tzinfo-data v1.205.07
2015/06/29: version v2.2.1
fluentd v0.12.12
fluent-plugin-s3 v0.5.9
fluent-plugin-mongo v0.7.10
fluent-plugin-td v0.10.27
fluent-plugin-td-monitoring v0.2.1
td v0.11.10
fluentd-ui v0.4.2
rpm: Fix -e issue in init script
deb: Improve stop mechanizm to prevent killing other td-agent process in containers
deb: Add STOPTIMEOUT parameter to control shutdown td-agent process same as rpm
2015/04/06: version v2.2.0
fluentd v0.12.7
fluent-plugin-s3 v0.5.7
fluent-plugin-mongo v0.7.8
td v0.11.8.2
fluentd-ui v0.3.18
ruby 2.1.5
json 1.8.2
cool.io 1.3.0
Amazon Linux support
2015/03/26: version v2.1.5-1
Put logrotate setting in deb package
Fix status command in rpm init script
Include postgresql
2015/03/03: version v2.1.5
fluentd v0.10.61
fluent-plugin-s3 v0.5.6
fluent-plugin-mongo v0.7.6
fluent-plugin-td v0.10.26
td v0.11.8
td-client v0.8.68
2015/02/09: version v2.1.4
fluentd v0.10.60
fluent-plugin-s3 v0.5.3
fluent-plugin-mongo v0.7.5
fluent-plugin-td v0.10.25
fluentd-ui v0.3.14
msgpack 0.5.11
2014/12/23: version v2.1.3
fluentd v0.10.58
fluent-plugin-scribe v0.10.14
fluent-plugin-s3 v0.5.1
fluent-plugin-td v0.10.23
fluent-plugin-td-monitoring v0.2.0
fluentd-ui v0.3.11
RHEL/CentOS 7 support
2014/11/17: version v2.1.2
fluentd v0.10.57
fluent-plugin-scribe v0.1.3
fluent-plugin-mongo v0.7.4
fluent-plugin-s3 v0.4.3
fluent-plugin-webhdfs v0.4.1
fluent-plugin-td-monitoring v0.1.4
Remove fluent-plugin-flume
fluentd-ui v0.3.7
jemalloc 2.2.5
rubygems 2.2.1
Create /var/run/td-agent during post installation on rpm
2014/10/22: version v2.1.1
Ruby 2.1.3
fluentd v0.10.55
td v0.11.5
td-client v0.8.66
fluent-plugin-scribe v0.1.2
fluent-plugin-s3 v0.4.1
fluent-plugin-webhdfs v0.3.1
fluent-plugin-td v0.10.22
fluent-plugin-td-monitoring v0.1.3
Fix option handling in configtest
rpm: default timeout to 120 seconds
Change v1 non-quoted string literal behavior: https://groups.google.com/forum/#!topic/fluentd/4C36iAyI87E
2014/09/05: version v2.1.0
Ruby 2.1.2
fluentd v0.10.53
Use use-v1-config by default
jemalloc v3.6.0
msgpack-ruby v0.5.8
Cool.io v1.2.4
td-agent1 (Unmaintained, Not Recommended for New Deployments)
2014/10/20: version v1.1.21
fluentd v0.10.55
td v0.11.5
td-client v0.8.66
fluent-plugin-scribe v0.1.2
fluent-plugin-s3 v0.4.1
fluent-plugin-webhdfs v0.3.1
fluent-plugin-td v0.10.22
fluent-plugin-td-monitoring v0.1.3
Fix option handling in configtest
rpm: default timeout to 120 seconds
deb: fix cleanup routine in purge
2014/06/19: version v1.1.20
fluentd v0.10.50
td-client v0.8.61
fluent-plugin-s3 v0.4.0
fluent-plugin-td v0.10.20
fluent-plugin-td-monitoring v0.1.2
rspec v3.0.0
Improve stop routine of init script in RPM
Set ulimit -n 65536 at start
2014/04/01: version v1.1.19
fluentd v0.10.45
td-client v0.8.58
td v0.10.99
fluent-plugin-s3 v0.3.7
fluent-plugin-td v0.10.18
fluent-plugin-rewrite-tag-filter v1.4.1
fluent-plugin-mongo v0.7.3
fluent-plugin-td-monitoring v0.1.1
bundler to v1.3.6
webhdfs v0.5.5
aws-sdk-ruby v1.38.0
2013/12/05: version v1.1.18
ruby v1.9.3-p484 (security fix)
fluentd v0.10.41
td-client v0.8.56
td v0.10.96
fluent-plugin-s3 v0.3.5
fluent-plugin-td v0.10.17
fluent-plugin-rewrite-tag-filter v1.3.1
2013/09/25: version v1.1.17
fluentd v0.10.39
td-client v0.8.55
td v0.10.89
fluent-plugin-td v0.10.16
Fix configtest permission issue at restart
Create /etc/default/td-agent to set daemon options at Debian and Ubuntu
2013/08/30: version v1.1.16
fluentd v0.10.38
td-client v0.8.54
td v0.10.86
Add configtest and use configtest at restart
2013/08/02: version v1.1.15
fluentd v0.10.36
td-client v0.8.53
td v0.10.84
fluent-plugin-s3 v0.3.4
fluent-plugin-webhdfs v0.2.1
fluent-plugin-mongo v0.7.1
fluent-plugin-td v0.10.15
cool.io v1.1.1
2013/06/24: version v1.1.14
fluentd v0.10.35
td-client v0.8.52
td v0.10.82
fluent-plugin-s3 v0.3.3
fluent-plugin-webhdfs v0.2.0
webhdfs v0.5.3
bson_ext v1.8.6
bson v1.8.6
mongo v1.8.6
yajl-ruby v1.1.0
json v1.7.7
2013/04/23: version v1.1.13
fluent-plugin-td v0.10.14
td-client v0.8.48
td v0.10.76
2013/04/04: version v1.1.12
fluentd v0.10.33
fluent-plugin-s3 v0.3.1
fluent-plugin-mongo v0.7.0
fluent-plugin-webhdfs v0.1.2
msgpack v0.4.7
bson_ext v1.8.4
bson v1.8.4
mongo v1.8.4
iobuffer v1.1.2
2012/12/07: version v1.1.11
fluentd v0.10.30
fluent-plugin-s3 v0.2.5
fluent-plugin-td v0.10.13
fluent-plugin-mongo v0.6.11
2012/10/16: version v1.1.10.3
td-client v0.8.34 to set the timeout for connect(2)
2012/10/15: version v1.1.10.2
fluent-plugin-td v0.10.12
2012/10/15: version v1.1.10.1
fixed /usr/bin/td regression of setting GEM_HOME and GEM_PATH
2012/10/15: version v1.1.10
fluentd v0.10.27
fluent-plugin-mongo v0.6.9
fluent-plugin-webhdfs v0.5.1
fluent-plugin-td v0.10.11
enable debug agent by default (td-agent.conf)
set GEM_HOME and GEM_PATH at /usr/[s]bin/scripts to avoid RVM conflicts
2012/08/27: version 1.1.9
fixed /usr/bin/td doesnt take any command line argument
Ubuntu LTS Precise 12.04 is now supported
allow overriding the variables by /etc/sysconfig/td-agent (rpm)
2012/07/23: version 1.1.8
fluentd v0.10.25
fixed critical problem of duplicate daemon launch problem (both deb and rpm)
2012/06/12: version 1.1.7
bson_ext v1.6.4
bson v1.6.4
mongo v1.6.4
fluent-plugin-td v0.10.7
td v0.10.25 (new)
install /usr/bin/td (new)
2012/05/20: version 1.1.6
fluent-plugin-flume v0.1.1
remove ruby package dependency (rpm only)
2012/05/02: version 1.1.5
ruby v1.9.3-p194 (security fix)
fluentd v0.10.22
add with-libyaml-dir to rubys configure options
2012/04/23: version 1.1.4.4
add td-libyaml to the dependency (rpm only)
support debian 5 lenny (deb only)
2012/04/17: version 1.1.4
use ruby-1.9.3-p125
use jemalloc v2.2.5, to avoid memory fragmentations
fluentd v0.10.19
fluent-plugin-mongo v0.6.7
fluent-plugin-td v0.10.6
2012/03/24: version 1.1.3.1
not to start daemon, at the first installation. (rpm only)
fix prelink Ruby breakage problem on CentOS 4 (rpm only)
fix /etc/init.d/td-agent status problem on CentOS 4 (rpm only)
2012/03/10: version 1.1.3
fluent-plugin-mongo v0.6.6
2012/02/22: version 1.1.2.2
reduced # of dependent packages of RPM. The build time dependencies were unexpectedly included.
2012/02/21: version 1.1.2.1
fixed problem of removing /etc/prelink.d/td-agent.conf file, when upgrading (rpm only)
2012/02/15: version 1.1.2
fluentd v0.10.10
fluent-plugin-td v0.10.5
fluent-plugin-scribe v0.10.7
fluent-plugin-mongo v0.6.3
fluent-plugin-s3 v0.2.2 (New!)
2012/01/24: version 1.1.1
fluentd v0.10.9
fluent-plugin-scribe v0.10.6
fluent-plugin-mongo v0.6.2
fix /var/run/td-agent/ creation in init.d script
fix Ruby interpreter breakings by prelink, on 32-bit platform
2011/11/11: version 1.1.0
fluentd v0.10.6
fluent-plugin-td v0.10.2
fluent-plugin-scribe v0.10.3
fluent-plugin-mongo v0.4.0 (new)
prevent prelink to break Ruby interpreter at the nightly job
2011/10/10: version 1.0.11
fix gem installation order
2011/10/05: version 1.0.10
fix posinst script for not overwriting user-installed config file
2011/10/01: version 1.0.9
fluent v0.9.16
fluent-plugin-scribe v0.9.10
2011/09/20: version 1.0.8
fluent v0.9.14
fluent-plugin-td v0.9.10
View ArticleCaution
DMP activation on Google Display Ads has been paused.Google stopped whitelisting new accounts for this feature. Therefore, new accounts can't use this feature. The Google change does not affect accounts that are already whitelisted. Existing accounts can still use the feature.
Leverage your data in Arm Treasure Data as you run marketing campaigns using Google AdWords.
You can create audience lists in your Google AdWords - DFP using data held in Treasure Data. Follow these steps to move cookies, IDFAs and ADIDs to new or existing audience lists within Google AdWords - DFP.
Prerequisites
Basic knowledge of Treasure Data, including the toolbelt
A Google AdWords Account
Authorized Treasure Data DMP access to your Google AdWords Account
Grant access for Treasure Data
The Treasure Data Google AdWords for DDP connector requires permissions to create audience segments in your Google AdWords account. Send an email to AdWords Support to grant Treasure Data access your AdWords account. Provide the following information in the email:
Request: Grant Treasure Data permissions
Treasure Data DMP:
Customer-ID: 140-996-0635
NID: treasuredata_dmp
You are sending information so that Google recognizes Treasure Data and identifies your TD account to your Google AdWords account.
Usage
To export data, you create or select an existing connection, create or reuse a query, and then run the query to export your audience lists.
Create or reuse a query
Go to Treasure Data console
Go to the Query Editor. You can pick the query language and write your query.
Select your database and table.
Access the query that you plan to use to export data. Here is a sample query:
SELECT DISTINCT cookie, list_name, time FROM google_adwords_ddp
Create or select an existing Google AdWords DDP connection
Option
Task Steps
Tip
Create a connector
- Go to Integrations Hub - Click on the Google AdWords DDP icon from Sources Catalog. - Complete the required fields on the first pane. - In the AdWords Customer ID field, enter the ID that you use in Google Adwords. Next you complete the Configuration fields.
Sometimes you need to define the column mapping before writing the query.
Select a connector
- Go to Treasure Data console. - Go to Query Editor. - Access the query that you plan to use to export data. - Click Output results. The Choose Saved Connection dialog opens. - Type the connection name in the search box to filter and click the connection that you want. Next you complete the Configuration fields.
You can also create a new connection from here by clicking on Create New Connection.
Creating a new connection:
Selecting an existing connection:
Configure the connection by specifying the parameters
After you select your Google AdWords DDP connection, the Configuration dialog pane appears:
Source column name mappings (optional)
Define the mapping between Google DDP column names to the output column names that you specify in your the query. You specify the target column and then the source column. For example, if google_cookie is the identifier column in your TD data source, you should define the mapping as cookie:google_cookie. If the source column in the mapping is missing, target column name will be used. For example, cookie is the same as cookie:cookie mapping.
See Additional connector configuration information for more information about supported column names and usage.
Cookie or mobile identifier column header
Specify the original source of the user cookie or mobile identifier.
You must select one of the options:
cookie_encrypted: Encrypted identifier (for example, Web), a cookie hash of user id
cookie_idfa: iOS Advertising Identifier
cookie_adid: Android Advertising Identifier
See Additional connector configuration information for more information on how identifier names are handled.
Execute the query
Either save the query with a name and then run the query or just run the query. When the query completes successfully, the result is automatically processed. Plan to transfer your data at least 24 hours ahead of when you need the audience lists (also referred to as segments) to be in Google AdWords.
Additional connector configuration information
The manner in which Google handles your data from Treasure Data affects how you configure the connector.
Column Mappings
The Google AdWords (via DDP) reads data source table by columns and uses the following column name mappings to process each row data:
cookie: The encrypted Google ID or Mobile Advertising Identifier that DDP will use in id matching. This column contains the cookie hash or mobile identifier of your users.
list_name: This column contains the name of audience list (segment) that you want to create in your AdWords audience. If the list name does not exist in Google AdWords, a new list is created. If the list name does exist, the existing list is updated.
timestamp (optional): The timestamp (seconds since EPOCH). If this column does not exist or is missing, a current timestamp is used.
delete (optional): This column contains boolean values (false or true) or number (0 or 1) to indicate if the cookie is to be added or removed from the given audience segment. By default, the value will be false if value is left blank or if column is not provided.
Cookie or Mobile Identifier Name
The upload process supports several different identifiers, and it is important that types of identifiers are uploaded to the segment using the correct upload file format. Generally, the identifiers fall into one of two categories: encrypted identifiers (anything obtained from Google systems), and raw identifiers (obtained from an external system or source). Any encrypted identifier is uploaded using the cookie_encrypted file format, and raw identifiers are uploaded in a type-specific upload file format such as Mobile Advertising Identifiers, for example cookie_idfa or cookie_adid.
Supported cookie types are cookie_encrypted, cookie_idfa, and cookie_adid.
Appendix: Limitations from Google AdWords
It may take up to 24 hours for updates to audience lists to be visible Google AdWords. Expect to wait up to 24 hours from the time of the query completion for changes to be reflected in Google AdWords.
View ArticleYou can write job results directly to your Google BigQuery.You can view how to import data from Google BigQuery here.
Prerequisites
Basic knowledge of Arm Treasure Data, including the TD Toolbelt.
A Google Cloud Platform account
Get Google Cloud Platform Credential
To use this feature, you need: your Project ID, and JSON Credential.
JSON Credential
The integration with Google BigQuery is based on server-to-server API authentication. Go to your Google Developer Console, and click Credentials under APIs & auth at the left menu. Select Service account.
Select the JSON based key type, which is recommended by Google. The key is downloaded automatically by the browser.
Project ID
Go to your Google Developer Console, and click Home on the left menu. You see your Project ID.
Usage
We support "append, replace, replace backup, delete" mode.
Create Dataset and Table on BigQuery
Create your Dataset and Table from your BigQuery console.
Use from TD Console
Go to Treasure Data Console, go to the query editor, click Add for Result Export, and select BigQuery. Complete all the information as follows:
Write the query. The query results must be matched with the pre-defined schema on BigQuery. After you execute the job, Treasure Data query results are automatically imported into Google BigQuery.
Append Mode:
When table already exists inBigQuery, you see a pane similar to the following:
Fill in Project ID, Dataset Name and Table Name get from BigQuery.
Write the query. The query results must match the pre-defined schema on BigQuery. After you execute the job, Treasure Data query results are automatically imported into Google BigQuery.
When table does not exist inBigQuery, you see a pane similar to the following:
Fill in Project ID, Dataset Name get from BigQuery.
Provide a name for the table that will be created in BigQuery Dataset. Select Auto-create table to provides a schema for the new table. The value in the Schema JSON field must match the query results (number of fields and data type). After you execute the job, a new table with a name and a schema as your input is created and Treasure Data query results are automatically imported into this table in Google BigQuery.
Example of data in the JSON Schema field:
[{"name":"id","type":"INTEGER","mode": "REQUIRED"},{"name":"name","type":"STRING"}]
Replace Mode:
Fill in Project ID, Dataset Name and Table Name get from BigQuery.
Select Auto-create table and provide a schema for the table to be replaced. The value in the Schema JSON field must match the query results (number of fields and data type). Choose Replace from drop down. After you execute the job, if the table already exists in BigQuery, or if the value in Schema JSON is different from table schema, then table is handled as replaced data. If the table does not exist in BigQuery then a new table is created with name and schema as your input.Treasure Data query results are automatically imported into this table.
Replace Backup Mode:
Replace backup mode is similar to replace mode, but in replace backup mode, both data and schema are backed up into a table that is assigned the old name plus "_old" as the suffix.
Delete Mode:
Fill in Project ID, Dataset Name and Table Name get from BigQuery.
Select Auto-create table and provide a schema for the table to be replaced. The value in the Schema JSON field must match the query results (number of fields and data type). Choose Delete from the drop down. If the table already exists in BigQuery, when you execute the job, the table is deleted and a new table is added with the schema as JSON. If table does not exist in BigQuery, then a new table is created with name and schema as your input.Treasure Data query results are automatically imported into this newly created table.
Notes
Data conversion
BigQuery's data types are automatically converted to a corresponding Treasure Data type, as indicated in the following table. If you include unsupported types in the Schema JSON field, or if query result miss-matches the data type with the value in Schema JSON, then you receive errors.
Treasure Data
BigQuery
string
STRING
long
INTEGER
double
FLOAT
long (true is 1, false is 0)
BOOLEAN
string (yyyy-MM-dd HH:mm:ss.SSS)
TIMESTAMP
string (as JSON)
RECORD
string (as JSON)
REPEATED (PRIMITIVE or RECORD)
Use from CLI
The following command allows you to set a scheduled query with Results Output to BigQuery.
Designate your json_key and escape newline with backslash.
For example,
$ td sched:create scheduled_bigquery "10 6 14 12 *" \
-d dataconnector_db "SELECT id,account,purchase,comment,time FROM data_connectors" \
-r '{"type":"bigquery","project":"YOUR_PROJECT","dataset":"YOUR_DB","table":"YOUR_TABLE","auto_create_table":true,"max_bad_records":0,"ignore_unknown_values":true,"allow_quoted_newlines":true,"schema_file":"[{\"name\": \"id\", \"type\": \"INTEGER\"}, {\"name\": \"account\", \"type\": \"STRING\"},{\"name\": \"purchase\", \"type\": \"STRING\"}, {\"name\": \"comment\", \"type\": \"STRING\", \"mode\": \"REQUIRED\"}, {\"name\": \"time\", \"type\": \"TIMESTAMP\", \"mode\": \"REQUIRED\"}]", "json_keyfile":"{\"private_key_id\": \"ABDE\", \"private_key\": \"-----BEGIN PRIVATE KEY-----\\nABCDE\\nABCDE\\nABCDE\\n-----END PRIVATE KEY-----\\n\", \"client_email\": \"ABCDE.gserviceaccount.com\", \"client_id\": \"ABCDE.apps.googleusercontent.com\", \"type\": \"service_account\"}"}'
View ArticleThe Hive query engine has some limitations compared to Hive at this stage.
SELECT row limit
The maximum number of rows for SELECT queries is 2^31 (2,147,483,647) on both CDH4 and HDP2. If the number of rows exceeds the limit, the query fails.
To workaround this issue, use INSERT INTO can insert more records.
Division
When two values (of any type) are divided in Hive the result of the division is always automatically upgraded to double. To obtain an integer result from a division, cast the result of the division to the needed type. For example:
SELECT
CAST((column1 / column2) AS bigint) as division
FROM
table
The Presto engine behaves differently with divisions and does not automatically upgrades the result to double.
Trim Double-Byte Space
TRIM function with Hivecan'thandle double-byte space. For example:
SELECT
TRIM('a') as value_keyword
FROM
table
=> 'a' (Expected result is 'a')
As a workflow, use regexp_replace
SELECT
regexp_replace(TRIM('a '), '^+|+$', '') as value_keyword
FROM
table
Hive Floating Point
Hive adds extra decimals to float point (see https://issues.apache.org/jira/browse/HIVE-3715 ). The following example illustrates the behavior.
1. Create a table floattest with 2 columns:
floatval for float datatype
doubleval for double datatype
2. Insert values intofloattest table:
123123.2 as floatval
123123.2 as doubleval
3. Select cast (floatval as double) from floattest. Select floatval from floattest
returned : 123123.203125.
It is consistent with the output from java because the data is stored as double.
4. Select doubleval from floattest
returned : 123123.2
If you are concerned about the precision of numbers in your output, use double instead of float.
Hives reserved keywords (Why do I get a warning about reserved keywords?)
Treasure Data uses Hive as a querying language. One of Hives limitations is that a table name cannot be a reserved keyword. Thus, if your table (or database) name is identical to a reserved keyword, youwon'tbe able to query data in that table (or database).
Hive also has non-reserved keywords which have a proscribed meaning in the language but which are also allowed as identifiers. Hive has additional keywords which allow similar unrestricted usage. In particular, these keywords are allowed as column or table names.
For example if you create an items table and store data in it, youwon'tbe able to execute the following hive query, because items is a reserved keyword in Hive.
SELECT v FROM items
The following areKeywordsin Hive:
ALL,ALTER,AND,ARRAY,AS,AUTHORIZATION,BETWEEN,BIGINT,BINARY,BOOLEAN,BOTH,BY,CASE,CAST,
CHAR,COLUMN,CONF,CREATE,CROSS,CUBE,CURRENT,CURRENT_DATE,CURRENT_TIMESTAMP,CURSOR,
DATABASE,DATE,DECIMAL,DELETE,DESCRIBE,DISTINCT,DOUBLE,DROP,ELSE,END,EXCHANGE,EXISTS,
EXTENDED,EXTERNAL,FALSE,FETCH,FLOAT,FOLLOWING,FOR,FROM,FULL,FUNCTION,GRANT,GROUP,
GROUPING,HAVING,IF,IMPORT,IN,INNER,INSERT,INT,INTERSECT,INTERVAL,INTO,IS,JOIN,LATERAL,
LEFT,LESS,LIKE,LOCAL,MACRO,MAP,MORE,NONE,NOT,NULL,OF,ON,OR,ORDER,OUT,OUTER,OVER,
PARTIALSCAN,PARTITION,PERCENT,PRECEDING,PRESERVE,PROCEDURE,RANGE,READS,REDUCE,REVOKE,
RIGHT,ROLLUP,ROW,ROWS,SELECT,SET,SMALLINT,TABLE,TABLESAMPLE,THEN,TIMESTAMP,TO,TRANSFORM,
TRIGGER,TRUE,TRUNCATE,UNBOUNDED,UNION,UNIQUEJOIN,UPDATE,USER,USING,VALUES,VARCHAR,WHEN,
WHERE,WINDOW,WITH
The following areNon-reserved Keywordsin Hive:
ADD,ADMIN,AFTER,ANALYZE,ARCHIVE,ASC,BEFORE,BUCKET,BUCKETS,CASCADE,CHANGE,CLUSTER,
CLUSTERED,CLUSTERSTATUS,COLLECTION,COLUMNS,COMMENT,COMPACT,COMPACTIONS,COMPUTE,
CONCATENATE,CONTINUE,DATA,DATABASES,DATETIME,DAY,DBPROPERTIES,DEFERRED,DEFINED,
DELIMITED,DEPENDENCY,DESC,DIRECTORIES,DIRECTORY,DISABLE,DISTRIBUTE,ELEM_TYPE,ENABLE,
ESCAPED,EXCLUSIVE,EXPLAIN,EXPORT,FIELDS,FILE,FILEFORMAT,FIRST,FORMAT,FORMATTED,
FUNCTIONS,HOLD_DDLTIME,HOUR,IDXPROPERTIES,IGNORE,INDEX,INDEXES,INPATH,INPUTDRIVER,
INPUTFORMAT,ITEMS,JAR,KEYS,KEY_TYPE,LIMIT,LINES,LOAD,LOCATION,LOCK,LOCKS,LOGICAL,LONG,
MAPJOIN,MATERIALIZED,MINUS,MINUTE,MONTH,MSCK,NOSCAN,NO_DROP,OFFLINE,OPTION,OUTPUTDRIVER,
OUTPUTFORMAT,OVERWRITE,OWNER,PARTITIONED,PARTITIONS,PLUS,PRETTY,PRINCIPALS,PROTECTION,
PURGE,READ,READONLY,REBUILD,RECORDREADER,RECORDWRITER,REGEXP,RELOAD,RENAME,REPAIR,
REPLACE,RESTRICT,REWRITE,RLIKE,ROLE,ROLES,SCHEMA,SCHEMAS,SECOND,SEMI,SERDE,
SERDEPROPERTIES,SERVER,SETS,SHARED,SHOW,SHOW_DATABASE,SKEWED,SORT,SORTED,SSL,STATISTICS,
STORED,STREAMTABLE,STRING,STRUCT,TABLES,TBLPROPERTIES,TEMPORARY,TERMINATED,TINYINT,TOUCH,
TRANSACTIONS,UNARCHIVE,UNDO,UNIONTYPE,UNLOCK,UNSET,UNSIGNED,URI,USE,UTC,UTCTIMESTAMP,
VALUE_TYPE,VIEW,WHILE,YEAR
The following areReserved Keywordsin Hive:
ALL,ALTER,AND,ARRAY,AS,AUTHORIZATION,BETWEEN,BIGINT,BINARY,BOOLEAN,BOTH,BY,CASE,CAST,
CHAR,COLUMN,CONF,CREATE,CROSS,CUBE,CURRENT,CURRENT_DATE,CURRENT_TIMESTAMP,CURSOR,
DATABASE,DATE,DECIMAL,DELETE,DESCRIBE,DISTINCT,DOUBLE,DROP,ELSE,END,EXCHANGE,EXISTS,
EXTENDED,EXTERNAL,FALSE,FETCH,FLOAT,FOLLOWING,FOR,FROM,FULL,FUNCTION,GRANT,GROUP,
GROUPING,HAVING,IF,IMPORT,IN,INNER,INSERT,INT,INTERSECT,INTERVAL,INTO,IS,JOIN,LATERAL,
LEFT,LESS,LIKE,LOCAL,MACRO,MAP,MORE,NONE,NOT,NULL,OF,ON,OR,ORDER,OUT,OUTER,OVER,
PARTIALSCAN,PARTITION,PERCENT,PRECEDING,PRESERVE,PROCEDURE,RANGE,READS,REDUCE,REVOKE,
RIGHT,ROLLUP,ROW,ROWS,SELECT,SET,SMALLINT,TABLE,TABLESAMPLE,THEN,TIMESTAMP,TO,TRANSFORM,
TRIGGER,TRUE,TRUNCATE,UNBOUNDED,UNION,UNIQUEJOIN,UPDATE,USER,USING,VALUES,VARCHAR,WHEN,
WHERE,WINDOW,WITH
If you encounter a reserved keyword issue in your query, wrap the infringing keyword with ` as shown below.
SELECT `keyword_column`, COUNT(1) AS cnt FROM table_name GROUP BY `keyword_column`
Query: Hive 0.13 Query Returns Unexpected Null Results
Hive jobs complete successfully but, upon rare occasion, produce an incorrect result, outputting NULL values for one or more columns, even when NULL is filtered out by the WHERE clause or is not present in the source data. There are numerous SQL patterns that might produce this error but they typically involve HiveQL with subqueries.
Workaround: Create intermediate tables that compute the values of the subqueries, and then use theintermediate tables to rewrite the main query.
For example:
Assume that in the source data table test_join, test_join.pid contains no NULL values.
The following query produces unexpected NULL values in the pcnt column:
SELECT t0.pid, t2.cnt AS pcnt
FROM
(SELECT distinct pid FROM test_join ) t0
JOIN
(SELECT pid,
SUM(paid_price) AS weight,
COUNT(*) AS cnt
FROM test_join
GROUP BY pid
ORDER BY weight DESC
' With this ORDER BY, pcnt of all query result is null value.
' Without this line, pcnt value is correct.
) t2
ON t0.pid = t2.pid
One way to rewrite such a query is to break out subqueries into separate tables, as follows:
CREATE TABLE distinct_pid as (SELECT distinct pid FROM test_join);
CREATE TABLE weight_tbl as ((SELECT pid,
SUM(paid_price) AS weight,
COUNT(*) AS cnt
FROM test_join
GROUP BY pid
ORDER BY weight DESC);
SELECT t0.pid, t2.cnt AS pcnt
FROM
distinct_pid t0
JOIN
weight_tbl t2
ON t0.pid = t2.pid ;
Contact support if you need help developing a workaround for your situation.
View ArticleYou can integrate Amplitude with Arm Treasure Data to enhance data, add more data points, and to integrate the data Amplitude collects with all the other parts of your marketing stack.
To view sample workflows for importing data from Amplitude, view Treasure Boxes.
You use this same connector to create and export Amplitude Events. See Exporting Event Data to Amplitude.
Integrating Amplitude with Treasure Data makes it easy to:
Add new features to Amplitude. As an example, you can use Treasure Data to unify Amplitude behavior data with customer data from Salesforce, creating personally identifiable records with cross-device behavior tracking.
Use data collected from Amplitude to make the rest of your marketing stack smarter. For example, you can use Treasure Data and Amplitude to reduce churn by setting up automated triggers to create segments based on usage and feeding them into custom nurture funnels in Marketo.
If youdon'thave a Treasure Data account, contact us so we can get you set up!
How to connect to Amplitude using the Treasure Data Console
Connecting to Amplitude using the Treasure Data Console is quick and easy. Alternatively, you can use a command line interface to create connections. (For instructions to connect using the command line, click here. )
In Amplitude
Go to Settings -> Projects and get your API Key and Secret Key, which you need for the next step.
In the Treasure Data console
Go to Connections -> Sources Catalog and click on the Amplitude tile.
Enter your API Key and Secret Key and click Continue.
Give your connection a descriptive name and select Create Connection.
How to transfer your Amplitude data to Treasure Data
Select the Amplitude connection you created in Connections -> My Connections.
Click on New Transfer.
Enter the start time in the format YYYYMMDDTHH.
Optionally, you can specify an End Time using the same format. If End Time is not specified the default is the current time of your browser timezone.
Click Next to preview your data.
Youll see a preview of your data. Click Next.
If you are creating a new database, check Create new database and give your database a name. Do the same with Create new table
Select whether to append records to an existing table or replace your existing table.
If you want to set a different partition key seed than the default key, you can specify a key using the popup menu.
In the When tab, you can specify a one-time transfer, or you can schedule an automated recurring transfer. If you selectOnce now, click Start Transfer. If you selectRepeat specify your schedule options, then click Schedule Transfer.
After your transfer runs, you can see the results of your transfer in the Databases tab.
How to connect to Amplitude using the command line
Before you can use the command line to configure a connector, you must install the Treasure Data Toolbelt.
Install the Treasure Data Toolbelt
Install the newest Treasure Data Toolbelt.
$ td --version
0.15.3
Create a seed file
First, create a file called seed.yml as shown in the following example, with your credentials.
in:
type: amplitude
api_key: "<YOUR_API_KEY>"
secret_key: "<YOUR_SECRET_KEY>"
start_time: "20160901T03" # UTC Timezone. The format is yyyymmddThh. "T" is a static character.
Run the Guess Fields command
Run the following command in your terminal:
$ td connector:guess seed.yml -o load.yml
Connector:guess automatically reads the target data and intelligently guesses the data format.
Open the file load.yml, where youll see guessed file format definitions including, in some cases, file formats, encodings, column names, and types.
in: {type: amplitude, api_key: API_KEY, secret_key: SECRET_KEY,
start_time: 20160901T03}
out: {}
exec: {}
filters:
- type: rename
rules:
- rule: upper_to_lower
- rule: character_types
pass_types: ["a-z", "0-9"]
pass_characters: "_"
replace: "_"
- rule: first_character_types
pass_types: ["a-z"]
pass_characters: "_"
prefix: "_"
- rule: unique_number_suffix
max_length: 128
- type: add_time
to_column: {name: time}
from_value: {mode: upload_time}
For more details on the rename filter, see rename filter plugin for Data Connector.
For more details on available out modes, see Appendix.
You can preview how the system will parse the file by using the preview command.
$ td connector:preview load.yml
+----------+-------------------+----------------------+----------------------------------------+----------------------+-------------------------------+-----------------------------+-------------------------------+---------------+-----------------+-------------------+-----------------------------+---------------------+--------------+-------------+-------------+-------------+----------------+----------------------------------+------------------------------+-----------------+----------------+-------------------+---------------------+----------------------------+---------------------+-----------------------+--------------------+----------------------+---------------------+---------------------+-----------------+-----------------+-------------+---------------+------------+----------------+-------------------+----------------+----------------------+-----------------------+---------------------------+--------------------------+----------------------------------------+-------------------+-------------------------------+
| app:long | amplitude_id:long | user_id:string | device_id:string | event_time:timestamp | server_upload_time:timestamp | client_event_time:timestamp | client_upload_time:timestamp | event_id:long | session_id:long | event_type:string | amplitude_event_type:string | version_name:string | _schema:long | adid:string | groups:json | idfa:string | library:string | processed_time:timestamp | user_creation_time:timestamp | platform:string | os_name:string | os_version:string | device_brand:string | device_manufacturer:string | device_model:string | device_carrier:string | device_type:string | device_family:string | location_lat:double | location_lng:double | country:string | language:string | city:string | region:string | dma:string | revenue:double | ip_address:string | paying:boolean | start_version:string | event_properties:json | user_properties:json | data:json | uuid:string | _insert_id:string | time:timestamp |
+----------+-------------------+----------------------+----------------------------------------+----------------------+-------------------------------+-----------------------------+-------------------------------+---------------+-----------------+-------------------+-----------------------------+---------------------+--------------+-------------+-------------+-------------+----------------+----------------------------------+------------------------------+-----------------+----------------+-------------------+---------------------+----------------------------+---------------------+-----------------------+--------------------+----------------------+---------------------+---------------------+-----------------+-----------------+-------------+---------------+------------+----------------+-------------------+----------------+----------------------+-----------------------+---------------------------+--------------------------+----------------------------------------+-------------------+-------------------------------+
| 154962 | 10158089806 | "[email protected]" | "600bfc44-73b1-5e36-bc2f-d230dad9b1fd" | nil | "2016-09-13 04:25:31.405 UTC" | nil | "2016-09-13 04:25:31.405 UTC" | 823811538 | -1 | "watch_tutorial" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 04:25:32.791225 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "United States" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test A\"}" | "{\"first_event\":true}" | "1bc49492-796a-11e6-ae6a-22000a2589e5" | nil | "2016-09-28 08:00:10.137 UTC" |
| 154962 | 10158089806 | "[email protected]" | "600bfc44-73b1-5e36-bc2f-d230dad9b1fd" | nil | "2016-09-13 04:25:36.952 UTC" | nil | "2016-09-13 04:25:36.952 UTC" | 38872708 | -1 | "watch_tutorial" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 04:25:38.171848 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "United States" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test A\"}" | "{}" | "1f635fde-796a-11e6-ae6a-22000a2589e5" | nil | "2016-09-28 08:00:10.137 UTC" |
| 154962 | 10158089806 | "[email protected]" | "600bfc44-73b1-5e36-bc2f-d230dad9b1fd" | nil | "2016-09-13 05:00:03.679 UTC" | nil | "2016-09-13 05:00:03.679 UTC" | 689967922 | -1 | "watch_tutorial" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 05:00:04.692970 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "United States" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test A\"}" | "{}" | "ef211fbe-796e-11e6-ae6a-22000a2589e5" | nil | "2016-09-28 08:00:10.137 UTC" |
| 154962 | 10158359360 | "[email protected]" | "b37e7b0d-5474-5618-ae0a-e3d5b36b35f6" | nil | "2016-09-13 06:00:25.511 UTC" | nil | "2016-09-13 06:00:25.511 UTC" | 44814028 | -1 | "eventtest" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 06:00:26.509410 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "Japan" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test B\"}" | "{}" | "5da3baf2-7977-11e6-ba91-22000a5680d9" | nil | "2016-09-28 08:00:10.137 UTC" |
| 154962 | 10158089806 | "[email protected]" | "600bfc44-73b1-5e36-bc2f-d230dad9b1fd" | nil | "2016-09-13 06:00:25.541 UTC" | nil | "2016-09-13 06:00:25.541 UTC" | 847463167 | -1 | "watch_tutorial" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 06:00:26.735517 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "United States" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test A\"}" | "{}" | "5e0badba-7977-11e6-ae6a-22000a2589e5" | nil | "2016-09-28 08:00:10.137 UTC" |
| 154962 | 10158359360 | "[email protected]" | "b37e7b0d-5474-5618-ae0a-e3d5b36b35f6" | nil | "2016-09-13 07:00:00.437 UTC" | nil | "2016-09-13 07:00:00.437 UTC" | 808687647 | -1 | "eventtest" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 07:00:01.389360 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "Japan" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test B\"}" | "{}" | "b07c57b8-797f-11e6-ba91-22000a5680d9" | nil | "2016-09-28 08:00:10.137 UTC" |
| 154962 | 10158089806 | "[email protected]" | "600bfc44-73b1-5e36-bc2f-d230dad9b1fd" | nil | "2016-09-13 07:00:00.589 UTC" | nil | "2016-09-13 07:00:00.589 UTC" | 145913611 | -1 | "watch_tutorial" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 07:00:01.178515 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "United States" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test A\"}" | "{}" | "b091fe4c-797f-11e6-ae6a-22000a2589e5" | nil | "2016-09-28 08:00:10.137 UTC" |
| 154962 | 10158089806 | "[email protected]" | "600bfc44-73b1-5e36-bc2f-d230dad9b1fd" | nil | "2016-09-13 08:00:26.193 UTC" | nil | "2016-09-13 08:00:26.193 UTC" | 118354173 | -1 | "watch_tutorial" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 08:00:26.888901 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "United States" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test A\"}" | "{}" | "21a9a762-7988-11e6-ae6a-22000a2589e5" | nil | "2016-09-28 08:00:10.137 UTC" |
| 154962 | 10158359360 | "[email protected]" | "b37e7b0d-5474-5618-ae0a-e3d5b36b35f6" | nil | "2016-09-13 08:00:26.297 UTC" | nil | "2016-09-13 08:00:26.297 UTC" | 177419860 | -1 | "eventtest" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 08:00:27.323531 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "Japan" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test B\"}" | "{}" | "21ebf04a-7988-11e6-ba91-22000a5680d9" | nil | "2016-09-28 08:00:10.137 UTC" |
| 154962 | 10158359360 | "[email protected]" | "b37e7b0d-5474-5618-ae0a-e3d5b36b35f6" | nil | "2016-09-13 09:00:00.890 UTC" | nil | "2016-09-13 09:00:00.890 UTC" | 256548632 | -1 | "eventtest" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 09:00:01.668570 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "Japan" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test B\"}" | "{}" | "7409a27a-7990-11e6-ba91-22000a5680d9" | nil | "2016-09-28 08:00:10.137 UTC" |
| 154962 | 10158089806 | "[email protected]" | "600bfc44-73b1-5e36-bc2f-d230dad9b1fd" | nil | "2016-09-13 09:00:00.982 UTC" | nil | "2016-09-13 09:00:00.982 UTC" | 496536219 | -1 | "watch_tutorial" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 09:00:01.650288 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "United States" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test A\"}" | "{}" | "746275a8-7990-11e6-ae6a-22000a2589e5" | nil | "2016-09-28 08:00:10.137 UTC" |
| 154962 | 10158089806 | "[email protected]" | "600bfc44-73b1-5e36-bc2f-d230dad9b1fd" | nil | "2016-09-13 10:16:46.350 UTC" | nil | "2016-09-13 10:16:46.350 UTC" | 958097131 | -1 | "watch_tutorial" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 10:16:46.663451 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "United States" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test A\"}" | "{}" | "2d2fb2bc-799b-11e6-ae6a-22000a2589e5" | nil | "2016-09-28 08:00:10.137 UTC" |
| 154962 | 10158359360 | "[email protected]" | "b37e7b0d-5474-5618-ae0a-e3d5b36b35f6" | nil | "2016-09-13 10:16:46.393 UTC" | nil | "2016-09-13 10:16:46.393 UTC" | 474656866 | -1 | "eventtest" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 10:16:47.669048 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "Japan" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test B\"}" | "{}" | "2dc7129c-799b-11e6-ba91-22000a5680d9" | nil | "2016-09-28 08:00:10.137 UTC" |
| 154962 | 10158089806 | "[email protected]" | "600bfc44-73b1-5e36-bc2f-d230dad9b1fd" | nil | "2016-09-13 14:25:43.820 UTC" | nil | "2016-09-13 14:25:43.820 UTC" | 562771389 | -1 | "watch_tutorial" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 14:25:44.723185 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "United States" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test A\"}" | "{}" | "f4f3f0ca-79bd-11e6-ae6a-22000a2589e5" | nil | "2016-09-28 08:00:10.137 UTC" |
| 154962 | 10158359360 | "[email protected]" | "b37e7b0d-5474-5618-ae0a-e3d5b36b35f6" | nil | "2016-09-13 14:25:43.825 UTC" | nil | "2016-09-13 14:25:43.825 UTC" | 714909881 | -1 | "eventtest" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 14:25:45.332413 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "Japan" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test B\"}" | "{}" | "f5472bb4-79bd-11e6-ba91-22000a5680d9" | nil | "2016-09-28 08:00:10.137 UTC" |
| 154962 | 10158359360 | "[email protected]" | "b37e7b0d-5474-5618-ae0a-e3d5b36b35f6" | nil | "2016-09-13 16:25:52.992 UTC" | nil | "2016-09-13 16:25:52.992 UTC" | 966142739 | -1 | "eventtest" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 16:25:54.925309 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "Japan" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test B\"}" | "{}" | "be8c64a2-79ce-11e6-ba91-22000a5680d9" | nil | "2016-09-28 08:00:10.137 UTC" |
| 154962 | 10158089806 | "[email protected]" | "600bfc44-73b1-5e36-bc2f-d230dad9b1fd" | nil | "2016-09-13 16:25:52.980 UTC" | nil | "2016-09-13 16:25:52.980 UTC" | 136550534 | -1 | "watch_tutorial" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 16:25:55.159777 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "United States" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test A\"}" | "{}" | "beb0e872-79ce-11e6-ae6a-22000a2589e5" | nil | "2016-09-28 08:00:10.137 UTC" |
| 154962 | 10158089806 | "[email protected]" | "600bfc44-73b1-5e36-bc2f-d230dad9b1fd" | nil | "2016-09-13 18:26:06.237 UTC" | nil | "2016-09-13 18:26:06.237 UTC" | 911332926 | -1 | "watch_tutorial" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 18:26:07.308899 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "United States" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test A\"}" | "{}" | "897c0d24-79df-11e6-ae6a-22000a2589e5" | nil | "2016-09-28 08:00:10.137 UTC" |
| 154962 | 10158359360 | "[email protected]" | "b37e7b0d-5474-5618-ae0a-e3d5b36b35f6" | nil | "2016-09-13 18:26:06.582 UTC" | nil | "2016-09-13 18:26:06.582 UTC" | 637365591 | -1 | "eventtest" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 18:26:07.932679 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "Japan" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test B\"}" | "{}" | "89d72740-79df-11e6-ba91-22000a5680d9" | nil | "2016-09-28 08:00:10.137 UTC" |
| 154962 | 10158089806 | "[email protected]" | "600bfc44-73b1-5e36-bc2f-d230dad9b1fd" | nil | "2016-09-13 21:26:14.886 UTC" | nil | "2016-09-13 21:26:14.886 UTC" | 325299849 | -1 | "watch_tutorial" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 21:26:15.879945 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "United States" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test A\"}" | "{}" | "b3e28336-79f8-11e6-ae6a-22000a2589e5" | nil | "2016-09-28 08:00:10.137 UTC" |
| 154962 | 10158359360 | "[email protected]" | "b37e7b0d-5474-5618-ae0a-e3d5b36b35f6" | nil | "2016-09-13 21:26:14.881 UTC" | nil | "2016-09-13 21:26:14.881 UTC" | 275591044 | -1 | "eventtest" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-13 21:26:15.902873 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "Japan" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test B\"}" | "{}" | "b3e89ca8-79f8-11e6-ba91-22000a5680d9" | nil | "2016-09-28 08:00:10.137 UTC" |
| 154962 | 10158089806 | "[email protected]" | "600bfc44-73b1-5e36-bc2f-d230dad9b1fd" | nil | "2016-09-14 01:29:27.029 UTC" | nil | "2016-09-14 01:29:27.029 UTC" | 335311243 | -1 | "watch_tutorial" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-14 01:29:28.287592 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "United States" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test A\"}" | "{}" | "ada09ba8-7a1a-11e6-9f4f-22000a2589e5" | nil | "2016-09-28 08:00:10.137 UTC" |
| 154962 | 10158359360 | "[email protected]" | "b37e7b0d-5474-5618-ae0a-e3d5b36b35f6" | nil | "2016-09-14 01:29:26.988 UTC" | nil | "2016-09-14 01:29:26.988 UTC" | 633523172 | -1 | "eventtest" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-14 01:29:28.350201 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "Japan" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test B\"}" | "{}" | "ada9cae8-7a1a-11e6-bead-22000a5680d9" | nil | "2016-09-28 08:00:10.137 UTC" |
| 154962 | 10158359360 | "[email protected]" | "b37e7b0d-5474-5618-ae0a-e3d5b36b35f6" | nil | "2016-09-14 02:00:26.207 UTC" | nil | "2016-09-14 02:00:26.207 UTC" | 449936451 | -1 | "eventtest" | nil | nil | 10 | nil | "{}" | nil | "http/1.0" | "2016-09-14 02:00:26.662638 UTC" | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | nil | "Japan" | nil | nil | nil | nil | nil | "127.0.0.1" | nil | nil | "{}" | "{\"Cohort\":\"Test B\"}" | "{}" | "0157e478-7a1f-11e6-bead-22000a5680d9" | nil | "2016-09-28 08:00:10.137 UTC" |
+----------+-------------------+----------------------+----------------------------------------+----------------------+-------------------------------+-----------------------------+-------------------------------+---------------+-----------------+-------------------+-----------------------------+---------------------+--------------+-------------+-------------+-------------+----------------+----------------------------------+------------------------------+-----------------+----------------+-------------------+---------------------+----------------------------+---------------------+-----------------------+--------------------+----------------------+---------------------+---------------------+-----------------+-----------------+-------------+---------------+------------+----------------+-------------------+----------------+----------------------+-----------------------+---------------------------+--------------------------+----------------------------------------+-------------------+-------------------------------+
24 rows in set
Update load.yml and use 'td connector:preview load.yml' to preview again.
Use 'td connector:issue load.yml' to run Server-side bulk load.
Execute a Load Job
Finally, submit the load job. It may take a couple of hours depending on the size of your data.
NOTE: Amplitude connector provides a time column automatically, but you can also specify it as any timestamp column such as --time-column server_upload_time.
$ td connector:issue load.yml --database td_sample_db --table td_sample_table
Scheduled execution
You can also schedule incremental, periodic Data Connector execution from the command line, removing the need for a cron daemon on your local data center.
For the first scheduled import, the Data Connector for Amplitude imports all of your data. On the second and subsequent runs, only newly added files are imported.
Create the schedule
A new schedule can be created by using the td connector:create command.
The name of the schedule, cron-style schedule, database and table where data will be stored, and the Data Connector configuration file are required.
$ td connector:create \
daily_import \
"10 0 * * *" \
td_sample_db \
td_sample_table \
load.yml
The `cron` parameter also accepts three special options: `@hourly`, `@daily` and `@monthly`.
By default, the schedule is set in the UTC time zone. You can set the schedule in a time zone using the `-t` or `--timezone` option. The `--timezone` option only supports extended timezone formats like 'Asia/Tokyo', 'America/Los_Angeles' etc. Timezone abbreviations like PST, CST are *not* supported and may lead to unexpected results.
List the Schedules
You can see the list of scheduled entries by running the command td connector:list.
$ td connector:list
+--------------+------------+----------+-------+--------------+-----------------+----------------------------+
| Name | Cron | Timezone | Delay | Database | Table | Config |
+--------------+------------+----------+-------+--------------+-----------------+----------------------------+
| daily_import | 10 0 * * * | UTC | 0 | td_sample_db | td_sample_table | {"in"=>{"type"=>"amp", ... |
+--------------+------------+----------+-------+--------------+-----------------+----------------------------+
Show the Settings and Schedule History
td connector:show shows the execution settings of a schedule entry.
% td connector:show daily_import
Name : daily_import
Cron : 10 0 * * *
Timezone : UTC
Delay : 0
Database : td_sample_db
Table : td_sample_table
Config
---
in:
type: amplitude
api_key: "<YOUR_API_KEY>"
secret_key: "<YOUR_SECRET_KEY>"
start_time: "20160901T03" # UTC Timezone. The format is yyyymmddThh. "T" is a static character.
out:
mode: append
td connector:history shows the execution history of a schedule entry. To investigate the results of each individual run, use td job <jobid>.
% td connector:history daily_import
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
| JobID | Status | Records | Database | Table | Priority | Started | Duration |
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
| 578066 | success | 10000 | td_sample_db | td_sample_table | 0 | 2016-09-18 00:10:05 +0000 | 160 |
| 577968 | success | 10000 | td_sample_db | td_sample_table | 0 | 2016-09-17 00:10:07 +0000 | 161 |
| 577914 | success | 10000 | td_sample_db | td_sample_table | 0 | 2016-09-16 00:10:03 +0000 | 152 |
| 577872 | success | 10000 | td_sample_db | td_sample_table | 0 | 2016-09-15 00:10:04 +0000 | 163 |
| 577810 | success | 10000 | td_sample_db | td_sample_table | 0 | 2016-09-14 00:10:04 +0000 | 164 |
| 577766 | success | 10000 | td_sample_db | td_sample_table | 0 | 2016-09-13 00:10:04 +0000 | 155 |
| 577710 | success | 10000 | td_sample_db | td_sample_table | 0 | 2016-09-12 00:10:05 +0000 | 156 |
| 577610 | success | 10000 | td_sample_db | td_sample_table | 0 | 2016-09-11 00:10:04 +0000 | 157 |
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
8 rows in set
Delete the Schedule
td connector:delete removes the schedule.
$ td connector:delete daily_import
Configuration
See the following table for more details on available in modes.
Option name
Description
Type
Required?
Default value
api_key
API key
string
yes
N/A
secret_key
API Secret key
string
yes
N/A
start_time
First hour included in data series, formatted YYYYMMDDTHH (e.g. 20150201T05). This is UTC timezone.
string
yes
N/A
end_time
Last hour included in data series, formatted YYYYMMDDTHH (e.g. 20150203T20). This is UTC timezone.
string
no
processed time
incremental
true for mode: append, false for mode: replace (See below).
bool
no
true
Appendix
A) Modes for out plugin
You can specify the file import mode in the out section of seed.yml.
append (default)
This is the default mode, which appends records to the target table.
in:
...
out:
mode: append
replace (In td 0.11.10 and later)
This mode replaces data in the target table. Any manual schema changes made to the target table remains intact with this mode.
in:
...
out:
mode: replace
View ArticleThe data connector for Amazon S3 enables you to import the data from your JSON, TSV, and CSV files stored in an S3 bucket.
For sample workflows on importing data from files stored in an S3 bucket, go to the Treasure Box on Github.
Prerequisites
Basic knowledge of Arm Treasure Data
Use the TD Console to create your connection
You can use TD Console to create your data connector.
Create a new connection
When you configure a data connection, you provide authentication to access the integration. In Treasure Data, you configure the authentication and then specify the source information.
Go to Catalog and search and select AWS S3. Click Create.
https://console.eu01.treasuredata.com
Using Credentials to Authenticate
You need a client ID and access keys to authenticate using credentials.
Register your credential. Set the following parameters.
Endpoint: S3 endpoint login user name. You can find a region and endpoint information from AWS Document. (Ex. s3-ap-northeast-1.amazonaws.com)
Authentication Method:
basic: uses access_key_id and secret_access_key to authenticate. See here.
Access Key ID
Secret access key
anonymous: uses anonymous access. This auth method can access only public files.
session: uses temporary-generated access_key_id, secret_access_key and session_token. (Please note that this authentication method is only available with data import. This can't be used with data export for now.)
Access Key ID
Secret access key
Secret token
Click Continue. Name your new AWS S3 connection.Click Done.
Transfer data from AWS S3
Next, you need to create New Source on Authentications page. You can prepare an ad hoc DataConnector job or a scheduled data connector job. In this section, the following 4 steps are required.
Fetch from
You need to register the information that you want to ingest.
Bucket: provide the S3 bucket name (Ex. your_bucket_name)
Path Prefix: specify a prefix for target keys. (Ex. logs/data_)
Path Regex: use regexp to match file paths. If a file path doesnt match with the specified pattern, the file is skipped. For example, if you specify the pattern.csv$ #, then a file is skipped if its path doesnt match the pattern.
Skip Glacier Objects:select to skip processing objects stored in Amazon Glacier storage class. If objects are stored in Glacier storage class, but this option is not checked, an exception is thrown.
Filter by Modified Time:choose how to filter files for ingestion:
If it is unchecked (default):
Start after path: inserts last_path parameter so that first execution skips files before the path. (Ex. logs/data_20170101.csv)
Incremental: enables incremental loading. If incremental loading is enabled, config diff for the next execution includes the last_path parameter so that next execution skips files before the path. Otherwise, last_path is not included.
If it is checked
Modified after: inserts last_modified_time parameters so that first executions skip files that were modified before that specified timestamp (Ex.2019-06-03T10:30:19.806Z)
Incremental by Modified Time: enables incremental loading by modified time. If incremental loading is enabled, config diff for next execution includes the last_modified_time parameter so that next execution skips files that were modified before that time. Otherwise, last_modified_time is not included.
You can limit the access to your S3 bucket/IAM user by using a list of static IPs. Contact [email protected] if you need it.
There are instances where you might need to scan all the files in a directory (such as from the top level directory "/"). In such instances, you must use the CLI to do the import (see the Appendix for instructions on how to import using the CLI.)
Example: CloudFront
Amazon CloudFront is a web service that speeds up distribution of your static and dynamic web content. You can configure CloudFront to create log files that contain detailed information about every user request that CloudFront receives. If you enable logging, you can save CloudFront logfiles, shown as follows:
[your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-15.a103fd5a.gz]
[your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-15.b2aede4a.gz]
[your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-16.594fa8e6.gz]
[your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-16.d12f42f9.gz]
In this case, Fetch from settings are as shown:
Bucket: your_bucket
Path Prefix: logging/
Path Regex: .gz$ (Not Required)
Start after path: logging/E231A697YXWD39.2017-04-23-15.b2aede4a.gz (Assuming that you want to import the log files from 2017-04-23-16.)
Incremental: true (if you want to schedule this job.)
Preview
In this section, you can see a preview of data you configured. If you can't reach this Preview page, you might get any errors if you try to proceed to import using TD Console. See Preview for more information.
If you would like to set specified column name, choose Advanced Settings.
Advanced Settings
Advanced Settings allow you to edit guessed properties. Edit the following section, if you need to.
Default timezone: changes the time zone of timestamp columns if the value itself doesnt include time zone.
Columns:
Name: changes a name of the column. Column name is supported consisting of lowercase alphabets, numbers, and _ only.
Type: parses a value as a specified type. And then, it stores after converting to TreasureData schema.
boolean
long
timestamp: is imported as String type at TreasureData (Ex. 2017-04-01 00:00:00.000)
double
string
json
Total file count limit: maximum number of files to read. (optional)
Transfer to
In this phase, select your target database and table that you want to import data to.
Mode: Append/Replace
Partition key Seed: choose the long or timestamp column as the partitioning time. As default time column, its used upload_time with using add_time filter.
When
In this phase, you can set an ad hoc or schedule configuration for your job.
When
Once now: set one time job.
Repeat
Schedule: accepts these three options: @hourly, @daily and @monthly and custom cron.
Delay Transfer: add a delay of execution time.
TimeZone: supports extended timezone formats like Asia/Tokyo.
My Input Transfers
Your data connector jobs are listed on the jobs page of the TD Console.
Scheduled Execution
You can schedule periodic data connector execution for incremental S3 file import. We configure our scheduler carefully to ensure high availability.
For the scheduled import, the Data Connector for Amazon S3 imports all files that match with the specified prefix (e.g. path_prefix: path/to/sample_ > path/to/sample_201501.csv.gz, path/to/sample_201502.csv.gz,, path/to/sample_201505.csv.gz) initially and remembers one of these fields by condition:
If use_modified_time is disabled, the last path (path/to/sample_201505.csv.gz) is saved for the next execution. On the second and on subsequent runs, the connector imports only files that comes after the last path in alphabetical (lexicographic) order.(path/to/sample_201506.csv.gz, )
Otherwise, the time that the job is executed (i.e.2019-06-17T12:00:00.000Z) is saved for the next execution. On the second and on subsequent runs, the connector imports only files that modified after that executiontime in alphabetical (lexicographic) order.
Create a Schedule using the TD Toolbelt
A new schedule can be created using the td connector:create command.
$ td connector:create daily_import "10 0 * * *" \
td_sample_db td_sample_table load.yml
Its also recommended to specify the --time-column option, because Treasure Datas storage is partitioned by time (see also data partitioning )
$ td connector:create daily_import "10 0 * * *" \
td_sample_db td_sample_table load.yml \
--time-column created_at
The `cron` parameter also accepts three special options: `@hourly`, `@daily` and `@monthly`.
By default, schedule is setup in UTC timezone. You can set the schedule in a timezone using -t or --timezone option. `--timezone` option supports only extended timezone formats like 'Asia/Tokyo', 'America/Los_Angeles' etc. Timezone abbreviations like PST, CST are *not* supported and may lead to unexpected schedules.
List All Schedules
You can see the list of currently scheduled entries by running the command td connector:list.
$ td connector:list
+--------------+--------------+----------+-------+--------------+-----------------+------------------------------------------+
| Name | Cron | Timezone | Delay | Database | Table | Config |
+--------------+--------------+----------+-------+--------------+-----------------+------------------------------------------+
| daily_import | 10 0 * * * | UTC | 0 | td_sample_db | td_sample_table | {"in"=>{"type"=>"s3", "access_key_id"... |
+--------------+--------------+----------+-------+--------------+-----------------+------------------------------------------+
Show Schedule Settings And History
td connector:show shows the execution setting of a schedule entry.
% td connector:show daily_import
Name : daily_import
Cron : 10 0 * * *
Timezone : UTC
Delay : 0
Database : td_sample_db
Table : td_sample_table
Config
---
in:
type: s3
access_key_id: XXXXXXXXXX
secret_access_key: YYYYYYYYYY
endpoint: s3.amazonaws.com
bucket: sample_bucket
path_prefix: path/to/sample_
parser:
charset: UTF-8
...
td connector:history shows the execution history of a schedule entry. To investigate the results of each individual run, use td job <jobid>.
% td connector:history daily_import
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
| JobID | Status | Records | Database | Table | Priority | Started | Duration |
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
| 578066 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-18 00:10:05 +0000 | 160 |
| 577968 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-17 00:10:07 +0000 | 161 |
| 577914 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-16 00:10:03 +0000 | 152 |
| 577872 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-15 00:10:04 +0000 | 163 |
| 577810 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-14 00:10:04 +0000 | 164 |
| 577766 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-13 00:10:04 +0000 | 155 |
| 577710 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-12 00:10:05 +0000 | 156 |
| 577610 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-11 00:10:04 +0000 | 157 |
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
8 rows in set
Delete Schedule
td connector:deleteremoves the schedule.
$ td connector:delete daily_import
FAQ for the S3 Data Connector
Q: Data Connector for S3 job is running for a long time, what can I do?
Check the count of S3 files that your connector job is ingesting. If there are over 10,000 files, the performance degrades. To mitigate this issue, you can:
Narrow path_prefix option and reduce the count of S3 files.
Set 268,435,456 (256MB) to min_task_size option.
Q: How do I troubleshoot data import problems?
Review the job log. Warning and errors provide information about the success of your import. For example, you can identify the source file names associated with import errors.
Appendix
A) Optional Alternative: Use the CLI to Configure the Connector
Before setting up the connector, install the td command. Install the most current Treasure Data Toolbelt.
Create Seed Config File (seed.yml)
Prepare the seed.yml as shown in the following example, with your AWS access key and secret access key. You must also specify bucket name, and source file name (or prefix for multiple files).
in:
type: s3
access_key_id: XXXXXXXXXX
secret_access_key: YYYYYYYYYY
bucket: sample_bucket
# path to the *.json or *.csv or *.tsv file on your s3 bucket
path_prefix: path/to/sample_file
out:
mode: append
The Data Connector for Amazon S3 imports all files that match the specified prefix. (e.g. path_prefix: path/to/sample_ > path/to/sample_201501.csv.gz, path/to/sample_201502.csv.gz,, path/to/sample_201505.csv.gz)
Using path_prefix with leading '/', can lead to unintended results. For example: "path_prefix: /path/to/sample_file" would result in plugin looking for file in s3://sample_bucket//path/to/sample_file which is different on S3 than the intended path of s3://sample_bucket/path/to/sample_file
For more details on available out modes, see Appendix.
Guess Fields (Generate load.yml)
Use connector:guess. This command automatically reads the source files, and assesses (uses logic to guess) the file format and its field/columns.
$ td connector:guess seed.yml -o load.yml
If you open up load.yml, youll see the assessed file format definitions including file formats, encodings, column names, and types.
in:
type: s3
access_key_id: XXXXXXXXXX
secret_access_key: YYYYYYYYYY
bucket: sample_bucket
path_prefix: path/to/sample_file
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
escape: ''
skip_header_lines: 1
columns:
- name: id
type: long
- name: company
type: string
- name: customer
type: string
- name: created_at
type: timestamp
format: '%Y-%m-%d %H:%M:%S'
out:
mode: append
Then, you can see a preview of the data using the td connector:preview command.
$ td connector:preview load.yml
+-------+---------+----------+---------------------+
| id | company | customer | created_at |
+-------+---------+----------+---------------------+
| 11200 | AA Inc. | David | 2015-03-31 06:12:37 |
| 20313 | BB Imc. | Tom | 2015-04-01 01:00:07 |
| 32132 | CC Inc. | Fernando | 2015-04-01 10:33:41 |
| 40133 | DD Inc. | Cesar | 2015-04-02 05:12:32 |
| 93133 | EE Inc. | Jake | 2015-04-02 14:11:13 |
+-------+---------+----------+---------------------+
The guess command needs more than 3 rows and 2 columns in source data file, because the command assesses the column definition using sample rows from source data.
If the system detects your column name or column type unexpectedly, modify load.yml directly and preview again.
Currently, the Data Connector supports parsing of boolean, long, double, string, and timestamp types.
The `preview` command downloads one file from the specified bucket and display the results from that file. This may cause a difference in results from the preview and issue commands.See Preview for more information.
Execute Load Job
Submit the load job. It may take a couple of hours depending on the size of the data. Specify the Treasure Data database and table where the data should be stored.
Its also recommended to specify --time-column option, because Treasure Datas storage is partitioned by time (see data partitioning ) If the option is not provided, the data connector chooses the first long or timestamp column as the partitioning time. The type of the column specified by --time-column must be either of long and timestamp type.
If your data doesnt have a time column you can add a time column by using add_time filter option. For more details see add_time filter plugin
$ td connector:issue load.yml --database td_sample_db --table td_sample_table \
--time-column created_at
The connector:issue command assumes that you have already created a database(td_sample_db) and a table(td_sample_table). If the database or the table do not exist in TD, this command will not succeed, so create the database and table manually or use --auto-create-table option with td connector:issue command to auto create the database and table:
$ td connector:issue load.yml --database td_sample_db --table td_sample_table --time-column created_at --auto-create-table
The data connector does not sort records on server-side. To use time-based partitioning effectively, sort records in files beforehand.
If you have a field called time, youdon'thave to specify the --time-column option.
$ td connector:issue load.yml --database td_sample_db --table td_sample_table
B) Modes for out plugin
You can specify file import mode in out section of seed.yml.
append (default)
This is the default mode and records are appended to the target table.
in:
...
out:
mode: append
replace
This mode replaces data in the target table. Any manual schema changes made to the target table remain intact with this mode.
in:
...
out:
mode: replace
C) IAM Permissions
The IAM credentials specified in the YML configuration file and used for the connector:guess and connector:issue commands need to be allowed permissions for the AWS S3 resources that they need to access. If the IAM user does not possess these permissions, configure the user with one of the predefined Policy Definitions or create a new Policy Definition in JSON format.
The following example is based on the Policy Definition reference format, giving the IAM user read only (through GetObject and ListBucket actions) permission for the your-bucket:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::your-bucket",
"arn:aws:s3:::your-bucket/*"
]
}
]
}
Replace your-bucket with the actual name of your bucket.
D) Use AWS Security Token Service (STS) as temporary credentials provider
In certain cases, IAM basic authentication through access_key_id and secret_access_key might be too risky (although the secret_access_key is never clearly shown when a job is executed or after a session is created).
The S3 data connector can use AWS Secure Token Service (STS) provided Temporary Security Credentials. Using AWS STS, any IAM user can use his own access_key_id and secret_access_key to create a set of temporary new_access_key_id, new_secret_access_key, and session_token keys with an associated expiration time, after which the credentials become invalid. There are essentially 2 types of Temporary Security Credentials:
Session Token The simplest Security Credentials with an associated expiration time. The temporary credentials give access to all resources the original IAM credentials used to generate them had. These credentials are valid as long as they are not expired and the permissions of the original IAM credentialsdon'tchange.
Federation Token Adds an extra layer of permission control over the Session Token above. When generating a Federation Token, the IAM user is required to specify a Permission Policy definition. The scope can be used to further narrow down which of the resources, accessible to the IAM user, the bearer of the Federation Token should get access to. Any Permission Policy definition can be used but the scope of the permission is limited to only all or a subset of the permissions the IAM user used to generate the token had. As for the Session Token, the Federation Token credentials are valid as long as they are not expired and the permissions associated to the original IAM credentialsdon'tchange.
AWS STS Temporary Security Credentials can be generated using the AWS CLI or the AWS SDK in the language of your choice.
Session Token
$ aws sts get-session-token --duration-seconds 900
{
"Credentials": {
"SecretAccessKey": "YYYYYYYYYY",
"SessionToken": "ZZZZZZZZZZ",
"Expiration": "2015-12-23T05:11:14Z",
"AccessKeyId": "XXXXXXXXXX"
}
}
Federation Token
$ aws sts get-federation-token --name temp_creds --duration-seconds 900 \
--policy '{"Statement": [{"Effect": "Allow", "Action": ["s3:GetObject", "s3:ListBucket"], "Resource": "arn:aws:s3:::bucketname"}]}'
{
"FederatedUser": {
"FederatedUserId": "523683666290:temp_creds",
"Arn": "arn:aws:sts::523683666290:federated-user/temp_creds"
},
"Credentials": {
"SecretAccessKey": "YYYYYYYYYY",
"SessionToken": "ZZZZZZZZZZ",
"Expiration": "2015-12-23T06:06:17Z",
"AccessKeyId": "XXXXXXXXXX"
},
"PackedPolicySize": 16
}
where: * temp_cred is the name of the Federated token/user * bucketname is the name of the bucket to give access to. Refer to the ARN specification for more details * s3:GetObject and s3:ListBucket are the basic read operation for a AWS S3 bucket.
AWS STS credentials cannot be revoked. They will remain effective until expired, or until you delete or remove the permissions of the original IAM user used to generate the credentials.
When your Temporary Security Credentials are generated, copy the SecretAccessKey, AccessKeyId, and SessionToken in your seed.yml file as follows.
in:
type: s3
auth_method: session
access_key_id: XXXXXXXXXX
secret_access_key: YYYYYYYYYY
session_token: ZZZZZZZZZZ
bucket: sample_bucket
path_prefix: path/to/sample_file
and execute the Data Connector for S3 as usual.
Because STS credentials expire after the specified amount of time, the data connector job that uses the credential might eventually start failing when credential expiration occurs. Currently, if the STS credentials are reported expired, the data connector job retries up to the maximum number of times (5) and eventually complete with 'error' status.
E) IP Whitelist and VPC
You can access here for IP addresses TD is using.
Look up the region of TD Console by the URL you are logging in to TD, then refer to the "Data Connector" of your region in above URL.
Region of TD Console
URL
US
https://console.treasuredata.com
Tokyo
https://console.treasuredata.co.jp
EU01
It will be also necessary to setup access route in AWS if you are using AWS S3 bucket located in same region with your TD region. If so, please configure access via "vpc-df7066ba".
View ArticleGDPR (General Data Protection Regulation) was introduced in Europe on May 25, 2018. The regulation is focused on to the collection of personal information, and brings strict penalties of up to 4% of gross annual worldwide revenue or 20M (whichever is greater) for each violation. The rules and penalties can apply to companies based in any country, and customers based in any country, if a company has any sustained business presence in Europe (such as a small sales or support office) or does business with any European subjects.
Arm Treasure Data helps you to comply with GDPR in the following ways:
Data Retention Policy
SDK Controls
Sites and Endpoints
Data Retention Policy
Our data retention policy is described here.
Data Retention in Presto
Customer data that you collect using Treasure Data can be permanently deleted by using the Presto DELETE function.
SDK Controls
The Treasure Data JavaScript SDK and Android, Unity and iOS SDKs are all able to collect data that counts as personal data under GDPR. To protect Treasure Data customers (and Treasure Data) from legal jeopardy under GDPR, the SDKs have been updated to minimize the collection of personal data by default. Treasure Data customers who use such personal data must make code changes to continue to collect such data andnotify their subscribers about consent for data collection and use.
For the JavaScript SDKs, this personal information includes the td_client_id and IP address of a web site visitor. For the mobile SDKs, the personal information includes the td_uuid and IP address of the application sending events.
GDPR-ready releases of our SDKs have been made available prior to the May 25 GDPR deadline, including:
Release 2.1.0 (JS SDK) - https://github.com/treasure-data/td-js-sdk#data-privacy
Release 0.1.9 (Unity SDK) - https://github.com/treasure-data/td-unity-sdk-package#gdpr-compliance
Release 0.1.27 (iOS SDK) - https://github.com/treasure-data/td-ios-sdk#gdpr-compliance
Release 0.1.18 (Android SDK) - https://github.com/treasure-data/td-android-sdk#opt-out
How do I ensure I am in compliance with GDPR when using the SDK?
Developers using the SDKs should upgrade to the latest SDK.
Consult with your Privacy officer and Legal team, before collecting or enabling collection of personal data using Treasure Data provided SDKs. Then, after appropriate reviews by your company, you can explicitly enable collection of personal data for events if this is in line with your legal obligations and your companys chosen data privacy posture.
The documentation for each SDK, hosted in GitHub, explains the new data privacy-related controls and how to re-enable data collection. If you have technical questions about the use of the SDKs, contact Treasure Data support or customer success for implementation guidance.
EU Site and Endpoints (aws)
Arm Treasure Data accounts can use a data site in Europe. Data that is assigned to the EU service physically resides in Germany, to comply with both GDPR and the German Privacy Act (Bundesdatenschutzgesetz).For details, see Sites and Endpoints.
Treasure Data and Segmentation for Marketing
You can read more information about Treasure Data as a customer data platform and GDPR here.
View ArticleYou can write job results from Arm Treasure Data directly to Criteo Marketing API.
Prerequisites
Basic knowledge of Treasure Data, including the toolbelt
A Criteo Marketing API account
Authorized Treasure Data account access to Criteo
Use the TD Console to create your connection
You can use the Treasure Data console to configure your connection.
Create a new connection
You mustcreate and configure the data connection to be used during export, prior to running your query. As part of the data connection,you provide authentication to access the integration.
Go to Catalog and search and select Criteo.
https://api.criteo.com/marketing/swagger/ui/index#!/Audiences/Audiences_GetAudiences
The following dialog opens.
Authenticating your connection
You need a client ID and client secret to authenticate using credentials. To get a client ID and client secret, use the Criteo Management Center. A Criteo Marketing Account is required to use the Management Center.
Click Setup > Users.
Then clickCREATE API USER.
Enter a contactEmail addressand select one of the following roles:
View only
Business Manager
Financial Manager
To use the Criteo Marketing API, selectBusiness Manager.TheView onlyandFinancial Managerrolesdon'thave enough privileges to input and pull data through the API.
When done, clickAdd user. You receive a confirmation message that includes your Client IDand Client Secret. Save the information because you use it later to configure the Criteo data connector.
Configure Export Results in Your Data Connection
In this step, you create or reuse a query. In the query, you configure the data connection.
Note: Sometimes you need to define the column mapping in the query.
Configure the Connection by Specifying the Parameters
Go to the TD Console. Go to Data Workbench > Queries. Access the query that you plan to use to export data.
Click Export Results located at top of your query editor. The Choose Integration dialog opens. Type the connection name in the search box to filter and select your connection.
Parameters:
Advertiser ID(required): The ID used to group audiences under.
Maximum Identities Per Request (optional): The maximum number of IDs to include per request. The maximum number for the Criteo interface is 50000. Default: 50000
Retry Limit (optional): Number of times to attempt a network operation. Default: 7
Connect Timeout in Seconds(optional): The time, in seconds, to wait until aborting a connection operation. Default: 300, which is equivalent to 5 minutes.
Read Timeout in Seconds(optional): The time, in seconds, to wait until aborting a read operation. Default: 900, which is equivalent to 15 minutes.
Write Timeout in Seconds(optional):The time, in seconds, to wait until aborting a write operation. Default: 900, which is equivalent to 15 minutes.
The following is a sample configuration:
Example of a Query to Populate Criteo
From Treasure Data, run the following query with Export Results into a connection for Criteo:
Code Example
SELECT audience_id_column AS id, audience_name_column AS name, operation_column AS operation, schema_column AS schema, userid_column AS userid, gumid_column AS gumidFROM your_table;
Column Naming
Data exported to Criteo must follow the Criteo schema. Supported column names are:
id : The audience ID
name : The audience name
operation : The operation (add or remove) to perform on an audience
schema : The type of identifier represented in the row of data. In this case, you must specify 'gum'.
userid : The identifier to add or remove from an audience
gumid : The GUM caller ID, data required for gum schema. In this case, you must specify 359 because the caller is TreasureData.
Example:
SELECT audience_id_column AS id, audience_name_column AS name, 'add' AS operation, 'gum' AS schema, ${criteo_id} AS userid, 359 AS gumidFROM your_table
Optional: Use of Scheduled Jobs for Output
You can use Scheduled Jobs with Result Output, to periodically write the output result to a target destination that you specify.
Optional: Configure Export Results in Workflows
Within Treasure Workflow, you can specify the use of the data connector to output data.
timezone: UTC
_export: td: database: sample_datasets+td-result-into-criteo: td>: queries/sample.sql result_connection: your_connections_name result_settings: advertiser_id: 12345 request_threshold: 50000 request_retries: 7 request_connect_timeout: 300 request_read_timeout: 900 request_write_timeout: 900
Click here for more information on using data connectors in Workflows to export data.
Appendix
FAQ for Export into Criteo
Q: Where do I confirm audience ID?
To confirm audience ID, check with tool provided by Criteo. ( https://api.criteo.com/marketing/swagger/ui/index#!/Audiences/Audiences_GetAudiences)
Q: Are both the audience ID and audience name required to not be NULL?
No. Either the audience ID or audience name must be non-NULL, not both. If the audience ID and audience name are both provided, then the values must be present within the same audience.
Q: If audience ID or audience name do not exist in the list of audiences, will they be created?
If an audience name is provided and the audience ID is NULL, an audience is created if the audience name is not found in the list of audiences.
If an audience ID is provided, an audience cannot be created if the audience ID is not found in the list of audiences. The Criteo API does not allow the audience ID to be specified when creating an audience.
Q: How long does it take for my audience to be updated when adding or removing users?
Criteo can take up to 24 hours to process the data and make it available for display. The data is stored in Criteo's system, but takes some time to be processed.
Q: How do I check upload result?
Check with tool provided by Criteo (). If upload succeeds, the tool shows the result count at AudienceResponse.nbLines as JSON.
FAQ for Log Messages
Q: Log: Columns [<name(s)>] were not found.
The column names listed in the message were missing from the query.
Check the query to ensure that the column names are not missing or they are not misspelled.
Q: Log: Created audience id '<id>' for audience name '<name>'
The audience name was not found within the list of audiences for the advertiser ID provided. The audience is created only if the name does not exist and the ID value provided by the query is NULL.
Q: Log: Audience id '<id>' is not found in the list of audiences for advertiser '<id>'. Cannot create a new audience for non-existent id
If the audience ID value provided by the query is not NULL and the audience ID does not exist within the list of audiences for the advertiser ID provided, then the audience cannot be created by the connector. The audience ID provided does not match the audience ID generated by the Criteo API and causes a data mismatch.
Create an audience and update data entries of unknown audience IDs with newly created audience IDs.
Q: Log: Audience name '<name>' is not found in the list of audiences for advertiser '<id>' but audience id '<id>' is found. Cannot create a new audience due to name mismatch
Both the audience ID and name were provided by the query but the audience name was not found in thelist of audiences for the advertiser ID provided. Providing the name as part of the Criteo API operation is invalid.
Make the name NULL in the database or update the audience name to match the name provided by the query.
Q: Log: Audience id '<id>' and audience name '<name>' do not represent the same audience for advertiser '<id>'
The audience name associated with ID does not match the audience name provided by the query.
If both the audience ID and name are provided by the query and are not NULL, both values must be associated with the same audience. You can update the audience name associated with the ID to match the name provided by the query or you can modify the row of data and set the name to match the name value within the audience.
View ArticleYou can write job results to your Salesforce.com organization.For sample workflows of SFDC, view Treasure Boxes.
Prerequisites
Basic knowledge of Arm Treasure Data, including the toolbelt
Salesforce.com organization and username, password, and security token for API integration
User has API Enabled permission
Target Salesforce.com Object should exist with read/write permissions for the User
Architecture
A front-end application streams data to be collected in Treasure Data via Log/data collector daemon or Mobile SDKs. You can also bulk import your data using Bulk Import from the CLI. A scheduled query is setup in Treasure Data to run periodically on the data and write the result of each query execution into your Salesforce.com Object.
View Bulk Data Load Job Details
The above is a fairly common architecture; for example:
Example 1: Ranking: What are the Top N of X?
Every social/mobile application calculates the top N of X (ex: top 5 movies watched today). Treasure Data already handles the raw data warehousing; the write-to-Salesforce.com feature enables Treasure Data to find the top N data as well.
Example 2: Dashboard Application
If youre a data scientist, you need to keep track of a range of metrics every hour/day/month and make them accessible via visualizations. Using this write-to-Salesforce.com feature, you can streamline the process and focus on building visualizations of your query results via Reports and Dashboards on the Salesforce.com organization.
Result Output URL Format
Format
The result output target is represented by a URL with the following format:
sfdc://<username>:<password><security_token>@<hostname>/<object_name>
where:
sfdc is an identifier for result output to Salesforce.com;
username and password are the credential to your Salesforce.com organization;
security_token is the additional credential for API access;
hostname is the host name of the Salesforce.com organization. Usually this is login.salesforce.com for production environments and test.salesforce.com for sandbox environments. In case where you configure custom domain for your organization specify the hostname youre using for login;
object_name is the target Salesforce.com Object API name (e.g. ResultOutput__c). The Object and columns for data integration must be defined beforehand;
For example with:
username: [email protected]
password: PASSWORD
security_token: 7SMvicR9ojdPz0XLtlWi3Rtw
The URL will look like:
sfdc://user%40treasure-data.com:[email protected]/Account
Make sure that you escape the '@' in the username with '%40'.
Options
Result output to Salesforce.com supports various options that can be specified as optional URL parameters. The options are compatible with each other and can be combined.Where applicable, the default behavior is indicated.
Update mode option
Controls the various ways of modifying the database data.
Append
Truncate
Update
mode=append (default)
The append mode is the default that is used when no mode option is provided in the URL. In this mode, the query results are appended to the object.
Because mode=append is the default behavior, these two URLs are equivalent:
sfdc://.../Contact
sfdc://.../Contact?mode=append
mode=truncate
With the truncate mode the system first truncates the existing records in the Salesforce.com Object and moves them into the Trashbin, then inserts the query results.
Example:
sfdc://.../CustomObject__c?mode=truncate
You can specify the hard_delete=true option for mode=truncate to delete records instead of moving it to the Trashbin. To use this option, the user must have the 'Bulk API Hard Delete' permission.
sfdc://.../CustomObject__c?mode=truncate&hard_delete=true
mode=update
With the update mode, a row is inserted unless it would cause a duplicate value in the external key columns specified in the unique parameter. In such a case, an update is performed instead. The unique parameter is required with this mode and must be defined as an external key when used with the update mode.
Example:
sfdc://.../Contact?mode=update&unique=CustomerId__c
The default behavior for the update mode is actually upsert. If you do not want to upsert but only update, you can add the upsert=false option. Then it updates existing records based on unique parameter match and does not insert new records.
sfdc://.../Contact?mode=update&unique=CustomerId__c&upsert=false
Upload concurrency_mode option
The concurrency_mode option controls how the data is uploaded to the Salesforce.com organization. The default mode is parallel; it is the recommended method for most situations.
concurrency_mode=parallel (default)
With the parallel method, data is uploaded in parallel. This is the most reliable and effective method and it is recommended for most situations.
Because concurrency_mode=parallel is the default behavior, these two URLs are equivalent:
sfdc://.../CustomObject__c
sfdc://.../CustomObject__c?concurrency_mode=parallel
concurrency_mode=serial
Uploading records in parallel is recommended. However, if you see UNABLE_TO_LOCK_ROW in an error message, try concurrency_mode=serial instead.
sfdc://.../CustomObject__c?concurrency_mode=serial
Updating A Salesforce.com Object acquires a lock for the Object and parent Object referenced by columns. If you upload objects in parallel and multiple objects have reference to the same parent object, Salesforce.com is not able to acquire the lock for insert/update and returns an UNABLE_TO_LOCK_ROW error. In such cases, specify the concurrency_mode=serial option.
Authentication session_id option
If you have a Salesforce.com Session ID, you can authenticate with the session_id option instead of username, password, and security token (that is, username, password, and security token can be omitted from the URL).
sfdc://login.salesforce.com/Contact?session_id=3deT2aQjYQbIRN0M...jB1tHBb7UW0K!M
Retry option
This options sets the number of attempts the Treasure Data export worker makes to write the result to the configured Salesforce.com destination, if errors occur. If the export fails more than the set number of retries, the query fails.The default number of retries is retry=2 but one can virtually set it to any number. The number of retries affect the overall duration of a query.
sfdc://.../CustomObject__c?retry=5
Split Records options
The Treasure Data result export splits the records in the result of a query in chunks of 10000 records by default and bulk upload one chunk at a time. The split_records option configures the size of this chunk, if required.
sfdc://.../CustomObject__c?split_records=100
Usage
CLI
To output the result of a single query to Salesforce.com organization add the -r / result option to the td query command. After the job is finished, the results are written into your Salesforce.com organization Object:
$ td query -w -d testdb \
--result 'sfdc://login.salesforce.com/CustomObject__c?session_id=.....' \
"SELECT code as Code__c, COUNT(1) as Count__c FROM www_access GROUP BY code"
To create a scheduled query whose output is systematically written to Salesforce.com organization, add the -r / result option when creating the schedule through the td sched:create command:
$ td sched:create hourly_count_example "0 * * * *" -d testdb \
--result 'sfdc://user%40treasure-data.com:[email protected]/CustomObject__c' \
"SELECT COUNT(*) as Count__c FROM www_access"
Using the TD Console
In the Queries tab, click New Query.
Select a database and table from the popup menus. Enter your query and select Output results.
To avoid any issues with result export, define column aliases in your query such that resulting column names from the query match the Salesforce field names for default fields and API names (usually ending with __c) for custom fields.
To export the result of a query or schedule to Salesforce.com, specify Result Export information for a new query or an existing job/query:
Troubleshooting
If you get the following type of error due to the Result Output to SFDC job, you can see the error of the job on SFDC. As shown in the following example, check XXXXXXXXXXX on SFDC. You can figure out a detail of the error.
```17/05/01 03:35:05 INFO sfdc.BulkAPIJob: Job XXXXXXXXXXX finished: Total 1, Completed 0, Failed 1 17/05/01 03:35:05 INFO sfdc.BulkAPIClient: Batch jobs failed (1/1)```
Ref. .
View ArticleArm Treasure Data can publish user segments into Salesforce Marketing Cloud (ExactTarget), and enable you to send personalized emails to your customers. You can run data-driven email campaigns, by using your first party data from Web, Mobile, CRM, and other data sources.
For sample workflows of Salesforce Marketing Cloud (ExactTarget), view Treasure Boxes.
Treasure Data provides two methods to write job results to Salesforce Marketing Cloud (SFMC), our SFTP and SFMC plugins.
This SFMC plugin can be used for small and medium data sets. We recommend you use SFTP for large data sets.
Synchronous or Asynchronous API
This data connector supports two kinds of APIs to put records into SFMC: Synchronous APIs and Asynchronous APIs.
As a best practice, if you are trying to put a result set with less than 100,000 records into SFMC, Synchronous APIs is a great fit. However, if you are planning to send a million+ data set, then consider using Asynchronous APIs because it provides higher availability and reliability in comparison withSynchronous APIs.
Note:In order to use the Asynchronous APIs, you must ensure that your SFMC account is enabled as aData Extensions Async REST API from Salesforce Marketing Cloud. Or you could create a case to ask Salesforce Marketing Cloud Support to enable it. See more in Create support case in Salesforce Marketing Cloud
Prerequisites
Basic knowledge of Treasure Data
Basic knowledge of Salesforce Marketing Cloud
TD account
This connector also requires that the Data Extension have a primary key that uniquely identifies each record.
How to create an installed app in Salesforce Marketing Cloud
Currently we support both Legacy and Enhanced Package in Salesforce Marketing Cloud.
However after August 1st 2019, you can not create theLegacy Package, therefore we highly recommend you to create an Enhanced Package.
How to create legacy package
Log on to your Salesforce Marketing Cloud account
On the Welcome Page, click your name on top right corner then click on Setup.
On the left side menu of new screen, click on App > Installed Packages
On theInstalled Packages screen, click New.
On the New Package Details pop up, enter the Name and Description. Uncheck (deselect) thecheckboxCreate with enhanced functionality (recommended) then click Save.
On the Package Details screen, clickAdd Component.
On theAdd Component pop up, selectAPI Integration then click Next.
On theAdd API Integration screen, scroll to Data section and select the Read and Write checkboxes on Data Extension. Youneed these permission in order to write data to Salesforce Marketing Cloud. Then, click Save.
On theInstalled Packages screen, scroll down to Components panel, then take note of theClient Id and Client Secret. You will use the information to write the data from Treasure Data to Salesforce Marketing Cloud.
On the Package Details screen, clickAdd Component.
How to create enhanced functionality package
Log on to your Salesforce Marketing Cloud account.
On the Welcome Page, click your name on top right corner then clickSetup.
On the left side menu of new screen, clickApp > Installed Packages
On theInstalled Packages screen, click New.
On the New Package Details pop up, enter the Name and Description and selectthecheckboxCreate with enhanced functionality (recommended). Then click Save.
On theAdd Component pop up, selectAPI Integration then click Next.
On theChoose Your Integration Type screen, select Server-to-Server then click Next.
On theSet Server-to-Server Properties screen, scroll to Data section and select the Read and Write checkboxes onData Extension. Youneed these permission in order to write data to Salesforce Marketing Cloud. Then, click Save.
On theInstalled Packages screen, scroll down to Components panel, and take note of theClient Id,Client Secret andAuthentication Base URI,. You will use the information to write the data from Treasure Data to Salesforce Marketing Cloud.
Using Treasure Data console
Complete the following steps to export TD query results to a Salesforce Marketing Cloud Data Extension:
Create a new data connection
A new data connection can be used to export results to an SFMC Data Extension.
Go to Integrations Hub > Catalogand search and select Salesforce Marketing Cloud.
Create Legacy Package integration
On theNew Authentication screen, select Package Type as Legacy Package,then enter the Client Id and Client Secret (which you obtained when you created legacy package in SFMC). Then click Continue.
Enter aName for this integration and click Done.
Create Enhanced Package integration
On theNew Authentication screen, select Package Type as Enhanced Functionality Package,then enter the Client Id,Client Secret andAuthentication Base URI (which you obtained when you created the enhanced package in SFMC). Optionally, you could specify Account identifier or MID to access multiple BUs(more detail in How to get MID ) and Scope to limit the token's scope(more detail in API scopes ).Then, click Continue.
Enter aName for this integration and click Done.
Configure the query results output connector
Navigate to Data Workbench > Queries > New Query.
Enter your query then click the Export Results checkbox.
On Choose Integration, select the integration you've created in previous step.
On theExport Results screen, enter aData Extension Name, select theUsing Async API checkbox if you want to use Asynchronous API, otherwise uncheck the checkbox to use the Synchronous API. Then, click Done.
Run the query
After configuring the output connector, and running a query, the query results are written to the configured data extension.
Plugin configuration and options
Available configurations
client_id:Installed package client key
client_secret:Installed package client secret
auth_type:Authenticate type (enum v1 for legacy, v2 for enhanced)auth_uri: Authenticate URI (required if auth_type=v2)
de_name: Data Extension name
async: Using asynchronous or synchronous API from SFMC continue_on_failure: Continue running if there is an insertion error (applied only for synchronous API)
Example configurations
Example configuration for using Synchronous APIs to upsert data extension using Legacy package.
out:
type: salesforce_marketing_cloud
client_id: Th1s1s4n3x4mpl3Cl13nt1ds
client_secret: Th1s1s4F4k3dCl13ntS3cr3tauth_type: v1
de_name: data_extension_name
continue_on_failure: false
Example configuration for using Synchronous APIs to upsert data extension using Enhanced package.
out:
type: salesforce_marketing_cloud
client_id: Th1s1s4n3x4mpl3Cl13nt1ds
client_secret: Th1s1s4F4k3dCl13ntS3cr3tauth_type: v2auth_uri: https://th1s1sf4k3d1nst4nc3h0st.auth.marketingcloudapis.com/
de_name: data_extension_name
continue_on_failure: false
Example configurationfor Asynchronous APIs to upsert data extensionusing Legacy package.
out:
type: salesforce_marketing_cloud
client_id: Th1s1s4n3x4mpl3Cl13nt1ds
client_secret: Th1s1s4F4k3dCl13ntS3cr3tauth_type: v1
de_name: data_extension_name
async: true
Example configurationfor Asynchronous APIs to upsert data extensionusingEnhancedpackage.
out: type: salesforce_marketing_cloud client_id: Th1s1s4n3x4mpl3Cl13nt1ds client_secret: Th1s1s4F4k3dCl13ntS3cr3tauth_type: v2auth_uri: https://th1s1sf4k3d1nst4nc3h0st.auth.marketingcloudapis.com/
de_name: data_extension_name
async: true
Using Treasure Data CLI (Command-line interface)
You can use Treasure Data CLI to run the query and write output result to Salesforce Marketing Cloud.
How to run query on TDTreasure Data CLI
Example using Asynchronous APIs with Legacy package
$ td query -d database_name -r '{"type": "salesforce_marketing_cloud","client_id":"Th1s1s4n3x4mpl3Cl13nt1ds","client_secret":"Th1s1s4F4k3dCl13ntS3cr3t","auth_type":"v2","de_name":"data_extension_name","async":true}' "select primary_key, mytext, myinteger, mydecimal, myphone, myemail, mydate, mylocale, myboolean=1 as myboolean from my_1_million" -T presto
Example using Asynchronous APIs with Enhanced package
$ td query -d database_name -r '{"type": "salesforce_marketing_cloud","client_id":"Th1s1s4n3x4mpl3Cl13nt1ds","client_secret":"Th1s1s4F4k3dCl13ntS3cr3t","auth_type":"v2","auth_uri":"https://th1s1sf4k3d1nst4nc3h0st.auth.marketingcloudapis.com/","de_name":"data_extension_name","async":true}' "select primary_key, mytext, myinteger, mydecimal, myphone, myemail, mydate, mylocale, myboolean=1 as myboolean from my_1_million" -T presto
Example using Synchronous APIs with Legacy package
$ td query -d database_name -r '{"type": "salesforce_marketing_cloud","client_id":"Th1s1s4n3x4mpl3Cl13nt1ds","client_secret":"Th1s1s4F4k3dCl13ntS3cr3t","auth_type":"v1","de_name":"data_extension_name","async":false,"continue_on_failure":true}' "select primary_key, mytext, myinteger, mydecimal, myphone, myemail, mydate, mylocale, myboolean=1 as myboolean from my_1_million" -T presto
Example using Synchronous APIs with Enhanced package
$ td query -d database_name -r '{"type": "salesforce_marketing_cloud","client_id":"Th1s1s4n3x4mpl3Cl13nt1ds","client_secret":"Th1s1s4F4k3dCl13ntS3cr3t","auth_type":"v2","auth_uri":"https://th1s1sf4k3d1nst4nc3h0st.auth.marketingcloudapis.com/","de_name":"data_extension_name","async":false,"continue_on_failure":true}' "select primary_key, mytext, myinteger, mydecimal, myphone, myemail, mydate, mylocale, myboolean=1 as myboolean from my_1_million" -T presto
How to map between query result and Data Extension column name
You can use the as keyword in the query to map between query result and Data Extension column.
Example:
Given the following Data Extension that contains subscriber information:
name: de_1 Columns: primary_key, cust_email, cust_first_name, cust_last_name, cust_phone_number
Given the following table that contains user account information:
name: account_table
Columns: id, login_id, first_name, last_name, contact_number
You can use the following query with anas keyword to map between the table and Data Extension.
select id as primary_key, login_id as cust_email, first_name as cust_first_name, last_name as cust_last_name, contact_number as cust_phone_number from account_table;
View ArticleThis article describes how to use the data connector for Microsoft SQL Server, which allows you to directly import data from your Microsoft SQL Server to Arm Treasure Data.
For sample workflows on how to import data from your Microsoft SQL Server, view Treasure Boxes.
Prerequisites
Basic knowledge of Treasure Data
Basic knowledge of Microsoft SQL Server
A Microsoft SQL Server instance running remotely, for example on RDS.
Use the TD Console to Create Your Connection
You can use the TD Console to configure your connection.
Create a New Connection
When you configure a data connection, you provide authentication to access the integration. In Treasure Data, you configure the authentication and then specify the source information.
Go to Integrations Hub -> Catalog and search and select Microsoft SQL Server.
List of Options for Microsoft SQL Server Data Connector
The following dialog opens.
Provide the required credentials, including the host name, port, user name and password required to access Microsoft SQL Server.
Click Continue.
Name your new Microsoft SQL Server Connection. Click New Source.
Transfer Your Microsoft SQL Server Data to Treasure Data
You created your authenticated connection, now you create the source connection.
Complete the details and click Next.
If you want to specify the schema, select Use custom SELECT query as follows:
Preview
Youll see a preview of your data.To make changes, click Advanced Settings otherwise, click Next.
Choose the Target Database and Table
Choose an existing or create a new database and tablewhere you want to transfer data to:
Create new database and give your database a name. Complete similar steps for Create new table.
Select whether to append records to an existing table or replace your existing table.
If you want to set a different partition key seed rather than use the default key, you can specify one.
Scheduling
In the Schedule tab, you can specify a one-time transfer, or you can schedule an automated recurring transfer. If you select Once now, click Start Transfer. If you select Repeat specify your schedule options, then click Schedule Transfer.
Name your Source connector. You can just save the connector or click Create & Run.
Use Command Line to Create Your Microsoft SQL Server Connection
You can use the CLI to configure your connection.
Install the Treasure Data Toolbelt
Install the newest TD Toolbelt.
Create Seed Config File (seed.yml)
Prepare seed.yml with your Microsoft SQL Server access information:
in:
type: sqlserver
host: sqlserver_host_name
port: 1433
user: test_user
password: test_password
database: test_database
table: test_table
out:
mode: replace
This example will dump all records inside the table. You can have more detailed control with additional parameters.
For more details on available out modes, see Appendix
Guess Fields (Generate load.yml)
Next use connector:guess. This command automatically reads the target data, and intelligently guesses the data format.
$ td connector:guess seed.yml -o load.yml
If you open up load.yml, youll see guessed file format definitions including, in some cases, file formats, encodings, column names, and types.
Optional: Preview data to be imported
You can preview data to be imported using the command td connector:preview.
$ td connector:preview load.yml
+---------+--------------+----------------------------------+------------+---------------------------+
| id:long | name:string | description:string | price:long | created_at:timestamp |
+---------+--------------+----------------------------------+------------+---------------------------+
| 1 | "item name1" | "26e3c3625366591bc2ffc6e262976e" | 2419 | "2014-02-16 13:01:06 UTC" |
| 2 | "item name2" | "3e9dd9474dacb78afd607f9e0a3366" | 1298 | "2014-05-24 13:59:26 UTC" |
| 3 | "item name3" | "9b6c9e4a140284d3951681e9e047f6" | 9084 | "2014-06-21 00:18:21 UTC" |
| 4 | "item name4" | "a11faf5e63c1b02a3d4c2b5cbb7331" | 669 | "2014-05-02 03:44:08 UTC" |
| 6 | "item name6" | "6aa15471c373ddc8a6469e1c918f98" | 3556 | "2014-03-29 08:30:23 UTC" |
+---------+--------------+----------------------------------+------------+---------------------------+
Execute Load Job
Finally, submit the load job. The job may take a couple of hours to run depending on the data size. Users need to specify the database and table where their data is stored.
It is recommended to specify --time-column option, since Treasure Datas storage is partitioned by time (see also architecture ) If the option is not given, the Data Connector will choose the first long or timestamp column as the partitioning time. The type of the column specified by --time-column must be either of long and timestamp type.
If your data doesnt have a time column you may add it using add_time filter option. More details at add_time filter plugin
$ td connector:issue load.yml --database td_sample_db --table td_sample_table
The above command assumes you have already created database(td_sample_db) and table(td_sample_table). If the database or the table do not exist in TD this command will not succeed, so create the database and table manually or use --auto-create-table option with td connector:issue command to auto create the database and table:
$ td connector:issue load.yml --database td_sample_db --table td_sample_table --time-column created_at --auto-create-table
You can assign Time Format column to the "Partitioning Key" by "--time-column" option.
You can load records incrementally by specifying columns in your table by utilizing the incremental_columns and last_record options.
in:
type: sqlserver
host: sqlserver_host_name
port: 1433
user: test_user
password: test_password
database: test_database
table: test_table
incremental: true
incremental_columns: [id, sub_id]
last_record: [10000, 300]
out:
mode: append
exec: {}
Then plugin will automatically recreate query and sort value at internal.
# when last_record wasn't given
SELECT * FROM(
...original query is here
)
ORDER BY id, sub_id
::: terminal
# when last_record was given
SELECT * FROM(
...original query is here
)
WHERE id > 10000 OR (id = 10000 AND sub_id > 300)
ORDER BY id, sub_id
If youre using with scheduled execution, plugin will automatically generate last_record and hold it internally. Then you can use it at next scheduled execution.
in:
type: sqlserver
...
out:
...
Config Diff
---
in:
last_record:
- 20000
- 400
`query` option can't available when you set `incremental: true`.
Only strings and integers are supported as incremental_columns.
Scheduled execution
You can schedule periodic Data Connector execution for periodic Microsoft SQL Server import. We carefully configure our scheduler to ensure high availability. By using this feature, you no longer need a cron daemon on your local data center.
Create the schedule
A new schedule can be created using the td connector:create command. The name of the schedule, cron-style schedule, the database and table where their data will be stored, and the Data Connector configuration file are required.
$ td connector:create \
daily_sqlserver_import \
"10 0 * * *" \
td_sample_db \
td_sample_table \
load.yml
The `cron` parameter also accepts three options: `@hourly`, `@daily` and `@monthly`.
By default, schedule is setup in UTC timezone. You can set the schedule in a timezone using -t or --timezone option. Please note that `--timezone` option only supports extended timezone formats like 'Asia/Tokyo', 'America/Los_Angeles' etc. Timezone abbreviations like PST, CST are *not* supported and may lead to unexpected schedules.
List the Schedules
You can see the list of scheduled entries by td connector:list.
$ td connector:list
+-------------------------+--------------+----------+-------+--------------+-----------------+------------------------------+
| Name | Cron | Timezone | Delay | Database | Table | Config |
+-------------------------+--------------+----------+-------+--------------+-----------------+------------------------------+
| daily_sqlserver_import | 10 0 * * * | UTC | 0 | td_sample_db | td_sample_table | {"type"=>"sqlserver", ... } |
+-------------------------+--------------+----------+-------+--------------+-----------------+------------------------------+
Show the Setting and History of Schedules
td connector:show shows the execution setting of a schedule entry.
% td connector:show daily_sqlserver_import
Name : daily_sqlserver_import
Cron : 10 0 * * *
Timezone : UTC
Delay : 0
Database : td_sample_db
Table : td_sample_table
td connector:history shows the execution history of a schedule entry. To investigate the results of each individual execution, please use td job <jobid>.
% td connector:history daily_sqlserver_import
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
| JobID | Status | Records | Database | Table | Priority | Started | Duration |
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
| 578066 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-18 00:10:05 +0000 | 160 |
| 577968 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-17 00:10:07 +0000 | 161 |
| 577914 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-16 00:10:03 +0000 | 152 |
| 577872 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-15 00:10:04 +0000 | 163 |
| 577810 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-14 00:10:04 +0000 | 164 |
| 577766 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-13 00:10:04 +0000 | 155 |
| 577710 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-12 00:10:05 +0000 | 156 |
| 577610 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-11 00:10:04 +0000 | 157 |
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
8 rows in set
Delete the Schedule
td connector:delete will remove the schedule.
$ td connector:delete daily_sqlserver_import
Appendix
Modes for out plugin
You can specify file import mode in out section of load.yml.
append (default)
This is the default mode and records are appended to the target table.
in:
...
out:
mode: append
replace (In td 0.11.10 and later)
This mode replaces data in the target table. Please note that any manual schema changes made to the target table will remain intact with this mode.
in:
...
out:
mode: replace
Further Information
View ArticleThis is a summary of new Arm Treasure Data features and improvements introduced in the December 1st, 2019 release. If you have any product feature requests, submit them to feedback.treasuredata.com.
Option: you can view a video summary of our December releases:
Treasure Data Python Interface to REST APIs, Query Engine
Arm Treasure Data offers pytd v1.0.0, a Python interface that makes it easier to consume Treasure Datas REST APIs, Presto query engine, and Plazma primary storage.
For ingestion, pytd supports Bulk Import APIs to convert data into a CSV/msgpack file and enable upload in batch. For small volumes of data, pytd supports the Presto INSERT INTO query enabling you to insert single rows into your table.
You can also use pytd with td-spark, which uses a local customized Spark instance to directly write DataFrames to TDs primary storage system. The seamless connections of pytd enables your Python code to efficiently read and write large volumes of data from and to Treasure Data.
Comparing pytd to td-client-python and pandas-td
pytd is the new recommended SDK, although we will continue to support the existing td-client-python. The td-client-python package is a basic minimal SDK, whereas pytd provides more user-friendly interfaces and gives access to Plazma API.
Use pytd with custom scripts in TD Workflows to make day-to-day data analytic work more productive. The pytd package offers compatible functions in a more efficient manner than does the Python package pandas-td. We recommend that you switch to pytd if you are currently using pandas-td.
To learn more about pytd, refer to the following documents.
OSS documentation: https://pytd-doc.readthedocs.io/en/latest/#
pandas-td compatible guide: https://support.treasuredata.com/hc/en-us/sections/360003775873-Python-Library-Reference-for-pytd-pandas-td-
Identity Federation (Beta)
The Treasure Data Identity Federation feature works with your Identity Provider (IdP) and enables your TD account users to use one ID to log into your Treasure Data accounts, even if the user is assigned multiple TD accounts.
You configure your IdP to authenticate your Treasure Data users and thereby control the login policy for your users through the IdP.
Identity Federation key benefit is providing heightened security and tighter authentication for both on-premise and cloud applications. You can centrally manage all users and their respective permissions through your corporate directory service.
The initial implementation of Identity Federation supports Azure Active Directory using SAML 2.0 protocol.
If you are interested in participating in the Identity Federation beta, contact your Customer Success Representative.
User Experience Improvements
Arm Treasure Data continues to improve the ease of use in TD Console.
Changes to Segment Timestamp Operators
Segmentation Timestamp Operators contribute to the rules that define a segment. These timestamps operators are now easier to understand and configure.
Google Sheets - Export
As part of this upgrade, some operators were kept and changed, and other operators were removed. See Segment Timestamp Operators for a summary of the segment timestamp operators. The article includes a table that reflects the changes.
Import Integrations
Arm Treasure Data offers some new features that improve your experience when creating and editing data transfers for sources.
Choose to Save or Save and Run Your Configured Data Connector
You can choose to configure your data connector, save it and run the import immediately, or to just save the configured data connector to run at a later time.
Configure Storage with More Assurance
For every new data transfer, you can manually select a database and table, instead of a default database and table being selected for you. This is particularly useful when you already have a similar, existing data transfer and want the specified location for the imported data to be different.
As you define a database and table you can refer to help text
Also, a new user interface is coming soon:
Coming Soon: Google Sheet Import
You can now use the Google Sheet data connector to import Google Sheet content into your database. You can bring in data from Google Spreadsheet file without manually converting the data to CSV before import, and you can retain the format in the original spreadsheet.
Contact your Treasure Data Customer Success representative for more information and enablement.
Data Connector: Magento Import (Beta)
To aid in building Customer Journeys from your Magento eCommerce sites, the Magento data connector imports data directly from Magento. The data connector also helps to ingest from the APIs of Magento Commerce modules, third-party modules, and extension attributes that are installed on your system.
If you are interested in participating in the beta, contact your Treasure Data Customer Success representative.
Data Connector: Google Sheet Export Improvement
A new parameter is added to the Google Sheet Export data connector. You can now specify a destination folder key when using a spreadsheet name to export data. You must re-authenticate to access Google Drive metadata in order to use this new parameter. Refer to .
If multiple spreadsheets with the same name exist in the destination folder, your jobs fail. You can use the spreadsheet key to specify the correct destination spreadsheet to avoid job failure.
Action Required: Google Sheet Re-Authentication
The current version of the Google sheets V3 connector is scheduled to be deprecated in early March 2020. You must re-authenticate all your existing Google Sheet Authentications before mid February 2020 when the TD V3 connector is deprecated. Failure to do so results in a disruption of your Google sheets jobs. For more information, refer to the Appendix in Google Sheets - Export.
View ArticleThis is a summary of new Arm Treasure Data features and improvements introduced in the November 1st, 2019 release. If you have any product feature requests, submit them to feedback.treasuredata.com.
Option: you can view a video summary of our November releases:
Identity Federation (BETA - Coming Soon)
The Treasure Data Identity Federation feature works with your Identity Provider and enables your TD account users to use one ID to log into your Treasure Data accounts, even if the user is assigned multiple accounts (such as development account, system test account and production account).
You configure your Identity Provider (IdP) to authenticate your Treasure Data users and thereby control the login policy for your users through the IdP.
The initial implementation of Identity Federation supports Azure Active Directory using SAML 2.0 protocol.
If you are interested in participating in the Identity Federation beta, contact your Customer Success Representative.
Sunsetting HTTP Support
The HTTP endpoints both in Tokyo and US regions are being deprecated for secure access reasons.All customers who are currently accessing a TDHTTP endpointwill bedirectly notifiedabout the deprecation.
If you are using any of the following addresses to access Treasure Data, you must migrate to HTTPS:
US region http://api.treasuredata.com http://api.treasure-data.com http://api-import.treasuredata.com http://api-workflow.treasuredata.com Tokyo region http://api.treasuredata.co.jp http://api-import.treasuredata.co.jp http://api-workflow.treasuredata.co.jp
Server-Side First PartyCookies - Usage Allowances
The Server-Side First Party Cookies feature was released for general availability last month. For further information on Server-Side First Party Cookies feature, refer to the technical documentation.
This feature is automatically included for current CDP customers. By using this feature, customers agree to the allowance and usage limitations applicable to their pricing tier. Customers should contact their customer success representative for information on usage allowance and limitations.
Data Connector: LiveRamp Export (BETA)
You are invited to participate in our LiveRamp (Beta). Use Treasure Data with LiveRamp both to anonymize your user segment data and increase the reach of your advertising campaigns, with more accurate targeting.
Contact your Treasure Data Customer Success representative for more information and enablement.
Data Connector: The Trade Desk Import
In addition to the output connector we offer to push segments to The Trade Desk platform, you can now use the input connector to ingest The Trade Desk Advertiser, Campaign, Data Group, and Tracking Tags. With both ingest and export capabilities, you can achieve an end-to-end view of The Trade Desk platform architecture.
Data Connector: Google Adwords Import Improvement
In addition to existing reports provided by Google Ads (formerly known as Google Adwords) connector, you can now import data from Keyword Performance Report. The new Include Negative Keywords parameter allows you to specify whether negative keywords are included in the import result.
In addition, we have updated the Google Adwords logo to the new Google Ads logo according to Google rebranding and removed the Average Position metric from existing reports. TheAverage Position metric was deprecated September 30, 2019.
Data Connector: Shopify Import Improvement
You can now import an additional data target: Inventory (Level, Items, Location). This new data type supports the making of item recommendations or mail marketing based on stock quantity.
View ArticleComplete the following steps to migrate from the legacy Salesforce data connector to the new Salesforce V2 connector. The legacy data connector uses only REST API to import data. The new Salesforce data connector enables you to use Bulk import and REST API.
Create a New Salesforce v2 Connector
Go to Treasure Data Catalog, then search and select Salesforce v2.
https://login.salesforce.com/
In the dialog box, enter the values that you enter in your legacy Salesforce connector.
Salesforce v2 connector requires that you remove unnecessary letters from Login URL parameter. For example, instead of https://login.salesforce.com/?locale=jp, use use.
Enter your username (your email) and password, as well as your Client ID, Client Secret and Security Token.
Save Settings and Run the Legacy Salesforce Data Connector One Last Time
You can save your legacy setting from TD Console or from the CLI.
In Console UI
Save the settings of your scheduled legacy Salesforce connector and run a final import
Go to Integration Hub > Sources. Search for your scheduled Salesforce source, select and click Edit.
In the dialog box, copy the settings to use later:
Also copy any advanced settings:
Next, you configure one final run with the legacy data connector to create a temporary table against which you can run a config-diff. You use the diff to identify and confirm the latest data imported into Treasure Data.
Before running the final import with the legacy connector, make sure that you change the schedule to one run only:
After the job is complete, look at and copyconfig_diffin job query information somewhere to use later.
Create new Salesforce v2 source
Go to Integration Hub > Authentication. Search for new Salesforce v2 connection that you created:
Click into New Source. Fill in all basic settings and advanced settings that you copied in the preceding steps. Then, if you want the new source to continue ingesting from the point where the legacy connector left, fill in theLast Record field with theconfig_diff information that you copied in the previous job.
After completing the settings, choose the database and table job to populate data into, then schedule the job and provide a name for your new data connector. Click Save and then run the new data connector.
In CLI and Workflow
Update in:type in your yml configuration fromsfdc tosfdc_v2.
For example, your existing workflow configuration might look something like this:
in:
type: sfdc
username: ${secret:sfdc.username}
password: ${secret:sfdc.password}
client_id: ${secret:sfdc.client_id}
client_secret: ${secret:sfdc.client_secret}
security_token: ${secret:sfdc.security_token}
login_url: ${secret:sfdc.login_url}
target: Lead
out: {}
exec: {}
filters: []
Your new workflow configuration would look like this:
in:
type: sfdc_v2
username: ${secret:sfdc.username}
password: ${secret:sfdc.password}
client_id: ${secret:sfdc.client_id}
client_secret: ${secret:sfdc.client_secret}
security_token: ${secret:sfdc.security_token}
login_url: ${secret:sfdc.login_url}
target: Lead
out: {}
exec: {}
filters: []
For result output
The SFDC connection is shared between data connector and result output, although there is nothing change in result output, but if you use either of those, you should upgrade it too.
In Console UI
Save the settings of Legacy export connector
Go to Treasure Data console. Go to Query Editor. Open the Query that usessfdc for its connection
Click the sfdc connector, then copy and save the details of existing connection to use later.
Clickdelete to remove the Legacy one.
Modify the existing Query (to replace the Legacy connection)
In the query, click Output results. Next, you are going to set up the SFDC v2 connector by finding and select the SFDC v2 export connector that you created.
In Configuration pane, specify the fields you saved in previous step, then click Done.
CheckOutput results to... to verify that you are using the created output connection. ClickSave.
It is strongly recommended to create a test target and use it for the first data export to verify that exported data looks as expected and the new export does not corrupt existing data. In your test case, choose an alternate Object for your test target.
In CLI:
Result type protocol needs to update fromsfdc tosfdc_v2 for instance from:
sfdc://<username>:<password><security_token>@<hostname>/<object_name>
to:
sfdc_v2://<username>:<password><security_token>@<hostname>/<object_name>
In workflow:
If you have a workflow that used the SFDC, you can keep your result settings the same, but need to updateresult_connection to the new connection_name you created above.
An example of old workflow result output settings is as follows:
+td-result-output-sfdc:
td>: queries/sample.sql
database: sample_datasets
result_connection: your_old_connection_name
result_settings:
object: object_name
mode: append
concurrency_mode: parallel
retry: 2
split_records: 10000
An example of new workflow result output settings is as follows:
+td-result-output-sfdc:
td>: queries/sample.sql
database: sample_datasets
result_connection: your_new_connection_name
result_settings:
object: object_name
mode: append
concurrency_mode: parallel
retry: 2
split_records: 10000
View ArticleThe Presto query engine has some known limitations.
Column Name Escaping
When a column name matches the name of a reserved keyword, the name of the column needs to be quoted.
In Hive, the quotation character is the back tick character `.
SELECT `join` FROM mytbl
In Presto, quoting of a column name is accomplished with the double quote character instead .
SELECT "join" FROM mytbl
Quoting a column name in single quotes will make the query parser interpret the content as a simple string that might produce an unexpected result. For example:
SELECT "join" FROM mytbl LIMIT 10
result:
value1
value2
value3
value4
...
value10
SELECT 'join' FROM mytbl
result:
join
join
join
join
...
join
48 Hour Limit for Presto Queries
Presto queries that run more than 48 hours are automatically cancelled. Typically, Presto queries that run more than a day or so are failing queries.
For long running queries, you can rewrite the Presto query as a Hive query.
Table Schemas are Required
When querying using Presto, you must set a schema for the tables. The v syntax doesnt work.
JOIN Order
Presto does not support cost-based JOIN optimizations. This means that JOINs are not automatically reordered based on table size. Make sure that smaller tables are on the right hand size of JOIN, and they must fit in memory. Otherwise out of memory exceptions cause the queries to fail.
SELECT
...
FROM
large_table
JOIN
small_table
INSERT OVERWRITE Statements are NOT Supported
Presto does not support INSERT OVERWRITE statements. Make sure that you delete the table before using INSERT INTO. See Presto Query FAQs.
Error Message NOT_SUPPORTED: NULL values are not allowed on the probe side of SemiJoin operator
This error is caused by using NULL IN an empty subquery that has an issue with result correctness.
The IN predicate determines if any values produced by the subquery are equal to the provided expression. The result of IN follows the standard rules for nulls. The subquery must produce exactly one column.
For example, the user column contains a NULL value, which might cause the following query to fail.
select *
from www_access
where
user in (select 0 as user_id)
=> Query 20170526_042346_18708_6huz3 failed: NULL values are not allowed on the probe side of SemiJoin operator. See the query plan for details.
As a workaround, use the IS NOT NULL clause instead.
select *
from www_access
where
user in (select 0 as user_id)
and user is not null
Error Message - Killed by the system because this query generate output for more than 200GB.
This error is caused by trying to generate more than 200GB output by single query.
As a workaround, use result_output_redirectoption or CREATE TABLE AS.
result_output_redirectoption Solution: [Experimental] Use result_output_redirect
CREATE TABLE AS Solution: Use CREATE TABLE (table) AS SELECT
Error Message -Accessing too many(N) tables. Maximum 300 tables allowed
This error is caused by trying to scan more than 300 tables by single query.
As a workaround, tune the sql as reducing scan tables, or divide single query to multiple jobs.
Presto Performance Tuning
Error Message -Creating partitions(N) exceeds the limit(500000)
This error is caused by trying to commit more than 500,000 (500K) partitions by single query.
As a workaround, use User Defined Partitioning (UDP) or tune parameters for UDP.
Use User Defined Partitioning User Defined Partitioning for Presto
Tune parameters for UDP. For example:
Reduce bucket_count to 256. (512 in default)
Update max_time_range to adopt more broad time range.
Increasemax_file_size to 512MB. (256MB in default)
View ArticleWhen exporting data to an external source, an external source might require an additional column or marketers might want to create one that allows them to track lists back to specific master segments, segments, campaigns or dates and times.
When creating a new activation, Treasure Data allows additional columns to be created and exported if needed using String Builder.
To Create a New Column for Export
Navigate to Audience Studio > Segments
Select the segment you want to activate.
Select Activations on the right-side panel.
Select Create Activation or Edit Activation from the more options menu.
Select Output Mapping in the Activations editor.
Select + Add String under String Builder.
Edit the Output Column Name field. Select from the following value options:
String--Can be any value you choose. Use free text to create a custom value.Timestamp--Date and time of exportSegment ID--Segment ID numberSegmentName--Segment NameAudienceID--Master Segment number
To add an additional value to the string, select + Add String. You may string multiple values together.
After completing the String Builder, select Create & Run Now.
View ArticleArm Treasure Data uses the same convention as Relational Database Management Systems (RDBMSs) for managing data sets:
Databases
Tables
Schema
Unlike traditional warehousing platforms, Treasure Data allows users to store-first, schema-later. Your schema can be changed at any time, with no cost.
Conventional warehousing platforms are schema dependent, supporting an assumptive analytics model. In this model, data elements forecasted to yield insights are defined in advance, with the structure of the data store schema.
Performance considerations are also important in initial design and the analyst must have knowledge of the underlying structure to ensure query performance. When new columns are added to the table, the schema must change.
Big Data analysis however, is largely non-assumptive. The analyst seeks hidden patterns, relationships or events in the data that were not obvious from the outset. You are able to query the data where ever it is stored and without the burden of performance considerationsand exploration can create requirements for new records to support the analysis trail.
In this model schema, dependence adds a significant tax that can become prohibitive.
The TD Approach: Store-First, Schema-Later
Hive
Unlike traditional warehousing platforms, TD users can assign schema even after importing data to a table. This means that you can add or remove fields at any time.
This system is much more flexible, and schema changes no longer take days of work.
Prerequisites
Basic knowledge of Treasure Data, including the TD toolbelt.
A table with some data. See Running a query and downloading results
Understanding the TD Default Schema
When a table is created in a TD database, it has two fields:
time: The time that each entry was generated, in int64 UNIX time
v: Each log entry, stored as map<string, string>
When you look up the value of a database entry, address the information using the format: v[field1].
Defining a custom schema is strongly recommended.
Defining a Custom Schema
Typically, the default schema defined by TD is acceptable. But a custom schema can make queries shorter, and greatly improve performance.
To define a custom schema for a table:
Optionally, create testdb using the following command:
td db testdb
Optionally, create the www_access table using the following comman:
td table:create testdb www_access
Use the td schema:set command. The syntax of the command is:$ td schema:set <database> <table> <column_name>:<type>...Where <column_name> consists of lowercase alphabets, numbers, and "_" only.Where <type> can be one of the following:int, long, double, float, string, array<T>.
For this example, the schema would be added as follows:
td schema:set testdb www_access action:string labels:'array<string>' user:int
You can now query this table with defined column names.
td query -w -d testdb "SELECT user, COUNT(1) AS cnt
FROM www_access
WHERE action='login'
GROUP BY user ORDER BY cnt DESC"
You can see the following articles for more query syntax:
Hive Query Language
Presto Query Language
Treasure Data, Presto and Hive Schema Relation
Treasure Data
Presto
Hive
int
bigint
smallint
int
bigint
int
long
bigint
bigint
double
double
decimal
float
double
float
double
double
double
Convert to string or int
boolean
boolean
string
varchar
string or varchar
string or Convert to long
date
string
string or Convert to long
timestamp
timestamp
You can refer to the open source documentation as well:
Presto
View ArticleYou can export job results from Arm Treasure Data directly to Google Sheets. In Google Sheets, you can then further analyze your data and uncover key business insights.
You can transfer up to 2 million cells for spreadsheets that you created in or converted to Google Sheets. Read more from Google Help.
Prerequisites
Basic knowledge of Treasure Data, including the TD Toolbelt
A Google account (for Google Drive)
Authorized Treasure Data account access
Use TD Console to Create Your Connection
You can use the Treasure Data console to configure your connection.
Create a New Connection
You mustcreate and configure the data connection to be used during export, prior to running your query. As part of the data connection,you provide authentication to access the integration.
Go to Catalog and search and select Google Sheets.
https://myaccount.google.com/permissions
The following dialog opens:
Account Authentication
Access to Treasure Data Google Sheets integration requires OAuth2 authentication. The authentication requires that users manually connect their Treasure Data account to their respective Google account.
To authenticate, complete the following procedure:
Select an existing OAuth connection for Google, or click the link under "OAuth connection" to create a new connection
Log into your Google Sheets account in thepopup window and grant access to the Treasure Data app.
You are redirected back to integration dialog pane. Repeat the first step (Create a new connection) and choose your new OAuth connection.
Name your new Google Sheets connection.
Configure Export Results in Your Data Connection
Go toTreasure Data console
Go to Data Workbench > Queries. You can pick the query language and write your query.
Select your database and table.
Create or pick the query that you plan to use to export data.
Next, select Export Results checkbox
and select your Google Sheet connection:
After you select yourGoogle Sheetsconnection, thedialog pane appears:
Parameters:
You must specify either thespreadsheet_id ORspreadsheet_title. You cannot use both.
Use Spreadsheet Key: If it is un-checked, the spreadsheet title is used as the key, otherwise, spreadsheet id is used.
Spreadsheet name: The title of spreadsheet. A new spreadsheet with the title you specify is created if the spreadsheet does not exist. If multiple spreadsheets with the same title (non-case-sensitive) exist in the destination folder (given by Folder Key) or any folder (if Folder Key is empty), the job fails. In this case, specify by spreadsheet key instead.
Folder key: Folder ID of the data to be exported. Used only when using Spreadsheet name. If empty, the destination folder is determined by matching the spreadsheet name. If there is no spreadsheet match, theMy Drive root folder is used. Important: In order to use this Folder Key option, you must update authentications what were created before this parameter was added to this data connector. Otherwise, your export jobs fail. See Appendix 4.3.
Worksheet: The title of sheet in spreadsheet. A new spreadsheet with the title you specify is created if the sheet does not exist. The Name entered is compared "as is" (including spaces) with the worksheet name in Google. If empty, data is exported to the first (leftmost) worksheet.
Upload mode: The mode to modify data in spreadsheet. See Section 4 for details.
Range:The initial cell position that the data is written to.
Batch rows to upload: The number of rows that are uploaded to a spreadsheet per call. The larger value of batch, the larger the volume of the payload that is uploaded, and the fewer number of calls are needed to complete upload of all data.
Value Input Option:Specify either RAW to upload data to the spreadsheet directly or USER_ENTERED which leaves the dataparsed as if the user typed them into the UI. Numbers remain as numbers, but strings may be converted to numbers, dates, etc. following the same rules that are applied when entering text into a cell via the Google Sheets UI.
Set Cell to Null: Control the display value of the cell when an invalid value (such as dividing by 0) is encountered
Ways to Modify Data
You can manipulate spreadsheet data in the following ways: Replace, Append, Truncate, Update.
Replace
This is the default mode. If the table already exists, the rows of the existing table are replaced with the query results. If the table does not exist yet, a new table is created.
Append
The query results are appended to the table. If the table does not exist yet, a new table is created.
The result of the query is appended at the end of the spreadsheet. If, for example, you have the default sheet with 1000 empty rows, APPEND mode starts adding the rows in the query result at the row after the last one visible in the spreadsheet (i.e. 1001st row). Refer to the following example:
Truncate
If the table already exists, the existing rows are cleared, query results are updated into the table. If the table does not exist yet, a new table is created.
Before executing truncate mode:
After executing truncate mode with query results contain id column:
Update
If the table already exists, the existing rows are updated into the table by query results. If the table does not exist yet, a new table is created.
Before executing update mode:
After executing update mode, from the range A1, the perspective column is updated.
Optional: Use of Scheduled Jobs for Output
You can use Scheduled Jobs with Result Output, to periodically write the output result to a target destination that you specify.
Optional: Configure Export Results in Workflow
Within Treasure Workflow, you can specify the use of this data connector to output data.
Available parameters are as follows:
spreadsheet_id: String. Spreadsheet key. Required*.
spreadsheet_title: String. Spreadsheet name. Required*.
spreadsheet_folder: String. Default is null. Folder id.
sheet_title: String. Default is null.Worksheet name.
mode: String(append|replace|truncate|update). Default is append.
range: String.Default is A1, which is the top left corner of a sheet.For APPEND mode, range has no effect because new rows are appended after the last row.
rows_threshold: Integer. Default is 50000, maximum is 2000000)Google API has a threshold of 10MB for request payload. This data connector automatically detects which threshold is reached first.
value_input_option: String(raw|user_entered), RAW).Default is raw.
set_nil_for_double_nan: Default istrue, for example, turn NaN to empty string)
*Choose either spreadsheet_id OR spreadsheet_title. The two parameters are mutually exclusive.
timezone: UTC
_export: td: database: sample_datasets+td-result-into-target: td>: queries/sample.sql result_connection: my_googlesheet_connector result_settings: sheet_title: value1 mode: replace ....
Click here for more information on using data connectors in workflow to export data.
Appendix
Find the Spreadsheet Key
You can obtain the Spreadsheet key from the access URL, as follows:
Find the Folder Key
You can obtain the Google Drive Folder key from the access URL, as follows:
Required: Migrate your Existing Authentications
Due to Google Sheet V3 deprecation scheduled for early March 2020, all your existing authentications must be updated. You mustperform the update on your existing authentications before mid of Feb 2020.
Until March 2020, your existing authentications, without the updates, continue to work properly when the Folder Key option is left blank.
The upgrade requires that your existing Google Sheet Authentications apply a new Folder Key option. To use the new Folder Key option, you must select the permission View metadata for files in your Google Drive in your Google account. If you do not update your existing Authentications with the new Folder Key option, your data connector no longer works as expected and your export jobs fail.
Complete the update as follows:
In TD Console, go to Integrations Hub> Authentications. Locate your Google Sheet Authentication object.
Click Click here to connect a new accountto redirect to Authentication in Google.
In Google, select the appropriate account and with View metadata for files in your Google Drive scope included, click Allow.
Go back to the TD Console, open your Google Sheet Authentication again, and select the latest authentication account at the bottom of the dropdown list.
Click Continue and Done to save the Authentication.
Repeat steps 4 & 5 with the other authentications that use the same Google account.
To check whether your authentication includes the permission,log into Google account and on the browser, access the URL: . Check for the Arm Treasure Data permissions on Google Drive as follows:
View ArticleYou can import Google Ads reports into Arm Treasure Data using the Google Ad data connector.
You use this same connector to export Google Ads Remarketing data. See Writing Job Results into Google Ads Remarketing Lists.
Google has rebranded Google AdWords to Google Ads.
Prerequisites
Basic knowledge of Treasure Data
Basic knowledge of Google Ads
Use TD Console
Create a new connection
Go to Integrations Hub > Catalog and search and select Google Ads input. The following dialog opens.
Keywords Performance Report
Click to connect a new account. Log into your Google Ads account from the new window and grant Treasure Data access to your Ads campaigns:
You will be redirected back to Treasure Data Catalog. Repeat the step to connect to a new account to choose your new OAuth connection.
Click Continue and give your connection a name:
Then click Done.
Create a new source
After creating the connection, you are automatically taken to Sources tab. Look for the connection you created and click New Source.
The following dialog opens. Edit the details and click Next.
In the field Ad Account enter the Customer ID in your Google Ads console UI.
The parameters that you specify next are based on the Google Ad report type.
Report Type
You can select from the following reports: Ad Performance, AdGroup Performance, Campaign Performance, Audience Performanceand Keywords Performance. See Appendix B for more information.
Attributes / Segments / Metrics
Depending on the report type picked, only attributes, segments, and metrics that are available are showed when you click Add.
Some predefined attributes are automatically added, corresponding to the report type. See Appendix C for more information.
If you check Use predefined metrics, a list of predefined metrics is automatically added. See Appendix D for more information. You can select additional metrics from the Metrics dropdown.
Incremental Loading
By enabling incremental loading, you can schedule the data loading during a time period that you specify. Note that data is fetched only when the full period of time (automatically computed based on the Date Range value) is complete.
Date Range
Specify the date range of the report being generated. You must enter a start date and an end date if you select CUSTOM_DATE. The period for scheduling mentioned in Incremental Loading is the number of days between Start Date and End Date. See Appendix E for more information. Click Next.
Preview
You see a preview of your datasimilar to the following image. The preview data doesnt reflect the actual data from your Google Ads account. Its just dummy data to show all the fields you picked, as well as the data type of those fields.
If you wish to change anything, click Back or else click Next.
Advanced Settings
Advanced Settings allow you to customize theguessed properties. Edit the following section, if needed.
Choose the target database and table
Choose an existing or create a new database and table.
Create a new database and give your database a name. Complete similar steps for Create new table.
Select whether to append records to an existing table or replace your existing table.
If you want to set a different partition key seed rather than use the default key, you can specify one using the popup menu.
Scheduling
In the Schedule tab, you can specify a one-time transfer, or you can schedule an automated recurring transfer. If you select Once now, click Next. If you select Repeat specify your schedule options, then click Next.
After your transfer has run, you can see the results of your transfer in the Databases tab.
Use Command Line
Install TDToolbelt v0.11.9 or later
You can install the newest Treasure Data Toolbelt.
$ td --version
0.15.0
Create Configuration File
Prepare configuration file (for eg: load.yml) with your Google Ads account access information, similar to the following example.
in:
type: google_adwords
client_id: "<app client id>" (required, string)
client_secret: "<app client secret>" (required, string)
refresh_token: "<your oauth2 refresh token>" (required, string)
developer_token: "<Ads developer token>" (required, string)
target: audience_performance_report (required, string)
client_customer_id: "<your Google Ads account number>" (required, string)
date_range: custom_date (required, enum)
start_date: 2018-04-01 (required if `custom_date` date range, date)
end_date: 2018-04-07 (required if `custome_date` date range, date)
include_zero_impressions: true (optional, boolean, default is `true`)
incremental: true (optional, boolean, default is `false`) include_negatives_keywords: true (optional, boolean, default is `false`) columns:
- AdGroupName
- CampaignStatus
- Date
- Clicks
- ...
out:
mode: replace
This example dumps Audience Performance report from Google Ads:
client_id: OAuth application client id.
client_secret: OAuth application client secret
refresh_token: Refresh token, result of OAuth authentication flow
target: Google Ads report type to be imported.
See Appendix B for the list of available target.
client_customer_id: Google Ads account number
Google Ads Account number is the Customer ID in your Google Ads console UI
date_range: Available date range from Ads
See Appendix E for the list of available date_range.
start_date: From which date (yyyy-MM-dd) report data to be generated. This field is required if custom_date is selected for date_range.
end_date: To which date (yyyy-MM-dd) product data to be imported. This field is required if date_range is custom_date.
include_zero_impressions: true means accepting all rows that have no impression
incremental: Support incremental loading
columns: List of attributes, segments and metrics to be collected
See Appendix F for the list of available attributes, segments and metrics.
include_negative_keywords: true means accepting all negative and normal keywords. This option is applied to Keyword Performance Report only.
For more details on available out modes, see Appendix A
(optional): Preview data to import
You can preview data to be imported using the command td connector:preview.
$ td connector:preview load.yml
+-----------------+---------------------+-----------------+----
| id:long | displayurl:string | headline:string | ...
+-----------------+---------------------+-----------------+----
| 42023 | "Hello" | "Ads" |
| 42045 | "World" | "Ads" |
+-----------------+---------------------+-----------------+----
Execute Load Job
You use td connector:issue to execute the job.The following are required:
name of the schedule
cron-style schedule
database and table where their data will be stored
the Data Connector configuration file
$ td connector:issue load.yml --database td_sample_db --table td_sample_table --time-column updated_date
If the database or the table do not exist in TD, td connector:issue fails
Its also recommended to specify --time-column option, because Treasure Datas storage is partitioned by time (see data partitioning ). If the--time-column option is not available, the data connector selects the first long or timestamp column as the partitioning time. The type of the column specified by --time-column must be either of long or timestamp type (use Preview results to check for the available column name and type). A time column is available at the end of the output.
$ td connector:issue load.yml --database td_sample_db --table td_sample_table --time-column updated_date --auto-create-table
If your data doesnt have a time column you can add it using theadd_time filter. You add the "time" column by adding theadd_timefilter to your configuration file as follows.
in: type: googlead ...filters:- type: add_time from_value: mode: upload_time to_column: name: timeout:
type:td
More details at add_time filter plugin.
If you have a field called time, youdon'tneed to specify the --time-column option.
$ td connector:issue load.yml --database td_sample_db --table td_sample_table
Scheduled execution
You can schedule periodic data connector execution for Google Ads import. We have configured our scheduler carefully to ensure high availability. By using this feature, you no longer need a cron daemon on your local data center.
Create the schedule
A new schedule can be created using the td connector:create command. The name of the schedule, cron-style schedule, the database and table where their data is stored, and the data connector configuration file are required.
$ td connector:create \
daily_google_ads_import \
"10 0 * * *" \
td_sample_db \
td_sample_table \
load.yml
The `cron` parameter also accepts these three options: `@hourly`, `@daily` and `@monthly`.
By default, schedule is setup in UTC timezone. You can set the schedule in a timezone using -t or --timezone option. The `--timezone` option supports only extended timezone formats like 'Asia/Tokyo', 'America/Los_Angeles' etc. Timezone abbreviations like PST, CST are *not* supported and may lead to unexpected schedules.
List the Schedules
You can see the list of currently scheduled entries by td connector:list.
$ td connector:list
+-----------------------+--------------+----------+-------+--------------+-----------------+-----------------------------+
| Name | Cron | Timezone | Delay | Database | Table | Config |
+-----------------------+--------------+----------+-------+--------------+-----------------+-----------------------------+
| daily_google_ads_import | 10 0 * * * | UTC | 0 | td_sample_db | td_sample_table | {"type"=>"google_adwords", ... } |
+-----------------------+--------------+----------+-------+--------------+-----------------+-----------------------------+
Show the Setting and History of Schedules
td connector:show shows the execution setting of a schedule entry.
% td connector:show daily_google_ads_import
Name : daily_google_ads_import
Cron : 10 0 * * *
Timezone : UTC
Delay : 0
Database : td_sample_db
Table : td_sample_table
td connector:history shows the execution history of a schedule entry. To investigate the results of each individual execution, use td job <jobid>.
% td connector:history daily_google_ads_import
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
| JobID | Status | Records | Database | Table | Priority | Started | Duration |
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
| 578066 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-18 00:10:05 +0000 | 160 |
| 577968 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-17 00:10:07 +0000 | 161 |
| 577914 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-16 00:10:03 +0000 | 152 |
| 577872 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-15 00:10:04 +0000 | 163 |
| 577810 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-14 00:10:04 +0000 | 164 |
| 577766 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-13 00:10:04 +0000 | 155 |
| 577710 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-12 00:10:05 +0000 | 156 |
| 577610 | success | 10000 | td_sample_db | td_sample_table | 0 | 2015-04-11 00:10:04 +0000 | 157 |
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
8 rows in set
Delete the Schedule
td connector:delete removes the schedule.
$ td connector:delete daily_google_ads_import
Appendix
A) Modes for out plugin
You can specify file import mode in out section of load.yml.
append (default)
This is the default mode and records are appended to the target table.
in:
...
out:
mode: append
replace (In td 0.11.10 and later)
This mode replaces data in the target table. Any manual schema changes made to the target table remain intact with this mode.
in:
...
out:
mode: replace
B) Available report types
Report Type
Description
ad_performance_report
Ad Performance Report
adgroup_performance_report
AdGroup Performance Report
campaign_performance_report
Campaign Performance Report
audience_performance_report
Audience Performance Report
keywords_performance_report
Keywords Performance Report
C) Predefined Attributes
Report Type
Predefined Attributes
Ad Performance
Id DisplayUrl Headline HeadlinePart1 HeadlinePart2
AdGroup Performance
AdGroupId AdGroupName
Campaign Performance
CampaignId CampaignName
Audience Performance
Id UserListName
Keywords Performance Report
Id AdGroupId
D) Predefined Metrics
Report Type
Predefined Metrics
Ad Performance AdGroup Performance Campaign Performance
Clicks Impressions Ctr AverageCpc Cost Conversions ViewThroughConversions CostPerConversion ConversionRateCostPerConversion ConversionRateCostPerConversion ConversionRate
Audience Performance
Clicks Impressions Ctr AverageCpc Cost AverageCpm AbsoluteTopImpressionPercentage TopImpressionPercentage
Keywords Performance
Clicks Impressions Ctr AverageCpc Cost Conversions ViewThroughConversions CostPerConversion ConversionRate AbsoluteTopImpressionPercentage TopImpressionPercentage
E) Available Date Ranges
Date Range
Description
TODAY
Today only.
YESTERDAY
Yesterday only.
LAST_7_DAYS
The last 7 days not including today.
LAST_WEEK
The seven-day period starting with previous Monday.
LAST_BUSINESS_WEEK
The 5 day business week, Monday through Friday, of the previous business week.
THIS_MONTH
All days in the current month.
LAST_MONTH
All days in the previous month.
ALL_TIME
The entire available time range.
CUSTOM_DATE
A custom date range. Need start_date and end_date in yyyy-MM-dd.
LAST_14_DAYS
The last 14 days not including today.
LAST_30_DAYS
The last 30 days not including today.
THIS_WEEK_SUN_TODAY
The period between the previous Sunday and the current day.
THIS_WEEK_MON_TODAY
The period between the previous Monday and the current day.
LAST_WEEK_SUN_SAT
The seven-day period starting with the previous Sunday.
F) Available Fields
Report Type
Reference
Ad Performance
Ad Performance Report
AdGroup Performance
AdGroup Performance Report
Campaign Performance
Campaign Performance Report
Audience Performance
Audience Performance Report
Keywords Performance
View ArticleUsing Tableau Desktop on Mac with Arm Treasure Data allows users to interactively explore data using Presto.
Prerequisites
Mac OS 10.12 or later
Tableau Desktop for Mac v10.5 or later
Install ODBC Driver for Mac
Visit https://www.tableau.com/support/drivers.
Follow these steps to install the Mac driver for Presto:
Close Tableau Desktop.
Download the driver from the Download link.
Click the downloaded file.
Double-click the Presto .pkg file to run the installer.
Accept the defaults.
Open Tableau Desktop and connect to Presto.
Result output to Tableau
Make the connection
Start Tableau. Click Connect > Presto.
For a complete list of data connections, click To a Server > More.
Then complete the following steps:
Enter Presto JDBC/ODBC Gateway server: api-presto.treasuredata.com
Enter the port: 443
Enter the catalog name; td-presto
Select the Authentication method: Username
Enter the Username: Use your master APIKEY (Ref. Get APIKEY
Select the Require SSL check box.
Select Sign In.
If Tableau cannot complete the connection, verify that your credentials are correct.
Setup the data source
On the data source page, complete the following steps:
Select the default data source name at the top of the page, and then enter a unique data source name for use in Tableau.
From the Schema drop-down list, select the search icon or enter the schema name in the text box and select the search icon, and then select the schema.
In the Table text box, select the search icon or enter the table name and select the search icon, and then select the table.
Drag the table to the canvas, and then select the sheet tab to start your analysis.
Advanced Tips
Change time columns Data Type to Date & Time
After connecting to the table, youll find a time field in Measures. time field is unixtime data as a default. You can drag and drop the time field data to Dimensions and change the data type to Date & time:
Limitations
Potential Risk at Presto upgrade
Compatibility between the Presto and Tableau ODBC packageisn'tguaranteed when Treasure Data Presto is upgraded. We recommend that you use for production systems because the connector requires regular update of data sources.
View ArticleUsing Tableau Desktop with Arm Treasure Data enables you to interactively explore huge amounts of data. You can also use Treasure Data as a place to integrate and store of all your data.
Connect to TD presto endpoint
Select "More..." on the left menu.Choose Prestoand enter the connect information. Click Sign In.
Enter a Schema (database name) and Table (table name) to import data from a table, or to write SQL that imports data from multiple tables. Then click Update Nowto see the data in the Tableau window.
Confirm Import in Treasure Data
In Treasure Data, create a query to track the events coming in from Tableau Desktop.It usually takes 3-5 minutes before events appear in the TD Console.
Tips and Tricks
Convert time Field as Datetime Type
Treasure Data treats the time column as a unix timestamp by default, however its really convenient to handle timestamp with Datetime type in Tableau. The following example describes how to create the calculated Datetime field for time column.
Select Creat Calculated Field...
Then, enter the name and enter the following function within Formula section.
sql
RAWSQL_DATETIME("from_unixtime(%1)", [time])
Next, the calculated field appears in a list of Dimensions, with a Datetime type.
Contact Us
If you are you interested in using Treasure Data with Tableau, contact your account representative.
View ArticleUsing Tableau Online with Arm Treasure Data allows you to interactively explore huge amounts of data, and share your data discoveries across your business organizations.
Prerequisites
Basic knowledge of Treasure Data
A license and its installation of Tableau Online
Usage
CLI: One Time Query
1. JSON-style config (new and recommended)
Add the Tableau result output destination by using the -r / result option for the td query command:
$ td query -d mydb -r '{"type":"tableau","host":"company.online.tableau.com","username":"my_user","password":"passw0rd","ssl":true,"ssl_verify":false,"server_version":"online","datasource":"my_ds","site":"MarketingTeam","project":"","mode":"replace","chunk_size_in_mb":50,"timezone":"America/Los_Angeles"}' 'SELECT * FROM access'
where:
username: Your Tableau Online Username
password: Your Tableau Online Password
host: Tableau Online Host on which your site exists (e.g. 10ay.online.tableau.com)
datasource: target Tableau DataSource name
site: The URL of the site to sign in to
To determine the value to use for the site attribute, sign in to Tableau Online and examine the value that appears after /site/ in the URL. For example, in the following URL, the site value is MarketingTeam:
https://online.tableau.com/#/site/MarketingTeam/workbooks
Refer here for more details
chunk_size_in_mb: Chunk file size (in MB) to be uploaded each time, default: 200, min: 100, max: 1024
timezone: timezone to convert from Timestamp data type to Tableau DateTime data type, default: UTC
2. URL-style (not recommended)
Change the Tableau result output destination from JSON format to URL format as following example:
$ td query -d mydb -r 'tableau://username:password@host/datasource?mode=replace&site' 'SELECT * FROM access'
3. Migrating Existing Output Configurations to Tableau Hyper
Tableau has updated its data engine with a technology called Hyper. To take advantage of this, you must update your existing Treasure Data output configurations. New Treasure Data configurations automatically use the latest version.
To upgrade your existing configurations:
Go to Queries and select your scheduled query.
In Query Editor and click Export Results target.
Choose your saved Tableau connection.
Input the Site ID value; its required for Tableau Online.
Select 'hyper' for Data Source Type.
Click Done to save your configuration.
You do not need to change the timezone configuration when migrating from the legacy Tableau to the current Tableau. You can leave the default value (UTC). See also timezone in this article.
Console: One Time Query
Go to the query editor on the TD Console, and type in the query. The following example query uses the access log example data set, and calculates the distribution of HTTP method per day.
HIVE:
# HiveQL
SELECT
CAST(TD_TIME_FORMAT(time, "yyyy-MM-dd 00:00:00") AS TIMESTAMP) AS `dates`,
method AS `Method`,
COUNT(1) AS `Count`
FROM
www_access
GROUP BY
TD_TIME_FORMAT(time, "yyyy-MM-dd 00:00:00"),
method
PRESTO:
# Presto
SELECT
CAST(TD_TIME_FORMAT(time, 'yyyy-MM-dd 00:00:00') AS TIMESTAMP) AS "dates",
method AS "Method",
COUNT(1) AS "Count"
FROM
www_access
GROUP BY
TD_TIME_FORMAT(time, 'yyyy-MM-dd 00:00:00'),
method
We're casting the Datetime column from String type to TIMESTAMP type, for convenience in Tableau.
Tableau doesnt support fractional seconds in a timestamp. Remove fractional seconds (for example, by using subtr() function) before casting to the TIMESTAMP type in the query.
Choose saved connection
A dialog Choose Integrationdisplays. Select an existing Tableau Online connection. If you do not have a Saved Integration already setup, follow the next step on how to create a new connection within the Sources Catalog.
Create a new authentication
Go to Integrations Hub > Catalogand search. Select Tableau. The following opens:
Username: Your Tableau Online Username
Password: Your Tableau Online Password
Host: Tableau Online Host on which your site exists (e.g. 10az.online.tableau.com)
Additional configuration
After you create a Tableau connection or select an existing one, you see the followingConfiguration popup.
Parameters
Description
Default values
Datasource Name
The name of destination Data Source on Tableau Online
Site ID
The URL of the site to sign in to, its required for Tableau Online
Project Name
Go to your Tableau Online to get a list of projects
Default
Mode
replace to replace Data Source each time, append to append to existing Data Source
append
Chunk File Size In MB
Extract File is split into chunks before uploading. This option defines file size of each chunk (min: 100, max: 1024)
200
Timezone
Timezone ID to use when converting from Timestamp (timezone-independent) to Tableau DateTime (timezone-dependent)
UTC
Append Mode ignores new columns if it is exporting data inserted an existing data source because of the specificationof the Tableau. https://onlinehelp.tableau.com/current/pro/desktop/en-us/extracting_addfromfile.html
After completing all the fields, Submit the query. The system will execute the query, create the Tableau Data Extract file (.tde or .hyper), and upload extract file to Tableau Online.
Go to your Tableau Online, and click Data Sources at the top left bar. You can view the list of data sources, including your TDE file.
Click New Workbook to create the charts and dashboard from the browser. Drag and drop the dimensions and measures from the left nav, to top right nav to create graphs. Then click Save to store the result.
Console: Scheduled Query
In the previous section, you created charts through one job. Next, you can periodically refresh the dataset by using the Scheduled Jobs feature. Using this feature, you execute a query on Treasure Data, and update TDE file on Tableau Server. You can specify schedules from the query editor by using TD_SCHEDULED_TIME() UDF, with append mode.
The following query calculates the # of records within last 24 hours, from the time that the query gets executed. By continuously running this scheduled query, you can avoid processing the entire data set every day.
HIVE:
# HiveQL
SELECT
CAST(TD_TIME_FORMAT(time, "yyyy-MM-dd 00:00:00") AS TIMESTAMP) AS `dates`,
method AS `Method`,
COUNT(1) AS `Count`
FROM
www_access
GROUP BY
TD_TIME_FORMAT(time, "yyyy-MM-dd 00:00:00"),
method
PRESTO:
# Presto
SELECT
CAST(TD_TIME_FORMAT(time, 'yyyy-MM-dd 00:00:00') AS TIMESTAMP) AS "dates",
method AS "Method",
COUNT(1) AS "Count"
FROM
www_access
GROUP BY
TD_TIME_FORMAT(time, 'yyyy-MM-dd 00:00:00'),
method
Options
Result output to Tableau supports various options. The options can be specified as URL parameters on the CLI or with the REST APIs or the Console where supported. The options are normally compatible with each other and can be combined. Where applicable, the default behavior is indicated.
ssl
The ssl option determines whether to use SSL for connecting to the Tableau server. When true, SSL is used. ssl=true is the default when this option is not specified.
tableau://username:password@host/?ssl=true
ssl_verify
The ssl_verify option determines whether to require certifcate verification for the SSL communication. When true, the certificate verification is required. ssl_verify=true is the default when this option is not specified.
tableau://username:password@host/?ssl=true&ssl_verify=true
Disabling certificate verification is useful when the Tableau servers SSL certificate is self-signed.
timezone
To convert from timestamp value, which is timezone independent, for example, 1548979200, to Tableau DateTime, which includes day, hour, minute, etc. the connector needs to know the target timezone.
If your query contains a TIMESTAMP column, or you casta datetime column to TIMESTAMP, the value is exportedto Tableau server as DateTime. Meaning, there is a conversion and you need to provide target timezone as necessary.
Treasure Data stores datetime value using UTC timezone. In most cases, leave timezone config as default (UTC), to preserve the value from Treasure Data, unless you particularly want to convert the value to another timezone.
An example of configuring a timezone other than the default UTC is as follows:
From the CLI:
$ td query "..." -r '{ "type": "tableau", ..., "timezone": "America/Los_Angeles" }'
As part of TD Workflow:
host: "company.online.tableau.com"ssl: truessl_verify: trueusername: "my_user"password: "passw0rd"
datasource: "my_ds"site: "my_company"project: "Default" server_version: "online"
timezone: "America/Los_Angeles"
Limitations
The maximum result record is 250,000,000 records. If it exceeded, the log displays the message:Extract file records limit exceeded: 250000000.
The lowest Timestamp value is 1000-01-01 00:00:00. If it exceeded, the log displays the message:invalid date value.
Known issues:
Appending a large dataset to existing Data Source may result in timeout. In that case, you will see below message in job logs:
2019-04-10 19:20:41.460 +0000 [WARN] (0001:transaction): !!! Data Source Publish is timed out. This is a known issue of Tableau when you append a large extract file to existing Data Source.
2019-04-10 19:20:41.460 +0000 [WARN] (0001:transaction): !!! Check Tableau Console for final result of the Publish.
The job succeeds but you must check the final result on Tableau Console (you can verify by Number of recordsin Data Source).
View ArticleUsing Tableau Server with Arm Treasure Data allows users to interactively explore huge amounts of data, and also share data information across your organizations.
Tableau Server 2018.1 Release Notes
Prerequisites
Basic knowledge of Treasure Data
A license and its installation of Tableau Server
Usage
CLI: One Time Query
JSON-style config (new and recommended)
Add the Tableau result output destination by using the -r / result option for the td query command:
$ td query -d mydb -r '{"type":"tableau","host":"my.tableauserver.com","username":"my_user","password":"passw0rd","ssl":true,"ssl_verify":false,"server_version":"10.0","datasource":"my_ds","site":"","project":"","mode":"replace","chunk_size_in_mb":50,"timezone":"PST"}' 'SELECT * FROM access'
where:
username: Your Tableau Server Username
password: Your Tableau Server Password
host: Your Tableau Server Hostname
site: The URL of the site to sign in to (optional)
To determine the value to use for the site attribute, sign in to Tableau Server and examine the value that appears after /site/ in the URL. For example, in the following URL, the site value is MarketingTeam:
https://MyServer/#/site/MarketingTeam/projects
If the site attribute is an empty string, you are signed in to the default site. Note that you always sign in to a specific site, even if youdon'tspecify a site when you sign in
Refer here for more details
datasource: target Tableau DataSource name
chunk_size_in_mb: Chunk file size (in MB) to be uploaded each time, default: 200, min: 100, max: 1024
timezone: timezone to convert from Timestamp data type to Tableau DateTime data type, default: UTC
URL-style config (not recommended)
Change the Tableau result output destination from JSON format to URL format as shown in the following example:
$ td query -d mydb -r 'tableau://username:[email protected]/?mode=replace' 'SELECT * FROM access'
$ td query -d mydb -r 'tableau://username:[email protected]:8443/?mode=replace&ssl=true' 'SELECT * FROM access'
Migrating Existing Output Configurations to Tableau Hyper
Tableau has updated its data engine with a technology called Hyper. To take advantage of this, you must update your existing Treasure Data output configurations. New Treasure Data configurations automatically use the latest version.
To upgrade your existing configurations:
Go to Queries and select your scheduled query.
In Query Editor and click Export Results target.
Choose your saved Tableau connection.
Input the Site ID value; its required for multi-tenant Tableau Server (leave it empty if you sign in to Default site).
Select 'hyper' for Data Source Type.
Click Done to save your configuration.
You do not need to change the timezone configuration when migrating from the legacy Tableau to the current Tableau. You can leave the default value (UTC). See also timezone in this article.
Console: One Time Query
Go to the query editor on the TD Console, and type in the query. The following example query uses the access log example data set, and calculates the distribution of HTTP method per day.
HIVE:
# HiveQL
SELECT
CAST(TD_TIME_FORMAT(time, "yyyy-MM-dd 00:00:00") AS TIMESTAMP) AS "dates",
method AS `Method`,
COUNT(1) AS `Count`
FROM
www_access
GROUP BY
TD_TIME_FORMAT(time, "yyyy-MM-dd 00:00:00"),
method
PRESTO:
# Presto
SELECT
CAST(TD_TIME_FORMAT(time, 'yyyy-MM-dd 00:00:00') AS TIMESTAMP) AS "dates",
method AS `Method`,
COUNT(1) AS `Count`
FROM
www_access
GROUP BY
TD_TIME_FORMAT(time, 'yyyy-MM-dd 00:00:00'),
method
We're casting Datetime column to TIMESTAMP type from String type, for convenience in Tableau. NOTE: Tableau doesn't support fractional seconds in a timestamp. Remove fractional seconds (for eg, using subtr() function) before casting it to TIMESTAMP type in the query.
Choose saved connection
A dialog Choose Integrationdisplays. Select an existing Tableau Server connection. If you do not have a Saved Integration already setup, follow the next step on how to create a new connection within the Sources Catalog.
Create a new authentication
Go to Integrations Hub > Catalogand search and select Tableau. The following opens:
Username: Your Tableau Server Username
Password: Your Tableau Server Password
Host: Your Tableau Server Hostname (If you want to set specified port number, set HOST:PORT)
Additional configuration
After you create a Tableau connection or select an existing one, you see the followingConfiguration popup.
Parameters
Description
Default values
Datasource Name
The name of Data Source on Tableau Server
Site ID
If youdon'thave specific Site, set EMPTY String for Tableau Server
Project Name
Go to your Tableau Server to get a list of projects
Default
Mode
replace to replace Data Source each time, append to append to existing Data Source
append
Chunk File Size In MB
Extract File is split into chunks before uploading. This option defines file size of each chunk (min: 100, max: 1024)
200
Timezone
Timezone ID to use when converting from Timestamp (timezone-independent) to Tableau DateTime (timezone-dependent)
UTC
Currently, you MUST setup https (SSL). Check here about how to set up SSL on your Tableau Server.
Append Mode ignores new columns if it is exporting data inserted an existing data source because of the specification of the Tableau. https://onlinehelp.tableau.com/current/pro/desktop/en-us/extracting_addfromfile.html
After completing all the fields, Submit the query. The system executes the query, creates the Tableau Data Extract file (.tde or .hyper), and uploads Extract file to Tableau Server.
Go to your Tableau Server, and click Data Sources at the top left bar. You can view the list of data sources, including your TDE file.
Click New Workbook to create the charts and dashboard from the browser. Drag and drop the dimensions and measures from the left nav, to top right nav to create graphs. Then click Save to store the result.
Console: Scheduled Query
In the previous section, you created the charts using one job. The next action is to periodically refresh the dataset by Scheduled Jobs feature. By using this feature, you can periodically execute the query on Treasure Data, and update TDE file on Tableau Server. The schedules can be specified from the query editor. Its useful is to use TD_SCHEDULED_TIME() UDF, with append mode.
The following query calculates the # of records within last 24 hours, from the time when the query gets executed. By continuously running this scheduled query, you can avoid processing the entire data set every day.
HIVE:
# HiveQL
SELECT
CAST(TD_TIME_FORMAT(time, "yyyy-MM-dd 00:00:00") AS TIMESTAMP) AS `Datetime`,
method AS `Method`,
COUNT(1) AS `Count`
FROM
www_access
WHERE
TD_TIME_RANGE(time,
TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d'),
TD_SCHEDULED_TIME())
GROUP BY
TD_TIME_FORMAT(time, "yyyy-MM-dd 00:00:00"),
method
PRESTO:
# Presto
SELECT
CAST(TD_TIME_FORMAT(time, 'yyyy-MM-dd 00:00:00') AS TIMESTAMP) AS "dates",
method AS `Method`,
COUNT(1) AS `Count`
FROM
www_access
GROUP BY
TD_TIME_FORMAT(time, 'yyyy-MM-dd 00:00:00'),
method
Options
Result output to Tableau Server supports various options. The options can be specified as URL parameters on the CLI or with the REST APIs or the Console where supported. The options are normally compatible with each other and can be combined. Where applicable, the default behavior is indicated.
ssl
The ssl option determines whether to use SSL or not for connecting to the Tableau server. When true, SSL is used. ssl=true is the default when this option is not specified.
tableau://username:[email protected]/?ssl=true
tableau://username:[email protected]:443/?ssl=true
ssl_verify
The ssl_verify option determines whether to require certifcate verification for the SSL communication or not. When true, the certificate verification is required. ssl_verify=true is the default when this option is not specified.
tableau://username:[email protected]/?ssl=true&ssl_verify=true
Disabling certificate verification is particularly useful when the Tableau servers SSL certificate is self-signed.
project
On Tableau Server, there should be at least one 'Default' project. You can't delete or rename this project. If you omit Project Name option, 'Default' project is used.
On some rare cases, especially non-English Tableau Server, the 'Default' project name is localized and jobs fail if you don't provide a Project Name (because it couldn't find a project with 'Default' as the name).
timezone
To convert from timestamp value, which is timezone independent, for example, 1548979200, to Tableau DateTime, which includes day, hour, minute, etc. connector needs to know the target timezone.
If your query contains a TIMESTAMP column, or you casta datetime column to TIMESTAMP, the value is exportedto Tableau server as DateTime. Meaning, there is a conversion and you need to provide target timezone as necessary.
Treasure Data stores the datetime value using UTC timezone. In most cases, leave timezone config as default (UTC), to preserve the value from Treasure Data, unless you particularly want to convert the value to another timezone.
An example of configuring a timezone other than the default UTC is as follows:
From the CLI:
$ td query "..." -r '{ "type": "tableau", ..., "timezone": "America/Los_Angeles" }'
As part of TD Workflow:
host: "tableau.company.com"ssl: truessl_verify: falseusername: "my_user"password: "passw0rd"
datasource: "my_ds"site: "MarketingTeam"project: "Default"server_version: "2018.1"
timezone: "America/Los_Angeles"
Limitations
The maximum result record is 250,000,000 records. If it exceeded, the log displays the message: Extract file records limit
exceeded: 250000000.
The lowest Timestamp value is 1000-01-01 00:00:00. If it exceeded, the log displays the message:invalid date value.
Known issues:
.hyper extracts published to Tableau Server 10.5 via REST API will have .tde file extension on Tableau Console. These extracts were still in .hyper file format and functioned the same way as other .hyper extracts. The issue is fixed in 2018.1. More details: (search for Issue ID 754677)
Data Source belonging to a nested Project cannot be appended by using a REST API. AResource Not Found error is returned. Replace mode does not have this issue. A fix is planned for Tableau Server 2018.1.4. If you see such error, it's recommended to upgrade to latest version of Tableau Server.
View ArticleYou can run queries in Mode Analytics and send reports generated from the queries to Arm Treasure Data.
Prerequisites
Mode Analytics account
Basic knowledge of Arm Treasure Data
Requires access to the new Presto-specific JDBC/ODBC API, currently in Private Beta. Contact your account representative if youd like access to this new feature.
Retrive Treasure Data API Key
Retrieve the Treasure Data API key from Consoles profile page. The key must be the API key typemaster.
Connect a Database
Connect to Treasure Data from Mode. Click your profile icon button on the menu, and click Connect a Database menu.
Then, select Database in the Cloud tab.
Youll find Treasure Data Presto in the database list. Click Treasure Data Presto.
Set API key
After you retrieve the credential, enter the credential in theAPI Key field.
Create Reports
Now its time to create reports! In the editor, select the database connectionyou'vecreated. Then, execute the SQL queries, and create reports from the queries.
View ArticleWhile Arm Treasure Data is a BI tool agnostic service, customers like retailer MUJI are using Tableau for the BI / Visual Analytics. In this article, we showcase our customers usage pattern of combining Treasure Data, Tableau Desktop and Tableau Server. Lets begin with understanding the characteristics of each solutions.
Treasure Data: Tableau Desktop ODBC connector
Tableau is a business intelligence software that helps people see and understand their data. There are two major products provided by Tableau Software: Tableau Desktop and Tableau Server. Tableau Desktop is a Desktop Application used to visualize and analyze data. It helps create workbooks, visualizations, dashboards, and stories.
Users can publish visualized data to Tableau Server for sharing within an organization. Tableau Desktop is a BI designer tool, and Tableau Server is a publishing environment to share the visualizations. Tableau Online is a hosted version of Tableau Server, which doesnt require you to manage the BI server.
Tableau Desktop Product Tour (Video)
Tableau Server Product Tour (Video)
Treasure Data Introduction (Video)
Treasure Data Technical Overview & Concepts (Video)
Tableau + Treasure Data Reference Architecture
So, why combine Treasure Data & Tableau? Treasure Data provides a scalable backend to handle new big data sources (application logs, web logs, mobile data, sensor data, etc), while Tableau provides flexible visual analytics for existing data sources (EDW, CRM, ERP, etc).
By combining Treasure Data and Tableau, you can quickly get insights on any type of data sources of any size. Lets do a walk through of the following architecture diagram.
Collect Big Data (Treasure Data)
First, lets start collecting data into Treasure Data. Treasure Data provides various ways to collect data into the cloud in near-real-time. The data sources depicted here are time-series data, which means there is historical data, produced in real time, and growing rapidly as your business scales. Here are the four main data collection capabilities provided by Treasure Data:
JavaScript SDK for website tracking
Mobile & Gaming SDK (Android, iOS, Unity, Unreal Engine) for mobile and gaming application tracking
Treasure Agent for streaming data collection
Bulk Loader for parallel bulk loading
Data Connector for pre-built integrations
Treasure Data imports almost 1 Million records per second, and Treasure Data customers benefit from such scale. Setting up the data collection usually takes only a couple of hours, or even a few minutes in some cases.
Aggregate Big Data (Treasure Data)
Now we have raw data in the cloud. To provide a better experience for the BI consumers, its a good idea to summarize this raw data into smaller sets for performance reasons. By using one of Treasure Datas embedded query engines, you can crunch big data into an aggregated format.
Treasure Data supports Tableau Result Output so you can directly push the aggregated results into Tableau Server. Youdon'tneed any additional infrastructure to do this. You can even automate this process by using Scheduled Jobs to periodically aggregate the data.
Treasure Data can push the query results as a Tableau Data Extract (TDE) file. TDE is a Tableaus proprietary columnar file format, optimized for efficiently slicing and dicing data (see Why Use Tableau Data Extracts ). The TDE file will be directly saved into Tableau Server.
Treasure Data: Writing Job Results into Tableau Server
Treasure Data: Writing Job Results into Tableau Online
Design Workbooks (Tableau Desktop)
Now we have raw data access and aggregated data too. Its time to explore the data using Tableau Desktop. Tableau offers a lot of built-in connectors for existing data sources (EDW, CRM, ERP, Excel, etc), that you can interact with directly.
Tableau Desktop: List of Data Sources
Treasure Data provides an ODBC driver for Tableau Desktop so that data analyst can have raw data access.
Tableau Desktop: Treasure Data ODBC connector
Analysts can choose any of the above methods depending on the needs. You can also join across these data sources. For example, you can create a join between Salesforce.com data and a TDE file, or even join multiple TDE files. When the workbook is created, Tableau Desktop can publish it to Tableau Server.
Share the Workbooks (Tableau Server)
Now everything is set. Analysts can publish workbooks to the server and the consumers can view these from their browsers. Analysts can quickly iterate on the data and reports by having access to all the data sources, sothey'renow self-reliant.
Tableau Desktop: Publishing to Tableau Server (Video)
Summary
Tableau + Treasure Data empowers data-driven companies to rapidly explore data and get insights. By combining these two solutions, your team can focus on insights, not infrastructure, with an industry-leading visual analytics tool. If you have questions, contact us. For the next steps, follow these links to learn how to connect to Treasure Data from each Tableau product.
Treasure Data and Tableau: A complete big data analytics solution in days (White Paper)
Treasure Data: Writing Job Results into Tableau Server
Treasure Data: Writing Job Results into Tableau Online
View ArticleMaximize your data resources by using Treasure Data with Google DoubleClick for Publishers on DFP.Google DoubleClick for Publishers is now known as Google Ad Manager.
You can create audience lists in your Google DoubleClick for Publishers (DFP) using data held in Treasure Data. Follow these steps to move cookies and Mobile Advertising Identifiers to new or existing audience lists within Google DFP.
Prerequisites
Basic knowledge of Treasure Data, including the TD Toolbelt
A DFP Account
Authorized Treasure Data DMP access to your Google DFP Account
Grant access for Treasure Data
Treasure Datas data connector requires permissions to create audience segments in your Google DFP account. Use the Google Contact Us form to reach the DoubleClick for Publishers Support teamand request that Treasure Data be granted access to your DFP account. Provide the following information in the form:
Request: Grant Treasure Data permissions
Your DoubleClick for Publishers account ID(referred to by Google as the Audience Link ID)
Treasure Data DMP:
Customer ID: 140-996-0635
NID: treasuredata_dmp
Instructions on how to find your Audience link ID.Basically, in Google Bid Manager, you click Admin > Global settings > All network settings to find the Audience link ID.
You are sending information so that Google recognizes Treasure Data and connects your Google DPF account to Treasure Data.
To export data, you create or select an existing connection, create or reuse a query, and then run the query to export your audience lists.
Create or reuse a query
Go to Treasure Data console
Go to Query Editor. You can pick the query language and write your query.
Select your database and table.
Access the query that you plan to use to export data. Here is a sample query:
:::sql SELECT DISTINCT "cookie", "list_name", "time" FROM "google_dfp_ddp"
Create or select an existing DFP connection
Options
Task Steps
Tip
Create a connector
Go to Integrations Hub > Catalogand search and selectDoubleclick for Publishers. - Complete the required fields on the first pane. - In the Audience Link ID field, enter the ID that you use in your DFP. Next you complete the Configuration fields.
Sometimes you need to define the column mapping before writing the query.
Select a connector
- Go to Treasure Data console. - Go to Query Editor. - Access the query that you plan to use to export data. - Click Output results. The Choose Saved Connection dialog opens. - Type the connection name in the search box to filter and click the connection that you want. Next you complete the Configuration fields.
You can also create a new connection from here by clicking on Create New Connection.
Creating a new connection:
Additional connector configuration information
Selecting an existing connection:
Configure the connection by specifying the parameters
After you select yourGoogle DoubleClick for Publishers DP connection, the Configuration dialog pane appears:
Source column name mappings (optional)
Define the mapping between Google DDP column names to the output column names that you specify in your the query. You specify the target column and then the source column. For example, if google_cookie is the identifier column in your TD data source, you should define the mapping as cookie:google_cookie. If the source column in the mapping is missing, target column name will be used. For example, cookie is the same as cookie:cookie mapping.
See Additional connector configuration information for more information about supported column names and usage.
Cookie or mobile identifier column header
Specify the original source of the user cookie or mobile identifier.
You must select one of the options:
cookie_encrypted: Encrypted identifier (for example, Web), a cookie hash of user id
cookie_idfa: iOS Advertising Identifier
cookie_adid: Android Advertising Identifier
See for more information on how identifier names are handled.
Execute the query
Either save the query with a name and then run the query or just run the query. When the query completes successfully, the result is automatically processed. Plan to transfer your data at least 24 hours ahead of when you need the audience lists (also referred to as segments) to be in Google AdWords.
Additional connector configuration information
The manner in which Google handles your data from Treasure Data affects how you configure the connector.
Column Mappings
The Google DFP reads data source table by columns and uses the following column name mappings to process each row data:
cookie: The encrypted Google ID or Mobile Advertising Identifier that is used in id matching. This column contains the cookie hash or mobile identifier of your users.
list_name: This column contains the name of audience list (segment) that you want to create in your DFP audience. If the list name does not exist in DFP, a new list is created. If the list name does exist, the existing list is updated.
timestamp (optional): The timestamp (seconds since EPOCH). If this column does not exist or is missing, a current timestamp is used.
delete (optional): This column contains boolean values (`false` or `true`) or number (`0` or `1`) to indicate if the cookie is to be added or removed from the given audience segment. By default, the value will be false if value is left blank or if column is not provided.
Cookie or Mobile Identifier Name
The upload process supports several different identifiers, and it is important that types of identifiers are uploaded to the segment using the correct upload file format. Generally, the identifiers fall into one of two categories: encrypted identifiers (anything obtained from Google systems), and raw identifiers (obtained from an external system or source). Any encrypted identifier is uploaded using the cookie_encrypted file format, and raw identifiers are uploaded in a type-specific upload file format such as Mobile Advertising Identifiers, for example cookie_idfa or cookie_adid.
Supported cookie types are cookie_encrypted, cookie_idfa, and cookie_adid.
Appendix
Limitationsfrom Google DoubleClick for Publishers
It may take up to 24 hours for updates to audience lists to be visible DFP. Expect to wait up to 24 hours from the time of the query completion for changes to be reflected in DFP.
View Article