Imply Data FAQs | Comparably
Imply was founded by members of the original Druid team, who also built the ingestion and visualization layers around the project. read more
EMPLOYEE
PARTICIPANTS
8
TOTAL
RATINGS
113

Imply Data FAQs

Imply Data's Frequently Asked Questions page is a central hub where its customers can always go to with their most common questions. These are the 61 most popular questions Imply Data receives.

Frequently Asked Questions About Imply Data

  • WARNING: This was performed as a proof of concept and has not been tested in production environments.

    Azure Event Hubs are fully managed, real-time data ingestion service that can support streaming millions of events per second. It is very similar to Apache Kafka.

    Kafka and Event Hub Conceptual Mapping

    Kafka Concept

    Event Hub Concept

    Cluster

    Namespace

    Topic

    Event Hub

    Partition

    Partition

    Consumer Group

    Consumer Group

    Offset

    Offset

    Azure Event Hubs allows accessing the data (publishing and consuming) via Apache Kafka API. Please see the link below for more information

    https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-for-kafka-ecosystem-overview

    Apache Druid Kafka Ingestion services leverages Kafka API to consume messages. This article focuses on how to leverage the same to consume messages from Azure Event Hubs (via Kafka API)

    Setup Namespace and Event Hub in Azure

    Login to Azure Portal and select the Event Hubs from Azure Services and click "Add" to create a new Namespace

    https://imply.io/videos/druid-data-loader-kafka-walk-through

    In the following Create Namespace UI please make sure "Enable Kafka" is checked. This allows the Event Hub to be accessed via the Kafka API

    Make a note of the name space name provided -imply-data-namespace.servicebus.windows.net, this is the bootstrap server configuration that we need to provide for Kafka Consumer.

    Once created the name space will be active as below

    Click on the Namespace to open the name space details page and click on the "Event Hub" to add a new event hub (topic in kafka terminology).

    Provide the topic name in the Create Event Hub screen and adjust the partition and message retention settings as needed

    The above should create a new event hub. Now we need to add access policies to this newly created event hub. This access policy information is needed for us to connect to this topic from Kafka API. Open the details page of the newly created event hub and click on the "Shared access policies" as below, and then click on the "Add" button to add a new access policy

    Create access policy like below, which creates an all access policy.

    Once created copy the "Connection string-primary key" information as below. That is needed for accessing via Kafka API

    Now we are all set to start consuming data from this event hub. Following parameters should be provided to consume data

    "bootstrap.servers":"imply-data-namespace.servicebus.windows.net:9093",

    "security.protocol":"SASL_SSL",

    "sasl.mechanism":"PLAIN",

    "sasl.jaas.config":"org.apache.kafka.common.security.plain.PlainLoginModulerequiredusername=\"$ConnectionString\"password=\"<textcopiedfromConnectionstring-primarykey>\";"

    See the Druid Data Loader section for more information on this.

    Druid Data Loader to Azure Event Hubs

    I have an Imply quick start single node cluster running in one of the Azure VM's. Please see below link for more information on quick start guide

    https://docs.imply.io/on-prem/quickstart

    I'm going to use the Apache Kafka Data Loader in this example to load data from Azure Event Hub. For a detailed look on the data loader for Kafka please take a look at the below tutorial

    https://imply.io/videos/druid-data-loader-kafka-walk-through

    Below is the configuration in the data loader for connecting to Azure Event Hub

    Below is the complete ioConfig the above configuration generates

    "ioConfig": {

    "type": "kafka",

    "consumerProperties": {

    "bootstrap.servers": "imply-data-namespace.servicebus.windows.net:9093",

    "group.id": "$Default",

    "request.timeout.ms": "60000",

    "security.protocol": "SASL_SSL",

    "sasl.mechanism": "PLAIN",

    "sasl.jaas.config": "org.apache.kafka.common.security.plain.PlainLoginModule required username=\"$ConnectionString\" password=\"Endpoint=sb://imply-data-namespace.servicebus.windows.net/;SharedAccessKeyName=all-access;SharedAccessKey=keyFromAzureUI=;EntityPath=new-topic\";"

    },

    "topic": "new-topic"

    },

    NOTE - the value for

    "sasl.jaas.config": "org.apache.kafka.common.security.plain.PlainLoginModule required username=\"$ConnectionString\" password=\"<textcopiedfromConnectionstring-primarykey>\";"

    I have already published some data to the Event Hub by using Kafka API. So by hitting preview, you should see the data from Event Hub in the Druid Data Loader like below

    There you have it, the data from Azure Event Hub can be consumed via Druid Data Loader and a supervisor task can be submitted to create a Druid Datasource. For creating data sources and supervisor tasks follow the below tutorial.

    View Article
  • Issue Description

    "Balancer move segments queue has a segment stuck" in the coordinator log may be caused by the coordinator looking in the wrong place for the segment/s it wants to move.

    Implications

    When this happens then coordinator won't be able to balance segments on the cluster. You may see balancing happening but at a very slow speed. The coordinator will have list of errors for the given historical node.

    Symptoms:

    Check the coordinator log for errors similar to this:

    Failed to connect to host[<host name>:8283]

    connection timed out: <host name>/<ip>:8283

    Debugging

    Run Telnet command to test connectivity with the historical node with the same host and port mentioned in the log from the host which is running the coordinator process. It should fail to connect.

    Now run same command from on this failing historical node and it should connect.

    > telnet <hostname> <port>

    Resolution

    Check the hosts file on the coordinator for a missing/incorrect entry for the failing historical node. Once this hosts file is fixed the issue will resolve.

    View Article
  • Warning

    This is not a recommended or supported process. It was done as a proof of concept only, and has significant limitations. As Power Query attempts to pull the entire result set of the query onto the client, this process does not scale to "big data" amounts of data.

    Prerequisites

    This POC was done using the Imply Quickstart. You can find instructions on setting up the Quickstart here: https://docs.imply.io/on-prem/quickstart

    It was completed using Microsoft Excel 2019 on Windows. Power Query used to be called "Get and Transform" in Excel 2016 and was not available in previous versions and at the time of writing, was only available for Windows.

    Alright... let's get into it.

    From Excel go to Data-->Get Data-->From Other Sources -->Blank Query

    With Power Query open enter the following expression, replacing the IP address with your Broker IP or hostname and the query value with your SQL query.

    = Csv.Document(Web.Contents("http://192.168.56.1:8888/druid/v2/sql/", [Headers=[#"Content-Type"="application/json"], Content=Text.ToBinary("{""query"": ""select * from wikipedia limit 10000"",""resultFormat"":""csv"",""header"":""true""}")]))

    Pressing the Check Mark will execute the query and display a preview

    You will notice that the headers are not in the header row, so the next step is to click Use First Row as Headers

    After that is complete you can give your query a name, and click Close & Load

    And there you have it, you have extracted data from Apache Druid into Excel.

    View Article
  • This article provides steps for one method to test avro ingestion locally using the Imply distribution. The example sends nested avro using parser type: avro_stream and avroBytesDecoder type: schema_registry.

    1. Addextension "druid-avro-extensions" to the loadList in conf-quickstart/druid/_common/common.runtime.properties if running in quickstart mode, and conf/druid/_common/common.runtime.properties if running in cluster mode.

    2. Download Confluent quickstart, install: https://docs.confluent.io/current/quickstart/cos-quickstart.html. We will be using Confluent's supplied Kafka and schema-registry, so make sure no other Kafka process is currently running (port :9092).

    3. Start Confluent from the Confluent directory. Your terminal should look something like this:

    bin/confluent start

    Using CONFLUENT_CURRENT: /var/folders/yy/dk4fdc7n2zdf2gn1n5sblq980000gn/T/confluent.ZLnYVjH0

    Starting zookeeper

    zookeeper is [UP]

    Starting kafka

    kafka is [UP]

    Starting schema-registry

    schema-registry is [UP]

    Starting kafka-rest

    kafka-rest is [UP]

    Starting connect

    connect is [UP]

    Starting ksql-server

    ksql-server is [UP]

    If one of the services doesn't start, you may have port conflicts. After fixing conflicts, you may need to cycle the affected service, e.g.:

    bin/kafka-rest-stop

    bin/kafka-rest-start

    bin/zookeeper-server-stop

    bin/zookeeper-server-start

    4. If running Confluent and Imply on different machines, skip this step.

    Before starting the Imply quickstart, change the ports for the coordinator and broker inconf-quickstart/druid/broker/runtime.properties andconf-quickstart/druid/coordinator/runtime.properties to avoid conflicts with Confluent ports. Change the zookeeper port inconf-quickstart/zk/zoo.cfg.

    If you plan to use the Imply UI and you have never launched localhost:9095, it will be sufficient to modify the coordinator and broker ports in:conf-quickstart/pivot/config.yaml, where "host" = the broker. If you have previously used this Imply UI distribution, modify the port values in the UI under Settings -> My Druid.

    If you don't want to change Imply ports, you will need to change ports in the Confluent distribution.

    5. Create a json record file that represents the Avro you will be ingesting. For our example this record (location.json) is:

    {"hilltop":{"timestamp":"2018-06-21T18:00:00Z","view":"ocean"}}

    6. Create a schema that will describe your data. The schema for location.json is:

    http://druid.io/docs/latest/ingestion/flatten-json.html

    7. Create a Kafka topic ("avro-test") and send the schema registry and sample file to Confluent Kafka:

    bin/kafka-avro-console-producer --broker-list localhost:9092 --topic avro-test --property value.schema='{"type":"record","name":"location","fields":[{"name":"hilltop","type":{"type":"record","name":"anotherNameDescribingType","fields":[{"name":"timestamp","type":"string","doc":"Local time","default":""},{"name":"view","type":"string","doc":"doYouSeeWhatISee","default":""}]}}]}' < location.json

    8. In a different terminal window, check that the topic has been created:

    bin/kafka-topics --list --zookeeper localhost:2181

    __confluent.support.metrics

    __consumer_offsets

    _confluent-ksql-default__command_topic

    _schemas

    avro-test

    9. Ingest the record into Druid by submitting the supervisor spec from the Imply directory. If running Confluent and Imply on separate machines, edit the "url" and "bootstrap.servers" in the spec.

    curl -XPOST -H'Content-Type: application/json' -d @quickstart/avroSupervisor.json http://localhost:8090/druid/indexer/v1/supervisor

    {"id":"location"}\% <- the output id will be the name of the dataSource.

    This supervisor spec looks like this:

    Note that the parseSpec format is "avro". The Confluent process is taking the json file and translating it into avro, and Druid is consuming the avro data.

    Seefor more about creating flattenSpecs.

    10. Check that Druid has received the data, either from the terminal:

    bin/dsql -H localhost:8084 # broker port

    Welcome to dsql, the command-line client for Druid SQL.

    Type "\h" for help.

    dsql> \d

    TABLE_SCHEMA TABLE_NAME

    druid location

    druid wikipedia

    Retrieved 4 rows in 0.03s.

    dsql> select * from "location";

    __time view

    2018-06-21T18:00:00.000Z ocean

    Retrieved 1 row in 0.14s.

    or directly in the Imply UI (localhost:9095).

    View Article
  • If Kafka is configured with SASL / SCRAM, then Druid, which is the Kafka consumer, should pass the SASL / SCRAM credentials in theconsumerProperties block of ioConfig section of the Kafka supervisor spec:

    "ioConfig": {

    "topic": "some_topic",

    "consumerProperties": {

    "bootstrap.servers": "host1:9092,host2:9092,host3:9092",

    "sasl.mechanism": "SCRAM-SHA-256",

    "security.protocol": "SASL_SSL",

    "sasl.jaas.config": "org.apache.kafka.common.security.scram.ScramLoginModule required username=\"some_userID\" password=\"some_password\";"

    }

    "taskCount": 1,

    "replicas": 1,

    "taskDuration": "PT1H"

    }

    View Article
  • When a kill task is submitted, it completes these three steps:

    1. Gets a list of all segments marked as unused (used=0) in the metadata store.

    2. Loops through each segment, deletes from druid_segments where id = x. This process takes a couple hundred milliseconds per segment, so deleting 10k segments can take ~30 min.

    3. Once step 2 is complete, removes each segment from deep storage.

    If the kill task gets interrupted, there is a chance that segments get deleted from the metadata store (step 2) but not deep storage. If this happens, segments must be removed from deep storage manually.

    View Article
  • Objective:

    In this article, we will go over step by step to deploy Imply manager on-prem using Kubernetes, on 4 nodes: 1x master, 1x query, and 2x data nodes.

    Environment setup:

    1. In this article, We use external Zookeeper for Druid

    2. We use external HDFS as deep storage

    3. We use external MySQL as meta DB for both Imply Manager, and Druid.

    4. We use CentOS 7 on all the nodes, but the OS in Imply on-prem manager is Ubuntu based.

    Prerequisites:

    0. Create an user account at https://www.docker.com

    Note: Please contact Imply sales representatives FIRST for account setup, provide us your docker username, so we grant permission to download from our private repo.

    Install Docker on all 4 nodes:

    1. Install yum and git utilities on all nodes:

    yum install -y yum-utils device-mapper-persistent-data lvm2

    yum install -y git

    yum-config-manager --enable rhel-7-server-extras-rpms

    2. Setup Docker repo, and install Docker (community edition) on all nodes.

    yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo

    yum install -yftp://bo.mirror.garr.it/1/slc/centos/7.1.1503/extras/x86_64/Packages/container-selinux-2.9-4.el7.noarch.rpm

    yum -y install docker-ce docker-ce-cli containerd.io

    3. Start Docker and verify successful installation

    # systemctl start docker

    # docker version

    Client:

    Version: 18.09.2

    API version: 1.39

    Go version: go1.10.6

    Git commit: 6247962

    Built: Sun Feb 10 04:13:27 2019

    OS/Arch: linux/amd64

    Experimental: false

    Server: Docker Engine - Community

    Engine:

    Version: 18.09.2

    API version: 1.39 (minimum version 1.12)

    Go version: go1.10.6

    Git commit: 6247962

    Built: Sun Feb 10 03:47:25 2019

    OS/Arch: linux/amd64

    Experimental: false

    Install kubernetes on all 4 nodes:

    1. Add Kubernetes repo to all nodes:

    [root@ip-172-31-7-220 Docker]# cat /etc/yum.repos.d/kubernetes.repo

    [kubernetes]

    name=Kubernetes

    baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64

    enabled=1

    gpgcheck=1

    repo_gpgcheck=1

    gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg

    2. Configure environment settings on all nodes:

    # setenforce 0

    # vi /etc/selinux/config

    SELINUX=permissive ##Change if it is enforcing

    # cat <<EOF > /etc/sysctl.d/k8s.conf

    net.bridge.bridge-nf-call-ip6tables = 1

    net.bridge.bridge-nf-call-iptables = 1

    EOF

    3. Install Kubernetes on all nodes:

    [root@ip-172-31-7-220 Docker]# yum -y install kubelet kubeadm kubectl

    Build a Kubernetes cluster

    1. Initialize Kubernetes admin on one of the nodes, as root user, due to privilege requirements.

    [rootr@ip-172-31-7-220 ec2-user]# kubeadm init

    [init] Using Kubernetes version: v1.15.1

    .

    Your Kubernetes master has initialized successfully!

    To start using your cluster, you need to run the following as a regular user:

    mkdir -p $HOME/.kube

    sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config

    sudo chown $(id -u):$(id -g) $HOME/.kube/config

    You should now deploy a pod network to the cluster.

    Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:

    https://kubernetes.io/docs/concepts/cluster-administration/addons/

    You can now join any number of machines by running the following on each node

    as root:

    kubeadm join 172.31.7.220:6443 --token 4mhtc3.27s6iiaxcggwsi38 --discovery-token-ca-cert-hash sha256:be9ad8be83276854ba93a213bbddc61af476bb016f9f7939a713baec2fe9c7d1

    2. Follow the on screen instruction to copy the config files over as regular user. ("ec2-user") in this case.

    mkdir -p $HOME/.kube

    sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config

    sudo chown $(id -u):$(id -g) $HOME/.kube/config

    3. Follow the on screen instruction to add the other nodes to this Kubernetes cluster

    # kubeadm join 172.31.7.220:6443 --token 4mhtc3.27s6iiaxcggwsi38 --discovery-token-ca-cert-hash sha256:be9ad8be83276854ba93a213bbddc61af476bb016f9f7939a713baec2fe9c7d1

    ..

    4. Verify all nodes have joined this Kubernetes cluster byrunning command 'kubectl get nodes' on the master to see this node join the cluster.

    [ec2-user@ip-172-31-7-220 sysctl.d]# kubectl get nodes

    NAME STATUS ROLES AGE VERSION

    ip-172-31-4-67.us-west-2.compute.internal NotReady <none> 34m v1.15.1

    ip-172-31-5-160.us-west-2.compute.internal NotReady <none> 37m v1.15.1

    ip-172-31-5-3.us-west-2.compute.internal NotReady <none> 35m v1.15.1

    ip-172-31-7-220.us-west-2.compute.internal NotReady master 39m v1.15.1

    Note, all the nodes are showing 'NotReady' under STATUS column.

    5. To make the nodes ready, We need to create a pod network. We can use some sample pod network settings on the internet to create one.

    # kubectl apply -n kube-system -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version|base64 | tr -d '\n')"

    serviceaccount/weave-net created

    clusterrole.rbac.authorization.k8s.io/weave-net created

    clusterrolebinding.rbac.authorization.k8s.io/weave-net created

    role.rbac.authorization.k8s.io/weave-net created

    rolebinding.rbac.authorization.k8s.io/weave-net created

    daemonset.extensions/weave-net created

    6. Give it some time and the nodes will become Ready.

    [ec2-user@ip-172-31-7-220 ec2-user]# kubectl get nodes

    NAME STATUS ROLES AGE VERSION

    ip-172-31-4-67.us-west-2.compute.internal NotReady <none> 34m v1.15.1

    ip-172-31-5-160.us-west-2.compute.internal NotReady <none> 37m v1.15.1

    ip-172-31-5-3.us-west-2.compute.internal NotReady <none> 35m v1.15.1

    ip-172-31-7-220.us-west-2.compute.internal NotReady master 39m v1.15.1

    [root@ip-172-31-7-220 ec2-user]# kubectl get nodes

    NAME STATUS ROLES AGE VERSION

    ip-172-31-4-67.us-west-2.compute.internal Ready <none> 36m v1.15.1

    ip-172-31-5-160.us-west-2.compute.internal Ready <none> 38m v1.15.1

    ip-172-31-5-3.us-west-2.compute.internal Ready <none> 37m v1.15.1

    ip-172-31-7-220.us-west-2.compute.internal Ready master 40m v1.15.1

    7. (OPTIONAL) Kubernetes also relies on labeling nodes to assign distributed services. Here we assign one node as Druid's master, two as data, and the last as query. Skipping this step will let Kubernetes decide which node to spawn which service.

    # kubectl label nodes ip-172-31-7-220.us-west-2.compute.internal nodeType=master

    # kubectl label nodes ip-172-31-5-160.us-west-2.compute.internal nodeType=data

    # kubectl label nodes ip-172-31-5-3.us-west-2.compute.internal nodeType=data

    # kubectl label nodes ip-172-31-4-67.us-west-2.compute.internal nodeType=query

    8. (OPTIONAL) By default, no pods can be scheduled to spawn on Kubernete's master, which is also recommended in production environment. However, for this lab, we remove this restriction.

    kubectl taint nodes --all node-role.kubernetes.io/master-

    Now we are ready to spawn the Imply on-prem using Kubernetes.

    Install helm on the master node:

    1. Install "home brew" first

    [ec2-user@ip-172-31-7-220 ~]$ sh -c "$(curl -fsSL https://raw.githubusercontent.com/Linuxbrew/install/master/install.sh)"

    ==> Installing Ruby to /home/linuxbrew/.linuxbrew/Homebrew/Library/Homebrew/vendor

    2. Install "Kubernetes-helm"

    [ec2-user@ip-172-31-7-220 ~]$ brew install kubernetes-helm

    3. Initialize helm:

    [ec2-user@ip-172-31-7-220 ec2-user]# helm init --history-max 200

    Configure and install Imply on-prem Manager

    Now we are ready to deploy the Imply Manager on this 4-node cluster.

    1. Add the Imply Manager repo, update using the helm, and verify:

    [ec2-user@ip-172-31-7-220 ec2-user]# helm repo add imply https://static.imply.io/onprem/helm

    [ec2-user@ip-172-31-7-220 ec2-user]# helm repo update

    Hang tight while we grab the latest from your chart repositories...

    ...Skip local chart repository

    ...Successfully got an update from the "imply" chart repository

    ...Successfully got an update from the "stable" chart repository

    Update Complete.

    [ec2-user@ip-172-31-7-220 ~]$ helm repo list

    NAME URL

    stable https://kubernetes-charts.storage.googleapis.com

    local http://127.0.0.1:8879/charts

    imply https://static.imply.io/onprem/helm

    [ec2-user@ip-172-31-7-220 ~]$ helm search imply

    NAME CHART VERSION APP VERSION DESCRIPTION

    imply/imply 0.1.1 3.0.6-PREVIEW Imply Manager (PREVIEW)

    2. create helm secrets, using your Docker login credential

    [ec2-user@ip-172-31-7-220 ~]$ kubectl create secret docker-registry regcred --docker-server=https://index.docker.io/v1/

    --docker-username=<USERNAME> --docker-email=<EMAIL> --docker-password=<PASSWORD>

    secret/regcred created

    3. Download the helm chart from Imply

    [ec2-user@ip-172-31-7-220 ~]$ helm fetch --untar imply/imply

    4. Find the "value.yaml" file under "imply" directory you just download, and modify it as following example

    [ec2-user@ip-172-31-7-220 imply]$ cat values.yaml

    meOverride:

    fullnameOverride:

    images:

    manager:

    repository: imply/onprem-manager

    tag: 3.0.6-PREVIEW

    agent:

    repository: imply/onprem-agent

    tag: 2019-07-30-PREVIEW

    pullPolicy: IfNotPresent

    imagePullSecrets:

    - name: regcred

    manager:

    licenseKey: |

    {"name":"Ming","expiryDate":"3000-01-01","features":["alerts","cross-tab"]}|2019-07-25|<HASH>

    initialClusterName: default

    metadataStore:

    type: mysql

    host: ip-172-31-4-67.us-west-2.compute.internal

    port: 3306

    user: root

    password: hadoop

    database: imply-manager

    resources:

    requests:

    cpu: 200m

    memory: 500M

    nodeSelector:

    nodeType: master

    tolerations: []

    affinity: {}

    annotations: {}

    serviceType: ClusterIP

    port: 9097

    druid:

    metadataStore:

    type: mysql

    host: ip-172-31-4-67.us-west-2.compute.internal

    port: 3306

    user: root

    password: hadoop

    database: imply-k8s

    zk:

    connectString: ip-172-31-4-67.us-west-2.compute.internal,ip-172-31-5-160.us-west-2.compute.internal,ip-172-31-5-3.us-west-2.compute.internal

    basePath: imply

    deepStorage:

    type: hdfs

    baseLocation: "hdfs://ip-172-31-7-220.us-west-2.compute.internal/druid-k8s/segments"

    ingress:

    enabled: false

    annotations: {}

    manager:

    host: manager.mydomain.com

    path: /*

    pivot:

    host: mydomain.com

    path: /*

    router:

    host: api.mydomain.com

    path: /*

    tls: []

    nfsVolumeClaim:

    enabled: false

    mountPath: "/mnt/nfs"

    baseLocationOverride: "/mnt/nfs/druid"

    storageClassName: imply-nfs

    resources:

    requests:

    storage: 8Gi

    master:

    replicaCount: 1

    resources:

    requests:

    cpu: 200m

    memory: 800M

    nodeSelector:

    nodeType: master

    tolerations: []

    affinity: {}

    annotations: {}

    coordinatorPort: 8081

    overlordPort: 8090

    query:

    replicaCount: 1

    resources:

    requests:

    cpu: 300m

    memory: 1200M

    nodeSelector:

    nodeType: query

    tolerations: []

    affinity: {}

    annotations: {}

    serviceType: ClusterIP

    routerPort: 8888

    routerTlsPort: 9088

    pivotPort: 9095

    data:

    replicaCount: 2

    resources:

    requests:

    cpu: 300m

    memory: 1500M

    nodeSelector:

    nodeType: data

    tolerations: []

    affinity: {}

    annotations: {}

    Helm Chart brief explanation :

    - "images: manager: repository:" and "images: agent: repository:" : This chart will download from Imply's two private repos: "imply/onprem-manager" and "imply/onprem-agent"

    - "manager: licenseKey:" : You need to obtain the license key to be able to use Imply Manager

    - "manager:metadataStore:" and "druid:metadataStore" : The external MySQL DB is set on host "ip-172-31-4-67.us-west-2.compute.internal", with username and password, database name provided here, for both Imply Manager, and Druid to use.

    - "manager:nodeSelector" : Kubernetes node that's labeled with "nodeType=master" will host the Imply Manager service

    - "manager:port" : Imply Manager runs on port 9097 by default

    - "druid:zk:" and "druid:deepStorage" : Information of the external zookeeper and deep storage is also provided here

    - "master: replicaCount:" : decides number of pods spawned to run the master processes: Overlord and Coordinator. In our example, we only configure one master service.

    Similar configurations are set for "query" and "data" services as well.

    5. Go to the parent directory, and Deploy Imply with helm:

    [ec2-user@ip-172-31-7-220 ~]$ cd ..

    [ec2-user@ip-172-31-7-220 ~]$ helm install imply

    NAME: rafting-iguana

    LAST DEPLOYED: Wed Aug 21 17:58:42 2019

    NAMESPACE: default

    STATUS: DEPLOYED

    Note helm generated a random instance name for us: "rafting-iguana"

    6. Verify:

    [ec2-user@ip-172-31-7-220 ~]$ helm status rafting-iguana

    LAST DEPLOYED: Wed Aug 21 17:58:42 2019

    NAMESPACE: default

    STATUS: DEPLOYED

    ...

    ==> v1/Pod(related)

    NAME READY STATUS RESTARTS AGE

    rafting-iguana-imply-data-b4d7d6bdf-kwnx2 0/1 Pending 0 12m

    rafting-iguana-imply-data-b4d7d6bdf-m7t2b 0/1 Pending 0 12m

    rafting-iguana-imply-manager-dcb9bc875-qzf9g 0/1 Pending 0 12m

    rafting-iguana-imply-master-54876c5658-fph2t 0/1 Pending 0 12m

    rafting-iguana-imply-query-fd6b7cfd6-gx4dc 0/1 Pending 0 12m

    rafting-iguana-mysql-6dc56fbb8f-krwpl 0/1 Pending 0 12m

    rafting-iguana-nfs-server-provisioner-0 1/1 Running 0 12m

    rafting-iguana-zookeeper-0 0/1 Pending 0 12m

    ==> v1/Secret

    NAME TYPE DATA AGE

    rafting-iguana-mysql Opaque 2 12m

    ==> v1/Service

    NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE

    rafting-iguana-imply-manager ClusterIP 10.107.154.8 9097/TCP,9989/TCP,9998/TCP,9999/TCP 12m

    rafting-iguana-imply-query ClusterIP 10.102.158.76 8888/TCP,9088/TCP,9095/TCP 12m

    rafting-iguana-mysql ClusterIP 10.107.64.253 3306/TCP 12m

    rafting-iguana-nfs-server-provisioner ClusterIP 10.97.196.46 2049/TCP,20048/TCP,51413/TCP,51413/UDP 12m

    rafting-iguana-zookeeper ClusterIP 10.97.210.171 2181/TCP 12m

    rafting-iguana-zookeeper-headless ClusterIP None 2181/TCP,3888/TCP,2888/TCP 12m

    Wait for awhile until all instance status are "Ready"

    [ec2-user@ip-172-31-7-220 ~]$ kubectl get pods

    NAME READY STATUS RESTARTS AGE

    rafting-iguana-imply-data-b4d7d6bdf-kwnx2 1/1 Running 2 29m

    rafting-iguana-imply-data-b4d7d6bdf-m7t2b 1/1 Running 2 29m

    rafting-iguana-imply-manager-dcb9bc875-qzf9g 1/1 Running 0 29m

    rafting-iguana-imply-master-54876c5658-fph2t 1/1 Running 0 29m

    rafting-iguana-imply-query-fd6b7cfd6-gx4dc 1/1 Running 1 29m

    rafting-iguana-mysql-6dc56fbb8f-krwpl 0/1 Pending 0 29m

    rafting-iguana-nfs-server-provisioner-0 1/1 Running 0 29m

    rafting-iguana-zookeeper-0 0/1 Pending 0 29m

    Find details about each kubernetes instances. Make sure they are running on nodes they are assigned to with "nodeType". For example:

    [ec2-user@ip-172-31-7-220 ~]$ kubectl describe pod rafting-iguana-imply-master-54876c5658-fph2t

    Name: rafting-iguana-imply-master-54876c5658-fph2t

    Namespace: default

    Priority: 0

    Node: ip-172-31-7-220.us-west-2.compute.internal/172.31.7.220

    Start Time: Wed, 21 Aug 2019 18:24:21 +0000

    Labels: app.kubernetes.io/instance=rafting-iguana

    app.kubernetes.io/name=imply-master

    pod-template-hash=54876c5658

    Annotations:

    Status: Running

    IP: 10.44.0.1

    ....

    Node-Selectors: nodeType=master

    Access master UIs, using the Kubernetes instance internal IPs. For example, "10.44.0.1:8081" and "10.44.01:8090", and so on.

    Run a sample ingestion task and query task to ensure full functionality of Imply Manager.

    Congratulations. You now have Imply Manager successfully set up on your multi-node cluster.

    Troubleshoot some common issues:

    1. Unable to install Imply with "failed to download" error, even after updating the repo again as suggested

    [ec2-user@ip-172-31-7-220 imply]$ helm install imply

    Error: failed to download "imply" (hint: running `helm repo update` may help)

    [ec2-user@ip-172-31-7-220 imply]$ helm repo update

    Hang tight while we grab the latest from your chart repositories...

    ...Skip local chart repository

    ...Successfully got an update from the "imply" chart repository

    ...Successfully got an update from the "stable" chart repository

    Update Complete.

    [ec2-user@ip-172-31-7-220 imply]$ helm install imply --debug

    [debug] Created tunnel using local port: '44054'

    [debug] SERVER: "127.0.0.1:44054"

    [debug] Original chart version: ""

    Error: Non-absolute URLs should be in form of repo_name/path_to_chart, got: imply

    Solution : Need to execute the "helm install imply" command in the SAME directory as where you run

    helm fetch --untar imply/imply

    2. Getting "no available release name found" error

    [ec2-user@ip-172-31-7-220 ~]$ helm install imply --debug

    [debug] Created tunnel using local port: '38716'

    [debug] SERVER: "127.0.0.1:38716"

    [debug] Original chart version: ""

    [debug] CHART PATH: /home/ec2-user/imply

    Error: no available release name found

    Another symptom is same error messages are found in the system pod "till"

    [ec2-user@ip-172-31-7-220 ~]$ kubectl get pods -n kube-system

    NAME READY STATUS RESTARTS AGE

    .....

    tiller-deploy-65867875cb-g4bsz 1/1 Running 0 26m

    ...

    [ec2-user@ip-172-31-7-220 ~]$ kubectl logs tiller-deploy-65867875cb-g4bsz -n kube-system

    ....

    [storage/driver] 2019/08/12 21:03:13 get: failed to get "winning-albatross.v1": configmaps "winning-albatross.v1" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "configmaps" in API group "" in the namespace "kube-system"

    [tiller] 2019/08/12 21:03:13 info: generated name winning-albatross is taken. Searching again.

    [storage] 2019/08/12 21:03:13 getting release "oily-turtle.v1"

    [storage/driver] 2019/08/12 21:03:13 get: failed to get "oily-turtle.v1": configmaps "oily-turtle.v1" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "configmaps" in API group "" in the namespace "kube-system"

    [tiller] 2019/08/12 21:03:13 info: generated name oily-turtle is taken. Searching again.

    [storage] 2019/08/12 21:03:13 getting release "plundering-flee.v1"

    [storage/driver] 2019/08/12 21:03:13 get: failed to get "plundering-flee.v1": configmaps "plundering-flee.v1" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "configmaps" in API group "" in the namespace "kube-system"

    [tiller] 2019/08/12 21:03:13 info: generated name plundering-flee is taken. Searching again.

    [tiller] 2019/08/12 21:03:13 warning: No available release names found after 5 tries

    [tiller] 2019/08/12 21:03:13 failed install prepare step: no available release name found

    Solution: We need to reset by deleting this "tiller" instance, and creating a new one. Then reset the "helm" as well.

    [ec2-user@ip-172-31-7-220 ~]$ kubectl delete pod tiller-deploy-65867875cb-g4bsz -n kube-system

    pod "tiller-deploy-65867875cb-g4bsz" deleted

    [ec2-user@ip-172-31-7-220 ~]$ kubectl create serviceaccount --namespace kube-system tiller

    serviceaccount/tiller created

    [ec2-user@ip-172-31-7-220 ~]$ kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller

    clusterrolebinding.rbac.authorization.k8s.io/tiller-cluster-rule created

    [ec2-user@ip-172-31-7-220 ~]$ helm init --service-account tiller

    $HELM_HOME has been configured at /home/ec2-user/.helm.

    Warning: Tiller is already installed in the cluster.

    (Use --client-only to suppress this message, or --upgrade to upgrade Tiller to the current version.)

    [ec2-user@ip-172-31-7-220 ~]$ kubectl get pods -n kube-system

    NAME READY STATUS RESTARTS AGE

    ....

    tiller-deploy-65867875cb-lwtbg 1/1 Running 0 51s

    ....

    [ec2-user@ip-172-31-7-220 ~]$ helm reset --force

    Tiller (the Helm server-side component) has been uninstalled from your Kubernetes Cluster.

    Finally, re-initialize helm and tiller, and install Imply again:

    [ec2-user@ip-172-31-7-220 ~]$ helm init --service-account tiller

    $HELM_HOME has been configured at /home/ec2-user/.helm.

    Tiller (the Helm server-side component) has been installed into your Kubernetes Cluster.

    [ec2-user@ip-172-31-7-220 ~]$ helm install imply --debug

    Tear down the Kubernetes cluster:

    [ec2-user@ip-172-31-7-220 on-prem]$ helm ls --all

    NAME REVISION UPDATED STATUS CHART APP VERSION NAMESPACE

    rafting-iguana 1 Mon Aug 12 21:30:17 2019 DEPLOYED imply-0.1.1 3.0.6-PREVIEW default

    [ec2-user@ip-172-31-7-220 on-prem]$ helm delete rafting-iguana

    release "rafting-iguana” deleted

    [ec2-user@ip-172-31-7-220 on-prem]$ kubectl get pods

    No resources found.

    View Article
  • SYMPTOM :In case of ingestion failure from AWS S3 with following error messages inlog:

    2018-02-15T17:18:05,600 INFO [<DATASOURCE_NAME>] io.druid.segment.realtime.appenderator.AppenderatorImpl - Committing metadata[AppenderatorDriverMetadata{activeSegments={<SEGMENT_NAME>_2018-02-02T23:00:00.000Z=[<DATASOURCE_NAME>_2018-02-02T23:00:00.000Z_2018-02-03T00:00:00.000Z_2018-02-15T16:40:08.642Z_1]}, publishPendingSegments={<SEGMENT_NAME>_2018-02-02T23:00:00.000Z=[<DATASOURCE_NAME>_2018-02-02T23:00:00.000Z_2018-02-03T00:00:00.000Z_2018-02-15T16:40:08.642Z_1]}, lastSegmentIds={<SEGMENT_NAME>_2018-02-02T23:00:00.000Z=<DATASOURCE_NAME>_2018-02-02T23:00:00.000Z_2018-02-03T00:00:00.000Z_2018-02-15T16:40:08.642Z_1}, callerMetadata=null}] for sinks[<DATASOURCE_NAME>_2018-02-02T23:00:00.000Z_2018-02-03T00:00:00.000Z_2018-02-15T16:40:08.642Z_1:1].

    2018-02-15T17:18:05,600 INFO [task-runner-0-priority-0] io.druid.segment.realtime.appenderator.AppenderatorDriver - Persisted pending data in 368ms.

    2018-02-15T17:18:05,601 INFO [task-runner-0-priority-0] io.druid.segment.realtime.appenderator.AppenderatorImpl - Shutting down...

    2018-02-15T17:18:05,601 INFO [appenderator_persist_0] io.druid.segment.realtime.appenderator.AppenderatorImpl - Removing sink for segment[<DATASOURCE_NAME>_2018-02-02T23:00:00.000Z_2018-02-03T00:00:00.000Z_2018-02-15T16:40:08.642Z_1].

    2018-02-15T17:18:05,605 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[IndexTask{id=<SEGMENT_NAME>_2018-02-02T23:00:00.000Z, type=index, dataSource=<DATASOURCE_NAME>}]

    java.lang.IllegalStateException: java.net.SocketException: Connection reset

    at org.apache.commons.io.LineIterator.hasNext(LineIterator.java:106) ~[commons-io-2.5.jar:2.5]

    at io.druid.data.input.impl.FileIteratingFirehose.hasMore(FileIteratingFirehose.java:66) ~[druid-api-0.11.0-iap4.jar:0.11.0-iap4]

    at io.druid.indexing.common.task.IndexTask.generateAndPublishSegments(IndexTask.java:628) ~[druid-indexing-service-0.11.0-iap4.jar:0.11.0-iap4]

    at io.druid.indexing.common.task.IndexTask.run(IndexTask.java:233) ~[druid-indexing-service-0.11.0-iap4.jar:0.11.0-iap4]

    at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.11.0-iap4.jar:0.11.0-iap4]

    at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.11.0-iap4.jar:0.11.0-iap4]

    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_152]

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_152]

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_152]

    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_152]

    Caused by: java.net.SocketException: Connection reset

    at java.net.SocketInputStream.read(SocketInputStream.java:210) ~[?:1.8.0_152]

    at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_152]

    at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) ~[?:1.8.0_152]

    at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:593) ~[?:1.8.0_152]

    at sun.security.ssl.InputRecord.read(InputRecord.java:532) ~[?:1.8.0_152]

    at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983) ~[?:1.8.0_152]

    at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:940) ~[?:1.8.0_152]

    at sun.security.ssl.AppInputStream.read(AppInputStream.java:105) ~[?:1.8.0_152]

    at org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198) ~[httpcore-4.4.3.jar:4.4.3]

    at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178) ~[httpcore-4.4.3.jar:4.4.3]

    at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137) ~[httpclient-4.5.1.jar:4.5.1]

    at org.jets3t.service.io.InterruptableInputStream.read(InterruptableInputStream.java:78) ~[jets3t-0.9.4.jar:0.9.4]

    at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.read(HttpMethodReleaseInputStream.java:146) ~[jets3t-0.9.4.jar:0.9.4]

    at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) ~[?:1.8.0_152]

    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) ~[?:1.8.0_152]

    at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) ~[?:1.8.0_152]

    at java.io.InputStreamReader.read(InputStreamReader.java:184) ~[?:1.8.0_152]

    at java.io.BufferedReader.fill(BufferedReader.java:161) ~[?:1.8.0_152]

    at java.io.BufferedReader.readLine(BufferedReader.java:324) ~[?:1.8.0_152]

    at java.io.BufferedReader.readLine(BufferedReader.java:389) ~[?:1.8.0_152]

    at org.apache.commons.io.LineIterator.hasNext(LineIterator.java:95) ~[commons-io-2.5.jar:2.5]

    ... 9 more

    2018-02-15T17:18:05,610 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [<SEGMENT_NAME>_2018-02-02T23:00:00.000Z] status changed to [FAILED].

    ROOT CAUSE :the error means that the connection was reset while reading inputs from s3. This is a quite common, but transient error in s3. So, usually Druid can retry on this kind of transient errors. But, currently, to turn on the retrying, prefetching should also be enabled. Prefetching is a feature that Druid can prefetch input files from s3 to local disk to make ingestion get faster.

    RESOLUTION : To enable prefetching, please set some configurations like below or leave them as null to use their default values in your ioConfig. Please see http://druid.io/docs/latest/development/extensions-core/s3.html#statics3firehose for more details.

    "ioConfig" : {

    "type" : "index",

    "firehose" : {

    "type" : "static-s3",

    "prefixes" : [ "s3://path/to/data/2018/02/02/23/" ],

    "maxCacheCapacityBytes" : 1073741824,

    "maxFetchCapacityBytes" : 1073741824,

    "prefetchTriggerBytes" : 536870912,

    "fetchTimeout" : 200000,

    "maxFetchRetry" : 3

    },

    "appendToExisting" : false

    },

    In our next release, retrying will be available even if prefetching is disabled.

    View Article
  • Hadoop Setup

    Following are the configurations files required to be copied over to Druid conf folders:

    For HDFS as a deep storage, hdfs-site.xml, core-site.xml

    For ingestion, mapred-site.xml, yarn-site.xml

    HDFS Folders and permissions

    Choose any folder name for the druid deep storage, for example 'druid'

    Create the folder in hdfs under the required parent folder. For example, hdfs dfs -mkdir /druid OR hdfs dfs -mkdir /apps/druid

    Give druid processes appropriate permissions for the druid processes to access this folder. This would ensure that druid is able to create necessary folders like data and indexing_log in HDFS. For example, if druid processes run as user 'root', then

    hdfs dfs -chown root:root /apps/druid

    OR

    hdfs dfs -chmod 777 /apps/druid

    Druid creates necessary sub-folders to store data and index under this newly created folder.

    Druid Setup

    Edit common.runtime.properties at conf/druid/_common/common.runtime.properties to include the HDFS properties. Folders used for the location are same as the ones used for example above.

    common_runtime_properties

    # Deep storage

    #

    # For HDFS:

    druid.storage.type=hdfs

    druid.storage.storageDirectory=/druid/segments

    # OR

    # druid.storage.storageDirectory=/apps/druid/segments

    #

    # Indexing service logs

    #

    # For HDFS:

    druid.indexer.logs.type=hdfs

    druid.indexer.logs.directory=/druid/indexing-logs

    # OR

    # druid.storage.storageDirectory=/apps/druid/indexing-logs

    Note: Comment out Local storage and S3 Storage parameters in the file

    Also include hdfs-storage core extension to conf/druid/_common/common.runtime.

    #

    # Extensions

    #

    druid.extensions.directory=dist/druid/extensions

    druid.extensions.hadoopDependenciesDir=dist/druid/hadoop-dependencies

    druid.extensions.loadList=["druid-parser-route", "mysql-metadata-storage", "druid-hdfs-storage", "druid-kerberos"]

    Hadoop Jars

    Ensure that Druid has necessary jars to support the Hadoop version.

    Find the hadoop version using command, hadoop version

    In case there are other softwares used with hadoop, like WanDisco, ensure that

    the necessary libraries are available

    add the requisite extensions to druid.extensions.loadlist in conf/druid/_common/common.runtime.properties

    Kerberos setup

    Create a headless keytab which would have access to the druid data and index.

    Edit conf/druid/_common/common.runtime.properties and add the following properties:

    druid.hadoop.security.kerberos.principal

    druid.hadoop.security.kerberos.keytab

    For example

    [email protected]

    druid.hadoop.security.kerberos.keytab=/etc/security/keytabs/hdfs.headless.keytab

    Restart Druid Services

    With the above changes, restart Druid. This would ensure that Druid works with Kerberized Hadoop

    View Article
  • Introduction

    In a typical Imply cluster, we have:

    Master services (Overlord, Coordinator)

    Data services (MiddleManager, Historical)

    Query services (Broker, Imply UI)

    Zookeeper servers

    Deep Storage (S3, HDFS, NFS etc)

    Metadata DB (MySQL, PostgresSQL etc) for Druid and Pivot if Imply is used

    Migrating the cluster is simple since we only need to move (or maintain) data from Deep Storage and Metadata DB (Druid / Pivot). The other services contain only transient data, and they are repopulated anyway when the new cluster is started.

    Following are the possible migration plans:

    Migration where there is no change in Deep Storage and the Metadata DB

    Migration where there is change in Deep Storage / Metadata DB

    Migration from Apache Druid to Imply

    Multiple scenario's are presented in the rest of the article.

    Scenario 1 : Old and new clusters share same Deep Storage and DB servers

    Druid services need to be rebooted.

    Steps:

    Configure the new clusters with the same Deep Storage path and DB server address as the old

    Stop the old cluster

    Make sure the old cluster is COMPLETELY down before starting new cluster, or Druid data can be corrupted

    Start the new cluster

    Scenario 2 :

    Old and new clusters use different Deep Storage and DB servers.

    This scenario would apply if you are using Imply Cloud and

    Old deep storage is accessible from new imply cluster

    Old metadata DB and new metadata DB is MySQL

    New deep storage isclean with no data

    New metadata DB is brand new and clean with no data

    Steps:

    Configure the new clusters with different Deep Storage path and DB server address

    Copy tables "druid_config, druid_datasource, druid_supervisors, druid_segments" from old DB to new DBby following the next step

    mysqldump on the source metadata DB ( and Pivot if used)and then import the mysqldump into new target metadata DB. Output file is given .sql extension because it contains sql commands. The -p command in mysqlimport asks for a password

    mysqldump -h <host_name> -u <user_name> -p --single-transaction --skip-add-drop-table --no-create-info --no-create-db <db_name> druid_config druid_dataSource druid_supervisors druid_segments > output_file.sql

    Start the new cluster

    Coordinator automatically starts reading the old segments' metadata in the new DB, andthen historical nodes load them from the old Deep Storage. Note that the data in old Deep Storage is kept intact.

    Old cluster will keep writing to the old metadata DB and old Deep Storage

    New cluster will write to the new metadata DB and new Deep Storage

    Note #1:Once you do this migration, the old and new cluster will share the samedata segment files in deep storage for any data ingested before the migration. (Data ingested after the migration will go to different files.) It is important toavoid running kill tasks(permanent data deletion) on datasources that may have segments between two clusters, because it will cause the clusters to delete each others' data.

    Note #2: If the new Druid cluster shares the same ZooKeeper quorum as the old, it must use a different base znode path, by configuring the property druid.zk.paths.base in Druid's common.runtime.properties to a different name, such as /druid-newcluster. The default value is /druid.

    Scenario 3 :

    Old and new clusters use different Deep Storage and DB servers. This scenario would apply if you are using Imply Cloud and

    Old deep storage is accessible from new imply cluster

    Old metadata DB and new metadata DB is MySQL

    New deep storage has somedata in it

    New metadata DB has some data in it

    Steps:

    Make sure there are no collisions in the paths between old deep storage and new deep storage

    If there are collisions, change the path of source deep storage to something else and give the same paths when you modify source mysqldump file of source metadata DB

    Copy the data from old deep storage to new deep storage

    configure the new clusters with different Deep Storage path and DB server address.

    mysqldump on the source metadata DB (and excluding the DDL

    mysqldump -h <host_name> -u <user_name> -p --single-transaction --skip-add-drop-table --no-create-info --no-create-db <db_name> druid_config druid_dataSource druid_supervisors druid_segments > source_output_file.sql

    We have to be careful in this scenario NOT to overwrite the target metadata DB

    Change the location of the segments in druid_segments table in the above source mysqldump file to point to new deep storage location

    sed -i .bak 's/\\"bucket\\":\\"<old_segment_name>\\"/\\"bucket\\":\\"<new_segment_name>\\"/' /dir/source_output_file.sql

    copy tables "druid_config, druid_datasource, druid_supervisors, druid_segments" from old metadata DB to new metadata DB by following the next step

    import the above modified source mysqldump file into new target metadata DB

    mysql -h <host_name> -u <user_name> -p <db_name> < /dir/source_output_file.sql

    Old cluster will keep writing to the old metadata DB and old Deep Storage

    New cluster will write to the new DB and new Deep Storage

    Scenario 4 :

    Old and new clusters use different Deep Storage and DB servers. This scenario would apply if you are using Imply Cloud and

    Old deep storage is NOT accessible from new imply cluster

    Old metadata DB and new metadata DB is MySQL

    New deep storage is clean with nodata in it

    New metadata DB is brand new and has no data in it

    Steps:

    Copy the data from old deep storage to new deep storage(may be using somestaging area as an intermediate location)

    configure the new clusters with different Deep Storage path and DB server address.

    mysqldump on the source metadata DB

    mysqldump -h <host_name> -u <user_name> -p --single-transaction --skip-add-drop-table --no-create-info --no-create-db <db_name> druid_config druid_dataSource druid_supervisors druid_segments > output_file.sql

    Change the location of the segments in druid_segments table in the above mysqldump file to point to new deep storage location

    sed -i .bak 's/\\"bucket\\":\\"<old_bucket_name>\\"/\\"bucket\\":\\"<new_bucket_name>\\"/' /tmp/output_file.sql

    Copy tables "druid_config, druid_datasource, druid_supervisors, druid_rules, druid_segments" from old DB to new DB by following the next step

    import the above modified source mysqldump file into new target metadata DB

    mysql -h <host_name> -u <user_name> -p <db_name> < /dir/output_file.sql

    Drop druid_rules table from target mysql. This will be re-created once the cluster is started

    Start the new cluster.

    Coordinator automatically starts reading the old segments' metadata in the new DB, andthen historical nodes load them from the NEW Deep Storage. Note that the data in old Deep Storage is kept intact.

    Old cluster will keep writing to the old DB and old Deep Storage.

    New cluster will write to the new DB and new Deep Storage.

    Scenario 5 :

    Old and new clusters use different Deep Storage and DB servers. This scenario would apply if you are using Imply Cloud and

    Old deep storage is NOT accessible from new imply cluster

    Old metadata DB and new metadata DB is MySQL

    New deep storage has somedata in it

    New metadata DB has some data in it

    Steps:

    Make sure there are no collisions in the paths between old deep storage and new deep storage

    If there are collisions, change the path of source deep storage to something else and give the same paths when you modify source mysqldump file of source metadata DB

    Copy the data from old deep storage to new deep storage(may be using somestaging area as an intermediate location)

    configure the new clusters with different Deep Storage path and DB server address

    mysqldump on the source metadata DB excluding the DDL

    mysqldump -h <host_name> -u <user_name> -p --skip-add-drop-table --no-create-info --no-create-db <db_name> druid_config druid_dataSource druid_supervisors druid_segments > source_output_file.sql

    We have to be careful in this scenario NOT to overwrite the target metadata DB

    Change the location of the segments in druid_segments table in the above source mysqldump file to point to new deep storage location

    sed -i .bak 's/\\"bucket\\":\\"<old_segment_name>\\"/\\"bucket\\":\\"<new_segment_name>\\"/' /dir/source_output_file.sql

    Copy tables "druid_config, druid_datasource, druid_supervisors, druid_segments" from old metadata DB to new metadata DB by following the next step.

    import the above modified source mysqldump file into new target metadata DB.

    mysql -h <host_name> -u <user_name> -p <db_name> < /dir/source_output_file.sql

    Start the new cluster.

    Coordinator automatically starts reading the old segments' metadata in the new DB, andthen historical nodes load them from the NEW Deep Storage. Note that the data in old Deep Storage is kept intact

    Old cluster will keep writing to the old metadata DB and old Deep Storage

    New cluster will write to the new DB and new Deep Storage

    Scenario 6:

    Apache Druid to Imply

    All of the above scenarios can be considered for migrating from Apache Druid to Imply.

    Guidelines

    Review Imply Release notes for changes in the new version

    Upgrade with Cluster restarts. Rolling upgrade is a possibility, but not preferred

    Move to an Apache Druid version matching with that of Imply's Druid Version before the migration. This is a required condition if a rolling upgrade needs to be performed

    Use the same configuration parameters for the services as the Apache Druid versions configuration after reviewing release notes for parameter changes

    if the `druid.segmentCache.locations` are changed, copy over the segment cache from the existing Apache Druid cluster before restarting services using the new Imply version

    While starting the cluster in Imply Version, following steps are preferable when the Segment Count is very high:

    Start all the Historicals and await Lifecyle to be started

    Start all the Master Services (Co-ordinator / Overlord) and await Lifecycle to be started

    Start Query Node services (Broker / Router) and await Lifecyle to be started

    Start Middlemanager service and await Lifecycle to be started

    Resume Supervisors / tasks

    View Article
  • Druid's overlord console provides an overview of current tasks, pending tasks, waiting tasks, and tasks that have completed in the last 24 hours.

    Druid's coordinator console provides an overview of datasources, segments, data size per segment, and cluster usage.

    To configure the ability to see these consoles in your browser, follow these steps:

    1. Log into your AWS account that is associated with Imply Cloud

    2. In the Imply UI, go to the API view under Manage / Clusters:

    3. Under AWS Consoles, clicked on "Unmanaged". This should take you to your AWS Console.

    4. In the AWS Console, create rules to allow your public IP address to access the overlord and coordinator consoles:

    5. Back in the API view in the Imply UI, note the Druid username and password:

    6. Copy the link from the top of the API view, and you're done!

    OR Navigate to the Servers page:

    using the EXTERNAL IP of a master node, access one of the consoles by entering this address + port in a new browser tab:

    overlord console: https://<EXT-IP-MASTER-NODE>:8290

    coordinator console: https://<EXT-IP-MASTER-NODE>:8281

    Druid will change leadership among master nodes, so if the first IP times out, try another master IP until prompted for the username and password from #5 above.

    View Article
  • This can be done by adding custom files to cluster specification - The init script looks for a specially named custom file called user-init, the file has to be called that, and it has to be marked as 'is executable'[Please refer the screenshots].

    the user-init file should be a script (should start with something like #!/bin/bash or similar) and it'll run after the rest of our init scripts finish and you can use that to configure whatever custom service you want to configure.

    You can also use the custom file mechanism to upload additional files for SignalFX - you could use 'Location Type' Default and the files would be placed at /opt/imply/user and you can reference them there.

    Example: I am showing you an example where a user wants to install `SignalFx agent` on each imply managed EC2 instance

    Step1 :

    here, I have created a `user-init` file and placed it to into `s3-bucket`

    eg:

    Location:

    s3://imply-signalfx/user-init

    The user-init file has below content:

    #!/bin/bash

    curl -sSL https://dl.signalfx.com/signalfx-agent.sh > /tmp/signalfx-agent.sh

    sudo sh /tmp/signalfx-agent.sh <SOME_ID> --realm us0

    Step 2:

    Login to imply cloud manager UI > Select the Cluster > Click on `manage` > click on `setup` >navigate to `Advance Config' and expand it > Now click on to ` Add custom file`

    https://`,

    Step 3:

    When you Add the `custom file` it will ask to you provide below details :

    a)Custome File Path: The file path must start with `s3://, http://,

    In this example, the path should be :

    s3://imply-signalfx/user-init

    b)Location Type: I am using the default location, in this case, the init file will be placed inside this directory :

    /opt/imply/user/

    c)Extra Properties: In this case, it should be:

    Is executable

    Example:

    Once you are done with the above configuration, you need to update the cluster!

    What is the impact on the running cluster?

    As It requires a cluster update and the instances will be recreated so if your cluster is HA and you do a rolling update, there should not be any downtime, otherwise, there will be downtime. If there is an error in the script, you should be able to roll back the update.

    Suggestion to be on the safe side :

    I will suggest to spin up a TEST cluster and try this first before doing it in PRODUCTION.

    View Article
  • Here are some handpicked options for hardening Druid system when dealing with demanding scenario like group by on large datasets which may return a very large query response. This article also provides recommendation guideline for setting these parameters.

    Querying a large dataset

    Druid provides many tunable parameters at broker and query level. Following are list of advanced configurations currently supported by Druid.

    Property

    Description

    Default

    Recommendation

    druid.server.http.defaultQueryTimeout

    Query timeout in millis, beyond which unfinished queries will be cancelled. 0 timeout meansno timeout. To set the default timeout, see Broker configuration

    300000 milli

    druid.broker.http.readTimeout

    The timeout for data reads from Historical servers and real-time tasks.

    PT15M

    leave it alone at first, but you may need to raise it if TTFB (Time to First Byte) is high

    druid.router.http.readTimeout

    timeout of inflight query responses. Router will terminate running queries if they run longer than this duration.

    PT8M

    Make sure this duration is enough for longest running query.

    Handle partial query response at client side since http 200 response is already sent.

    druid.broker.http.maxQueuedBytes

    Maximum number of bytes queued per query before exerting backpressure on the channel to the data server. Similar tomaxScatterGatherBytes, except unlike that configuration, this one will trigger backpressure rather than query failure. Zero means disabled.

    0 (disabled)

    Always set this parameter on broker. Try 10M (10000000) if you have <100 historicals; consider higher #'s if you have more historicals.

    Make sure it fits within your heap

    this setting isper queryand so usage could be up to druid.server.http.numThreads * druid.broker.http.maxQueuedBytes)

    druid.processing.buffer.sizeBytes

    This specifies a buffer size for the storage of intermediate results. The computation engine in both the Historical and Realtime processes will use a scratch buffer of this size to do all of their intermediate computations off-heap. Larger values allow for more aggregations in a single pass over the data while smaller values can require more passes depending on the query that is being executed.

    auto (max 1GB)

    leave it alone at first, but you may consider raising it if you have spare memory

    druid.query.groupBy.maxMergingDictionarySize

    Maximum amount of heap space (approximately) to use for the string dictionary during merging. When the dictionary exceeds this size, a spill to disk will be triggered.

    100000000

    druid.query.groupBy.maxOnDiskStorage

    Amount of space on disk used for aggregation, per query, in bytes. By default, this is 0, which means aggregation will not use disk.

    0 (disabled)

    always set this, on historicals.

    Try 10GB at first if you have enough disk space, raise if needed.

    This setting isper queryhence actual usage could be up todruid.server.http.numThreads * druid.query.groupBy.maxOnDiskStorage

    Note: Please check your Druid version for support of these advanced options.

    Downloading large dataset

    Apply HTTP Compression

    Apache Druid (incubating) supports http request decompression and response compression, to use this,Activate by settingtheAccept-Encoding: gzipheader on your client requests.

    This does not require server configuration.

    Recommendation:consider this if network is more of a bottleneck than CPU (this trades off CPU for network). Not recommended for smaller result-sets since this option increases CPU usage on broker and on client.

    Detect Truncated Responses

    Truncated response can be detected by validating returned data. Each format type provides a way to detect if data is truncated. Let's see what each response type works in this scenario. Some of the response formats are better than others but it all depends on what implementation works with external systems.

    Always set resultFormatif using SQL queries.

    Query Type

    Parameter

    Value

    Content Type

    Format Type

    Detection

    Recommendation

    SQL

    resultFormat

    array

    application/json

    JSON array of JSON arrays

    Validate Response JSON

    Preferred for large result sets

    SQL

    resultFormat

    arrayLines

    text/plain

    Like "array", but the JSON arrays are separated by newlines instead of being wrapped in a JSON array. This can make it easier to parse the entire response set as a stream, if you do not have ready access to a streaming JSON parser.

    Presence of one blank line in the end

    Preferred for large result sets

    make sure to implement a truncation checker

    Native Druid

    resultAsArray

    true

    application/json

    groupBy v2 queries now use an array-based representation of result rows, rather than the map-based representation used by prior versions of Druid. This provides faster generation and processing of result sets. Out of the box, this change is backwards-compatible.

    Validate Response JSON

    Always set this parameter.

    Only supported in 3.1+

    Documentation link resultsAsArray

    Other Formats

    Druid supports additional formats which are excluded from this KB article due to their relevance. These formats are:

    resultFormat=[object, objectLines, csv] for SQL queries are excluded.

    resultFormat=[list, compactedList] for Druid Native queries only apply for scan queries and not for GROUP BYhence excluded.

    Additional Recommendations

    Try avoiding "ORDER BY" if at all humanly possible.

    Colocate clients to reduce network latency.

    Avoid server going down.

    Implement retry in case of error 500 or truncated response.

    Further Reading

    https://druid.apache.org/docs/latest/operations/http-compression.html

    https://druid.apache.org/docs/latest/configuration/realtime.html

    https://druid.apache.org/docs/latest/querying/groupbyquery.html#query-context

    https://druid.apache.org/docs/latest/querying/sql

    https://docs.imply.io/3.1/on-prem/misc/release

    View Article
  • Once configuration has been ruled out as the likely cause of slowness and you have a repeatably slow query on your hands, a useful next step is to perform some light profiling to capture a recording of your Druid Broker or Historical (or any other java) processes for analysis.

    There are three steps:

    Download Swiss Java Knife (SJK) onto the servers that host the Druid processes you'd like to analyze. This will generally be a Broker or Historical, or perhaps an indexing task.

    Use SJK to capture stack dumps.

    Use SJK to generate flame graphs from the stack dumps.

    Download SJK

    For this, we will use Swiss Java Knife, in the form of a .jar that can be downloaded from maven here.

    Capture stack dumps

    To prepare the recording, you will need to copy 'sjk-{version}.jar` to the Druid servers you wish to collect profiling data from. Once this is complete, the next step is to find the process ids of the process we wish to sample by running

    jcmd

    to print all running java processes. After this, we sample the process using the 'sjk-{version}.jar' that we copied to the Druid servers:

    java -jar sjk-{version}.jar stcap -p {pid} -o druid-{pid}.std -t 30s

    This will generate a file containing sampled thread dumps, specified by the argument '-o historical-{pid}.std. Adjust the '-t 30s' to a value that is appropriate for the length of the query you are recording.

    We recommend disabling cache (by setting "useCache" to "false" in your query context) and running your query at least 3 times, or repeatedly for 30 seconds, whichever is longer. This ensures enough data is collected for analysis.

    Generate flame graphs

    Next up, we can create a flame graph from the output:

    java -jar sjk.jar flame -f druid-{pid}.std -o druid-{pid}-report.html

    Which will generate an html page that contains something like:

    an interactive flame graph that can be used to help determine what Druid was doing while processing the query or queries.

    When reading flame graphs, it's important to look at query-related threads.There are two major types of threads involved in queries: processing threads, and HTTP server threads. When a query is ongoing, both types of threads will change their names to note the query type and datasource being queried, making them easier to find.

    The tool has a thread filter widget in the top-right which case be used to filter threads by name. To filter in on certain threads, first click "none" to clear the list of selected threads, then type in a filter, and finally either select individual threads or click "all" to select all threads matching the filter.

    View Article
  • Tableau 2018.3 introduced support for JDBC connections, which you can use to connect to Druid. This article documents the technique for Imply 2.8+, but it also works for Apache Druid 0.14.0+.

    Please note that not all Tableau functions have been fully tested and verified.

    To do this, you will need:

    Tableau 2018.3 or later (Windows or Mac).

    Imply 2.8.0 or later.

    The AvaticaJDBC driver, which you can get from https://static.imply.io/support/avatica-1.12.0.jar. Druid uses an Avatica JDBC server.

    Once you have this software installed, follow these steps:

    Place the "avatica-1.12.0.jar" file in Tableau's Drivers folder. On Windows this isC:\Program Files\Tableau\Drivers. On Mac it is ~/Library/Tableau/Drivers.

    Launch Tableau and choose the "Other databases (JDBC)" connection option.

    Enter the URL "jdbc:avatica:remote:url=http://DRUID-QUERY-SERVER:8082/druid/v2/sql/avatica/" (replacingDRUID-QUERY-SERVER with the hostname or IP address of your Druid query broker).

    Select thedialect "SQL92".

    Enter a username and password if you have enabled authentication on your Druid cluster. Otherwise, leave them blank.

    You can set connection context parameters through a properties file, if you like. If you do not need any special context parameters, leave the properties section blank.

    Click "Sign In".

    Choose database "druid", schema "druid", and then drag the table you wish to query up to the tables area and click "Sheet 1" to open a worksheet.Joins are not supported, so only use one table at a time.

    Setting up a TLS truststore

    Some extra steps are necessary to connect to a TLS-enabled Druid cluster where you're using your own CA. This includes Imply Cloud clusters (where Imply configures TLS out of the box, with a cluster-specific CA) or an on-premise Druid cluster with TLS enabled.

    Save a copy of the CA to your local machine. If you are using Imply Cloud you can download it from the "API" tab.

    Use keytool to create a truststore: keytool -import -alias druid -file downloaded_crt_file.crt -storetype JKS -keystore truststore.jks

    Use a JDBC URL with the truststore parameters set, such asjdbc:avatica:remote:url=https://DRUID-QUERY-SERVER:9088/druid/v2/sql/avatica/;truststore=/path/to/truststore.jks;truststore_password=changeit;

    View Article
  • Query logging (request logging) will capture the queries made and users who made them in the broker log. To enable:

    1. Add the following to the imply/conf/druid/_common/common.runtime.properties :

    druid.request.logging.type=slf4j

    2. Restart your Imply processes.

    3. In the imply/var/sv/druid/broker.log you will see queries logged in JSON format with the user id.

    Example output:

    2018-11-07T16:14:15,378 INFO [DruidSchema-Cache-0] io.druid.sql.calcite.schema.DruidSchema - Refreshed metadata for dataSource[wikipedia] in 447 ms (1 segments queried, 0 segments left).

    2018-11-07T16:14:26,574 INFO [qtp404994500-119] io.druid.server.log.LoggingRequestLogger - 2018-11-07T16:14:26.303Z 0:0:0:0:0:0:0:1 {"queryType":"topN","dataSource":{"type":"table","name":"wikipedia"},"virtualColumns":[],"dimension":{"type":"default","dimension":"page","outputName":"page","outputType":"STRING"},"metric":{"type":"LegacyTopNMetricSpec","metric":"sum_deleted"},"threshold":100,"intervals":{"type":"LegacySegmentSpec","intervals":["2016-06-26T21:32:00.000Z/2016-06-27T21:32:00.000Z"]},"filter":null,"granularity":{"type":"all"},"aggregations":[{"type":"doubleSum","name":"sum_deleted","fieldName":"deleted","expression":null}],"postAggregations":[],"context":{"implyDataCube":"wikipedia","implyFeature":"visualization","implyUser":"user","implyView":"wikipedia","implyViewTitle":"Wikipedia","priority":1,"queryId":"9fee261c-dd0d-4949-8376-09661b762841","timeout":40000},"descending":false} {"query/time":265,"query/bytes":7761,"success":true,"identity":"allowAll"}

    References:

    https://druid.apache.org/docs/latest/configuration/index.html#request-logging

    View Article
  • Druid Version: 0.12.3

    Note: Foreign Key relationships displayed are logical and for display purposes only. There are no foreign key constraints created by druid.

    View Article
  • Following steps need to be followed if MapR Hadoop is deep storage:

    Install and configure MapR client

    You need to install the MapR client that matches your MapR Hadoop installation on all middle manager/historical nodes.

    . https://mapr.com/docs/52/AdvancedInstallation/SettingUptheClient-install-mapr-client.html

    https://mapr.com/docs/61/MapR-DB/Installing-mapr-client.html

    Run the configure.sh command for the MapR client

    Un-secure cluster example...

    In the following examples, the -N parameter specifies the cluster name, the-c(lowercase) option specifies a client configuration, the-C(uppercase) parameter specifies the CLDB nodes, and -HS specifies the HistoryServer node:

    Example:

    sudo /opt/mapr/server/configure.sh -Nmy.cluster.com-c -C mynode01:7222 -HS nodeA

    jar changes

    Move imply/dist/druid/extensions/druid-hdfs-storage to a backup location

    Copy all mapr jars from /opt/mapr/lib to new directory imply/dist/druid/extensions/druid-hdfs-storage

    Copy druid-hdfs-storage-0.14.0-incubating-iap4.jar from the backup directory into the new druid-hdfs-storage directory.

    rm dist/druid/extensions/druid-hdfs-storage/jackson-*.jar

    rm dist/druid/extensions/druid-hdfs-storage/guava-14.0.1.jar

    rm dist/druid/extensions/druid-hdfs-storage/joda-time-2.0.jar (if it exists)

    Common.runtime.properties (modifications)

    druid-hdfs-storage should remain as an extension in thecommon.runtime.properties

    common.runtime.properties needs to include

    druid.storage.type=hdfs

    druid.storage.storageDirectory=maprfs:///data/default-rack

    In your common.runtime.properties comment out following parameter ###druid.extensions.hadoopDependenciesDir=dist/druid/hadoop-dependencies

    In your common.runtime.properties set druid.extensions.useExtensionClassloaderFirst=true

    This can be placed after your druid.extension.loadList

    Verification

    Verify connectivity from druid middle manager to maprfs hadoop cluster by running the following (you should get a list of directory contents). If this does not work most likely information was entered wrong for MapR client when you ran configure.sh.

    hadoop fs -ls /

    Last step

    Restart all Imply processes on your middlemanagers/historical nodes

    Reference

    MapR Documentation

    View Article
  • This is to outline steps involved in ingesting data in batch mode using Qubole Hadoop into Imply cloud. Integrating Qubole <> Imply & Druid is just a few clicks away.

    Instructions:

    All Imply clusters use java 8 by default at the time of this authoring this document. Please make sure your Qubole cluster uses java 8 as well. If using 7, you can change to 8 using the following bootstrap script:

    Bootstrap script:

    #!/bin/bash

    source /usr/lib/hustler/bin/qubole-bash-lib.sh

    export PROFILE_FILE=${PROFILE_FILE:-/etc/profile}

    export HADOOP_ETC_DIR=${HADOOP_ETC_DIR:-/usr/lib/hadoop2/etc/hadoop}

    function restart_master_services() {

    monit unmonitor namenode

    monit unmonitor timelineserver

    monit unmonitor historyserver

    monit unmonitor resourcemanager

    /bin/su -s /bin/bash -c '/usr/lib/hadoop2/sbin/yarn-daemon.sh stop timelineserver' yarn

    /bin/su -s /bin/bash -c 'HADOOP_LIBEXEC_DIR=/usr/lib/hadoop2/libexec /usr/lib/hadoop2/sbin/mr-jobhistory-daemon.sh stop historyserver' mapred

    /bin/su -s /bin/bash -c '/usr/lib/hadoop2/sbin/yarn-daemon.sh stop resourcemanager' yarn

    /bin/su -s /bin/bash -c '/usr/lib/hadoop2/sbin/hadoop-daemon.sh stop namenode' hdfs

    /bin/su -s /bin/bash -c '/usr/lib/hadoop2/sbin/hadoop-daemon.sh start namenode' hdfs

    /bin/su -s /bin/bash -c '/usr/lib/hadoop2/sbin/yarn-daemon.sh start resourcemanager' yarn

    /bin/su -s /bin/bash -c 'HADOOP_LIBEXEC_DIR=/usr/lib/hadoop2/libexec /usr/lib/hadoop2/sbin/mr-jobhistory-daemon.sh start historyserver' mapred

    /bin/su -s /bin/bash -c '/usr/lib/hadoop2/sbin/yarn-daemon.sh start timelineserver' yarn

    sudo /usr/lib/zeppelin/bin/zeppelin-daemon.sh stop # as root user

    sudo /usr/lib/zeppelin/bin/zeppelin-daemon.sh start # as root user

    monit monitor namenode

    monit monitor resourcemanager

    monit monitor historyserver

    monit monitor timelineserver

    }

    function restart_worker_services() {

    monit unmonitor datanode

    /bin/su -s /bin/bash -c '/usr/lib/hadoop2/sbin/hadoop-daemon.sh stop datanode' hdfs

    /bin/su -s /bin/bash -c '/usr/lib/hadoop2/sbin/hadoop-daemon.sh start datanode' hdfs

    monit monitor datanode

    # No need to restart nodemanager since it starts only

    # after thhe bootstrap is finished

    }

    function use_java8() {

    export JAVA_HOME=/usr/lib/jvm/java-1.8.0

    export PATH=$JAVA_HOME/bin:$PATH

    echo "export JAVA_HOME=/usr/lib/jvm/java-1.8.0" >> "$PROFILE_FILE"

    echo "export PATH=$JAVA_HOME/bin:$PATH" >> "$PROFILE_FILE"

    sudo echo "export JAVA_HOME=/usr/lib/jvm/java-1.8.0_60" >> /usr/lib/zeppelin/conf/zeppelin-env.sh

    sed -i 's/java-1.7.0/java-1.8.0/' "$HADOOP_ETC_DIR/hadoop-env.sh"

    rm -rf /usr/bin/java

    ln -s $JAVA_HOME/bin/java /usr/bin/java

    is_master=$(nodeinfo is_master)

    if [[ "$is_master" == "1" ]]; then

    restart_master_services

    else

    restart_worker_services

    fi

    }

    use_java8

    Restart the cluster to reflect the Java 8.

    2.Get the master node IP address/hostname from Qubole cluster. You can find it from here: https://us.qubole.com/clusters#/details/<cluster_id>/instances?tab=Nodes. Please make sure to replace <cluster_id> with your cluster id.

    Sample Screenshot:

    Imply Cloud Documents

    You can use private IP and make sure that the traffic can be reached from Qubole cluster to Imply cluster. Please contact your network administrator for help.

    3.You then have to tell Imply middle manager to talk to Hadoop master for data processing. Go to Imply clusters page, select the desired cluster, click on Manage, click on setup, scroll down and click on Advanced Config.

    Find Middle Manger advanced configuration box and then add the following entries related to your Qubole cluster:

    hadoop.fs.defaultFS=hdfs://<Qubole_cluster_master_node_IP>:9000

    hadoop.yarn.resourcemanager.hostname=<Qubole_cluster_master_node_IP>

    hadoop.yarn.application.classpath=/usr/lib/hadoop2/etc/hadoop,/usr/lib/hadoop2/*,/usr/lib/hadoop2/lib/*,/usr/lib/hadoop2/share/hadoop/common/*,/usr/lib/hadoop2/share/hadoop/common/lib/*,/usr/lib/hadoop2/share/hadoop/hdfs/*,

    /usr/lib/hadoop2/share/hadoop/hdfs/lib/*,/usr/lib/hadoop2/share/hadoop/yarn/*,/usr/lib/hadoop2/share/hadoop/yarn/lib/*,/usr/lib/hadoop2/share/hadoop/mapreduce/*,/usr/lib/hadoop2/share/hadoop/mapreduce/lib/*,/usr/lib/hadoop2/share/hadoop/tools/*,

    /usr/lib/hadoop2/share/hadoop/tools/lib/*,/usr/lib/hadoop2/share/hadoop/qubole/*,/usr/lib/hadoop2/share/hadoop/qubole/lib/*

    hadoop.mapreduce.framework.name =yarn

    hadoop.fs.s3n.awsAccessKeyId=<access_key>

    hadoop.fs.s3n.awsSecretAccessKey=<secret_key>

    Sample Screenshot:

    Restart the Imply cluster to reflect the added entires.

    4.Include the following job properties in your ingestion specification to avoid block and multipart size errors:

    Tuning Config:

    "tuningConfig": {

    "type": "hadoop",

    "jobProperties": {

    "mapreduce.job.user.classpath.first": "true",

    "fs.s3a.readahead.range": "65536",

    "fs.s3a.multipart.size": "104857600",

    "fs.s3a.block.size": "33554432"

    }

    }

    The above four steps should establish communication between Qubole hadoop and Imply clusters and data ingestion should directly hit deep storage after aggregations.

    Reference:

    Qubole

    Qubole Documents

    View Article
  • Objective:

    Step by step guide to configure Druid to run Hadoop ingestion with Google Cloud Platform.

    Step 1: Login to GCP console:

    Go to GCP console page https://console.cloud.google.com/, and log in with your credential.

    Step 2: Create a service account that will access the bucket:

    Browse to "IAM & Admin" -> "Service Accounts" -> "Create Service Account",Create a service account, in this example "imply-cs"

    At permission section of the "imply-cs", make sure your user name is listed as one of the owners

    Click on "Create key"

    Download the JSON file that created, and save for later use. That's the authentication/authorization credential for your Druid hosts to access GCP.

    Step 3: Apply the bucket access permission to the service account just created:

    Browse to your GCP bucket through Google navigation menu -> storage -> browser -> YOUR BUCKET ("cs-bucket1" in this example)

    Configure user access permission on the bucket

    Add the service account as Storage Admin to this bucket.

    Step 4: Configure environmental variable on all Druid hosts

    Open SSH sessions to all druid hosts, and save the JSON file downloaded in step 2 to Imply Druid's conf (./conf) directory on all hosts. Then in your bash properties file, e.g, '~/.bash_properties' on RHEL/CentOS, create a new environment variableGOOGLE_APPLICATION_CREDENTIALS. Replace[PATH]with the file path of the JSON file that contains your service account key, and[FILE_NAME]with the filename. For example:

    [root@ip-172-31-2-115 ec2-user]# cat ~/.bash_profile

    # .bash_profile

    export GOOGLE_APPLICATION_CREDENTIALS="/imply-<VERSION>/conf/googleauth.json"

    [root@ip-172-31-2-115 ec2-user]# cat /imply-<VERSION>/conf/googleauth.json

    {

    "type": "service_account",

    "project_id": "<ID>",

    "private_key_id": "<KEY_ID>",

    "private_key": "-----BEGIN PRIVATE KEY-----<KEY>-----END PRIVATE KEY-----\n",

    "client_email": "imply-cs@<ID>.iam.gserviceaccount.com",

    "client_id": "<CLIENT_ID>",

    "auth_uri": "https://accounts.google.com/o/oauth2/auth",

    "token_uri": "https://oauth2.googleapis.com/token",

    "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",

    "client_x509_cert_url": "<URL>"

    }

    Step 5: Add "Druid-google-extensions" to the Druid's "druid.extensions.loadList" in the Druid's "common.running.properties".

    Step 6: Add following to replace the current deep storage config in "common.running.properties"

    druid.storage.type=google

    druid.google.bucket=cs-bucket1

    druid.google.prefix=druid/segments

    Step 7: Add following to replace the current index logging config in "common.running.properties"

    druid.indexer.logs.type=google

    druid.indexer.logs.bucket=cs-bucket1

    druid.indexer.logs.prefix=druid/indexing-logs

    Step 8: Download the gcs connector driver, and place it to both Druid's lib and "druid-google-extension" directories

    wget https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop2-latest.jar

    /imply-2.8.17/dist/druid/extensions/druid-google-extensions/

    /imply-2.8.17/dist/druid/lib/

    Step 9: Configure the Hadoop to interact with GCP

    Copy the same "googleauth.json" in step 4 to "/etc/hadoop/conf" directory on all Hadoop hosts.

    Change the file permission to 444

    Add following config to "core-site.xml" file on all Hadoop hosts

    fs.gs.auth.service.account.json.keyfile=/etc/hadoop/conf/googleauth.json

    fs.gs.working.dir=/

    fs.gs.path.encoding=uri-path

    fs.gs.reported.permissions=777

    Restart Hadoop cluster.

    Test GCP access using HDFS client

    hadoop fs -ls gs://cs-bucket1/

    Step 10: Copy the latest "core-site.xml" to /imply-<VERSION>/conf/druid/_common directory on all Druid hosts.

    Restart Druid cluster.

    Step 11: Upload test raw data JSON file "wikipedia-2016-06-27-sampled.json" to GCP bucket "cs-bucket1"

    Step 12: Run Hadoop ORC ingestion using following spec:

    {

    "type" : "index_hadoop",

    "spec" : {

    "dataSchema" : {

    "dataSource" : "wikipedia-hadoop",

    "parser" : {

    "type" : "hadoopyString",

    "parseSpec" : {

    "format" : "json",

    "dimensionsSpec" : {

    "dimensions" : [ "channel", "cityName", "comment", "countryIsoCode", "countryName", "isAnonymous", "isMinor", "isNew", "isRobot", "isUnpatrolled", "metroCode", "namespace", "page", "regionIsoCode", "regionName", "user", {

    "name" : "commentLength",

    "type" : "long"

    }, {

    "name" : "deltaBucket",

    "type" : "long"

    }, "flags", "diffUrl", {

    "name" : "added",

    "type" : "long"

    }, {

    "name" : "deleted",

    "type" : "long"

    }, {

    "name" : "delta",

    "type" : "long"

    } ]

    },

    "timestampSpec" : {

    "column" : "timestamp",

    "format" : "iso"

    }

    }

    },

    "metricsSpec" : [ ],

    "granularitySpec" : {

    "type" : "uniform",

    "segmentGranularity" : "DAY",

    "queryGranularity" : {

    "type" : "none"

    },

    "rollup" : false,

    "intervals" : [ "2016-06-20T00:00:00.000Z/2016-07-02T00:00:00.000Z" ]

    },

    "transformSpec" : {

    "filter" : null,

    "transforms" : [ ]

    }

    },

    "ioConfig" : {

    "type" : "hadoop",

    "inputSpec" : {

    "type" : "static",

    "paths" : "gs://cs-bucket1/wikipedia-2016-06-27-sampled.json"

    },

    "metadataUpdateSpec" : null,

    "segmentOutputPath" : null

    },

    "tuningConfig" : {

    "type" : "hadoop",

    "workingPath" : null,

    "partitionsSpec" : {

    "type" : "hashed",

    "targetPartitionSize" : 5000000,

    "maxPartitionSize" : 7500000,

    "assumeGrouped" : false,

    "numShards" : -1,

    "partitionDimensions" : [ ]

    },

    "shardSpecs" : { },

    "indexSpec" : {

    "bitmap" : {

    "type" : "concise"

    },

    "dimensionCompression" : "lz4",

    "metricCompression" : "lz4",

    "longEncoding" : "longs"

    },

    "maxRowsInMemory" : 1000000,

    "maxBytesInMemory" : 0,

    "leaveIntermediate" : false,

    "cleanupOnFailure" : true,

    "overwriteFiles" : false,

    "ignoreInvalidRows" : false,

    "jobProperties" : {

    "mapreduce.job.user.classpath.first" : "true",

    "mapreduce.map.java.opts" : "-Duser.timezone=UTC -Dfile.encoding=UTF-8",

    "mapreduce.reduce.java.opts" : "-Duser.timezone=UTC -Dfile.encoding=UTF-8"

    },

    "combineText" : false,

    "useCombiner" : false,

    "buildV9Directly" : true,

    "numBackgroundPersistThreads" : 0,

    "forceExtendableShardSpecs" : true,

    "useExplicitVersion" : false,

    "allowedHadoopPrefix" : [ ],

    "logParseExceptions" : false,

    "maxParseExceptions" : 0

    }

    },

    "hadoopDependencyCoordinates" : null,

    "classpathPrefix" : null,

    "context" : { },

    "dataSource" : "wikipedia-hadoop"

    }

    View Article
  • Objective:

    Step by step guide to configure Druid to ingest files from Google Cloud Platform.

    Step 1: Login to GCP console:

    Go to GCP console page https://console.cloud.google.com/, and log in with your credential.

    Step 2: Create a service account that will access the bucket:

    Browse to "IAM & Admin" -> "Service Accounts" -> "Create Service Account",Create a service account, in this example "imply-cs"

    At permission section of the "imply-cs", make sure your user name is listed as one of the owners

    Click on "Create key"

    Download the JSON file that created, and save for later use. That's the authentication/authorization credential for your Druid hosts to access GCP.

    Step 3: Apply the bucket access permission to the service account just created:

    Browse to your GCP bucket through Google navigation menu -> storage -> browser -> YOUR BUCKET ("cs-bucket1" in this example)

    Configure user access permission on the bucket

    Add the service account as Storage Admin to this bucket.

    Step 4: Configure environmental variable on all Druid hosts

    Open SSH sessions to all druid hosts, and save the JSON file downloaded in step 2 to Imply Druid's conf (./conf) directory on all hosts. Then in your bash properties file, e.g, '~/.bash_properties' on RHEL/CentOS, create a new environment variableGOOGLE_APPLICATION_CREDENTIALS. Replace[PATH]with the file path of the JSON file that contains your service account key, and[FILE_NAME]with the filename. For example:

    [root@ip-172-31-2-115 ec2-user]# cat ~/.bash_profile

    # .bash_profile

    export GOOGLE_APPLICATION_CREDENTIALS="/imply-<VERSION>/conf/googleauth.json"

    [root@ip-172-31-2-115 ec2-user]# cat /imply-<VERSION>/conf/googleauth.json

    {

    "type": "service_account",

    "project_id": "<ID>",

    "private_key_id": "<KEY_ID>",

    "private_key": "-----BEGIN PRIVATE KEY-----<KEY>-----END PRIVATE KEY-----\n",

    "client_email": "imply-cs@<ID>.iam.gserviceaccount.com",

    "client_id": "<CLIENT_ID>",

    "auth_uri": "https://accounts.google.com/o/oauth2/auth",

    "token_uri": "https://oauth2.googleapis.com/token",

    "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",

    "client_x509_cert_url": "<URL>"

    }

    Step 5: Add "Druid-google-extensions" to the Druid's "druid.extensions.loadList" in the Druid's "common.running.properties".

    Step 6: Add following to replace the current deep storage config in "common.running.properties"

    druid.storage.type=google

    druid.google.bucket=cs-bucket1

    druid.google.prefix=druid/segments

    Step 7: Add following to replace the current index logging config in "common.running.properties"

    druid.indexer.logs.type=google

    druid.indexer.logs.bucket=cs-bucket1

    druid.indexer.logs.prefix=druid/indexing-logs

    Step 8: restart all Druid services.

    Step 9: Upload raw data file "wikipedia-2016-06-27-sampled.json" to GCP bucket "cs-bucket1".

    Step 10: POST native batch ingestion to start ingestion:

    {

    "type": "index",

    "spec": {

    "dataSchema": {

    "dataSource": "wikipedia",

    "parser": {

    "type": "string",

    "parseSpec": {

    "format": "json",

    "dimensionsSpec": {

    "dimensions": [

    "isRobot",

    "diffUrl",

    {

    "name": "added",

    "type": "long"

    },

    "channel",

    "flags",

    {

    "name": "delta",

    "type": "long"

    },

    "isUnpatrolled",

    "isNew",

    {

    "name": "deltaBucket",

    "type": "long"

    },

    "isMinor",

    "isAnonymous",

    {

    "name": "deleted",

    "type": "long"

    },

    "namespace",

    "comment",

    "page",

    {

    "name": "commentLength",

    "type": "long"

    },

    "user",

    "countryIsoCode",

    "regionName",

    "cityName",

    "countryName",

    "regionIsoCode",

    {

    "name": "metroCode",

    "type": "long"

    }

    ]

    },

    "timestampSpec": {

    "column": "timestamp",

    "format": "iso"

    }

    }

    },

    "granularitySpec": {

    "type": "uniform",

    "segmentGranularity": "DAY",

    "rollup": false,

    "queryGranularity": "none"

    },

    "metricsSpec": []

    },

    "ioConfig": {

    "type": "index",

    "firehose": {

    "type":"static-google-blobstore",

    "blobs”: [

    {

    "bucket":"cs-bucket1",

    "path":"wikipedia-2016-06-27-sampled.json"

    } ]

    },

    "appendToExisting": false

    },

    "tuningConfig": {

    "type": "index",

    "forceExtendableShardSpecs": true,

    "maxRowsInMemory": 1000000,

    "reportParseExceptions": false,

    "maxParseExceptions": 100,

    "maxSavedParseExceptions": 10

    }

    }

    }

    View Article
  • There are some instances where you want to do a planned shut down of your on-premise Imply Druid cluster such as network maintenance, operating system updates, or hardware maintenance.

    Below are the steps to do so.

    Assumptions:

    The Imply cluster is running colocated services as the following:

    Pre-requisites:

    Notify your end users the cluster will be shut down, hence they are aware services being powered by Imply will not be operational.

    If you have a streaming ingestion, ensure retention policies for the source systems are longer than the planned downtime. For instance, if you have a downtime window of 4 hours, make sure that the retention policy is at least 8 hours to prevent any gaps in data.

    Stop all ingestions jobs.

    *NOTE make sure to have a backup of the ingestion spec since you may need to resubmit the ingestion jobs.

    Steps

    Stop the Query Servers (Broker, Pivot, Router). On the Query Server machine log in as the user running the Imply druid processes. Then run the commandbin/service

    --down. Verify the service is no longer running by executing a ps -ef | grep Imply.

    Stop the Data Servers (Middle Manager, Historical)On the Data Server machine log in as the user running the Imply druid processes. Then run the commandbin/service

    --down.Verify the service is no longer running by executing a ps -ef | grep Imply.

    Stop the Master Servers (Overlord, Coordinator, Zookeeper). On the Master Server machine log in as the user running the Imply druid processes. Then run the commandbin/service --down.Verify the service is no longer running by executing a ps -ef | grep Imply.

    View Article
  • A Druid cluster can only be as stable as its Zookeeper. By design, Druid nodes are prevented from talking to each other. Zookeeper is responsible for much of the communication that keeps the Druid cluster functioning.

    ZK Documentation

    For maximum ZK stability, follow these best practices:

    Architecture

    Keep a dedicated ZK for Druid; avoid sharing with other products / applications

    Maintain an odd number of zookeepers. For increased reliability (high availability), use 3 or 5 ZK's so that it can maintain a quorum in voting decisions (ceiling(n/2)). Voting decisions include IP discovery, determining which overlord or coordinator is the leader, etc.

    For example, with 3 ZK's, 1 ZK can be lost and the cluster can still function because 2 are left, which = ceil(3/2). In a cluster with 2 ZK's, only 1 is needed to maintain a "quorum", so it does not provide high availability. Two ZK's doubles the probability of failure and therefore is actually less preferable than 1 ZK.

    Imply prefers to colocate ZK on master nodes rather than data or query nodes. This is because the coordinator and overlord are normally far less work intensive than data or query nodes

    Never put ZK behind a load balancer. This can work if all ZK servers are healthy, but if one goes down it can be difficult for the client to find a replacement. Because ZK recovery is hampered in this case, it defeats thepurpose of high availability.

    ZK Data and Log files should be on disks, which have least contention from other I/O activities. Ideally the ZK data and ZK transaction log files should be on different disks, so that they don't contend for the IO resource. Note that, it isn't enough to just have partitions but they have to be different disks to ensure performance. If possible use high speed disks like SSD's for dataDir and dataLogDir locations.

    Configuration

    List ZK IP's as hosts rather than DNS names in the common.runtime.properties. The reason for this is that there are some bugs in the part of the ZK code that handles DNS mappings, including one where ZK won't retry the connection if DNS becomes unavailable.

    Example common.runtime.properties:

    druid.zk.service.host=172.16.xxx.x1,172.16.xxx.x2,172.16.xxx.x3

    When running in HA mode, add all ZK server IP's to zoo.cfg to ensure that ZK can properly elect its own leader:

    ...

    #

    # Ensemble

    #

    server.1=172.16.xxx.x1:2888:3888

    server.2=172.16.xxx.x2:2888:3888

    server.3=172.16.xxx.x3:2888:3888

    In the dataDir for the respective ZK servers, ensure that the myid file has the same number as defined by server.x in the above. For example, if dataDir is defined as /opt/imply-2.8.7/var/zk, on server.1, myid file should have an entry of 1. In case of server.2, it should be 2. This can be verified by doing a cat/opt/imply-2.8.7/var/zk/myid on server.1 (172.16.xxx.x1)

    Try to keep the Zookeeper log transaction and data dir to be in independent disks. These can be defined in zoo.cfg using dataDir and dataLogDir parameters.

    In ZK version 3.4.0 and above, enable auto purging of transaction logs usingautopurge.purgeInterval=1 in zoo.cfg. This would ensure that the older transaction logs are cleaned up regularly and avoids disk full situations. By default 3 transaction logs are retained and this is controlled byautopurge.snapRetainCount in zoo.cfg.

    All ZK servers should use the same configuration

    Refer to for more information.

    View Article
  • Symptoms:

    "Too many open files" messages showing up in your logs? Tasks failing?

    First step - check your ulimit

    ulimit -a

    If the value is < 10,000 you should probably increase it.

    Root Cause:

    The default open file settings for linux or often too low for an Apache Druid system, which can be interacting with thousands of files simultaneously.

    Solution:

    Per session solution (will be reverted on reboot):

    NOTE: Ensure you have root access

    Execute the following

    ulimit -n 65536

    ulimit -u 65536

    Permanent solution:

    vi /etc/security/limits.conf

    Ensure the following values are set:

    * soft nproc 65535

    * hard nproc 65535

    * soft nofile 65535

    * hard nofile 65535

    Reference:

    https://access.redhat.com/solutions/61334

    https://www.youtube.com/watch?v=6uikJTnmtgw

    View Article
  • Druid SQL

    The simplest way to make Druid SQL queries from the command line is using thedsqltool included with the Imply distribution:

    $ bin/dsql

    Welcome to dsql, the command-line client for Druid SQL.

    Connected to [http://localhost:8082/].

    Type "\h" for help.

    dsql> SELECT COUNT(*) from wikipedia;

    EXPR$0

    24433

    Retrieved 1 row in 0.05s.

    By default, thedsql tool connects to a broker on localhost without security enabled. If you need to connect to another broker, use a command like:

    bin/dsql -H http://your-broker:8082/

    If you need to connect to a broker with security enabled, try the following (Imply 2.8 or later required):

    bin/dsql -H https://your-broker:8282/ -u admin:yourpassword

    Refer to the Imply Druid SQL documentation, or the community Druid SQL documentation, for more details about the kinds of queries you can issue.

    You can also make Druid SQL queries using Pivot through the "SQL" tab.

    Native Druid queries

    To issue native Druid JSON queries from the command line, run:

    curl -XPOST [email protected] -H'Content-Type: application/json' http://your-broker:8082/druid/v2/

    Wherequery.json is a file containing the query JSON you wish to issue. Refer to the Druid querying documentation for details about the kinds of queries you can issue.

    View Article
  • For an overview of Kafka security features and guidance for setting up Kerberized Kafka, please see:

    https://www.confluent.io/blog/apache-kafka-security-authorization-authentication-encryption/

    In order for Druid to connect with Kerberized Kafka, a JAAS conf file needs to be set up on each service. There are two alternate methods of authentication from within the JAAS conf file:

    1. setting useTicketCache=true

    2. create keytab based authentication. We provide an example of this below.

    1. Create keytab for Druid

    Where <druid-user> is the user that runs the Druid services:

    addprinc <druid-user>@EXAMPLE.COM

    ktadd -k /keytab/<druid-user>.keytab <druid-user>@{REALM}

    2. Create JAAS conf file for Druid as Kafka Client:

    /path/to/client_jaas.conf

    *Note: The Client section is only necessary if ZK is Kerberized

    KafkaClient {

    com.sun.security.auth.module.Krb5LoginModule required

    useTicketCache=false

    useKeyTab=true

    storeKey=true

    keyTab="/keytab/<druid-user>.keytab"

    principal="<druid-user>@EXAMPLE.COM";

    };

    Client {

    com.sun.security.auth.module.Krb5LoginModule required

    useTicketCache=false

    useKeyTab=true

    storeKey=true

    keyTab="/keytab/<druid-user>.keytab"

    principal="<druid-user>@EXAMPLE.COM";

    };

    3.EditDruidproperties

    In the jvm.config for each service, add the path to the JAAS conf file:

    -Djava.security.auth.login.config=/client_jaas.conf

    Inconf/middleManager/runtime.properties(or conf-quickstart if running quickstart), add to druid.indexer.runner.javaOpts:

    -Djava.security.auth.login.config=/client_jaas.conf

    In conf/druid/_common/common.runtime.properties (or conf-quickstart if running quickstart) - NOTE: this part is only necessary if ZK is Kerberized:

    druid.zk.service.host={host}.example.com

    Note: This procedure was tested with Imply 2.6.8 and Kafka kafka_2.11-2.0.0.

    View Article
  • These instructions have been tested with Imply 2.8 (based on Druid 0.13.0).

    Oracle Cloud provides an S3-compatible Object Storage API, which can be used as a Druid deep storage. This requires Imply 2.8 or later. To achieve this, add the following properties to your common.runtime.properties:

    druid.storage.type=s3

    druid.storage.bucket=yourBucket

    druid.storage.baseKey=druid/segments

    druid.storage.disableAcl=true

    druid.s3.endpoint.url=<hostname>:443

    druid.s3.endpoint.signingRegion=<region>

    druid.s3.accessKey=<accessKey>

    druid.s3.secretKey=<secretKey>

    druid.s3.disableChunkedEncoding=true

    druid.s3.enablePathStyleAccess=true

    druid.s3.forceGlobalBucketAccessEnabled=false

    See https://docs.cloud.oracle.com/iaas/Content/Object/Tasks/s3compatibleapi.htm for more information about Oracle's S3 compatible Object Storage API.

    View Article
  • About Swift

    OpenStack Swift is a self-hosted option that offers enough compatibility with the S3 API to be used as deep storage via Druid's S3 integration.

    Imply 2.7.x and earlier (Druid 0.12.x and earlier)

    These versions use the JetS3t library to access S3 APIs.

    Configure S3 deep storage the usual way: add "druid-s3-extensions" to your extension list, configure access and secret keys, and configure a bucket and path.

    Create a file "conf/druid/_common/jets3t.properties" (or "conf-quickstart/druid/_common/jets3t.properties" if you are using the quickstart config) and include the following lines:

    s3service.s3-endpoint= swift.example.com

    s3service.s3-endpoint-http-port=8080

    s3service.https-only=false

    s3service.disable-dns-buckets=true

    s3service.enable-storage-classes=false

    Imply 2.8 and later

    These versions use the Amazon aws-java-sdk to access S3 APIs. We do not have specifics at this time about how to integrate these versions with Swift for deep storage.

    View Article
  • Where do Kafka messages go when they Enter the Druid? Do they morph into insects? Do they exist at all?

    This diagram shows the relationship between partitions on the Kafka broker, ingestion tasks, and Druid segments during Kafka Indexing. Druid parameters that influence ingestion efficiency and number of segments are noted.

    The Kafka Indexing Service works differently before and after Druid 0.12 (Imply 2.5.x). The main difference is that pre-0.12, segments could not mix data from different partitions. This can have a large impact on total number of segments generated when using <0.12.

    http://druid.io/docs/latest/tutorials/tutorial-kafka.html

    Additional references:

    http://druid.io/docs/latest/development/extensions-core/kafka-ingestion.html

    View Article
  • Challenge : We do see this error message when running Hadoop_index job on EMR instances with Dimension Partitioning

    Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in InMemoryMerger - Thread to merge in-memory shuffled map-outputs

    at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)

    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:377)

    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)

    at java.security.AccessController.doPrivileged(Native Method)

    at javax.security.auth.Subject.doAs(Subject.java:422)

    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)

    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

    Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/attempt_1543312496194_0001_r_000000_1005/map_421.out

    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:402)

    at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)

    at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)

    at org.apache.hadoop.mapred.YarnOutputFiles.getInputFileForWrite(YarnOutputFiles.java:213)

    at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl$InMemoryMerger.merge(MergeManagerImpl.java:459)

    at org.apache.hadoop.mapreduce.task.reduce.MergeThread.run(MergeThread.java:94)

    Root Cause : EMR instances might not have sufficient space or not adequately configured as Dimension Partitioning requires more temporary space for shuffle as it has to re-group.

    Solution : Either upgrade EMR instances to higher configuration or run the job with lesser intervals like 1 hour or 6 hours instead of 1 day or 1 week.

    Reference Links :

    https://aws.amazon.com/emr/

    https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html

    View Article
  • ***Imply Version: 2.7.5 and higher***

    With the added ability to perform UNION ALL operations on datasets, Imply also offers the ability to create a data cube from multiple datasets.

    Before we begin, there are a couple of things to note about this capability.

    This is a logical operation at query time, so no changes are made to the underlying druid datasets

    This is a UNION ALL operation, so duplicates between datasets will be preserved

    Any dimensions that are missing from a dataset will be treated as NULL values

    The Gist

    To create a data cube from multiple data sources, add the following JSON to the Data Cube Options field in the Advanced section of the Data Cube Properties. With TABLE_A and TABLE_B being the names of the data sources you wish to use.

    {

    "altDataSource": [

    "TABLE_A",

    "TABLE_B"

    ]

    }

    The Tutorial

    Getting Started - Loading the data

    For this example, I will be using sample data from https://quantquote.com/historical-stock-data.

    First, let's create two datasets, for my example, I am using Minute and Second resolution data provided as a sample from Quantquote. The datasets are available here:

    https://quantquote.com/sample/SPY_SECOND_TRADE.csv

    https://quantquote.com/sample/SPY_MINUTE_TRADE.csv

    The data format is CSV, with a header that contains the column names.

    You'll notice that the Minute level data contains a date field, while the Second level data does not.

    For these datasets, we will not be using a time column or a roll-up

    Leaving the defaults for the columns

    And we will go ahead and Automatically Create a data cube for each of the data sources, just for comparison purposes.

    A quick review and we go ahead and load the data!

    Create the Data Cube

    We are going to create a new data cube, and initially select the data source with all of the fields you wish to include in the cube (in this case we start with the By Minute data source, as it contains the date field that does not exist in the Second level dataset)

    Go ahead and change the name, so we don't confuse it with our existing data cubes.

    Jump down to the Advance Tab and scroll to the bottom to the Data Cube Options and enter the following JSON.

    {

    "altDataSource": [

    "SPY_MINUTE_TRADE",

    "SPY_SECOND_TRADE"

    ]

    }

    Click Save and We're DONE

    Validation

    Spy Minute Trade = 748 Events

    Spy Second Trade = 19.15k Events

    Spy Combined Trade = 19,150 + 748 = 19,898 Events

    View Article
  • Imply 2.8 and later

    Imply 2.8 and later provide log rotation out of the boxon supported platforms (Linux x86_64 and MacOS) through thesupervise program. Logs appear invar/sv/<service>/current and are rotated to other files in that same directory, before eventually being deleted.

    The remainder of this article is useful for Imply 2.7 and earlier.

    Imply 2.7 and earlier

    In Imply 2.7.x and earlier, Druid is configured to use `ConsoleAppender` for logging. However, ConsoleAppender does not roll over old logs and does not do archiving, so all logs of a service are written to a single file, which can grow very large in size over time.

    A popular solution is to use `RollingFileAppender` for logging. More details can be found here: https://logging.apache.org/log4j/2.0/manual/appenders.html#RollingFileAppender

    The following is an example of configuring `RollingFileAppender` for coordinator logging:

    1. Configure`jvm.config` for coordinator by adding `-Dservice` at the end of file with the name of each service. It is important that each jvm.config have a different value for "service"!

    [root@ip-172-31-12-141 coordinator]# cat /imply/imply-2.5.14/conf/druid/coordinator/jvm.config

    -server

    -Xms256m

    -Xmx256m

    -Duser.timezone=UTC

    -Dfile.encoding=UTF-8

    -Djava.io.tmpdir=var/tmp

    -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

    -Dderby.stream.error.file=var/druid/derby.log

    -Dservice=coordinator

    Similar configuration can be applied to other services, with changes in `-Dservice` value respectively.

    2. back up the original the log4j2.xml file, then replace the content with following:

    <?xml version="1.0" encoding="UTF-8"?>

    <Configuration status="warn" name="Imply" packages="">

    <Appenders>

    <RollingFile name="RollingFile" fileName="var/sv/${sys:service}-service.log"

    filePattern="var/sv/${sys:service}-service-\%d{yyyy-MM-dd}.\%i.log.gz">

    <PatternLayout>

    <Pattern>\%d{ISO8601} \%p [\%t] \%c - \%m\%n</Pattern>

    </PatternLayout>

    <Policies>

    <TimeBasedTriggeringPolicy interval="1" modulate="true"/>

    <SizeBasedTriggeringPolicy size="100 MB"/>

    </Policies>

    <DefaultRolloverStrategy max="10"/>

    </RollingFile>

    </Appenders>

    <Loggers>

    <Root level="info">

    <AppenderRef ref="RollingFile"/>

    </Root>

    </Loggers>

    </Configuration>

    In this new configuration, we roll over and archive old logs in to gz format daily, or if the current log file grows larger than 100MB.

    3. IMPORTANT:`RollingFileAppender` will break ingestion task logging. So we need to instruct tasks to still use `ConsoleAppender` for logging. To do so:

    3a. Keep the default log4j2.xml and rename it to log4j2-task.xml. It will have the following contents:

    <?xml version="1.0" encoding="UTF-8" ?>

    <Configuration status="WARN">

    <Appenders>

    <Console name="Console" target="SYSTEM_OUT">

    <PatternLayout pattern="\%d{ISO8601} \%p [\%t] \%c - \%m\%n"/>

    </Console>

    </Appenders>

    <Loggers>

    <Root level="info">

    <AppenderRef ref="Console"/>

    </Root>

    </Loggers>

    </Configuration>

    3b. add additional property `-Dlog4j.configurationFile` to middleManager's runtime.properties --> `druid.indexer.runner.javaOpts`, and point to the path of the `log4j2-task.xml` file:

    [root@ip-172-31-12-141 middleManager]# cat /imply/imply-2.5.14/conf/druid/middleManager/runtime.properties

    ...

    # Task launch parameters

    druid.indexer.runner.javaOpts=-server -Xms512m -Xmx512m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -Dlog4j.configurationFile=conf/druid/_common/log4j2-task.xml

    ...

    4. The var/sv directory will look like this after the change is applied:

    [root@ip-172-31-12-141 sv]# ls -lh

    -rw-r--r--. 1 root root 37K Jun 24 00:00 broker-service-2018-06-23.log.gz

    -rw-r--r--. 1 root root 14K Jun 25 02:00 broker-service-2018-06-24.log.gz

    -rw-r--r--. 1 root root 188K Jun 25 22:55 broker-service.log

    -rw-r--r--. 1 root root 1K Jun 25 22:55 broker.log

    -rw-r--r--. 1 root root 574K Jun 25 22:55 imply-ui.log

    In this example, "broker-service.log" is Druid's log4j2 log, and "broker.log" is standard out. The standard-out log file will generally contain nothing at first. Eventually, it may accumulate logs that do not go through log4j2, including:

    GC logs (from the JVM PrintGCTimeStamps option).

    Crash logs.

    Thread dumps generated by "kill -3".

    View Article
  • The feature will be introduced in the 0.11 release (& part of Imply 2.4.0)

    For more details, please refer to : http://druid.io/docs/latest/operations/tls-support.html

    For TLS security on Cloud : https://docs.imply.io/cloud/deploy/security

    View Article
  • This feature is available for Imply Druid 2.6.0 or greater

    Step 1:Make sure toinclude druid-basic-securityas an extension.

    Step 2: Also add following configs to thecommon.runtime.properties:

    druid.auth.authenticatorChain=["basic"]

    druid.auth.authenticator.basic.type=basic

    druid.auth.authenticator.basic.initialAdminPassword=priest

    druid.auth.authenticator.basic.initialInternalClientPassword=warlock

    druid.auth.authenticator.basic.authorizerName=basic

    druid.escalator.type=basic

    druid.escalator.internalClientUsername=druid_system

    druid.escalator.internalClientPassword=warlock

    druid.escalator.authorizerName=basic

    druid.auth.authorizers=["basic"]

    druid.auth.authorizer.basic.type=basic

    When the extension is first loaded, it looks at these properties to determine what initial password to set for the `admin` user and the `druid_system` user. After that these properties do nothing

    Both `admin` and `druid_system` have R/W * privileges, and are automatically created if an initial password is set for them. In our case, `priest` is the password for `admin` and `warlock` is for `druid_system`.

    Step 3: Start the Druid cluster and you can now create and manager users:

    [root@mf-amos263 extensions]# curl -u admin:priest -X POST http://localhost:8081/druid-ext/basic-security/authentication/db/basic/users/ocean

    [root@mf-amos263 extensions]#

    [root@mf-amos263 extensions]# curl -u admin:priest -X GET http://localhost:8081/druid-ext/basic-security/authentication/db/basic/users/

    ["admin","druid_system","ocean"]

    [root@mf-amos263 extensions]#

    [root@mf-amos263 extensions]# curl -u druid_system:warlock -X GET http://localhost:8081/druid-ext/basic-security/authentication/db/basic/users/

    ["admin","druid_system","ocean"]

    To further managing the users and their roles, please refer to: https://docs.imply.io/cloud/deploy/security#managing-druid-api-users

    For other functions in druid-basic-security, please check reference.

    Reference: http://druid.io/docs/latest/development/extensions-core/druid-basic-security.html

    View Article
  • SYMPTOM:

    Ingestion from S3 fails with a message similar to below:

    <Code>NoSuchKey</>

    <Message>The specified key does not exist</Message>

    Ingestion spec has ioConfig block similar to the following:

    "ioConfig": {

    "type": "index",

    "firehose": {

    "fetchTimeout": 300000,

    "type": "static-s3",

    "prefixes": ["s3://not-real-bucket/filepath/dt=2018-10-31"]

    },

    "appendToExisting": "false"

    }

    ROOT CAUSE:

    The cause of this issue is a bug in the jetS3t library when using AWS S3 Signature V4.

    WORKAROUND:

    The workaround is to revert to using Signature V2 for AWS S3. This can be enabled in the Experimental Features section of the cluster settings as illustrated below.

    Resolution:

    Beginning with Imply 2.8.0, jetS3t is no longer used. Instead the native AWS S3 API is leveraged, thus resolving this issue.

    View Article
  • Question: Over time there might be millions of log files stored on to hdfs. How to configure auto purging logs older than x number of days?

    Answer: Admins can enable auto killing indexing logs by settingdruid.indexer.logs.kill.enabledto true. and configuredruid.indexer.logs.kill.durationToRetainto retain task logs that were created in lastxmilliseconds.

    Please see http://druid.io/docs/latest/configuration/indexing-service.html#task-logging for more details.

    View Article
  • The following ingestion spec will help renaming dimension during ingestion while keep its values intact. In this case, original column names are `user` and `regionName`, and after ingestion, the column names are changed to `RENAME_User` and "RENAME_RegionName` in dataSource:

    {

    "type" : "index",

    "spec" : {

    "dataSchema" : {

    "dataSource" : "wikipedia-rename",

    ...

    "dimensionsSpec" : {

    "dimensions" : [ "RENAME_User", .... "RENAME_RegionName", ... ]

    }

    }

    },

    ...

    "transformSpec" : {

    "filter" : null,

    "transforms" : [ {

    "type" : "expression",

    "name" : "RENAME_User",

    "expression" : "(\"user\")"

    }, {

    "type" : "expression",

    "name" : "RENAME_RegionName",

    "expression" : "(\"regionName\")"

    } ]

    }

    }

    ...

    }

    View Article
  • SYMPTOM:

    During data ingestion, it hits error

    2018-04-30T21:34:50,735 INFO [appenderator_persist_0] io.druid.segment.realtime.appenderator.AppenderatorImpl - Removing sink for segment[SEGMENT_NAME].

    2018-04-30T21:34:50,758 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[IndexTa

    sk{id=index_SEGMENT, type=index, dataSource=DATASOURCE}]

    java.lang.UnsupportedOperationException: Numeric columns do not support multivalue rows.

    at io.druid.segment.DoubleDimensionIndexer.processRowValsToUnsortedEncodedKeyComponent(DoubleDimensionIndexer.java:40) ~[druid-processing-0.12.0-iap7.jar:0.12.0-iap7]

    at io.druid.segment.DoubleDimensionIndexer.processRowValsToUnsortedEncodedKeyComponent(DoubleDimensionIndexer.java:33) ~[druid-processing-0.12.0-iap7.jar:0.12.0-iap7]

    at io.druid.segment.incremental.IncrementalIndex.toTimeAndDims(IncrementalIndex.java:555) ~[druid-processing-0.12.0-iap7.jar:0.12.0-iap7]

    at io.druid.segment.incremental.IncrementalIndex.add(IncrementalIndex.java:503) ~[druid-processing-0.12.0-iap7.jar:0.12.0-iap7]

    at io.druid.segment.realtime.plumber.Sink.add(Sink.java:157) ~[druid-server-0.12.0-iap7.jar:0.12.0-iap7]

    at io.druid.segment.realtime.appenderator.AppenderatorImpl.add(AppenderatorImpl.java:223) ~[druid-server-0.12.0-iap7.jar:0.12.0-iap7]

    at io.druid.segment.realtime.appenderator.BaseAppenderatorDriver.append(BaseAppenderatorDriver.java:293) ~[druid-server-0.12.0-iap7.jar:0.12.0-iap7]

    ROOT CAUSE:

    A row containing multi value is found during date ingestion, and indexer does not know how to handle.

    WORKAROUND:

    Create a dummy config for property`"listDelimiter"` under `parseSpec`. You can fill in any string as the value.

    SOLUTION :

    "parseSpec" : {

    "format" : "csv",

    "hasHeaderRow": true,

    "listDelimiter": "abc",

    "timestampSpec": {

    "column": "TimeStamp"

    },

    View Article
  • This is a feature added to Imply 2.5.8 and later.

    As shown in the following example,it is now possible to concatenate two columns together in a derived dimension using a formula such as:

    `$columnA ++ ', ' ++ $columnB`

    View Article
  • What are you trying to accomplish:

    Execute a thread dump to troubleshoot an unresponsive druid process.

    Workaroundor Solution:

    Run the command as the linux user that is running the process. For instance,

    ps -f | grep java

    [root@localhost ~]# ps -ef | grep java

    root 2341 2318 0 14:41 ? 00:00:19 java -server -Xms128m -Xmx128m -Duser.timezone=UTC -cp /tmp/imply-2.7.8/bin/../dist/zk/lib/*:/tmp/imply-2.7.8/bin/../dist/zk/*:/tmp/imply-2.7.8/conf-quickstart/zk org.apache.zookeeper.server.quorum.QuorumPeerMain /tmp/imply-2.7.8/conf-quickstart/zk/zoo.cfg

    Based on this output, you can see that the rootuser is running the process. Hence, in order to run the jstack command, you will need to be logged in as this user.

    Root Cause:

    Executing the command with a linux user that isn't the same as the user running the druid process.

    View Article
  • SYMPTOM:

    Indexing tasks from Kafka or Kinesis are finishing without any data ingested. From task logs we can find similar logs like following:

    2018-07-02T07:46:02,783 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[KinesisIndexTask{id=index_kinesis_xxxx, type=index_kinesis, dataSource=xxxxxx}]

    io.druid.java.util.common.ISE: Starting sequenceNumber [49585811274695675412070234775862022455403125352932835346] is no longer available for partition [shardId-000000000001] (earliest: [49585811274695675412070251479546550741235996701713498130]) and resetOffsetAutomatically is not enabled

    Or

    2018-09-03T12:55:14,522 WARN [task-runner-0-priority-0] io.druid.indexing.kafka.KafkaIndexTask - OffsetOutOfRangeException with message [Offsets out of range with no configured reset policy for partitions: {Druid-XXXX-X=7263946670}]

    And at the end, we can see:

    2018-09-03T12:55:15,207 INFO [publish-0] io.druid.segment.realtime.appenderator.BaseAppenderatorDriver - Nothing to publish, skipping publish step.

    ......

    2018-09-03T12:55:15,232 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_kafka_Test-K_40675c261af88df_pcigmfli] status changed to [SUCCESS].

    ..... ......

    2018-09-03T12:55:15,233 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {

    "id" : "index_kafka_Test-K_40675c261af88df_pcigmfli",

    "status" : "SUCCESS",

    "duration" : 3601087,

    "errorMsg" : null

    }

    ROOT CAUSE:

    The ingestion is trying to read messages that are not available in Kafka, it patiently waits until the `taskDuration` is over, or the supervisor is reset. The most common reason is that the ingestion is not able to ingest all messages from Kafka before they were deleted due to Kafka retention policies. It could be because the ingestion is slow and needs to be scaled out / tuned, or the ingestion could have just been turned off for long enough that Kafka deleted the messages it was pointed at.

    WORKAROUND:

    Druid admin shall reset the Supervisor when this happens, which basically skips ingestion forward to the latest Kafka messages. This reset can be done by using the "reset supervisor" command in the UI or by using the reset API described on http://druid.io/docs/latest/development/extensions-core/kafka-ingestion.html.

    When reset, Supervisor will kill all currently on-going ingestion tasks, and launch new tasks after reset. Druid also will make sure to load exactly whatever is in Kinesis/Kafka only once.

    The reason it doesn't reset automatically by default is due to possible scenarios where messages are only temporarily not available (like some temporary issue on the Kafka side that can be fixed quickly). In another hand, admin can enable the `resetOffsetAutomatically` config if this resetting needs to be automatic.

    Other options include using`useEarliestSequenceNumber` in Kinesis ingestion, while `useEarliestOffset` in Kafka property ingestion. If one is configured for another ingestion method, that property will be ignored during ingestion. These two properties serve same purpose in each ingestion method of their own.

    In production, we recommend to monitor Kafka ingestion lag in order to prevent ingest from falling behind like this. There are two main ways to do it. One is manually, through Clarity's Ingestion tab and the "Kafka Lag" metric (see attached screenshot). Another is by API, using the supervisor status API and looking at the "aggregateLag" value. In both cases, the lag is measured in terms of number of messages. So if the normal expectation is to read 100 messages per second, then a lag value of 100,000 means there are about 16 minutes behind.

    Finally: the fact that this condition, when occurring for an entire task duration, is reported as SUCCESS instead of FAILED is confusing and we will be adjusting that. In general our preference is that user only see SUCCESS when the connection to Kafka is healthy.

    View Article
  • For instance, we have two datasources :workbook1andworkbook4, with different schemas, but some common columns

    [root@ip-172-31-2-115 tmp]# curl -X POST http://ip-172-31-10-21.us-west-2.compute.internal:8082/druid/v2/?pretty -H 'Content-Type:application/json' -d @query_workbook1.json

    [ {

    "segmentId" : "Workbook1_2018-01-01T00:00:00.000Z_2018-02-01T00:00:00.000Z_2018-06-25T20:57:45.505Z",

    "columns" : [ "__time", "card number", "email id", "name" ],

    "events" : [ [ 1514764800000, "123456", "[email protected]", "jason b" ], [ 1514851200000, "234567", "[email protected]", "al c" ], [ 1514937600000, "345678", "[email protected]", "brian K" ], [ 1515024000000, "456789", "[email protected]", "jason b" ], [ 1515110400000, "567890", "[email protected]", "henry G" ] ]

    [root@ip-172-31-2-115 tmp]# curl -X POST http://ip-172-31-10-21.us-west-2.compute.internal:8082/druid/v2/?pretty -H 'Content-Type:application/json' -d @query_workbook4.json

    [ {

    "segmentId" : "Workbook4_2018-01-01T00:00:00.000Z_2018-02-01T00:00:00.000Z_2018-06-25T21:36:28.017Z",

    "columns" : [ "__time", "card number", "email id", "address", "zip" ],

    "events" : [ [ 1515628800000, "654321", "[email protected]", "123 any street", "95434" ], [ 1515715200000, "765432", "[email protected]", "345 unknown ave", "78643" ], [ 1515801600000, "876543", "[email protected]", "653 where blvd", "88976" ], [ 1515888000000, "987654", "[email protected]", "1 goto block", "22345" ], [ 1515974400000, "987651", "[email protected]", "889 mapper ave", "12001" ] ]

    } ]

    Then I can create a UNON query, to findcard numberforemail id = [email protected], from both datasources:

    [root@ip-172-31-2-115 tmp]# cat query_workbook_union2.json

    {

    "queryType": "scan",

    "dataSource": {

    "type": "union",

    "dataSources": ["Workbook1", "Workbook4"]

    },

    "intervals": {

    "type": "intervals",

    "intervals": [

    "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"

    ]

    },

    "virtualColumns": [],

    "resultFormat": "compactedList",

    "batchSize": 20480,

    "limit": 5000,

    "filter": {

    "type": "selector",

    "dimension": "email id",

    "value": "[email protected]"

    },

    "columns": [

    "__time",

    "card number",

    "email id"

    ],

    "legacy": false,

    "context": {

    "timeout": 300000

    },

    "descending": false,

    "granularity": {

    "type": "all"

    }

    }

    and the output reads:

    [root@ip-172-31-2-115 tmp]# curl -X POST http://ip-172-31-10-21.us-west-2.compute.internal:8082/druid/v2/?pretty -H 'Content-Type:application/json' -d @query_workbook_union2.json

    [ {

    "segmentId" : "Workbook1_2018-01-01T00:00:00.000Z_2018-02-01T00:00:00.000Z_2018-06-25T20:57:45.505Z",

    "columns" : [ "__time", "card number", "email id" ],

    "events" : [ [ 1514764800000, "123456", "[email protected]" ], [ 1515024000000, "456789", "[email protected]" ] ]

    }, {

    "segmentId" : "Workbook4_2018-01-01T00:00:00.000Z_2018-02-01T00:00:00.000Z_2018-06-25T21:36:28.017Z",

    "columns" : [ "__time", "card number", "email id" ],

    "events" : [ [ 1515628800000, "654321", "[email protected]" ], [ 1515888000000, "987654", "[email protected]" ] ]

    } ]

    The UNION query performance is the same as querying two datasources simultaneously, so there's very minimal performance difference.

    View Article
  • Overview

    This article describes how to establish a VPC peering connection to allow traffic from another VPC in either the same or a different AWS account. This allows services located in other VPCs to query Imply without requiring the Imply Cloud clusters to be reachable publicly over the Internet.

    We will first go through the straightforward case, where none of the peered VPCs have overlapping CIDR blocks (the IP address ranges for all peered VPCs are disjoint). Following this, we will describe a scenario where there is an overlap, and present a strategy for setting up a network topology to handle this.

    This article does not cover the following two cases:

    Setting up a peering connection between different AWS regions: AWS now provides limited support for managed peering between some regions, but this is an advanced case as there are a number of restrictions that must be considered. Please consult the AWS documentation on multi-region VPC peering for more details.

    Setting up a peering connection if the Imply Cloud VPC (the one created in your AWS account) has an overlapping CIDR block with the VPC you're trying to peer: VPC peering is not supported by AWS in this case. You may contact us to request that your Imply Cloud VPC use a different CIDR block to remove the conflict. Changing the CIDR block of the VPC will incur downtime of your Imply Cloud cluster.

    No Overlapping VPCs

    There are two main steps in peering VPCs when a given VPC does not have multiple peering connections which have overlapping network ranges:

    Establish the peering connection between the VPCs.

    Configure the route tables so that packets are routed to the correct destination.

    Create a Peering Connection

    Using the AWS web interface, go to the VPC Dashboard and inspect your configured VPCs. Identify the Imply Cloud VPC (the one tagged as 'imply-{accountId}-network-vpc') and the VPC you wish to peer with:

    here

    Here, we wish to peer the Imply VPC (vpc-ce78dbb7) with non-overlapping-vpc (vpc-641fee1c). Note the CIDR ranges, 10.50.0.0/21 and 172.20.0.0/20 respectively, and ensure that they are not overlapping.

    Go to the Peering Connections section of the VPC Dashboard and note the active peering connections:

    You should see at least one entry, which is the peering connection used by Imply to enable communication between Imply's servers (the Imply Cloud Manager) and the Imply clusters running in your account. If the VPC you wish to peer with has a CIDR range that overlaps with this entry, you fall into the overlapping VPC category and should continue to read that section after understanding the basic case.

    In this example, the Imply VPC (vpc-ce78dbb7) is peered to a VPC in Imply's AWS account (vpc-828a54fb) which has a CIDR block of 172.19.0.0/20. This corresponds to an IP address range of 172.19.0.0 - 172.19.15.255 which does not overlap the 172.20.0.0/20 block of non-overlapping-vpc so no special configuration will be required.

    Click on Create Peering Connection, optionally provide a name tag, and put the Imply VPC (here vpc-ce78dbb7) as the requester and the other VPC (here vpc-641fee1c) as the accepter. You will see a new entry in the peering table with a status of Pending Acceptance:

    To accept the peering request, right-click on the peering connection and choose Accept Request. Upon acceptance, the peering connection will change status to be Active.

    You will also want to configure the peering connection to allow DNS names to be resolved over the link. This will allow you to address the query elastic load balancer using its assigned DNS name from the other VPC. To do this, right-click on the peering connection and select Edit DNS Settings. On this page, ensure that the option allowing your VPC to resolve DNS requests for the Imply VPC hosts is checked. In this example, we want non-overlapping-vpc (vpc-641fee1c) to be able to resolve DNS requests of Imply VPC (vpc-ce78dbb7):

    Click on Save.

    Setup Route Tables

    Route tables are used to provide instructions to your subnet's virtual router about where network traffic with a given IP address should be sent to. Common destinations include: other instances within the same VPC, an Internet Gateway for external traffic over the Internet, and a peering connection for traffic destined for a different peered VPC. Creating a VPC creates a default route table which is used implicitly if no other route tables are associated with a subnet. However, strictly speaking, route tables are associated with subnets and not with VPCs, which is an important distinction that we will utilize when handling the overlapping peered VPC scenario below. For more information on route tables, see Amazon's documentation.

    Click on the Subnets section to view the available subnets for your account in this region:

    The first 6 subnets belong to the Imply VPC (vpc-ce78dbb7). Imply typically creates two subnets (one for static and one for dynamically allocated addresses) in three different availability zones to provide protection against failures of an availability zone. Note that they are all associated with the route table rtb-7122e509. The subnet labelled non-overlapping is a subnet in the non-overlapping-vpc (vpc-641fee1c) which, for this example, will be the subnet used by the EC2 instance that requires connectivity to Imply Cloud. This subnet uses the route table rtb-8c4492f7. We will need to configure these two route tables to route traffic destined for the other subnet to the VPC peering connection.

    Starting with the route table for the Imply VPC (here rtb-7122e509):

    Click on the link in the Route table column and then click on the Routes tab:

    Routes are matched by finding the most specific entry that matches the destination address. In this example:

    Traffic destined for an address in 10.50.0.0/21 (10.50.0.0 - 10.50.7.255) remain local and are routed to other EC2 instances in this VPC.

    Traffic destined for an address in 172.19.0.0/20 (172.19.0.0-172.19.15.255) are routed to pcx-d3c1e8ba, which is the peering connection link to a VPC in Imply's AWS account. From here this traffic gets routed to one of the EC2 instances powering the Imply Cloud Manager.

    All other traffic (0.0.0.0/0 matches all IPs) is routed to igw-acd20cca which is the Internet Gateway allowing the instances to make outbound requests externally over the Internet.

    To this table, we will need to add an entry to route traffic destined for 172.20.0.0/22 (the range of the non-overlapping subnet) to pcx-6ca76b04 (the VPC peering connection we established). Alternatively, we can specify the range for the entire non-overlapping-vpc VPC (172.20.0.0/20) to allow other subnets within that VPC to also communicate over the peering link.

    Click on Edit, click on Add another route, and using this example, set:

    Destination: 172.20.0.0/22

    Target: pcx-6ca76b04

    Click on Save.

    We will now need to do a similar configuration for the route table rtb-8c4492f7. Here, we will want to add an entry to route all traffic destined for the Imply Cloud VPC (10.50.0.0/21) to the peering link pcx-6ca76b04. Select this route table, click on the Routes tab, click on Edit, click on Add another route, and using this example, set:

    Destination: 10.50.0.0/21

    Target: pcx-6ca76b04

    Click on Save.

    This completes the setup of the peering connection and configuration of the route tables.

    Testing the Peering Connection

    We can test the peering connection by making an API request to Druid from an instance in the peered VPC. We will do this by making a call to the query load balancer rather than to an individual instance. You can get the hostname of the load balancer from the API section of the Cloud Manager:

    Note that for VPC peering connections, we use the Private endpoints rather than the Public ones (which are used when you want the cluster to be accessible over the public Internet).

    Even though the peering connection is in place, the security group for the load balancer will not by default permit traffic from your EC2 instance. In order to allow the connection to be established, we will need to add an entry to the Inbound table of the Imply Cloud ELB Unmanaged security group.

    First, determine which security group is associated with the EC2 instance the request will originate from. From the EC2 Dashboard, click on Instances, and find the details of your instance:

    Here, the security group is launch-wizard-2. Clicking on the security group name will bring you to the Security Groups section where, in this example, we find that launch-wizard-2 is identified with group ID: sg-dee0b1ae.

    Staying in the Security Groups section, look for the security group with the description 'Imply Cloud ELB Unmanaged'. You may have to clear your filter for it to appear in the table. Select this entry, click on the Inbound tab, click Edit, and add a rule as follows:

    Type: All TCP

    Protocol: TCP

    Port Range: 0 - 65535

    Source: Custom: sg-dee0b1ae

    Click on Save.

    You should now be able to make requests from your instance to the Imply Cloud query load balancer, as an example:

    ubuntu@ip-172-20-2-51:~$ curl -k -u admin https://internal-imply-e04-elbinter-qqsuirj9f9ep-901811362.us-west-2.elb.amazonaws.com:9088/druid/v2/datasources

    Enter host password for user 'admin':

    ["wikipedia"]

    ubuntu@ip-172-20-2-51:~$ curl -k -u admin -XPOST -H'Content-Type:application/json' -d '{"queryType": "timeBoundary", "dataSource": "wikipedia"}' https://internal-imply-e04-elbinter-qqsuirj9f9ep-901811362.us-west-2.elb.amazonaws.com:9088/druid/v2/

    Enter host password for user 'admin':

    [{"timestamp":"2016-06-27T00:00:11.080Z","result":{"maxTime":"2016-06-27T21:31:02.498Z","minTime":"2016-06-27T00:00:11.080Z"}}]

    Overlapping VPCs

    We now move on to the case where the Imply VPC needs to be peered to multiple VPCs that have overlapping CIDR blocks. This happens most frequently when the VPC you wish to create a peering connection to has a network range that overlaps the VPC in Imply's AWS account. Continuing the previous example, we now want to establish a peering connection between the Imply VPC (vpc-ce78dbb7, 10.50.0.0/21) and overlapping-vpc (vpc-3607f64e, 172.19.0.0/20). Note that while these two VPCs don't overlap (if they did, a peering connection would not be allowed), the Imply VPC already has a peering connection to vpc-828a54fb which has a CIDR block of 172.19.0.0/20, conflicting with overlapping-vpc. vpc-828a54fb is a VPC in Imply's AWS account, and this peering connection is crucial for the operation of Imply Cloud and cannot be modified or removed.

    AWS allows peering connections to be made to multiple VPCs which have overlapping CIDR blocks, but there is an additional complication when trying to resolve the route table entries: given a packet that should be sent to 172.19.0.3 (an address within the 172.19.0.0/20 CIDR), how does the router know which of the two VPC peering connections to route the message to?

    As mentioned previously, we will need to leave the subnets and route table entries set up by Imply Cloud unmodified, since modifying or removing these will cause your clusters to be unreachable by the Imply Cloud Manager. Remembering that the routers (and their corresponding route tables) are associated with subnets rather than VPCs, our strategy here will be to set up additional subnets with a separate route table that will route traffic for 172.19.0.0/20 to your other VPC instead of Imply's VPC. Into these subnets, we will add a new elastic load balancer which will act as a proxy to the Imply query servers, and the route table rules will ensure that traffic passing through this load balancer will be routed to your VPC instead of Imply's.

    This works particularly nicely in AWS because an auto-scaling group (which is used to manage the query EC2 instances) can be associated with multiple elastic load balancers. Hence, as the EC2 instances are created, fail, and are replaced, the elastic load balancer will continue pointing at the correct online set of instances instead of pointing at instances which have died. When we are done with this setup, the query auto-scaling group will be associated with three load balancers:

    one for communication internally within the VPC and with the VPC in Imply's AWS account (created by Imply)

    one for communication externally over the internet (created by Imply, but by default blocked using firewall rules)

    one for communication internally with your peered VPC using the VPC peering connection (this is what we will be setting up here)

    Create a Peering Connection

    The first step is to create a peering connection using the steps detailed in the similarly named section for the non-overlapping case. In our example, we will be peering the Imply VPC (vpc-ce78dbb7) and overlapping-vpc (vpc-3607f64e):

    Note that the peering connection we created (pcx-ceb27ea6) links us to a VPC with a CIDR 172.19.0.0/20 which is the same network range as the peering connection created by Imply (pcx-d3c1e8ba) and that this is a permitted operation.

    Create New Subnets

    We will need to create an additional subnet for each availability zone used by the Imply Cloud cluster. There will be 2 or 3 availability zones used, depending on which region you are in. To identify the availability zones, inspect the Availability Zone column (or view the detail descriptions) for each Imply Cloud subnet (beginning with imply-...).

    On the Subnet page, click Create subnet. Give your subnet a recognizable name, and ensure that it is created in the Imply VPC (in our example, vpc-ce78dbb7). For the IPv4 CIDR block, you will need to choose a block of IP addresses within the Imply VPC block that doesn't conflict with any of the existing 6 subnets; in our example, we used 10.50.6.0/26. Set the Availability Zone to correspond to one of the zones used by Imply Cloud. Clicking Create gives us the following subnets:

    The one that we created is called imply-overlap (subnet-d5eea2ac). Note that it uses a different route table (rtb-a822e5d0) than the subnets created by Imply Cloud. This is actually the default route table for the Imply VPC which was previously unused, since the 6 subnets created by Imply were explicitly associated with a custom route table. It is fine to use the default route table, but alternatively, you can also create an additional route table and then explicitly associate the imply-overlap subnet with this route table.

    Repeat the previous step until you have created a subnet for each availability zone used by Imply Cloud.

    Setup Route Table

    This is performed similarly to the non-overlapping case, but using the route table associated with the newly created subnets instead of the one used by the existing 6 subnets. In this case, our new route table entries look like this:

    For the route table associated with the new subnets in the Imply VPC (rtb-a822e5d0):

    Destination: 172.19.0.0/20

    Target: pcx-ceb27ea6

    For the route table associated with the subnet in the other VPC that was peered with the Imply VPC (in this example, the route table for overlapping subnet-567e171d, which is rtb-48598f33):

    Destination: 10.50.6.0/26

    Target: pcx-ceb27ea6

    Create a New Elastic Load Balancer

    Now that we have created new subnets which route traffic from 172.19.0.0/20 to your peered VPC, we will need to create an elastic load balancer which uses those subnets and proxies requests to the Imply Cloud query nodes. We will inspect the existing load balancer setup by Imply and create an additional one with similar configuration:

    In the Load Balancers section of the EC2 Dashboard:

    Click on Create Load Balancer

    For this example, we are using the Classic Load Balancer, but a similar setup can be done using one of the newer generation load balancers. Click on Create.

    Load Balancer name: provide a recognizable name for the load balancer

    Create LB Inside: select the Imply Cloud VPC, in this example vpc-ce78dbb7

    Create an internal load balancer: this should be checked so that the load balancer can be used over the peering connection

    Listener Configuration: This should have the same entries as the reference load balancer:

    TCP 8888 -> TCP 8888

    TCP 9088 -> TCP 9088

    TCP 9095 -> TCP 9095

    Select Subnets: select the subnets you created previously (in this example we only have one, subnet-d5eea2ac, but you should have one for each availability zone)

    Click on Next: Assign Security Groups

    Select the security groups with description 'Imply Cloud ELB Unmanaged' and 'Imply Cloud Default'

    Click on Next: Configure Security Settings

    Click on Next: Configure Health Check

    Use the same entries as the reference load balancer:

    Ping Protocol: HTTPS

    Ping Port: 9095

    Ping Path: /health

    Click on Next: Add EC2 Instances

    We will not select any EC2 instances on this page, but will later associate this elastic load balancer with our auto-scaling group.

    Click on Next: Add Tags

    Click on Review and Create and then click Create.

    Note the load balancer name in the AWS console, in our example it is imply-query-my-peering:

    Add Load Balancer to Query Auto Scaling Group

    Go to the Auto Scaling Groups section of the EC2 Dashboard and locate the auto-scaling group tagged with a name of the form imply-{clusterId}-Query-... Right-click on this auto-scaling group and select Edit.

    Modify the Classic Load Balancers section to include an additional entry for the load balancer you created above, and then click on Save.

    To confirm this was done correctly, return to the Load Balancers section and inspect the load balancer you created. Status should show 'x of x instances in service' where x is non-zero, and the Instances tab should show a list of all the query instances in your cluster.

    Testing the Peering Connection

    Similarly to the non-overlapping case, we should now be able to make query requests from our VPC to the Imply VPC, but we will be communicating with the load balancer we created instead of the one listed in the API section of the Imply Cloud Manager. For this example, our load balancer has the DNS name of internal-imply-query-my-peering-1218651714.us-west-2.elb.amazonaws.com. You will still have to add an entry to the Imply Cloud ELB Unmanaged security group to permit inbound access from the origin instance's security group as described in the non-overlapping section.

    If everything was correctly set up, issuing Druid API calls to the new load balancer should return a response:

    ubuntu@ip-172-19-2-170:~$ curl -k -u admin https://internal-imply-query-my-peering-1218651714.us-west-2.elb.amazonaws.com:9088/druid/v2/datasources

    Enter host password for user 'admin':

    ["wikipedia"]

    ubuntu@ip-172-19-2-170:~$ curl -k -u admin -XPOST -H'Content-Type:application/json' -d '{"queryType": "timeBoundary", "dataSource": "wikipedia"}' https://internal-imply-query-my-peering-1218651714.us-west-2.elb.amazonaws.com:9088/druid/v2/

    Enter host password for user 'admin':

    [{"timestamp":"2016-06-27T00:00:11.080Z","result":{"maxTime":"2016-06-27T21:31:02.498Z","minTime":"2016-06-27T00:00:11.080Z"}}]

    Additional Notes about the New Query Load Balancer

    This strategy works by binding an additional load balancer to the query auto-scaling group to allow requests made through this load balancer to be routed back to your peered VPC instead of the VPC in Imply's AWS account. At first glance, this may seem to only allow queries to be made from your VPC, while not supporting calls to Druid's other APIs such as to submit indexing tasks. Actually, the same query load balancer can be used to access Druid's other APIs for the coordinator and overlord services (running on the master nodes) by using the router's management proxy functionality which is enabled by default in Imply Cloud. For more information on this, see the router documentation .

    One caveat about the router management proxy is that it cannot handle requests to serve the coordinator or overlord consoles. If you need access to these consoles over the peering connection, one strategy is to create additional load balancers (one for the coordinator and one for the overlord) according to the steps described above. You would then add these load balancers to the Classic Load Balancers section of the master auto-scaling groups (you will have 1 of these in the non-HA configuration and 3 of them in the HA configuration). For the health check endpoints, use the following:

    Coordinator: HTTPS:8281/druid/coordinator/v1/isLeader

    Overlord: HTTPS:8290/druid/indexer/v1/isLeader

    This will cause the active coordinator and overlord to be the in-service instance, while removing the slave instances from the load balancers and thus preventing traffic from being routed to them. As leadership changes in a highly available clusters, instances will be registered with and removed from the load balancer to ensure that all traffic is routed to the active coordinator/overlord. This configuration will support loading of the coordinator and overlord web consoles over the peered connection.

    View Article
  • Druid can interact with Hadoop in two ways:

    Use HDFS for deep storage using the druid-hdfs-storage extension.

    Batch-load data from Hadoop using Map/Reduce jobs.

    These are not necessarily linked together; you can load data with Hadoop jobs into a non-HDFS deep storage (like S3), and you can use HDFS for deep storage even if you're loading data from streams rather than using Hadoop jobs.

    If you use Hadoop Map/Reduce jobs to load data, then these jobs will scan through your raw data and produce optimized Druid data segments in your configured deep storage. The data will then be loaded by Druid Historical Nodes. Once loading is complete, Hadoop and YARN are not involved in the query path of Druid in any way.

    The main advantage of loading data using Hadoop is that it automatically parallelizes the batch data loading process, it uses YARN resources instead of using your Druid machines (leaving your Druid machines free to handle queries), and it can leverage data that resides in your existing Hadoop cluster.

    For more information, see our documentation at: https://docs.imply.io/on-premise/manage-data/ingestion-files#hadoop-based-ingestion

    Working with specific distributions

    Sometimes, specific configurations are needed to integrate Druid with specific Hadoop distributions. These are generally needed when the versions of common Java dependencies (like Jackson, Guava, and so on) differ between your version of Druid and your version of Hadoop.

    The Druid documentation has some general tips that apply to many distributions. See here for details: http://druid.io/docs/latest/operations/other-hadoop.html. In particular, three go-to tips include:

    Place Hadoop XMLs on Druid's classpath.

    Classloader modification on Hadoop, using "mapreduce.job.classloader" or "mapreduce.job.user.classpath.first". Additionally, "mapreduce.job.classloader.system.classes" can be useful to customize class loading even further. The purpose of all of these settings is to control which versions of which dependencies are used for which code paths.

    Use specific versions of Hadoop libraries. Druid is bundled with Apache Hadoop, but you can install libraries specific to your distribution and have Druid use those instead.

    We have collected some additional tips here for working with specific Hadoop distributions.

    Amazon EMR (Elastic MapReduce)

    Imply Cloud supports integration with Amazon EMR. For a setup guide, see: https://docs.imply.io/cloud/manage-data/emr

    Apache Hadoop

    Place your Hadoop XMLs on Druid's classpath.

    Set"mapreduce.job.classloader": true in your Druid jobProperties.

    Replacing libraries is generally not necessary, since Druid is built against Apache Hadoop by default.

    View Article
  • SYMPTOM : When there is a division in Druid SQL, users may get 0 as result.

    For example: the return on `Saving_ratio` is always 0 in the following case.

    ROOT CAUSE:

    All numbers were ingested as of type `longSum` (integer), and in Druid SQL integer divided by integer uses integer division and rounds down. Therefore, the `Saving_ratio` is always zero since'Save` is less than `Income`

    RESOLUTION:

    Ingest the data as `floatSum` instead:

    "metricsSpec": [

    {

    "type": "count",

    "name": "count"

    },

    {

    "type": "floatSum",

    "name": "Income",

    "fieldName": "Income",

    "expression": null

    },

    {

    "type": "floatSum",

    "name": "Food_spending",

    "fieldName": "Food",

    "expression": null

    },

    {

    "type": "floatSum",

    "name": "Clothing_spending",

    "fieldName": "Clothing",

    "expression": null

    },

    {

    "type": "floatSum",

    "name": "Save",

    "fieldName": "Save",

    "expression": null

    }

    ],

    View Article
  • With Imply UI's latest "SUBSET FILTER FORMULA" feature, users can filter out the unrelated data by default right at start of opening a data cube.

    However, sometimes when a filter is added for a column, the output shows 0 result. For example, user wants to filter only rows with only `Country Iso Code` = `US` in the Wikipedia data source:

    But the result shows 0 results

    ROOT CAUSE: The problem is admin configured SUBSET FILTER FORMULA using the NAME in the configuration, instead of the actual COLUMN:

    After the configuration change, the output shows up correctly.

    View Article
  • ISSUE :Druid SQL doesn't support multi-value dimensions yet. This is a potential feature request.

    WORKAROUND : Try Druid's native query language -- i.e. timeseries, topN, groupBy queries -- instead, rather than SQL when you need to use multi-value dimensions.

    View Article
  • When creating a new Druid DataSource, admins may want to add new dimensions that's not in raw data, and fill the columns with default values. This can be achieved by using `transformSpec` during data ingestion.

    Here is a example of adding two dummy columns `dummyCol1` and `dummyCol2` to `Wikipedia` dataSource in ingestion spec :

    {

    "type" : "index",

    "spec" : {

    "dataSchema" : {

    "dataSource" : "wikipedia",

    "parser" : {

    "type" : "string",

    "parseSpec" : {

    "format" : "json",

    "timestampSpec" : {

    "column" : "timestamp",

    "format" : "iso"

    },

    "dimensionsSpec" : {

    "dimensions" : [ "diffUrl", "isRobot", {

    "name" : "added",

    "type" : "long"

    }, .....

    "name" : "metroCode",

    "type" : "long"

    }, "dummyCol1", "dummyCol2" ]

    }

    }

    },

    "metricsSpec" : [ ],

    "granularitySpec" : {

    ...

    },

    "rollup" : false,

    "intervals" : null

    },

    "transformSpec" : {

    "filter" : null,

    "transforms" : [ {

    "type" : "expression",

    "name" : "dummyCol1",

    "expression" : "nvl(\"dummyCol1\", 'HAPPY')"

    }, {

    "type" : "expression",

    "name" : "dummyCol2",

    "expression" : "nvl(\"dummyCol2\", 'JOY')"

    } ]

    }

    },

    "ioConfig" : {

    "type" : "index",

    "firehose" : {

    "type" : "http",

    "uris" : [ "https://SERVER/data/wikipedia.json.gz" ],

    ...

    },

    "appendToExisting" : false

    },

    "tuningConfig" : {

    ...

    }

    },

    "dataSource" : "wikipedia"

    }

    After ingestion is done, we can see the two new dimensions are created :

    View Article
  • Lookups are a concept in Druid where dimension values are (optionally) replaced with new values. Using lookups can improve querying experience with higher level grouping by default.

    For example, in a scenario where a lookup table containsmultiple dimension values 'US-EN', 'GB-EN' and 'AU-EN' correspond to a common channel name 'English', users can send Druid query with 'English' filter, and the result will return all rows containing 'US-EN', 'GB-EN' or 'AU-EN'.

    Similarly, users can query with filtering either 'US-EN', 'GB-EN' or 'AU-EN', and the result will be presented with 'English' as the channel name for all. This is callednon-injective lookup where two or more values in the domain would map to the same value in the range.

    In a non-default injective lookup scenario, where the lookup table mapping is exclusive for each entry, then users can still get some performance boosts by having the lookup operations be done on the brokers.

    To enable injective lookup, add ""injective":true" under lookups config :

    For more details about injective property in lookups, please refer to: http://druid.io/docs/latest/development/extensions-core/lookups-cached-global.html

    View Article

Curious about Imply Data?

Anonymously Ask Imply Data Any Question

Ask Anonymous Question

×
Rate your company