Imply was founded by members of the original Druid team, who also built the ingestion and visualization layers around the project. read more

EMPLOYEE PARTICIPANTS: 8

TOTAL RATINGS: 113

CEO: Fangjin Yang

Imply Data FAQs

Imply Data's Frequently Asked Questions page is a central hub where its customers can always go to with their most common questions. These are the 61 most popular questions Imply Data receives.

Frequently Asked Questions About Imply Data

Ingesting Data to Druid from Azure Event Hubs

WARNING: This was performed as a proof of concept and has not been tested in production environments.
Azure Event Hubs are fully managed, real-time data ingestion service that can support streaming millions of events per second. It is very similar to Apache Kafka.
Kafka and Event Hub Conceptual Mapping
Kafka Concept
Event Hub Concept
Cluster
Namespace
Topic
Event Hub
Partition
Partition
Consumer Group
Consumer Group
Offset
Offset
Azure Event Hubs allows accessing the data (publishing and consuming) via Apache Kafka API. Please see the link below for more information
https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-for-kafka-ecosystem-overview
Apache Druid Kafka Ingestion services leverages Kafka API to consume messages. This article focuses on how to leverage the same to consume messages from Azure Event Hubs (via Kafka API)
Setup Namespace and Event Hub in Azure
Login to Azure Portal and select the Event Hubs from Azure Services and click "Add" to create a new Namespace
https://imply.io/videos/druid-data-loader-kafka-walk-through
In the following Create Namespace UI please make sure "Enable Kafka" is checked. This allows the Event Hub to be accessed via the Kafka API
Make a note of the name space name provided -imply-data-namespace.servicebus.windows.net, this is the bootstrap server configuration that we need to provide for Kafka Consumer.
Once created the name space will be active as below
Click on the Namespace to open the name space details page and click on the "Event Hub" to add a new event hub (topic in kafka terminology).
Provide the topic name in the Create Event Hub screen and adjust the partition and message retention settings as needed
The above should create a new event hub. Now we need to add access policies to this newly created event hub. This access policy information is needed for us to connect to this topic from Kafka API. Open the details page of the newly created event hub and click on the "Shared access policies" as below, and then click on the "Add" button to add a new access policy
Create access policy like below, which creates an all access policy.
Once created copy the "Connection string-primary key" information as below. That is needed for accessing via Kafka API
Now we are all set to start consuming data from this event hub. Following parameters should be provided to consume data
"bootstrap.servers":"imply-data-namespace.servicebus.windows.net:9093",
"security.protocol":"SASL_SSL",
"sasl.mechanism":"PLAIN",
"sasl.jaas.config":"org.apache.kafka.common.security.plain.PlainLoginModulerequiredusername=\"$ConnectionString\"password=\"<textcopiedfromConnectionstring-primarykey>\";"
See the Druid Data Loader section for more information on this.
Druid Data Loader to Azure Event Hubs
I have an Imply quick start single node cluster running in one of the Azure VM's. Please see below link for more information on quick start guide
https://docs.imply.io/on-prem/quickstart
I'm going to use the Apache Kafka Data Loader in this example to load data from Azure Event Hub. For a detailed look on the data loader for Kafka please take a look at the below tutorial
https://imply.io/videos/druid-data-loader-kafka-walk-through
Below is the configuration in the data loader for connecting to Azure Event Hub
Below is the complete ioConfig the above configuration generates
"ioConfig": {
"type": "kafka",
"consumerProperties": {
"bootstrap.servers": "imply-data-namespace.servicebus.windows.net:9093",
"group.id": "$Default",
"request.timeout.ms": "60000",
"security.protocol": "SASL_SSL",
"sasl.mechanism": "PLAIN",
"sasl.jaas.config": "org.apache.kafka.common.security.plain.PlainLoginModule required username=\"$ConnectionString\" password=\"Endpoint=sb://imply-data-namespace.servicebus.windows.net/;SharedAccessKeyName=all-access;SharedAccessKey=keyFromAzureUI=;EntityPath=new-topic\";"
},
"topic": "new-topic"
},
NOTE - the value for
"sasl.jaas.config": "org.apache.kafka.common.security.plain.PlainLoginModule required username=\"$ConnectionString\" password=\"<textcopiedfromConnectionstring-primarykey>\";"
I have already published some data to the Event Hub by using Kafka API. So by hitting preview, you should see the data from Event Hub in the Druid Data Loader like below
There you have it, the data from Azure Event Hub can be consumed via Druid Data Loader and a supervisor task can be submitted to create a Druid Datasource. For creating data sources and supervisor tasks follow the below tutorial.
View Article
Balancer move segments queue has a segment stuck

Issue Description
"Balancer move segments queue has a segment stuck" in the coordinator log may be caused by the coordinator looking in the wrong place for the segment/s it wants to move.
Implications
When this happens then coordinator won't be able to balance segments on the cluster. You may see balancing happening but at a very slow speed. The coordinator will have list of errors for the given historical node.
Symptoms:
Check the coordinator log for errors similar to this:
Failed to connect to host[<host name>:8283]
connection timed out: <host name>/<ip>:8283
Debugging
Run Telnet command to test connectivity with the historical node with the same host and port mentioned in the log from the host which is running the coordinator process. It should fail to connect.
Now run same command from on this failing historical node and it should connect.
> telnet <hostname> <port>
Resolution
Check the hosts file on the coordinator for a missing/incorrect entry for the failing historical node. Once this hosts file is fixed the issue will resolve.
View Article
Connecting Microsoft Excel or Power BI to Apache Druid via Power Query

Warning
This is not a recommended or supported process. It was done as a proof of concept only, and has significant limitations. As Power Query attempts to pull the entire result set of the query onto the client, this process does not scale to "big data" amounts of data.
Prerequisites
This POC was done using the Imply Quickstart. You can find instructions on setting up the Quickstart here: https://docs.imply.io/on-prem/quickstart
It was completed using Microsoft Excel 2019 on Windows. Power Query used to be called "Get and Transform" in Excel 2016 and was not available in previous versions and at the time of writing, was only available for Windows.
Alright... let's get into it.
From Excel go to Data-->Get Data-->From Other Sources -->Blank Query
With Power Query open enter the following expression, replacing the IP address with your Broker IP or hostname and the query value with your SQL query.
= Csv.Document(Web.Contents("http://192.168.56.1:8888/druid/v2/sql/", [Headers=[#"Content-Type"="application/json"], Content=Text.ToBinary("{""query"": ""select * from wikipedia limit 10000"",""resultFormat"":""csv"",""header"":""true""}")]))
Pressing the Check Mark will execute the query and display a preview
You will notice that the headers are not in the header row, so the next step is to click Use First Row as Headers
After that is complete you can give your query a name, and click Close & Load
And there you have it, you have extracted data from Apache Druid into Excel.
View Article
Avro ingestion from Kafka using Confluent Schema Registry

This article provides steps for one method to test avro ingestion locally using the Imply distribution. The example sends nested avro using parser type: avro_stream and avroBytesDecoder type: schema_registry.
1. Addextension "druid-avro-extensions" to the loadList in conf-quickstart/druid/_common/common.runtime.properties if running in quickstart mode, and conf/druid/_common/common.runtime.properties if running in cluster mode.
2. Download Confluent quickstart, install: https://docs.confluent.io/current/quickstart/cos-quickstart.html. We will be using Confluent's supplied Kafka and schema-registry, so make sure no other Kafka process is currently running (port :9092).
3. Start Confluent from the Confluent directory. Your terminal should look something like this:
bin/confluent start
Using CONFLUENT_CURRENT: /var/folders/yy/dk4fdc7n2zdf2gn1n5sblq980000gn/T/confluent.ZLnYVjH0
Starting zookeeper
zookeeper is [UP]
Starting kafka
kafka is [UP]
Starting schema-registry
schema-registry is [UP]
Starting kafka-rest
kafka-rest is [UP]
Starting connect
connect is [UP]
Starting ksql-server
ksql-server is [UP]
If one of the services doesn't start, you may have port conflicts. After fixing conflicts, you may need to cycle the affected service, e.g.:
bin/kafka-rest-stop
bin/kafka-rest-start
bin/zookeeper-server-stop
bin/zookeeper-server-start
4. If running Confluent and Imply on different machines, skip this step.
Before starting the Imply quickstart, change the ports for the coordinator and broker inconf-quickstart/druid/broker/runtime.properties andconf-quickstart/druid/coordinator/runtime.properties to avoid conflicts with Confluent ports. Change the zookeeper port inconf-quickstart/zk/zoo.cfg.
If you plan to use the Imply UI and you have never launched localhost:9095, it will be sufficient to modify the coordinator and broker ports in:conf-quickstart/pivot/config.yaml, where "host" = the broker. If you have previously used this Imply UI distribution, modify the port values in the UI under Settings -> My Druid.
If you don't want to change Imply ports, you will need to change ports in the Confluent distribution.
5. Create a json record file that represents the Avro you will be ingesting. For our example this record (location.json) is:
{"hilltop":{"timestamp":"2018-06-21T18:00:00Z","view":"ocean"}}
6. Create a schema that will describe your data. The schema for location.json is:
http://druid.io/docs/latest/ingestion/flatten-json.html
7. Create a Kafka topic ("avro-test") and send the schema registry and sample file to Confluent Kafka:
bin/kafka-avro-console-producer --broker-list localhost:9092 --topic avro-test --property value.schema='{"type":"record","name":"location","fields":[{"name":"hilltop","type":{"type":"record","name":"anotherNameDescribingType","fields":[{"name":"timestamp","type":"string","doc":"Local time","default":""},{"name":"view","type":"string","doc":"doYouSeeWhatISee","default":""}]}}]}' < location.json
8. In a different terminal window, check that the topic has been created:
bin/kafka-topics --list --zookeeper localhost:2181
__confluent.support.metrics
__consumer_offsets
_confluent-ksql-default__command_topic
_schemas
avro-test
9. Ingest the record into Druid by submitting the supervisor spec from the Imply directory. If running Confluent and Imply on separate machines, edit the "url" and "bootstrap.servers" in the spec.
curl -XPOST -H'Content-Type: application/json' -d @quickstart/avroSupervisor.json http://localhost:8090/druid/indexer/v1/supervisor
{"id":"location"}\% <- the output id will be the name of the dataSource.
This supervisor spec looks like this:
Note that the parseSpec format is "avro". The Confluent process is taking the json file and translating it into avro, and Druid is consuming the avro data.
Seefor more about creating flattenSpecs.
10. Check that Druid has received the data, either from the terminal:
bin/dsql -H localhost:8084 # broker port
Welcome to dsql, the command-line client for Druid SQL.
Type "\h" for help.
dsql> \d
TABLE_SCHEMA TABLE_NAME
druid location
druid wikipedia
Retrieved 4 rows in 0.03s.
dsql> select * from "location";
__time view
2018-06-21T18:00:00.000Z ocean
Retrieved 1 row in 0.14s.
or directly in the Imply UI (localhost:9095).
View Article
Kafka, SASL / SCRAM credentials in supervisor spec

If Kafka is configured with SASL / SCRAM, then Druid, which is the Kafka consumer, should pass the SASL / SCRAM credentials in theconsumerProperties block of ioConfig section of the Kafka supervisor spec:
"ioConfig": {
"topic": "some_topic",
"consumerProperties": {
"bootstrap.servers": "host1:9092,host2:9092,host3:9092",
"sasl.mechanism": "SCRAM-SHA-256",
"security.protocol": "SASL_SSL",
"sasl.jaas.config": "org.apache.kafka.common.security.scram.ScramLoginModule required username=\"some_userID\" password=\"some_password\";"
}
"taskCount": 1,
"replicas": 1,
"taskDuration": "PT1H"
}
View Article
How do kill tasks work?

When a kill task is submitted, it completes these three steps:
1. Gets a list of all segments marked as unused (used=0) in the metadata store.
2. Loops through each segment, deletes from druid_segments where id = x. This process takes a couple hundred milliseconds per segment, so deleting 10k segments can take ~30 min.
3. Once step 2 is complete, removes each segment from deep storage.
If the kill task gets interrupted, there is a chance that segments get deleted from the metadata store (step 2) but not deep storage. If this happens, segments must be removed from deep storage manually.
View Article
How to launch multi-node on-prem Imply Manager with Kubernetes

Objective:
In this article, we will go over step by step to deploy Imply manager on-prem using Kubernetes, on 4 nodes: 1x master, 1x query, and 2x data nodes.
Environment setup:
1. In this article, We use external Zookeeper for Druid
2. We use external HDFS as deep storage
3. We use external MySQL as meta DB for both Imply Manager, and Druid.
4. We use CentOS 7 on all the nodes, but the OS in Imply on-prem manager is Ubuntu based.
Prerequisites:
0. Create an user account at https://www.docker.com
Note: Please contact Imply sales representatives FIRST for account setup, provide us your docker username, so we grant permission to download from our private repo.
Install Docker on all 4 nodes:
1. Install yum and git utilities on all nodes:
yum install -y yum-utils device-mapper-persistent-data lvm2
yum install -y git
yum-config-manager --enable rhel-7-server-extras-rpms
2. Setup Docker repo, and install Docker (community edition) on all nodes.
yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
yum install -yftp://bo.mirror.garr.it/1/slc/centos/7.1.1503/extras/x86_64/Packages/container-selinux-2.9-4.el7.noarch.rpm
yum -y install docker-ce docker-ce-cli containerd.io
3. Start Docker and verify successful installation
# systemctl start docker
# docker version
Client:
Version: 18.09.2
API version: 1.39
Go version: go1.10.6
Git commit: 6247962
Built: Sun Feb 10 04:13:27 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 18.09.2
API version: 1.39 (minimum version 1.12)
Go version: go1.10.6
Git commit: 6247962
Built: Sun Feb 10 03:47:25 2019
OS/Arch: linux/amd64
Experimental: false
Install kubernetes on all 4 nodes:
1. Add Kubernetes repo to all nodes:
[root@ip-172-31-7-220 Docker]# cat /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
2. Configure environment settings on all nodes:
# setenforce 0
# vi /etc/selinux/config
SELINUX=permissive ##Change if it is enforcing
# cat <<EOF > /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
EOF
3. Install Kubernetes on all nodes:
[root@ip-172-31-7-220 Docker]# yum -y install kubelet kubeadm kubectl
Build a Kubernetes cluster
1. Initialize Kubernetes admin on one of the nodes, as root user, due to privilege requirements.
[rootr@ip-172-31-7-220 ec2-user]# kubeadm init
[init] Using Kubernetes version: v1.15.1
.
Your Kubernetes master has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
https://kubernetes.io/docs/concepts/cluster-administration/addons/
You can now join any number of machines by running the following on each node
as root:
kubeadm join 172.31.7.220:6443 --token 4mhtc3.27s6iiaxcggwsi38 --discovery-token-ca-cert-hash sha256:be9ad8be83276854ba93a213bbddc61af476bb016f9f7939a713baec2fe9c7d1
2. Follow the on screen instruction to copy the config files over as regular user. ("ec2-user") in this case.
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
3. Follow the on screen instruction to add the other nodes to this Kubernetes cluster
# kubeadm join 172.31.7.220:6443 --token 4mhtc3.27s6iiaxcggwsi38 --discovery-token-ca-cert-hash sha256:be9ad8be83276854ba93a213bbddc61af476bb016f9f7939a713baec2fe9c7d1
..
4. Verify all nodes have joined this Kubernetes cluster byrunning command 'kubectl get nodes' on the master to see this node join the cluster.
[ec2-user@ip-172-31-7-220 sysctl.d]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-172-31-4-67.us-west-2.compute.internal NotReady <none> 34m v1.15.1
ip-172-31-5-160.us-west-2.compute.internal NotReady <none> 37m v1.15.1
ip-172-31-5-3.us-west-2.compute.internal NotReady <none> 35m v1.15.1
ip-172-31-7-220.us-west-2.compute.internal NotReady master 39m v1.15.1
Note, all the nodes are showing 'NotReady' under STATUS column.
5. To make the nodes ready, We need to create a pod network. We can use some sample pod network settings on the internet to create one.
# kubectl apply -n kube-system -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version|base64 | tr -d '\n')"
serviceaccount/weave-net created
clusterrole.rbac.authorization.k8s.io/weave-net created
clusterrolebinding.rbac.authorization.k8s.io/weave-net created
role.rbac.authorization.k8s.io/weave-net created
rolebinding.rbac.authorization.k8s.io/weave-net created
daemonset.extensions/weave-net created
6. Give it some time and the nodes will become Ready.
[ec2-user@ip-172-31-7-220 ec2-user]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-172-31-4-67.us-west-2.compute.internal NotReady <none> 34m v1.15.1
ip-172-31-5-160.us-west-2.compute.internal NotReady <none> 37m v1.15.1
ip-172-31-5-3.us-west-2.compute.internal NotReady <none> 35m v1.15.1
ip-172-31-7-220.us-west-2.compute.internal NotReady master 39m v1.15.1
[root@ip-172-31-7-220 ec2-user]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-172-31-4-67.us-west-2.compute.internal Ready <none> 36m v1.15.1
ip-172-31-5-160.us-west-2.compute.internal Ready <none> 38m v1.15.1
ip-172-31-5-3.us-west-2.compute.internal Ready <none> 37m v1.15.1
ip-172-31-7-220.us-west-2.compute.internal Ready master 40m v1.15.1
7. (OPTIONAL) Kubernetes also relies on labeling nodes to assign distributed services. Here we assign one node as Druid's master, two as data, and the last as query. Skipping this step will let Kubernetes decide which node to spawn which service.
# kubectl label nodes ip-172-31-7-220.us-west-2.compute.internal nodeType=master
# kubectl label nodes ip-172-31-5-160.us-west-2.compute.internal nodeType=data
# kubectl label nodes ip-172-31-5-3.us-west-2.compute.internal nodeType=data
# kubectl label nodes ip-172-31-4-67.us-west-2.compute.internal nodeType=query
8. (OPTIONAL) By default, no pods can be scheduled to spawn on Kubernete's master, which is also recommended in production environment. However, for this lab, we remove this restriction.
kubectl taint nodes --all node-role.kubernetes.io/master-
Now we are ready to spawn the Imply on-prem using Kubernetes.
Install helm on the master node:
1. Install "home brew" first
[ec2-user@ip-172-31-7-220 ~]$ sh -c "$(curl -fsSL https://raw.githubusercontent.com/Linuxbrew/install/master/install.sh)"
==> Installing Ruby to /home/linuxbrew/.linuxbrew/Homebrew/Library/Homebrew/vendor
2. Install "Kubernetes-helm"
[ec2-user@ip-172-31-7-220 ~]$ brew install kubernetes-helm
3. Initialize helm:
[ec2-user@ip-172-31-7-220 ec2-user]# helm init --history-max 200
Configure and install Imply on-prem Manager
Now we are ready to deploy the Imply Manager on this 4-node cluster.
1. Add the Imply Manager repo, update using the helm, and verify:
[ec2-user@ip-172-31-7-220 ec2-user]# helm repo add imply https://static.imply.io/onprem/helm
[ec2-user@ip-172-31-7-220 ec2-user]# helm repo update
Hang tight while we grab the latest from your chart repositories...
...Skip local chart repository
...Successfully got an update from the "imply" chart repository
...Successfully got an update from the "stable" chart repository
Update Complete.
[ec2-user@ip-172-31-7-220 ~]$ helm repo list
NAME URL
stable https://kubernetes-charts.storage.googleapis.com
local http://127.0.0.1:8879/charts
imply https://static.imply.io/onprem/helm
[ec2-user@ip-172-31-7-220 ~]$ helm search imply
NAME CHART VERSION APP VERSION DESCRIPTION
imply/imply 0.1.1 3.0.6-PREVIEW Imply Manager (PREVIEW)
2. create helm secrets, using your Docker login credential
[ec2-user@ip-172-31-7-220 ~]$ kubectl create secret docker-registry regcred --docker-server=https://index.docker.io/v1/
--docker-username=<USERNAME> --docker-email=<EMAIL> --docker-password=<PASSWORD>
secret/regcred created
3. Download the helm chart from Imply
[ec2-user@ip-172-31-7-220 ~]$ helm fetch --untar imply/imply
4. Find the "value.yaml" file under "imply" directory you just download, and modify it as following example
[ec2-user@ip-172-31-7-220 imply]$ cat values.yaml
meOverride:
fullnameOverride:
images:
manager:
repository: imply/onprem-manager
tag: 3.0.6-PREVIEW
agent:
repository: imply/onprem-agent
tag: 2019-07-30-PREVIEW
pullPolicy: IfNotPresent
imagePullSecrets:
- name: regcred
manager:
licenseKey: |
{"name":"Ming","expiryDate":"3000-01-01","features":["alerts","cross-tab"]}|2019-07-25|<HASH>
initialClusterName: default
metadataStore:
type: mysql
host: ip-172-31-4-67.us-west-2.compute.internal
port: 3306
user: root
password: hadoop
database: imply-manager
resources:
requests:
cpu: 200m
memory: 500M
nodeSelector:
nodeType: master
tolerations: []
affinity: {}
annotations: {}
serviceType: ClusterIP
port: 9097
druid:
metadataStore:
type: mysql
host: ip-172-31-4-67.us-west-2.compute.internal
port: 3306
user: root
password: hadoop
database: imply-k8s
zk:
connectString: ip-172-31-4-67.us-west-2.compute.internal,ip-172-31-5-160.us-west-2.compute.internal,ip-172-31-5-3.us-west-2.compute.internal
basePath: imply
deepStorage:
type: hdfs
baseLocation: "hdfs://ip-172-31-7-220.us-west-2.compute.internal/druid-k8s/segments"
ingress:
enabled: false
annotations: {}
manager:
host: manager.mydomain.com
path: /*
pivot:
host: mydomain.com
path: /*
router:
host: api.mydomain.com
path: /*
tls: []
nfsVolumeClaim:
enabled: false
mountPath: "/mnt/nfs"
baseLocationOverride: "/mnt/nfs/druid"
storageClassName: imply-nfs
resources:
requests:
storage: 8Gi
master:
replicaCount: 1
resources:
requests:
cpu: 200m
memory: 800M
nodeSelector:
nodeType: master
tolerations: []
affinity: {}
annotations: {}
coordinatorPort: 8081
overlordPort: 8090
query:
replicaCount: 1
resources:
requests:
cpu: 300m
memory: 1200M
nodeSelector:
nodeType: query
tolerations: []
affinity: {}
annotations: {}
serviceType: ClusterIP
routerPort: 8888
routerTlsPort: 9088
pivotPort: 9095
data:
replicaCount: 2
resources:
requests:
cpu: 300m
memory: 1500M
nodeSelector:
nodeType: data
tolerations: []
affinity: {}
annotations: {}
Helm Chart brief explanation :
- "images: manager: repository:" and "images: agent: repository:" : This chart will download from Imply's two private repos: "imply/onprem-manager" and "imply/onprem-agent"
- "manager: licenseKey:" : You need to obtain the license key to be able to use Imply Manager
- "manager:metadataStore:" and "druid:metadataStore" : The external MySQL DB is set on host "ip-172-31-4-67.us-west-2.compute.internal", with username and password, database name provided here, for both Imply Manager, and Druid to use.
- "manager:nodeSelector" : Kubernetes node that's labeled with "nodeType=master" will host the Imply Manager service
- "manager:port" : Imply Manager runs on port 9097 by default
- "druid:zk:" and "druid:deepStorage" : Information of the external zookeeper and deep storage is also provided here
- "master: replicaCount:" : decides number of pods spawned to run the master processes: Overlord and Coordinator. In our example, we only configure one master service.
Similar configurations are set for "query" and "data" services as well.
5. Go to the parent directory, and Deploy Imply with helm:
[ec2-user@ip-172-31-7-220 ~]$ cd ..
[ec2-user@ip-172-31-7-220 ~]$ helm install imply
NAME: rafting-iguana
LAST DEPLOYED: Wed Aug 21 17:58:42 2019
NAMESPACE: default
STATUS: DEPLOYED
Note helm generated a random instance name for us: "rafting-iguana"
6. Verify:
[ec2-user@ip-172-31-7-220 ~]$ helm status rafting-iguana
LAST DEPLOYED: Wed Aug 21 17:58:42 2019
NAMESPACE: default
STATUS: DEPLOYED
...
==> v1/Pod(related)
NAME READY STATUS RESTARTS AGE
rafting-iguana-imply-data-b4d7d6bdf-kwnx2 0/1 Pending 0 12m
rafting-iguana-imply-data-b4d7d6bdf-m7t2b 0/1 Pending 0 12m
rafting-iguana-imply-manager-dcb9bc875-qzf9g 0/1 Pending 0 12m
rafting-iguana-imply-master-54876c5658-fph2t 0/1 Pending 0 12m
rafting-iguana-imply-query-fd6b7cfd6-gx4dc 0/1 Pending 0 12m
rafting-iguana-mysql-6dc56fbb8f-krwpl 0/1 Pending 0 12m
rafting-iguana-nfs-server-provisioner-0 1/1 Running 0 12m
rafting-iguana-zookeeper-0 0/1 Pending 0 12m
==> v1/Secret
NAME TYPE DATA AGE
rafting-iguana-mysql Opaque 2 12m
==> v1/Service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
rafting-iguana-imply-manager ClusterIP 10.107.154.8 9097/TCP,9989/TCP,9998/TCP,9999/TCP 12m
rafting-iguana-imply-query ClusterIP 10.102.158.76 8888/TCP,9088/TCP,9095/TCP 12m
rafting-iguana-mysql ClusterIP 10.107.64.253 3306/TCP 12m
rafting-iguana-nfs-server-provisioner ClusterIP 10.97.196.46 2049/TCP,20048/TCP,51413/TCP,51413/UDP 12m
rafting-iguana-zookeeper ClusterIP 10.97.210.171 2181/TCP 12m
rafting-iguana-zookeeper-headless ClusterIP None 2181/TCP,3888/TCP,2888/TCP 12m
Wait for awhile until all instance status are "Ready"
[ec2-user@ip-172-31-7-220 ~]$ kubectl get pods
NAME READY STATUS RESTARTS AGE
rafting-iguana-imply-data-b4d7d6bdf-kwnx2 1/1 Running 2 29m
rafting-iguana-imply-data-b4d7d6bdf-m7t2b 1/1 Running 2 29m
rafting-iguana-imply-manager-dcb9bc875-qzf9g 1/1 Running 0 29m
rafting-iguana-imply-master-54876c5658-fph2t 1/1 Running 0 29m
rafting-iguana-imply-query-fd6b7cfd6-gx4dc 1/1 Running 1 29m
rafting-iguana-mysql-6dc56fbb8f-krwpl 0/1 Pending 0 29m
rafting-iguana-nfs-server-provisioner-0 1/1 Running 0 29m
rafting-iguana-zookeeper-0 0/1 Pending 0 29m
Find details about each kubernetes instances. Make sure they are running on nodes they are assigned to with "nodeType". For example:
[ec2-user@ip-172-31-7-220 ~]$ kubectl describe pod rafting-iguana-imply-master-54876c5658-fph2t
Name: rafting-iguana-imply-master-54876c5658-fph2t
Namespace: default
Priority: 0
Node: ip-172-31-7-220.us-west-2.compute.internal/172.31.7.220
Start Time: Wed, 21 Aug 2019 18:24:21 +0000
Labels: app.kubernetes.io/instance=rafting-iguana
app.kubernetes.io/name=imply-master
pod-template-hash=54876c5658
Annotations:
Status: Running
IP: 10.44.0.1
....
Node-Selectors: nodeType=master
Access master UIs, using the Kubernetes instance internal IPs. For example, "10.44.0.1:8081" and "10.44.01:8090", and so on.
Run a sample ingestion task and query task to ensure full functionality of Imply Manager.
Congratulations. You now have Imply Manager successfully set up on your multi-node cluster.
Troubleshoot some common issues:
1. Unable to install Imply with "failed to download" error, even after updating the repo again as suggested
[ec2-user@ip-172-31-7-220 imply]$ helm install imply
Error: failed to download "imply" (hint: running `helm repo update` may help)

[ec2-user@ip-172-31-7-220 imply]$ helm repo update
Hang tight while we grab the latest from your chart repositories...
...Skip local chart repository
...Successfully got an update from the "imply" chart repository
...Successfully got an update from the "stable" chart repository
Update Complete.
[ec2-user@ip-172-31-7-220 imply]$ helm install imply --debug
[debug] Created tunnel using local port: '44054'
[debug] SERVER: "127.0.0.1:44054"
[debug] Original chart version: ""
Error: Non-absolute URLs should be in form of repo_name/path_to_chart, got: imply
Solution : Need to execute the "helm install imply" command in the SAME directory as where you run
helm fetch --untar imply/imply
2. Getting "no available release name found" error
[ec2-user@ip-172-31-7-220 ~]$ helm install imply --debug
[debug] Created tunnel using local port: '38716'
[debug] SERVER: "127.0.0.1:38716"
[debug] Original chart version: ""
[debug] CHART PATH: /home/ec2-user/imply
Error: no available release name found
Another symptom is same error messages are found in the system pod "till"
[ec2-user@ip-172-31-7-220 ~]$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
.....
tiller-deploy-65867875cb-g4bsz 1/1 Running 0 26m
...
[ec2-user@ip-172-31-7-220 ~]$ kubectl logs tiller-deploy-65867875cb-g4bsz -n kube-system
....
[storage/driver] 2019/08/12 21:03:13 get: failed to get "winning-albatross.v1": configmaps "winning-albatross.v1" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "configmaps" in API group "" in the namespace "kube-system"
[tiller] 2019/08/12 21:03:13 info: generated name winning-albatross is taken. Searching again.
[storage] 2019/08/12 21:03:13 getting release "oily-turtle.v1"
[storage/driver] 2019/08/12 21:03:13 get: failed to get "oily-turtle.v1": configmaps "oily-turtle.v1" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "configmaps" in API group "" in the namespace "kube-system"
[tiller] 2019/08/12 21:03:13 info: generated name oily-turtle is taken. Searching again.
[storage] 2019/08/12 21:03:13 getting release "plundering-flee.v1"
[storage/driver] 2019/08/12 21:03:13 get: failed to get "plundering-flee.v1": configmaps "plundering-flee.v1" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "configmaps" in API group "" in the namespace "kube-system"
[tiller] 2019/08/12 21:03:13 info: generated name plundering-flee is taken. Searching again.
[tiller] 2019/08/12 21:03:13 warning: No available release names found after 5 tries
[tiller] 2019/08/12 21:03:13 failed install prepare step: no available release name found
Solution: We need to reset by deleting this "tiller" instance, and creating a new one. Then reset the "helm" as well.
[ec2-user@ip-172-31-7-220 ~]$ kubectl delete pod tiller-deploy-65867875cb-g4bsz -n kube-system
pod "tiller-deploy-65867875cb-g4bsz" deleted
[ec2-user@ip-172-31-7-220 ~]$ kubectl create serviceaccount --namespace kube-system tiller
serviceaccount/tiller created
[ec2-user@ip-172-31-7-220 ~]$ kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
clusterrolebinding.rbac.authorization.k8s.io/tiller-cluster-rule created
[ec2-user@ip-172-31-7-220 ~]$ helm init --service-account tiller
$HELM_HOME has been configured at /home/ec2-user/.helm.
Warning: Tiller is already installed in the cluster.
(Use --client-only to suppress this message, or --upgrade to upgrade Tiller to the current version.)
[ec2-user@ip-172-31-7-220 ~]$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
....
tiller-deploy-65867875cb-lwtbg 1/1 Running 0 51s
....
[ec2-user@ip-172-31-7-220 ~]$ helm reset --force
Tiller (the Helm server-side component) has been uninstalled from your Kubernetes Cluster.
Finally, re-initialize helm and tiller, and install Imply again:
[ec2-user@ip-172-31-7-220 ~]$ helm init --service-account tiller
$HELM_HOME has been configured at /home/ec2-user/.helm.
Tiller (the Helm server-side component) has been installed into your Kubernetes Cluster.
[ec2-user@ip-172-31-7-220 ~]$ helm install imply --debug
Tear down the Kubernetes cluster:
[ec2-user@ip-172-31-7-220 on-prem]$ helm ls --all
NAME REVISION UPDATED STATUS CHART APP VERSION NAMESPACE
rafting-iguana 1 Mon Aug 12 21:30:17 2019 DEPLOYED imply-0.1.1 3.0.6-PREVIEW default
[ec2-user@ip-172-31-7-220 on-prem]$ helm delete rafting-iguana
release "rafting-iguana” deleted
[ec2-user@ip-172-31-7-220 on-prem]$ kubectl get pods
No resources found.
View Article
Ingestion failed from AWS S3 due to error "java.net.SocketException: Connection reset"

SYMPTOM :In case of ingestion failure from AWS S3 with following error messages inlog:
2018-02-15T17:18:05,600 INFO [<DATASOURCE_NAME>] io.druid.segment.realtime.appenderator.AppenderatorImpl - Committing metadata[AppenderatorDriverMetadata{activeSegments={<SEGMENT_NAME>_2018-02-02T23:00:00.000Z=[<DATASOURCE_NAME>_2018-02-02T23:00:00.000Z_2018-02-03T00:00:00.000Z_2018-02-15T16:40:08.642Z_1]}, publishPendingSegments={<SEGMENT_NAME>_2018-02-02T23:00:00.000Z=[<DATASOURCE_NAME>_2018-02-02T23:00:00.000Z_2018-02-03T00:00:00.000Z_2018-02-15T16:40:08.642Z_1]}, lastSegmentIds={<SEGMENT_NAME>_2018-02-02T23:00:00.000Z=<DATASOURCE_NAME>_2018-02-02T23:00:00.000Z_2018-02-03T00:00:00.000Z_2018-02-15T16:40:08.642Z_1}, callerMetadata=null}] for sinks[<DATASOURCE_NAME>_2018-02-02T23:00:00.000Z_2018-02-03T00:00:00.000Z_2018-02-15T16:40:08.642Z_1:1].
2018-02-15T17:18:05,600 INFO [task-runner-0-priority-0] io.druid.segment.realtime.appenderator.AppenderatorDriver - Persisted pending data in 368ms.
2018-02-15T17:18:05,601 INFO [task-runner-0-priority-0] io.druid.segment.realtime.appenderator.AppenderatorImpl - Shutting down...
2018-02-15T17:18:05,601 INFO [appenderator_persist_0] io.druid.segment.realtime.appenderator.AppenderatorImpl - Removing sink for segment[<DATASOURCE_NAME>_2018-02-02T23:00:00.000Z_2018-02-03T00:00:00.000Z_2018-02-15T16:40:08.642Z_1].
2018-02-15T17:18:05,605 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[IndexTask{id=<SEGMENT_NAME>_2018-02-02T23:00:00.000Z, type=index, dataSource=<DATASOURCE_NAME>}]
java.lang.IllegalStateException: java.net.SocketException: Connection reset
at org.apache.commons.io.LineIterator.hasNext(LineIterator.java:106) ~[commons-io-2.5.jar:2.5]
at io.druid.data.input.impl.FileIteratingFirehose.hasMore(FileIteratingFirehose.java:66) ~[druid-api-0.11.0-iap4.jar:0.11.0-iap4]
at io.druid.indexing.common.task.IndexTask.generateAndPublishSegments(IndexTask.java:628) ~[druid-indexing-service-0.11.0-iap4.jar:0.11.0-iap4]
at io.druid.indexing.common.task.IndexTask.run(IndexTask.java:233) ~[druid-indexing-service-0.11.0-iap4.jar:0.11.0-iap4]
at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.11.0-iap4.jar:0.11.0-iap4]
at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.11.0-iap4.jar:0.11.0-iap4]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_152]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_152]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_152]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_152]
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210) ~[?:1.8.0_152]
at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_152]
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) ~[?:1.8.0_152]
at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:593) ~[?:1.8.0_152]
at sun.security.ssl.InputRecord.read(InputRecord.java:532) ~[?:1.8.0_152]
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983) ~[?:1.8.0_152]
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:940) ~[?:1.8.0_152]
at sun.security.ssl.AppInputStream.read(AppInputStream.java:105) ~[?:1.8.0_152]
at org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198) ~[httpcore-4.4.3.jar:4.4.3]
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178) ~[httpcore-4.4.3.jar:4.4.3]
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137) ~[httpclient-4.5.1.jar:4.5.1]
at org.jets3t.service.io.InterruptableInputStream.read(InterruptableInputStream.java:78) ~[jets3t-0.9.4.jar:0.9.4]
at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.read(HttpMethodReleaseInputStream.java:146) ~[jets3t-0.9.4.jar:0.9.4]
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) ~[?:1.8.0_152]
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) ~[?:1.8.0_152]
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) ~[?:1.8.0_152]
at java.io.InputStreamReader.read(InputStreamReader.java:184) ~[?:1.8.0_152]
at java.io.BufferedReader.fill(BufferedReader.java:161) ~[?:1.8.0_152]
at java.io.BufferedReader.readLine(BufferedReader.java:324) ~[?:1.8.0_152]
at java.io.BufferedReader.readLine(BufferedReader.java:389) ~[?:1.8.0_152]
at org.apache.commons.io.LineIterator.hasNext(LineIterator.java:95) ~[commons-io-2.5.jar:2.5]
... 9 more
2018-02-15T17:18:05,610 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [<SEGMENT_NAME>_2018-02-02T23:00:00.000Z] status changed to [FAILED].
ROOT CAUSE :the error means that the connection was reset while reading inputs from s3. This is a quite common, but transient error in s3. So, usually Druid can retry on this kind of transient errors. But, currently, to turn on the retrying, prefetching should also be enabled. Prefetching is a feature that Druid can prefetch input files from s3 to local disk to make ingestion get faster.
RESOLUTION : To enable prefetching, please set some configurations like below or leave them as null to use their default values in your ioConfig. Please see http://druid.io/docs/latest/development/extensions-core/s3.html#statics3firehose for more details.
"ioConfig" : {
"type" : "index",
"firehose" : {
"type" : "static-s3",
"prefixes" : [ "s3://path/to/data/2018/02/02/23/" ],
"maxCacheCapacityBytes" : 1073741824,
"maxFetchCapacityBytes" : 1073741824,
"prefetchTriggerBytes" : 536870912,
"fetchTimeout" : 200000,
"maxFetchRetry" : 3
},
"appendToExisting" : false
},
In our next release, retrying will be available even if prefetching is disabled.
View Article
Configuring Druid to use a Kerberized Hadoop as Deep Storage

Hadoop Setup
Following are the configurations files required to be copied over to Druid conf folders:
For HDFS as a deep storage, hdfs-site.xml, core-site.xml
For ingestion, mapred-site.xml, yarn-site.xml
HDFS Folders and permissions
Choose any folder name for the druid deep storage, for example 'druid'
Create the folder in hdfs under the required parent folder. For example, hdfs dfs -mkdir /druid OR hdfs dfs -mkdir /apps/druid
Give druid processes appropriate permissions for the druid processes to access this folder. This would ensure that druid is able to create necessary folders like data and indexing_log in HDFS. For example, if druid processes run as user 'root', then
hdfs dfs -chown root:root /apps/druid
OR
hdfs dfs -chmod 777 /apps/druid
Druid creates necessary sub-folders to store data and index under this newly created folder.
Druid Setup
Edit common.runtime.properties at conf/druid/_common/common.runtime.properties to include the HDFS properties. Folders used for the location are same as the ones used for example above.
common_runtime_properties
# Deep storage
#
# For HDFS:
druid.storage.type=hdfs
druid.storage.storageDirectory=/druid/segments
# OR
# druid.storage.storageDirectory=/apps/druid/segments
#
# Indexing service logs
#
# For HDFS:
druid.indexer.logs.type=hdfs
druid.indexer.logs.directory=/druid/indexing-logs
# OR
# druid.storage.storageDirectory=/apps/druid/indexing-logs
Note: Comment out Local storage and S3 Storage parameters in the file
Also include hdfs-storage core extension to conf/druid/_common/common.runtime.
#
# Extensions
#
druid.extensions.directory=dist/druid/extensions
druid.extensions.hadoopDependenciesDir=dist/druid/hadoop-dependencies
druid.extensions.loadList=["druid-parser-route", "mysql-metadata-storage", "druid-hdfs-storage", "druid-kerberos"]
Hadoop Jars
Ensure that Druid has necessary jars to support the Hadoop version.
Find the hadoop version using command, hadoop version
In case there are other softwares used with hadoop, like WanDisco, ensure that
the necessary libraries are available
add the requisite extensions to druid.extensions.loadlist in conf/druid/_common/common.runtime.properties
Kerberos setup
Create a headless keytab which would have access to the druid data and index.
Edit conf/druid/_common/common.runtime.properties and add the following properties:
druid.hadoop.security.kerberos.principal
druid.hadoop.security.kerberos.keytab
For example
[email protected]
druid.hadoop.security.kerberos.keytab=/etc/security/keytabs/hdfs.headless.keytab
Restart Druid Services
With the above changes, restart Druid. This would ensure that Druid works with Kerberized Hadoop
View Article
Migrate existing Druid Cluster to a new Imply cluster

Introduction
In a typical Imply cluster, we have:
Master services (Overlord, Coordinator)
Data services (MiddleManager, Historical)
Query services (Broker, Imply UI)
Zookeeper servers
Deep Storage (S3, HDFS, NFS etc)
Metadata DB (MySQL, PostgresSQL etc) for Druid and Pivot if Imply is used
Migrating the cluster is simple since we only need to move (or maintain) data from Deep Storage and Metadata DB (Druid / Pivot). The other services contain only transient data, and they are repopulated anyway when the new cluster is started.
Following are the possible migration plans:
Migration where there is no change in Deep Storage and the Metadata DB
Migration where there is change in Deep Storage / Metadata DB
Migration from Apache Druid to Imply
Multiple scenario's are presented in the rest of the article.
Scenario 1 : Old and new clusters share same Deep Storage and DB servers
Druid services need to be rebooted.
Steps:
Configure the new clusters with the same Deep Storage path and DB server address as the old
Stop the old cluster
Make sure the old cluster is COMPLETELY down before starting new cluster, or Druid data can be corrupted
Start the new cluster
Scenario 2 :
Old and new clusters use different Deep Storage and DB servers.
This scenario would apply if you are using Imply Cloud and
Old deep storage is accessible from new imply cluster
Old metadata DB and new metadata DB is MySQL
New deep storage isclean with no data
New metadata DB is brand new and clean with no data
Steps:
Configure the new clusters with different Deep Storage path and DB server address
Copy tables "druid_config, druid_datasource, druid_supervisors, druid_segments" from old DB to new DBby following the next step
mysqldump on the source metadata DB ( and Pivot if used)and then import the mysqldump into new target metadata DB. Output file is given .sql extension because it contains sql commands. The -p command in mysqlimport asks for a password
mysqldump -h <host_name> -u <user_name> -p --single-transaction --skip-add-drop-table --no-create-info --no-create-db <db_name> druid_config druid_dataSource druid_supervisors druid_segments > output_file.sql
Start the new cluster
Coordinator automatically starts reading the old segments' metadata in the new DB, andthen historical nodes load them from the old Deep Storage. Note that the data in old Deep Storage is kept intact.
Old cluster will keep writing to the old metadata DB and old Deep Storage
New cluster will write to the new metadata DB and new Deep Storage
Note #1:Once you do this migration, the old and new cluster will share the samedata segment files in deep storage for any data ingested before the migration. (Data ingested after the migration will go to different files.) It is important toavoid running kill tasks(permanent data deletion) on datasources that may have segments between two clusters, because it will cause the clusters to delete each others' data.
Note #2: If the new Druid cluster shares the same ZooKeeper quorum as the old, it must use a different base znode path, by configuring the property druid.zk.paths.base in Druid's common.runtime.properties to a different name, such as /druid-newcluster. The default value is /druid.
Scenario 3 :
Old and new clusters use different Deep Storage and DB servers. This scenario would apply if you are using Imply Cloud and
Old deep storage is accessible from new imply cluster
Old metadata DB and new metadata DB is MySQL
New deep storage has somedata in it
New metadata DB has some data in it
Steps:
Make sure there are no collisions in the paths between old deep storage and new deep storage
If there are collisions, change the path of source deep storage to something else and give the same paths when you modify source mysqldump file of source metadata DB
Copy the data from old deep storage to new deep storage
configure the new clusters with different Deep Storage path and DB server address.
mysqldump on the source metadata DB (and excluding the DDL
mysqldump -h <host_name> -u <user_name> -p --single-transaction --skip-add-drop-table --no-create-info --no-create-db <db_name> druid_config druid_dataSource druid_supervisors druid_segments > source_output_file.sql
We have to be careful in this scenario NOT to overwrite the target metadata DB
Change the location of the segments in druid_segments table in the above source mysqldump file to point to new deep storage location
sed -i .bak 's/\\"bucket\\":\\"<old_segment_name>\\"/\\"bucket\\":\\"<new_segment_name>\\"/' /dir/source_output_file.sql
copy tables "druid_config, druid_datasource, druid_supervisors, druid_segments" from old metadata DB to new metadata DB by following the next step
import the above modified source mysqldump file into new target metadata DB
mysql -h <host_name> -u <user_name> -p <db_name> < /dir/source_output_file.sql
Old cluster will keep writing to the old metadata DB and old Deep Storage
New cluster will write to the new DB and new Deep Storage
Scenario 4 :
Old and new clusters use different Deep Storage and DB servers. This scenario would apply if you are using Imply Cloud and
Old deep storage is NOT accessible from new imply cluster
Old metadata DB and new metadata DB is MySQL
New deep storage is clean with nodata in it
New metadata DB is brand new and has no data in it
Steps:
Copy the data from old deep storage to new deep storage(may be using somestaging area as an intermediate location)
configure the new clusters with different Deep Storage path and DB server address.
mysqldump on the source metadata DB
mysqldump -h <host_name> -u <user_name> -p --single-transaction --skip-add-drop-table --no-create-info --no-create-db <db_name> druid_config druid_dataSource druid_supervisors druid_segments > output_file.sql
Change the location of the segments in druid_segments table in the above mysqldump file to point to new deep storage location
sed -i .bak 's/\\"bucket\\":\\"<old_bucket_name>\\"/\\"bucket\\":\\"<new_bucket_name>\\"/' /tmp/output_file.sql
Copy tables "druid_config, druid_datasource, druid_supervisors, druid_rules, druid_segments" from old DB to new DB by following the next step
import the above modified source mysqldump file into new target metadata DB
mysql -h <host_name> -u <user_name> -p <db_name> < /dir/output_file.sql
Drop druid_rules table from target mysql. This will be re-created once the cluster is started
Start the new cluster.
Coordinator automatically starts reading the old segments' metadata in the new DB, andthen historical nodes load them from the NEW Deep Storage. Note that the data in old Deep Storage is kept intact.
Old cluster will keep writing to the old DB and old Deep Storage.
New cluster will write to the new DB and new Deep Storage.
Scenario 5 :
Old and new clusters use different Deep Storage and DB servers. This scenario would apply if you are using Imply Cloud and
Old deep storage is NOT accessible from new imply cluster
Old metadata DB and new metadata DB is MySQL
New deep storage has somedata in it
New metadata DB has some data in it
Steps:
Make sure there are no collisions in the paths between old deep storage and new deep storage
If there are collisions, change the path of source deep storage to something else and give the same paths when you modify source mysqldump file of source metadata DB
Copy the data from old deep storage to new deep storage(may be using somestaging area as an intermediate location)
configure the new clusters with different Deep Storage path and DB server address
mysqldump on the source metadata DB excluding the DDL
mysqldump -h <host_name> -u <user_name> -p --skip-add-drop-table --no-create-info --no-create-db <db_name> druid_config druid_dataSource druid_supervisors druid_segments > source_output_file.sql
We have to be careful in this scenario NOT to overwrite the target metadata DB
Change the location of the segments in druid_segments table in the above source mysqldump file to point to new deep storage location
sed -i .bak 's/\\"bucket\\":\\"<old_segment_name>\\"/\\"bucket\\":\\"<new_segment_name>\\"/' /dir/source_output_file.sql
Copy tables "druid_config, druid_datasource, druid_supervisors, druid_segments" from old metadata DB to new metadata DB by following the next step.
import the above modified source mysqldump file into new target metadata DB.
mysql -h <host_name> -u <user_name> -p <db_name> < /dir/source_output_file.sql
Start the new cluster.
Coordinator automatically starts reading the old segments' metadata in the new DB, andthen historical nodes load them from the NEW Deep Storage. Note that the data in old Deep Storage is kept intact
Old cluster will keep writing to the old metadata DB and old Deep Storage
New cluster will write to the new DB and new Deep Storage
Scenario 6:
Apache Druid to Imply
All of the above scenarios can be considered for migrating from Apache Druid to Imply.
Guidelines
Review Imply Release notes for changes in the new version
Upgrade with Cluster restarts. Rolling upgrade is a possibility, but not preferred
Move to an Apache Druid version matching with that of Imply's Druid Version before the migration. This is a required condition if a rolling upgrade needs to be performed
Use the same configuration parameters for the services as the Apache Druid versions configuration after reviewing release notes for parameter changes
if the `druid.segmentCache.locations` are changed, copy over the segment cache from the existing Apache Druid cluster before restarting services using the new Imply version
While starting the cluster in Imply Version, following steps are preferable when the Segment Count is very high:
Start all the Historicals and await Lifecyle to be started
Start all the Master Services (Co-ordinator / Overlord) and await Lifecycle to be started
Start Query Node services (Broker / Router) and await Lifecyle to be started
Start Middlemanager service and await Lifecycle to be started
Resume Supervisors / tasks
View Article
Accessing overlord console and coordinator console in Imply Cloud

Druid's overlord console provides an overview of current tasks, pending tasks, waiting tasks, and tasks that have completed in the last 24 hours.
Druid's coordinator console provides an overview of datasources, segments, data size per segment, and cluster usage.
To configure the ability to see these consoles in your browser, follow these steps:
1. Log into your AWS account that is associated with Imply Cloud
2. In the Imply UI, go to the API view under Manage / Clusters:
3. Under AWS Consoles, clicked on "Unmanaged". This should take you to your AWS Console.
4. In the AWS Console, create rules to allow your public IP address to access the overlord and coordinator consoles:
5. Back in the API view in the Imply UI, note the Druid username and password:
6. Copy the link from the top of the API view, and you're done!
OR Navigate to the Servers page:
using the EXTERNAL IP of a master node, access one of the consoles by entering this address + port in a new browser tab:
overlord console: https://<EXT-IP-MASTER-NODE>:8290
coordinator console: https://<EXT-IP-MASTER-NODE>:8281
Druid will change leadership among master nodes, so if the first IP times out, try another master IP until prompted for the username and password from #5 above.
View Article
How to deploy custom applications on imply cloud instances using Imply-cloud-manager UI

This can be done by adding custom files to cluster specification - The init script looks for a specially named custom file called user-init, the file has to be called that, and it has to be marked as 'is executable'[Please refer the screenshots].
the user-init file should be a script (should start with something like #!/bin/bash or similar) and it'll run after the rest of our init scripts finish and you can use that to configure whatever custom service you want to configure.
You can also use the custom file mechanism to upload additional files for SignalFX - you could use 'Location Type' Default and the files would be placed at /opt/imply/user and you can reference them there.
Example: I am showing you an example where a user wants to install `SignalFx agent` on each imply managed EC2 instance
Step1 :
here, I have created a `user-init` file and placed it to into `s3-bucket`
eg:
Location:
s3://imply-signalfx/user-init
The user-init file has below content:
#!/bin/bash
curl -sSL https://dl.signalfx.com/signalfx-agent.sh > /tmp/signalfx-agent.sh
sudo sh /tmp/signalfx-agent.sh <SOME_ID> --realm us0
Step 2:
Login to imply cloud manager UI > Select the Cluster > Click on `manage` > click on `setup` >navigate to `Advance Config' and expand it > Now click on to ` Add custom file`
https://`,
Step 3:
When you Add the `custom file` it will ask to you provide below details :
a)Custome File Path: The file path must start with `s3://, http://,
In this example, the path should be :
s3://imply-signalfx/user-init
b)Location Type: I am using the default location, in this case, the init file will be placed inside this directory :
/opt/imply/user/
c)Extra Properties: In this case, it should be:
Is executable
Example:
Once you are done with the above configuration, you need to update the cluster!
What is the impact on the running cluster?
As It requires a cluster update and the instances will be recreated so if your cluster is HA and you do a rolling update, there should not be any downtime, otherwise, there will be downtime. If there is an error in the script, you should be able to roll back the update.
Suggestion to be on the safe side :
I will suggest to spin up a TEST cluster and try this first before doing it in PRODUCTION.
View Article
Tuning Druid for Large Result Sets

Here are some handpicked options for hardening Druid system when dealing with demanding scenario like group by on large datasets which may return a very large query response. This article also provides recommendation guideline for setting these parameters.
Querying a large dataset
Druid provides many tunable parameters at broker and query level. Following are list of advanced configurations currently supported by Druid.
Property
Description
Default
Recommendation
druid.server.http.defaultQueryTimeout
Query timeout in millis, beyond which unfinished queries will be cancelled. 0 timeout meansno timeout. To set the default timeout, see Broker configuration
300000 milli
druid.broker.http.readTimeout
The timeout for data reads from Historical servers and real-time tasks.
PT15M
leave it alone at first, but you may need to raise it if TTFB (Time to First Byte) is high
druid.router.http.readTimeout
timeout of inflight query responses. Router will terminate running queries if they run longer than this duration.
PT8M
Make sure this duration is enough for longest running query.
Handle partial query response at client side since http 200 response is already sent.
druid.broker.http.maxQueuedBytes
Maximum number of bytes queued per query before exerting backpressure on the channel to the data server. Similar tomaxScatterGatherBytes, except unlike that configuration, this one will trigger backpressure rather than query failure. Zero means disabled.
0 (disabled)
Always set this parameter on broker. Try 10M (10000000) if you have <100 historicals; consider higher #'s if you have more historicals.
Make sure it fits within your heap
this setting isper queryand so usage could be up to druid.server.http.numThreads * druid.broker.http.maxQueuedBytes)
druid.processing.buffer.sizeBytes
This specifies a buffer size for the storage of intermediate results. The computation engine in both the Historical and Realtime processes will use a scratch buffer of this size to do all of their intermediate computations off-heap. Larger values allow for more aggregations in a single pass over the data while smaller values can require more passes depending on the query that is being executed.
auto (max 1GB)
leave it alone at first, but you may consider raising it if you have spare memory
druid.query.groupBy.maxMergingDictionarySize
Maximum amount of heap space (approximately) to use for the string dictionary during merging. When the dictionary exceeds this size, a spill to disk will be triggered.
100000000
druid.query.groupBy.maxOnDiskStorage
Amount of space on disk used for aggregation, per query, in bytes. By default, this is 0, which means aggregation will not use disk.
0 (disabled)
always set this, on historicals.
Try 10GB at first if you have enough disk space, raise if needed.
This setting isper queryhence actual usage could be up todruid.server.http.numThreads * druid.query.groupBy.maxOnDiskStorage
Note: Please check your Druid version for support of these advanced options.
Downloading large dataset
Apply HTTP Compression
Apache Druid (incubating) supports http request decompression and response compression, to use this,Activate by settingtheAccept-Encoding: gzipheader on your client requests.
This does not require server configuration.
Recommendation:consider this if network is more of a bottleneck than CPU (this trades off CPU for network). Not recommended for smaller result-sets since this option increases CPU usage on broker and on client.
Detect Truncated Responses
Truncated response can be detected by validating returned data. Each format type provides a way to detect if data is truncated. Let's see what each response type works in this scenario. Some of the response formats are better than others but it all depends on what implementation works with external systems.
Always set resultFormatif using SQL queries.
Query Type
Parameter
Value
Content Type
Format Type
Detection
Recommendation
SQL
resultFormat
array
application/json
JSON array of JSON arrays
Validate Response JSON
Preferred for large result sets
SQL
resultFormat
arrayLines
text/plain
Like "array", but the JSON arrays are separated by newlines instead of being wrapped in a JSON array. This can make it easier to parse the entire response set as a stream, if you do not have ready access to a streaming JSON parser.
Presence of one blank line in the end
Preferred for large result sets
make sure to implement a truncation checker
Native Druid
resultAsArray
true
application/json
groupBy v2 queries now use an array-based representation of result rows, rather than the map-based representation used by prior versions of Druid. This provides faster generation and processing of result sets. Out of the box, this change is backwards-compatible.
Validate Response JSON
Always set this parameter.
Only supported in 3.1+
Documentation link resultsAsArray
Other Formats
Druid supports additional formats which are excluded from this KB article due to their relevance. These formats are:
resultFormat=[object, objectLines, csv] for SQL queries are excluded.
resultFormat=[list, compactedList] for Druid Native queries only apply for scan queries and not for GROUP BYhence excluded.
Additional Recommendations
Try avoiding "ORDER BY" if at all humanly possible.
Colocate clients to reduce network latency.
Avoid server going down.
Implement retry in case of error 500 or truncated response.
Further Reading
https://druid.apache.org/docs/latest/operations/http-compression.html
https://druid.apache.org/docs/latest/configuration/realtime.html
https://druid.apache.org/docs/latest/querying/groupbyquery.html#query-context
https://druid.apache.org/docs/latest/querying/sql
https://docs.imply.io/3.1/on-prem/misc/release
View Article
Profiling Druid queries using flame graphs

Once configuration has been ruled out as the likely cause of slowness and you have a repeatably slow query on your hands, a useful next step is to perform some light profiling to capture a recording of your Druid Broker or Historical (or any other java) processes for analysis.
There are three steps:
Download Swiss Java Knife (SJK) onto the servers that host the Druid processes you'd like to analyze. This will generally be a Broker or Historical, or perhaps an indexing task.
Use SJK to capture stack dumps.
Use SJK to generate flame graphs from the stack dumps.
Download SJK
For this, we will use Swiss Java Knife, in the form of a .jar that can be downloaded from maven here.
Capture stack dumps
To prepare the recording, you will need to copy 'sjk-{version}.jar` to the Druid servers you wish to collect profiling data from. Once this is complete, the next step is to find the process ids of the process we wish to sample by running
jcmd
to print all running java processes. After this, we sample the process using the 'sjk-{version}.jar' that we copied to the Druid servers:
java -jar sjk-{version}.jar stcap -p {pid} -o druid-{pid}.std -t 30s
This will generate a file containing sampled thread dumps, specified by the argument '-o historical-{pid}.std. Adjust the '-t 30s' to a value that is appropriate for the length of the query you are recording.
We recommend disabling cache (by setting "useCache" to "false" in your query context) and running your query at least 3 times, or repeatedly for 30 seconds, whichever is longer. This ensures enough data is collected for analysis.
Generate flame graphs
Next up, we can create a flame graph from the output:
java -jar sjk.jar flame -f druid-{pid}.std -o druid-{pid}-report.html
Which will generate an html page that contains something like:
an interactive flame graph that can be used to help determine what Druid was doing while processing the query or queries.
When reading flame graphs, it's important to look at query-related threads.There are two major types of threads involved in queries: processing threads, and HTTP server threads. When a query is ongoing, both types of threads will change their names to note the query type and datasource being queried, making them easier to find.
The tool has a thread filter widget in the top-right which case be used to filter threads by name. To filter in on certain threads, first click "none" to clear the list of selected threads, then type in a filter, and finally either select individual threads or click "all" to select all threads matching the filter.
View Article
Connecting Tableau to Druid with JDBC

Tableau 2018.3 introduced support for JDBC connections, which you can use to connect to Druid. This article documents the technique for Imply 2.8+, but it also works for Apache Druid 0.14.0+.
Please note that not all Tableau functions have been fully tested and verified.
To do this, you will need:
Tableau 2018.3 or later (Windows or Mac).
Imply 2.8.0 or later.
The AvaticaJDBC driver, which you can get from https://static.imply.io/support/avatica-1.12.0.jar. Druid uses an Avatica JDBC server.
Once you have this software installed, follow these steps:
Place the "avatica-1.12.0.jar" file in Tableau's Drivers folder. On Windows this isC:\Program Files\Tableau\Drivers. On Mac it is ~/Library/Tableau/Drivers.
Launch Tableau and choose the "Other databases (JDBC)" connection option.
Enter the URL "jdbc:avatica:remote:url=http://DRUID-QUERY-SERVER:8082/druid/v2/sql/avatica/" (replacingDRUID-QUERY-SERVER with the hostname or IP address of your Druid query broker).
Select thedialect "SQL92".
Enter a username and password if you have enabled authentication on your Druid cluster. Otherwise, leave them blank.
You can set connection context parameters through a properties file, if you like. If you do not need any special context parameters, leave the properties section blank.
Click "Sign In".
Choose database "druid", schema "druid", and then drag the table you wish to query up to the tables area and click "Sheet 1" to open a worksheet.Joins are not supported, so only use one table at a time.
Setting up a TLS truststore
Some extra steps are necessary to connect to a TLS-enabled Druid cluster where you're using your own CA. This includes Imply Cloud clusters (where Imply configures TLS out of the box, with a cluster-specific CA) or an on-premise Druid cluster with TLS enabled.
Save a copy of the CA to your local machine. If you are using Imply Cloud you can download it from the "API" tab.
Use keytool to create a truststore: keytool -import -alias druid -file downloaded_crt_file.crt -storetype JKS -keystore truststore.jks
Use a JDBC URL with the truststore parameters set, such asjdbc:avatica:remote:url=https://DRUID-QUERY-SERVER:9088/druid/v2/sql/avatica/;truststore=/path/to/truststore.jks;truststore_password=changeit;
View Article
Enable Query or Request logging for Druid

Query logging (request logging) will capture the queries made and users who made them in the broker log. To enable:
1. Add the following to the imply/conf/druid/_common/common.runtime.properties :
druid.request.logging.type=slf4j
2. Restart your Imply processes.
3. In the imply/var/sv/druid/broker.log you will see queries logged in JSON format with the user id.
Example output:
2018-11-07T16:14:15,378 INFO [DruidSchema-Cache-0] io.druid.sql.calcite.schema.DruidSchema - Refreshed metadata for dataSource[wikipedia] in 447 ms (1 segments queried, 0 segments left).
2018-11-07T16:14:26,574 INFO [qtp404994500-119] io.druid.server.log.LoggingRequestLogger - 2018-11-07T16:14:26.303Z 0:0:0:0:0:0:0:1 {"queryType":"topN","dataSource":{"type":"table","name":"wikipedia"},"virtualColumns":[],"dimension":{"type":"default","dimension":"page","outputName":"page","outputType":"STRING"},"metric":{"type":"LegacyTopNMetricSpec","metric":"sum_deleted"},"threshold":100,"intervals":{"type":"LegacySegmentSpec","intervals":["2016-06-26T21:32:00.000Z/2016-06-27T21:32:00.000Z"]},"filter":null,"granularity":{"type":"all"},"aggregations":[{"type":"doubleSum","name":"sum_deleted","fieldName":"deleted","expression":null}],"postAggregations":[],"context":{"implyDataCube":"wikipedia","implyFeature":"visualization","implyUser":"user","implyView":"wikipedia","implyViewTitle":"Wikipedia","priority":1,"queryId":"9fee261c-dd0d-4949-8376-09661b762841","timeout":40000},"descending":false} {"query/time":265,"query/bytes":7761,"success":true,"identity":"allowAll"}
References:
https://druid.apache.org/docs/latest/configuration/index.html#request-logging
View Article
Druid Metastore Database Schema

Druid Version: 0.12.3
Note: Foreign Key relationships displayed are logical and for display purposes only. There are no foreign key constraints created by druid.
View Article
MapR Hadoop as deep storage for Druid cluster using mapr client

Following steps need to be followed if MapR Hadoop is deep storage:
Install and configure MapR client
You need to install the MapR client that matches your MapR Hadoop installation on all middle manager/historical nodes.
. https://mapr.com/docs/52/AdvancedInstallation/SettingUptheClient-install-mapr-client.html
https://mapr.com/docs/61/MapR-DB/Installing-mapr-client.html
Run the configure.sh command for the MapR client
Un-secure cluster example...
In the following examples, the -N parameter specifies the cluster name, the-c(lowercase) option specifies a client configuration, the-C(uppercase) parameter specifies the CLDB nodes, and -HS specifies the HistoryServer node:
Example:
sudo /opt/mapr/server/configure.sh -Nmy.cluster.com-c -C mynode01:7222 -HS nodeA
jar changes
Move imply/dist/druid/extensions/druid-hdfs-storage to a backup location
Copy all mapr jars from /opt/mapr/lib to new directory imply/dist/druid/extensions/druid-hdfs-storage
Copy druid-hdfs-storage-0.14.0-incubating-iap4.jar from the backup directory into the new druid-hdfs-storage directory.
rm dist/druid/extensions/druid-hdfs-storage/jackson-*.jar
rm dist/druid/extensions/druid-hdfs-storage/guava-14.0.1.jar
rm dist/druid/extensions/druid-hdfs-storage/joda-time-2.0.jar (if it exists)
Common.runtime.properties (modifications)
druid-hdfs-storage should remain as an extension in thecommon.runtime.properties
common.runtime.properties needs to include
druid.storage.type=hdfs
druid.storage.storageDirectory=maprfs:///data/default-rack
In your common.runtime.properties comment out following parameter ###druid.extensions.hadoopDependenciesDir=dist/druid/hadoop-dependencies
In your common.runtime.properties set druid.extensions.useExtensionClassloaderFirst=true
This can be placed after your druid.extension.loadList
Verification
Verify connectivity from druid middle manager to maprfs hadoop cluster by running the following (you should get a list of directory contents). If this does not work most likely information was entered wrong for MapR client when you ran configure.sh.
hadoop fs -ls /
Last step
Restart all Imply processes on your middlemanagers/historical nodes
Reference
MapR Documentation
View Article
Qubole - Imply Cloud Integration

This is to outline steps involved in ingesting data in batch mode using Qubole Hadoop into Imply cloud. Integrating Qubole <> Imply & Druid is just a few clicks away.
Instructions:
All Imply clusters use java 8 by default at the time of this authoring this document. Please make sure your Qubole cluster uses java 8 as well. If using 7, you can change to 8 using the following bootstrap script:
Bootstrap script:
#!/bin/bash
source /usr/lib/hustler/bin/qubole-bash-lib.sh
export PROFILE_FILE=${PROFILE_FILE:-/etc/profile}
export HADOOP_ETC_DIR=${HADOOP_ETC_DIR:-/usr/lib/hadoop2/etc/hadoop}
function restart_master_services() {
monit unmonitor namenode
monit unmonitor timelineserver
monit unmonitor historyserver
monit unmonitor resourcemanager
/bin/su -s /bin/bash -c '/usr/lib/hadoop2/sbin/yarn-daemon.sh stop timelineserver' yarn
/bin/su -s /bin/bash -c 'HADOOP_LIBEXEC_DIR=/usr/lib/hadoop2/libexec /usr/lib/hadoop2/sbin/mr-jobhistory-daemon.sh stop historyserver' mapred
/bin/su -s /bin/bash -c '/usr/lib/hadoop2/sbin/yarn-daemon.sh stop resourcemanager' yarn
/bin/su -s /bin/bash -c '/usr/lib/hadoop2/sbin/hadoop-daemon.sh stop namenode' hdfs
/bin/su -s /bin/bash -c '/usr/lib/hadoop2/sbin/hadoop-daemon.sh start namenode' hdfs
/bin/su -s /bin/bash -c '/usr/lib/hadoop2/sbin/yarn-daemon.sh start resourcemanager' yarn
/bin/su -s /bin/bash -c 'HADOOP_LIBEXEC_DIR=/usr/lib/hadoop2/libexec /usr/lib/hadoop2/sbin/mr-jobhistory-daemon.sh start historyserver' mapred
/bin/su -s /bin/bash -c '/usr/lib/hadoop2/sbin/yarn-daemon.sh start timelineserver' yarn
sudo /usr/lib/zeppelin/bin/zeppelin-daemon.sh stop # as root user
sudo /usr/lib/zeppelin/bin/zeppelin-daemon.sh start # as root user
monit monitor namenode
monit monitor resourcemanager
monit monitor historyserver
monit monitor timelineserver
}
function restart_worker_services() {
monit unmonitor datanode
/bin/su -s /bin/bash -c '/usr/lib/hadoop2/sbin/hadoop-daemon.sh stop datanode' hdfs
/bin/su -s /bin/bash -c '/usr/lib/hadoop2/sbin/hadoop-daemon.sh start datanode' hdfs
monit monitor datanode
# No need to restart nodemanager since it starts only
# after thhe bootstrap is finished
}
function use_java8() {
export JAVA_HOME=/usr/lib/jvm/java-1.8.0
export PATH=$JAVA_HOME/bin:$PATH
echo "export JAVA_HOME=/usr/lib/jvm/java-1.8.0" >> "$PROFILE_FILE"
echo "export PATH=$JAVA_HOME/bin:$PATH" >> "$PROFILE_FILE"
sudo echo "export JAVA_HOME=/usr/lib/jvm/java-1.8.0_60" >> /usr/lib/zeppelin/conf/zeppelin-env.sh
sed -i 's/java-1.7.0/java-1.8.0/' "$HADOOP_ETC_DIR/hadoop-env.sh"
rm -rf /usr/bin/java
ln -s $JAVA_HOME/bin/java /usr/bin/java
is_master=$(nodeinfo is_master)
if [[ "$is_master" == "1" ]]; then
restart_master_services
else
restart_worker_services
fi
}
use_java8
Restart the cluster to reflect the Java 8.
2.Get the master node IP address/hostname from Qubole cluster. You can find it from here: https://us.qubole.com/clusters#/details/<cluster_id>/instances?tab=Nodes. Please make sure to replace <cluster_id> with your cluster id.
Sample Screenshot:
Imply Cloud Documents
You can use private IP and make sure that the traffic can be reached from Qubole cluster to Imply cluster. Please contact your network administrator for help.
3.You then have to tell Imply middle manager to talk to Hadoop master for data processing. Go to Imply clusters page, select the desired cluster, click on Manage, click on setup, scroll down and click on Advanced Config.
Find Middle Manger advanced configuration box and then add the following entries related to your Qubole cluster:
hadoop.fs.defaultFS=hdfs://<Qubole_cluster_master_node_IP>:9000
hadoop.yarn.resourcemanager.hostname=<Qubole_cluster_master_node_IP>
hadoop.yarn.application.classpath=/usr/lib/hadoop2/etc/hadoop,/usr/lib/hadoop2/*,/usr/lib/hadoop2/lib/*,/usr/lib/hadoop2/share/hadoop/common/*,/usr/lib/hadoop2/share/hadoop/common/lib/*,/usr/lib/hadoop2/share/hadoop/hdfs/*,
/usr/lib/hadoop2/share/hadoop/hdfs/lib/*,/usr/lib/hadoop2/share/hadoop/yarn/*,/usr/lib/hadoop2/share/hadoop/yarn/lib/*,/usr/lib/hadoop2/share/hadoop/mapreduce/*,/usr/lib/hadoop2/share/hadoop/mapreduce/lib/*,/usr/lib/hadoop2/share/hadoop/tools/*,
/usr/lib/hadoop2/share/hadoop/tools/lib/*,/usr/lib/hadoop2/share/hadoop/qubole/*,/usr/lib/hadoop2/share/hadoop/qubole/lib/*
hadoop.mapreduce.framework.name =yarn
hadoop.fs.s3n.awsAccessKeyId=<access_key>
hadoop.fs.s3n.awsSecretAccessKey=<secret_key>
Sample Screenshot:

Restart the Imply cluster to reflect the added entires.
4.Include the following job properties in your ingestion specification to avoid block and multipart size errors:
Tuning Config:
"tuningConfig": {
"type": "hadoop",
"jobProperties": {
"mapreduce.job.user.classpath.first": "true",
"fs.s3a.readahead.range": "65536",
"fs.s3a.multipart.size": "104857600",
"fs.s3a.block.size": "33554432"
}
}
The above four steps should establish communication between Qubole hadoop and Imply clusters and data ingestion should directly hit deep storage after aggregations.
Reference:
Qubole
Qubole Documents
View Article
How to run Hadoop batch ingestion using GCP

Objective:
Step by step guide to configure Druid to run Hadoop ingestion with Google Cloud Platform.
Step 1: Login to GCP console:
Go to GCP console page https://console.cloud.google.com/, and log in with your credential.
Step 2: Create a service account that will access the bucket:
Browse to "IAM & Admin" -> "Service Accounts" -> "Create Service Account",Create a service account, in this example "imply-cs"
At permission section of the "imply-cs", make sure your user name is listed as one of the owners
Click on "Create key"
Download the JSON file that created, and save for later use. That's the authentication/authorization credential for your Druid hosts to access GCP.
Step 3: Apply the bucket access permission to the service account just created:
Browse to your GCP bucket through Google navigation menu -> storage -> browser -> YOUR BUCKET ("cs-bucket1" in this example)
Configure user access permission on the bucket
Add the service account as Storage Admin to this bucket.
Step 4: Configure environmental variable on all Druid hosts
Open SSH sessions to all druid hosts, and save the JSON file downloaded in step 2 to Imply Druid's conf (./conf) directory on all hosts. Then in your bash properties file, e.g, '~/.bash_properties' on RHEL/CentOS, create a new environment variableGOOGLE_APPLICATION_CREDENTIALS. Replace[PATH]with the file path of the JSON file that contains your service account key, and[FILE_NAME]with the filename. For example:
[root@ip-172-31-2-115 ec2-user]# cat ~/.bash_profile
# .bash_profile
export GOOGLE_APPLICATION_CREDENTIALS="/imply-<VERSION>/conf/googleauth.json"
[root@ip-172-31-2-115 ec2-user]# cat /imply-<VERSION>/conf/googleauth.json
{
"type": "service_account",
"project_id": "<ID>",
"private_key_id": "<KEY_ID>",
"private_key": "-----BEGIN PRIVATE KEY-----<KEY>-----END PRIVATE KEY-----\n",
"client_email": "imply-cs@<ID>.iam.gserviceaccount.com",
"client_id": "<CLIENT_ID>",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "<URL>"
}
Step 5: Add "Druid-google-extensions" to the Druid's "druid.extensions.loadList" in the Druid's "common.running.properties".
Step 6: Add following to replace the current deep storage config in "common.running.properties"
druid.storage.type=google
druid.google.bucket=cs-bucket1
druid.google.prefix=druid/segments
Step 7: Add following to replace the current index logging config in "common.running.properties"
druid.indexer.logs.type=google
druid.indexer.logs.bucket=cs-bucket1
druid.indexer.logs.prefix=druid/indexing-logs
Step 8: Download the gcs connector driver, and place it to both Druid's lib and "druid-google-extension" directories
wget https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop2-latest.jar
/imply-2.8.17/dist/druid/extensions/druid-google-extensions/
/imply-2.8.17/dist/druid/lib/
Step 9: Configure the Hadoop to interact with GCP
Copy the same "googleauth.json" in step 4 to "/etc/hadoop/conf" directory on all Hadoop hosts.
Change the file permission to 444
Add following config to "core-site.xml" file on all Hadoop hosts
fs.gs.auth.service.account.json.keyfile=/etc/hadoop/conf/googleauth.json
fs.gs.working.dir=/
fs.gs.path.encoding=uri-path
fs.gs.reported.permissions=777
Restart Hadoop cluster.
Test GCP access using HDFS client
hadoop fs -ls gs://cs-bucket1/
Step 10: Copy the latest "core-site.xml" to /imply-<VERSION>/conf/druid/_common directory on all Druid hosts.
Restart Druid cluster.
Step 11: Upload test raw data JSON file "wikipedia-2016-06-27-sampled.json" to GCP bucket "cs-bucket1"
Step 12: Run Hadoop ORC ingestion using following spec:
{
"type" : "index_hadoop",
"spec" : {
"dataSchema" : {
"dataSource" : "wikipedia-hadoop",
"parser" : {
"type" : "hadoopyString",
"parseSpec" : {
"format" : "json",
"dimensionsSpec" : {
"dimensions" : [ "channel", "cityName", "comment", "countryIsoCode", "countryName", "isAnonymous", "isMinor", "isNew", "isRobot", "isUnpatrolled", "metroCode", "namespace", "page", "regionIsoCode", "regionName", "user", {
"name" : "commentLength",
"type" : "long"
}, {
"name" : "deltaBucket",
"type" : "long"
}, "flags", "diffUrl", {
"name" : "added",
"type" : "long"
}, {
"name" : "deleted",
"type" : "long"
}, {
"name" : "delta",
"type" : "long"
} ]
},
"timestampSpec" : {
"column" : "timestamp",
"format" : "iso"
}
}
},
"metricsSpec" : [ ],
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "DAY",
"queryGranularity" : {
"type" : "none"
},
"rollup" : false,
"intervals" : [ "2016-06-20T00:00:00.000Z/2016-07-02T00:00:00.000Z" ]
},
"transformSpec" : {
"filter" : null,
"transforms" : [ ]
}
},
"ioConfig" : {
"type" : "hadoop",
"inputSpec" : {
"type" : "static",
"paths" : "gs://cs-bucket1/wikipedia-2016-06-27-sampled.json"
},
"metadataUpdateSpec" : null,
"segmentOutputPath" : null
},
"tuningConfig" : {
"type" : "hadoop",
"workingPath" : null,
"partitionsSpec" : {
"type" : "hashed",
"targetPartitionSize" : 5000000,
"maxPartitionSize" : 7500000,
"assumeGrouped" : false,
"numShards" : -1,
"partitionDimensions" : [ ]
},
"shardSpecs" : { },
"indexSpec" : {
"bitmap" : {
"type" : "concise"
},
"dimensionCompression" : "lz4",
"metricCompression" : "lz4",
"longEncoding" : "longs"
},
"maxRowsInMemory" : 1000000,
"maxBytesInMemory" : 0,
"leaveIntermediate" : false,
"cleanupOnFailure" : true,
"overwriteFiles" : false,
"ignoreInvalidRows" : false,
"jobProperties" : {
"mapreduce.job.user.classpath.first" : "true",
"mapreduce.map.java.opts" : "-Duser.timezone=UTC -Dfile.encoding=UTF-8",
"mapreduce.reduce.java.opts" : "-Duser.timezone=UTC -Dfile.encoding=UTF-8"
},
"combineText" : false,
"useCombiner" : false,
"buildV9Directly" : true,
"numBackgroundPersistThreads" : 0,
"forceExtendableShardSpecs" : true,
"useExplicitVersion" : false,
"allowedHadoopPrefix" : [ ],
"logParseExceptions" : false,
"maxParseExceptions" : 0
}
},
"hadoopDependencyCoordinates" : null,
"classpathPrefix" : null,
"context" : { },
"dataSource" : "wikipedia-hadoop"
}
View Article
How to run native batch ingestion using GCP

Objective:
Step by step guide to configure Druid to ingest files from Google Cloud Platform.
Step 1: Login to GCP console:
Go to GCP console page https://console.cloud.google.com/, and log in with your credential.
Step 2: Create a service account that will access the bucket:
Browse to "IAM & Admin" -> "Service Accounts" -> "Create Service Account",Create a service account, in this example "imply-cs"
At permission section of the "imply-cs", make sure your user name is listed as one of the owners
Click on "Create key"
Download the JSON file that created, and save for later use. That's the authentication/authorization credential for your Druid hosts to access GCP.
Step 3: Apply the bucket access permission to the service account just created:
Browse to your GCP bucket through Google navigation menu -> storage -> browser -> YOUR BUCKET ("cs-bucket1" in this example)
Configure user access permission on the bucket
Add the service account as Storage Admin to this bucket.
Step 4: Configure environmental variable on all Druid hosts
Open SSH sessions to all druid hosts, and save the JSON file downloaded in step 2 to Imply Druid's conf (./conf) directory on all hosts. Then in your bash properties file, e.g, '~/.bash_properties' on RHEL/CentOS, create a new environment variableGOOGLE_APPLICATION_CREDENTIALS. Replace[PATH]with the file path of the JSON file that contains your service account key, and[FILE_NAME]with the filename. For example:
[root@ip-172-31-2-115 ec2-user]# cat ~/.bash_profile
# .bash_profile
export GOOGLE_APPLICATION_CREDENTIALS="/imply-<VERSION>/conf/googleauth.json"
[root@ip-172-31-2-115 ec2-user]# cat /imply-<VERSION>/conf/googleauth.json
{
"type": "service_account",
"project_id": "<ID>",
"private_key_id": "<KEY_ID>",
"private_key": "-----BEGIN PRIVATE KEY-----<KEY>-----END PRIVATE KEY-----\n",
"client_email": "imply-cs@<ID>.iam.gserviceaccount.com",
"client_id": "<CLIENT_ID>",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "<URL>"
}
Step 5: Add "Druid-google-extensions" to the Druid's "druid.extensions.loadList" in the Druid's "common.running.properties".
Step 6: Add following to replace the current deep storage config in "common.running.properties"
druid.storage.type=google
druid.google.bucket=cs-bucket1
druid.google.prefix=druid/segments
Step 7: Add following to replace the current index logging config in "common.running.properties"
druid.indexer.logs.type=google
druid.indexer.logs.bucket=cs-bucket1
druid.indexer.logs.prefix=druid/indexing-logs
Step 8: restart all Druid services.
Step 9: Upload raw data file "wikipedia-2016-06-27-sampled.json" to GCP bucket "cs-bucket1".
Step 10: POST native batch ingestion to start ingestion:
{
"type": "index",
"spec": {
"dataSchema": {
"dataSource": "wikipedia",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"dimensionsSpec": {
"dimensions": [
"isRobot",
"diffUrl",
{
"name": "added",
"type": "long"
},
"channel",
"flags",
{
"name": "delta",
"type": "long"
},
"isUnpatrolled",
"isNew",
{
"name": "deltaBucket",
"type": "long"
},
"isMinor",
"isAnonymous",
{
"name": "deleted",
"type": "long"
},
"namespace",
"comment",
"page",
{
"name": "commentLength",
"type": "long"
},
"user",
"countryIsoCode",
"regionName",
"cityName",
"countryName",
"regionIsoCode",
{
"name": "metroCode",
"type": "long"
}
]
},
"timestampSpec": {
"column": "timestamp",
"format": "iso"
}
}
},
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "DAY",
"rollup": false,
"queryGranularity": "none"
},
"metricsSpec": []
},
"ioConfig": {
"type": "index",
"firehose": {
"type":"static-google-blobstore",
"blobs”: [
{
"bucket":"cs-bucket1",
"path":"wikipedia-2016-06-27-sampled.json"
} ]
},
"appendToExisting": false
},
"tuningConfig": {
"type": "index",
"forceExtendableShardSpecs": true,
"maxRowsInMemory": 1000000,
"reportParseExceptions": false,
"maxParseExceptions": 100,
"maxSavedParseExceptions": 10
}
}
}
View Article
How to properly shutdown an on-premise Imply Druid Cluster

There are some instances where you want to do a planned shut down of your on-premise Imply Druid cluster such as network maintenance, operating system updates, or hardware maintenance.
Below are the steps to do so.
Assumptions:
The Imply cluster is running colocated services as the following:
Pre-requisites:
Notify your end users the cluster will be shut down, hence they are aware services being powered by Imply will not be operational.
If you have a streaming ingestion, ensure retention policies for the source systems are longer than the planned downtime. For instance, if you have a downtime window of 4 hours, make sure that the retention policy is at least 8 hours to prevent any gaps in data.
Stop all ingestions jobs.
*NOTE make sure to have a backup of the ingestion spec since you may need to resubmit the ingestion jobs.
Steps
Stop the Query Servers (Broker, Pivot, Router). On the Query Server machine log in as the user running the Imply druid processes. Then run the commandbin/service
--down. Verify the service is no longer running by executing a ps -ef | grep Imply.
Stop the Data Servers (Middle Manager, Historical)On the Data Server machine log in as the user running the Imply druid processes. Then run the commandbin/service
--down.Verify the service is no longer running by executing a ps -ef | grep Imply.
Stop the Master Servers (Overlord, Coordinator, Zookeeper). On the Master Server machine log in as the user running the Imply druid processes. Then run the commandbin/service --down.Verify the service is no longer running by executing a ps -ef | grep Imply.
View Article
Zookeeper: Best Practices for Imply

A Druid cluster can only be as stable as its Zookeeper. By design, Druid nodes are prevented from talking to each other. Zookeeper is responsible for much of the communication that keeps the Druid cluster functioning.
ZK Documentation
For maximum ZK stability, follow these best practices:
Architecture
Keep a dedicated ZK for Druid; avoid sharing with other products / applications
Maintain an odd number of zookeepers. For increased reliability (high availability), use 3 or 5 ZK's so that it can maintain a quorum in voting decisions (ceiling(n/2)). Voting decisions include IP discovery, determining which overlord or coordinator is the leader, etc.
For example, with 3 ZK's, 1 ZK can be lost and the cluster can still function because 2 are left, which = ceil(3/2). In a cluster with 2 ZK's, only 1 is needed to maintain a "quorum", so it does not provide high availability. Two ZK's doubles the probability of failure and therefore is actually less preferable than 1 ZK.
Imply prefers to colocate ZK on master nodes rather than data or query nodes. This is because the coordinator and overlord are normally far less work intensive than data or query nodes
Never put ZK behind a load balancer. This can work if all ZK servers are healthy, but if one goes down it can be difficult for the client to find a replacement. Because ZK recovery is hampered in this case, it defeats thepurpose of high availability.
ZK Data and Log files should be on disks, which have least contention from other I/O activities. Ideally the ZK data and ZK transaction log files should be on different disks, so that they don't contend for the IO resource. Note that, it isn't enough to just have partitions but they have to be different disks to ensure performance. If possible use high speed disks like SSD's for dataDir and dataLogDir locations.
Configuration
List ZK IP's as hosts rather than DNS names in the common.runtime.properties. The reason for this is that there are some bugs in the part of the ZK code that handles DNS mappings, including one where ZK won't retry the connection if DNS becomes unavailable.
Example common.runtime.properties:
druid.zk.service.host=172.16.xxx.x1,172.16.xxx.x2,172.16.xxx.x3
When running in HA mode, add all ZK server IP's to zoo.cfg to ensure that ZK can properly elect its own leader:
...
#
# Ensemble
#
server.1=172.16.xxx.x1:2888:3888
server.2=172.16.xxx.x2:2888:3888
server.3=172.16.xxx.x3:2888:3888
In the dataDir for the respective ZK servers, ensure that the myid file has the same number as defined by server.x in the above. For example, if dataDir is defined as /opt/imply-2.8.7/var/zk, on server.1, myid file should have an entry of 1. In case of server.2, it should be 2. This can be verified by doing a cat/opt/imply-2.8.7/var/zk/myid on server.1 (172.16.xxx.x1)
Try to keep the Zookeeper log transaction and data dir to be in independent disks. These can be defined in zoo.cfg using dataDir and dataLogDir parameters.
In ZK version 3.4.0 and above, enable auto purging of transaction logs usingautopurge.purgeInterval=1 in zoo.cfg. This would ensure that the older transaction logs are cleaned up regularly and avoids disk full situations. By default 3 transaction logs are retained and this is controlled byautopurge.snapRetainCount in zoo.cfg.
All ZK servers should use the same configuration
Refer to for more information.
View Article
If you're having a "too many open files" problem.. I feel bad for you son, but I got 99 problems and a ulimit setting ain't one.

Symptoms:
"Too many open files" messages showing up in your logs? Tasks failing?
First step - check your ulimit
ulimit -a
If the value is < 10,000 you should probably increase it.
Root Cause:
The default open file settings for linux or often too low for an Apache Druid system, which can be interacting with thousands of files simultaneously.
Solution:
Per session solution (will be reverted on reboot):
NOTE: Ensure you have root access
Execute the following
ulimit -n 65536
ulimit -u 65536
Permanent solution:
vi /etc/security/limits.conf
Ensure the following values are set:
* soft nproc 65535
* hard nproc 65535
* soft nofile 65535
* hard nofile 65535
Reference:
https://access.redhat.com/solutions/61334
https://www.youtube.com/watch?v=6uikJTnmtgw
View Article
Querying Druid from the command line

Druid SQL
The simplest way to make Druid SQL queries from the command line is using thedsqltool included with the Imply distribution:
$ bin/dsql
Welcome to dsql, the command-line client for Druid SQL.
Connected to [http://localhost:8082/].
Type "\h" for help.
dsql> SELECT COUNT(*) from wikipedia;
EXPR$0
24433
Retrieved 1 row in 0.05s.
By default, thedsql tool connects to a broker on localhost without security enabled. If you need to connect to another broker, use a command like:
bin/dsql -H http://your-broker:8082/
If you need to connect to a broker with security enabled, try the following (Imply 2.8 or later required):
bin/dsql -H https://your-broker:8282/ -u admin:yourpassword
Refer to the Imply Druid SQL documentation, or the community Druid SQL documentation, for more details about the kinds of queries you can issue.
You can also make Druid SQL queries using Pivot through the "SQL" tab.
Native Druid queries
To issue native Druid JSON queries from the command line, run:
curl -XPOST [email protected] -H'Content-Type: application/json' http://your-broker:8082/druid/v2/
Wherequery.json is a file containing the query JSON you wish to issue. Refer to the Druid querying documentation for details about the kinds of queries you can issue.
View Article
Kerberized Kafka Ingestion

For an overview of Kafka security features and guidance for setting up Kerberized Kafka, please see:
https://www.confluent.io/blog/apache-kafka-security-authorization-authentication-encryption/
In order for Druid to connect with Kerberized Kafka, a JAAS conf file needs to be set up on each service. There are two alternate methods of authentication from within the JAAS conf file:
1. setting useTicketCache=true
2. create keytab based authentication. We provide an example of this below.
1. Create keytab for Druid
Where <druid-user> is the user that runs the Druid services:
addprinc <druid-user>@EXAMPLE.COM
ktadd -k /keytab/<druid-user>.keytab <druid-user>@{REALM}
2. Create JAAS conf file for Druid as Kafka Client:
/path/to/client_jaas.conf
*Note: The Client section is only necessary if ZK is Kerberized
KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
useTicketCache=false
useKeyTab=true
storeKey=true
keyTab="/keytab/<druid-user>.keytab"
principal="<druid-user>@EXAMPLE.COM";
};
Client {
com.sun.security.auth.module.Krb5LoginModule required
useTicketCache=false
useKeyTab=true
storeKey=true
keyTab="/keytab/<druid-user>.keytab"
principal="<druid-user>@EXAMPLE.COM";
};
3.EditDruidproperties
In the jvm.config for each service, add the path to the JAAS conf file:
-Djava.security.auth.login.config=/client_jaas.conf
Inconf/middleManager/runtime.properties(or conf-quickstart if running quickstart), add to druid.indexer.runner.javaOpts:
-Djava.security.auth.login.config=/client_jaas.conf
In conf/druid/_common/common.runtime.properties (or conf-quickstart if running quickstart) - NOTE: this part is only necessary if ZK is Kerberized:
druid.zk.service.host={host}.example.com
Note: This procedure was tested with Imply 2.6.8 and Kafka kafka_2.11-2.0.0.
View Article
Using Oracle Cloud S3-compatible Object Storage as a deep storage for Druid

These instructions have been tested with Imply 2.8 (based on Druid 0.13.0).
Oracle Cloud provides an S3-compatible Object Storage API, which can be used as a Druid deep storage. This requires Imply 2.8 or later. To achieve this, add the following properties to your common.runtime.properties:
druid.storage.type=s3
druid.storage.bucket=yourBucket
druid.storage.baseKey=druid/segments
druid.storage.disableAcl=true
druid.s3.endpoint.url=<hostname>:443
druid.s3.endpoint.signingRegion=<region>
druid.s3.accessKey=<accessKey>
druid.s3.secretKey=<secretKey>
druid.s3.disableChunkedEncoding=true
druid.s3.enablePathStyleAccess=true
druid.s3.forceGlobalBucketAccessEnabled=false
See https://docs.cloud.oracle.com/iaas/Content/Object/Tasks/s3compatibleapi.htm for more information about Oracle's S3 compatible Object Storage API.
View Article
Using OpenStack Swift as deep storage for Druid

About Swift
OpenStack Swift is a self-hosted option that offers enough compatibility with the S3 API to be used as deep storage via Druid's S3 integration.
Imply 2.7.x and earlier (Druid 0.12.x and earlier)
These versions use the JetS3t library to access S3 APIs.
Configure S3 deep storage the usual way: add "druid-s3-extensions" to your extension list, configure access and secret keys, and configure a bucket and path.
Create a file "conf/druid/_common/jets3t.properties" (or "conf-quickstart/druid/_common/jets3t.properties" if you are using the quickstart config) and include the following lines:
s3service.s3-endpoint= swift.example.com
s3service.s3-endpoint-http-port=8080
s3service.https-only=false
s3service.disable-dns-buckets=true
s3service.enable-storage-classes=false
Imply 2.8 and later
These versions use the Amazon aws-java-sdk to access S3 APIs. We do not have specifics at this time about how to integrate these versions with Swift for deep storage.
View Article
Kafka Indexing -> Segment Flow

Where do Kafka messages go when they Enter the Druid? Do they morph into insects? Do they exist at all?
This diagram shows the relationship between partitions on the Kafka broker, ingestion tasks, and Druid segments during Kafka Indexing. Druid parameters that influence ingestion efficiency and number of segments are noted.
The Kafka Indexing Service works differently before and after Druid 0.12 (Imply 2.5.x). The main difference is that pre-0.12, segments could not mix data from different partitions. This can have a large impact on total number of segments generated when using <0.12.
http://druid.io/docs/latest/tutorials/tutorial-kafka.html
Additional references:
http://druid.io/docs/latest/development/extensions-core/kafka-ingestion.html
View Article
Temporary space issue with Dimension Partitioning on EMR instances in Hadoop_index job

Challenge : We do see this error message when running Hadoop_index job on EMR instances with Dimension Partitioning
Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in InMemoryMerger - Thread to merge in-memory shuffled map-outputs
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:377)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/attempt_1543312496194_0001_r_000000_1005/map_421.out
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:402)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
at org.apache.hadoop.mapred.YarnOutputFiles.getInputFileForWrite(YarnOutputFiles.java:213)
at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl$InMemoryMerger.merge(MergeManagerImpl.java:459)
at org.apache.hadoop.mapreduce.task.reduce.MergeThread.run(MergeThread.java:94)
Root Cause : EMR instances might not have sufficient space or not adequately configured as Dimension Partitioning requires more temporary space for shuffle as it has to re-group.
Solution : Either upgrade EMR instances to higher configuration or run the job with lesser intervals like 1 hour or 6 hours instead of 1 day or 1 week.
Reference Links :
https://aws.amazon.com/emr/
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html
View Article
Creating a Data Cube From Multiple Datasets

***Imply Version: 2.7.5 and higher***
With the added ability to perform UNION ALL operations on datasets, Imply also offers the ability to create a data cube from multiple datasets.
Before we begin, there are a couple of things to note about this capability.
This is a logical operation at query time, so no changes are made to the underlying druid datasets
This is a UNION ALL operation, so duplicates between datasets will be preserved
Any dimensions that are missing from a dataset will be treated as NULL values
The Gist
To create a data cube from multiple data sources, add the following JSON to the Data Cube Options field in the Advanced section of the Data Cube Properties. With TABLE_A and TABLE_B being the names of the data sources you wish to use.
{
"altDataSource": [
"TABLE_A",
"TABLE_B"
]
}
The Tutorial
Getting Started - Loading the data
For this example, I will be using sample data from https://quantquote.com/historical-stock-data.
First, let's create two datasets, for my example, I am using Minute and Second resolution data provided as a sample from Quantquote. The datasets are available here:
https://quantquote.com/sample/SPY_SECOND_TRADE.csv
https://quantquote.com/sample/SPY_MINUTE_TRADE.csv
The data format is CSV, with a header that contains the column names.
You'll notice that the Minute level data contains a date field, while the Second level data does not.
For these datasets, we will not be using a time column or a roll-up
Leaving the defaults for the columns
And we will go ahead and Automatically Create a data cube for each of the data sources, just for comparison purposes.
A quick review and we go ahead and load the data!
Create the Data Cube
We are going to create a new data cube, and initially select the data source with all of the fields you wish to include in the cube (in this case we start with the By Minute data source, as it contains the date field that does not exist in the Second level dataset)
Go ahead and change the name, so we don't confuse it with our existing data cubes.
Jump down to the Advance Tab and scroll to the bottom to the Data Cube Options and enter the following JSON.
{
"altDataSource": [
"SPY_MINUTE_TRADE",
"SPY_SECOND_TRADE"
]
}
Click Save and We're DONE
Validation
Spy Minute Trade = 748 Events
Spy Second Trade = 19.15k Events
Spy Combined Trade = 19,150 + 748 = 19,898 Events
View Article
How to configure log rotation in Druid using RollingFileAppender

Imply 2.8 and later
Imply 2.8 and later provide log rotation out of the boxon supported platforms (Linux x86_64 and MacOS) through thesupervise program. Logs appear invar/sv/<service>/current and are rotated to other files in that same directory, before eventually being deleted.
The remainder of this article is useful for Imply 2.7 and earlier.
Imply 2.7 and earlier
In Imply 2.7.x and earlier, Druid is configured to use `ConsoleAppender` for logging. However, ConsoleAppender does not roll over old logs and does not do archiving, so all logs of a service are written to a single file, which can grow very large in size over time.
A popular solution is to use `RollingFileAppender` for logging. More details can be found here: https://logging.apache.org/log4j/2.0/manual/appenders.html#RollingFileAppender
The following is an example of configuring `RollingFileAppender` for coordinator logging:
1. Configure`jvm.config` for coordinator by adding `-Dservice` at the end of file with the name of each service. It is important that each jvm.config have a different value for "service"!
[root@ip-172-31-12-141 coordinator]# cat /imply/imply-2.5.14/conf/druid/coordinator/jvm.config
-server
-Xms256m
-Xmx256m
-Duser.timezone=UTC
-Dfile.encoding=UTF-8
-Djava.io.tmpdir=var/tmp
-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
-Dderby.stream.error.file=var/druid/derby.log
-Dservice=coordinator
Similar configuration can be applied to other services, with changes in `-Dservice` value respectively.
2. back up the original the log4j2.xml file, then replace the content with following:
<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="warn" name="Imply" packages="">
<Appenders>
<RollingFile name="RollingFile" fileName="var/sv/${sys:service}-service.log"
filePattern="var/sv/${sys:service}-service-\%d{yyyy-MM-dd}.\%i.log.gz">
<PatternLayout>
<Pattern>\%d{ISO8601} \%p [\%t] \%c - \%m\%n</Pattern>
</PatternLayout>
<Policies>
<TimeBasedTriggeringPolicy interval="1" modulate="true"/>
<SizeBasedTriggeringPolicy size="100 MB"/>
</Policies>
<DefaultRolloverStrategy max="10"/>
</RollingFile>
</Appenders>
<Loggers>
<Root level="info">
<AppenderRef ref="RollingFile"/>
</Root>
</Loggers>
</Configuration>
In this new configuration, we roll over and archive old logs in to gz format daily, or if the current log file grows larger than 100MB.
3. IMPORTANT:`RollingFileAppender` will break ingestion task logging. So we need to instruct tasks to still use `ConsoleAppender` for logging. To do so:
3a. Keep the default log4j2.xml and rename it to log4j2-task.xml. It will have the following contents:
<?xml version="1.0" encoding="UTF-8" ?>
<Configuration status="WARN">
<Appenders>
<Console name="Console" target="SYSTEM_OUT">
<PatternLayout pattern="\%d{ISO8601} \%p [\%t] \%c - \%m\%n"/>
</Console>
</Appenders>
<Loggers>
<Root level="info">
<AppenderRef ref="Console"/>
</Root>
</Loggers>
</Configuration>
3b. add additional property `-Dlog4j.configurationFile` to middleManager's runtime.properties --> `druid.indexer.runner.javaOpts`, and point to the path of the `log4j2-task.xml` file:
[root@ip-172-31-12-141 middleManager]# cat /imply/imply-2.5.14/conf/druid/middleManager/runtime.properties
...
# Task launch parameters
druid.indexer.runner.javaOpts=-server -Xms512m -Xmx512m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -Dlog4j.configurationFile=conf/druid/_common/log4j2-task.xml
...
4. The var/sv directory will look like this after the change is applied:
[root@ip-172-31-12-141 sv]# ls -lh
-rw-r--r--. 1 root root 37K Jun 24 00:00 broker-service-2018-06-23.log.gz
-rw-r--r--. 1 root root 14K Jun 25 02:00 broker-service-2018-06-24.log.gz
-rw-r--r--. 1 root root 188K Jun 25 22:55 broker-service.log
-rw-r--r--. 1 root root 1K Jun 25 22:55 broker.log
-rw-r--r--. 1 root root 574K Jun 25 22:55 imply-ui.log
In this example, "broker-service.log" is Druid's log4j2 log, and "broker.log" is standard out. The standard-out log file will generally contain nothing at first. Eventually, it may accumulate logs that do not go through log4j2, including:
GC logs (from the JVM PrintGCTimeStamps option).
Crash logs.
Thread dumps generated by "kill -3".
View Article
Does Druid support TLS/SSL ?

The feature will be introduced in the 0.11 release (& part of Imply 2.4.0)
For more details, please refer to : http://druid.io/docs/latest/operations/tls-support.html
For TLS security on Cloud : https://docs.imply.io/cloud/deploy/security
View Article
How to create/manage users through API for on-premise cluster?

This feature is available for Imply Druid 2.6.0 or greater
Step 1:Make sure toinclude druid-basic-securityas an extension.
Step 2: Also add following configs to thecommon.runtime.properties:
druid.auth.authenticatorChain=["basic"]
druid.auth.authenticator.basic.type=basic
druid.auth.authenticator.basic.initialAdminPassword=priest
druid.auth.authenticator.basic.initialInternalClientPassword=warlock
druid.auth.authenticator.basic.authorizerName=basic
druid.escalator.type=basic
druid.escalator.internalClientUsername=druid_system
druid.escalator.internalClientPassword=warlock
druid.escalator.authorizerName=basic
druid.auth.authorizers=["basic"]
druid.auth.authorizer.basic.type=basic
When the extension is first loaded, it looks at these properties to determine what initial password to set for the `admin` user and the `druid_system` user. After that these properties do nothing
Both `admin` and `druid_system` have R/W * privileges, and are automatically created if an initial password is set for them. In our case, `priest` is the password for `admin` and `warlock` is for `druid_system`.
Step 3: Start the Druid cluster and you can now create and manager users:
[root@mf-amos263 extensions]# curl -u admin:priest -X POST http://localhost:8081/druid-ext/basic-security/authentication/db/basic/users/ocean
[root@mf-amos263 extensions]#
[root@mf-amos263 extensions]# curl -u admin:priest -X GET http://localhost:8081/druid-ext/basic-security/authentication/db/basic/users/
["admin","druid_system","ocean"]
[root@mf-amos263 extensions]#
[root@mf-amos263 extensions]# curl -u druid_system:warlock -X GET http://localhost:8081/druid-ext/basic-security/authentication/db/basic/users/
["admin","druid_system","ocean"]
To further managing the users and their roles, please refer to: https://docs.imply.io/cloud/deploy/security#managing-druid-api-users
For other functions in druid-basic-security, please check reference.
Reference: http://druid.io/docs/latest/development/extensions-core/druid-basic-security.html
View Article
Issue: Unable to load data from S3 firehose when "=" in prefix

SYMPTOM:
Ingestion from S3 fails with a message similar to below:
<Code>NoSuchKey</>
<Message>The specified key does not exist</Message>
Ingestion spec has ioConfig block similar to the following:
"ioConfig": {
"type": "index",
"firehose": {
"fetchTimeout": 300000,
"type": "static-s3",
"prefixes": ["s3://not-real-bucket/filepath/dt=2018-10-31"]
},
"appendToExisting": "false"
}
ROOT CAUSE:
The cause of this issue is a bug in the jetS3t library when using AWS S3 Signature V4.
WORKAROUND:
The workaround is to revert to using Signature V2 for AWS S3. This can be enabled in the Experimental Features section of the cluster settings as illustrated below.
Resolution:
Beginning with Imply 2.8.0, jetS3t is no longer used. Instead the native AWS S3 API is leveraged, thus resolving this issue.
View Article
How to auto purge task logs in HDFS?

Question: Over time there might be millions of log files stored on to hdfs. How to configure auto purging logs older than x number of days?
Answer: Admins can enable auto killing indexing logs by settingdruid.indexer.logs.kill.enabledto true. and configuredruid.indexer.logs.kill.durationToRetainto retain task logs that were created in lastxmilliseconds.
Please see http://druid.io/docs/latest/configuration/indexing-service.html#task-logging for more details.
View Article
How to rename dimension during raw data ingestion

The following ingestion spec will help renaming dimension during ingestion while keep its values intact. In this case, original column names are `user` and `regionName`, and after ingestion, the column names are changed to `RENAME_User` and "RENAME_RegionName` in dataSource:
{
"type" : "index",
"spec" : {
"dataSchema" : {
"dataSource" : "wikipedia-rename",
...
"dimensionsSpec" : {
"dimensions" : [ "RENAME_User", .... "RENAME_RegionName", ... ]
}
}
},
...
"transformSpec" : {
"filter" : null,
"transforms" : [ {
"type" : "expression",
"name" : "RENAME_User",
"expression" : "(\"user\")"
}, {
"type" : "expression",
"name" : "RENAME_RegionName",
"expression" : "(\"regionName\")"
} ]
}
}
...
}
View Article
Data ingestion failure due to error `Numeric columns do not support multivalue rows`

SYMPTOM:
During data ingestion, it hits error
2018-04-30T21:34:50,735 INFO [appenderator_persist_0] io.druid.segment.realtime.appenderator.AppenderatorImpl - Removing sink for segment[SEGMENT_NAME].
2018-04-30T21:34:50,758 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[IndexTa
sk{id=index_SEGMENT, type=index, dataSource=DATASOURCE}]
java.lang.UnsupportedOperationException: Numeric columns do not support multivalue rows.
at io.druid.segment.DoubleDimensionIndexer.processRowValsToUnsortedEncodedKeyComponent(DoubleDimensionIndexer.java:40) ~[druid-processing-0.12.0-iap7.jar:0.12.0-iap7]
at io.druid.segment.DoubleDimensionIndexer.processRowValsToUnsortedEncodedKeyComponent(DoubleDimensionIndexer.java:33) ~[druid-processing-0.12.0-iap7.jar:0.12.0-iap7]
at io.druid.segment.incremental.IncrementalIndex.toTimeAndDims(IncrementalIndex.java:555) ~[druid-processing-0.12.0-iap7.jar:0.12.0-iap7]
at io.druid.segment.incremental.IncrementalIndex.add(IncrementalIndex.java:503) ~[druid-processing-0.12.0-iap7.jar:0.12.0-iap7]
at io.druid.segment.realtime.plumber.Sink.add(Sink.java:157) ~[druid-server-0.12.0-iap7.jar:0.12.0-iap7]
at io.druid.segment.realtime.appenderator.AppenderatorImpl.add(AppenderatorImpl.java:223) ~[druid-server-0.12.0-iap7.jar:0.12.0-iap7]
at io.druid.segment.realtime.appenderator.BaseAppenderatorDriver.append(BaseAppenderatorDriver.java:293) ~[druid-server-0.12.0-iap7.jar:0.12.0-iap7]
ROOT CAUSE:
A row containing multi value is found during date ingestion, and indexer does not know how to handle.
WORKAROUND:
Create a dummy config for property`"listDelimiter"` under `parseSpec`. You can fill in any string as the value.
SOLUTION :
"parseSpec" : {
"format" : "csv",
"hasHeaderRow": true,
"listDelimiter": "abc",
"timestampSpec": {
"column": "TimeStamp"
},
View Article
How to concatenate two columns in Imply UI

This is a feature added to Imply 2.5.8 and later.
As shown in the following example,it is now possible to concatenate two columns together in a derived dimension using a formula such as:
`$columnA ++ ', ' ++ $columnB`
View Article
Receiving error message "Error attaching to process: Doesn't appear to be a HotSpot VM (could not find symbol "gHotSpotVMTypes" when trying to execute jstack command

What are you trying to accomplish:
Execute a thread dump to troubleshoot an unresponsive druid process.
Workaroundor Solution:
Run the command as the linux user that is running the process. For instance,
ps -f | grep java
[root@localhost ~]# ps -ef | grep java
root 2341 2318 0 14:41 ? 00:00:19 java -server -Xms128m -Xmx128m -Duser.timezone=UTC -cp /tmp/imply-2.7.8/bin/../dist/zk/lib/*:/tmp/imply-2.7.8/bin/../dist/zk/*:/tmp/imply-2.7.8/conf-quickstart/zk org.apache.zookeeper.server.quorum.QuorumPeerMain /tmp/imply-2.7.8/conf-quickstart/zk/zoo.cfg
Based on this output, you can see that the rootuser is running the process. Hence, in order to run the jstack command, you will need to be logged in as this user.
Root Cause:
Executing the command with a linux user that isn't the same as the user running the druid process.
View Article
Issue: Indexing tasks from Kafka or Kinesis are finishing successfully, but without any data ingested.

SYMPTOM:
Indexing tasks from Kafka or Kinesis are finishing without any data ingested. From task logs we can find similar logs like following:
2018-07-02T07:46:02,783 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[KinesisIndexTask{id=index_kinesis_xxxx, type=index_kinesis, dataSource=xxxxxx}]
io.druid.java.util.common.ISE: Starting sequenceNumber [49585811274695675412070234775862022455403125352932835346] is no longer available for partition [shardId-000000000001] (earliest: [49585811274695675412070251479546550741235996701713498130]) and resetOffsetAutomatically is not enabled
Or
2018-09-03T12:55:14,522 WARN [task-runner-0-priority-0] io.druid.indexing.kafka.KafkaIndexTask - OffsetOutOfRangeException with message [Offsets out of range with no configured reset policy for partitions: {Druid-XXXX-X=7263946670}]
And at the end, we can see:
2018-09-03T12:55:15,207 INFO [publish-0] io.druid.segment.realtime.appenderator.BaseAppenderatorDriver - Nothing to publish, skipping publish step.
......
2018-09-03T12:55:15,232 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_kafka_Test-K_40675c261af88df_pcigmfli] status changed to [SUCCESS].
..... ......
2018-09-03T12:55:15,233 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
"id" : "index_kafka_Test-K_40675c261af88df_pcigmfli",
"status" : "SUCCESS",
"duration" : 3601087,
"errorMsg" : null
}
ROOT CAUSE:
The ingestion is trying to read messages that are not available in Kafka, it patiently waits until the `taskDuration` is over, or the supervisor is reset. The most common reason is that the ingestion is not able to ingest all messages from Kafka before they were deleted due to Kafka retention policies. It could be because the ingestion is slow and needs to be scaled out / tuned, or the ingestion could have just been turned off for long enough that Kafka deleted the messages it was pointed at.
WORKAROUND:
Druid admin shall reset the Supervisor when this happens, which basically skips ingestion forward to the latest Kafka messages. This reset can be done by using the "reset supervisor" command in the UI or by using the reset API described on http://druid.io/docs/latest/development/extensions-core/kafka-ingestion.html.
When reset, Supervisor will kill all currently on-going ingestion tasks, and launch new tasks after reset. Druid also will make sure to load exactly whatever is in Kinesis/Kafka only once.
The reason it doesn't reset automatically by default is due to possible scenarios where messages are only temporarily not available (like some temporary issue on the Kafka side that can be fixed quickly). In another hand, admin can enable the `resetOffsetAutomatically` config if this resetting needs to be automatic.
Other options include using`useEarliestSequenceNumber` in Kinesis ingestion, while `useEarliestOffset` in Kafka property ingestion. If one is configured for another ingestion method, that property will be ignored during ingestion. These two properties serve same purpose in each ingestion method of their own.
In production, we recommend to monitor Kafka ingestion lag in order to prevent ingest from falling behind like this. There are two main ways to do it. One is manually, through Clarity's Ingestion tab and the "Kafka Lag" metric (see attached screenshot). Another is by API, using the supervisor status API and looking at the "aggregateLag" value. In both cases, the lag is measured in terms of number of messages. So if the normal expectation is to read 100 messages per second, then a lag value of 100,000 means there are about 16 minutes behind.
Finally: the fact that this condition, when occurring for an entire task duration, is reported as SUCCESS instead of FAILED is confusing and we will be adjusting that. In general our preference is that user only see SUCCESS when the connection to Kafka is healthy.
View Article
How to use `UNION` to run single query on multiple dataSources

For instance, we have two datasources :workbook1andworkbook4, with different schemas, but some common columns
[root@ip-172-31-2-115 tmp]# curl -X POST http://ip-172-31-10-21.us-west-2.compute.internal:8082/druid/v2/?pretty -H 'Content-Type:application/json' -d @query_workbook1.json
[ {
"segmentId" : "Workbook1_2018-01-01T00:00:00.000Z_2018-02-01T00:00:00.000Z_2018-06-25T20:57:45.505Z",
"columns" : [ "__time", "card number", "email id", "name" ],
"events" : [ [ 1514764800000, "123456", "[email protected]", "jason b" ], [ 1514851200000, "234567", "[email protected]", "al c" ], [ 1514937600000, "345678", "[email protected]", "brian K" ], [ 1515024000000, "456789", "[email protected]", "jason b" ], [ 1515110400000, "567890", "[email protected]", "henry G" ] ]
[root@ip-172-31-2-115 tmp]# curl -X POST http://ip-172-31-10-21.us-west-2.compute.internal:8082/druid/v2/?pretty -H 'Content-Type:application/json' -d @query_workbook4.json
[ {
"segmentId" : "Workbook4_2018-01-01T00:00:00.000Z_2018-02-01T00:00:00.000Z_2018-06-25T21:36:28.017Z",
"columns" : [ "__time", "card number", "email id", "address", "zip" ],
"events" : [ [ 1515628800000, "654321", "[email protected]", "123 any street", "95434" ], [ 1515715200000, "765432", "[email protected]", "345 unknown ave", "78643" ], [ 1515801600000, "876543", "[email protected]", "653 where blvd", "88976" ], [ 1515888000000, "987654", "[email protected]", "1 goto block", "22345" ], [ 1515974400000, "987651", "[email protected]", "889 mapper ave", "12001" ] ]
} ]
Then I can create a UNON query, to findcard numberforemail id = [email protected], from both datasources:
[root@ip-172-31-2-115 tmp]# cat query_workbook_union2.json
{
"queryType": "scan",
"dataSource": {
"type": "union",
"dataSources": ["Workbook1", "Workbook4"]
},
"intervals": {
"type": "intervals",
"intervals": [
"-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
]
},
"virtualColumns": [],
"resultFormat": "compactedList",
"batchSize": 20480,
"limit": 5000,
"filter": {
"type": "selector",
"dimension": "email id",
"value": "[email protected]"
},
"columns": [
"__time",
"card number",
"email id"
],
"legacy": false,
"context": {
"timeout": 300000
},
"descending": false,
"granularity": {
"type": "all"
}
}
and the output reads:
[root@ip-172-31-2-115 tmp]# curl -X POST http://ip-172-31-10-21.us-west-2.compute.internal:8082/druid/v2/?pretty -H 'Content-Type:application/json' -d @query_workbook_union2.json
[ {
"segmentId" : "Workbook1_2018-01-01T00:00:00.000Z_2018-02-01T00:00:00.000Z_2018-06-25T20:57:45.505Z",
"columns" : [ "__time", "card number", "email id" ],
"events" : [ [ 1514764800000, "123456", "[email protected]" ], [ 1515024000000, "456789", "[email protected]" ] ]
}, {
"segmentId" : "Workbook4_2018-01-01T00:00:00.000Z_2018-02-01T00:00:00.000Z_2018-06-25T21:36:28.017Z",
"columns" : [ "__time", "card number", "email id" ],
"events" : [ [ 1515628800000, "654321", "[email protected]" ], [ 1515888000000, "987654", "[email protected]" ] ]
} ]
The UNION query performance is the same as querying two datasources simultaneously, so there's very minimal performance difference.
View Article
Accessing your Imply Cloud VPC from another VPC using a peering connection

Overview
This article describes how to establish a VPC peering connection to allow traffic from another VPC in either the same or a different AWS account. This allows services located in other VPCs to query Imply without requiring the Imply Cloud clusters to be reachable publicly over the Internet.
We will first go through the straightforward case, where none of the peered VPCs have overlapping CIDR blocks (the IP address ranges for all peered VPCs are disjoint). Following this, we will describe a scenario where there is an overlap, and present a strategy for setting up a network topology to handle this.
This article does not cover the following two cases:
Setting up a peering connection between different AWS regions: AWS now provides limited support for managed peering between some regions, but this is an advanced case as there are a number of restrictions that must be considered. Please consult the AWS documentation on multi-region VPC peering for more details.
Setting up a peering connection if the Imply Cloud VPC (the one created in your AWS account) has an overlapping CIDR block with the VPC you're trying to peer: VPC peering is not supported by AWS in this case. You may contact us to request that your Imply Cloud VPC use a different CIDR block to remove the conflict. Changing the CIDR block of the VPC will incur downtime of your Imply Cloud cluster.
No Overlapping VPCs
There are two main steps in peering VPCs when a given VPC does not have multiple peering connections which have overlapping network ranges:
Establish the peering connection between the VPCs.
Configure the route tables so that packets are routed to the correct destination.
Create a Peering Connection
Using the AWS web interface, go to the VPC Dashboard and inspect your configured VPCs. Identify the Imply Cloud VPC (the one tagged as 'imply-{accountId}-network-vpc') and the VPC you wish to peer with:
here
Here, we wish to peer the Imply VPC (vpc-ce78dbb7) with non-overlapping-vpc (vpc-641fee1c). Note the CIDR ranges, 10.50.0.0/21 and 172.20.0.0/20 respectively, and ensure that they are not overlapping.
Go to the Peering Connections section of the VPC Dashboard and note the active peering connections:
You should see at least one entry, which is the peering connection used by Imply to enable communication between Imply's servers (the Imply Cloud Manager) and the Imply clusters running in your account. If the VPC you wish to peer with has a CIDR range that overlaps with this entry, you fall into the overlapping VPC category and should continue to read that section after understanding the basic case.
In this example, the Imply VPC (vpc-ce78dbb7) is peered to a VPC in Imply's AWS account (vpc-828a54fb) which has a CIDR block of 172.19.0.0/20. This corresponds to an IP address range of 172.19.0.0 - 172.19.15.255 which does not overlap the 172.20.0.0/20 block of non-overlapping-vpc so no special configuration will be required.
Click on Create Peering Connection, optionally provide a name tag, and put the Imply VPC (here vpc-ce78dbb7) as the requester and the other VPC (here vpc-641fee1c) as the accepter. You will see a new entry in the peering table with a status of Pending Acceptance:
To accept the peering request, right-click on the peering connection and choose Accept Request. Upon acceptance, the peering connection will change status to be Active.
You will also want to configure the peering connection to allow DNS names to be resolved over the link. This will allow you to address the query elastic load balancer using its assigned DNS name from the other VPC. To do this, right-click on the peering connection and select Edit DNS Settings. On this page, ensure that the option allowing your VPC to resolve DNS requests for the Imply VPC hosts is checked. In this example, we want non-overlapping-vpc (vpc-641fee1c) to be able to resolve DNS requests of Imply VPC (vpc-ce78dbb7):
Click on Save.
Setup Route Tables
Route tables are used to provide instructions to your subnet's virtual router about where network traffic with a given IP address should be sent to. Common destinations include: other instances within the same VPC, an Internet Gateway for external traffic over the Internet, and a peering connection for traffic destined for a different peered VPC. Creating a VPC creates a default route table which is used implicitly if no other route tables are associated with a subnet. However, strictly speaking, route tables are associated with subnets and not with VPCs, which is an important distinction that we will utilize when handling the overlapping peered VPC scenario below. For more information on route tables, see Amazon's documentation.
Click on the Subnets section to view the available subnets for your account in this region:
The first 6 subnets belong to the Imply VPC (vpc-ce78dbb7). Imply typically creates two subnets (one for static and one for dynamically allocated addresses) in three different availability zones to provide protection against failures of an availability zone. Note that they are all associated with the route table rtb-7122e509. The subnet labelled non-overlapping is a subnet in the non-overlapping-vpc (vpc-641fee1c) which, for this example, will be the subnet used by the EC2 instance that requires connectivity to Imply Cloud. This subnet uses the route table rtb-8c4492f7. We will need to configure these two route tables to route traffic destined for the other subnet to the VPC peering connection.
Starting with the route table for the Imply VPC (here rtb-7122e509):
Click on the link in the Route table column and then click on the Routes tab:
Routes are matched by finding the most specific entry that matches the destination address. In this example:
Traffic destined for an address in 10.50.0.0/21 (10.50.0.0 - 10.50.7.255) remain local and are routed to other EC2 instances in this VPC.
Traffic destined for an address in 172.19.0.0/20 (172.19.0.0-172.19.15.255) are routed to pcx-d3c1e8ba, which is the peering connection link to a VPC in Imply's AWS account. From here this traffic gets routed to one of the EC2 instances powering the Imply Cloud Manager.
All other traffic (0.0.0.0/0 matches all IPs) is routed to igw-acd20cca which is the Internet Gateway allowing the instances to make outbound requests externally over the Internet.
To this table, we will need to add an entry to route traffic destined for 172.20.0.0/22 (the range of the non-overlapping subnet) to pcx-6ca76b04 (the VPC peering connection we established). Alternatively, we can specify the range for the entire non-overlapping-vpc VPC (172.20.0.0/20) to allow other subnets within that VPC to also communicate over the peering link.
Click on Edit, click on Add another route, and using this example, set:
Destination: 172.20.0.0/22
Target: pcx-6ca76b04
Click on Save.
We will now need to do a similar configuration for the route table rtb-8c4492f7. Here, we will want to add an entry to route all traffic destined for the Imply Cloud VPC (10.50.0.0/21) to the peering link pcx-6ca76b04. Select this route table, click on the Routes tab, click on Edit, click on Add another route, and using this example, set:
Destination: 10.50.0.0/21
Target: pcx-6ca76b04
Click on Save.
This completes the setup of the peering connection and configuration of the route tables.
Testing the Peering Connection
We can test the peering connection by making an API request to Druid from an instance in the peered VPC. We will do this by making a call to the query load balancer rather than to an individual instance. You can get the hostname of the load balancer from the API section of the Cloud Manager:
Note that for VPC peering connections, we use the Private endpoints rather than the Public ones (which are used when you want the cluster to be accessible over the public Internet).
Even though the peering connection is in place, the security group for the load balancer will not by default permit traffic from your EC2 instance. In order to allow the connection to be established, we will need to add an entry to the Inbound table of the Imply Cloud ELB Unmanaged security group.
First, determine which security group is associated with the EC2 instance the request will originate from. From the EC2 Dashboard, click on Instances, and find the details of your instance:
Here, the security group is launch-wizard-2. Clicking on the security group name will bring you to the Security Groups section where, in this example, we find that launch-wizard-2 is identified with group ID: sg-dee0b1ae.
Staying in the Security Groups section, look for the security group with the description 'Imply Cloud ELB Unmanaged'. You may have to clear your filter for it to appear in the table. Select this entry, click on the Inbound tab, click Edit, and add a rule as follows:
Type: All TCP
Protocol: TCP
Port Range: 0 - 65535
Source: Custom: sg-dee0b1ae
Click on Save.
You should now be able to make requests from your instance to the Imply Cloud query load balancer, as an example:
ubuntu@ip-172-20-2-51:~$ curl -k -u admin https://internal-imply-e04-elbinter-qqsuirj9f9ep-901811362.us-west-2.elb.amazonaws.com:9088/druid/v2/datasources
Enter host password for user 'admin':
["wikipedia"]
ubuntu@ip-172-20-2-51:~$ curl -k -u admin -XPOST -H'Content-Type:application/json' -d '{"queryType": "timeBoundary", "dataSource": "wikipedia"}' https://internal-imply-e04-elbinter-qqsuirj9f9ep-901811362.us-west-2.elb.amazonaws.com:9088/druid/v2/
Enter host password for user 'admin':
[{"timestamp":"2016-06-27T00:00:11.080Z","result":{"maxTime":"2016-06-27T21:31:02.498Z","minTime":"2016-06-27T00:00:11.080Z"}}]
Overlapping VPCs
We now move on to the case where the Imply VPC needs to be peered to multiple VPCs that have overlapping CIDR blocks. This happens most frequently when the VPC you wish to create a peering connection to has a network range that overlaps the VPC in Imply's AWS account. Continuing the previous example, we now want to establish a peering connection between the Imply VPC (vpc-ce78dbb7, 10.50.0.0/21) and overlapping-vpc (vpc-3607f64e, 172.19.0.0/20). Note that while these two VPCs don't overlap (if they did, a peering connection would not be allowed), the Imply VPC already has a peering connection to vpc-828a54fb which has a CIDR block of 172.19.0.0/20, conflicting with overlapping-vpc. vpc-828a54fb is a VPC in Imply's AWS account, and this peering connection is crucial for the operation of Imply Cloud and cannot be modified or removed.
AWS allows peering connections to be made to multiple VPCs which have overlapping CIDR blocks, but there is an additional complication when trying to resolve the route table entries: given a packet that should be sent to 172.19.0.3 (an address within the 172.19.0.0/20 CIDR), how does the router know which of the two VPC peering connections to route the message to?
As mentioned previously, we will need to leave the subnets and route table entries set up by Imply Cloud unmodified, since modifying or removing these will cause your clusters to be unreachable by the Imply Cloud Manager. Remembering that the routers (and their corresponding route tables) are associated with subnets rather than VPCs, our strategy here will be to set up additional subnets with a separate route table that will route traffic for 172.19.0.0/20 to your other VPC instead of Imply's VPC. Into these subnets, we will add a new elastic load balancer which will act as a proxy to the Imply query servers, and the route table rules will ensure that traffic passing through this load balancer will be routed to your VPC instead of Imply's.
This works particularly nicely in AWS because an auto-scaling group (which is used to manage the query EC2 instances) can be associated with multiple elastic load balancers. Hence, as the EC2 instances are created, fail, and are replaced, the elastic load balancer will continue pointing at the correct online set of instances instead of pointing at instances which have died. When we are done with this setup, the query auto-scaling group will be associated with three load balancers:
one for communication internally within the VPC and with the VPC in Imply's AWS account (created by Imply)
one for communication externally over the internet (created by Imply, but by default blocked using firewall rules)
one for communication internally with your peered VPC using the VPC peering connection (this is what we will be setting up here)
Create a Peering Connection
The first step is to create a peering connection using the steps detailed in the similarly named section for the non-overlapping case. In our example, we will be peering the Imply VPC (vpc-ce78dbb7) and overlapping-vpc (vpc-3607f64e):
Note that the peering connection we created (pcx-ceb27ea6) links us to a VPC with a CIDR 172.19.0.0/20 which is the same network range as the peering connection created by Imply (pcx-d3c1e8ba) and that this is a permitted operation.
Create New Subnets
We will need to create an additional subnet for each availability zone used by the Imply Cloud cluster. There will be 2 or 3 availability zones used, depending on which region you are in. To identify the availability zones, inspect the Availability Zone column (or view the detail descriptions) for each Imply Cloud subnet (beginning with imply-...).
On the Subnet page, click Create subnet. Give your subnet a recognizable name, and ensure that it is created in the Imply VPC (in our example, vpc-ce78dbb7). For the IPv4 CIDR block, you will need to choose a block of IP addresses within the Imply VPC block that doesn't conflict with any of the existing 6 subnets; in our example, we used 10.50.6.0/26. Set the Availability Zone to correspond to one of the zones used by Imply Cloud. Clicking Create gives us the following subnets:
The one that we created is called imply-overlap (subnet-d5eea2ac). Note that it uses a different route table (rtb-a822e5d0) than the subnets created by Imply Cloud. This is actually the default route table for the Imply VPC which was previously unused, since the 6 subnets created by Imply were explicitly associated with a custom route table. It is fine to use the default route table, but alternatively, you can also create an additional route table and then explicitly associate the imply-overlap subnet with this route table.
Repeat the previous step until you have created a subnet for each availability zone used by Imply Cloud.
Setup Route Table
This is performed similarly to the non-overlapping case, but using the route table associated with the newly created subnets instead of the one used by the existing 6 subnets. In this case, our new route table entries look like this:
For the route table associated with the new subnets in the Imply VPC (rtb-a822e5d0):
Destination: 172.19.0.0/20
Target: pcx-ceb27ea6
For the route table associated with the subnet in the other VPC that was peered with the Imply VPC (in this example, the route table for overlapping subnet-567e171d, which is rtb-48598f33):
Destination: 10.50.6.0/26
Target: pcx-ceb27ea6
Create a New Elastic Load Balancer
Now that we have created new subnets which route traffic from 172.19.0.0/20 to your peered VPC, we will need to create an elastic load balancer which uses those subnets and proxies requests to the Imply Cloud query nodes. We will inspect the existing load balancer setup by Imply and create an additional one with similar configuration:
In the Load Balancers section of the EC2 Dashboard:
Click on Create Load Balancer
For this example, we are using the Classic Load Balancer, but a similar setup can be done using one of the newer generation load balancers. Click on Create.
Load Balancer name: provide a recognizable name for the load balancer
Create LB Inside: select the Imply Cloud VPC, in this example vpc-ce78dbb7
Create an internal load balancer: this should be checked so that the load balancer can be used over the peering connection
Listener Configuration: This should have the same entries as the reference load balancer:
TCP 8888 -> TCP 8888
TCP 9088 -> TCP 9088
TCP 9095 -> TCP 9095
Select Subnets: select the subnets you created previously (in this example we only have one, subnet-d5eea2ac, but you should have one for each availability zone)
Click on Next: Assign Security Groups
Select the security groups with description 'Imply Cloud ELB Unmanaged' and 'Imply Cloud Default'
Click on Next: Configure Security Settings
Click on Next: Configure Health Check
Use the same entries as the reference load balancer:
Ping Protocol: HTTPS
Ping Port: 9095
Ping Path: /health
Click on Next: Add EC2 Instances
We will not select any EC2 instances on this page, but will later associate this elastic load balancer with our auto-scaling group.
Click on Next: Add Tags
Click on Review and Create and then click Create.
Note the load balancer name in the AWS console, in our example it is imply-query-my-peering:
Add Load Balancer to Query Auto Scaling Group
Go to the Auto Scaling Groups section of the EC2 Dashboard and locate the auto-scaling group tagged with a name of the form imply-{clusterId}-Query-... Right-click on this auto-scaling group and select Edit.
Modify the Classic Load Balancers section to include an additional entry for the load balancer you created above, and then click on Save.
To confirm this was done correctly, return to the Load Balancers section and inspect the load balancer you created. Status should show 'x of x instances in service' where x is non-zero, and the Instances tab should show a list of all the query instances in your cluster.
Testing the Peering Connection
Similarly to the non-overlapping case, we should now be able to make query requests from our VPC to the Imply VPC, but we will be communicating with the load balancer we created instead of the one listed in the API section of the Imply Cloud Manager. For this example, our load balancer has the DNS name of internal-imply-query-my-peering-1218651714.us-west-2.elb.amazonaws.com. You will still have to add an entry to the Imply Cloud ELB Unmanaged security group to permit inbound access from the origin instance's security group as described in the non-overlapping section.
If everything was correctly set up, issuing Druid API calls to the new load balancer should return a response:
ubuntu@ip-172-19-2-170:~$ curl -k -u admin https://internal-imply-query-my-peering-1218651714.us-west-2.elb.amazonaws.com:9088/druid/v2/datasources
Enter host password for user 'admin':
["wikipedia"]
ubuntu@ip-172-19-2-170:~$ curl -k -u admin -XPOST -H'Content-Type:application/json' -d '{"queryType": "timeBoundary", "dataSource": "wikipedia"}' https://internal-imply-query-my-peering-1218651714.us-west-2.elb.amazonaws.com:9088/druid/v2/
Enter host password for user 'admin':
[{"timestamp":"2016-06-27T00:00:11.080Z","result":{"maxTime":"2016-06-27T21:31:02.498Z","minTime":"2016-06-27T00:00:11.080Z"}}]
Additional Notes about the New Query Load Balancer
This strategy works by binding an additional load balancer to the query auto-scaling group to allow requests made through this load balancer to be routed back to your peered VPC instead of the VPC in Imply's AWS account. At first glance, this may seem to only allow queries to be made from your VPC, while not supporting calls to Druid's other APIs such as to submit indexing tasks. Actually, the same query load balancer can be used to access Druid's other APIs for the coordinator and overlord services (running on the master nodes) by using the router's management proxy functionality which is enabled by default in Imply Cloud. For more information on this, see the router documentation .
One caveat about the router management proxy is that it cannot handle requests to serve the coordinator or overlord consoles. If you need access to these consoles over the peering connection, one strategy is to create additional load balancers (one for the coordinator and one for the overlord) according to the steps described above. You would then add these load balancers to the Classic Load Balancers section of the master auto-scaling groups (you will have 1 of these in the non-HA configuration and 3 of them in the HA configuration). For the health check endpoints, use the following:
Coordinator: HTTPS:8281/druid/coordinator/v1/isLeader
Overlord: HTTPS:8290/druid/indexer/v1/isLeader
This will cause the active coordinator and overlord to be the in-service instance, while removing the slave instances from the load balancers and thus preventing traffic from being routed to them. As leadership changes in a highly available clusters, instances will be registered with and removed from the load balancer to ensure that all traffic is routed to the active coordinator/overlord. This configuration will support loading of the coordinator and overlord web consoles over the peered connection.
View Article
How to run query with filtering on lookups

View Article
Integration with Hadoop

Druid can interact with Hadoop in two ways:
Use HDFS for deep storage using the druid-hdfs-storage extension.
Batch-load data from Hadoop using Map/Reduce jobs.
These are not necessarily linked together; you can load data with Hadoop jobs into a non-HDFS deep storage (like S3), and you can use HDFS for deep storage even if you're loading data from streams rather than using Hadoop jobs.
If you use Hadoop Map/Reduce jobs to load data, then these jobs will scan through your raw data and produce optimized Druid data segments in your configured deep storage. The data will then be loaded by Druid Historical Nodes. Once loading is complete, Hadoop and YARN are not involved in the query path of Druid in any way.
The main advantage of loading data using Hadoop is that it automatically parallelizes the batch data loading process, it uses YARN resources instead of using your Druid machines (leaving your Druid machines free to handle queries), and it can leverage data that resides in your existing Hadoop cluster.
For more information, see our documentation at: https://docs.imply.io/on-premise/manage-data/ingestion-files#hadoop-based-ingestion
Working with specific distributions
Sometimes, specific configurations are needed to integrate Druid with specific Hadoop distributions. These are generally needed when the versions of common Java dependencies (like Jackson, Guava, and so on) differ between your version of Druid and your version of Hadoop.
The Druid documentation has some general tips that apply to many distributions. See here for details: http://druid.io/docs/latest/operations/other-hadoop.html. In particular, three go-to tips include:
Place Hadoop XMLs on Druid's classpath.
Classloader modification on Hadoop, using "mapreduce.job.classloader" or "mapreduce.job.user.classpath.first". Additionally, "mapreduce.job.classloader.system.classes" can be useful to customize class loading even further. The purpose of all of these settings is to control which versions of which dependencies are used for which code paths.
Use specific versions of Hadoop libraries. Druid is bundled with Apache Hadoop, but you can install libraries specific to your distribution and have Druid use those instead.
We have collected some additional tips here for working with specific Hadoop distributions.
Amazon EMR (Elastic MapReduce)
Imply Cloud supports integration with Amazon EMR. For a setup guide, see: https://docs.imply.io/cloud/manage-data/emr
Apache Hadoop
Place your Hadoop XMLs on Druid's classpath.
Set"mapreduce.job.classloader": true in your Druid jobProperties.
Replacing libraries is generally not necessary, since Druid is built against Apache Hadoop by default.
View Article
How to avoid getting division result = 0 in Druid SQL.

SYMPTOM : When there is a division in Druid SQL, users may get 0 as result.
For example: the return on `Saving_ratio` is always 0 in the following case.
ROOT CAUSE:
All numbers were ingested as of type `longSum` (integer), and in Druid SQL integer divided by integer uses integer division and rounds down. Therefore, the `Saving_ratio` is always zero since'Save` is less than `Income`
RESOLUTION:
Ingest the data as `floatSum` instead:
"metricsSpec": [
{
"type": "count",
"name": "count"
},
{
"type": "floatSum",
"name": "Income",
"fieldName": "Income",
"expression": null
},
{
"type": "floatSum",
"name": "Food_spending",
"fieldName": "Food",
"expression": null
},
{
"type": "floatSum",
"name": "Clothing_spending",
"fieldName": "Clothing",
"expression": null
},
{
"type": "floatSum",
"name": "Save",
"fieldName": "Save",
"expression": null
}
],
View Article
How to configure "SUBSET FILTER FORMULA" properly in Imply data cube

With Imply UI's latest "SUBSET FILTER FORMULA" feature, users can filter out the unrelated data by default right at start of opening a data cube.
However, sometimes when a filter is added for a column, the output shows 0 result. For example, user wants to filter only rows with only `Country Iso Code` = `US` in the Wikipedia data source:
But the result shows 0 results
ROOT CAUSE: The problem is admin configured SUBSET FILTER FORMULA using the NAME in the configuration, instead of the actual COLUMN:
After the configuration change, the output shows up correctly.
View Article
Imply Druid SQL doesn't support multi-value dimensions

ISSUE :Druid SQL doesn't support multi-value dimensions yet. This is a potential feature request.
WORKAROUND : Try Druid's native query language -- i.e. timeseries, topN, groupBy queries -- instead, rather than SQL when you need to use multi-value dimensions.
View Article
How to create a new dimension and fill the column with default value during ingestion

When creating a new Druid DataSource, admins may want to add new dimensions that's not in raw data, and fill the columns with default values. This can be achieved by using `transformSpec` during data ingestion.
Here is a example of adding two dummy columns `dummyCol1` and `dummyCol2` to `Wikipedia` dataSource in ingestion spec :
{
"type" : "index",
"spec" : {
"dataSchema" : {
"dataSource" : "wikipedia",
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "json",
"timestampSpec" : {
"column" : "timestamp",
"format" : "iso"
},
"dimensionsSpec" : {
"dimensions" : [ "diffUrl", "isRobot", {
"name" : "added",
"type" : "long"
}, .....
"name" : "metroCode",
"type" : "long"
}, "dummyCol1", "dummyCol2" ]
}
}
},
"metricsSpec" : [ ],
"granularitySpec" : {
...
},
"rollup" : false,
"intervals" : null
},
"transformSpec" : {
"filter" : null,
"transforms" : [ {
"type" : "expression",
"name" : "dummyCol1",
"expression" : "nvl(\"dummyCol1\", 'HAPPY')"
}, {
"type" : "expression",
"name" : "dummyCol2",
"expression" : "nvl(\"dummyCol2\", 'JOY')"
} ]
}
},
"ioConfig" : {
"type" : "index",
"firehose" : {
"type" : "http",
"uris" : [ "https://SERVER/data/wikipedia.json.gz" ],
...
},
"appendToExisting" : false
},
"tuningConfig" : {
...
}
},
"dataSource" : "wikipedia"
}
After ingestion is done, we can see the two new dimensions are created :
View Article
Use lookups to improve SQL user experience

Lookups are a concept in Druid where dimension values are (optionally) replaced with new values. Using lookups can improve querying experience with higher level grouping by default.
For example, in a scenario where a lookup table containsmultiple dimension values 'US-EN', 'GB-EN' and 'AU-EN' correspond to a common channel name 'English', users can send Druid query with 'English' filter, and the result will return all rows containing 'US-EN', 'GB-EN' or 'AU-EN'.
Similarly, users can query with filtering either 'US-EN', 'GB-EN' or 'AU-EN', and the result will be presented with 'English' as the channel name for all. This is callednon-injective lookup where two or more values in the domain would map to the same value in the range.
In a non-default injective lookup scenario, where the lookup table mapping is exclusive for each entry, then users can still get some performance boosts by having the lookup operations be done on the brokers.
To enable injective lookup, add ""injective":true" under lookups config :
For more details about injective property in lookups, please refer to: http://druid.io/docs/latest/development/extensions-core/lookups-cached-global.html
View Article

1 2 3 4 5 6 7 8 9 10

Curious about Imply Data?

Anonymously Ask Imply Data Any Question

Ask Anonymous Question

Working at Imply Data

Work Culture

Top 30%Top 30% of Similar Size Companies on Comparably

Work Culture Score at Imply Data is rated a B by 8 employees.

eNPS

Top 35%Top 35% of Similar Size Companies on Comparably

eNPS Score at Imply Data is rated a B by 6 employees.

Compensation

Top 35%Top 35% of Similar Size Companies on Comparably

Compensation Score at Imply Data is rated a B by 4 employees.

Environment

Top 15%Top 15% of Similar Size Companies on Comparably

A-

Environment Score at Imply Data is rated an A- by 4 employees.

Perks And Benefits

Top 25%Top 25% of Similar Size Companies on Comparably

Perks And Benefits Score at Imply Data is rated a B by 4 employees.

Imply Data's Competitors

Recently Asked Questions

Recent Imply Data Employee Reviews

Rate your company

Imply Data FAQs

Frequently Asked Questions About Imply Data

Ingesting Data to Druid from Azure Event Hubs

Balancer move segments queue has a segment stuck

Connecting Microsoft Excel or Power BI to Apache Druid via Power Query

Avro ingestion from Kafka using Confluent Schema Registry

Kafka, SASL / SCRAM credentials in supervisor spec

How do kill tasks work?

How to launch multi-node on-prem Imply Manager with Kubernetes

Ingestion failed from AWS S3 due to error "java.net.SocketException: Connection reset"

Configuring Druid to use a Kerberized Hadoop as Deep Storage

Migrate existing Druid Cluster to a new Imply cluster

Accessing overlord console and coordinator console in Imply Cloud

How to deploy custom applications on imply cloud instances using Imply-cloud-manager UI

Tuning Druid for Large Result Sets

Profiling Druid queries using flame graphs

Connecting Tableau to Druid with JDBC

Enable Query or Request logging for Druid

Druid Metastore Database Schema

MapR Hadoop as deep storage for Druid cluster using mapr client

Qubole - Imply Cloud Integration

How to run Hadoop batch ingestion using GCP

How to run native batch ingestion using GCP

How to properly shutdown an on-premise Imply Druid Cluster

Zookeeper: Best Practices for Imply

If you're having a "too many open files" problem.. I feel bad for you son, but I got 99 problems and a ulimit setting ain't one.

Querying Druid from the command line

Kerberized Kafka Ingestion

Using Oracle Cloud S3-compatible Object Storage as a deep storage for Druid

Using OpenStack Swift as deep storage for Druid

Kafka Indexing -> Segment Flow

Temporary space issue with Dimension Partitioning on EMR instances in Hadoop_index job

Creating a Data Cube From Multiple Datasets

How to configure log rotation in Druid using RollingFileAppender

Does Druid support TLS/SSL ?

How to create/manage users through API for on-premise cluster?

Issue: Unable to load data from S3 firehose when "=" in prefix

How to auto purge task logs in HDFS?

How to rename dimension during raw data ingestion

Data ingestion failure due to error `Numeric columns do not support multivalue rows`

How to concatenate two columns in Imply UI

Receiving error message "Error attaching to process: Doesn't appear to be a HotSpot VM (could not find symbol "gHotSpotVMTypes" when trying to execute jstack command

Issue: Indexing tasks from Kafka or Kinesis are finishing successfully, but without any data ingested.

How to use `UNION` to run single query on multiple dataSources

Accessing your Imply Cloud VPC from another VPC using a peering connection

How to run query with filtering on lookups

Integration with Hadoop

How to avoid getting division result = 0 in Druid SQL.

How to configure "SUBSET FILTER FORMULA" properly in Imply data cube

Imply Druid SQL doesn't support multi-value dimensions

How to create a new dimension and fill the column with default value during ingestion

Use lookups to improve SQL user experience

Anonymously Ask Imply Data Any Question

Working at Imply Data

Imply Data's Competitors

Recently Asked Questions

Recent Imply Data Employee Reviews