Latest News

Latest

Data Warehouse Concepts

OBIEE Errors

What's New

OBIEE Performance Tips

Sponsor

Big Data

Natural Language Processing

Machine Learning

Latest News

Tuesday, January 23, 2024

In today's data-hungry world, building efficient pipelines to ingest, process, and deliver insights is vital. Platforms like Azure empower data engineers to craft robust and scalable pipelines like never before. 

This guide dives deep into the essential components and best practices of crafting Azure data pipelines, equipping you with practical tips to unleash the full potential of your data flow.

Understanding Data Pipelines:

A data pipeline is a series of interconnected processes that extract, transform, and load (ETL) data from various sources into a target destination, typically a data warehouse, database, or analytical system. The goal is to ensure data is collected, cleansed, and transformed into a usable format for analysis and decision-making.

  1. Components of a Data Pipeline:
  2. Data Sources: Identify the sources of data, which can range from databases, APIs, logs, and external feeds. Azure offers connectors for various sources like Azure SQL Database, Azure Blob Storage, and more.
  3. Data Transformation: This stage involves cleansing, enriching, and transforming the raw data into a structured format. Azure Data Factory, Azure Databricks, and Azure HDInsight are popular tools for this purpose.
  4. Data Movement: Move data efficiently between different storage solutions and services within Azure using Azure Data Factory or Azure Copy Data.
  5. Data Loading: Load the transformed data into the destination, which could be Azure SQL Data Warehouse, Azure Synapse Analytics, or other databases.
  6. Orchestration: Tools like Azure Logic Apps or Apache Airflow can be used to orchestrate the entire pipeline, ensuring the right steps are executed in the correct order.

Best Practices for Azure Data Pipeline Design:

  • Scalability and Elasticity: Leverage Azure's scalability by using services like Azure Databricks or Azure Synapse Analytics to handle varying data workloads.
  • Data Security and Compliance: Implement Azure's security features to protect sensitive data at rest and in transit. Use Azure Key Vault for managing keys and secrets.
  • Modularity: Design pipelines as modular components to facilitate reusability and easier maintenance. This also helps in debugging and troubleshooting.
  • Monitoring and Logging: Implement robust monitoring and logging using Azure Monitor and Azure Log Analytics to track pipeline performance and identify issues.
  • Data Partitioning: When dealing with large datasets, use partitioning strategies to optimize data storage and retrieval efficiency.
  • Backup and Disaster Recovery: Ensure data integrity and availability by implementing backup and disaster recovery solutions provided by Azure.

Building a Customer Analytics Pipeline (Example ):

  • Let's consider an example of building a customer analytics pipeline in Azure:
  • Data Extraction: Extract customer data from Azure SQL Database and external CRM APIs.
  • Data Transformation: Use Azure Databricks to cleanse and transform the data, calculating metrics like customer lifetime value and segmentation.
  • Data Loading: Load the transformed data into Azure Synapse Analytics for further analysis.
  • Orchestration: Use Azure Data Factory to schedule and orchestrate the entire process.

Conclusion:

Creating efficient data pipelines in Azure necessitates a profound comprehension of the platform's services and data engineering principles. By adhering to best practices, taking into account scalability, security, and performance, and harnessing the extensive Azure ecosystem, you can develop data pipelines that deliver precise, timely, and actionable insights, propelling your organization toward success. It is crucial to tailor these practices to your unique use case and consistently iterate to enhance the pipeline's efficiency and dependability.

Monday, October 30, 2023

A distributed, fault-tolerant data warehousing system, Apache Hive allows for large-scale analytics. Hive Metastore (HMS) is an essential part of many data lake systems because it offers a central repository of metadata that can be readily analysed to make data-driven choices. Hive is based on Apache Hadoop and uses HDFS to provide storage on S3, adls, gs, and other platforms. SQL can be used by Hive users to read, write, and manage petabytes of data.


Hive Metastore Server (HMS):

  • Using the metastore service API, clients (such as Hive, Impala, and Spark) can access the central repository of metadata for Hive tables and partitions in a relational database, which is called the Hive Metastore (HMS). 

  • It is now a fundamental component of data lakes that make use of the wide range of open-source tools, including Apache Spark and Presto. 

  • Actually, the Hive Metastore serves as the foundation for an entire ecosystem of tools, some of which are depicted in this diagram.







Hive ACID:

Hive provides full acid support for ORC tables out and insert only support to all other formats.

ACID stands for four traits of database transactions:  
  1. Atomicity (an operation either succeeds completely or fails, it does not leave partial data).
  2. Consistency (once an application performs an operation the results of that operation are visible to it in every subsequent operation).
  3. Isolation (an incomplete operation by one user does not cause unexpected side effects for other users).
  4. Durability (once an operation is complete it will be preserved even in the face of machine or system failure).

These traits have long been expected of database systems as part of their transaction functionality.  

 Hive stores the DATA into HDFS and SCHEMA into RDBMS (Derby, SQL, etc.)

  1. When user creates table, a schema is created in RDBMS
  2. When data is entered, files are created in HDFS. User can also directly put files into HDFS without interacting with RDBMS.
  3. Schema while reading data concept - Now when table is read - then Hive will check the schema and most importantly line delimiter and field delimiter.

As per delimiters rows and fields will be read from file. And a table will be formed to send to user.

e.g.

As per table definition line delimiter is '\n' (new line) and field delimiter is ',' (comma)

Then file in HDFS would -

1,Employee_Name1,1000

2,Employee_Name2,2000

And while reading this file Hive would assign the 2 rows and 3 columns each to the table.

Interesting part -

  • Now even if the file we put directly into HDFS is anything like lyrics of song. Then also Hive will not throw any exception.
  • Hive will just check line delimiter to create multiple rows of table. And check field delimiter to check for multiple columns in a row.
  • Now if any line/field delimiter is not present in the file then all the data of song lyrics would be put inside first column of first row in table.

Friday, September 1, 2023


 The level of interdependence between the variables in your dataset should be identified and quantified. By being aware of these interdependencies, you can better prepare your data to match the demands of machine learning algorithms like linear regression, whose performance will suffer as a result.

You will learn how to compute correlation for various types of variables and relationships in this course. Correlation is the statistical summary of the relationship between variables.

There are 4 sections in this tutorial; they are as follows:

  1. Correlation: What Is It?
  2. Covariance of the Test Dataset
  3. Correlation by Pearson
  4. Using Spearman's Correlation

Correlation: What Is It?

There are many different ways that variables in a dataset might be related or connected.
For instance:
  • The values of one variable may influence or be dependent upon the values of another.
  • There may be a very slight correlation between two variables.
  • A third unknowable variable may be dependent on two other variables.
The ability to better grasp the relationships between variables can be helpful in data analysis and modeling. The term "correlation" refers to the statistical association between two variables.

A correlation can be either positive or negative, meaning that when one variable's value changes, the other variable's value also changes in the same way. The variables may not be correlated if the correlation is 0 or neutral.
    1. Both variables fluctuate in the same direction when there is a positive correlation.
    2. No correlation exists between the changes in the variables in a neutral correlation.
    3. In a negative correlation, the variables shift against each other.

Impact of Correlations on ML Models:

  • Multicollinearity, or the close relationship between two or more variables, might cause some algorithms to perform worse. For instance, in linear regression, one of the problematic associated variables should be eliminated to raise the model's accuracy.
  • In order to gain insight into which variables may or may not be relevant as input for constructing a model, we may also be interested in the correlation between input variables and the output variable.

Generating Data for the Correlation Analysis:


# generate related variables
from numpy import mean
from numpy import std
from numpy.random import randn
from numpy.random import seed
from matplotlib import pyplot
# seed random number generator
seed(1)
# prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)
# summarize
print('data1: mean=%.3f stdv=%.3f' % (mean(data1), std(data1)))
print('data2: mean=%.3f stdv=%.3f' % (mean(data2), std(data2)))
# plot
pyplot.scatter(data1, data2)
pyplot.show()

data1: mean=100.776 stdv=19.620 data2: mean=151.050 stdv=22.358
Covariance pertains to the potential association between variables through a linear connection, where this connection remains consistently additive across both sets of data.

This connection can be succinctly described as the covariance between two variables. It is determined by averaging the product of values from each data set, after those values have been adjusted to be centered (by subtracting their mean).

The formula for computing the sample covariance is outlined as follows:

cov(X, Y) = (sum (x - mean(X)) * (y - mean(Y)) ) * 1/(n-1)


Utilizing the mean in the computation implies a requirement for Gaussian or Gaussian-like distribution in each data sample. Covariance's sign signifies whether variables change together (positive) or diverge (negative). Magnitude's interpretation is complex. A covariance of zero means full independence.

NumPy's cov() function computes a covariance matrix for multiple variables.
covariance = cov(data1, data2)

The matrix's diagonal holds the self-covariance of each variable. The other entries signify the covariance between the paired variables; given that only two variables are under consideration, these remaining entries are identical.

We can derive the covariance matrix for the given pair of variables in our test scenario. Here's the complete illustration:


from numpy.random import randn
from numpy.random import seed
from numpy import cov

# Set random seed
seed(1)

# Prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)

# Calculate covariance matrix
covariance = cov(data1, data2)
print(covariance)

[[385.33297729 389.7545618 ] [389.7545618 500.38006058]]

Covariance and covariance matrix play a crucial role in statistics and multivariate analysis for describing relationships among variables.

By executing the example, the covariance matrix is computed and displayed.

Since the dataset involves variables drawn from Gaussian distribution and exhibits linear correlation, covariance is a suitable approach for characterization.

The covariance between the two variables measures 389.75. This positive value indicates that the variables change in the expected direction together.

Using covariance as a standalone statistical tool is problematic due to its complex interpretation, prompting the introduction of Pearson's correlation coefficient.


Pearson’s Correlation:

The Pearson correlation coefficient, named after Karl Pearson, summarizes linear relationship strength between data samples.

It's calculated by dividing the covariance of variables by the product of their standard deviations, normalizing the covariance for an interpretable score.

 

Pearson's correlation coefficient = covariance(X, Y) / (stdv(X) * stdv(Y))

The requirement for mean and standard deviation use implies a Gaussian or Gaussian-like distribution for the data samples.

The outcome of the calculation, the correlation coefficient, provides insights into the relationship.

The coefficient ranges from -1 to 1, signifying the extent of correlation. 0 implies no correlation. Typically, values below -0.5 or above 0.5 indicate significant correlation, while values below these thresholds suggest weaker correlation.

To compute the Pearson’s correlation coefficient for data samples of equal length, one can utilize the pearsonr() function from SciPy.

# calculate the Pearson's correlation between two variables
from numpy.random import randn
from numpy.random import seed
from scipy.stats import pearsonr
# seed random number generator
seed(1)
# prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)
# calculate Pearson's correlation
corr, _ = pearsonr(data1, data2)
print('Pearsons correlation: %.3f' % corr)

Pearsons correlation: 0.888

The variables show a strong positive correlation with a coefficient of 0.8, indicating high association, similar to values near 1.0.

Pearson's correlation coefficient assesses relationships among multiple variables. This involves creating a correlation matrix by calculating interactions between each variable pair. The matrix is symmetrical, with 1.0 on the diagonal due to perfect self-correlation in each column.

Spearman’s Correlation

Variables can exhibit varying nonlinear relationships across their distributions. The distribution of these variables may also deviate from the Gaussian pattern.

For such cases, the Spearman's correlation coefficient, named after Charles Spearman, measures the strength of association between data samples. It's applicable for both nonlinear and linear relationships, with slightly reduced sensitivity (lower coefficients) in the latter case.

Similar to the Pearson correlation, scores range between -1 and 1 for perfect negative and positive correlations, respectively. However, Spearman's coefficient utilizes rank-based statistics rather than sample values, making it suitable for non-parametric analysis, where data distribution assumptions like Gaussian aren't made.

Spearman's correlation coefficient
= covariance(rank(X), rank(Y)) / (stdv(rank(X)) * stdv(rank(Y)))
# calculate the spearmans's correlation between two variables
from numpy.random import randn
from numpy.random import seed
from scipy.stats import spearmanr
# seed random number generator
seed(1)
# prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)
# calculate spearman's correlation
corr, _ = spearmanr(data1, data2)
print('Spearmans correlation: %.3f' % corr)


Spearmans correlation: 0.872

Despite assuming Gaussian data and a linear variable relationship, the nonparametric rank-based method reveals a robust 0.8 correlation between the variables.

Wednesday, August 30, 2023


Question: 

Lanternfish

You are in presence of specific species of lanternfish. They have one special attribute, each lanternfish creates a new lanternfish once every 7 days.

However, this process isn’t necessarily synchronized between every lanternfish - one lanternfish might have 2 days left until it creates another lanternfish, while another might have 4. So, you can model each fish as a single number that represents the number of days until it creates a new lanternfish.

Furthermore, you reason, a new lanternfish would surely need slightly longer before it’s capable of producing more lanternfish: two more days for its first cycle.

So, suppose you have a lanternfish with an internal timer value of 3:

After one day, its internal timer would become 2.

After another day, its internal timer would become 1.

After another day, its internal timer would become 0.

After another day, its internal timer would reset to 6, and it would create a new lanternfish with an internal timer of 8.

After another day, the first lanternfish would have an internal timer of 5, and the second lanternfish would have an internal timer of 7.

A lanternfish that creates a new fish resets its timer to 6, not 7 (because 0 is included as a valid timer value). The new lanternfish starts with an internal timer of 8 and does not start counting down until the next day.

For example, suppose you were given the following list:

3,4,3,1,2

This list means that the first fish has an internal timer of 3, the second fish has an internal timer of 4, and so on until the fifth fish, which has an internal timer of 2. Simulating these fish over several days would proceed as follows:

Initial state: 3,4,3,1,2

After 1 day: 2,3,2,0,1

After 2 days: 1,2,1,6,0,8

After 3 days: 0,1,0,5,6,7,8

After 4 days: 6,0,6,4,5,6,7,8,8

After 5 days: 5,6,5,3,4,5,6,7,7,8

After 6 days: 4,5,4,2,3,4,5,6,6,7

After 7 days: 3,4,3,1,2,3,4,5,5,6

After 8 days: 2,3,2,0,1,2,3,4,4,5

After 9 days: 1,2,1,6,0,1,2,3,3,4,8

After 10 days: 0,1,0,5,6,0,1,2,2,3,7,8

After 11 days: 6,0,6,4,5,6,0,1,1,2,6,7,8,8,8

After 12 days: 5,6,5,3,4,5,6,0,0,1,5,6,7,7,7,8,8

After 13 days: 4,5,4,2,3,4,5,6,6,0,4,5,6,6,6,7,7,8,8

After 14 days: 3,4,3,1,2,3,4,5,5,6,3,4,5,5,5,6,6,7,7,8

After 15 days: 2,3,2,0,1,2,3,4,4,5,2,3,4,4,4,5,5,6,6,7

After 16 days: 1,2,1,6,0,1,2,3,3,4,1,2,3,3,3,4,4,5,5,6,8

After 17 days: 0,1,0,5,6,0,1,2,2,3,0,1,2,2,2,3,3,4,4,5,7,8

After 18 days: 6,0,6,4,5,6,0,1,1,2,6,0,1,1,1,2,2,3,3,4,6,7,8,8,8,8

Each day, a 0 becomes a 6 and adds a new 8 to the end of the list, while each other number decreases by 1 if it was present at the start of the day.

In this example, after 18 days, there are a total of 26 fish.

Question 1 (easy): How many lanternfish would there be after 80 days?

Question 2 (harder): How many lanternfish would there be after 400 days?

Wednesday, July 26, 2023


 

What is Llama2?

Llama 2, the next generation of our open source large language model.  

  • LLama2 is a transformer-based language model developed by researchers at Meta AI. 
  • The model is trained on a large corpus of text data and is designed to generate coherent and contextually relevant text.
  •  LLama2 uses a multi-layer transformer architecture with encoder and decoder components to generate text. 
  • The model is trained on a variety of tasks, including language translation, text summarization, and text generation.
  •  LLama2 has achieved state-of-the-art results on several benchmark datasets. 
  • The model's architecture and training procedures are made publicly available to encourage further research and development in the field of natural language processing. 
  • LLama2 has many potential applications, including chatbots, language translation.
How to download Llama2?
1. From meta git repository using download.sh
2. From Hugging Face

1. From meta git repository using download.sh 

Below are the steps to download Llama2 from meta website.
  • Go to the meta website https://ai.meta.com/llama/ 


  • Click on download. Provide the details in the form . 


  • Accept term and condition and continue.

  • Once you submit you will receive an email from meta to download the model from git repository. You can download the Llama2 in your local using download.sh script from git repository. 


  • Run download.sh ,it will ask the authenticate URL from meta . The URL gets expired in 24 hours. After providing you will be prompted to provide the size of the model i.e 7B,13B, or 70 B. Accordingly model will get downloaded.
Downloaded file in Google Colab: 

2. From Hugging Face
Once you get acceptance email from meta ,login to Hugging Face . 
link : https://huggingface.co/meta-llama


Select any model and submit request to grant access from Hugging Face .


Note :  This is a form to enable access to Llama 2 on Hugging Face after you have been granted access from Meta. Please visit the Meta website and accept our license terms and acceptable use policy before submitting this form. Requests will be processed in 1-2 days.

You will receive one 'access granted ' email from Hugging Face. 


3. Create Accees  Tokens from 'Settings' in Hugging Face account.


4. Llama2 in Langchain and Hugging Face  in Google Colab .
1. Change the run type to GPU in google Colab.


For the demonstration purpose I have used meta-llama/Llama-2-7b-chat-hf model in the code.
Step-1
Install below packages as part requirements.txt


Step 2
Login to HuggingFace cli using previously generated Access Token.



Step3 Install Langchain

Step 4
Import all packages


step5
Create the pipeline using transformers.pipeline'


Step6
Create llm model using HuggingFacePipeline from Langchain


Step 7
Create the prompt template and run it using LLMChain from Langchain

Below are some use cases.
1. Summarize the Paragraph


2. Ask Information /Question


3. NER Recognition




Ads Place 970 X 90

Big Data Concepts

Error and Resolutions

Differences