Wednesday, July 8, 2020

Pig Tutorial

Pig Tutorial Pig Tutorial: Apache Pig Architecture Twitter Case Study Back Home Categories Online Courses Mock Interviews Webinars NEW Community Write for Us Categories Artificial Intelligence AI vs Machine Learning vs Deep LearningMachine Learning AlgorithmsArtificial Intelligence TutorialWhat is Deep LearningDeep Learning TutorialInstall TensorFlowDeep Learning with PythonBackpropagationTensorFlow TutorialConvolutional Neural Network TutorialVIEW ALL BI and Visualization What is TableauTableau TutorialTableau Interview QuestionsWhat is InformaticaInformatica Interview QuestionsPower BI TutorialPower BI Interview QuestionsOLTP vs OLAPQlikView TutorialAdvanced Excel Formulas TutorialVIEW ALL Big Data What is HadoopHadoop ArchitectureHadoop TutorialHadoop Interview QuestionsHadoop EcosystemData Science vs Big Data vs Data AnalyticsWhat is Big DataMapReduce TutorialPig TutorialSpark TutorialSpark Interview QuestionsBig Data TutorialHive TutorialVIEW ALL Blockchain Blockchain TutorialWhat is BlockchainHyperledger FabricWhat Is EthereumEthereum TutorialB lockchain ApplicationsSolidity TutorialBlockchain ProgrammingHow Blockchain WorksVIEW ALL Cloud Computing What is AWSAWS TutorialAWS CertificationAzure Interview QuestionsAzure TutorialWhat Is Cloud ComputingWhat Is SalesforceIoT TutorialSalesforce TutorialSalesforce Interview QuestionsVIEW ALL Cyber Security Cloud SecurityWhat is CryptographyNmap TutorialSQL Injection AttacksHow To Install Kali LinuxHow to become an Ethical Hacker?Footprinting in Ethical HackingNetwork Scanning for Ethical HackingARP SpoofingApplication SecurityVIEW ALL Data Science Python Pandas TutorialWhat is Machine LearningMachine Learning TutorialMachine Learning ProjectsMachine Learning Interview QuestionsWhat Is Data ScienceSAS TutorialR TutorialData Science ProjectsHow to become a data scientistData Science Interview QuestionsData Scientist SalaryVIEW ALL Data Warehousing and ETL What is Data WarehouseDimension Table in Data WarehousingData Warehousing Interview QuestionsData warehouse architectureTalend T utorialTalend ETL ToolTalend Interview QuestionsFact Table and its TypesInformatica TransformationsInformatica TutorialVIEW ALL Databases What is MySQLMySQL Data TypesSQL JoinsSQL Data TypesWhat is MongoDBMongoDB Interview QuestionsMySQL TutorialSQL Interview QuestionsSQL CommandsMySQL Interview QuestionsVIEW ALL DevOps What is DevOpsDevOps vs AgileDevOps ToolsDevOps TutorialHow To Become A DevOps EngineerDevOps Interview QuestionsWhat Is DockerDocker TutorialDocker Interview QuestionsWhat Is ChefWhat Is KubernetesKubernetes TutorialVIEW ALL Front End Web Development What is JavaScript รข€" All You Need To Know About JavaScriptJavaScript TutorialJavaScript Interview QuestionsJavaScript FrameworksAngular TutorialAngular Interview QuestionsWhat is REST API?React TutorialReact vs AngularjQuery TutorialNode TutorialReact Interview QuestionsVIEW ALL Mobile Development Android TutorialAndroid Interview QuestionsAndroid ArchitectureAndroid SQLite DatabaseProgramming Twitter Case Study La st updated on May 20,2020 38.8K Views Shubham Sinha Shubham Sinha is a Big Data and Hadoop expert working as a... Shubham Sinha is a Big Data and Hadoop expert working as a Research Analyst at Edureka. He is keen to work with Big Data...2 Comments Bookmark 1 / 4 Blog from Apache Pig Become a Certified Professional As we mentioned in our Hadoop Ecosystem blog, Apache Pigis an essential part of our Hadoop ecosystem. So, I would like to take you through this Apache Pig tutorial, which is a part ofour Hadoop Tutorial Series.Learning it will help you understand and seamlessly execute the projects required for Big Data Hadoop Certification.In this Apache Pig Tutorial blog, I will talk about:Apache Pigvs MapReduceIntroduction to Apache PigWhere to use Apache Pig?TwitterCase StudyApache Pig ArchitecturePig Latin Data ModelApache Pig SchemaBefore starting with the Apache Pig tutorial, I would like you to ask yourself a question while MapReduce was there for Big Data Analytics why Apach e Pig came into picture?The sweet and simple answer to this is:approximately 10 lines of Pig code isequal to 200 lines of MapReduce code.Writing MapReduce jobs in Java is not an easy task for everyone. If you want a taste of MapReduce Java code, click hereand you will understand the complexities. Thus, Apache Pig emerged as a boon for programmers who were not good with Java or Python. Even if someone who knows Java and is good with MapReduce, they will also prefer Apache Pig due to the ease working with Pig. Let us take a look now.Apache PigTutorial: Apache Pig vs MapReduceProgrammers face difficulty writingMapReduce tasks as it requires Java or Python programming knowledge. For them,Apache Pig is a savior.Pig Latin is a high-level data flow language, whereas MapReduce is a low-level data processing paradigm.Without writing complex Java implementations in MapReduce, programmers can achieve the same implementationsvery easily using Pig Latin.Apache Pig uses multi-query approach (i.e. using a single query of Pig Latin we can accomplish multiple MapReduce tasks), which reduces the length of the code by 20 times. Hence, this reduces the development period by almost 16 times.Pig provides many built-in operators to support data operations like joins, filters, ordering, sorting etc. Whereas to perform the same function in MapReduce is ahumongous task.Performing a Join operation in Apache Pig is simple. Whereas it is difficult in MapReduce to perform a Join operation between the data sets, as it requires multiple MapReduce tasks to be executed sequentially to fulfill the job.In addition, it also provides nested data types like tuples, bags, and maps that are missing from MapReduce. I will explain you these data types in a while.Nowthat we know why Apache Pig came into the picture, you would be curious to know what is Apache Pig? Let us move ahead in this Apache Pig tutorial blog and go through the introduction and features of Apache Pig.Apache PigTutorial:Introduction to Apache PigApache Pig is a platform, used to analyze large data sets representing them as data flows. It is designed to provide an abstraction over MapReduce, reducing the complexities of writing aMapReduce program. We can perform data manipulation operations very easily in Hadoop using Apache Pig.The features of Apache pig are:Pig enables programmers to write complex data transformations without knowing Java.Apache Pig has two main components the Pig Latin language and the Pig Run-time Environment, in which Pig Latin programs are executed.For Big Data Analytics, Pig gives a simple data flow language known as Pig Latinwhich has functionalities similar to SQL like join, filter, limit etc.Developers who are working with scripting languages and SQL, leverages Pig Latin. This gives developersease of programmingwith Apache Pig. Pig Latin provides various built-in operators like join, sort, filter, etc to read, write, and process large data sets. Thus it is evident, Pig has a rich set of operators.Programmers write scripts using Pig Latin to analyze data and these scripts are internally converted to Map and Reduce tasks by Pig MapReduce Engine. Before Pig, writing MapReduce tasks was the only way to process the data stored in HDFS.If a programmer wants to write custom functions which isunavailable in Pig, Pig allows them to write User Defined Functions (UDF) in anylanguage of their choice like Java, Python, Ruby, Jython, JRuby etc. and embed them in Pig script. This provides extensibility to Apache Pig.Pig can process any kind of data, i.e. structured, semi-structured or unstructured data, coming from various sources. Apache Pig handles all kinds of data.Approximately,10 lines of pig code is equal to 200 lines of MapReduce code.It can handle inconsistent schema (in case of unstructured data).ApachePig extracts the data, performs operations on that data and dumps the data in the required format in HDFS i.e. ETL (Extract Transform Load).Apache Pig automatically op timizes the tasks beforeexecution, i.e.automatic optimization.It allows programmers and developers to concentrate upon the whole operation irrespective of creating mapper and reducer functions separately.After knowing what isApache Pig, now let us understand where we can use Apache Pig and what are the use cases which suits Apache Pig the most?Apache Pig Tutorial: Where to use Apache Pig?Apache Pig is used for analyzing and performing tasks involving ad-hoc processing. Apache Pig is used:Where we need to process, huge data sets like Web logs, streaming online data, etc.Where we need Data processing for search platforms (different types of data needs to be processed) likeYahoo uses Pig for 40% of their jobs including news feeds and search engine.Where we need to process time sensitive data loads. Here, data needs to be extractedand analyzed quickly. E.g. machine learning algorithms requires time sensitive data loads, like twitter needs to quickly extract data of customer activities ( i.e. tweets, re-tweets and likes) and analyze the data to find patterns in customer behaviors, and make recommendations immediately like trending tweets.Now, in our Apache Pig Tutorial, let us go through the Twitter case study to better understand how Apache Pig helps in analyzing data and makes business understanding easier.Apache Pig Tutorial:Twitter Case StudyI will take you through a case study of Twitter where Twitter adopted Apache Pig.Twitters data was growing at an accelerating rate (i.e. 10 TB data/day). Thus, Twitter decided to move the archived data to HDFS and adopt Hadoop for extracting the business values out of it.Their major aim was to analyse data stored in Hadoop to come up with the following insights on a daily, weekly or monthly basis.Counting operations:How many requests twitter serve in a day?What is the average latency of the requests?How many searches happens each day on Twitter?How many unique queries are received?How many unique users come to visit?What ist he geographic distribution of the users?Correlating Big Data:How usage differs for mobile users?Cohort analysis: analyzing data by categorizing user, based on their behavior.What goes wrong while site problem occurs?Which features user often uses?Search correction and search suggestions.Research on Big Data produce better outcomes like:What can Twitter analysisabout users from their tweets?Who follows whom and on what basis?What is the ratio of the follower to following?What is the reputation of the user?and many moreSo, for analyzing data, Twitter used MapReduce initially, which is parallel computing over HDFS (i.e. Hadoop Distributed File system).For example, they wanted to analyse how many tweets are stored per user, in the given tweet table?Using MapReduce, this problem will be solved sequentially as shown in the below image:MapReduce program first inputs the key as rows and sends the tweet table information to mapper function. Then the Mapper function will select the user id a nd associate unit value (i.e. 1) to every user id. The Shuffle function will sort same user ids together. At last, Reduce function will add all the number of tweets together belonging to same user. The output will be user id, combined with user name and the number of tweets per user.But while using MapReduce, they faced some limitations:Analysis needs to be typically done in Java.Joins, that are performed, needs to be written in Java, which makes it longerand more error-prone.For projection and filters, custom code needs to be written which makes the whole process slower.The job is divided into many stages while using MapReduce, which makes it difficult to manage.So, Twitter moved to Apache Pig for analysis. Now, joining data sets, grouping them, sorting them and retrieving data becomes easier and simpler. You can see in the below image how twitter used Apache Pig to analyse their large data set.Twitter had both semi-structured data like Twitter Apache logs, Twitter search logs, Twi tter MySQL query logs, applicationlogs and structured data like tweets, users, block notifications, phones, favorites, saved searches, re-tweets, authentications, SMS usage, user followings, etc. which can be easily processed by Apache Pig.Twitter dumps all its archived data on HDFS. It has two tables i.e. user data and tweets data. User data contains information about the users like username, followers, followings, number of tweets etc. While Tweet data contains tweet, its owner, number of re-tweets, number of likes etc. Now, twitter uses this data to analyse their customers behaviors and improve their past experiences.We will see how Apache Pig solves the same problem which was solved by MapReduce:Question: Analyzinghow many tweets are stored per user, in the given tweet tables?The below image shows the approach of Apache Pig to solve the problem:The step by step solution of this problem is shown in the above image.STEP 1 First of all, twitter imports the twitter tables (i.e. user table and tweet table) into the HDFS.STEP 2 Then Apache Pig loads (LOAD) the tables into Apache Pig framework.STEP 3 Then it joins and groups the tweet tables and user table using COGROUP command as shown in the above image.This results in the inner Bag Data type, which we will discuss later in this blog.Example of Inner bags produced (refer to the above image) (1,{(1,Jay,xyz),(1,Jay,pqr),(1,Jay,lmn)})(2,{(2,Ellie,abc),(2,Ellie,vxy)})(3, {(3,Sam,stu)})STEP 4 Then the tweets are counted according to the users using COUNT command. So, that the total number of tweets per user can be easily calculated.Example of tupleproduced as (id, tweet count) (refer to the above image) (1,3)(2,2)(3,1)STEP 5 At last the result is joined with user table to extract the user name with produced result.Example of tupleproduced as (id, name, tweet count) (refer to the above image) (1,Jay,3)(2,Ellie, 2)(3, Sam,1)STEP 6 Finally, this result is stored back in the HDFS.Pig is not only limited to this operati on. It can perform various other operations which I mentionedearlier in this use case.These insightshelps Twitter to performsentiment analysis and develop machine learning algorithms based on the user behaviors and patterns.Now, after knowing the Twitter case study, in this Apache Pig tutorial, let us take a deep dive and understand the architecture of ApachePig and Pig Latins data model. This will help us understand how pig works internally.Apache Pig draws its strength from its architecture.Pig Tutorial | EdurekaYou can check out this video where all the concepts related to Pig has been discussed.Apache Pig Tutorial: ArchitectureFor writing a Pig script, we need Pig Latin language and to execute them, we need an execution environment.The architecture of Apache Pig is shown in the below image.Pig Latin Scripts Initially as illustrated in the above image, we submit Pig scripts to the Apache Pig execution environment which can be written in Pig Latin using built-in operators.There ar e three ways to execute the Pig script:Grunt Shell: This is Pigs interactive shell provided to execute all Pig Scripts.Script File: Write all the Pig commands in a script file and execute the Pig script file. This is executed by the Pig Server.Embedded Script: If some functions are unavailablein built-in operators, we can programmatically create User Defined Functions to bring that functionalities using other languages like Java, Python, Ruby, etc. and embed it in Pig Latin Script file.Then, execute that script file.ParserFrom the above image you can see, after passing through Grunt or Pig Server, Pig Scripts are passed to the Parser. The Parser does type checking and checks the syntax of the script. The parser outputs a DAG (directed acyclic graph). DAG represents the Pig Latin statements and logical operators. The logical operators are represented as the nodes and the data flows are represented as edges.OptimizerThen the DAG is submitted to the optimizer. The Optimizer performs th e optimizationactivities likesplit, merge, transform, and reorder operators etc. This optimizer provides the automatic optimization feature to Apache Pig. The optimizer basically aims to reduce the amount of data in the pipeline at any instance of time while processing the extracted data, and for that it performs functions like:PushUpFilter: If there are multiple conditions in the filter and the filter can be split, Pig splits the conditions and pushes up each condition separately. Selecting these conditions earlier, helps in reducing the number of records remaining in the pipeline.PushDownForEachFlatten: Applying flatten, which produces a cross product between a complex type such as a tuple or a bag and the other fields in the record, as late as possible in the plan. This keeps the number of records low in the pipeline.ColumnPruner: Omitting columns that are never used or no longer needed, reducing the size of the record. This can be applied after each operator, so that fields can be pruned as aggressively as possible.MapKeyPruner: Omitting map keys that are never used, reducing the size of the record.LimitOptimizer:If the limit operator is immediately applied after a load or sort operator, Pig converts the load or sort operator into a limit-sensitive implementation, which does not require processing the whole data set. Applying the limit earlier, reduces the number of records.This is just a flavor of the optimization process. Over that it also performs Join, Order By and Group By functions.To shutdown, automatic optimization, you can execute thiscommand:pig -optimizer_off [opt_rule | all ]CompilerAfter the optimization process, the compiler compiles the optimized code into a series of MapReduce jobs. The compiler is the one who is responsible for converting Pig jobs automatically into MapReduce jobs.Execution engineFinally, as shown in the figure, these MapReduce jobs are submitted for execution to the execution engine. Then the MapReduce jobs are executed a nd gives the required result. The result can be displayedon the screen using DUMP statementand can be stored in the HDFS using STORE statement.After understanding the Architecture, now in this Apache Pig tutorial, I will explain you the Pig Latinss Data Model.Apache Pig Tutorial: Pig Latin Data ModelThe data model of Pig Latin enables Pig to handle all types of data. Pig Latin can handle both atomic data types like int, float, long, double etc.and complex data types like tuple, bag and map. I will explain them individually. The below image shows the data types and their corresponding classes using which we can implement them: Atomic /Scalar Data type Atomic or scalar data types are the basic data types which are used in all the languages like string, int, float, long, double, char[], byte[]. These are also called the primitive data types. The value of each cell in a field (column) is an atomic data type as shown in the below image.For fields, positional indexes are generated by the system automatically (also known as positional notation), which is represented by $ and it starts from $0, and grows $1, $2, so on As compared with the below image $0 = S.No., $1 = Bands, $2 = Members, $3 = Origin.Scalar data types are 1, Linkin Park, 7, California etc.Now we will talk about complex data types in Pig Latin i.e. Tuple, Bagand Map.TupleTuple isan ordered set of fields which may contain different data types for each field. You can understand it as the records stored in arow in a relational database. A Tuple is a set of cells from a single row as shown in the above image. The elements inside a tuple does not necessarily need to have a schema attached to it.A tuple isrepresented by () symbol.Example of tuple (1, Linkin Park, 7, California)Since tuples are ordered, we can access fields in each tuple using indexes of the fields, like $1 form above tuple will return a value Linkin Park. You can notice that above tuple doesnt have any schema attached to it.BagA bag is a col lection of a set of tuples and these tuples are subset of rows or entirerows of a table. A bag can contain duplicate tuples, and it is not mandatory that they need to be unique.The bag has a flexible schema i.e. tuples within the bag can have different number of fields. A bag can also have tuples with different data types.A bag isrepresented by {} symbol.Example of a bag {(Linkin Park, 7, California), (Metallica, 8), (Mega Death, Los Angeles)}But for Apache Pig to effectively process bags, the fields and their respective data types need to be in the same sequence.Set of bags {(Linkin Park, 7, California), (Metallica, 8), (Mega Death, Los Angeles)}, {(Metallica, 8, Los Angeles), (Mega Death, 8), (Linkin Park, California)}There are two types of Bag, i.e. Outer Bag or relations and Inner Bag.Outer bag or relation is noting but a bag of tuples. Here relations are similar as relations in relational databases. To understand it better let us takean example:{(Linkin Park, California), (Me tallica, Los Angeles), (Mega Death, Los Angeles)} This above bagexplains the relation between the Band and their place of Origin.On the other hand, an inner bag contains a bag inside a tuple. For Example, if we sort Band tuples based on Bands Origin, we will get:(Los Angeles, {(Metallica, Los Angeles), (Mega Death, Los Angeles)})(California,{(Linkin Park, California)})Here, first field type is a string while the second field type is a bag, which is aninner bag within a tuple.MapA map iskey-value pairs used to represent data elements.The key must be a chararray []and should be unique like column name, so it can be indexed and value associated with itcan be accessed on basis of the keys. The value can be of any data type.Maps are represented by [] symbol and key-value are separated by # symbol, as you can see in the above image.Example of maps [band#Linkin Park, members#7 ],[band#Metallica, members#8 ]Now as we learned Pig Latins Data Model. We will understand how Apache Pig handles s chema as well as works with schema-less data.Apache Pig Tutorial: SchemaSchema assigns name to the field and declares data type of the field.Schema is optional in Pig Latin but Pig encourage you to use them whenever possible, as theerror checking becomes efficient while parsing the script which results in efficient execution of program. Schema can be declared as both simple and complex data types. During LOAD function, if the schema is declared it is also attached with the data.Few Points on Schema in Pig:If the schema only includes the field name, the data type of field is considered as byte array. If you assign a name to the field you can access the field by both, the field name and the positional notation. Whereas if field name is missing we can only access it by the positional notation i.e. $ followed by the index number.If you perform any operation which is a combination of relations (like JOIN, COGROUP, etc.) and if any of the relation is missingschema, the resulting relation will have null schema. If the schema is null, Pig willconsider it as byte array and the real data type of field will be determined dynamically.I hope this Apache Pig tutorial blog is informative and you liked it. In this blog, you got to knowthe basics of Apache Pig, its data model and its architecture. The Twitter case study would have helped you to connect better. In mynextblog of Hadoop Tutorial Series, we will be covering theinstallation of Apache Pig, so that you can get your hands dirty whileworking practically on Pig andexecuting Pig Latincommands.Now that you have understood the Apache Pig Tutorial, check out theHadooptrainingby Edureka,a trusted online learning companywith a network of more than250,000satisfied learnersspread acrossthe globe. The Edureka Big Data Hadoop Certification Training coursehelps learners becomeexpert in HDFS, Yarn, MapReduce, Pig, Hive, HBase, Oozie, Flume and Sqoop using real-time use cases on Retail, Social Media, Aviation, Tourism, Finance domai n.Got a question for us? Please mention it in the comments section and we will get back to you.Recommended videos for you MapReduce Design Patterns Application of Join Pattern Watch Now Apache Spark Redefining Big Data Processing Watch Now Filtering on HBase Using MapReduce Filtering Pattern Watch Now Hadoop for Java Professionals Watch Now Introduction to Big Data TDD and Pig Unit Watch Now Big Data XML Parsing With MapReduce Watch Now Pig Tutorial Know Everything About Apache Pig Script Watch Now Is Hadoop A Necessity For Data Science? Watch Now What is Big Data and Why Learn Hadoop!!! Watch Now Hadoop Cluster With High Availability Watch Now Ways to Succeed with Hadoop in 2015 Watch Now Logistic Regression In Data Science Watch Now Advanced Security In Hadoop Cluster Watch Now Webinar: Introduction to Big Data Hadoop Watch Now Big Data Processing with Spark and Scala Watch Now Power of Python With BigData Watch Now Apache Kafka With Spark Streaming: Real-Time Analytics Redefi ned Watch Now MapReduce Tutorial All You Need To Know About MapReduce Watch Now Spark SQL | Apache Spark Watch Now Introduction to Apache Solr-1 Watch NowRecommended blogs for you CCA and CCP Certifications By Cloudera: All You Need To Know Read Article ELK Stack Tutorial Discover, Analyze And Visualize Your Data Efficiently Read Article Spark Java Tutorial : Your One Stop Solution to Spark in Java Read Article Big Data Career Is The Right Way Forward. Know Why! Read Article Top 50 Hadoop Interview Questions You Must Prepare In 2020 Read Article Importance of Hadoop Tutorial Read Article Hive and Yarn Examples on Spark Read Article Introduction to Hadoop Job Tracker Read Article Hadoop and Java Job Trends Read Article Apache Pig Installation on Linux Read Article RDDs in PySpark Building Blocks Of PySpark Read Article PySpark Dataframe Tutorial PySpark Programming with Dataframes Read Article Hadoop Career: Career in Big Data Analytics Read Article Splunk Tutorial For Beginners: Explore Machine Data With Splunk Read Article A Deep Dive Into Pig Read Article Introduction to Lambda Architecture Read Article Introduction of Hadoop Architecture Read Article Why do we need Hadoop for Data Science? Read Article How to become a Hadoop Administrator? Read Article Drilling Down On Apache Drill, the New-Age Query Engine Read Article Comments 2 Comments Trending Courses in Big Data Big Data Hadoop Certification Training158k Enrolled LearnersWeekend/WeekdayLive Class Reviews 5 (62900)

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.