Transcript

Hello, everyone, my name is Eric go. I am the instructor for this introduction to data in text mining using the SDK tree. So in this course I'm going to explain to you and teach you about why is the time text mining and how to use these, the SDK tree software to do all these are the time text mining tasks. So actually are these ebooks that I am using now can actually be downloaded at the the SDK dot tech website we export into the the SDK dotnet website later. So now what is the SDK tree so the SDK tree is a data science to key or is called the data science to key is actually a set of the time text mining software that was developed to follows follow very closely to the Christie model. So the SDK offers our data understanding using statistical and text analysis, including the data visualizations which we plot the histogram, pie chart and so on.

Then add a data preparation stage, the SDK of the normalization and text pre processing. And then we also had some feature selections, to help us to select the which of the variables are important for the predictions prediction or predictive analytics and the SDK also have the modeling and the evaluation for the machine learning and statistical learning algorithms. So, in general, the SDK tree is a set of data in text mining software, the SDK tree of the easy to use features and the SDK to have a easy to use interface to do all those are data and text mining okay. So, what is data science? data science is very popular nowadays is a very famous or hot trend. So, data size according to Wikipedia says that data science is an interdisciplinary field.

Okay, that that is to process about Setting system that is to extract information or insight from sale data or data in various from either way either that data is structured or unstructured. Data Science is interdisciplinary, because I in many websites or many brought the recognizer data size. You need to have expertise in mathematics or statistics. You need to have expertise in the computer science or programming in machine learning. And then you need expertise in the domain knowledge, domain expertise mean that if you are in the business view, doing data analysis, data science, so you need to have domain expertise in the business okay. What is data mining data science is a field that has that is interdisciplinary data mining is a process to discover the patterns from data using machine learning statistics or database or data warehouse, okay, it usually result in the creation of prediction models to predict a to predict an output of one variable based on the new input data.

So, in prediction modeling, I will say is that we will have a set of training data historical data that we use to train them or train the classifier and classify them. order will be used to predict a new set of data to predict one of the column based on the new set of input. So, we will spawn to this prediction or predictive analytics AI later. So, in this data mining we can see that I have these Chris de m model here. So, why is business understanding business understanding I will say is that to understand, what are we going to miss out from the data are the knowledge required to for us or what is the knowledge required to get from the data? Then data understanding I will say small on the data exploration tool to understand what are the DA yc Inside the data, what are the columns?

So, data understanding we can use the statistical analysis. So, we can use the mean median standard deviation variance for the descriptive statistics on the each of the variables or each column. We can also use the inferential statistics to test between two variables or two columns or you can say sample samples. Then, data understanding or the expiration we can also use the visualization like you can use bar chart histogram or these to visualize the data. And then from there we can understand why is the data understanding the demo more carefully or understand more about the data Then we come to the data preparation stage it operations issues, we try to prepare the data for pre prediction or predictive modeling in next stage modeling stage. So, data preparation we can do data transformation and logarithms transform a transformation.

You can do normalization like the standard score normalization or the feature scaling, normalization normalizations we can remove all the missing values here. At this stage we can use the correlation to get the important variables and then reduce the number of columns from the data. So, at this stage is we Try to transform the data and then add a modeling stage is here modeling is where we create our classifier. Create models or prediction models. So the models of classifier we can use statistical learning regression, linear regressions multiple linear regression. The logistic regressions Naive Bayes is more probably digging and also being inside statistical learning.

Inside the machine learning data modeling we can also train models using the machine learning algorithms like neural network the KNN and SVM. And as a straw then and evaluate In stasia evaluations if we can evaluate how accurate is the prediction, how accurate is the classifier to predict the data okay. So, if we want to improve on the model will have a higher position or accuracy we can go back to the data preparation stage or we also go back with a business understanding stage and then come in to understand our body and then prepare our data. Oh, so, is this process is no ah as I go in one session, okay business understanding that understanding then data preparation, modeling and any evaluation if we have any issue, we can always Pull back with a previous dish. So let's say in the evaluation stage the model is not accurate enough, we can go back to the business and the second stage or we can go back to the pre production stage from the modeling stage.

Okay, m&e plan primer means we can use order that we have trained, and then predict a new data or deployment. You can also say that because based on the model we create in the modeling and the evaluation stage, and then from there we create a software Korea prediction system. Using the classifier to predict some of the features to offer some suggestion or more Currently is called recommendation system. So we can also work towards this direction in the deployment stage. Okay, yeah. So these are all the explanation of each other stage.

And then now we talked about text mining. So data mining I will say your journey is more on the numeric data. Text mining I will say is more on the test data. Okay on numerical data we can see slidy inside the s, inside Excel inside of spreadsheet, all the columns are the in numerical form 01234 or in some other decimal form 1.1 1.2 and so on. Then test data will be more like the forum data lets me download from order respond on the forum or the price on the forum and then put them into into an Excel file Excel form. So in each row, we will have order one, one of the forum order restaurant, one of the restaurant in the forum.

So A is row, let's say we have one response. Second row, we have a second response row, we have the response in forum, or let's say raw data or all these data are in textual form. They won't be in numerical form. Let's see 12345 so in Follow me can I say discuss about how good is your phone okay. So, people may say the phone shy em. So, they may say shy me is good, the sound is good.

So, this is one response then another guy will say I am the wire I need to use a silver PLC volt wire to have that sound. So, this is second response. So, each other respond will not be in number four will be something like this. So, this is where we use text mining Okay, data mining is more numerical text mining is where we might form or these are text responses and then try to get useful information or useful insight. So in That's money. So, what is process show is that they strike documents.

So, one response can be one document okay one response from the forum can be one document. So, you can extract different documents and decide what type of documents you want and then you can size and strength a feature and come out with TF IDF entries. And then you do others in that data mining process too. Understand the the top repaired the tie them do them predictive modeling, train the classifier and then evaluate the prediction models. Okay. So these are detailed explanation of the process.

And then in analytics, people usually talk about three types of analytics. One is descriptive analytics. One is predictive analytics. One is prescriptive analytics. Descriptive analytics I will say is more on the line using descriptive statistics. So in this patient says I use it my new approach to identify better and new knowledge from existing data set to answer what has happened.

So you will, as I say, you're in Bazemore using descriptive statistics, data visualization to present the information on the data to understand more about the data. So I will say this one is more on the data understanding and then predictive analytics In that we create a model we use a set of training data or you can say is historical or past year data previous data, then we train a model or classify out from the from these training data then we have this model we have a new set of data, then we put in a new set of data into this model to let a predict one of the variable we call a target variable. So, this is more on the predictive analytics. So, I will say this is in the modeling and the evaluation stage. So, prescriptive analytics is more of seldom we seldom we have people say about prescriptive analytics predictive analytics is very popular descriptive analytics is also very popular prescriptive analytics is to use a simulation and optimization algorithm to advise on what should be done to achieve a certain outcome.

So, let's say you have a set of data based on the passage data and then you have a target variable body outcome. So, based on these partial data, you want to have a certain value in the target variable. You want to serve a certain outcome then what should be the input from the input for the data, what should be the input so this is more And more on a more complicated way of analytics. So, I will say you can use the predictive analytics prediction model then you try to modify the input value and then try to get the outcome that you want or you can use other methods, KD is a simulation and optimization algorithm. Okay. So nowadays people talk a lot on the big data.

So big data is big data. Big Data is very Excel data. That usually requires many computers to process the data. So if the data cannot be processed by One computer, one laptop, usually is big data, then a properties, or big data is velocity, volume and velocity. So velocity, velocity, volume velocity these other creepy. That is very popular when we talk about big data.

Okay? So velocity is the amount of the data grows very fast over time, then volume is the amount of data is huge that requires a lot of computers more than one computers to stop. And Verity is usually the number of variables in the data is to two issues or too many variables inside the data. Okay. So, usually, how do we deal with big data, big data usually We'll say that we use parallel system parallel processing or the distributed system. So we usually use Hadoop.

Nowadays we have the new one is the Apache Spark. So we usually use this type of Hadoop system on Apache Spark system to process the big data. Why do we use this type of system? Because this type of system offers you to process the data using deep, deep compute on many computer. They're usually called as a group of computers they usually call as a cluster. So when we deal with big data that cannot be processed or cannot be The analyzer on my mind in the in one computer we will usually need to use a lot more computers.

So we use this Hadoop system or distributed system to process the data to analyze the data or the system use many computers to process the data and analyze that data and then give us a result. So for Apache Spark and the Hadoop RB data, and we can use Python or Java, ah language we also have IV plugins to interface it hado system on Apache Spark, so in the SDK, treat plugins and we have I have Develop proteins are using the R, so that we can interface with Hadoop Apache Spark system, and then use it to process big data. So why we use these data science toolkit tree? Why we use these, the SDK tree to do that data? And that's mainly because the SDK have a lot of features. A lot of those are normally used all the time.

That's my new features. Like add descriptive statistics, inferential statistics, data visualization, the normalization, and then for the, our predictive analytics we, the SDK tree has these are Linear Regression multiple linear regression neural network is for the oddest common algorithm for doing all this prediction modeling. So, the SDK also offer other common features for the test analytics, which includes the POS tagging, named entity sentiment analysis, Destiny analysis, TF IDF, and as a traveler. So the SDK ci allows the user to do the Firstly, first route analysis. And then they can then use the more advanced software like SPSS modeler or STS, Enterprise Miner SPSS Statistics tool further analyze the data to further minor data more deeply Okay, so, in conclusion we have come to the end of the first chapter. So we introduce more about why is data science wise data mining wise text mining wise big data.

And then why is data science tolki tree data science is a few. And then data mining is more of a process that's mining the my knowledge from the texture data. The data is small for the very large set of data that cannot be processed by one computer. So for this chapter we have completed and have the basis of why is the data science data mining, text mining and why is the SDK tree? So, Kavita, we have come to a nice chapter.

Introduction to Data and Text Mining Using DSTK 3

Introduction

Transcript

Sign Up

Sign Up

Share